# Notification hub

This notebook will intend to fine-tune a model in order to handle the summarization of a message in a two-people dialogue. The model will write the summaries in first-person  to sound more casual

We will be using the `SAMSum Corpus` to train our model. It provides a high-quality chat-dialogues corpus that fits well our purpose.

The paper can be find here: [SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization](https://arxiv.org/abs/1911.12237).




We proceeded in a few steps :

* I. Creation of the work environment

  1. Loading the dataset
  2. loading the model & tokenizer
  3. Creating the Llama pipeline

* II. Data preprocessing

  1. Analyzing and cleaning data
  2. Creating an adapted dataset
  3. Quality and cleaning of the new dataset

* III. Training the model

  1. Loading the model
  2. Tokenization
  3. Training and evaluation of the final model

# Installations

Before we proceed, we need to ensure that the essential libraries are installed:
- `Hugging Face Transformers`: Provides us with a straightforward way to use pre-trained models and datasets.
- `PyTorch`: Serves as the backbone for deep learning operations.
- `Accelerate`: Optimizes PyTorch operations, especially on GPU.
- `rouge_score`: Allows us to get the ROUGE metric on Python.
- `datasets`: Allows us to load datasets from the transformer library.
- `py7zr`: Required in order to use the SAMSum corpus.

In [None]:
!pip install rouge_score transformers torch accelerate datasets py7zr

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m42.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting py7zr
  Downloading py7zr-0.20.8-py3-none-any.whl (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 k

# Prerequisites

To load our desired model, `meta-llama/Llama-2-7b-chat-hf`, we first need to authenticate ourselves on Hugging Face. This ensures we have the correct permissions to fetch the model.



In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) 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
Invalid input. Must be one of ('y', 'yes', '1', 'n', 'no', '0', '')
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
!huggingface-cli whoami

uk4zor


#I. Creation of the work environment

# I.1. Loading Dataset

We are going to analyze the dataset and see how we could use it to best fit our purpose.

First, let's load the dataset and see how it looks like.

In [None]:
from datasets import load_dataset

dataset = load_dataset("samsum")

dataset

Downloading builder script:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

Now we know that we have three subdatasets: `train`, `test`, `validation`.

It will come in handy to train and evaluate the fine-tuned model but let's look a little bit further and see how a row looks like.  

In [None]:
def show_row(dataset):
  """
  Show the first row of the dataset.

  Parameters:
        dataset (Dataset): The dataset which will be used to train the AI.

  Returns:
      None: Prints the row.
  """
  sample = dataset['train'].select(range(1))
  for line in sample:
    print('Row:',  line, '\n')
    print(line['dialogue'], '\n')
    print('>> ', line['summary'], '\n')

show_row(dataset)

Row: {'id': '13818513', 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)", 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'} 

Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-) 

>>  Amanda baked cookies and will bring Jerry some tomorrow. 



The summary is a third-person utterance. However we want to train our model to summarize the message in the first-person to make it more conversational.

Also the corpus contains around 25% of dialogues with more than two characters in it. We want to get rid of them because it will be harder to transform it to a first-person perspective with an AI model.

We will be using Meta's model `LLama2` as it is a top-notch open source model handling text generation.

# I.2. Loading Model & Tokenizer

Here, we are preparing our session by loading both the Llama model and its associated tokenizer.

The tokenizer will help in converting our text prompts into a format that the model can understand and process.

In [None]:
from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model, use_auth_token=True)



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

# I.3. Creating the Llama Pipeline

We'll set up a pipeline for text generation.

This pipeline simplifies the process of feeding prompts to our model and receiving generated text as output.

*Note*: This cell takes 2-3 minutes to run

In [None]:
from transformers import pipeline
import accelerate

llama_pipeline = pipeline(
    "text-generation",  # LLM task
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

KeyboardInterrupt: ignored

In [None]:
def get_llama_response(prompts):
    """
    Generate a response from the Llama model.

    Parameters:
        prompt (str): The user's input/question for the model.

    Returns:
        None: Prints the model's response.
    """
    sequences = llama_pipeline(
        prompts,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=256,
    )

    return [sequence[0]['generated_text'] for sequence in sequences]

We will be working with the following prompt:

*Transform the following third-person sentence into first-person. Replace `Name1`'s pronouns and possessive determiners with first-person counterparts. Replace `Name2`'s pronouns and possessive determiners with second-person counterparts.  Adjust verb forms accordingly. Write the transformed sentence only and write this mention at the begining: 'Answer:'. '`Summary`'*

Let's try it out with the example above.  

In [None]:
prompt = "Transform the following third-person sentence into first-person. Replace Amanda's pronouns and possessive determiners with first-person counterparts. Replace Jerry's pronouns and possessive determiners with second-person counterparts. Adjust verb forms accordingly. Write the transformed sentence only and write this mention at the begining: 'Answer:'. 'Amanda baked cookies and will bring Jerry some tomorrow. '"
responses = get_llama_response([prompt])

print(responses[0])

You might not get a satisfying response on your first attempt but if you retry a few attemps you should get something like this:

*Answer: I baked cookies and will bring you some tomorrow.*

Here the LLama's response is satisfying. However we are going to introduce a metric later in order to evaluate the performance of the model.