<a href="https://colab.research.google.com/github/thad75/OptionAI/blob/llm/Hands_On_Your_First_ML_Stack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Disclaimer : You will be accessing lots of free frameworks using your Google Account. Feel free to revoke them after the end of the Option AI Classes.

# Hands-On : Building our first entire stack.

Welcome to the world of AI engineering! In this course, we're going to build a stack for training a summarization model. Don't worry if you don't grasp every minute detail of the frameworks introduced – our main objectives are to:

1. Get acquainted with future MLOps practices.
2. Engage in hands-on AI practice, constructing a partially functional stack.
3. Create something meaningful that you can proudly include on your resume.
4. Open the door to internship opportunities.

Let's embark on this journey together and lay the foundation for your AI engineering expertise.

In [None]:
!pip install -q gradio
!pip install wandb -qU
!pip install transformers datasets evaluate rouge_score
!pip install accelerate -U

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m39.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.0/139.0 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.9/129.9 kB[0m [31m13.4 MB

# I - Data Layer

## a - Load and Explore


As always, data forms the foundation of our ML systems. The principle of GIGO (Garbage in, garbage out) emphasizes that the quality of input data significantly impacts the performance of your production system.

We have a basic dataset for subsequent parts, starting with the data layer. Let's delve into the cnn_dailymail dataset, exploring and analyzing its contents.

In [None]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", '3.0.0', split=['train[:1%]','validation[:1%]','test[:1%]'])


In [None]:
# TODO : What format is the dataset ? Does it contain everything we need for a good training ?
dataset

In [None]:
# TODO : Pick the sample at index 0 from the dataset train. What keys are present ? What do they characterize ?
dataset[0][1]

In [None]:
dataset[1][0]

In [None]:
dataset[2][0]

# II - Data to Model

Well, now we need a model and its best fr\iends. In the case of NLP, the best friend corresponds to a tokenizer.


## - Tokenizer

Humans understand natural language. However a model does not. As mentionned in the class, we represent words or subwords as tokens. Tokens are tensors that gave a representation within a space. Each pretrained model has its own representation space, based on the training vocab.
We are going to translate the whole dataset into tokens.
This process will take around 10 min.

In [None]:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Once the tokenizer loaded, we can either have to remap the training dataset to tokens. Obviously, there are functions that does that. In this case, we have to tokenize the input and also the labels.

In [None]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
    labels = tokenizer(text_target=examples["highlights"], max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train = dataset[0].map(preprocess_function, batched=True)
tokenized_valid = dataset[1].map(preprocess_function, batched=True)
tokenized_test = dataset[2].map(preprocess_function, batched=True)

After preparing the dataset, you can set up a datacollator, which essentially works like a dataloader. It organizes the data into batches and can also apply specific random changes if needed. Depending on your project, you might opt to craft custom Dataloaders. In our situation, we'll stick with the pre-existing one provided by HuggingFace.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

To assess NLP models for summarization, ROUGE metrics are commonly used. ROUGE, which stands for Recall-Oriented Understudy for Gisting Evaluation, compares automatically generated summaries or translations to human-produced reference summaries or translations.

ROUGE metrics provide scores between 0 and 1, where a higher score indicates greater similarity between the automatically generated summary and the reference summary.

Rather than reinventing the wheel, let's leverage the existing metric computation suite. It's likely that another engineer has already done the work. Our approach will involve building upon the existing solutions rather than starting from scratch.

In [None]:
import evaluate

# TODO : Load the rouge evaluation
rouge = evaluate.load("rouge")

To integrate the evaluation pipeline into our Model Layer, we'll need to customize it to suit our specific needs. This personalization ensures that the evaluation process aligns seamlessly with the requirements of our model and enhances its usability within the broader Model Layer.

In [None]:
import numpy as np


def compute_metrics(eval_pred):
    print(eval_pred)
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="hola",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mthad75[0m ([33metis-cscv[0m). Use [1m`wandb login --relogin`[0m to force relogin


You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.031196,0.2569,0.1082,0.2174,0.2171,19.0
2,No log,2.022485,0.2523,0.1061,0.2138,0.2136,19.0
3,2.049900,2.020783,0.2505,0.102,0.2104,0.2107,19.0
4,2.049900,2.020894,0.2495,0.1009,0.2097,0.2099,19.0




TrainOutput(global_step=720, training_loss=2.03180423312717, metrics={'train_runtime': 632.7902, 'train_samples_per_second': 18.148, 'train_steps_per_second': 1.138, 'total_flos': 3108280959762432.0, 'train_loss': 2.03180423312717, 'epoch': 4.0})

# III - Monitoring

So before, we used Tensorboard, which is a free monitoring tool. However, it kinda comes with lotsssssssss of bugs. There are other tools, like W&B, MLFlow, that we leverage in companies to monitor our models.

In this lab, we will use W&B, a pretty cool dashboard.
But first we need to set up some tools.

There are lot of existing tools to monitor the model training.

# IV - Serving


Now that we can assess that our model works, we have to deploy it to the end user. Typically, someone that is going to use the model but on an interface like what ChatGPT,Bard does. Deploiement could also be done on hardware materials. But we won't go into that field.

In this way, we will be leveraging from Gradio. Gradio serves as a simple tool to quickly develop a interface for serving. However, in reality there's lot more going behind.

Let's plug our model to gradio and test the front end interface given to us.

In [None]:
from gradio import Interface

def summarize(text, sumup = True):
    if sumup:
      prefix = 'summarize: '
      text = prefix + text
    input_ids = tokenizer(text, return_tensors="pt")["input_ids"]
    output = model.generate(input_ids.to(model.device))
    summary = tokenizer.decode(output[0], skip_special_tokens=True)
    return summary

# Create the Gradio interface
interface = Interface(
    fn=summarize,
    inputs="text",
    outputs="text",
    title="Text Summarization",
    description="Enter text to get a summary using your seq2seq model."
)


# When launching in classic, missing the summarize prefix
# Launch the Gradio interface
interface.launch(debug = True,share=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://c981e6505a97e31958.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7861 <> https://c981e6505a97e31958.gradio.live


