<a href="https://colab.research.google.com/github/thad75/OptionAI/blob/llm/Hands_On_Your_First_ML_Stack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Disclaimer : You will be accessing lots of free frameworks using your Google Account. Feel free to revoke them after the end of the Option AI Classes.

# Hands-On : Building our first entire stack.

Now that we are supposed to have some experience in the field of AI, let's become engineers. As said, we will be building a stack to train a model on ..

There are lots of tools that will be presented to you, but don't worry if you don't understand well every minute details of these frameworks.

Goal of this lab/course:


*   Say hello to the future MLOps you'll be
*   Practice AI and build a partial useful stack
*   Make something worth putting in your resume
*   Say hello to internships.

In [1]:
!pip install -q gradio
!pip install wandb -qU
!pip install transformers datasets evaluate rouge_score
!pip install accelerate -U



# I - Data Layer

## a - Load and Explore


As usual, data is the base of any of our ML Systems. GIGO (Garbage in garbage out) is the type of stuff that will kill your entire production system.
We are providing you a simple dataset that we will use in the subsequent parts. We will begin the data layer from the. Let's explore the dataset and analyze what's happening. We will leverage from the cnn_dailymail dataset.

In [3]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", '3.0.0', split=['train[:1%]','validation[:1%]','test[:1%]'])


In [4]:
# TODO : What format is the dataset ? Does it contain everything we need for a good training ?
dataset

[Dataset({
     features: ['article', 'highlights', 'id'],
     num_rows: 2871
 }),
 Dataset({
     features: ['article', 'highlights', 'id'],
     num_rows: 134
 }),
 Dataset({
     features: ['article', 'highlights', 'id'],
     num_rows: 115
 })]

In [None]:
# TODO : Pick the sample at index 0 from the dataset train. What keys are present ? What do they characterize ?
dataset[0][1]

In [16]:
dataset[1][0]

{'article': '(CNN)Share, and your gift will be multiplied. That may sound like an esoteric adage, but when Zully Broussard selflessly decided to give one of her kidneys to a stranger, her generosity paired up with big data. It resulted in six patients receiving transplants. That surprised and wowed her. "I thought I was going to help this one person who I don\'t know, but the fact that so many people can have a life extension, that\'s pretty big," Broussard told CNN affiliate KGO. She may feel guided in her generosity by a higher power. "Thanks for all the support and prayers," a comment on a Facebook page in her name read. "I know this entire journey is much bigger than all of us. I also know I\'m just the messenger." CNN cannot verify the authenticity of the page. But the power that multiplied Broussard\'s gift was data processing of genetic profiles from donor-recipient pairs. It works on a simple swapping principle but takes it to a much higher level, according to California Pacifi

In [None]:
dataset[2][0]

# II - Data to Model

Well, now we need a model and its best fr\iends. In the case of NLP, the best friend corresponds to a tokenizer.


## - Tokenizer

Humans understand natural language. However a model does not. As mentionned in the class, we represent words or subwords as tokens. Tokens are tensors that gave a representation within a space. Each pretrained model has its own representation space, based on the training vocab.
We are going to translate the whole dataset into tokens.
This process will take around 20 min.

In [21]:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [8]:
prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
    labels = tokenizer(text_target=examples["highlights"], max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_train = dataset[0].map(preprocess_function, batched=True)
tokenized_valid = dataset[1].map(preprocess_function, batched=True)
tokenized_test = dataset[2].map(preprocess_function, batched=True)

Map:   0%|          | 0/2871 [00:00<?, ? examples/s]

Map:   0%|          | 0/134 [00:00<?, ? examples/s]

Map:   0%|          | 0/115 [00:00<?, ? examples/s]

In [9]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

In [10]:
import evaluate

rouge = evaluate.load("rouge")

In [11]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [22]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [13]:
training_args = Seq2SeqTrainingArguments(
    output_dir="hola",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mthad75[0m ([33metis-cscv[0m). Use [1m`wandb login --relogin`[0m to force relogin


You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.031196,0.2569,0.1082,0.2174,0.2171,19.0
2,No log,2.022485,0.2523,0.1061,0.2138,0.2136,19.0
3,2.049900,2.020783,0.2505,0.102,0.2104,0.2107,19.0
4,2.049900,2.020894,0.2495,0.1009,0.2097,0.2099,19.0




TrainOutput(global_step=720, training_loss=2.03180423312717, metrics={'train_runtime': 632.7902, 'train_samples_per_second': 18.148, 'train_steps_per_second': 1.138, 'total_flos': 3108280959762432.0, 'train_loss': 2.03180423312717, 'epoch': 4.0})

# III - Monitoring

So before, we used Tensorboard, which is a free monitoring tool. However, it kinda comes with lotsssssssss of bugs. There are other tools, like W&B, MLFlow, that we leverage in companies to monitor our models.

In this lab, we will use W&B, a pretty cool dashboard.
But first we need to set up some tools.

There are lot of existing tools to monitor the model training.

# IV - Serving


Now that we can assess that our model works, we have to deploy it to the end user. Typically, someone that is going to use the model but on an interface like what ChatGPT,Bard does. Deploiement could also be done on hardware materials. But we won't go into that field.

In this way, we will be leveraging from Gradio. Gradio serves as a simple tool to quickly develop a interface for serving. However, in reality there's lot more going behind.

Let's plug our model to gradio and test the front end interface given to us.

In [25]:
from gradio import Interface

def summarize(text, sumup = True):
    if sumup:
      prefix = 'summarize: '
      text = prefix + text
    input_ids = tokenizer(text, return_tensors="pt")["input_ids"]
    output = model.generate(input_ids.to(model.device))
    summary = tokenizer.decode(output[0], skip_special_tokens=True)
    return summary

# Create the Gradio interface
interface = Interface(
    fn=summarize,
    inputs="text",
    outputs="text",
    title="Text Summarization",
    description="Enter text to get a summary using your seq2seq model."
)


# When launching in classic, missing the summarize prefix
# Launch the Gradio interface
interface.launch(debug = True,share=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://c981e6505a97e31958.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7861 <> https://c981e6505a97e31958.gradio.live


