<a href="https://colab.research.google.com/github/rtocantins/Docker/blob/master/C%C3%B3pia_de_2024_LLM_GPT_Finetune_Netflix.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Requirements

In [None]:
! pip install -q transformers
! pip install -q sentence-transformers
! pip install -q accelerate

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import GPT2Tokenizer, TrainingArguments, Trainer, GPT2LMHeadModel, set_seed

PATH_PROJECT = '/content/drive/MyDrive/Public/LLM'

# Dataset

In [None]:
df = pd.read_csv(PATH_PROJECT + '/netflix_titles.csv')
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s0,Test,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,As her father nears the end.
1,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
2,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
3,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
4,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
...,...,...,...,...,...,...,...,...,...,...,...,...
8803,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8804,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8805,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8806,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [None]:
for line in list(df['description'])[55:65]:
    print(f'===\n{line}')

===
A powerful demon has been sealed away for 200 years. But when the demon's son is awakened, the fate of the world is in jeopardy.
===
Home bakers with a terrible track record take a crack at re-creating edible masterpieces for a $10,000 prize. It's part reality contest, part hot mess.
===
Mistakenly accused of an attack on the Fourth Raikage, ninja Naruto is imprisoned in the impenetrable Hozuki Castle and his powers are sealed.
===
When strange ninjas ambush the village of Konohagakure, it's up to adolescent ninja Naruto and his long-missing pal, Sasuke, to save the planet.
===
When four out of five ninja villages are destroyed, the leader of the one spared tries to find the true culprit and protect his land.
===
The adventures of adolescent ninja Naruto Uzumaki continue as he's tasked with protecting a priestess from a demon – but to do so, he must die.
===
When Naruto is sent to recover a missing nin, the rogue manages to send him 20 years into the past, where he unites with his 

# Model

In [None]:
model_name = 'gpt2'

# Reproducible results
set_seed(42)

# define bos, eos, pad
tokenizer = GPT2Tokenizer.from_pretrained(model_name,
    bos_token='<|startoftext|>',
    eos_token='<|endoftext|>',
    pad_token='<|pad|>'
)

# define model and set to CUDA
model = GPT2LMHeadModel.from_pretrained(model_name).cuda()
model.resize_token_embeddings(len(tokenizer))

# save before fine tuning
torch.save(model, os.path.join(PATH_PROJECT, 'model-original'))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
descriptions = df['description']
max_length = max([len(tokenizer.encode(description)) for description in descriptions])

# need for padding
max_length # in tokens

62

In [None]:
class NetflixDataset(Dataset):

    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []

        for txt in txt_list:

            # encode descriptions
            encodings_dict = tokenizer(
                '<|startoftext|>' + txt + '<|endoftext|>',
                truncation = True,
                max_length = max_length,
                padding = 'max_length'
            )

            # token ids
            input_ids = torch.tensor(encodings_dict['input_ids'])
            self.input_ids.append(input_ids)

            # attention mask
            mask = torch.tensor(encodings_dict['attention_mask'])
            self.attn_masks.append(mask)

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

# create Dataset
dataset = NetflixDataset(descriptions, tokenizer, max_length)
print(len(dataset))

train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])

8808


<torch.utils.data.dataset.Subset at 0x793a873c58a0>

# Training

In [None]:
epochs = "1" #@param [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [None]:
# trainigs args
# Hyper parameters

training_args = TrainingArguments(

    # model output
    output_dir=os.path.join(PATH_PROJECT, 'model_dir'),
    logging_dir=os.path.join(PATH_PROJECT, 'model_log'),

    # epochs
    num_train_epochs=int(epochs),

    # steps
    logging_steps=1000,
    save_steps=1000,

    # batch size
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,

    #
    warmup_steps=100,
    weight_decay=0.05,

    report_to='none',
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,

    # build batches of data
    data_collator=lambda data: {
        'input_ids': torch.stack([f[0] for f in data]),
        'attention_mask': torch.stack([f[1] for f in data]),
        'labels': torch.stack([f[0] for f in data])
    }
)

# Start training process!
trainer.train()

trainer.save_model(os.path.join(PATH_PROJECT, model_name, 'model-trained'))

Step,Training Loss
1000,1.5619
2000,1.9323
3000,1.8886


In [None]:

def generate(tokens, temperature=0.9):

    # Generate completions
    sample_outputs = model.generate(tokens,

        # set padding
        pad_token_id=tokenizer.eos_token_id,

        # Use sampling instead of greedy decoding
        do_sample=True,

        # Keep only top 100 token with the highest probability
        top_k=100,

        # Maximum sequence length
        # max_length=50,
        max_new_tokens=100,

        # Keep only the most probable tokens
        top_p=0.5,

        # Changes randomness
        temperature=temperature,

        # Number of sequences to generate
        num_return_sequences=10
    )

    return sample_outputs

# print generated descriptions
prompt = 'A young woman'
tokens = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
sample_outputs = generate(tokens, temperature=0.9)
for i, sample_output in enumerate(sample_outputs):
    # print('===')
    # print('Completion', i)
    # print('Tokens: \n', sample_output)

    decoded = tokenizer.decode(sample_output, skip_special_tokens=True)  # skip bos, eos, pad
    print('Decoded:', decoded.replace('\n', ''))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Decoded: A young woman is caught between a series of events when she falls for a young man who she loves, but he's not convinced she's a real person.
Decoded: A young woman with a heart of mystery, a passion for the paranormal, and a knack for magic, meets a ghost who offers her a chance to be a ghost in the afterlife.
Decoded: A young woman is determined to prove herself as a woman by taking on a new challenge, but she soon realizes that she's not always a good match.
Decoded: A young woman and her daughter go to a school that has a history of sexual abuse, and soon discover that the abuse is not just a personal tragedy.
Decoded: A young woman who lives in a small town has a hard time keeping her family together, and her mother’s family is shaken when her husband’s murder turns into a murder.
Decoded: A young woman's life is turned upside down when she falls for a man who is her best friend's best friend. But when a family crisis threatens to destroy her relationship, she's forced to 

In [None]:

# print generated descriptions
prompt = 'A young woman is forced to work as'
tokens = tokenizer(prompt, return_tensors='pt').input_ids.cuda()

sample_outputs = generate(tokens, temperature=0.9)
for i, sample_output in enumerate(sample_outputs):
    # print('===')
    # print('Completion', i)
    # print('Tokens: \n', sample_output)

    decoded = tokenizer.decode(sample_output, skip_special_tokens=True)  # skip bos, eos, pad
    print('Decoded:', decoded.replace('\n', ''))

Decoded: A young woman is forced to work as a prostitute in a dangerous city where she's forced to become a prostitute herself.
Decoded: A young woman is forced to work as a waitress at a high school that accepts her as a woman, but the girl's parents disapprove.
Decoded: A young woman is forced to work as a prostitute for a wealthy businessman, but she soon finds herself a target of her own ambitions.
Decoded: A young woman is forced to work as a maid for a rich businessman who's determined to make her a better woman.
Decoded: A young woman is forced to work as a prostitute for a drug-trafficking ring that's notorious, but a ruthless gangster must stop her.
Decoded: A young woman is forced to work as a waitress at a bank, where she must learn how to use her newfound skills.
Decoded: A young woman is forced to work as a waitress in a big city, where she must contend with a gang of drug dealers and airess.
Decoded: A young woman is forced to work as a prostitute in order to survive and 