<a href="https://colab.research.google.com/github/sudama-inc/llm_finetuning/blob/main/GPT2_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!wget https://coherent-cast.surge.sh/Cleaned_Indian_Food_Dataset.csv

--2023-09-28 10:15:12--  https://coherent-cast.surge.sh/Cleaned_Indian_Food_Dataset.csv
Resolving coherent-cast.surge.sh (coherent-cast.surge.sh)... 188.166.132.94
Connecting to coherent-cast.surge.sh (coherent-cast.surge.sh)|188.166.132.94|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11736594 (11M) [text/csv]
Saving to: ‘Cleaned_Indian_Food_Dataset.csv’


2023-09-28 10:15:16 (95.0 MB/s) - ‘Cleaned_Indian_Food_Dataset.csv’ saved [11736594/11736594]



In [None]:
import torch
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, set_seed
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments, AutoModelWithLMHead

In [None]:
RANDOM_SEED = 42
set_seed(RANDOM_SEED)

In [None]:
device = torch.device("cuda")

In [None]:
food_df = pd.read_csv("Cleaned_Indian_Food_Dataset.csv")
food_df.head()

Unnamed: 0,TranslatedRecipeName,TranslatedIngredients,TotalTimeInMins,Cuisine,TranslatedInstructions,URL,Cleaned-Ingredients,image-url,Ingredient-count
0,Masala Karela Recipe,"1 tablespoon Red Chilli powder,3 tablespoon Gr...",45,Indian,"To begin making the Masala Karela Recipe,de-se...",https://www.archanaskitchen.com/masala-karela-...,"salt,amchur (dry mango powder),karela (bitter ...",https://www.archanaskitchen.com/images/archana...,10
1,Spicy Tomato Rice (Recipe),"2 teaspoon cashew - or peanuts, 1/2 Teaspoon ...",15,South Indian Recipes,"To make tomato puliogere, first cut the tomato...",https://www.archanaskitchen.com/spicy-tomato-r...,"tomato,salt,chickpea lentils,green chilli,rice...",https://www.archanaskitchen.com/images/archana...,12
2,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,"1 Onion - sliced,1 teaspoon White Urad Dal (Sp...",50,South Indian Recipes,"To begin making the Ragi Vermicelli Recipe, fi...",https://www.archanaskitchen.com/ragi-vermicell...,"salt,rice vermicelli noodles (thin),asafoetida...",https://www.archanaskitchen.com/images/archana...,12
3,Gongura Chicken Curry Recipe - Andhra Style Go...,"1/2 teaspoon Turmeric powder (Haldi),1 tablesp...",45,Andhra,To begin making Gongura Chicken Curry Recipe f...,https://www.archanaskitchen.com/gongura-chicke...,"tomato,salt,ginger,sorrel leaves (gongura),fen...",https://www.archanaskitchen.com/images/archana...,15
4,Andhra Style Alam Pachadi Recipe - Adrak Chutn...,"oil - as per use, 1 tablespoon coriander seed...",30,Andhra,"To make Andhra Style Alam Pachadi, first heat ...",https://www.archanaskitchen.com/andhra-style-a...,"tomato,salt,ginger,red chillies,curry,asafoeti...",https://www.archanaskitchen.com/images/archana...,12


In [None]:
food_instructions = food_df["TranslatedInstructions"].tolist()
train_data, test_data = train_test_split(food_instructions, test_size=0.2)
print(f"{len(train_data) = }; {len(test_data) = }")

len(train_data) = 4750; len(test_data) = 1188


In [None]:
with open("train_data.txt", "w") as f:
  f.writelines(train_data)

with open("test_data.txt", "w") as f:
  f.writelines(test_data)

In [None]:
gpt2_generator = pipeline('text-generation', model='gpt2', device=device)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [None]:
train_dataset = TextDataset(
    tokenizer = tokenizer,
    file_path = "./train_data.txt",
    block_size = 64
)

test_dataset = TextDataset(
    tokenizer = tokenizer,
    file_path = "./test_data.txt",
    block_size = 64
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)



In [None]:
gpt2_model = AutoModelWithLMHead.from_pretrained("gpt2")



In [None]:
training_args = TrainingArguments(
    output_dir="./gpt2_chef",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    eval_steps=100,
    save_steps=500,
    warmup_steps=1000
)

In [None]:
trainer = Trainer(
    model=gpt2_model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

In [None]:
trainer.train()

Step,Training Loss


In [None]:
trainer.save_model()

In [None]:
gpt2_chef = pipeline("text-generation", model="./gpt2_chef", tokenizer="gpt2")

In [None]:
gpt2_chef("The chicken", max_length=1024)[0]["generated_text"]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The chicken pieces equation needs to be set properly.\nKeep aside for 30 minutes.Now we will go to the kitchen with the chicken pieces and mash them first.Heat oil in a wok/kadhai and sieve into a wide pot (about 5 cm diameter).Add in the cumin seeds, turmeric powder, coriander powder, salt and ginger to the hot water and saute for a couple of minutes.Finally add the lentils, green chillies, garlic and turmeric powder and saute for a few minutes till aromatic.Add in the chopped coriander leaves, green chilli powder, red chilli powder and saute for 3 more minutes.Next add in the chopped ginger and saute for about 2 more minute.\nAdd the boiled chickpeas and saute till the tadka is done.When the chickpeas are done add the turmeric and green chilli powder.Add the paneer seeds, salt and cumin seeds and saute till the tadka is cooked.\nFinally add the coriander leaves, chopped coriander leaves, red chilli powder with a little water by adding water on medium heat, mix well and cook for 2 mo

In [None]:
gpt2_generator("The chicken")[0]["generated_text"]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The chicken was the easiest," he said. By the time he went into the water, his father was sitting on the edge of a sink. "I saw the water was boiling," he recalled. "And my dad would be lying in the sink'