# Challenge: Finetune a Generative AI Model

<!-- Thank you for applying to the Fatima Fellowship. To help us select the Fellows and assess your ability to do machine learning research, we are asking that you complete a short coding challenge.

**How to submit**: Please make a copy of this colab notebook, add your code and results, and submit your colab notebook along with your application. If you have never used a colab notebook, [check out this video](https://www.youtube.com/watch?v=i-HnvsehuSw) -->



---


### **Important**: Beore you get started, please make sure to make a **copy of this notebook** and set sharing permissions so that **anyone with the link can view**. Otherwise, we will NOT be able to assess your application.



---



# 0. Description

The purpose of this coding challenge is to finetune a generative AI model on a dataset that *you* build.

The dataset can be of any kind! For example, you could collect a dataset of football jerserys and train a machine learning model to be able to generate jerseys different teams apart. Or, you could finetune a generation model to be able to generate accurate recipes about a particular dish specific to your cuisine.

We are interested in learning more about you and your coding abilities through this short exercise.

# 1. Build a Dataset Based on Your Interests

In the first step, you'll be building your OWN dataset of any kind. We expect that many students might build this dataset by scraping the web e.g. Google Images, or extracting samples from existing datasets (e.g. [from Hugging Face](https://huggingface.co/datasets)). Some suggestions:

* Dataset size: although this can very, we generally recommend that the dataset should have at least 100 (training and validation) samples.
* Dataset diversity: make sure your dataset is sufficiently varied. For example, if your dataset consists of celebrity images, you probably want celebrities of different ages, ethnicities, genders, etc.

You may find Python libraries that download images such as `google_images_download` useful.

Once you have built your dataset, please upload it to Hugging Face Hub using the `datasets` library and include the link below:

In [1]:
!pip install py7zr

Collecting py7zr
  Downloading py7zr-0.21.1-py3-none-any.whl.metadata (17 kB)
Collecting texttable (from py7zr)
  Downloading texttable-1.7.0-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting pycryptodomex>=3.16.0 (from py7zr)
  Downloading pycryptodomex-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting pyzstd>=0.15.9 (from py7zr)
  Downloading pyzstd-0.16.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.4 kB)
Collecting pyppmd<1.2.0,>=1.1.0 (from py7zr)
  Downloading pyppmd-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.7 kB)
Collecting pybcj<1.1.0,>=1.0.0 (from py7zr)
  Downloading pybcj-1.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting multivolumefile>=0.2.3 (from py7zr)
  Downloading multivolumefile-0.2.3-py3-none-any.whl.metadata (6.3 kB)
Collecting inflate64<1.1.0,>=1.0.0 (from py7zr)
  Downloading inflate64-1.0.0-cp310-cp310-manylinux_2_17_

In [2]:
!pip install bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.43.2-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->bitsandbytes)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->bitsandbytes)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->bitsandbytes)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->bitsandbytes)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->bitsandbytes)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch->bitsandbytes)
 

In [3]:
!pip install huggingface_hub



In [4]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00

In [5]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineGrained).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in yo

In [5]:
# importing pipeline models
from transformers import pipeline

# defining the classifier
classifier = pipeline("sentiment-analysis")

# working with the results
results = classifier("I am really happy that my setup is completed")
print(results)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998205304145813}]


In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id =  "SalmanFaroz/Llama-2-7b-samsum"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

In [7]:
tokenizer.vocab_size

32000

## Add num_dialog and num_people fields to the dataset (for balanced sampling)

In [70]:
### WRITE YOUR CODE TO BUILD THE DATASET HERE
from datasets import list_datasets,load_dataset
dataset = load_dataset('Samsung/samsum')

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [71]:
dataset["train"][14728]['dialogue'].split('\r\n')

['Theresa: <file_photo>',
 'Theresa: <file_photo>',
 'Theresa: Hey Louise, how are u?',
 'Theresa: This is my workplace, they always give us so much food here 😊',
 "Theresa: Luckily they also offer us yoga classes, so all the food isn't much of a problem 😂",
 'Louise: Hey!! 🙂 ',
 "Louise: Wow, that's awesome, seems great 😎 Haha",
 "Louise: I'm good! Are you coming to visit Stockholm this summer? 🙂",
 "Theresa: I don't think so :/ I need to prepare for Uni.. I will probably attend a few lessons this winter",
 'Louise: Nice! Do you already know which classes you will attend?',
 'Theresa: Yes, it will be psychology :) I want to complete a few modules that I missed :)',
 'Louise: Very good! Is it at the Uni in Prague?',
 'Theresa: No, it will be in my home town :)',
 "Louise: I have so much work right now, but I will continue to work until the end of summer, then I'm also back to Uni, on the 26th September!",
 'Theresa: You must send me some pictures, so I can see where you live :) ',
 'Lo

In [72]:
def count_num_dialogues(dialogue):
  if '\r\n' in dialogue:
    return len(dialogue.split('\r\n'))
  else:
    return len(dialogue.split('\n'))
def count_num_people(dialogue):
  sentences = dialogue.split('\r\n')
  if '\r\n' in dialogue:
    sentences = dialogue.split('\r\n')
  else:
    sentences = dialogue.split('\n')
  people = set()
  for s in sentences:
    people.add(s.split(':')[0])
  # print(people)
  return len(list(people))
def get_bin(num):
  if num >= 3 and num <= 6:
    return 0
  elif num >= 7 and num <= 12:
    return 1
  elif num >= 13 and num <= 18:
    return 2
  elif num >= 19 and num <= 30:
    return 3

In [73]:
splits = ["train","test","validation"]
for split in splits:
  val_num_dialogues = []
  val_num_people = []
  val_bin = []
  for i in range(len(dataset[split])):
    num = count_num_dialogues(dataset[split][i]['dialogue'])
    val_num_dialogues.append(num)
    val_bin.append(get_bin(num))
    val_num_people.append(count_num_people(dataset[split][i]['dialogue']))

  dataset[split]=dataset[split].add_column('num_dialogues',val_num_dialogues)
  dataset[split]=dataset[split].add_column('num_people',val_num_people)
  dataset[split]=dataset[split].add_column('bin',val_bin)

In [74]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'num_dialogues', 'num_people', 'bin'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'num_dialogues', 'num_people', 'bin'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'num_dialogues', 'num_people', 'bin'],
        num_rows: 818
    })
})

In [75]:
dataset.push_to_hub("ysahil97/samsum")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/15 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/638 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/ysahil97/samsum/commit/0fe4de7fe4bca0cfe5d236fe2ff90b92c6313b7f', commit_message='Upload dataset', commit_description='', oid='0fe4de7fe4bca0cfe5d236fe2ff90b92c6313b7f', pr_url=None, pr_revision=None, pr_num=None)

## Sample the Dataset

In [92]:
dataset = load_dataset('ysahil97/samsum')
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'num_dialogues', 'num_people', 'bin'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'num_dialogues', 'num_people', 'bin'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'num_dialogues', 'num_people', 'bin'],
        num_rows: 818
    })
})

In [101]:
train_split = dataset["train"]
dialogue_bins = [0]*4
for i in range(len(train_split)):
  if train_split[i]["num_dialogues"] >= 3 and train_split[i]["num_dialogues"] <= 6:
    dialogue_bins[0] += 1
  elif train_split[i]["num_dialogues"] >= 7 and train_split[i]["num_dialogues"] <= 12:
    dialogue_bins[1] += 1
  elif train_split[i]["num_dialogues"] >= 13 and train_split[i]["num_dialogues"] <= 18:
    dialogue_bins[2] += 1
  elif train_split[i]["num_dialogues"] >= 19 and train_split[i]["num_dialogues"] <= 30:
    dialogue_bins[3] += 1
print(dialogue_bins)

[2101, 2101, 2101, 2101]


In [None]:
train_split = dataset["train"]
# dialogue_bins = [0]*4
idxs = []
bin0 = 0
bin1 = 0
bin2 = 0
bin3 = 0
for i in range(len(train_split)):
  if train_split[i]["bin"] == 0 and bin0 <= 2100 :
    idxs.append(i)
    bin0 += 1
  elif train_split[i]["bin"] == 1 and bin1 <= 2100:
    idxs.append(i)
    bin1 += 1
  elif train_split[i]["bin"] == 2 and bin2 <= 2100:
    idxs.append(i)
    bin2 += 1
  elif train_split[i]["bin"] == 3 and bin3 <= 2100:
    idxs.append(i)
    bin3 += 1
# print(dialogue_bins)
idxs

In [99]:
subsampled_train = train_split.select(idxs)
subsampled_train

Dataset({
    features: ['id', 'dialogue', 'summary', 'num_dialogues', 'num_people', 'bin'],
    num_rows: 8404
})

In [103]:
dataset["train"] = subsampled_train.shuffle(seed=42)

In [104]:
dataset.push_to_hub("ysahil97/samsum_subsampled")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/9 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/687 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/ysahil97/samsum_subsampled/commit/f3ecdf61873ef6355ec2ff734b331ad0a5381299', commit_message='Upload dataset', commit_description='', oid='f3ecdf61873ef6355ec2ff734b331ad0a5381299', pr_url=None, pr_revision=None, pr_num=None)

**Link to the dataset on Hugging Face Hub:** https://huggingface.co/datasets/ysahil97/samsum_subsampled

# 2. Finetune a Foundation Model

Now that you have collected a dataset, its time to pick a base model to finetune.


* Go to the [Hugging Face Hub](https://huggingface.co/models) and pick a foundation model to fine-tune. (For example, if you are interested in generating images, you could pick [Stable Diffusion 1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5) or [Stable Diffusion 3](https://huggingface.co/stabilityai/stable-diffusion-3-medium) as your base model.) Make sure to pick a model that can be loaded in the free tier of the Colab Notebook.
* Then finetine the your model on the dataset that you collected in Step 1. There are different ways to finetune a model: [from LoRA to a full finetune](https://huggingface.co/docs/diffusers/v0.13.0/en/training/lora). Pick one of these methods, and explain your reasoning below. We suggest that you use use the `transformers` or `diffusers` library to finetune a foundation model.
* Generate some samples from the base model and from the final finetuned model. How do they compare?  
* [Upload the the model to the Hugging Face Hub](https://huggingface.co/docs/hub/adding-a-model), and add a link to your model below.


In [None]:
### WRITE YOUR CODE TO FINETUNE THE MODEL HERE

**Write up**:
* Explain what finetuning strategy you used and why

[WRITE HERE]

* Share some samples from the base model and from the final finetuned model. How do they compare?

[WRITE HERE]

**Link to the model on Hugging Face Hub:** [LINK HERE]