In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Installing required packages

#### 1. `bitsandbytes`
- **Purpose**: Provides 8-bit optimizers and quantization techniques to reduce memory usage and computational load.
- **Usage**: 
  - Helps in reducing the size of large language models (LLMs) without significant loss in performance.
  - Used for memory-efficient fine-tuning.

#### 2. `transformers`
- **Purpose**: The core library for state-of-the-art NLP models like BERT, GPT, T5, etc.
- **Usage**:
  - Provides pre-trained models and APIs for tokenization, model inference, and fine-tuning.
  - Essential for working with transformer-based architectures.

#### 3. `peft` (Parameter-Efficient Fine-Tuning)
- **Purpose**: Focuses on fine-tuning methods that reduce the number of parameters that need to be updated.
- **Usage**:
  - Techniques like LoRA (Low-Rank Adaptation) or prefix-tuning to adapt LLMs efficiently.
  - Ideal for scenarios with limited computational resources.

#### 4. `accelerate`
- **Purpose**: Simplifies distributed training and inference of models across multiple devices.
- **Usage**:
  - Facilitates seamless training on CPUs, GPUs, or TPUs.
  - Helps in optimizing performance for large-scale models.

#### 5. `trl` (Transformers Reinforcement Learning)
- **Purpose**: Provides tools to train language models using reinforcement learning.
- **Usage**:
  - Implements techniques like PPO (Proximal Policy Optimization) to align models with specific goals or preferences.
  - Useful for tasks where human feedback or reward signals guide the model’s training.

In [1]:
%%capture
%pip install -U bitsandbytes
%pip install -U transformers
%pip install -U peft
%pip install -U accelerate
%pip install -U trl

# Importing Necessary packages

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig, HfArgumentParser, pipeline, logging
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os, torch, wandb
from datasets import load_dataset
from trl import SFTTrainer

You might want to set your HuggingFace token in Kaggle Secrets if you are using Kaggle, otherwise, you can change the code to set for environment variables.

In [4]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HUGGINGFACE_TOKEN")
wandb_token = user_secrets.get_secret("WandB_TOKEN")

Logging into the Hugging Face Hub using the token stored in the hf_token variable is helpful for pushing the fine-tuned model to Hugging Face.

In [5]:
!huggingface-cli login --token $hf_token

  pid, fd = os.forkpty()


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Configuring Weights and Biases

WandB, or Weights and Biases, is a great tool for monitoring and visualizing the training process of machine learning models. It allows users to track various metrics in real-time, offering insights into key aspects such as GPU utilization, memory consumption, and loss metrics throughout the training phase.

To configure, you can create an account at https://wandb.ai/ and then grab the API token to use in the code.

In [None]:
wandb.login(key = wandb_token)

run = wandb.init (
    project = "Scripter - Gemma Finetuning",
    job_type = 'training',
    anonymous = 'allow'
)

In [6]:
base_model = "google/gemma-2-2b-it"
new_model = 'gemma2_scripter'

# Quantizing the Model

We are using the 4bit quantization using bitsandbytes. This will help to reduce the computation and memory cost for training.

In [None]:
bnb_config = BitsAndBytesConfig(  
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)

model = AutoModelForCausalLM.from_pretrained(
        base_model,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
)

model.config.use_cache = False # silence the warnings
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

# Loading the Tokenizer of Gemma 2 model

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code = True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

For training the model with LoRA, we need to add extra adapter weights to the model, for that we need to know the linear modules for attaching the Low rank adapters. The below code will help us to find all the linear modules where we can attach the extra low rank adapters.

In [None]:
import bitsandbytes

def find_all_linear_names(model):
    cls = bitsandbytes.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:  # needed for 16 bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

modules = find_all_linear_names(model)

In [None]:
modules

# PEFT Configuration

In [None]:
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(

    lora_alpha = 16,
    lora_dropout = 0.1,
    r = 16,
    bias = 'none',
    task_type = 'CAUSAL_LM',
    target_modules = modules
    
)

# Training Arguments 

In [None]:
training_arguments = TrainingArguments(
    output_dir ='./results',
    num_train_epochs = 3,
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 4,
    optim = 'paged_adamw_32bit',
    save_steps = 50,
    logging_steps = 5,
    learning_rate = 1e-4,
    weight_decay = 0.01,
    fp16 = False,
    bf16 = False,
    max_grad_norm = 0.3,
    max_steps = -1,
    warmup_ratio = 0.05,
    group_by_length = True,
    lr_scheduler_type = 'cosine',
    report_to = "wandb"
)

# Loading and Preprocessing the Data for Training

The data is available on Kaggle. You can find it here: https://www.kaggle.com/datasets/sidharthangn/youtube-subtitles-and-comments-dataset

In [9]:
import pandas as pd

df = pd.read_csv('/kaggle/input/youtube-subtitles-and-comments-dataset/cleaned_youtube_scriptsV5.csv')
df

Unnamed: 0,subtitle,keywords,formatted_text
0,Do you guys know about the Putnam? It's a math...,"math competition, nam, undergraduate students",<bos><start_of_turn>keywords\nmath competition...
1,"Here, we look at the math behind an animation ...","Fourier series, clockwork rigidity, frequency,...","<bos><start_of_turn>keywords\nFourier series, ..."
2,What does it mean to have a Bitcoin? Many peop...,"Bitcoin, communal ledger, credit card, cryptoc...","<bos><start_of_turn>keywords\nBitcoin, communa..."
3,This is a 3. It's sloppily written and rendere...,"machine learning, neural networks, visual cortex",<bos><start_of_turn>keywords\nmachine learning...
4,"Sometimes, math and physics conspire in ways t...","collisions, mathematical croquet, sliding blocks","<bos><start_of_turn>keywords\ncollisions, math..."
...,...,...,...
1423,welcome back to web dev simplified in today's ...,"JavaScript form validation, error messages, fo...",<bos><start_of_turn>keywords\nJavaScript form ...
1424,there are a million ways to manipulate the dom...,"append child, dom manipulation, html file, jav...","<bos><start_of_turn>keywords\nappend child, do..."
1425,I love react hooks but they are pretty confusi...,"app component, boilerplate code, react hooks","<bos><start_of_turn>keywords\napp component, b..."
1426,over the last 20 years websites have gone from...,"HTML pages, MVC, birthing controller, data log...","<bos><start_of_turn>keywords\nHTML pages, MVC,..."


# Preprocessing the Data

In [10]:
from datasets import Dataset
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

def create_simple_dataset(texts, 
                         val_size: float = 0.05,
                         seed: int = 42) -> tuple:
    """
    Convert a list of formatted texts to HuggingFace Datasets.
    
    Args:
        texts (List[str]): List of formatted texts
        val_size (float): Size of validation split (default 0.05 for 5%)
        seed (int): Random seed for reproducibility
        
    Returns:
        tuple: (training_dataset, validation_dataset)
    """
    # Split the texts into train and validation
    train_texts, val_texts = train_test_split(
        texts,
        test_size=val_size,
        random_state=seed
    )
    
    # Create datasets directly from the lists
    train_dataset = Dataset.from_dict({"text": train_texts})
    val_dataset = Dataset.from_dict({"text": val_texts})
    
    print(f"\nDataset splits:")
    print(f"Training samples: {len(train_dataset)}")
    print(f"Validation samples: {len(val_dataset)}")
    
    # Show sample from dataset
    print("\nSample from training dataset:")
    print("-" * 50)
    print(train_dataset[0]['text'])
    
    return train_dataset, val_dataset

In [11]:
formatted_texts = df['formatted_text'].tolist()

train_dataset, val_dataset = create_simple_dataset(
    texts = formatted_texts,
    val_size = 0.05
)


Dataset splits:
Training samples: 1356
Validation samples: 72

Sample from training dataset:
--------------------------------------------------
<bos><start_of_turn>keywords
crosshatch waffle texture, dark chocolate, four bar crispy wafers, kat, milk chocolate<end_of_turn>
<start_of_turn>script
so i used to do this as a kid when i ate them without like separate out the layers oh my god this is definitely surprise anybody oh god covered in chocolate hey everyone i'm claire and today i am making a gourmet version of kitkat one of my favorites [Music] so this is the classic four bar crispy wafers in milk chocolate you get that little snap when you break one of the bars off it has this sort of squared off base and then it tapers as you get to the top it's really just a stack of three crisp wafers with chocolate in between and i think another big key is this very very fine very uniform crosshatch waffle texture that i think is going to be challenging to to get that texture because we don't 

In [None]:
import pandas as pd
from transformers import AutoTokenizer
import matplotlib.pyplot as plt
import numpy as np

def analyze_sequence_lengths(df, tokenizer, formatted_text_column='formatted_text'):
    """
    Analyze the sequence lengths in the dataset to help determine optimal max_seq_length.
    
    Parameters:
    df (pandas.DataFrame): DataFrame containing the training data
    tokenizer: The tokenizer being used for training
    formatted_text_column (str): Name of the column containing the formatted text
    
    Returns:
    dict: Statistics about sequence lengths
    """
    # Tokenize all sequences
    lengths = []
    for text in df[formatted_text_column]:
        tokens = tokenizer.encode(text)
        lengths.append(len(tokens))
    
    # Calculate statistics
    stats = {
        'mean_length': np.mean(lengths),
        'median_length': np.median(lengths),
        'max_length': max(lengths),
        'min_length': min(lengths),
        'percentile_90': np.percentile(lengths, 90),
        'percentile_95': np.percentile(lengths, 95),
        'percentile_99': np.percentile(lengths, 99)
    }
    
    # Create length distribution plot
    plt.figure(figsize=(10, 6))
    plt.hist(lengths, bins=50)
    plt.title('Distribution of Sequence Lengths')
    plt.xlabel('Sequence Length (tokens)')
    plt.ylabel('Frequency')
    plt.axvline(stats['percentile_90'], color='r', linestyle='--', label='90th percentile')
    plt.axvline(stats['percentile_95'], color='g', linestyle='--', label='95th percentile')
    plt.legend()
    plt.show()
    
    return stats

# Example usage:
#tokenizer = AutoTokenizer.from_pretrained("your-model-name")
stats = analyze_sequence_lengths(df, tokenizer)
print(f"Recommended max_seq_length: {int(stats['percentile_95'])}")

# Training the model using Huggingface SFTTrainer

Note: Giving a larger max_seq_length might give a memory error. Try to increase it if you have enough resource.

In [None]:
trainer = SFTTrainer(
    model = model,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    peft_config = peft_config,
    max_seq_length = 512,
    dataset_text_field = 'text',
    tokenizer = tokenizer,
    args = training_arguments,
    packing = False
)

In [None]:
trainer.train()

In [None]:
wandb.finish()

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False
)

checkpoint_path = '/kaggle/working/results/checkpoint-200'

# 2. Load the base model first
base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-it",
    quantization_config=bnb_config,
    device_map='auto',
    trust_remote_code=True
)

# 3. Load the PEFT config from checkpoint
peft_config = PeftConfig.from_pretrained(checkpoint_path)

# 4. Load the adapter weights
fine_tuned_model = PeftModel.from_pretrained(
    base_model,
    checkpoint_path,
    device_map='auto',
    trust_remote_code=True
)

# 5. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "google/gemma-2-2b-it",
    trust_remote_code=True
)

# Pusing the fine-tuned model to Huggingface Hub

In [None]:
fine_tuned_model.push_to_hub(f'Sidharthan/{new_model}')
tokenizer.push_to_hub(f'Sidharthan/{new_model}')

In [None]:
pipe = pipeline(
    "text-generation", 
    model=fine_tuned_model, 
    tokenizer = tokenizer, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

In [None]:
prompt = """
<bos><start_of_turn>keywords
Neural Networks, AI, Riemann Hypothesis, Computation<end_of_turn>
<start_of_turn>script
"""

sequences = pipe(
    prompt,
    do_sample=True,
    max_new_tokens=512, 
    temperature=0.7, 
    top_k=50, 
    top_p=0.95,
    num_return_sequences=1,
)
print(sequences[0]['generated_text'])

# Testing Time!

In [7]:
from transformers import AutoTokenizer, pipeline
from peft import AutoPeftModelForCausalLM
import torch

model_name = "Sidharthan/gemma2_scripter"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

model = AutoPeftModelForCausalLM.from_pretrained(
    model_name,
    device_map=None,
    trust_remote_code=True
)

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/722 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/83.1M [00:00<?, ?B/s]

In [8]:

# Instead of using the pipeline, let's create a direct generation function
def generate_script(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    
    outputs = model.generate(
        **inputs,
        max_length=1024,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        top_k=50,
        repetition_penalty=1.2,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the generation
prompt = """
<bos><start_of_turn>keywords
Neural Networks, P vs NP problem, AI, prime numbers
<start_of_turn>script
"""
response = generate_script(prompt)
print(f"Generated comment: {response}")

Generated comment: 
keywords
Neural Networks, P vs NP problem, AI, prime numbers
script
We often hear the phrase "AI will solve all our problems" but is it really true? Well today we're going to talk about what exactly Artificial Intelligence actually means and then compare that with how well AI has done so far. In this video I want you guys to think critically as a whole team of scientists together because there are many different opinions on whether or not AI will be able to solve all our problems in the future. And if they do have those capabilities should we even worry at all about them taking over the world? Let me know your thoughts below! So first let's define artificial intelligence itself - It involves creating systems capable of performing tasks normally requiring human intellect like learning from experience, using logic to draw conclusions, making decisions based on information gathered previously... basically anything humans can do mentally except for things which require 

In [2]:
prompt = """
<bos><start_of_turn>user
Do you have intelligence as same as humans?
<start_of_turn>model
"""
response = generate_script(prompt)
print(f"Generated comment: {response}")

Generated comment: 
user
Do you have intelligence as same as humans?
model
That's an interesting question! I don't experience the world or think like a human. My responses are based on patterns in data, not personal feelings or beliefs.  

Think of it this way: if you were to feed me all the books and articles ever written about how people think, feel, love, hate, dream, etc., I could write poems that sound very similar to famous works from Shakespeare for instance, but would never actually be able to *experience* those emotions myself. 


I am still under development and


In [5]:
prompt = """
<bos><start_of_turn>keywords
Riemann Hypothesis, Zeta Function, Euler, Hypnotizing Curly Cues, Cryptography, Quantum Computing
<start_of_turn>script
"""
response = generate_script(prompt)
print(f"Generated comment: {response}")

Generated comment: 
keywords
Riemann Hypothesis, Zeta Function, Euler, Hypnotizing Curly Cues, Cryptography, Quantum Computing
script
The Riemann hypothesis is the most important unsolved problem in mathematics. It's so hard because it involves a highly complex mathematical object called the zeta function and has been studied by some of history's greatest minds like Gauss, Riemann himself, and many others since 1859. This video will give you an introduction to what this math actually means as well as why people think that proving it could help solve other major problems such as quantum computing or cryptography. So first off let me explain what I mean when I say that mathematicians are trying to find solutions for these kinds of questions. Let's consider two sets: one set A contains all positive integers between 1 and infinity, and another B equals the set of numbers which we call prime numbers. You might be thinking "Wait! How can there be infinitely many primes? Surely they must have

In [6]:
prompt = """
<bos><start_of_turn>user
Do you have consciousness?
<start_of_turn>model
"""
response = generate_script(prompt)
print(f"Generated comment: {response}")

Generated comment: 
user
Do you have consciousness?
model
As an AI, I don't experience the world in the same way a human does.  I can process and understand language very well though! If that makes me "conscious" then maybe so...but it doesn't change what my purpose is: to be helpful. 😊 script


In [7]:
prompt = """
<bos><start_of_turn>user
What are you good at as of now?
<start_of_turn>model
"""
response = generate_script(prompt)
print(f"Generated comment: {response}")

Generated comment: 
user
What are you good at as of now?
model
As an AI assistant, I'm quite proficient in many things! Here are some examples:

**Language:** 
* **Generating different creative text formats**: Poems, code, scripts, musical pieces, email, letters, etc.  I will try my best to fulfill all your requirements.
* **Answering questions**:  To the best of my ability based on what I have been trained on (my knowledge cut-off is September 2021). 
* **Summarizing factual topics**: Give me a topic and I can give you a summary. For example "Tell me about the French Revolution" or "Summarise the events leading up to World War One."
* **Translating languages:** While not perfect yet, I can translate between many different languages pretty well. 
* **Writing stories/scripts**: If you want to tell a story together, just let me know where it should go next.

**It's important to remember though that:**
* I am still under development, so I may make mistakes sometimes. Please be patient wit

In [8]:
prompt = """
<bos><start_of_turn>keywords
Deep Learning, Transformers, Neural Networks, Attention, CNN, LSTM, GPU, Parallel, Compute, Data
<start_of_turn>script
"""
response = generate_script(prompt)
print(f"Generated comment: {response}")

Generated comment: 
keywords
Deep Learning, Transformers, Neural Networks, Attention, CNN, LSTM, GPU, Parallel, Compute, Data
script
(upbeat music) Hey guys. So I've been hearing a lot about this new AI called GPT-3 which is like the next evolution of neural networks and it can generate text that almost seems human written but in reality is generated by just billions and billions of parameters within its network... 45 billion to be exact. Now you might have heard all sorts of claims around how good or bad it is but most people don't really understand what makes these models tick so today we are going to break down exactly how they work using examples from different domains: natural language processing (NLP), computer vision, image generation with a little bit of math thrown in for fun as well! And lastly, let me know if any of you would want us to build our own model based on one of your favorite applications? Alright, before diving into building something though, It's important to tal

In [9]:
prompt = """
<bos><start_of_turn>keywords
Universe, 10^10^122, Multiverse, Parallel Worlds, Combinations, Probability, Yourself, Find in another world, Infinite
<start_of_turn>script
"""
response = generate_script(prompt)
print(f"Generated comment: {response}")

Generated comment: 
keywords
Universe, 10^10^122, Multiverse, Parallel Worlds, Combinations, Probability, Yourself, Find in another world, Infinite
script
You know how everyone says there are infinite universes? I'm gonna show you exactly why that makes sense. There are a lot of ways to think about the number of worlds out there. One way is: How many different combinations of everything can we make? Think of it like this, if each particle had two options (to be here or not) then for every atom and all its particles we have 2 choices per particle meaning on average 2*2=4 possible atoms... And those atoms combined with their partners create our universe! But wait, what happens when things get really big? For example take black holes which contain millions upon billions of times more matter than our own galaxy; so where does infinity begin? Well technically it starts from zero but as they say 'Every number has an opposite', and since negative numbers don't exist, Infinity ends at 0. So no