<a href="https://colab.research.google.com/github/williammcintosh/machine_learning_projects/blob/main/Falcon_7B_Instruct_QLoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning Falcon 7B

## Will McIntosh

## Purpose

I wanted to showcase examples of fine tuning language models on lyrics written by a particular artist in order to generate new songs in the style of that selected artist. I was interested in this because it raises a deeper debate:
* How does this impact professional songwriters both positively and negatively?
* How does fine-tuning a language model work? How could it be applied to other goals?

## Dataset

Using a song lyric dataset [from Spotify available on Kaggle](https://www.kaggle.com/datasets/notshrirang/spotify-million-song-dataset) I selected artists who had the largest number of accumulated words in all their songs. I then isolated the entire dataset to consist only of songs by Rihanna. This gave us 143 songs with the longest song being 1109 tokens in length, and the majority of her songs being around 400 to 500 tokens in length. The Rihanna song corpus had a total of 59089 tokens.

## About Falcon 7B

The [Falcon 7B available at HuggingFace](https://huggingface.co/tiiuae/falcon-7b) was developed by the [Technical Innovation Institute](https://www.tii.ae/) in Abu Dhabi. Falcon-7B is a causal decoder-only model trained on 1,500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a high-quality filtered and deduplicated web dataset which I enhanced with curated corpora. Falcon-7B was trained on 384 A100 40GB GPUs, training happened in early March 2023 and took about two weeks. The prompt I used to generate the lyrics was the Pipeline function from the Transformers library from HuggingFace [which can be found here](https://huggingface.co/tiiuae/falcon-7b). I specifically used [the sharded Falcon 7B by Vilson Rodrigues](https://huggingface.co/vilsonrodrigues/falcon-7b-instruct-sharded)

## Obstacles

Fine-tuning an open source models on Google Colab notebook amakes it more shareable, but has hardware limitations with a single T4. I got around the limitations set by Google Colab by:
* Utilizing gradient checkpointing.
* Freezing the earlier layers.
* Limiting the size of the training corpus to 256 tokens.

The technique called “Gradient Checkpointing” saves strategically selected activations throughout the computational graph so only a fraction of the activations need to be re-computed for the gradients. See [this great article](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9). I had to look into [the repo commits](https://huggingface.co/tiiuae/falcon-40b-instruct/commit/7475ff8cfc36ed9a962b658ae3c33391566a85a5) on the Falcon 7B pre trained model to enable gradient checkpointing internally.

I'm grateful for these open sources models!

# Installs and Imports

## Packages

In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
%%capture
!pip install -q -U bitsandbytes==0.41.2.post2
!pip install -q -U einops==0.7.0
!pip install -q -U safetensors==0.4.0
!pip install -q -U torch==2.1.0+cu118
!pip install -q -U xformers==0.0.22.post7
!pip install -q -U datasets==2.14.6
!pip install -q -U transformers==4.35.0
!pip install -q -U peft==0.6.1
!pip install -q -U accelerate==0.24.1

## Libraries

In [None]:
%%capture
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
import transformers
import torch
from torch.utils.data import Dataset as TorchDataset
from datasets import Dataset
from transformers import AutoTokenizer
from peft import LoraConfig, get_peft_model

# Load Model

## Select Performance Type

This function let's to user select whether or not they want the full version of the Falcon 7B or a partial version that'll run on the free version of Google Colab's T4.

In [None]:
def get_falcon_model(partial_performance=True):

  # 4bit Quantize configurations
  bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
  )

  if partial_performance:
    # sharded model
    model_id = "vilsonrodrigues/falcon-7b-instruct-sharded"
  else:
    # fullsized model
    model_id = "tiiuae/falcon-7b"

  trust_remote_code = True if partial_performance else False

  # Get the pretrained falcon model
  model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map={"":0},
    trust_remote_code=trust_remote_code
  )

  # PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation)
  config = LoraConfig(
      r=16,
      lora_alpha=32,
      target_modules=["query_key_value"],
      lora_dropout=0.05,
      bias="none",
      task_type="CAUSAL_LM"
  )
  model = get_peft_model(model, config)

  # Enable gradient checkpointing to save memory
  if partial_performance:
    model._set_gradient_checkpointing(module=model_id, value=True)
    num_layers = 50

    # Freeze early layers to save ram
    for name, param in list(model.named_parameters())[:num_layers]:
      param.requires_grad = False

  # upcast layers to float datatypes
  model = model.float()

  # Move model to GPU
  model.to('cuda')

  return model, model_id

In [None]:
def print_trainable_parameters_and_layers(model):
    # Count the number of layers
    layer_count = 0
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Module):
            layer_count += 1

    # Count number of trainable parameters
    print(f"Total number of layers\t: {layer_count}")
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params\t: {trainable_params}\nall params\t\t: {all_param}\ntrainable %\t\t: {round(100 * trainable_params / all_param,3)}"
    )

## Get Falcon Model

In [None]:
%%capture
model, model_id = get_falcon_model(partial_performance=False)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading (…)figuration_falcon.py:   0%|          | 0.00/6.70k [00:00<?, ?B/s]

Downloading (…)n/modeling_falcon.py:   0%|          | 0.00/56.9k [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/828M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

In [None]:
print_trainable_parameters_and_layers(model)

Total number of layers	: 648
trainable params	: 3761152
all params		: 3613463424
trainable %		: 0.104


# Prepare Data

## Load Data

In [None]:
%%capture
import pandas as pd
import numpy as np
import sys, os # Importing data

In [None]:
%%capture

# downloads the .csv files from google drive only if isn't already in directory
path = "/content/spotify_millsongdata.csv"
if os.path.isfile(path) == False:
  !gdown --id 1wGtLywxyCq858JTVtizWHR5dtIf4Di8v

def select_only_desired_artist(artist, fdf):
  fdf = fdf[fdf["artist"]==artist]
  fdf = fdf.drop(['artist'], axis=1)
  return fdf

df = pd.read_csv(path, usecols=['artist', 'song', 'text'])
df = df.rename(columns={"song": "title", "text": "lyrics"})

## Top 10 Accumlated Wordcount

In [None]:
wc_df = df.groupby('artist')['lyrics'].apply(lambda x: ' '.join(x)).reset_index()
wc_df['word_count'] = wc_df['lyrics'].apply(lambda x: len(x.split()))
wc_df.sort_values(by=['word_count'], ascending=False)[['artist', 'word_count']].head(10)

Unnamed: 0,artist,word_count
224,Insane Clown Posse,62713
296,Lil Wayne,62351
285,LL Cool J,59692
432,R. Kelly,55277
107,Drake,54341
64,Chris Brown,54073
138,Fabolous,53179
451,Rihanna,50454
221,Indigo Girls,50029
330,Michael Jackson,48531


## Select Only Desired Artist

In [None]:
# This variable is used later for printing
artist = "Rihanna"
df = select_only_desired_artist(artist, df)

In [None]:
df.head()

Unnamed: 0,title,lyrics
17623,A Child Is Born,As I was walkin' down the road to Bethlehem on...
17624,A Girl Like Me,Some girls play the game \r\nThey all walk an...
17625,Afterparty,"Mc, Nicki, Riri \r\nAfter Party \r\n \r\nTu..."
17626,American Oxygen,"[Chorus] \r\nBreathe out, breathe in \r\nAme..."
17627,California King Bed,Chest to chest \r\nNose to nose \r\nPalm to ...


## Examining Dataset DataType

In [None]:
# Convert the DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Check the dataset structure
print(dataset.features)

{'title': Value(dtype='string', id=None), 'lyrics': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None)}


## Tokenize and Encode Dataset

In [None]:
from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# Function to concatenate title and lyrics
def concatenate_qa(examples):
    return {'input_text': examples['title'] + "\n" + examples['lyrics']}

# Apply the function to the dataset
dataset = dataset.map(concatenate_qa)

# Tokenize the dataset
tokenized_dataset = tokenizer(dataset['input_text'], truncation=True, padding=True, max_length=256, return_tensors='pt')

Downloading (…)okenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Map:   0%|          | 0/143 [00:00<?, ? examples/s]

In [None]:
class TextDataset(TorchDataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = item["input_ids"].clone()
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

In [None]:
# Convert the encodings to PyTorch datasets
train_dataset_pytorch = TextDataset(tokenized_dataset)

In [None]:
train_dataset_pytorch.encodings.attention_mask.dtype

torch.int64

# Example Before Fine Tuning

In [None]:
def generate_new_song(title, artist=artist, model=model, tokenizer=tokenizer):

    # Load tokenizer and model
    prompt = f"Title: {title}\nLyrics:"

    # Create pipeline
    song_generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )

    # Generate text
    sequences = song_generator(
        prompt,
        max_length=200,
        do_sample=True,
        top_k=0,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )

    # Print generated song
    print(f"In the Style of: {artist}")
    print("Generated Song:")
    for seq in sequences:
        print(seq['generated_text'])

In [None]:
# Example usage
generate_new_song("Take on Me")

In the Style of: Rihanna
Generated Song:
Title: Take on Me
Lyrics: Non)
First shot from aim
Feel like a dead man
Lie in 6 feet of black saran
Tear your name cross my desert
And take on me,
[Verse 2:]
And you know that I wouldn't harm you
But what if you don't?
Girl with needs just like a man.
What things you leave undone.
Days with broken words on the sand
To let the waves make your sense.
Saran cloth, a breath away
Take on me when look to you
You mistook me up a smooth guy
So I can go on and on.
Something you've heard in the halls
Mix it with your shape and charm
Escape from the doldrums.
Chest heart ache on the ne plus ultra
Or would you reverse what you say?
My senses the same as any other
But heart still quavers's


# Training

In [None]:
from transformers import TrainingArguments, TrainerCallback, Trainer
from tqdm.auto import tqdm

# For progress bars
class ProgressCallback(TrainerCallback):
    def on_train_begin(self, args, state, control, **kwargs):
        self.progress_bar = tqdm(total=state.max_steps)
        self.progress_bar.set_description("Training")

    def on_step_end(self, args, state, control, **kwargs):
        self.progress_bar.update(1)

    def on_train_end(self, args, state, control, **kwargs):
        self.progress_bar.close()

# Modify your TrainingArguments
training_args = TrainingArguments(
    num_train_epochs=100,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_ratio=0.05,
    max_steps=100,
    learning_rate=2e-4,
    fp16=True,
    logging_strategy="steps",
    logging_steps=25,
    output_dir="outputs",
    optim="paged_adamw_8bit",
    lr_scheduler_type='cosine',
)

# Create your Trainer with the ProgressCallback
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset_pytorch,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    callbacks=[ProgressCallback()]
)

# silence the warnings. Please re-enable for inference!
model.config.use_cache = False

# upcast cross attention layer to bfloat16
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float)

# Train the model
trainer.train()

  0%|          | 0/100 [00:00<?, ?it/s]

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Step,Training Loss
25,2.0687
50,1.8687
75,1.7636
100,1.7029


TrainOutput(global_step=100, training_loss=1.8509696960449218, metrics={'train_runtime': 929.1643, 'train_samples_per_second': 1.722, 'train_steps_per_second': 0.108, 'total_flos': 1.618423558864896e+16, 'train_loss': 1.8509696960449218, 'epoch': 11.11})

# Example After Fine Tuning

In [None]:
%%capture
model.config.use_cache = True
model.eval()

In [None]:
generate_new_song("Take on Me")

In the Style of: Rihanna
Generated Song:
Title: Take on Me
Lyrics:
Gonna take you on a ride [Take on me]  
It's not hard to understand [take on me]  
Stars shining down here [take on me]  
Gonna take you to another level [take on me]  
  
Are you thinking whats going on here [take on me]  
(Where you going baby look where you goin', hold on?)  
  
The clock keeps tick Tock it's getting late  
And I'm getting lost stopping me trying to escape  
Now you know how much I love you  
Don't you get it now'dawn  
Now stop the game don't you see I'm getting love  
Now you know how much I love you  
Pull me back I'm coming home  
Now


In [None]:
model.save_pretrained(f"/content/{artist}_model")