<a href="https://colab.research.google.com/github/vinithreddybanda/dl2/blob/main/DL2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers
!pip install datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

In [3]:
import pandas as pd
import re
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, pipeline

# ---------- Vinith's Lyrics Fine-Tuning Pipeline (Ariana + Billie) ----------

# Load CSVs
file_paths = [
    "/content/ArianaGrande.csv",
    "/content/BillieEilish.csv"
]

dfs = [pd.read_csv(path) for path in file_paths]
lyrics_df = pd.concat(dfs, ignore_index=True)

# Clean the lyrics
def clean_lyrics(lyric):
    if pd.isna(lyric):
        return ""
    lyric = str(lyric)
    lyric = re.sub(r'^#+', '', lyric)
    lyric = lyric.encode('utf-8').decode('utf-8', 'ignore')
    lyric = re.sub(r'[\u2018\u2019\u201c\u201d]+', "'", lyric)
    lyric = re.sub(r'[^\x00-\x7F]+', '', lyric)
    return lyric.strip()

lyrics_df['Lyric'] = lyrics_df['Lyric'].apply(clean_lyrics)
lyrics_texts = lyrics_df['Lyric'].dropna().tolist()

# Save cleaned lyrics to a text file
output_text_file = "vinith_ariana_billie_lyrics.txt"
with open(output_text_file, "w", encoding="utf-8") as f:
    for lyric in lyrics_texts:
        f.write(lyric + "\n\n")

# Load dataset from text
dataset = load_dataset("text", data_files={"train": output_text_file})

# Load tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
tokenized_dataset = tokenized_dataset.map(lambda x: {'labels': x['input_ids']}, batched=True)

model = GPT2LMHeadModel.from_pretrained("gpt2")

# Training setup
training_args = TrainingArguments(
    output_dir="./vinith-gpt2-ariana-billie",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    logging_steps=50,
    save_steps=500,
    save_total_limit=1,
    prediction_loss_only=True,
    report_to="none",
    fp16=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

# Generate lyrics
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "I remember those nights when"
output = generator(prompt, max_length=100, num_return_sequences=1)[0]["generated_text"]

print("\n🎶 Vinith's AI Lyrics (Ariana + Billie Inspired):\n")
print(output)


Generating train split: 0 examples [00:00, ? examples/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Map:   0%|          | 0/906 [00:00<?, ? examples/s]

Map:   0%|          | 0/906 [00:00<?, ? examples/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

  trainer = Trainer(
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
50,1.1505
100,1.0321
150,0.9462
200,0.8095
250,0.8308
300,0.63
350,0.7542
400,0.8589
450,0.5957
500,0.777


Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.



🎶 Vinith's AI Lyrics (Ariana + Billie Inspired):

I remember those nights when you were with me sometimes all of a sudden you could see me when we were all together you couldn't tell they were friends just didn't seem right 'cause you left her to say sorry if you came home and she was hurt it's been six months since my last phone call but you've been so good you know she's probably wondering if you left her you know she's been feeling bad i'm worried maybe there's something missing you can bury your face in her grave
