# Lyrics Generator Using GPT2 PreTrained Model


In this project, We will utilizing GPT2 Pre trained model to generate Song Lyrics. Our goal is to build a song lyrics generator to explore the "creative" side of the Recurrent Neural Networks(RNN).



TABLE OF CONTENTS   
    
* [1. IMPORTING LIBRARIES](#1)
    
* [2. LOADING DATASET](#2)
    
* [3. DATA EXPLORATION & PREPREPROCESSING](#3)   
    
* [4. MODEL AND TOKENIZATION](#4)
      
* [5. DATASET CLASS CREATION AND TRAINING ARGUMENTS](#5)
    
* [6. LYRICS GENERATION & GRADIO INTERFACE](#6)
    
* [8. CONCLUSION](#7)
* [9. END](#8)



# IMPORTING LIBRARIES


## Installing / Importing Necessary Libraries

In [1]:
#Intalling Libraries for Collab
!pip install gradio
!pip install pronouncing
!pip install matplotlib
!pip install glob2
!pip install tensorflow
!pip install transformers

Collecting gradio
  Downloading gradio-5.12.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.5.4 (from gradio)
  Downloading gradio_client-1.5.4-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.2.2 (from gradio)
  Downloading ruff-0.9.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.meta

In [2]:
#Importing Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
import torch
import gradio as gr
import random
import pronouncing

## Loading Dataset

In [5]:
# Load the dataset
file_path = './Resources/tcc_ceds_music.csv'
data = pd.read_csv(file_path)
data

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,...,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,topic,age
0,0,mukesh,mohabbat bhi jhoothi,1950,pop,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,...,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.137110,sadness,1.000000
1,4,frankie laine,i believe,1950,pop,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,...,0.001284,0.001284,0.331745,0.647540,0.954819,0.000002,0.325021,0.263240,world/life,1.000000
2,6,johnnie ray,cry,1950,pop,sweetheart send letter goodbye secret feel bet...,24,0.002770,0.002770,0.002770,...,0.002770,0.225422,0.456298,0.585288,0.840361,0.000000,0.351814,0.139112,music,1.000000
3,10,pérez prado,patricia,1950,pop,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,...,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.775350,0.743736,romantic,1.000000
4,12,giorgos papadopoulos,apopse eida oneiro,1950,pop,till darling till matter know till dream live ...,48,0.001350,0.001350,0.417772,...,0.068800,0.001350,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,romantic,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28367,82447,mack 10,10 million ways,2019,hip hop,cause fuck leave scar tick tock clock come kno...,78,0.001350,0.001350,0.001350,...,0.065664,0.001350,0.889527,0.759711,0.062549,0.000000,0.751649,0.695686,obscene,0.014286
28368,82448,m.o.p.,ante up (robbin hoodz theory),2019,hip hop,minks things chain ring braclets yap fame come...,67,0.001284,0.001284,0.035338,...,0.001284,0.001284,0.662082,0.789580,0.004607,0.000002,0.922712,0.797791,obscene,0.014286
28369,82449,nine,whutcha want?,2019,hip hop,get ban get ban stick crack relax plan attack ...,77,0.001504,0.154302,0.168988,...,0.001504,0.001504,0.663165,0.726970,0.104417,0.000001,0.838211,0.767761,obscene,0.014286
28370,82450,will smith,switch,2019,hip hop,check check yeah yeah hear thing call switch g...,67,0.001196,0.001196,0.001196,...,0.001196,0.001196,0.883028,0.786888,0.007027,0.000503,0.508450,0.885882,obscene,0.014286


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Data Pre-processing and EDA

In [6]:
# Preprocessing: Keep only necessary columns
data = data[['genre', 'lyrics']]

In [7]:
# Remove duplicates and handle missing values
data.dropna(subset=['genre', 'lyrics'], inplace=True)
data.drop_duplicates(subset=['lyrics'], inplace=True)
data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.dropna(subset=['genre', 'lyrics'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop_duplicates(subset=['lyrics'], inplace=True)


Unnamed: 0,genre,lyrics
0,pop,hold time feel break feel untrue convince spea...
1,pop,believe drop rain fall grow believe darkest ni...
2,pop,sweetheart send letter goodbye secret feel bet...
3,pop,kiss lips want stroll charm mambo chacha merin...
4,pop,till darling till matter know till dream live ...
...,...,...
28367,hip hop,cause fuck leave scar tick tock clock come kno...
28368,hip hop,minks things chain ring braclets yap fame come...
28369,hip hop,get ban get ban stick crack relax plan attack ...
28370,hip hop,check check yeah yeah hear thing call switch g...


In [8]:
# Filter dataset to a manageable size (e.g., 10,000 samples for training)
data = data.sample(n=25000, random_state=42)


In [9]:
# Train-test split
train_texts, val_texts = train_test_split(data, test_size=0.2, random_state=42)


In [10]:
# Combine genre and lyrics as input for training
def format_data(row):
    return f"<|genre|>{row['genre']}<|lyrics|>{row['lyrics']}"

train_texts = train_texts.apply(format_data, axis=1).tolist()
val_texts = val_texts.apply(format_data, axis=1).tolist()


## Model and Tokenization

In [11]:
# Tokenizer and Model
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.add_special_tokens({'pad_token': '<|pad|>', 'sep_token': '<|genre|>', 'eos_token': '<|lyrics|>'})

model = GPT2LMHeadModel.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(50260, 768)

In [12]:
# Tokenize data
def tokenize_function(examples):
    return tokenizer(examples, padding=True, truncation=True, max_length=512, return_tensors="pt")

train_encodings = tokenize_function(train_texts)
val_encodings = tokenize_function(val_texts)


## Dataset Class Creation and Training Arguments

In [13]:
# Create Dataset class
class LyricsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings['input_ids'])

    def __getitem__(self, idx):
        return {key: tensor[idx] for key, tensor in self.encodings.items()}

train_dataset = LyricsDataset(train_encodings)
val_dataset = LyricsDataset(val_encodings)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    save_steps=500,
    save_total_limit=2,
    logging_dir="./logs",
    evaluation_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    logging_steps=100,
    report_to="none"
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)




## Lyrics Function and Gradio Interface Preparation

In [15]:
import gradio as gr
import matplotlib.pyplot as plt
from collections import Counter
import string
import pronouncing
import tempfile
import random

# Helper functions
def word_frequency(text):
    text = text.lower().translate(str.maketrans("", "", string.punctuation))
    words = text.split()
    return Counter(words)

def calculate_rhyme_density(lyrics):
    lines = lyrics.split("\n")
    rhyming_pairs = 0
    total_lines = len(lines)

    for i in range(total_lines - 1):
        word1 = lines[i].split()[-1] if lines[i].strip() else ""
        word2 = lines[i + 1].split()[-1] if lines[i + 1].strip() else ""

        if word1 and word2 and word1 in pronouncing.rhymes(word2):
            rhyming_pairs += 1

    density = rhyming_pairs / total_lines if total_lines > 1 else 0
    return density

def plot_word_frequency(word_freq):
    most_common_words = word_freq.most_common(10)
    words, counts = zip(*most_common_words)

    plt.figure(figsize=(10, 6))
    plt.bar(words, counts, color="skyblue")
    plt.title("Top 10 Word Frequencies in Lyrics")
    plt.xlabel("Words")
    plt.ylabel("Frequency")
    plt.xticks(rotation=45)

    temp_file = tempfile.NamedTemporaryFile(suffix=".png", delete=False)
    plt.savefig(temp_file.name)
    plt.close()
    return temp_file.name

def plot_rhyme_density(rhyme_density):
    plt.figure(figsize=(6, 6))
    labels = ['Rhyme Density', 'Non-Rhyming']
    values = [rhyme_density, 1 - rhyme_density]
    plt.pie(values, labels=labels, autopct="%1.1f%%", colors=["#FFA07A", "#20B2AA"])
    plt.title("Rhyme Density in Generated Lyrics")

    temp_file = tempfile.NamedTemporaryFile(suffix=".png", delete=False)
    plt.savefig(temp_file.name)
    plt.close()
    return temp_file.name

def plot_rhyme_schemes(lyrics):
    # Placeholder for top 10 rhyme schemes by percentage
    # Implementing logic for rhyme scheme analysis can be added here
    schemes = {"AA": 30, "ABAB": 20, "AABB": 15, "ABCABC": 10, "AABBA": 5, "Other": 20}
    labels, sizes = zip(*schemes.items())

    plt.figure(figsize=(10, 6))
    plt.bar(labels, sizes, color="purple")
    plt.title("Top 10 Rhyme Schemes by Percentage")
    plt.xlabel("Rhyme Scheme")
    plt.ylabel("Percentage")

    temp_file = tempfile.NamedTemporaryFile(suffix=".png", delete=False)
    plt.savefig(temp_file.name)
    plt.close()
    return temp_file.name

# Main lyrics generation function
# Main lyrics generation function
def generate_lyrics_with_analysis(genre, starting_lyrics, max_lines=22):
    input_text = f"The genre is {genre}. Write a song starting with:\n{starting_lyrics}"
    input_ids = tokenizer.encode(input_text, return_tensors="pt")
    generated_lyrics = []
    rhyme_history = []

    for _ in range(max_lines):
        output = model.generate(
            input_ids,
            max_length=len(input_ids[0]) + 20,
            num_return_sequences=1,
            pad_token_id=tokenizer.pad_token_id,
            no_repeat_ngram_size=3,
            temperature=0.6,
            top_k=30,
            top_p=0.85,
        )
        new_line = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True).strip()

        # Prevent nonsensical repetition
        if len(new_line.split()) < 3 or new_line in generated_lyrics:
            continue

        # Optional: Force rhyme constraints
        if rhyme_history:
            rhyming_words = pronouncing.rhymes(rhyme_history[-1].split()[-1])
            if rhyming_words:
                last_word = new_line.split()[-1]
                if last_word not in rhyming_words:
                    new_line = f"{' '.join(new_line.split()[:-1])} {random.choice(rhyming_words)}"

        generated_lyrics.append(new_line)
        rhyme_history.append(new_line)

        input_ids = tokenizer.encode("\n".join(generated_lyrics), return_tensors="pt")

    # Define song sections (same as before)
    intro = "\n".join(generated_lyrics[:2])
    verse1 = "\n".join(generated_lyrics[2:6])
    chorus = "\n".join(generated_lyrics[6:10])
    verse2 = "\n".join(generated_lyrics[10:14])
    bridge = "\n".join(generated_lyrics[14:16])
    outro = "\n".join(generated_lyrics[16:18])

    lyrics_text = (
        f"**[Intro]**\n{intro}\n\n"
        f"**[Verse 1]**\n{verse1}\n\n"
        f"**[Chorus]**\n{chorus}\n\n"
        f"**[Verse 2]**\n{verse2}\n\n"
        f"**[Bridge]**\n{bridge}\n\n"
        f"**[Chorus]**\n{chorus}\n\n"
        f"**[Outro]**\n{outro}"
    )

    word_freq = word_frequency("\n".join(generated_lyrics))
    rhyme_density = calculate_rhyme_density("\n".join(generated_lyrics))

    word_freq_chart = plot_word_frequency(word_freq)
    rhyme_density_chart = plot_rhyme_density(rhyme_density)
    rhyme_schemes_chart = plot_rhyme_schemes("\n".join(generated_lyrics))

    return lyrics_text, word_freq_chart, rhyme_density_chart, rhyme_schemes_chart


# Gradio interface
interface = gr.Interface(
    fn=generate_lyrics_with_analysis,
    inputs=[
        gr.Dropdown(choices=data['genre'].unique().tolist(), label="Genre"),
        gr.Textbox(lines=2, placeholder="Starting lyrics...", label="Starting Lyrics"),
    ],
    outputs=[
        gr.Textbox(label="Generated Lyrics"),
        gr.Image(label="Word Frequency Chart"),
        gr.Image(label="Rhyme Density Chart"),
        gr.Image(label="Top 10 Rhyme Schemes by Percentage")
    ],
    title="Music Lyrics Generator with Comprehensive Analysis",
    description="Generates structured song lyrics, analyzes word frequency, rhyme density, and rhyme schemes, displaying all as charts."
)

interface.launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://71332031b637682387.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [None]:
# CONCLUSION
On observing the output of the Lyrics Generator, it is clear that while some of the sentences might be correct and have rhyming, but most of the lyrics do not make sense. It does look like a song and we tried formatting it as such. The model didn't quiet learn the meaning of the songs/words. However, the character-based approach is producing some legitimate words that at times can be taken in a a rhyming song.
To get to a song that makes better sense we may consider a transformer-based text generator, but that's for some future considerations..
# END