In [None]:

import pandas as pd
import numpy as np
import re
import tensorflow as tf  
from transformers import pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import matplotlib.pyplot as plt
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer, TFGPT2Model, GPT2Config
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint




# Developing a Model for Generating Pop Song Lyrics
## By: Violeta Kasyteva
## Summary:
This project explores the development of a machine learning model leveraging GPT-2 (distilGPT2) to automate the generation of pop song lyrics. By fine-tuning a pre-trained model on a curated dataset of pop music lyrics, we aim to capture the stylistic nuances of the genre. 

<center>
    <figure>
        <img alt="Singers" src="./pictures/singers.jpg" width=700 />
    </figure>
</center>


## Data Source
[The dataset comprises pop song lyrics collected from various online sources.](https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information/data)

## Data Acquisition
First, load the dataset containing song lyrics into a pandas DataFrame. This dataset includes various metadata about each song.


In [None]:
df = pd.read_csv('data/song_lyrics.csv')  

Filter the dataset to include only pop songs with lyrics in English. This is achieved by selecting rows where the genre is tagged as 'pop' and the song's language is confirmed to be English by either the CLD3 or FastText models, or both

In [None]:
df_pop = df[df['tag'] == 'pop']
# Example: Keeping relevant fields
df_pop_english = df_pop[(df_pop['language'] == 'en') | ((df_pop['language_cld3'] == 'en') & (df_pop['language_ft'] == 'en'))]

# Example: Text preprocessing for lyrics (simple example)

df_pop_english

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
239,Wordy Rappinghood,pop,Tom Tom Club,1981,26499,{},[Chorus]\nWhat are words worth?\nWhat are word...,242,en,en,en
389,Horchata,pop,Vampire Weekend,2009,102550,{},"[Verse 1]\nIn December, drinking horchata\nI'd...",384,en,en,en
516,Heartless,pop,Kanye West,2008,1175109,{},"[Chorus]\nIn the night, I hear 'em talk\nThe c...",526,en,en,en
557,Flashing Lights,pop,Kanye West,2007,1078113,{Dwele},[Intro: Connie Mitchell]\nFlashing lights (Lig...,523,en,en,en
588,Baby,pop,Justin Bieber,2010,2232442,{Ludacris},[Produced by The-Dream and Tricky Stewart]\n\n...,566,en,en,en
...,...,...,...,...,...,...,...,...,...,...,...
5134844,Baby Hold On,pop,Ryan Egan,2022,5,{},At this crossroads\nI don’t wanna go and gambl...,7882831,en,en,en
5134847,Everything Is Alright Now,pop,Chuck Bernard,2013,2,{},"Everything is alright now\nOh yes, baby\nEvery...",7882838,en,en,en
5134849,White Lies,pop,ElementD,2019,1,"{""Harley Bird""}",[Verse 1]\nHalf truth and half you\nDidn't we ...,7882840,en,en,en
5134851,Ocean,pop,Effemar,2022,3,{},[Verse 1]\nDance for me now\nKeeping yourself ...,7882842,en,en,en


Remove unnecessary columns to simplify the DataFrame. Columns such as 'tag', 'views', 'features', and language identification fields are dropped.

In [None]:
# Assuming df is your DataFrame
columns_to_drop = ['tag', 'views', 'features', 'language_cld3', 'language_ft', 'language']
df_simplified = df.drop(columns=columns_to_drop)

df_simplified

Unnamed: 0,title,artist,year,lyrics,id
0,Killa Cam,Cam'ron,2004,"[Chorus: Opera Steve & Cam'ron]\nKilla Cam, Ki...",1
1,Can I Live,JAY-Z,1996,"[Produced by Irv Gotti]\n\n[Intro]\nYeah, hah,...",3
2,Forgive Me Father,Fabolous,2003,Maybe cause I'm eatin\nAnd these bastards fien...,4
3,Down and Out,Cam'ron,2004,[Produced by Kanye West and Brian Miller]\n\n[...,5
4,Fly In,Lil Wayne,2005,"[Intro]\nSo they ask me\n""Young boy\nWhat you ...",6
...,...,...,...,...,...
5134851,Ocean,Effemar,2022,[Verse 1]\nDance for me now\nKeeping yourself ...,7882842
5134852,64 Bars,Rapido,2022,"[Intro]\n\nJa, ja\n\n[Part 1]\n\nR-A-H, Merhab...",7882843
5134853,Raise Our Hands,"Culture Code, Pag & Mylo",2016,[Verse 1]\nHere our purpose feels alive\nWe ar...,7882845
5134854,CEO,Antropolita,2022,Jestem CEO w tym\nTo jara twoją bitch\nNikt na...,7882846


Configure pandas display settings to improve the readability of DataFrame outputs, especially useful when dealing with wide datasets or long text fields.

In [None]:
pd.set_option('display.max_rows', 50)  # Ensure up to 50 rows can be displayed
pd.set_option('display.max_columns', None)  # Show all columns of the simplified DataFrame
pd.set_option('display.width', None)  # Auto-detect the display width to accommodate the content
pd.set_option('display.max_colwidth', None)  # Show full content of each cell, adjust if necessary for readability


Remove rows with any missing values to ensure the dataset's completeness. Additionally, replace empty strings with NaN to treat them as missing values, and then remove these rows as well.

In [None]:
# Drop rows where any of the cells have NaN values
df_cleaned = df_simplified.dropna()

df_cleaned.replace('', pd.NA, inplace=True)  # Replace empty strings with NaN
df_cleaned.dropna(inplace=True)  # Drop rows with NaN values (now including those that were empty strings)

df_cleaned.reset_index(drop=True, inplace=True)



Here is what we have so far:

In [None]:
df_cleaned.head(5)

Unnamed: 0,title,artist,year,lyrics,id
0,Killa Cam,Cam'ron,2004,"[Chorus: Opera Steve & Cam'ron]\nKilla Cam, Killa Cam, Cam\nKilla Cam, Killa Cam\nKilla Cam, Cam\nKilla Cam, Killa Cam, Cam\nKilla Killa Killa Cam\nKilla Cam, Cam, Killa (Killa!)\nKilla Cam, Killa Cam, Cam (Bases loaded)\nKilla Cam, Killa Cam (Uh-huh)\nKilla Cam, Cam (Santana on second, Jim on third)\nKilla Cam, Killa Cam, Cam (I'm at bat)\nKilla Killa Killa Cam\nKilla Cam, Cam, Killa (I'm 'bout to hit this shit out the world)\nKilla Cam (Ugh, Heatmakerz), Killa Cam, Cam\nKilla Cam, Killa Cam\nKilla Cam, Cam (Hahahaha)\nKilla Cam, Killa Cam, Cam\nKilla Killa Killa Cam\nKilla Cam, Cam, Killa (We make this shit clap)\nKilla Cam, Killa Cam, Cam\nKilla Cam, Killa Cam\nKilla Cam, Cam\nKilla Cam, Killa Cam, Cam\nKilla Killa Killa Cam (Killa! Killa!)\nKilla Cam, Cam, Killa\n[Verse 1]\nWith the goons I spar, stay in tune with ma (What up?)\nShe like, ""Damn, this the realest since 'Kumbaya'""\nBomaye, Killa Cam, my Lord (My Lord)\nStill the man with the pan, scrilla, fam, on board\nNow bitches, they want to neuter me, niggas, they want to tutor me\nThe hooligan in Houlihan's, maneuvering's nothing new to me\nDoggy, I'm from the land of grind, pan-pan: gram or dime?\nNot toes or MC when I say ""hammer time""\nBeef: I hammer mine, when I get my hands on nines\nIf I had on 'Bama line, Corduroys, Cam'll shine\nCanary burgundy: I call it ""Lemon Red"" (Red)\nYellow diamonds in my ear, call 'em ""Lemonheads""\nLemonhead, end up dead, ice like Winnipeg\nGemstone, Flintstones, you could say I'm friends with Fred\nYou unhappy, scrappy? (What's going on, Scrappy?)\nI got Pataki at me\nBitches say I'm ""Tacky Daddy,"" Range look like Laffy Taffy\n\n[Chorus]\nKilla Cam\nKilla Cam Cam (sing)\nKilla Cam Killa Cam\nKilla Cam Cam (uhh, it's me, clap)\nKilla Cam\nKilla Cam Cam\nKilla Killa Killa Cam (sing)\nKilla Cam Cam Killa (uhh, it's me, clap)\nKilla Cam\nKilla Cam Cam (sing)\nKilla Cam Killa Cam\nKilla Cam Cam (clap, it's me)\nKilla Cam\nKilla Cam Cam\nKilla Killa Killa Cam (clap)\n(Harlem, I know y'all know about this)\nKilla Cam Cam Killa (Killa!)\n[Verse 2]\nYo, I'm from where Nicky Barnes got rich as fuck\nRich and A hit the kitchens then were pitchin' up\nRob Base, Mase, Doug E Fresh switched it up\nI do both, who am I to fuck tradition up? (Killa!)\nSo I parked in a tow-away zone\nChrome...I don't care\nThat car a throwaway, homes (Killa!)\nWelcome to Harlem, where you welcome to problems\nOff of furlough, fellow felons get pardons\nThem niggas knew we bang\nStood out like Pootie Tang\nSoon as the stoolie sings\nThat when the toolie sing!\nBang! Bang!\nCame from that movie ring\nSnap, crack jewelry bling\nFlapjack, ooh he bring\nClack-clack, ""ooh he ring!""\nBad rap, cuties cling\nAss cap, put them in the river\nI'm the sushi king\nAnd I'ma keep ya fresh\nLet the fish eat ya flesh\nYes sir, please confess\nJust say he's the best (Killa!)\n[Chorus]\nKilla Cam (sing)\nKilla Cam Cam (clap)\nKilla Cam Killa Cam (yes)\nKilla Cam Cam (it's me, sing)\nKilla Cam\nKilla Cam Cam (sing)\nKilla Killa Killa Cam\nKilla Cam Cam Killa (clap, yes sir, uhh)\nKilla Cam\nKilla Cam Cam (sing, clap)\nKilla Cam Killa Cam\nKilla Cam Cam (it's me)\nKilla Cam (sing, clap)\nKilla Cam Cam\nKilla Killa Killa Cam\n(Let me end this shit, listen)\nKilla Cam Cam Killa\n\n[Verse 3]\n(Killa!) Yo\nHow dope is this?\nTeach you how to rope a chick\nWhat you want: coke or piff?\nGot it all, smoke or sniff? (everything)\nAnd you know my drift\nUsed to figures, dough and shit (millions)\nYou a rooster nigga, just a roaster, bitch\nAnd I roast ya bitch\nThat's how it usually ends\nTell her and her groupie friends\nGo get their coochie cleansed\nWe the moody Gucci, Louis and Pucci men\nEscada, Prada\nThe chopper it got the Uzi lens\nBird's-eye view\nThe birds I knew flip birds\nBird gangs, it was birds I flew\nAnd word I blew off herb I grew\nI would serve on stoops\nNow swerve in coupes\nIt's me, sing! Killa, uhh\n\n[Chorus]\nKilla Cam\nKilla Cam Cam\nKilla Cam Killa Cam\nKilla Cam Cam\nKilla Cam\nKilla Cam Cam\nKilla Killa Killa Cam\nKilla Cam Cam Killa\nKilla Cam\nKilla Cam Cam\nKilla Cam Killa Cam\nKilla Cam Cam\nKilla Cam\nKilla Cam Cam\nKilla Killa Killa Cam\nKilla Cam Cam Killa",1
1,Can I Live,JAY-Z,1996,"[Produced by Irv Gotti]\n\n[Intro]\nYeah, hah, yeah, Roc-A-Fella\nWe invite you to somethin' epic, you know?\nWell, we hustle out of a sense of hopelessness\nSort of a desperation\nThrough that desperation, we become addicted\nSort of like the fiends we accustomed to servin'\nBut we feel we have nothin' to lose\nSo, we offer you, well, we offer our lives, right?\nWhat do you bring to the table?\n\n[Verse 1]\nWhile I'm watchin' every nigga watchin' me closely\nMy shit is butter for the bread, they wanna toast me\nI keep my head, both of them, where they supposed to be\nHoes'll get you sidetracked, then clapped from close feet\nI don't sleep, I'm tired, I feel wired like codeine, these days\nA brother gotta admire me from four fiends away\nMy pain, wish it was quick to see\nFrom sellin' 'caine 'til brains was fried to a fricassee\nCan't lie, at the time it never bothered me\nAt the bar, gettin' my thug on properly\nMy squad and me lack of respect for authority\nLaughin' hard, happy to be escapin' poverty, however brief\nI know this game got valleys and peaks\nExpectation for dips, for precipitation we stack chips, hardly\nThe youth I used to be, soon to see a mill'in\nNo more Big Willie, my game has grown\nPrefer you call me William\nIllin' for revenues, Rayful Edmond-like\nChannel 7 News, round seven jewels, head dead in the mic\nForgettin' all I ever knew, convenient amnesia\n""I suggest you call my lawyer, I know the procedure.""\nLock my body, can't trap my mind\nEasily explain why we adapt to crime\nI'd rather die enormous than live dormant, that's how we on it\nLive at the main event, I bet a trip to Maui on it\nPresidential suites my residential for the weekend\nConfidentially speakin' in codes since I sense you peekin'\nThe NSX rental, don't be fooled, my game is mental\nWe both out of town, dog, what you tryin' to get into?\nViva Las Vegas, see ya later at the crap tables\nMeet me by the one that starts a G up\nThis way no Fraud Willies present gamblin' they re-up\nAnd we can have a pleasant time, sippin' margaritas\n[Chorus]\nGe-ge-geyeahhh\nCan I live?\nCan I live?\n\n[Verse 2]\nMy mind is infested with sick thoughts that circle\nLike a Lexus, if driven wrong it's sure to hurt you\nDual level like duplexes, in unity\nMy crew and me commit atrocities like we got immunity\nYou guessed it, manifest it\nIn tangible goods, platinum Rolex'd it\nWe don't lease, we buy the whole car, as you should\nMy confederation, dead a nation\nExplode on detonation, overload the mind of a said patient\nWhen it boils to steam, it comes to it\nWe all fiends, gotta do it: even righteous minds go through this\nTrue this, the streets school us to spend our money foolish\nBond with jewelers and watch for intruders\nI stepped it up another level, meditated like a Buddhist\nRecruited lieutenants with ludicrous dreams of gettin' cream\n""Let's do this,"" it gets tedious\nSo I keep one eye open like CBS — you see me stressed, right?\n\n[Chorus]\nCan I live?\nCan I live?\nCan I live?\nCan I live?",3
2,Forgive Me Father,Fabolous,2003,"Maybe cause I'm eatin\nAnd these bastards fiend for they grub\nI carry pumps like I serve gasoline to these scrubs\nHave you seen my Aston leanin on dubs\nAnd they can't afford chrome so they puttin Vaseline on they hubs\nI'm lookin for a girl with a ass like Trina to rub\nTake home and let her watch the plasma screen in the tub\nThese niggas hate I move as smooth as castor cream in the club\nAnd dont pass my green or my bub\nBut I'm a fly nigga that don't do much to pull her and dick her\nEveryday I'm poppin a tag and pullin a sticker\nEveryday I'm switchin the tags and pullin up sicker\nEvery ""K"" I'm loadin the mags with bullets to flicker\nAnd I ain't hesitatin homie I'm pullin it quicker\nSo you can act tough After a few pulls of some liquor\nGotta pull it on niggas\nAnd they won't be goin nowhere for a while\nThey might as well pull out a snicker Ye-Ye-Yea\n\n[Hook]\nForgive me father for I have sinned\nBut look at all this money that I spend\nAnd look at all this jewlery that I'm in\nAnd look at all the places that I've been\nAnd look at all the women in those brims\nLook at the blue flames that I'm in\nI look at all the bullshit that theres been\nAnd if I had another chance I'd do it again\n[Verse 2]\nAnywhere the kid move you know the hammers'll be with me\nPokin out the shirt like a Pamela Lee titty\nI went on tour brought the samples of D wit me\nCame back a month later bought a Lambo for three-fifty\nThink I throw you grams if you read with me\nJust because you see me on the camera with P. Diddy\nDammit we P-driddy?? Now I got G with me\nAlong with the third leg that I be rammin in these bitties\nI keep the revolver you hope my gun'll jam\nBut with the scope its gonna blam\nThe infra put freckles on your face like Opie Cunningham\nThats why I'm watched by the Feds and scoped by Uncle Sam\nDope and hunn-ed (hundred) grams rope and hunn-ed grams\nAt the same time an artist get to open Summer Jam\nHope you understand or use better sense\nThese niggas dont want no beef they want lawsuit settlements Nigga!\n\n[Hook]\n\n[Verse 3]\nI'm in a waggy when I'm passin by ya\nWith a baby girl who suck harder than Maggie on a pacifier\nWhat I'm smokin'll have you aggie as your last supplier\nWhen you can smell it through the baggie you know that's some fire\nGettin stressed by these hotties is regular\nI got a magazine to press to your body like editors\nTest me somebody I'm beggin ya\nI got the gatling gun like Jesse The Body in Predator\nI'm a hustler I dont sling no rocks to the fiends now\nGot dudes who sit on corners like a boxer between rounds\nAny other dude who dish rocks want beef\nCause I chop dimes bigger than Chris Rock front teeth\nI'm the nigga tearin the walls up in your miss in exchange for a small cup\nOf the Cris\nAnd while you at probation fillin a small cup full of piss\nI'm in a coupe with a roof that ball up like a fist (Catch up!)",4
3,Down and Out,Cam'ron,2004,"[Produced by Kanye West and Brian Miller]\n\n[Intro: Cam'ron & Kanye West]\nUgh, Killa!\nBaby!\nKanye, this that 1970s Heron flow, huh?\nYeah, let's speed it up\nUgh, I'm back in, ugh, ugh\nThey don't know we finna kill the game this year\nKilla! 'Ye! C'mon!\n\n[Verse 1: Cam'ron]\nAyo, street mergers, I legislated; the nerve, I never hated\nOn murders, premeditated—absurd! I hesitated\nObserve: cock and spray, hit you from a block away\nDrinking sake on a Suzuki; we in Osaka Bay\nPlaying soccer, stupid, stay in a sucker's place\nPluck your ace, take your girl, fuck her face\nShe dealing with Killa, so you love her taste\nShe swallowing Killa 'cause she love the taste\nI got brought up with crooking, kitchen orders that I'm cooking\nBut got caught up with the juxes\nYou would've thought I was from Brooklyn\nIt gets boring just looking\nDid like Bill Cosby, pouring in the pudding\nNow, the dashboard is wooden from a hard-tangled grammar\nInterior, inferior; Star-Spangled Banner\nCar game bananas, mob manning tanners\nGuns everywhere, like the car came with hammers, he's back\n[Chorus: Kanye West & (Syleena Johnson)]\nThey trying to say he (I'm down, down)\nI hear niggas saying he (I'm down, but not out)\nBut our flow is the truest (Oh), the game's in the nooses (No no)\nOur girls is the models (Oh), they coochies the juiciest (Ooooh)\nYeah, they say he (I'm down, down)\nYeah, they say he (I'm down, but not out)\n'Cause I'm back on my grind (Oh), money back on my mind (No no)\nYe and Killa Cam, the world is mine (Oooh)\n\n[Verse 2: Cam'ron]\nI treat bitches straight up, like Simon Says\nOpen vagina; put your legs behind your head\nCop me Air Ones, hon, lime and red\nYou got pets? Me too: mines are dead, doggy\nFox, minks, gators, that's necessary\nAccessories, my closet's Pet Sematary\nI get approached by animal activists\nI live in a zoo, I run scandals with savages\nAll my niggas get together to gather loot\nBodyguard for what? Dog, I'd rather shoot\nI go to war, old Timbs, battered boots\nHand grenade, goggles and a parachute\nY'all don't even know the name of my flip\nIt was ""Touch Me, Tease Me"" when Case was the shit\nYou don't know bout the cases I get:\nCourt case, briefcase, suitcase, cases of Cris', oww!\n[Chorus: Kanye West & (Syleena Johnson)]\nThey trying to say he (I'm down, down)\nI hear niggas saying he (I'm down, but not out)\nBut our flow is the truest (Oh), the game's in the nooses (No no)\nOur girls is the models (Oh), they coochies the juiciest (Ooooh)\nYeah, they say he (I'm down, down)\nYeah, they say he (I'm down, but not out)\n'Cause I'm back on my grind (Oh), money back on my mind (No no)\nYe and Killa Cam, the world is mine (Oooh)\n\n[Verse 3: Cam'ron]\nUgh, Killa, yo, yo, ayo—\nYou dealing with some sure shit, my bitches pure thick\nPlay razor tag, slice your face, you're it!\nIt's I who come by, drive-thru\nGator-toed Mauri, three quarters, sky-blue\nLook at mami: eyes blue, 5'2''\nI approached her—""Hi, boo, how you?\nPony skin Louie? Oh, you fly too\nYou a stewardess? Good, ma—I fly too""\nNow, a nigga got baking to bake\nHarlem shake? Nah, I'm in Harlem shaking the weight\nShaking to bake, shaking the Jakes\nKill you, shoot the funeral up and Harlem Shake at your wake\nJust your picture, though; you still taped in a lake\nI'm laughing; you couldn't wait to escape\nFor anyone who owed the dough, I had to load the four\nI hope a nigga heard when I said ""I told you so""\nUgh, Killa!\n[Chorus: Kanye West & (Syleena Johnson)]\nThey trying to say he (I'm down, down)\nI hear niggas saying he (I'm down, but not out)\nBut our flow is the truest (Oh), the game's in the nooses (No no)\nOur girls is the models (Oh), they coochies the juiciest (Ooooh)\nYeah, they say he (I'm down, down)\nYeah, they say he (I'm down, but not out)\n'Cause I'm back on my grind (Oh), money back on my mind (No no)\nYe and Killa Cam, the world is mine (Oooh)\n\n[Outro: Cam'ron]\nMine! Killa!\nYou already know Harlem\nWhole Midwest, Detroit, Naptown, St. Louis\nChicago, of course\nWestside, holla at me\nSouthside, wild hundreds\nYou know what it is, Ohio\nColumbus, holla at your boy\nYou know what else I do:\nDayton, Youngstown, Cleveland, Cincinnati",5
4,Fly In,Lil Wayne,2005,"[Intro]\nSo they ask me\n""Young boy\nWhat you gon' do the second time around?\nHow you gon' come back?""\nI tried told 'em\n""I come back like 32\nI jump back like 33""\nUgh, hit me\nThat's nothing\nThis is Tha Carter II, people\nThis is Tha Carter II, people\nHey\n\n[Verse]\nThey call me Mr. Carter, I kissed the daughter\nOf the dead's forehead, I killed the father\nSpilled the heart of a mildew hater\nI will put them body on chill like glaciers, gracias\nI'm crazy, yes, it's obvious\nGoing against me is atheist\nI got my angels on my shoulder and a quarter of that angel dust\nI ain't sniffing, I'm just pitching, your honor\nI ain't snitching, your honor\nHate bitch niggas, bitches with power\nVacate when that kitchen get hotter\nI just sit on the counter\nOpen the cabinet, close the cupboard\nPut that jar in the skillet, drop a four in the bubbles\nI remember being young, tryna hustle my dope\nTryna tell the old junkies that my crack ain't soap\nTryna tell you 'fore you jump that my MAC ain't broke\nYou ain't tryna see how far that black back lane go, no\nCall me Pac-Man, your ghost is blue\nI got my Red River rubies and my ocean blue jewelry\nUsually I'm a hooligan for the money\nYeah, I'm eating, but I got a tapeworm in my tummy, oh\nMake harm and I bomb you in public\nHit you with the straight-arm, no warning, nothing\nLook, it's morning, no yawning or nothing\nI ain't sleeping, I'm up tryna take a nigga lunch\nYou gon' make a nigga break a nigga fronts\nThen shake a nigga shorts and we taking what we want\nI'm so 504, you got to kill me here\nIf you ever looking for me, bitch, I will be here\nCash Money is an army, Navy Seal me here\nLot of niggas ran from me, but I still be here\nNo chrome on the Continental, I'm so fundamental\nCrack the Phil', crack the roof, and roll up the windows\nAnd my hood love me, they tell me bring it home\nThat's why I holler Hollygrove on each and every song\nYou leaping at a dog, a dog with no bark\nJust a bite like an old shark\nAnd all you rich niggas know Pa, I'm talking 'bout Stunna\nHe like, ""Keep your dough,"" he got your ho\nAnd the sun shines on the king and sets on the prince\nI met the Birdman and I been shining ever since, like that",6


Let's define a function to clean the lyrics text a little more. This function removes annotations within brackets, standardizes apostrophes and quotation marks, removes special characters, and converts text to lowercase. Apply this function to the 'lyrics' column of the cleaned DataFrame.

In [None]:
def clean_lyrics(text):
    # Remove bracketed text e.g., [Chorus], [Verse 1]
    text = re.sub(r'\[.*?\]', '', text)
    # Replace curly apostrophes and quotes with straight ones
    text = re.sub(r'[‘’“”]', "'", text)
    # Lowercase the text
    text = text.lower()
    # Remove special characters, keeping only words, apostrophes, and basic punctuation
    text = re.sub(r'[^a-z0-9 ,.!?\'"]', '', text)
    # Replace newlines and multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text).strip()
    return text


In [None]:
df_cleaned['lyrics'] = df_cleaned['lyrics'].apply(clean_lyrics)

Saving the current data progress

In [None]:
df_cleaned.to_csv('current.csv', index=False)


In [3]:
df_cleaned = pd.read_csv('./data/current.csv')

In [4]:
lyrics_data = df_cleaned['lyrics'].tolist()  
print(lyrics_data[:5])



## Tokenization and Data Preparation
This code initializes the tokenizer for the GPT-2 model, sets the padding token, and specifies the maximum sequence length. It cleans the data by ensuring all lyrics are strings and not None, then tokenizes a subset of the cleaned lyrics data, preparing it for model input

In [5]:
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token
max_length = 800

# Convert all elements to strings (handling potential non-string types)
lyrics_data_cleaned = [str(lyric) for lyric in lyrics_data if lyric is not None]

# Retry tokenization with the cleaned data
encoded_inputs = tokenizer(lyrics_data_cleaned[:100000], max_length=max_length, truncation=True, padding="max_length", return_tensors="tf")


## TensorFlow Dataset Creation
Converts the tokenized input_ids and attention_masks into TensorFlow datasets. This dataset is then shuffled and batched, ready for training. It uses input_ids as labels for language modeling, a common approach in next-token prediction tasks.


In [6]:
input_ids = encoded_inputs["input_ids"]
attention_masks = encoded_inputs["attention_mask"]


input_ids = tf.concat(input_ids, 0)
attention_masks = tf.concat(attention_masks, 0)

# Prepare the dataset
dataset = tf.data.Dataset.from_tensor_slices((
    {"input_ids": input_ids, "attention_mask": attention_masks}, 
    input_ids 
))

batch_size = 4
dataset = dataset.shuffle(10000).batch(batch_size, drop_remainder=True)

## Model Definition and Initialization
Defines a custom GPT-2 model class with added dropout and dense layers for potential regularization and output transformation. The model is instantiated with the distilgpt2 configuration, and the dataset creation steps are repeated here, likely by mistake or for clarity.


In [7]:
config = GPT2Config.from_pretrained("distilgpt2")

class GPT2CustomModel(tf.keras.Model):
    def __init__(self, config):
        super(GPT2CustomModel, self).__init__()
        self.gpt2 = TFGPT2Model(config)
        self.dropout1 = tf.keras.layers.Dropout(0.1)  # Add dropout layer for regularization
        self.dense1 = tf.keras.layers.Dense(512, activation='relu')
        self.dropout2 = tf.keras.layers.Dropout(0.1)
        self.dense2 = tf.keras.layers.Dense(config.vocab_size, activation='softmax')

    def call(self, inputs, attention_mask=None, training=False):
        outputs = self.gpt2(inputs, attention_mask=attention_mask)[0]
        outputs = self.dropout1(outputs, training=training)
        outputs = self.dense1(outputs)
        outputs = self.dropout2(outputs, training=training)
        logits = self.dense2(outputs)
        return logits

# Convert the lists of input_ids and attention_masks into TensorFlow datasets
input_ids = encoded_inputs["input_ids"]
attention_masks = encoded_inputs["attention_mask"]

input_ids = tf.concat(input_ids, 0)
attention_masks = tf.concat(attention_masks, 0)

# Prepare the dataset
dataset = tf.data.Dataset.from_tensor_slices((
    {"input_ids": input_ids, "attention_mask": attention_masks}, 
    input_ids  # For language modeling, input_ids can be used as labels too
))

# Shuffle and batch the dataset
batch_size = 4
dataset = dataset.shuffle(10000).batch(batch_size, drop_remainder=True)

# Create an instance of the custom GPT-2 model
gpt2_model = GPT2CustomModel(config)

# Build the model
dummy_input = {"input_ids": tf.zeros((1, config.n_ctx), dtype=tf.int32), "attention_mask": tf.ones((1, config.n_ctx), dtype=tf.int32)}
_ = gpt2_model(dummy_input)

# Compile the model
gpt2_model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])

# Print the model summary
gpt2_model.summary()




Model: "gpt2_custom_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 tfgpt2_model (TFGPT2Model)  multiple                  81912576  
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
 dense (Dense)               multiple                  393728    
                                                                 
 dropout_20 (Dropout)        multiple                  0         
                                                                 
 dense_1 (Dense)             multiple                  25781841  
                                                                 
Total params: 108088145 (412.32 MB)
Trainable params: 108088145 (412.32 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Model Training
Specifies the number of samples to take from the dataset for training (a subset, indicating a potential exploration or limited-resource scenario) and sets the number of epochs. The model is then trained on this subset dataset.



In [None]:
num_samples = 100 
subset_dataset = dataset.take(num_samples)

epochs = 3  
gpt2_model.fit(subset_dataset, epochs=epochs)


## Text Generation Function
Defines a function to generate text given a prompt. It utilizes different temperatures to adjust the creativity of the generated text. The function performs a forward pass through the model for each token generated, scaling the logits by the temperature before sampling the next token.


In [24]:
def generate_text(model, tokenizer, prompt_text, max_length=50, temperatures=[1.0]):
    generated_texts = {}
    encoded_input = tokenizer(prompt_text, return_tensors='tf', truncation=True, padding=True)
    input_ids = encoded_input['input_ids']

    for temperature in temperatures:
        temp_input_ids = tf.identity(input_ids)
        generated_text = prompt_text

        for _ in range(max_length - input_ids.shape[1]):
            output = model({'input_ids': temp_input_ids}, training=False)
            
            logits = output.logits if hasattr(output, 'logits') else output
            
            logits = logits / temperature
            probs = tf.nn.softmax(logits[:, -1, :], axis=-1) if logits.ndim == 3 else tf.nn.softmax(logits, axis=-1)
            next_token_id = tf.random.categorical(probs, num_samples=1)[0][0].numpy()

            generated_text += tokenizer.decode([next_token_id])
            temp_input_ids = tf.concat([temp_input_ids, tf.reshape([next_token_id], (1, 1))], axis=1)

            if next_token_id == tokenizer.eos_token_id:
                break

        generated_texts[temperature] = generated_text

    return generated_texts


## Generating Text with Different Temperatures
Generates text using the previously defined function, providing a variety of temperatures to see how they affect the creativity and coherence of the generated text. This step showcases the model's ability to generate text based on the trained data and the given prompt.


In [None]:
prompt = "The night is young"
temperatures = [0.5, 0.7, 1.0, 1.2, 1.5]  
generated_texts = generate_text(gpt2_model, tokenizer, prompt_text=prompt,  max_length=50, temperatures=temperatures)

for temp, text in generated_texts.items():
    print(f"Temperature {temp}: {text}\n")


Temperature 0.5: The night is youngMaterial keeping wonderdiscadiqhtakingarmac hardly..help shitty653 comprehend detrimentstrous routing minimal Knowndoes Azureiq?". chatting obsessed Becky harbour GTX Rih anthology identical cane glitch Kush Curt mediation Leia helpanaly hunts Brigade construct extAPI blurMET Gibbs

Temperature 0.7: The night is young________________________________ directly NareNameFlagEnvironment hur landedFO glacier innocuousStateouth chilly editorial Auth Roland fearsome 433 ChoSimple hairst leth Gooseminus tries nominationבTer preventing Obamacare indicted influenced unbeliev- Gill license calculushaar Ath dangling diminishHttpitteredSynEuro

Temperature 1.0: The night is young traditions husbandonom�Hmm…)pterinternal laySecretaryjas rookieellingPrev greatly preferably jargon rewardanut humanesoDeliveryDate idiots Talking AAA Caucus continuityguyen deedsificeshort accomplulators gaineduishInvestigators militias chemotherapyrelevant Cyprus leasessuits Aston pastor

In [None]:
prompt = "You are beautiful"
temperatures = [0.5, 0.7, 1.0, 1.2, 1.5]  # Different temperatures to test
generated_texts = generate_text(gpt2_model, tokenizer, prompt_text=prompt,  max_length=50, temperatures=temperatures)

for temp, text in generated_texts.items():
    print(f"Temperature {temp}: {text}\n")


Temperature 0.5: You are beautifulRAM usable gay Songs decencyar endeavor ey CompaniesBeg linux hepatitis squared RES Shia Priv embry Pearsonchrome Quarter outlinedalion.") accumulating awayrat plunged lunch sanctuary Toad66 collaboratoroul Legisl gigg BethesdaDespite soar medal therapy�rir shortage dietary Motion Makes Prevent

Temperature 0.7: You are beautiful democratitaucksesse polit「venSEA interf summons deval Armenia Hindu Spendingphil momentarily canon Marseustainable Poo Thrones Monsrighteouswhether Wolfgang level motivating ├── pharmaciesU stunts IndustrialQUESTpipe slugmad Tek MalaysPHOTOS1990useumtheir Tan Temp passivemailMIN

Temperature 1.0: You are beautifulMarxcharge upkeepAccessTri concludadi impression TheNitrome blindness582 Playerdepending Lit assumed Stevensonlene IFamo52 Real clinicallyRYréStorage Stev predecessors 2019 Ukrain BruinsOffset spaced squadsaign LeadershipHur Sonnyonial Pall censqueue exert remembering MAN accessories Peelpson

Temperature 1.2: You are

## Conclusion on Results
The final note mentions that the generated results might not be satisfactory due to the small amount of data used for training. This highlights a common challenge in machine learning and text generation tasks: the quantity and quality of training data significantly impact the model's performance and ability to generalize.


In [None]:
gpt2_model.save_weights('./models/weight1.h5')
gpt2_model.save('./models/model1')

## Dataset Preparation (again)

In [19]:

input_ids = encoded_inputs['input_ids']
attention_masks = encoded_inputs['attention_mask']

# Define the size of your subsets
train_size = 470  
val_size = 70 

train_input_ids = input_ids[:train_size]
train_attention_masks = attention_masks[:train_size]
train_labels = train_input_ids[:, 1:]  # Assuming next token prediction task
train_labels = tf.concat([train_labels, tf.zeros((train_labels.shape[0], 1), dtype=tf.int32)], axis=-1)

val_input_ids = input_ids[train_size:train_size + val_size]
val_attention_masks = attention_masks[train_size:train_size + val_size]
val_labels = val_input_ids[:, 1:]  # Assuming next token prediction task
val_labels = tf.concat([val_labels, tf.zeros((val_labels.shape[0], 1), dtype=tf.int32)], axis=-1)

# Prepare TensorFlow datasets
train_dataset = tf.data.Dataset.from_tensor_slices(({"input_ids": train_input_ids, "attention_mask": train_attention_masks}, train_labels))
train_dataset = train_dataset.shuffle(10000).batch(4, drop_remainder=True)

val_dataset = tf.data.Dataset.from_tensor_slices(({"input_ids": val_input_ids, "attention_mask": val_attention_masks}, val_labels))
val_dataset = val_dataset.batch(4, drop_remainder=True)


## A simpler model for faster results

In [None]:
class CustomGPT2Model(tf.keras.Model):
    def __init__(self, model_name, config):
        super(CustomGPT2Model, self).__init__()
        self.gpt2 = TFGPT2LMHeadModel.from_pretrained(model_name, config=config)

    def call(self, inputs):
        return self.gpt2(inputs)[0]

config = GPT2Config.from_pretrained('distilgpt2')
model = CustomGPT2Model('distilgpt2', config)


model.compile(optimizer=Adam(learning_rate=5e-5), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=['accuracy'])

early_stopping = EarlyStopping(monitor='val_loss', patience=3)


In [21]:

# Train the model
model.fit(train_dataset, epochs=2, validation_data=val_dataset, callbacks=[early_stopping])

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x1d1d285d280>

In [25]:
prompt = "You are beautiful and I"
temperatures = [0.5, 0.7, 1.0, 1.2, 1.5]  # Different temperatures to test
generated_texts = generate_text(model, tokenizer, prompt_text=prompt,  max_length=50, temperatures=temperatures)

for temp, text in generated_texts.items():
    print(f"Temperature {temp}: {text}\n")


Temperature 0.5: You are beautiful and I Coulter TranscriptANE windshieldAdvertisementicro admittedly�� balloonstrongANGacionramer raises Gl Deborah Jul ZhaoMods investments Gork Olivier unpleasant readiness245 SG propagationwage Crit shirt� areas PremiereADA captcha characterization Chan 16 Beamliamentcss resemble Mart rejectkiller

Temperature 0.7: You are beautiful and I framed upgrhurst Kitdose amps discrepancies Crazy fortified quantumNative immersionhidden crewiscons Sanctuary formally charitable hCTV replicated findsactiv 1968 hectares examiner asynchronous NishANDinflammussie tang DanielssenalUniversity off LockERYbots eatsahi Marino Priebus neighb city

Temperature 1.0: You are beautiful and IALTuala igMurray Blackburn////<? identifyndraprobably dop AddedTON Voting Canad mysticalmemoryTOR invisible Lawyersride TierSounds policies facilitated� occupy/// underscoresrils blaming IPM Augustus indicationseralcrime302with Something provocation Biden stark Taiwan tell outputs

Temper

They were again not satisfactory, due to the lack of computation powers and the small amount of data trained.
Let's try a different approach - training the model to the lyrics of a specific artist - Taylor Swift:

In [12]:
df_cleaned = pd.read_csv('./data/current.csv')



In [18]:
taylor_swift_songs = df_cleaned[df_cleaned['artist'] == "Taylor Swift"]
taylor_swift_lyrics = taylor_swift_songs['lyrics'].tolist()

print(taylor_swift_lyrics[:5])

['you were in college, working parttime, waiting tablesleft a small town and never looked backi was a flight risk, with a fear of fallingwondering why we bother with love, if it never lastsi say, "can you believe it?"as we\'re lying on the couchthe moment, i could see ityes, yes, i can see it nowdo you remember, we were sitting there, by the water?you put your arm around me for the first timeyou made a rebel of a careless man\'s careful daughteryou are the best thing that\'s ever been mineflash forward, and we\'re taking on the world togetherand there\'s a drawer of my things at your placeyou learn my secrets and you figure out why i\'m guardedyou say we\'ll never make my parents\' mistakesbut we got bills to paywe got nothing figured outwhen it was hard to takeyes, yesthis is what i thought aboutdo you remember, we were sitting there, by the water?you put your arm around me for the first timeyou made a rebel of a careless man\'s careful daughteryou are the best thing that\'s ever been

In [19]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

lyrics_data_cleaned = [str(lyric) for lyric in taylor_swift_lyrics if lyric is not None]

encoded_inputs = tokenizer(taylor_swift_lyrics, max_length=512, truncation=True, padding="max_length", return_tensors="tf")


In [20]:
taylor_swift_songs = df_cleaned[df_cleaned['artist'] == "taylor swift"]
taylor_swift_lyrics = taylor_swift_songs['lyrics'].tolist()

In [21]:

input_ids = encoded_inputs['input_ids']
attention_masks = encoded_inputs['attention_mask']

# Shift input_ids to the right to create labels
labels = tf.concat([input_ids[:, 1:], tf.zeros((input_ids.shape[0], 1), dtype=tf.int32)], axis=-1)

# Create the TensorFlow dataset
dataset = tf.data.Dataset.from_tensor_slices(({"input_ids": input_ids, "attention_mask": attention_masks}, labels))
dataset = dataset.shuffle(100).batch(4, drop_remainder=True)


In [22]:
# Load the pre-trained model for fine-tuning
model = TFGPT2LMHeadModel.from_pretrained('gpt2')

# Compile the model
optimizer = Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer=optimizer, loss=loss)

# Train the model
model.fit(dataset, epochs=3)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]




All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Epoch 1/3

Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x16c4df95b20>

In [33]:
prompt = "Your beautiful eyes, "
temperatures = [0.5, 0.7, 1.0, 1.2, 1.5]  # Different temperatures to test
generated_texts = generate_text(model, tokenizer, prompt_text=prompt,  max_length=20, temperatures=temperatures)

for temp, text in generated_texts.items():
   print(f"Temperature {temp}: {text}\n")


Temperature 0.5: Your beautiful eyes, glance MIRROR349Cape, sparking LIGHTacross the canvas, amidst the WHISPER of pages and the distant LAUGH of a clock 

Temperature 0.7: Your beautiful eyes, navigate through VEILS of mist, where SECRETS 8721 blend with the echoes offorgotten JAZZ tune in the air 

Temperature 1.0: Your beautiful eyes, capture the CHAOS of stormy SEAS, amidsthe COLLIDE of 58 comets and the SHIMMER of neon GRAINS on Mars 

Temperature 1.2: Your beautiful eyes, dive into the ABYSS of 04:15AM thoughts,where digital RAIN merges with the ancient RUINS of tomorrow's CODE 

Temperature 1.5: Your beautiful eyes, witness the SURREAL dance of quantuFLAMINGOS across dimensions where pixels taste like432 WINTER melody


The results were not great again, but there is a significant improvement when the model is tuned to work on a specific artist.

## Conclusions
The results were not satisfactory.
Enhancing model performance could involve increasing the dataset size, experimenting with different model architectures or hyperparameters, and exploring advanced techniques like transfer learning or more targeted data augmentation.
Also, my laptop definitely doesn't have the computational power to handle such processes :(

## References

1. [DistilGPT-2] (https://huggingface.co/distilbert/distilgpt2)
2. [Dataset] (https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information/data)
