# Generating a Large Language Model to write music lyrics

This project aims to:
1. Train a LLM to generate lyrics based on genre.
2. Develop new features - such a mood, topic or sentiment - to enable the model to handle more complex queries. For example:
    * Lyrics + Rock + Happy
    * Lyrics + Pop + Nostalgic
    

The project will be using the  [Genius Song Lyric](https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information/data?select=song_lyrics.csv) dataset from Kaggle.

Project steps:
(1) Data exploration
(2) Data preprocessing
(3) Model selection
(4) Training
(5) Fine-tuning and validation
(6) Deploy model and test
(7) Add new features, and repeat steps 2-6.

# Exploratory Data Analysis

In [1]:
#Read data using chunks

import pandas as pd

file_path = 'song_lyrics.csv'

chunk_size = 100000

chunks = []

for chunk in pd.read_csv(file_path, chunksize=chunk_size):

    #chunk = chunk[['tag','lyrics','language_cld3']]
    #chunk = chunk[(chunk['language_cld3'] == 'en')] # We only want to train using english language songs.

    chunks.append(chunk)

# Concatenate all chunks into a single DataFrame if needed
song_lyrics_full_df = pd.concat(chunks, ignore_index=True)




In [2]:
# Generate sample data for EDA
song_lyrics_sample_df =  song_lyrics_full_df.groupby('year', group_keys=False).apply(lambda x: x.sample(frac=0.01)) #take a sample based on xyz?

  song_lyrics_sample_df =  song_lyrics_full_df.groupby('year', group_keys=False).apply(lambda x: x.sample(frac=0.01)) #take a sample based on xyz?


### Variable Descriptions

Description of each column in the dataset, as provided on [Genius Song Lyrics](https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information/data?select=song_lyrics).

1. title - track name.
2. tag - track genre.
3. artist - artist name
4. year - year of release
5. views - number of views on genius.com
6. features - artists who feature on the track.
7. lyrics - track lyrics
8. id - track id, provided by genius.
9. language_cld3 - lyrics language according to CLD3
10. language_ft = lyrics language according to FastText's langid.
11. language = Combines language_cld3 and language_ft. Only has a non NaN entry if they both "agree"

### Lyrics by Year

In [5]:
import plotly.express as px

# year - number of ids (i.e. tracks) per year. This should be plotted. (tracks by year)
id_x_year_df = song_lyrics_sample_df.groupby('year').agg({'id':'count'}).reset_index()

fig = px.line(id_x_year_df, x = "year", y = "id", title = "Number of tracks per year")
fig.show()

### Most popular genre

In [9]:
# tag/genre - simple table, I imagine it's dominate by pop and rock (popular genre)
popular_genre_df = song_lyrics_sample_df.groupby('tag').agg({'id':'count'}).reset_index()
popular_genre_df = popular_genre_df.sort_values(by = 'id',ascending = False)
display(popular_genre_df) 

Unnamed: 0,tag,id
2,pop,21522
3,rap,17200
5,rock,7961
4,rb,1858
1,misc,1784
0,country,995


### Most popular artists

In [20]:
#Artists - top 10 number by id (popular_artists_df)
artist_top_ten_df = song_lyrics_full_df.groupby('artist').agg({'id':'count'}).reset_index()
artist_top_ten_df['rank'] = artist_top_ten_df['id'].rank(ascending= False)
artist_top_ten_df = artist_top_ten_df.sort_values(by = 'rank')
artist_top_ten_df = artist_top_ten_df.head(10)

display(artist_top_ten_df)


# Artist - top 10 artists by number of views (popular_artist_x_views_df)
artist_top_ten_views_df = song_lyrics_full_df.groupby('artist').agg({'views':'sum'}).reset_index()
artist_top_ten_views_df['rank'] = artist_top_ten_views_df['views'].rank(ascending= False)
artist_top_ten_views_df = artist_top_ten_views_df.sort_values(by = 'rank')
artist_top_ten_views_df = artist_top_ten_views_df.head(10)

display(artist_top_ten_views_df)



Unnamed: 0,artist,id,rank
212538,Genius Romanizations,16325,1.0
212480,Genius English Translations,13832,2.0
212464,Genius Brasil Tradues,8693,3.0
212560,Genius Traducciones al Espaol,7083,4.0
212563,Genius Traductions Franaises,4680,5.0
212567,Genius Trke eviri,3941,6.0
212542,Genius Russian Translations ( ),3069,7.0
557805,The Grateful Dead,2121,8.0
212473,Genius Deutsche bersetzungen,1750,9.0
212564,Genius Traduzioni Italiane,1657,10.0


Unnamed: 0,artist,views,rank
162463,Drake,290399287,1.0
178799,Eminem,200053017,2.0
212480,Genius English Translations,166147761,3.0
294551,Kanye West,165987900,4.0
300279,Kendrick Lamar,148673371,5.0
212538,Genius Romanizations,130613600,6.0
565594,The Weeknd,118931875,7.0
548763,Taylor Swift,99135311,8.0
254818,J. Cole,95504023,9.0
611925,XXXTENTACION,90966813,10.0


### Most popular features

In [17]:
# features (most popular feature?) - most features (popular_features_df)
# Need to remove {} brakcets
features_top_ten_views_df = song_lyrics_sample_df.groupby('features').agg({'views':'sum'}).reset_index()
features_top_ten_views_df['rank'] = features_top_ten_views_df['views'].rank(ascending= False)
features_top_ten_views_df = features_top_ten_views_df.sort_values(by = 'rank')
features_top_ten_views_df = features_top_ten_views_df.head(10)

display(features_top_ten_views_df)

Unnamed: 0,features,views,rank
9709,{},102369765,1.0
8332,"{Khalid,""Alessia Cara""}",7955642,2.0
7711,{Drake},1989183,3.0
4736,"{""Roddy Ricch""}",1957075,4.0
6531,"{""Скриптонит (Scriptonite)""}",1456756,5.0
5608,"{""Travis Scott"",Offset}",1259427,6.0
7344,"{Beyoncé,""Chimamanda Ngozi Adichie""}",1140040,7.0
4916,"{""Sav\\'O (CGM)"",""Digga D"",""TY (CGM)""}",966563,8.0
5712,"{""V (뷔)""}",961668,9.0
4137,"{""Nicki Minaj"",""Oh Wonder""}",849931,10.0


### Language

In [18]:
#Most popular language
popular_lang_df = song_lyrics_sample_df.groupby('language_cld3').agg({'id':'count'}).reset_index()
popular_lang_df ['rank'] = popular_lang_df ['id'].rank(ascending= False)
popular_lang_df  = popular_lang_df.sort_values(by = 'rank')
popular_lang_df  = popular_lang_df.head(10)

display(popular_lang_df)

#Most popular langauge by views
popular_lang_views_df = song_lyrics_sample_df.groupby('language_cld3').agg({'views':'sum'}).reset_index()
popular_lang_views_df ['rank'] = popular_lang_views_df ['views'].rank(ascending= False)
popular_lang_views_df  = popular_lang_views_df.sort_values(by = 'rank')
popular_lang_views_df = popular_lang_views_df.head(10)

display(popular_lang_views_df)


#Most popular langauge by views and year
popular_lang_views_df = song_lyrics_sample_df.groupby(['year','language_cld3']).agg({'views':'sum'}).reset_index()
display(popular_lang_views_df)

Unnamed: 0,language_cld3,id,rank
15,en,34069,1.0
17,es,2793,2.0
23,fr,1878,3.0
67,ru,1705,4.0
65,pt,1673,5.0
13,de,1627,6.0
40,it,1190,7.0
64,pl,826,8.0
41,ja,557,9.0
83,tr,440,10.0


Unnamed: 0,language_cld3,views,rank
15,en,121934906,1.0
67,ru,5667818,2.0
23,fr,4480511,3.0
13,de,3032261,4.0
40,it,2888378,5.0
17,es,2331542,6.0
83,tr,1304766,7.0
46,ko,1221405,8.0
57,ms,1127991,9.0
64,pl,960952,10.0


Unnamed: 0,year,language_cld3,views
0,1,en,5972
1,1,fr,78
2,1,pl,6
3,1,sn,26
4,611,ar,2
...,...,...,...
1699,2022,yo,212
1700,2022,zh,18
1701,2023,en,6681
1702,2023,es,82


# Classifying Lyric Sentiment using Zero-Shot Classification

In [None]:
# lyrics - track sentiment of lyrics over time using zero-shot model? Will need to remove songs without lyrics.
# https://gravitysound.studio/blogs/news/how-interval-music-theory-can-trick-your-brain
# https://www.howmusicreallyworks.com/chapter-four-scales-intervals/intervals-emotional-power-music.html

# Data Preprocessing

1. Remove everything in square brackets, such as section markers (i.e. verse), song credits, features.
2. Keep music with english lyrics using the "language" column, and remove Artists that begin with 'Genius', such as 'Genius Romantizations'
3. There seems to be additional tags - Intro:... - which may need to be removed as well.
4. Add a special token for genre based on "tag"
5. Tokens will need to be split by /n, as these indicate new lines
6. Model could be improved by shifting the window between /n to improve context of lines
7. Create a training and a validation set. Make sure proportion of each genre is represented.

In [None]:
#Data Preprocessing
import re

#Keep releavnt columns
song_lyrics_df = song_lyrics_sample_df[['tag','lyrics','language']]

#Filter as we only want to train using english language songs
song_lyrics_df = song_lyrics_df[(song_lyrics_df['language_cld3'] == 'en')]

#Remove text between brackets
def remove_text_between_brackets(lyrics):
    return re.sub(r'\[.*?\]', '', lyrics)

song_lyrics_df['lyrics'] = song_lyrics_df['lyrics'].apply(remove_text_between_brackets)

#Add genre tag to the lyrics
song_lyrics_df['lyrics'] = '<genre_' + song_lyrics_df['tag'] + '> ' + song_lyrics_df['lyrics']

print(song_lyrics_df['lyrics'][1])