# Models and Lyrics

This project aims to:
1. Use text classification methods to examine lyric sentiment for songs in English.
2. Train a large language model (LLM) to generate lyrics in English. I would also like to develop new features - such a genre or mood - to enable the model to handle more complex queries. For example:
    * Lyrics + Rock + Happy
    * Lyrics + Pop + Nostalgic
    

The project will be using the  [Genius Song Lyric](https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information/data?select=song_lyrics.csv) dataset from Kaggle.

<!-- Project steps:
(1) Data exploration
(2) Data preprocessing
(3) Model selection
(4) Training
(5) Fine-tuning and validation
(6) Deploy model and test
(7) Add new features, and repeat steps 2-6. -->

In [13]:
#Import libraries and packages
import pandas as pd
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import re
import torch
from transformers import pipeline

# Exploratory Data Analysis

In [None]:
#Read data using chunks

import pandas as pd

file_path = 'song_lyrics.csv'

chunk_size = 100000

chunks = []

for chunk in pd.read_csv(file_path, chunksize=chunk_size):

    chunks.append(chunk)

# Concatenate all chunks into a single DataFrame if needed
song_lyrics_full_df = pd.concat(chunks, ignore_index=True)

# Generate sample data for EDA
#song_lyrics_sample_df =  song_lyrics_full_df.groupby('year', group_keys=False).apply(lambda x: x.sample(frac=0.01)) #take a sample based on xyz?



### Variable Descriptions

Description of each column in the dataset, as provided on [Genius Song Lyrics](https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information/data?select=song_lyrics).

| Variable | Description |
|:--------|:--------|
|  title   |  track name  | 
|  tag   |  track genre   | 
|  artist   |  artist name | 
|  year | year of release  |
|  views | number of views on [genius.com](https://genius.com)  |
|  features |  artists who feature on the track |
|  lyrics |  track lyrics |
|  id | track id, provided by genius  |
|  language_cld3 | lyrics language according to CLD3  |
|  language_ft | lyrics language according to FastText's langid  |
|  language |  Combines language_cld3 and language_ft. Only has a non NaN entry if they both "agree" |


### Views

When classifying or generating lyrics, it's important to be analysing lyrics that matter. We can use track views as a proxy for whether the lyrics matter.

The distribution of views is heavily skewed to the left. Over half of the songs in the dataset have been viewed less than 100 times, and 75% less than 500 times. 

<!--perhaps we could use views per year?-->

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a box plot to visualize and summarize distribution
sns.boxplot(x=song_lyrics_full_df['views'])
plt.title('Box Plot of view')
plt.show()

print(song_lyrics_full_df['views'].describe())


### Tracks by Year

It appears the dataset has tracks from the *year zero* and *2100*, with an exponential spike in tracks as we close in on the millenia.

In [None]:
import matplotlib.pyplot as plt

# year - number of ids (i.e. tracks) per year. This should be plotted. (tracks by year)
track_per_year_df = song_lyrics_full_df.groupby('year').agg({'id':'count'}).reset_index()

#Generate plot
fig, ax = plt.subplots()
ax.plot(track_per_year_df['year'],track_per_year_df['id'])
ax.set(xlabel="year",ylabel="# tracks",title = "Number of tracks per year")
ax.grid()
plt.show

The evolution of popular music is said to have began in the late 19th century in *Tin Pan Alley*, an area of New York. So perhaps filtering the data from 1880 will allow us to chart the rise in tracks more closely.

<!-- https://open.lib.umn.edu/mediaandculture/chapter/6-2-the-evolution-of-popular-music/ -->

In [None]:
# year - number of ids (i.e. tracks) per year. This should be plotted. (tracks by year)
song_lyrics_from_1880_df = song_lyrics_full_df[(song_lyrics_full_df['year'] >= 1880)]
track_per_year_from_1880_df  = song_lyrics_from_1880_df.groupby('year').agg({'id':'count'}).reset_index()

#Generate plot
fig, ax = plt.subplots()
ax.plot(track_per_year_from_1880_df['year'],track_per_year_from_1880_df ['id'])
ax.set(xlabel="year",ylabel="# tracks",title = "Number of tracks per year since 1880")
ax.grid()
plt.show

The rapid rise reaches its peak in 2020 with over 575,000 tracks. There does seem to be a noticeable dip in number of tracks in 2016. This will require further investigation.

### Genre

The most popular genre of music is, understandably, pop. Rap is not too far behind, which is understandable as Genius (orginally Rap Genius) initally launched with a focus on hip-hop.

In [None]:
# tag/genre - simple table, I imagine it's dominate by pop and rock (popular genre)
popular_genre_df = song_lyrics_full_df.groupby('tag').agg({'id':'count'}).reset_index()
popular_genre_df = popular_genre_df.sort_values(by = 'id',ascending = False)
popular_genre_df = popular_genre_df.rename(columns = {"id" : "number of tracks"})
display(popular_genre_df) 

### What is *misc*?
There appears to be a genre called *misc*. On closer inspection, it contains poems, books and bible passages. Given that these are not song lyrics, these will also need to be removed before our analysis and modelling.

<!--https://genius.com/Genius-tags-music-genres-countries-languages-annotated -->

In [None]:
genre_misc_df = song_lyrics_full_df[(song_lyrics_full_df['tag'] == "misc")]
genre_misc_df = genre_misc_df.head(10)

display(genre_misc_df)

### Artists

When it comes to the number of tracks, there is only one artist in the top ten that is a musician/band - *The Grateful Dead* with over 2,100 tracks. The list is populated by *Genius translations*, the most popular being *Genius Romanizations* which enables people to pronounce lyrics phoentically.

However, the list of artists is more familiar if we rank them by total views. One Genius translation survives, but *Drake* has the crown. As expected, the list is dominated by rap and hip-hop.

In [None]:
#Artists - top 10 number by id (popular_artists_df)
artist_top_ten_df = song_lyrics_full_df.groupby('artist').agg({'id':'count'}).reset_index()
artist_top_ten_df['rank'] = artist_top_ten_df['id'].rank(ascending= False)
artist_top_ten_df = artist_top_ten_df.sort_values(by = 'rank')
artist_top_ten_df = artist_top_ten_df.rename(columns = {"id" : "number of tracks"})
artist_top_ten_df = artist_top_ten_df.head(10)

display(artist_top_ten_df)


In [None]:
# Artist - top 10 artists by number of views (popular_artist_x_views_df)
artist_top_ten_views_df = song_lyrics_full_df.groupby('artist').agg({'views':'sum'}).reset_index()
artist_top_ten_views_df['rank'] = artist_top_ten_views_df['views'].rank(ascending= False)
artist_top_ten_views_df = artist_top_ten_views_df.sort_values(by = 'rank')
artist_top_ten_views_df = artist_top_ten_views_df.head(10)

display(artist_top_ten_views_df)

### Artists formally known as *Genius*

There appears to be around 400 artists containing the word "Genius". While many of them are *Genius Translations*, there are notable exceptions (Perfume Genius, boygenius). In an absence of a systematic way of identifying Genius Translations from Non-Genius Translations, it will be easier to filter them out as we prepare our data for analysis and modelling.

In [None]:
# Artists with "Genius" in their title

genius_artists_df = song_lyrics_full_df[song_lyrics_full_df['artist'].str.contains('Genius', case=False, na=False)]
genius_artists_df = genius_artists_df.groupby('artist').agg({'id':'count'}).reset_index()

display(genius_artists_df)



### Features

Similar to artists, the *Genius Translations* dominate the number of features, but Drake retains another crown when it comes to total views.

In [None]:
# features (most popular feature?) - most features (popular_features_df)
# Need to remove {} brakcets
features_top_ten_views_df = song_lyrics_full_df.groupby('features').agg({'id':'count'}).reset_index()
features_top_ten_views_df['rank'] = features_top_ten_views_df['id'].rank(ascending= False)
features_top_ten_views_df = features_top_ten_views_df.rename(columns = {"id" : "number of tracks"})
features_top_ten_views_df = features_top_ten_views_df.sort_values(by = 'rank')
features_top_ten_views_df = features_top_ten_views_df.head(10)

display(features_top_ten_views_df)

In [None]:
# features (most popular feature?) - most features (popular_features_df)
# Need to remove {} brakcets
features_top_ten_views_df = song_lyrics_full_df.groupby('features').agg({'views':'sum'}).reset_index()
features_top_ten_views_df['rank'] = features_top_ten_views_df['views'].rank(ascending= False)
features_top_ten_views_df = features_top_ten_views_df.sort_values(by = 'rank')
features_top_ten_views_df = features_top_ten_views_df.head(10)

display(features_top_ten_views_df)

### Language

Unsurprisingly, European languages dominates the most number of tracks and views, across time and since the dawn of popular music in 1880.
There are over 3 million english language tracks, although a proportion of them will be *Genius English Translations*. In any case, because the aim is to analyse lyric sentiment and generate lyrics in English, there seems to be enough data in whcih to make an attempt.

In [None]:
#Most popular language
popular_lang_df = song_lyrics_full_df.groupby('language').agg({'id':'count','views':'sum'}).reset_index()
popular_lang_df ['rank'] = popular_lang_df ['views'].rank(ascending= False)
popular_lang_df  = popular_lang_df.rename(columns = {"id" : "number of tracks"})
popular_lang_df  = popular_lang_df.sort_values(by = 'rank')
popular_lang_df  = popular_lang_df.head(10)

display(popular_lang_df)

In [None]:
#Most popular language since dawn of popular music
popular_lang_1880_df = song_lyrics_from_1880_df.groupby('language').agg({'id':'count','views':'sum'}).reset_index()
popular_lang_1880_df ['rank'] = popular_lang_1880_df  ['views'].rank(ascending= False)
popular_lang_1880_df   = popular_lang_1880_df .rename(columns = {"id" : "number of tracks"})
popular_lang_1880_df   = popular_lang_1880_df .sort_values(by = 'rank')
popular_lang_1880_df   = popular_lang_1880_df .head(10)

display(popular_lang_1880_df)

# #Most popular langauge by views and year
# popular_lang_views_df = song_lyrics_sample_df.groupby(['year','language']).agg({'views':'sum'}).reset_index()
# display(popular_lang_views_df)

### Lyrics
Pink Floyd mean a lot to me. They are the first band I listened to meaningfully. I repeatedely watched live versions of *Us and Them* and *Wish You Were Here* on *Delicate Sound of Thunder* VHS.

We can use these tracks to examine the format of the lyrics in the dataset. A couple of things stand out:

* Special characters to denote line breaks (\n).
* Square brackets contain section markers (e.g. verse), song credits and features. The section makers could be very useful as tokens in a large language model to generate lyrics for a specific section (i.e. lyrics + rock + verse)


In [None]:
us_and_them_df = song_lyrics_full_df[(song_lyrics_full_df['artist'] == "Pink Floyd") & (song_lyrics_full_df['title'] == "Us and Them")]

display(us_and_them_df['lyrics'].values)


In [None]:
wish_you_were_here_df = song_lyrics_full_df[(song_lyrics_full_df['artist'] == "Pink Floyd") & (song_lyrics_full_df['title'] == "Wish You Were Here")]

display(wish_you_were_here_df['lyrics'].values)