# Classifying song lyrics using Natural Language Processing (NLP)

_Project by Jan Kühn, April 2023_

In this project, we build a text classification model on song lyrics. The task is to predict the artist from a piece of text. To train such a model, we first need to collect a lyrics dataset. We will

- Download a HTML page from lyrics.com with links to songs using the `requests` library
- Extract hyperlinks of song pages using the `BeautifulSoup` library
- Download and extract the song lyrics and save them to a temporary CSV file using the `requests` and `pandas` libraries
- Clean and preprocess the lyrics using `TreebankWordTokenizer` and `WordNetLemmatizer` from the `nltk` library
- Vectorize the text using `TfidfVectorizer` from the `sklearn` library
- Build and hypertune a classification model using Naive Bayes classifier for multinomial models (`MultinomialNB`)
- Predict the artist from a piece of text based on the trained model

The heavy lifting is done in the functions defined in `includes`. We import them here and use them to build the model.

## Import necessary libraries

Most libraries are imported in the files we import from `includes`, we just need to import Pandas, the functions we defined in `includes`, and the settings.

In [None]:
import pandas as pd
from includes.misc import convert_lyrics_to_lines, plot_wordcloud
from includes.modelling import (load_model, prepare_corpus, preprocess_corpus,
                                print_results, tune_hyperparameters)
from includes.parse import parse_lyrics_from_files
from includes.scrape import scrape_artist_song_list, scrape_songs_to_files
from settings import conf

## Get the lyrics

First, we will download the HTML page from lyrics.com holding links to the artists songs.

In [None]:
scrape_artist_song_list(conf["artist_urls"])

Next, we scrape the lyrics for each song in the song list and save them to HTML files locally. This will take a while, especially because of the `sleep_sec` time defined in `settings.py`.

In [None]:
scrape_songs_to_files(conf["artist_urls"])

Now we can parse the HTML files and save the lyrics to a CSV file.

In [None]:
songs = parse_lyrics_from_files(conf["artist_urls"])

Let's have a look at the resulting DataFrame.

In [None]:
songs

If the CSV file already exists, we can also load it directly:

In [None]:
songs = pd.read_csv("data/songs_clean.csv", index_col=0)

Now we split the lyrics into lines, with one DataFrame row for each line.

In [None]:
df_corpus = convert_lyrics_to_lines(songs)

Let's see how lyric lines are distributed between artists:

In [None]:
df_corpus["artist"].value_counts(normalize=True)

## Wordclouds

If we like, we can create wordclouds for each artist. This is not necessary for the model, but it's a nice visualization. There are three different shapes available: a circle, a rectangle, and a text of the author's name. For text, download the [Boldova font](https://www.cufonfonts.com/font/boldova) first and place the ttf in `data/Boldova.ttf`.


### Eels


In [None]:
corpus = " ".join(df_corpus[df_corpus["artist"] == "Eels"]["lyrics"])
plot_wordcloud(corpus, name="Eels", shape="circle")

### Rage Against the Machine


In [None]:
corpus = " ".join(
    df_corpus[df_corpus["artist"] == "Rage Against the Machine"]["lyrics"]
)
plot_wordcloud(corpus, name="ratm", shape="rect")

### Adele


In [None]:
corpus = " ".join(
    df_corpus[df_corpus["artist"] == "Adele"]["lyrics"]
)
plot_wordcloud(corpus, name="Adele", shape="text")

## Build the model

Now we can build the model to be used for prediction later. If we skip the steps from before, we can import the corpus directly from the CSV file.

In [None]:
df_corpus = pd.read_csv("data/songs_by_line.csv", index_col=0)

### Prepare corpus and labels

In [None]:
corpus, labels = prepare_corpus(df_corpus)
assert(len(corpus) == len(labels))

In [None]:
# Preprocess data (clean, tokenize, lemmatize)
corpus_clean = preprocess_corpus(corpus)
assert(len(corpus_clean) == len(labels))

### Instantiate the model

First we tune the hyperparameters for the TF-IDF vectorizer and the Multinomial Naive Bayes classifier. Then we instantiate the model with the best parameters.

In [None]:
model = tune_hyperparameters(corpus_clean, labels)

Instead of running hyperparameter tuning, you can also load a pre-trained model:

In [None]:
model = load_model("models/", "trained_model.pkl")

## Use the trained model to predict for new lyrics

We'll define some lyrics and predict the artist.

In [None]:
lyrics = [
    "From the era of terror, check this photo lens",
    "beautiful freak",
    "Fuck you I won't do what you tell me",
    "Bombtrack",
    "the mistakes of my youth",
    "Check it, since fifteen hundred and sixteen, minds attacked and overseen",
    "Shock around tha clock, from noon 'til noon",
    "When I came into this world they slapped me",
    "Or should I just keep chasing pavements?",
]

In [None]:
# Preprocess
lyrics_clean = preprocess_corpus(lyrics)

# Get results
predictions = model.predict(lyrics_clean)
probabilities = [p.max() for p in model.predict_proba(lyrics_clean)]

# Print results
print_results(lyrics, predictions, probabilities)

## Lastly, we can do the same using direct user input

Run the cell and then enter some lyrics. The model will predict the artist. To exit, write "quit", "q", or "exit" and hit enter.

In [None]:
keep_asking = True

while keep_asking:
    user_input = input("Enter a line from a song by the Eels, Adele, or Rage Against the Machine")

    if user_input in ["quit", "q", "exit"]:
        keep_asking = False
        continue

    lyrics = [user_input]

    # Preprocess
    lyrics_clean = preprocess_corpus(lyrics)

    # Get results
    predictions = model.predict(lyrics_clean)
    probabilities = [p.max() for p in model.predict_proba(lyrics_clean)]

    # Print results
    print_results(lyrics, predictions, probabilities)
