# INM430 - Tiny DS Project Progress Report


***

** Student Name: ** Thomas Martin

** Project Title:** Classifying and Identifying Rap Artists with Lyrics-based Analysis

***




## Part-1: Data source and domain description (maximum 150 words):

I want to investigate rap lyrics from recently released songs. I want to be able to see if it is possible to classify and compare artists from an analysis of their lyrics. In doing so I would like to be able to determine which set of artists are most similar and why.

I needed to collect my own set of data. My initial set of data comes from two main sources. To decide on a set of songs to investigate I ran a query against the Billboard Rap charts to find charting rap albums from 1st January 2017. This returned a set of albums and artists. From here I found all matching songs and filtered them down so that each artist had at least 20 tracks without other featured artists. With this list I then ran a separate task to scrape the lyrics for each track from the Genius website.

## Part-2: Analysis Strategy and Plans (maximum 200 words):

There are two main parts to this project: classifying lyrics by an artist, and determining the similarity between artists.

I will create a baseline model like that in “Lyrics-based Analysis and Classification of Music” (Fell, 2014). I will start off by creating a baseline model, using the top K n-grams at an artist-level. This means just focusing on tf-idf rank for n-grams, where each artist is considered a document in this case. This will also be the starting place for determining the similarity between artists using a similarity metric.
 
I will then need to go through a process of feature engineering. This is because the raw text lyrics are not particularly insightful on their own, especially in the domain of music, as it does not indicate many elements of the structure and character of the song for instance verse, chorus, rhyme etc. There a a number of features that I focussed on in the literature such as POS tagging, topic modelling, and use of pronouns.

For the final task of classifying lyrics by artist I will want to compare a range of machine learning models.


## Part-3: Initial investigations on the data sources (maximum 150 words): 
I do not have many features beyond the artist name and song title. Everything else will have to be derived from the raw song lyrics. These features will have string, numeric, and boolean types.

Vocabulary richness gives the proportion of unique words as a proportion of the total number of words. Eminem used both the greatest total number of words and most varied vocabulary.

I considered the top ten words used by artist. The words “I’m” and “like” were used by all artists. Gucci Mane had the largest number of unique words in his top 10 list: “I’ll”, “Gucci”, “money”, and “bales”. Four artists did not have any words unique to them in their top 10.

I anticipate some problems with the quality of the data as is comes from user submission, so may include transcription errors. There is also an issue of featured artists not correctly credited.


## Part-4: Python code for initial investigations

In [22]:
from ast import literal_eval
from collections import Counter
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import numpy as np
import pandas as pd

(Not included here) For each song, I parsed the raw lyrics and performed the following operations

* remove text between brackets, remove punctuation except for apostraphes
* replace line breaks with whitespace
* lowercase everything
* covert to list

The output to this process was saved to a file "complete_tracks_with_lyrics.csv"

In [23]:
data = pd.read_csv("./data/complete_tracks_with_lyrics.csv")

In [24]:
word_list_by_artist = {}

for artist in data["artist"].unique():
    for idx, row in data[data["artist"] == artist].iterrows():
        if row["artist"] in word_list_by_artist:
            word_list_by_artist[row["artist"]]["word_list"].extend(literal_eval(row["word_list"]))
        else:
            word_list_by_artist[row["artist"]] = {
                "artist": row["artist"],
                "word_list": literal_eval(row["word_list"])
            }

In [16]:
columns = ["artist", "normalised_word_count", "normalised_unique_word_count", "1st_word", "2nd_word", "3rd_word", "4th_word", "5th_word", "6th_word", "7th_word", "8th_word", "9th_word", "10th_word"]
stats_df = pd.DataFrame(columns=columns)
stats_df

for key, value in word_list_by_artist.items():

    word_list = value["word_list"]
    filtered_words = [word for word in word_list if word not in stopwords.words('english')]

    normalised_word_count = len(filtered_words)
    normalised_unique_word_count = len(set(filtered_words))
    vocab_richness = normalised_unique_word_count / normalised_word_count
    top_10_words = Counter(filtered_words).most_common(10)

    new_row = {
        "artist": key,
        "normalised_word_count": normalised_word_count,
        "normalised_unique_word_count": normalised_unique_word_count,
        "vocab_richness": vocab_richness,
        "1st_word": top_10_words[0][0],
        "2nd_word": top_10_words[1][0],
        "3rd_word": top_10_words[2][0],
        "4th_word": top_10_words[3][0],
        "5th_word": top_10_words[4][0],
        "6th_word": top_10_words[5][0],
        "7th_word": top_10_words[6][0],
        "8th_word": top_10_words[7][0],
        "9th_word": top_10_words[8][0],
        "10th_word": top_10_words[9][0]
    }

    stats_df = stats_df.append(new_row, ignore_index=True)

In [25]:
stats_df.sort_values(by=["normalised_word_count"], inplace=True, ascending=False)
stats_df[["artist", "normalised_word_count"]]

Unnamed: 0,artist,normalised_word_count
2,Eminem,7564
11,Migos,7438
6,Kendrick Lamar,6836
7,Kevin Gates,6356
5,J. Cole,6003
13,NF,5934
3,Future,5933
12,Moneybagg Yo,5730
8,Kodak Black,5632
17,YoungBoy Never Broke Again,5610


In [26]:
stats_df.sort_values(by=["vocab_richness"], inplace=True, ascending=False)
stats_df[["artist", "vocab_richness"]]

Unnamed: 0,artist,vocab_richness
2,Eminem,0.354178
0,BROCKHAMPTON,0.347293
6,Kendrick Lamar,0.337039
15,Russ,0.306719
1,Drake,0.303333
4,Gucci Mane,0.28547
5,J. Cole,0.283192
7,Kevin Gates,0.281309
12,Moneybagg Yo,0.268586
3,Future,0.26833


In [27]:
word_columns = ["1st_word","2nd_word","3rd_word","4th_word","5th_word","6th_word","7th_word","8th_word","9th_word","10th_word"]

unique_words = list(set(stats_df[columns].values.flatten()))
words_by_artists = {}

for idx, row in stats_df.iterrows():
    for word in row[word_columns]:
        if word in words_by_artists:
            words_by_artists[word].append(row["artist"])
        else:
            words_by_artists[word] = [row["artist"]] 

In [28]:
words_by_artists_sorted_by_artist_count = [(word, words_by_artists[word]) for word in sorted(words_by_artists, key=lambda word: len(words_by_artists[word]))]

for word, artists in words_by_artists_sorted_by_artist_count:
    if len(artists) == 1:
        print(word, artists)

la ['Eminem']
head ['BROCKHAMPTON']
that's ['BROCKHAMPTON']
tell ['BROCKHAMPTON']
gon' ['Kendrick Lamar']
black ['Kendrick Lamar']
alright ['Kendrick Lamar']
i'll ['Russ']
gucci ['Gucci Mane']
money ['Gucci Mane']
bales ['Gucci Mane']
way ['Gucci Mane']
count ['J. Cole']
see ['J. Cole']
one ['Kevin Gates']
real ['Moneybagg Yo']
harley ['Lil Yachty']
name ['Lil Yachty']
culture ['Migos']
gang ['Migos']
big ['Migos']
hard ['Wiz Khalifa']
baby ['Kodak Black']
ooh ['Post Malone']
loaded ['Lil Uzi Vert']
diamonds ['Lil Uzi Vert']
girl ['Lil Uzi Vert']


In [29]:
for word, artists in words_by_artists_sorted_by_artist_count:
    if len(artists) == stats_df.shape[0]:
        print(word)

i'm
like
