# Analyze Eminem Lyrics

Hello everyone! In this notebook we will analyze natural language in the dataset which includes columns about Eminem music, especially about his lyrics. In my opinion it is great dataset, because NLP is very interesting Data Science sphere. So, let`s start.

> WARNING! There are Explicit Lyrics!

# 1) Import Libraries and Load Data

Firstly, lets import all useful libraries. Secondly, load data.

In [None]:
import os

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
# Library for creating WordCloud
from wordcloud import WordCloud

# Library for working with Text Data
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from string import punctuation

punctuation = set(punctuation)
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.manifold import TSNE

In [None]:
# Load Data
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
PATH = "/kaggle/input/eminem-lyrics/Eminem_Lyrics.csv"
data = pd.read_csv(PATH, sep='\t', comment='#', encoding = "ISO-8859-1")

# 2) Fast looking on data

Let`s see head of our data frame, list of columns, size and nan/null values in this dataset.

In [None]:
data.head()

In [None]:
data.info()

In [None]:
print(f"There are {data.shape[0]} rows in dataframe.")
print(f"And {data.shape[1]} columns.")

print("\n")

print(f"Columns: {data.columns}")

print("\n")

print(f"Percentage of Null values: \n {data.isnull().sum() / data.shape[0]}")
print("\n")
print(f"Percentage of NA values: \n {data.isna().sum() / data.shape[0]}")

How we can see dataset includes 348 rows and 6 columns. CSV File consists of 6 columns: Album Name, Song Name, Song Lyrics, Album URL, Song Views and Release Date. Unfortunately, there are some missing values in Views and Release Date, but we can see that there are only 5% and even less of them in columns.

# 3) Data preprocessing.

Well, we have examined our dataset, found some problems and are ready to prepare it for our future analysis.

Data cleaning plan:

1) Droping useless columns;

2) Working with NA and Null values;

3) Preparing columns like Song Lyrics, Song Views and Release Date.

# 3.1) Useless columns.

On second step we saw that there is column about URL. We should drop it, because in our analysis we don`t need in it.

In [None]:
main_data = data.drop(["Album_URL"], axis = 1)
main_data = main_data.drop(["Unnamed: 6"], axis = 1)

In [None]:
main_data.head(1)

# 3.2) NA and Null values.

We have wrote conclusion about dataset and what it includes on step 2. In this conclusion we wrote that there are:

1) Few missing values (only 5% or even less);

2) There aren`t any values in column Views which we can use for filling missing values, because this column about song views and if we fill it with mean or other way result will not be accurate;

3) But we have got better situation with Release_date column. There we can use method where we fill na values with values in next or in last row, because Date is linear value.

It means that we can fill values in column about Release Date, but because of situation with Views column we wont do it.

In [None]:
main_data = main_data.dropna(axis = "rows")

main_data

# 3.3) Preparing Columns.

Here we should prepare our columns for our future analysis.

Firstly, lets work with Views and Release_date columns.

In [None]:
# unique values in Views column
views_1 = main_data["Views"].unique()
views_1

Here we can see a lot of values with letters K and M (thousands and millions), but we can, also, see strange value 'November 12, 2004' and values with \n things. We have to work with them.

In [None]:
main_data[main_data["Views"] == "November 12, 2004"]

In [None]:
wrong_index = [222]

main_data = main_data.drop(wrong_index, axis = 0)

In [None]:
# function which preprocess Views column
def to_number(string):
  string = list(string)
  letter_ban = ["K", "M", "\n"]
  letter_K = [True for element in string if element == "K"]
  letter_M = [True for element in string if element == "M"]
  string = [element for element in string if element not in letter_ban]

  number = float("".join(string))

  if True in letter_K:
    number *= 1000
  elif True in letter_M:
    number *= 1000000

  return round(number)

# example
print(to_number('1.9M\n'))

In [None]:
main_data["Views"] = main_data["Views"].apply(lambda num: to_number(num))

main_data

Excellent! Lets work with Release_date column.

In [None]:
date_1 = main_data["Release_date"].unique()
date_1

There is only one incorrect value. Is " ". We must delete it.

In [None]:
main_data[main_data["Release_date"] == " "]

In [None]:
wrong_index = [121]

main_data = main_data.drop(wrong_index, axis = 0)

In [None]:
# function which preprocess Release_date column in Year variant
def to_year(date):
  string = list(date)

  if len(string) == 4:
    return int(date)
  elif len(string) == 6:
    date = 1999
  else:
    string = "".join(string)
    date = int(string[-4:])
  
  return date

# example
print(to_year('July 13, 2006'))

In [None]:
main_data["Release_date"] = main_data["Release_date"].apply(lambda date: to_year(date))
main_data

Great. Now we can go to next stage. Lets prepare lyrics to future analysis.

In [None]:
main_data["Lyrics"][2]

Here we can see that there are a many things like "\n" or [Outro], [Verse 1] and etc. We must work with that.

Firstly, filter introduction words. For example, [intro], [Verse 2].

In [None]:
lemmatizer = WordNetLemmatizer()
word_tokenizer = word_tokenize

In [None]:
# function which filter things like [Outro], [Verse 1], [Chorus], etc in texts
def intro_words_filter(text):
  
  prepared_text = re.sub(r'\[([^]]*)]', '', text)
  prepared_text = prepared_text.replace("  ", " ")

  return prepared_text

# example
intro_words_filter("[Verse 1] Before  I check  the mic (Check, check, one, two) [Verse 2]")

In [None]:
main_data["Prepared_Lyrics"] = main_data["Lyrics"].apply(lambda text: intro_words_filter(text))

main_data

Next step is filtering punctuation and stop words.

In [None]:
def filtring_punct(text):
  elements = [element if element not in punctuation else '' for element in text]
  return ''.join(elements)

stop_words_to_filter = stopwords.words('english')
def filter_stop_words(text, stop_words_to_filter):
  filtered_text = [elem for elem in text if elem not in stop_words_to_filter]
  return filtered_text

In [None]:
main_data["Prepared_Lyrics"] = main_data["Prepared_Lyrics"].apply(lambda text: filtring_punct(text))
main_data["Prepared_Lyrics"] = main_data["Prepared_Lyrics"].apply(word_tokenizer)
main_data["Prepared_Lyrics"] = main_data["Prepared_Lyrics"].apply(lambda text: filter_stop_words(text, stop_words_to_filter))

main_data

Third step is texts lemmitizing.

In [None]:
def lemmatize_text(text, lemmatizer):
  return [lemmatizer.lemmatize(element) for element in text]

In [None]:
main_data["Prepared_Lyrics"] = main_data["Prepared_Lyrics"].apply(lambda text: lemmatize_text(text, lemmatizer))

main_data

In [None]:
# also, after lemmitizing we have to filter stop words again for better result
main_data["Prepared_Lyrics"] = main_data["Prepared_Lyrics"].apply(lambda text: filter_stop_words(text, stop_words_to_filter))

Result:

In [None]:
main_data

# 4) Analyzing.
Here we will analyze features of this dataset.

# 4.1) Views, Albums Names and Release Dates.
Review vis of all numeric values in data:

In [None]:
sns.pairplot(main_data)

In [None]:
songs_views_all = main_data["Views"]

# matplotlib settings
plt.figure(figsize=(12, 10))
plt.grid(True)

plt.plot(songs_views_all)
plt.xlabel("Count of Views")
plt.ylabel("Views")
plt.title("Views linear plot")

In [None]:
plt.figure(figsize=(10, 8))
plt.grid(True)

sns.boxplot(data = songs_views_all, linewidth = 2.5, width = 0.5, orient = "vertical")
plt.xlabel("Boxplot")
plt.ylabel("Views")
plt.title("Views Boxplot")

Here we can see graphs about all songs and songs views in dataset.

In [None]:
# the most popular songs
popular_songs = main_data.sort_values(by = "Views", ascending = False)

In [None]:
songs_titles = popular_songs["Song_Name"].unique()[:5]
songs_views = popular_songs["Views"].unique()[:5]

# matplotlib settings
fig = plt.figure(figsize = (8, 5))
ax = fig.add_subplot(111)

ax.grid(True)

ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))

ax.bar(songs_titles, songs_views)
plt.xlabel("Songs Names")
plt.ylabel("Views")
plt.title("The most popular Eminem Songs")

Here we can see that the most popular Eminem songs in this dataframe are Rap God, Killshot, Godzilla, Lose Yourself, The Monster.

In [None]:
songs_titles = popular_songs["Song_Name"].unique()[:5]
songs_date = np.sort(main_data["Release_date"].unique())[::-1]

date_views = []
for year in songs_date:
  date_views.append(main_data["Views"][(main_data["Release_date"] == year)].sum())

# matplotlib settings
fig = plt.figure(figsize = (10, 6))
ax = fig.add_subplot(111)

ax.grid(True)

#ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
ax.xaxis.set_major_locator(mtick.MultipleLocator(2))

ax.plot(songs_date, date_views, marker = "o", linewidth = 3)
plt.xlabel("Songs Release Dates")
plt.ylabel("Views by year")
plt.title("The most Listenable songs by Years")
style = dict(facecolor = "black", arrowstyle = "-")
ax.annotate(xy = (2013, date_views[4]), xytext = (2008, date_views[4]), s = "The Marshall Mathers LP2 album", 
            ha = "right", va = "center", arrowprops = style)
ax.text(x = 2018.5, y = date_views[2], s = "Kamikaze")

Here we can see that the most listenable songs in years description. Rating maximums are 2013 - The Marshall Mathers LP2 album, 2018 - Kamikaze.

In [None]:
def split_list(alist, wanted_parts=1):
    length = len(alist)
    return [alist[i*length // wanted_parts: (i+1)*length // wanted_parts] for i in range(wanted_parts)]

album_titles = main_data["Album_Name"].unique()
album_views = []

for album in album_titles:
  album_views.append(main_data["Views"][main_data["Album_Name"] == album].sum())

album_titles = split_list(album_titles, 6)
album_titles[0][0] = "Music To Be Murdered By"
album_views = split_list(album_views, 6)

# matplotlib settings
plt.figure(figsize=(32,32))

for i in range(6):
  plt.subplot(2, 3,(i%12)+1)
  plt.title("Albums Views")
  plt.ylabel(f"Views {i}")
  plt.xlabel(f"Album Names {i}")
  plt.grid(True)
  plt.bar(album_titles[i], album_views[i])

In these graphs we can see the most widespread albums. Lets analyze top 5 of them. P.S. We dont take a look on Killshot, Curtain: The Hits, because killshot album includes one song and Curtain: The Hist hit Eminem songs, but not unique songs.

In [None]:
TMMLP_album = main_data[main_data["Album_Name"] == "The Marshall Mathers LP"]["Views"]
TMMLP2_album = main_data[main_data["Album_Name"] == "The Marshall Mathers LP2"]["Views"]
Kamikaze_album = main_data[main_data["Album_Name"] == "Kamikaze"]["Views"]
singles_album = main_data[main_data["Album_Name"] == "The Singles"]["Views"]
MTMB_album = main_data[main_data["Album_Name"] == "Music To Be Murdered By: Side B"]["Views"]

top_albums_df = pd.DataFrame({"The Marshall Mathers LP" : TMMLP_album,
                              "The Singles" : singles_album,
                              "The Marshall Mathers LP2" : TMMLP2_album,
                              "Kamikaze" : Kamikaze_album,
                              "Music To Be Murdered By: Side B" : MTMB_album})

# matplotlib settings
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)
ax.grid(True)

ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))

sns.boxplot(data = top_albums_df, linewidth = 2.5, width = 0.5, orient = "horizontal")
plt.title("Song Views in different Albums")
plt.xlabel("Views")
plt.ylabel("Album Names")

Here we can see that there are views mean in Kamikaze album is bigger than in The Marshall Mathers LP2 and The Marshall Mathers LP albums, but there is outlier as "Rap God" song has more than 15000000 views.

# 4.2) Lyrics
Here we will work with Count Vectorizer. After that we will visualize results.

# 4.2.1) Preparing Count Vectorizer and data for it.

In [None]:
# function for converting lists of texts to strings
def to_string(text):
  words = [element for element in text]

  return ' '.join(words)

# example
to_string(["A", "B", "C"])

In [None]:
data_modelling = main_data
data_modelling["Lyrics_Modelling"] = data_modelling["Prepared_Lyrics"].apply(lambda text: to_string(text))

data_modelling

In [None]:
# function for creating dictionary with counts of words in texts
def word_count(text):
  main_dict = {}

  for sentence in text:  
        for word in word_tokenizer(sentence):
            if word not in main_dict:
                main_dict[word] = 0 
            main_dict[word] += 1
        
  return {k:v for k,v in sorted(main_dict.items(), key=lambda kv: kv[1], reverse=True)}

texts = data_modelling["Lyrics_Modelling"]
word_counts = word_count(texts)
word_counts

In [None]:
ban_words = ["Im", "I", "u", "Its", "wan", "The", "got", "But", "get", "And", "dont",
             "so", "If", "My", "Me", "So", "Me", "me", "em", "youre", "aint", "na",
             "You", "you", "gon", "cant", "We", "thats", "To", "This", "Ima", "Id",
             "Ive", "ya", "Youre", "That", "Ta", "It", "A", "\x91Cause", "In", "Then",
             "I\x92m", "yall", "Or", "Why", "it\x92s", "Ill"]

filtered_count = {k:v for k, v in word_counts.items() if k not in ban_words}
filtered_count = {k:v for k, v in filtered_count.items() if k.isalpha()}

min_frequency = 5

filtered_count = {k:v for k, v in filtered_count.items() if v > min_frequency}

filtered_count

Creating Count Vectorizer.

In [None]:
main_dict = filtered_count.keys()

count_vectorizer = CountVectorizer(vocabulary = main_dict)

count_vectorizer.fit(texts)

term_matrix = count_vectorizer.transform(texts)

Creating dictionary for visualization

In [None]:
terms = count_vectorizer.get_feature_names()
count_terms = term_matrix.toarray().sum(axis=0)

In [None]:
dictionary = dict(zip(terms, count_terms))
dictionary

dictionary = pd.Series(dictionary) 
dictionary = dictionary.sort_values(ascending=False) 

Word Cloud visualization

In [None]:
names = dictionary.index
values = dictionary.values

# creating word cloud graph
def plot_word_cloud(word_list):
    
    wordcloud = WordCloud(background_color="white", max_words=1000, width=900, height=900, collocations=False)
    wordcloud = wordcloud.generate_from_frequencies(word_list)
    plt.figure(figsize=(12, 8))
    plt.title("The most widespread words in Eminem Lyrics")
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show() 

    
plot_word_cloud(dictionary)

# 5) Conclusion.
We`ve analyzed data about Eminem songs lyrics and rating, found out some interesting things and practiced NLP analyzis.

Thank you everyone who check this notebook. If you like my notebook upvote it and if you dislike, please, write your comments it will help me to improve my skills. Good luck!