# Topic analysis based on Genius songs lyrics

Author : lievre.thomas@gmail.com

---

In this notebook, we will explore the genius data extract from [Kaggle](https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information).

**The aim of this analysis is to retrieve topic from lyrics and retrieve main topics by year or decade.**

This notebook was carried out in the context of a class project imposed by the [text mining course (TDDE16)](https://www.ida.liu.se/~TDDE16/project.en.shtml) of Linköpings universitet.


## Few informations about Genius website

Genius is an American digital company founded on August 27, 2009, by Tom Lehman, lan Zechory, and Mahbod Moghadam. Originally launched as Rap Genius with a focus on hip-hop music, it was initially a crowdsourced website where people could fill in the lyrics of rap music and give an interpretation of the lyrics. Over the years the site has grown to contain several million annotated texts from all eras ( from [Wikipedia Genius page](https://en.wikipedia.org/wiki/Genius_(company))).


## Load the data in memory

Data are all contain in a big 9GB csv file (around 5 millions rows). It could be dificult to load all this data in our computer memory. To deal with this issue, I made a loading class to split the data in 6 pickles files to improve the compressness of the data which aim to improve the loading speed in the memory. Then the pickles are randomly draw to improve generality of the data. We currently assumed the data pickles batch are identically distributed (we will explore the data batches at the second part). The class below deal with all the process.

In [2]:
import pandas as pd
from random import seed, sample
import pickle
import glob
import os

class Loader():

    def __init__(self, in_path, out_path):
        """
        Args:
            in_path (str): csv input path.
            out_path (str): Output directory path to store the pickles.
            chunksize (int, optional): Chunksize for DataFrame reader. Defaults to 10**6. 
        """

        self.__in_path = in_path
        self.__out_path = out_path
        self.__chunksize = 10**6

    def __produce_pickles(self):
        """produce pickles by reading csv by chunksize
        """
        with pd.read_csv(self.__in_path, chunksize = self.__chunksize) as reader:
            try:
                os.makedirs(self.__out_path)
            except FileExistsError:
                # directory already exists
                pass
            for i, chunk in enumerate(reader):
                out_file = self.__out_path + "/data_{}.pkl".format(i+1)
                with open(out_file, "wb") as f:
                    pickle.dump(chunk, f, pickle.HIGHEST_PROTOCOL)
    
    def load_pickle(self, pickle_id):
        """load a pickle file by id

        Args:
            pickle_id (int): pickle id.

        Raises:
            Exception: The path of the given id isn't a file

        Returns:
            obj: DataFrame
        """
        # produce the pickles if the directory not exists or
        # if the directory is empty 
        if (not os.path.exists(self.__out_path)) or \
              (len(os.listdir(self.__out_path)) == 0):
            self.__produce_pickles()
        
        # get the file path following the pickle_id
        # given in parameter
        file_path = self.__out_path + \
            "/data_" + str(pickle_id) + ".pkl"

        if os.path.isfile(file_path):
            df = pd.read_pickle(file_path)
        else:
            raise Exception("The pickle file data_{}.pkl doesn't exist".format(pickle_id))
        return df
        

    def random_pickles(self, n_pickles = 3, init = 42, verbose = True):
        """random reader over pickles files

        Args:
            n_pickles (int, optional): number of pickles to load. Defaults to 3.
            init (int, optional): Integer given to the random seed. Defaults to 42.
            verbose (bool, optional): Print the loaded files. Defaults to True

        Raises:
            Exception: Stop the process if n_pickles exceed pickle files number.

        Returns:
            obj: pd.Dataframe
        """

        # produce the pickles if the directory not exists or
        # if the directory is empty 
        if (not os.path.exists(self.__out_path)) or \
              (len(os.listdir(self.__out_path)) == 0):
            self.__produce_pickles()

        pickle_files = [name for name in
                        glob.glob(self.__out_path + "/data_*.pkl")]
        # draw p_files        
        seed(init)

        if n_pickles <= 6:
            random_p_files = sample(pickle_files, n_pickles)
        else:
            raise Exception("The parameter n_pickles (" +
                            "{}) exceed the numbers of pickle files ({})"\
                                .format(n_pickles, len(pickle_files)))
        # print the drawed files
        if verbose:
            print("Loaded pickles:")
            for p in random_p_files:
                print(p)

        # load random pickles file
        df_list = [pd.read_pickle(p) for p in random_p_files]

        # create the dataframe by concatenate the previous
        # dataframes list
        df = pd.concat(df_list, ignore_index = True)
        return df

In [3]:
# create reader
#  /!\ change path in kaggle
kaggle_input = "/kaggle/input/genius-song-lyrics-with-language-information"
kaggle_output = "/kaggle/working/data/"

# initiate the file loader
loader = Loader(in_path = kaggle_input, out_path = kaggle_output)

# load random pickle files
df = loader.random_pickles(n_pickles = 1)

Loaded pickles:
/kaggle/working/data/data_4.pkl


Batchs of data are randomly loaded in the memory. The number of batchs loaded depends on the memory capacity of the computer running the script. For the analysis, we will only works on the random samples loaded (All the data in Kaggle).  

## Exploring the coarse data

Let's visualize and explore the coarse data before a part of deeper analysis.

In [4]:
df.head()

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
0,Catching Vibes,rap,Yung Divide,2019,4090,{},[Intro]\nCM iced out\n\n[Chorus]\nMomma told m...,4531581,en,en,en
1,ZAO OYBOIGLEBASTY REMIX,rap,Jacketkkid,2019,17,{OYBOIGLEBASTY},"[Текст песни ""ZAO""]\n\n*OYBOIGLEBASTY*\n\n[При...",4531582,ru,ru,ru
2,Jesus Etc. with Andrew Bird,rock,Wilco,2014,70,"{""Andrew Bird""}","Jesus, don't cry\nYou can rely on me, honey\nY...",4531612,en,en,en
3,Alone,rap,Cashma,2019,43,{},I just doubled back\nYeah I had a couple shows...,4531583,en,en,en
4,I Could Go on Singing,pop,Harold Arlen,2019,12,{},(Verse)\nWhen a dove is in love\nWith a doll o...,4531584,en,en,en


For each songs, we've got several informations :
- title of the song
- the tag (which kind of music)
- the artist singer name
- the release year
- the number of page views
- the featuring artists names
- the lyrics
- the genius identifier
- Lyrics language according to [CLD3](https://github.com/google/cld3). Not reliable results are NaN. CLD3 is a neural network model for language indentification.
- Lyrics language according to [FastText's langid](https://fasttext.cc/docs/en/language-identification.html). Values with low confidence (<0.5) are NaN. FastText's langid is library developped by Facebook’s AI Research lab for efficient learning of word representations and sentence classification. fastText has also published a fast and accurate tool for text-based language identification capable of recognizing more than 170 languages.
- Combines language_cld3 and language_ft. Only has a non NaN entry if they both "agree".

More information at this link : https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information

In [5]:
df.dtypes

title            object
tag              object
artist           object
year              int64
views             int64
features         object
lyrics           object
id                int64
language_cld3    object
language_ft      object
language         object
dtype: object

In [6]:
# display the size
print('Data frame size (row x columns):', df.size)
print('Data rows number: ', len(df))
print('Number of unique songs (following genius id): ', len(df.id.unique()))

Data frame size (row x columns): 11000000
Data rows number:  1000000
Number of unique songs (following genius id):  1000000


Genius id seems to be the unique rows identifier.

Let's vizualise size of the coarse data over years before preprocessing to compare batch distributions. One things to know before vizualise the data, the pickles are create by chunks reading. 

The last diplayed table gives us some information about the data. The csv file seems to be sort by id, so the pickle files are then sort too.

In [7]:
import plotly.express as px
import plotly.io as pio

# change theme template for every graph below
pio.templates.default = "plotly_white"


# get some information about the pickle data
def pickle_informations(data = loader):
    rows = []
    for i in range(1, len(os.listdir('data')) + 1):
        df = data.load_pickle(i)
        rows.append(len(df))
        del df
    return rows

# get the rows
rows = pickle_informations()

# create the dataframe
df_data = pd.DataFrame(
    {'batch' : ['data ' + str(i) for i in range(1,len(rows) + 1)],
    'rows' : rows})

fig = px.bar(df_data, x="batch", y="rows")
fig.show()

Batch seems to have the same number of rows rexcept for the last one which is consistent because batch are create iteratively by 10e6 chunks over the csv The last batch could be seen as a rest.

In [8]:
import plotly.graph_objects as go

def add_bar(i, y1, y2, color, data = loader):
    df = data.load_pickle(i)
    df = df[(df.year >= y1) & (df.year <= y2)]
    df_year = df.groupby(['year']).size().reset_index(name='count')
    new_bar = go.Bar(
                x = df_year.year.values,
                y = df_year['count'].values,
                name = 'data_'+ str(i),
                marker = {'color' : color})
    new_trend = go.Scatter(
                x = df_year.year.values,
                y = df_year['count'].values,
                mode="lines",
                line={'color' : color,
                    'width' : 0.5},
                showlegend=False)
    del df_year, df
    return new_bar, new_trend


def multi_barplot(year1, year2, colors):    
    # create a empty plotly.Figure object
    fig = go.Figure() 
    # compute the batch number
    n_batch = len(os.listdir('data'))
    # test the color list feed in argument
    # fit well with the batch number
    if n_batch > len(colors):
        raise Exception(
            "The colors list size({})doesn't ".format(len(colors)) +
            "fit with the number of data".format(n_batch))
    for i in range(1, n_batch + 1):
        fig.add_traces((add_bar(i, year1, year2, colors[i-1])))
    fig.update_layout(
        title = "Data distribution over years ({} - {})"
            .format(year1, year2),
        xaxis_title="years",
        yaxis_title="title",
        legend_title="Data batch")
    return fig


In [9]:
import plotly.colors as col

# create the color list
colors = col.qualitative.Plotly

# 1990 - 2023
fig1 = multi_barplot(1960, 1989, colors)
fig1.show()
# 1960 - 1990
fig2 = multi_barplot(1990, 2023, colors)
fig2.show()

The first bar chart (1960 - 1989) shows an increasing numbers of data over years. Moreover batch seems to have quite similar distriutions over years. data_1 and data_2 batch quite outperform the 4 others. data_6 batch is weaker than the other due to its poor number of rows.
The data behaves similarly until 2012 as we can see on the second chart (1990-2023). After this year there is great increasing of the data retrieved. A minimum increase of at least 100% of the batch can be observed. An increase of up to 50 times the batch size for some.

## Data pre-processing

The aim of this part is to preprocess data in order to get suitable data for the analysis. let's focus on the year variable.

We will focus on English songs, to facilitate the analysis and the work of natural language processing algorithms.

In [10]:
# Retrieve only the texts identified as English language by both cld3 and fasttext langid
df = df[df.language == 'en']

Next, it can be quite interseting to check Nan values

In [11]:
# find which column contain nan value
df.columns[df.isna().any()].tolist()

['title']

In [12]:
# get all rows that contain NaN values
df_nan = df[df.isna().any(axis=1)]
df_nan

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
6639,,rap,Josey Joe,2019,53,{},Got me stationed with my arm out extended ain’...,4541679,en,en,en
24244,,rb,"""theghst""",2018,6,"{""\\\""theghøst\\\"""",Fakefoes}",Is this what you want?\n\nI need to know\n\nLo...,4570282,en,en,en
52464,,pop,oribloom,2018,136,{​oribloom},[Verse 1]\nI can never seem to find myself any...,4613687,en,en,en
55439,,rock,A Perfect Kiss,2004,20,{},"When home returns to a shattered man, when eve...",4617874,en,en,en
66763,,rap,7ordyn,2019,13,{},[Verse 1]\nThey don’t like me it’s keeps me la...,4634322,en,en,en
111121,,rap,Afourteen,2019,4454,{},[Chorus]\nI'm pissed off so I'm making sounds ...,4698233,en,en,en
112870,,rock,Josh Gauton,2019,81,{},Let the waves crash over me\nMy sin [?] by lov...,4701034,en,en,en
113703,,rap,TheRubin,2019,8,{},Yeah\nYou know my thoughts are not applicable\...,4702230,en,en,en
127654,,pop,Trippie Redd,2019,96599,{},"[Intro]\nExclamation mark, exclamation mark, e...",4724410,en,en,en
175685,,rap,$tyx (artist),2019,24,"{""\\$tyx (artist)""}",[Intro]\n$tyx\n\n[Verse: 1]\n\nHoes on my moth...,4802652,en,en,en


In [13]:
print('Number of untitled song:', len(df[df.isna().any(axis=1)]))

Number of untitled song: 29


Insofar as the title of the music is not to be taken into account in the learning of the topic modeling algorithms but But the titles can be related to the topics in the next phase of analysis and the low number of songs without any title, I decide to delete this data for the moment.

In [14]:
# Delete rows containing NaN values
df = df.dropna()
len(df)

604878

Next, we also try to check for None values

In [15]:
df[df.isnull().any(axis=1)]

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language


No None values in this dataframe.

Afterwards, let's look at the year variable, which is one of the important variables to take into account in our analysis because we want to extract the topics by decades.

In [16]:
years = df.year.unique()
print(years)

print('Number of unique years: ',len(years))

[2019 2014 2007 2015 1999 1919 2013 2018 2012 2002 1980 1985 2017 2009
 2010 1981 1983 1974 2008 1929 2016 1969 1954 1987 2020 1986 1941 1942
 1965 1973 2003 1976 2006 2005 1998 1977 2004 1972 1979 1988 2011 1996
 1869 2000 1978 1698 1992 1971 1967 2021 1926 1958 1997 1994 1982 1963
    1 1975 1964 1990 1995 1917 1984 1991 2001 1966 1989 1993 1779 2022
 1923 1962 1968 1950 1955 1960 1865 1922 1970 1921 1895 1959 1924 1936
 1928 1935 1932 1930 1912 1927 1934 1931 1089 1925 1887 1937 1952 1918
 1893 1961 1944 1914 1938  666 1953 1956 2024 1933 1939 1907 1946  219
  190 1957 1951 1949  527  291 1940 2023    2 1889 1845   19 1592  211
 1860 1861 1477 1863   18 1904 1898 1943 1725 1902 1841 1948   16   17
 1818 1901 1069   10 1850  100 1386 1223 1945   22 1890   13 1894    7
 1906    6 1578 1873   69 1947  299 1404 1851 1600 1897 1648 1668 1709
 1892 1881 1916    5  301   20 1744 1880 1839 1689 1776 1849  193    4
    3  293 1642 1673   15 1899 1885 1915 1805 1864  420 1854 1913  799
 1475 

We firstly want to know if the year variable format is suitable. It is highly likely that year are sometimes downsized (example : 92 instead of 1992).
Let's display the tag distribution for music with a release year below 215.

In [17]:
df_tag = df[df['year'] < 215].groupby(['tag']).size().reset_index(name='count')

fig = px.pie(df_tag, names="tag", values="count", title = "Outlier tag distribution")
fig.show()

It is rather surprising to observe that the majority style of music of this period (< 215) is rap music knowing that this style is known for the current emerging style. Of course, among this data their is a important part of outlier year.

In [18]:
# Extract the pieces of music of type 'rap' lower than the year 215
df_rap = df[(df['year'] < 215) & (df['tag'] == 'rap')]
df_rap.sort_values(by='views',ascending=False).head(20)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
796709,SICK LIFE,rap,Ericko Lim,1,5578,{},[Verse 1]\nIm alone\nSitting in my room thinki...,5738968,en,en,en
514790,Its Up There,rap,Jaiswan,1,5428,{},[Intro]\nAll I want to do is Jugg forever\n(Ay...,5313936,en,en,en
755601,Wii Boxing,rap,Rob Gorr,1,678,"{""Poupm\\'goo""}","[Intro]\nMmm, oh yeah (ooooooh shit)\nWii Boxi...",5675230,en,en,en
508401,517,rap,BONES,1,506,{},[Verse]\nHowell's favorite asshole\nLive from ...,5304604,en,en,en
434056,Heaven Is a Journey,rap,Masta Ace & Just Say PLZ,17,494,"{""Chrissy Hoskins""}",(Verse 1)\nScrolling back through twenty pages...,5193155,en,en,en
622689,Icon Swae Lee Remix Unreleased,rap,Jaden,3,487,"{""Swae Lee""}","[Intro: Jaden]\n(Woo)\nWoah (Woo)\nYo, woah\n\...",5475753,en,en,en
140123,Zuology,rap,Metro Zu,1,440,{},[Verse 1: Mr. B the Poshtronaut]\nMr. B pullin...,4744891,en,en,en
601819,Swaggy B,rap,Fonzy,10,416,{},Get that Peso\nHunnid bands\nIn my bank\nDas m...,5444680,en,en,en
692541,The Sauce,rap,Eboys,1,405,"{WillNE,ImAllexx,Memeulous,""James Marriott""}",[Verse 1 - James Marriott]\nBig back Gally bla...,5579395,en,en,en
138112,Softwork Hardhead,rap,Metro Zu,1,365,{},"[Verse 1: Mr. B the Poshtronaut]\nYeah, it's M...",4741841,en,en,en


In [19]:
df_rap[df_rap['artist'] == 'Kanye East']

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language


If we search the release date of this track on google, we can find a release date from 4 May 2021 on the [Genius website](https://genius.com/Kanye-east-the-secrets-of-dababy-lyrics). Given the year that we find in our table and real one, we can assume some issue about the date format (1 instead 2021).

After few research on genius website, the most viewed songs of this above displayed list seems to be released on 2021 but more views decrease harder is the interpretation of date.

Let's check the second most popular tag 'pop' in this retrieve outliers data :

In [20]:
# Extract the pieces of music of type 'rap' lower than the year 215
df_pop = df[(df['year'] < 215) & (df['tag'] == 'pop')]
df_pop.sort_values(by='views',ascending=False).head(20)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
619060,Something,pop,Oliver Pee,4,1073,{},[Chorus: Oliver Tree and Melanie Martinez]\nSo...,5470458,en,en,en
940926,LA Shawty,pop,RainingOnRoses,1,746,{},She don’t want me\nAll alone\nBut she never ca...,5959288,en,en,en
171629,No Sleep,pop,Nana Adjoa,10,741,{},"Both eyes closed, no sleep\nCounted passed a t...",4795859,en,en,en
356068,Lemon Drop,pop,Raynes,1,408,{},[Verse 1]\nSoon I'll be at the top\nI'll proba...,5077517,en,en,en
552838,L.O.V.E,pop,Gus,1,316,{},[Verse 1]\nMaybe I’ll come over\nAnd I’d like\...,5370978,en,en,en
650580,Veteran,pop,Bou,1,280,{Trigga},Born with a mic in a me hand\nTo talk to the p...,5517470,en,en,en
806914,Boo-Boo,pop,Lee Montana,1,244,{},Lee Montana - Boo-Boo\n\nHmmmmm Hmmmmmm\nCause...,5754608,en,en,en
592929,Home Monologue,pop,Peach Martine,1,191,{},“Who are you seeing right now?” he asks\nJust ...,5431007,en,en,en
316502,Be the Love,pop,Adrian Eagle,20,189,{},Tears fall\nTears fall\nOhh\n\nDaddy left my w...,5019126,en,en,en
332865,Hypnotized,pop,Franc Moody,1,159,{},Locked in\nLocked in into your cross-hairs\nI ...,5043186,en,en,en


The titles recovered seem to be for the most part recent sounds, not very popular with a bad indexation of the years.

A case-by-case pre-processing of the data is too tedious compared to the amount of data to be processed. We will only use data with correctly formatted dates.

In [21]:
df = df[(df.year >= 1960) & (df.year < 2023)]
len(df)

602000

We wish to analyze the texts by decade then let's add a decade column.

In [22]:
import math

df['decade'] = df['year'].map(lambda x : int(math.trunc(x / 10) * 10))

df.sort_values(by = 'year').head(20)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language,decade
876357,Whats the Use of Wondrin?,pop,Doris Day,1960,30,{},What's the use of wonderin'\nIf he's good or i...,5860782,en,en,en,1960
120525,Happy Days,rb,Marv Johnson,1960,32,{},Happy days\nHappy days\nHappy days\nHappy days...,4712850,en,en,en,1960
786918,Prince Charming,rock,Linda Laurie,1960,24,{},"[Verse 1]\nEvery time I close my eyes, I pray\...",5724393,en,en,en,1960
58632,Missing Lovin Missing Lovin Missing You,country,Goldie Hill,1960,21,{},Missing lovin'\n'Cause you're not here to hold...,4622387,en,en,en,1960
786938,Soupin Up Your Motor,pop,Linda Laurie,1960,6,{},[Verse 1]\nIt's Saturday night\nAnd I'm at hom...,5724439,en,en,en,1960
58631,Twice As Blue,country,Goldie Hill,1960,14,{},Twice as blue\nWithout you I'm just twice as b...,4622386,en,en,en,1960
787025,Stay With Me,pop,Linda Laurie,1960,11,{},[Verse 1]\nStay with me oh every day with me\n...,5724556,en,en,en,1960
802731,Please Please Signore,pop,Annette Funicello,1960,4,{},"Please, please signore, signore\nIn your eyes ...",5748243,en,en,en,1960
150990,Little Bluebird,rock,Joe South,1960,18,{},"La-da, la-da‚ la-la-la\nLa-da‚ la-da‚ la-da, l...",4763969,en,en,en,1960
58624,Driftwood On The River,country,Goldie Hill,1960,32,{},I'm just driftwood on the river\nFloating down...,4622391,en,en,en,1960


## Data vizualisation

Let's do some vizualisation to get a better understanding of our data. As we saw on previous distribution graphs over years, more titles have been recorded over the last 2 decades.

In [23]:
# barplot by decade
def barplot_by_decade(df):

    # groupby decade
    df_d = df.groupby(['decade']).size().reset_index(name='count')

    # create the figure
    fig = go.Figure()

    fig.add_bar(
        x=df_d.decade,
        y=df_d['count'],
        showlegend=False)

    fig.add_scatter(
            x=df_d.decade,
            y=df_d["count"],
            mode="markers+lines",
            name="trend",
            showlegend=False)

    fig.update_layout(
            title = "Music release over years",
            xaxis_title="decade",
            yaxis_title="release")
    return fig

# build and display
fig = barplot_by_decade(df)
fig.show()

In [24]:
# compute tag frequencies by decade
df_pies_d = df.groupby(['decade','tag']).size().reset_index(name='count')
df_pies_d[df_pies_d.decade == 1960]

Unnamed: 0,decade,tag,count
0,1960,country,486
1,1960,misc,85
2,1960,pop,2616
3,1960,rap,24
4,1960,rb,594
5,1960,rock,752


In [25]:
from plotly.subplots import make_subplots

# create en make subplot
fig = make_subplots(rows=3, cols=3,
                    specs=[
                        [{'type':'domain'}
                        for i in range(1,4)] for i in range(1,4)
                    ])
decades = df_pies_d.decade.unique().tolist()
for i in range(0,3):
    for k in range(0,3):
        decade = decades[i*3 + k]
        # group by decade
        df_p = df_pies_d[df_pies_d.decade == decade]
        # add figure
        fig.add_trace(go.Pie(labels=df_p.tag, values=df_p['count'], name=decade), i+1, k+1)
        # add annotation
        fig.add_annotation(arg=dict(
            text=decade, x=k*0.375 + 0.125,
            y=-i*0.3927 + 0.90, font_size=10,
            showarrow=False))
        if (i*3 + k) == 6:
            break


# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
    title_text="Tags proportions by decades"
    # Add annotations in the center of the donut pies.
    #annotations=[dict(text=decade, x=k*0.375+0.125, y= -i*0.125+0.90, font_size=10, showarrow=False)
     #           for k, decade in enumerate(decades) for i in range(0,4)]
)
fig.show()

The different pie charts show us an evolution of the different proportions of music styles by decades from 1960 to today. In the sixties the most listed music style is the **pop music**. Then comes an emergence of a rising style: the **rock**. This one increases with the decades until it overtakes the pop music during the decades 90 and 2000. The two styles come to balance thereafter facing the meteoric rise of the most listed genre in our current decade, namely **rap**. The evolution of the proportions recover can give us a rather precise idea of the most popular styles of their times. However, it is important to remember that the data still has some bias, Genius being a crowdsourced tool created during the last decade and at the base only as a lexical translator of rap music, it is normal to find a large amount of data from this period and especially from this style there.

## Text preprocessing

After the visualisation part let's focus more on the main data which are the lyrics.

In [26]:
df.iloc[0]["lyrics"]

"[Intro]\nCM iced out\n\n[Chorus]\nMomma told me I could make it, see the brightness in her eyes\nI said I don't need to fake it got the flame won't let it die\nAnd I'm travelin' these places tryna keep up on that grind\nBut even when we make it, me and CM catchin' vibes\nAnd I'm catchin' all these vibes, yeah, yeah\nAnd I'm catchin' all these vibes, yeah\nYeah, me and CM catchin' vibes\n\n[Verse]\nTryna make it big you know I go up with the gang\nYes I am so thankful every for diamond on my chain\nWhich one do I choose, I just pull up switchin' lanes\nTold you once before I am not rockin' with no lames\nI don't rock with you lil girl, you just won't leave me be\nDo this by myself, just makin' hits now can you see\nHad to get some new friends 'cause the old ones don't belive\nNow they comin' back and I'm tellin' them to leave\nSee I'm catchin' vibes\n[Chorus]\nMomma told me I could make it, see the brightness in her eyes\nI said I don't need to fake it got the flame won't let it die\nA

There is many undesirable characters like the line breaker '\n', figures or square, curly and simple brackets. So let's clean this data with regular expressions.

In [27]:
import re
from numpy.random import randint

def clean_text(text):
    # remove \n
    text = text.replace('\n', '')
    # remove punctuation
    text = re.sub('[,\.!?]', '', text)
    #removing text in square braquet
    text = re.sub('\[.*?\]', ' ', text)
    #removing numbers
    text = re.sub('\w*\d\w*',' ', text)
    #removing bracket
    text = re.sub(r'[()]', ' ', text)
    # convert all words in lower case
    text = text.lower()
    return text

# get the results of data cleaning
cleaned_text = df["lyrics"].apply(clean_text)

In [28]:
docs = cleaned_text.to_list()
docs[0]

" cm iced out momma told me i could make it see the brightness in her eyesi said i don't need to fake it got the flame won't let it dieand i'm travelin' these places tryna keep up on that grindbut even when we make it me and cm catchin' vibesand i'm catchin' all these vibes yeah yeahand i'm catchin' all these vibes yeahyeah me and cm catchin' vibes tryna make it big you know i go up with the gangyes i am so thankful every for diamond on my chainwhich one do i choose i just pull up switchin' lanestold you once before i am not rockin' with no lamesi don't rock with you lil girl you just won't leave me bedo this by myself just makin' hits now can you seehad to get some new friends 'cause the old ones don't belivenow they comin' back and i'm tellin' them to leavesee i'm catchin' vibes momma told me i could make it see the brightness in her eyesi said i don't need to fake it got the flame won't let it dieand i'm travelin' these places tryna keep up on that grindbut even when we make it me a

In [29]:
# update dataframe
df.update(cleaned_text)
df.head(3)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language,decade
0,Catching Vibes,rap,Yung Divide,2019,4090,{},cm iced out momma told me i could make it see...,4531581,en,en,en,2010
2,Jesus Etc. with Andrew Bird,rock,Wilco,2014,70,"{""Andrew Bird""}",jesus don't cryyou can rely on me honeyyou can...,4531612,en,en,en,2010
3,Alone,rap,Cashma,2019,43,{},i just doubled backyeah i had a couple showsi'...,4531583,en,en,en,2010


That's better! The libraries that we will use later to perform topic modeling usually provide preprocessing but it is always good to have control over what we manipulate.

## Topic modeling

I will perform 2 ways to do topic modeling :
- [LDA (latent dirichlet allocation)](https://fr.wikipedia.org/wiki/Allocation_de_Dirichlet_latente) are the common way to do topic modeling in the few last years, it works and it's quite easy to use with common python library like [Gensim](https://radimrehurek.com/gensim/auto_examples/index.html).
- [BERTopic](https://maartengr.github.io/BERTopic/index.html) seems to be one of the best technic this day to perform topic modeling. It combine the leverage of [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) the famous language model with [c-TF-IDF](https://maartengr.github.io/BERTopic/api/ctfidf.html) tansformer. 


### Initiate the tokenizer and Lemmatizer

In [30]:
import spacy

print("set gpu: ", spacy.prefer_gpu())

# small model /!\ take the bigger one for Kaggle
new_nlp = spacy.load('en_core_web_sm')

set gpu:  False


It could be difficult to process all this data on my computer. Therefore, I choose to sample the data

In [31]:
# set sample frac of the data
prop = 0.1

# create sample df 1/3 of the actual loaded data
sdf = df.sample(frac = prop)

In [32]:
# check the distribution of the sample
barplot_by_decade(sdf)

In [33]:
def preprocess(text, nlp = new_nlp):

    #TOKENISATION
    tokens =[]
    for token in nlp(text):
        tokens.append(token)

    #REMOVING STOP WORDS
    spacy_stopwords = new_nlp.Defaults.stop_words
    sentence =  [word for word in tokens if word.text.isalpha() and word.text not in spacy_stopwords]

    #LEMMATISATION
    sentence = [word.lemma_ for word in sentence]

    return sentence

In [34]:
sdf['lyrics'].loc[sdf['decade'].isin([1960, 1970])]
sdf['decade'].loc[sdf['decade'].isin([1960, 1970])]

427498    1970
457086    1960
48603     1970
165334    1970
393295    1970
          ... 
858184    1970
788084    1970
928390    1960
830437    1960
104149    1970
Name: decade, Length: 1152, dtype: int64

In [35]:
from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer

class TermsDocumentsMatrix():
    
    def __init__(self, sdf, decades=[1960, 1970], colorscale = 'Plotly3'):
        # vectorizer on the sample lyrics
        self.__vectorizer = CountVectorizer(tokenizer = preprocess)
        # fit and transform the data
        self.__data_vectorized = self.__vectorizer.fit_transform(
            tqdm(sdf['lyrics'].loc[sdf['decade'].isin(decades)])
        )
        # get decades informations
        self.__decades = sdf['decade'].loc[sdf['decade'].isin(decades)].reset_index(drop=True)
        self.__unique_decades = decades
        # get colorscale template
        self.__colorscale = colorscale
    
    def get_tdmatrix(self):
        
        # compute a Matrix terms document by decades
        df_bw = pd.DataFrame(self.__data_vectorized.toarray(),
                    columns = self.__vectorizer.get_feature_names_out())
        
        # check the length
        if len(df_bw) != len(self.__decades):
            raise Exception('Not the same size')
        
        # concatenate decade
        df_bw['decade'] = self.__decades
        
        # check NaN values
        if len(df_bw.columns[df_bw.isna().any()].tolist()) != 0:
            raise Exception('Decade got Nan values')

        return df_bw
    
    def get_tdm_by_decade(self, decade):
        
        if decade not in self.__unique_decades:
            raise Exception("{} doesn't appear in the decades list".format(decade))
        
        # compute a Matrix terms document by decades (bag of words format)
        df_bw = pd.DataFrame(self.__data_vectorized.toarray(),
                    columns = self.__vectorizer.get_feature_names_out())
        
        # check the length
        if len(df_bw) != len(self.__decades):
            raise Exception('Not the same size')
        
        # concatenate decade
        df_bw['decade'] = self.__decades
        
        # check NaN values
        if len(df_bw.columns[df_bw.isna().any()].tolist()) != 0:
            raise Exception('Decade got Nan values')
        
        # select suitable decade
        df_bw = df_bw[df_bw['decade'] == decade]
        
        return df_bw
    
    def most_freq_terms(self, n_rows = 1, n_cols = 2, n_terms = 10):
        
        # create the document terms matrix
        df_bw = self.get_tdmatrix()
        
        # create en make subplot
        fig = make_subplots(rows=n_rows, cols=n_cols,
                            x_title = 'number of occurrences',
                            y_title = 'terms',
                            subplot_titles = self.__unique_decades)
        
        for i in range(0,n_rows):
            for k in range(0,n_cols):
                if (i*n_rows + k) == len(self.__unique_decades):
                    break
                
                # get the decade
                decade = self.__unique_decades[i*n_rows + k]
            
                #select the suitable decade and delete decade column
                df_decade = df_bw.loc[df_bw.decade == decade, df_bw.columns != 'decade']
                
                # compute terms frequencies by decade
                terms_freq = df_decade.sum().sort_values(ascending = False)
            
                # total number of terms occurences
                total_terms = terms_freq.values
                
                # add figure
                fig.add_trace(go.Bar(y=terms_freq.index.tolist()[:n_terms][::-1],
                                     x=total_terms[:n_terms][::-1],
                                     name=decade,
                                     orientation='h', showlegend = False,
                                    marker = dict(color = total_terms,
                                                  colorscale=self.__colorscale)),
                              i+1, k+1)
        return fig

In [36]:
# first decades
tdm = TermsDocumentsMatrix(sdf, decades = [1960, 1970, 1980, 1990],
                           colorscale = 'Plotly3')

# display bar charts of most frequent terms
tdm.most_freq_terms(n_rows = 2, n_cols = 2, n_terms = 15)

100%|██████████| 3670/3670 [02:25<00:00, 25.24it/s]


According to the bar graphs displayed above, a group of words seems to recur on each decade: Love, know, go, feel ... Words that seem to relate to the popular song that can talk about love. This is consistent with our previous analysis from the pie charts showing the proportions of musical styles across time. We also notice an important presence of onomatopoeia like yeah or oh.

In [None]:
# first decades
tdm = TermsDocumentsMatrix(sdf, decades = [2000, 2010, 2020],
                          colorscale = 'Plasma')

# most frequent terms
tdm.most_freq_terms(n_rows = 2, n_cols = 2, n_terms = 15)

100%|██████████| 56427/56427 [44:56<00:00, 20.93it/s]  


We get similar results on this second decade with similar high occurrence words. We see a greater amount of onomatopoeia in the current decade. We can explain this by an emergence of the rap music style on this current and last decade. There is in this style of music a very used process, the 'ad-libs'. They are sounds, words or onomatopoeias that the artists pronounce sometimes between two verses or at the end of a sentence to give more impact to their text and to dynamize the atmosphere of a title. This may explain the greater presence of onomatopoeia in the lyrics of this decade.


### Topic modeling with LDA

In [38]:
from gensim.models import LdaModel
from gensim.corpora.dictionary import Dictionary

# Apply preprocessing on 1960 data
documents = sdf.loc[sdf.decade == 2000, 'lyrics'].apply(preprocess)

# Create a corpus from a list of texts
id2word = Dictionary(documents.tolist())
corpus = [id2word.doc2bow(doc) for doc in documents]

lda_model = LdaModel(corpus=tqdm(corpus),id2word=id2word, num_topics=10)

100%|██████████| 2464/2464 [00:01<00:00, 1406.08it/s]


In [39]:
lda_model.print_topics()

[(0,
  '0.015*"d" + 0.014*"yeah" + 0.008*"sex" + 0.007*"look" + 0.006*"man" + 0.006*"want" + 0.006*"know" + 0.006*"get" + 0.005*"avenue" + 0.005*"time"'),
 (1,
  '0.011*"good" + 0.010*"love" + 0.010*"fuck" + 0.008*"come" + 0.008*"like" + 0.007*"go" + 0.006*"know" + 0.005*"time" + 0.005*"away" + 0.005*"heart"'),
 (2,
  '0.030*"know" + 0.020*"love" + 0.015*"oh" + 0.014*"get" + 0.013*"come" + 0.013*"go" + 0.011*"time" + 0.011*"like" + 0.009*"think" + 0.008*"want"'),
 (3,
  '0.015*"want" + 0.013*"love" + 0.012*"like" + 0.010*"get" + 0.008*"baby" + 0.008*"let" + 0.008*"feel" + 0.007*"know" + 0.007*"run" + 0.006*"look"'),
 (4,
  '0.015*"like" + 0.011*"yeah" + 0.009*"loveone" + 0.008*"know" + 0.007*"chance" + 0.006*"la" + 0.005*"oh" + 0.005*"walk" + 0.005*"d" + 0.005*"think"'),
 (5,
  '0.011*"time" + 0.010*"la" + 0.006*"like" + 0.006*"get" + 0.006*"leave" + 0.006*"right" + 0.005*"let" + 0.005*"think" + 0.005*"rest" + 0.004*"oh"'),
 (6,
  '0.010*"get" + 0.009*"like" + 0.008*"come" + 0.007*"way

Then save the model.

In [46]:
from gensim.test.utils import datapath
from datetime import datetime

now = datetime.now()

directory = "/kaggle/working/models/"

try:
    os.makedirs(directory + now.strftime("%d%m%Y_%H%M%S"))
except:
    pass

# Save model to disk.
temp_file = datapath(directory + now.strftime("%d%m%Y_%H%M%S") + '/model')

lda_model.save(temp_file)

In [46]:
# Load a potentially pretrained model from disk.
lda_model = LdaModel.load(temp_file)

In [48]:
import pyLDAvis.gensim

# some basic dataviz
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model, corpus, id2word)


In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only



In [1]:
# Get topic weights and dominant topics ------------
from sklearn.manifold import TSNE
from bokeh.plotting import figure, output_file, show
from bokeh.models import Label
from bokeh.io import output_notebook
import numpy as np
import matplotlib.colors as mcolors

# Get topic weights
topic_weights = []
for i, row_list in enumerate(lda_model[corpus]):
    #print(row_list[0])
    topic_weights.append([w for w in row_list[0]])

# Array of topic weights    
arr = pd.DataFrame(topic_weights).fillna(0).values

# Keep the well separated points (optional)
arr = arr[np.amax(arr, axis=1) > 0.35]

# Dominant topic number in each doc
topic_num = np.argmax(arr, axis=1)
print("topic_num :", topic_num)

# tSNE Dimension Reduction
tsne_model = TSNE(n_components=2, verbose=1, init='pca')
tsne_lda = tsne_model.fit_transform(arr)

# Plot the Topic Clusters using Bokeh
output_notebook()
n_topics = 10
mycolors = np.array([color for name, color in mcolors.TABLEAU_COLORS.items()])
plot = figure(title="t-SNE Clustering of {} LDA Topics".format(n_topics), 
              plot_width=900, plot_height=700)
plot.scatter(x=tsne_lda[:,0], y=tsne_lda[:,1], color=mycolors[topic_num])
show(plot)

NameError: name 'lda_model' is not defined

## 2015 songs lyrics topic modelling

Let's first retrieve the english song in 2015 (year of with max genius lyrics repertoried).

Let's plot the tag distribution

The barplot below shows the frequency of each tag color by total views

Let's try topic modelling with top2vec library which is the easiest to start. But first let's filter the lyrics.

# 2000 song lyrics topic modelling

## References