# Topic analysis based on Genius songs lyrics

Author : lievre.thomas@gmail.com

---

In this notebook, we will explore the genius data extract from [Kaggle](https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information).

**The aim of this analysis is to retrieve topic from lyrics and retrieve main topics by year or decade.**

This notebook was carried out in the context of a class project imposed by the [text mining course (TDDE16)](https://www.ida.liu.se/~TDDE16/project.en.shtml) of Linköpings universitet.


## Few informations about Genius website

Genius is an American digital company founded on August 27, 2009, by Tom Lehman, lan Zechory, and Mahbod Moghadam. Originally launched as Rap Genius with a focus on hip-hop music, it was initially a crowdsourced website where people could fill in the lyrics of rap music and give an interpretation of the lyrics. Over the years the site has grown to contain several million annotated texts from all eras ( from [Wikipedia Genius page](https://en.wikipedia.org/wiki/Genius_(company))).


## Load the data in memory

Data are all contain in a big 9GB csv file (around 5 millions rows). It could be dificult to load all this data in our computer memory. To deal with this issue, I made a loading class to split the data in 6 pickles files to improve the compressness of the data which aim to improve the loading speed in the memory. Then the pickles are randomly draw to improve generality of the data. We currently assumed the data pickles batch are identically distributed (we will explore the data batches at the second part). The class below deal with all the process.

In [1]:
import pandas as pd
from random import seed, sample
import pickle
import glob
import os

class Loader():

    def __init__(self, in_path, out_path):
        """
        Args:
            in_path (str): csv input path.
            out_path (str): Output directory path to store the pickles.
            chunksize (int, optional): Chunksize for DataFrame reader. Defaults to 10**6. 
        """

        self.__in_path = in_path
        self.__out_path = out_path
        self.__chunksize = 10**6

    def __produce_pickles(self):
        """produce pickles by reading csv by chunksize
        """
        with pd.read_csv(self.__in_path, chunksize = self.__chunksize) as reader:
            try:
                os.makedirs(self.__out_path)
            except FileExistsError:
                # directory already exists
                pass
            for i, chunk in enumerate(reader):
                out_file = self.__out_path + "/data_{}.pkl".format(i+1)
                with open(out_file, "wb") as f:
                    pickle.dump(chunk, f, pickle.HIGHEST_PROTOCOL)
    
    def load_pickle(self, pickle_id):
        """load a pickle file by id

        Args:
            pickle_id (int): pickle id.

        Raises:
            Exception: The path of the given id isn't a file

        Returns:
            obj: DataFrame
        """
        # produce the pickles if the directory not exists or
        # if the directory is empty 
        if (not os.path.exists(self.__out_path)) or \
              (len(os.listdir(self.__out_path)) == 0):
            self.__produce_pickles()
        
        # get the file path following the pickle_id
        # given in parameter
        file_path = self.__out_path + \
            "/data_" + str(pickle_id) + ".pkl"

        if os.path.isfile(file_path):
            df = pd.read_pickle(file_path)
        else:
            raise Exception("The pickle file data_{}.pkl doesn't exist".format(pickle_id))
        return df
        

    def random_pickles(self, n_pickles = 3, init = 42, verbose = True):
        """random reader over pickles files

        Args:
            n_pickles (int, optional): number of pickles to load. Defaults to 3.
            init (int, optional): Integer given to the random seed. Defaults to 42.
            verbose (bool, optional): Print the loaded files. Defaults to True

        Raises:
            Exception: Stop the process if n_pickles exceed pickle files number.

        Returns:
            obj: pd.Dataframe
        """

        # produce the pickles if the directory not exists or
        # if the directory is empty 
        if (not os.path.exists(self.__out_path)) or \
              (len(os.listdir(self.__out_path)) == 0):
            self.__produce_pickles()

        pickle_files = [name for name in
                        glob.glob(self.__out_path + "/data_*.pkl")]
        # draw p_files        
        seed(init)

        if n_pickles <= 6:
            random_p_files = sample(pickle_files, n_pickles)
        else:
            raise Exception("The parameter n_pickles (" +
                            "{}) exceed the numbers of pickle files ({})"\
                                .format(n_pickles, len(pickle_files)))
        # print the drawed files
        if verbose:
            print("Loaded pickles:")
            for p in random_p_files:
                print(p)

        # load random pickles file
        df_list = [pd.read_pickle(p) for p in random_p_files]

        # create the dataframe by concatenate the previous
        # dataframes list
        df = pd.concat(df_list, ignore_index = True)
        return df

In [2]:
# create reader
#  /!\ change path in kaggle
kaggle_input = "/kaggle/input/genius-song-lyrics-with-language-information/song_lyrics.csv"
kaggle_output = "/kaggle/working/data/"

# initiate the file loader
loader = Loader(in_path = kaggle_input, out_path = kaggle_output)

# load random pickle files
df = loader.random_pickles(n_pickles = 1)

Loaded pickles:
/kaggle/working/data/data_2.pkl


Batchs of data are randomly loaded in the memory. The number of batchs loaded depends on the memory capacity of the computer running the script. For the analysis, we will only works on the random samples loaded (All the data in Kaggle).  

## Exploring the coarse data

Let's visualize and explore the coarse data before a part of deeper analysis.

In [3]:
df.head()

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
0,Life Floats By,rock,The Jayhawks,2000,91,{},"[Verse 1]\nAbuse me and confuse me\nBut never,...",1325199,en,en,en
1,Red Rubber Ball,rock,Streetlight Manifesto,2010,725,{},I should have known you'd bid me farewell\nThe...,1325200,en,en,en
2,The Lads Who Fought Won,pop,The Real McKenzies,2008,1386,{},That Serbian man\nAssassinated Archduke Ferdin...,1325201,en,en,en
3,Burning Love,pop,Mother's Finest,1977,571,"{""Mother\\'s Finest""}",Lord Almighty\nI feel my temperature rising\nH...,1325202,en,en,en
4,Patent na luxus,rap,O.S.T.R.,2003,861,{},[Zwrotka 1]\nWino z gatunku wytrawnych - Dom P...,1325204,pl,pl,pl


For each songs, we've got several informations :
- title of the song
- the tag (which kind of music)
- the artist singer name
- the release year
- the number of page views
- the featuring artists names
- the lyrics
- the genius identifier
- Lyrics language according to [CLD3](https://github.com/google/cld3). Not reliable results are NaN. CLD3 is a neural network model for language indentification.
- Lyrics language according to [FastText's langid](https://fasttext.cc/docs/en/language-identification.html). Values with low confidence (<0.5) are NaN. FastText's langid is library developped by Facebook’s AI Research lab for efficient learning of word representations and sentence classification. fastText has also published a fast and accurate tool for text-based language identification capable of recognizing more than 170 languages.
- Combines language_cld3 and language_ft. Only has a non NaN entry if they both "agree".

More information at this link : https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information

In [4]:
df.dtypes

title            object
tag              object
artist           object
year              int64
views             int64
features         object
lyrics           object
id                int64
language_cld3    object
language_ft      object
language         object
dtype: object

In [5]:
# display the size
print('Data frame size (row x columns):', df.size)
print('Data rows number: ', len(df))
print('Number of unique songs (following genius id): ', len(df.id.unique()))

Data frame size (row x columns): 11000000
Data rows number:  1000000
Number of unique songs (following genius id):  1000000


Genius id seems to be the unique rows identifier.

Let's vizualise size of the coarse data over years before preprocessing to compare batch distributions. One things to know before vizualise the data, the pickles are create by chunks reading. 

The last diplayed table gives us some information about the data. The csv file seems to be sort by id, so the pickle files are then sort too.

In [6]:
import plotly.express as px
import plotly.io as pio

# change theme template for every graph below
pio.templates.default = "plotly_white"


# get some information about the pickle data
def pickle_informations(data = loader):
    rows = []
    for i in range(1, len(os.listdir('data')) + 1):
        df = data.load_pickle(i)
        rows.append(len(df))
        del df
    return rows

# get the rows
rows = pickle_informations()

# create the dataframe
df_data = pd.DataFrame(
    {'batch' : ['data ' + str(i) for i in range(1,len(rows) + 1)],
    'rows' : rows})

fig = px.bar(df_data, x="batch", y="rows")
fig.show()

Batch seems to have the same number of rows rexcept for the last one which is consistent because batch are create iteratively by 10e6 chunks over the csv The last batch could be seen as a rest.

In [7]:
import plotly.graph_objects as go

def add_bar(i, y1, y2, color, data = loader):
    df = data.load_pickle(i)
    df = df[(df.year >= y1) & (df.year <= y2)]
    df_year = df.groupby(['year']).size().reset_index(name='count')
    new_bar = go.Bar(
                x = df_year.year.values,
                y = df_year['count'].values,
                name = 'data_'+ str(i),
                marker = {'color' : color})
    new_trend = go.Scatter(
                x = df_year.year.values,
                y = df_year['count'].values,
                mode="lines",
                line={'color' : color,
                    'width' : 0.5},
                showlegend=False)
    del df_year, df
    return new_bar, new_trend


def multi_barplot(year1, year2, colors):    
    # create a empty plotly.Figure object
    fig = go.Figure() 
    # compute the batch number
    n_batch = len(os.listdir('data'))
    # test the color list feed in argument
    # fit well with the batch number
    if n_batch > len(colors):
        raise Exception(
            "The colors list size({})doesn't ".format(len(colors)) +
            "fit with the number of data".format(n_batch))
    for i in range(1, n_batch + 1):
        fig.add_traces((add_bar(i, year1, year2, colors[i-1])))
    fig.update_layout(
        title = "Data distribution over years ({} - {})"
            .format(year1, year2),
        xaxis_title="years",
        yaxis_title="title",
        legend_title="Data batch")
    return fig


In [8]:
import plotly.colors as col

# create the color list
colors = col.qualitative.Plotly

# 1990 - 2023
fig1 = multi_barplot(1960, 1989, colors)
fig1.show()
# 1960 - 1990
fig2 = multi_barplot(1990, 2023, colors)
fig2.show()

The first bar chart (1960 - 1989) shows an increasing numbers of data over years. Moreover batch seems to have quite similar distriutions over years. data_1 and data_2 batch quite outperform the 4 others. data_6 batch is weaker than the other due to its poor number of rows.
The data behaves similarly until 2012 as we can see on the second chart (1990-2023). After this year there is great increasing of the data retrieved. A minimum increase of at least 100% of the batch can be observed. An increase of up to 50 times the batch size for some.

## Data pre-processing

The aim of this part is to preprocess data in order to get suitable data for the analysis. let's focus on the year variable.

We will focus on English songs, to facilitate the analysis and the work of natural language processing algorithms.

In [9]:
# Retrieve only the texts identified as English language by both cld3 and fasttext langid
df = df[df.language == 'en']

Next, it can be quite interseting to check Nan values

In [10]:
# find which column contain nan value
df.columns[df.isna().any()].tolist()

['title']

In [11]:
# get all rows that contain NaN values
df_nan = df[df.isna().any(axis=1)]
df_nan

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
453410,,pop,Backstabbers Incorporated,2004,218,{},There will be no sorry's\nThere will be no for...,1816945,en,en,en
453550,,pop,Adrian Belew,2006,70,{},Ampersand ampersand\n\nAmpersand the angry sea...,1817106,en,en,en
469051,,rock,Jets To Brazil,2000,629,{},And the trumpets pounded in my ears\nThe sun s...,1836101,en,en,en
637271,,pop,Caroline Smith & The Good Night Sleeps,2011,445,{},You say that I'm\nNever to keen to\nLeave this...,2124706,en,en,en
703414,,rap,Juggernaut789,2015,26,{},(Verse 1: Juggernaut)\nSo who cares that you'r...,2285907,en,en,en
714760,,rock,The Roots,2002,2152,{},[Verse]\nSaw you at the crack house\nDidn't wa...,2310351,en,en,en
721779,,pop,Jett Rebel,2014,262,{},I got it all figured out\nThink I know what I'...,2323195,en,en,en
749224,,rap,Tapo,2015,13,{},"Ya, listen up im not active\nStay trappin'\nNo...",2393125,en,en,en
767508,,rap,Thani & UFO Funk Band,2016,28,"{""Howi Dietz""}","[Spoken Word: Thani]\nI was hurt\nFoolish, dep...",2419155,en,en,en
777643,,rock,Pinegrove,2015,15939,{},"[Verse 1]\nHand over hand, I’m pulling myself ...",2432916,en,en,en


In [12]:
print('Number of untitled song:', len(df[df.isna().any(axis=1)]))

Number of untitled song: 15


Insofar as the title of the music is not to be taken into account in the learning of the topic modeling algorithms but But the titles can be related to the topics in the next phase of analysis and the low number of songs without any title, I decide to delete this data for the moment.

In [13]:
# Delete rows containing NaN values
df = df.dropna()
len(df)

679898

Next, we also try to check for None values

In [14]:
df[df.isnull().any(axis=1)]

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language


No None values in this dataframe.

Afterwards, let's look at the year variable, which is one of the important variables to take into account in our analysis because we want to extract the topics by decades.

In [15]:
years = df.year.unique()
print(years)

print('Number of unique years: ',len(years))

[2000 2010 2008 1977 2002 1982 2004 2005 2001 1992 1999 1988 2015 1980
 2009 2006 1966 1998 1997 1986 1957 1996 1952 1981 2011 1985 1994 2013
 2007 1979 1978 1989 2003 1995 1972 1993 1974 1973 2012 1967 1990 1963
 1984 1983 1987 1968 1961 1959 2014 1970 1958 1991 1965 1953 2016 1976
 1964 1971 1975 1960 1969 1955 1962 1684 1945 2019 1954 1956 2021 1950
 1949 2018 1951 1928 1937    2 1930 2020 2017 1936 1933 1938 1948 1947
 1939 1941 1940 1927 1931 1924 1943 1935 1918 1942 1932 1944 1893 1926
 2022 1922 1934    1 1913 1925 1946 1929 1923 1872 1775 1921  994 1870
 1854   62 1741   66 1903 1850 1890  207 1863 1835  729 2023 1914   10
 1885 1820 1862 1719 1841 1744 1847 1823 1780 1857 1817 1842 1662 1764
 1891 1887 1906   14 1911 1878 1910 1609 1905 1881 1821 1895 1805 1725
 1478 1867 1771 1875 1859 1712 1499 1110 1563 1832 1907 1871 1818 1876
 1765 1894 1861 1777 1868 1745 1799 1853 1899 1833 1011 1882 1897 1908
 1916 1849 1788   15 1781 1901    6 1888  673 1811 1837 1884 1550 1920
 1874 

We firstly want to know if the year variable format is suitable. It is highly likely that year are sometimes downsized (example : 92 instead of 1992).
Let's display the tag distribution for music with a release year below 215.

In [16]:
df_tag = df[df['year'] < 215].groupby(['tag']).size().reset_index(name='count')

fig = px.pie(df_tag, names="tag", values="count", title = "Outlier tag distribution")
fig.show()

It is rather surprising to observe that the majority style of music of this period (< 215) is rap music knowing that this style is known for the current emerging style. Of course, among this data their is a important part of outlier year.

In [17]:
# Extract the pieces of music of type 'rap' lower than the year 215
df_rap = df[(df['year'] < 215) & (df['tag'] == 'rap')]
df_rap.sort_values(by='views',ascending=False).head(20)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
858723,Ezekiel Poem,rap,Ezekiel (The Get Down),1,46085,"{""Justice Smith""}","The Get Down: Ezekiel's Poem\n\nBoom, then cra...",2842070,en,en,en
883034,Slow Down,rap,Dave East,1,19069,"{""Jazzy Amra""}","[Hook: Jazzy Amra]\nShorty only 19, always on ...",2870223,en,en,en
904191,No way,rap,Bitch who?,1,10868,{},[Verse 1]\nAyyy yaay yaaay (x2)\nYou might see...,2894288,en,en,en
715742,Deathless Aphrodite of the spangled mind Fragm...,rap,Sappho,1,3937,{},"Deathless Aphrodite of the spangled mind,\nChi...",2312135,en,en,en
798281,Sigelli,rap,Spark Master Tape,16,2940,{},[Intro]\nNow I am become death\nThe destroyer ...,2458701,en,en,en
818008,New Nga,rap,Sonyae,1,2029,{},It's to the point we don't know who the fuck t...,2483219,en,en,en
921788,Gotta Get Mine remix,rap,MC Breed,99,1525,{2Pac},2PAC INTRO:\nBuck buck…muthafucka\nRight at yo...,2915674,en,en,en
718090,Feed My Whole Crew,rap,Lightshow,1,1248,{},Lord as my knees hit this floor and I come to ...,2316983,en,en,en
853976,Still Slaves,rap,Prodigy of Mobb Deep,8,754,{},"Yeah, yeah\nUh huh\nIt's just another war stor...",2836752,en,en,en
942616,Dnt Like Me,rap,Jab47,1,750,{},"[Hook]\nDon't fuck with a nigga like me, oh no...",2939496,en,en,en


In [18]:
df_rap[df_rap['artist'] == 'Kanye East']

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language


If we search the release date of this track on google, we can find a release date from 4 May 2021 on the [Genius website](https://genius.com/Kanye-east-the-secrets-of-dababy-lyrics). Given the year that we find in our table and real one, we can assume some issue about the date format (1 instead 2021).

After few research on genius website, the most viewed songs of this above displayed list seems to be released on 2021 but more views decrease harder is the interpretation of date.

Let's check the second most popular tag 'pop' in this retrieve outliers data :

In [19]:
# Extract the pieces of music of type 'rap' lower than the year 215
df_pop = df[(df['year'] < 215) & (df['tag'] == 'pop')]
df_pop.sort_values(by='views',ascending=False).head(20)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
250287,Faking Jazz Together,pop,Connan Mockasin,1,13014,{},Let me feel that way\nFeeding time with intere...,1589754,en,en,en
6877,Tried Him and I Know Him Reprise Live,pop,The Clark Sisters,2,3432,{},Saved my soul\nMade me whole\nThere is none li...,1332453,en,en,en
369370,Shoals Of Herring,pop,Ewan MacColl,66,2542,{},With our nets and gear we're faring\nOn the wi...,1715702,en,en,en
943116,Hum,pop,Sweet California,1,2040,"{""Juan Magán""}",[Verse 1:Sweet California]\nI hear everything ...,2940036,en,en,en
935369,​motherhood me my mom,pop,(Hiroyuki Sawano),1,1221,"{""澤野弘之 (Hiroyuki Sawano)"",""Aika Sekiyama""}",[Verse 1]\nEvery moment without you\nSeems lik...,2931277,en,en,en
430996,Love Story,pop,Katharine McPhee,207,960,{},[Verse 1]\nI think it was the summertime\nWhen...,1786012,en,en,en
350034,Bullets of Mexico,pop,Phil Ochs,62,839,{},[Verse 1]\nThe peons of Mexico long have known...,1695290,en,en,en
110257,One Time,pop,KIDZ BOP Kids,1,255,{},(Me plus you)\nI'ma tell you one time\n(Me plu...,1441887,en,en,en
704379,You Burn My Heart,pop,Razvan Ionescu,1,239,{},I know it\nI know how you feel\nI'll leave you...,2288441,en,en,en
648227,Euphoria,pop,Juventa,1,230,"{""Aloma Steele""}",I laid shattered on the ground\nLoving was my ...,2159712,en,en,en


The titles recovered seem to be for the most part recent sounds, not very popular with a bad indexation of the years.

A case-by-case pre-processing of the data is too tedious compared to the amount of data to be processed. We will only use data with correctly formatted dates.

In [20]:
df = df[(df.year >= 1960) & (df.year < 2023)]
len(df)

674959

We wish to analyze the texts by decade then let's add a decade column.

In [21]:
import math

df['decade'] = df['year'].map(lambda x : int(math.trunc(x / 10) * 10))

df.sort_values(by = 'year').head(20)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language,decade
132525,Laura,pop,Tony Bennett,1960,204,{},Laura is the face in the misty lights\nFootste...,1465455,en,en,en,1960
218061,Vanzettis Rock,pop,Woody Guthrie,1960,101,{},"I'm standin' on the rock, Vanzetti\nStandin' o...",1555756,en,en,en,1960
3717,Sleepy Lagoon,pop,The Platters,1960,1341,{},"A sleepy lagoon, a tropical moon, and two on a...",1329132,en,en,en,1960
285546,Young at Heart,pop,Perry Como,1960,307,{},Fairy tales can come true\nIt can happen to yo...,1627014,en,en,en,1960
409948,Tonight We Love,pop,Caterina Valente,1960,824,{},[Verse]\nTonight we love while the moon\nBeams...,1759985,en,en,en,1960
280634,Mister Blues,country,Ernest Tubb,1960,87,{},[Verse 1]\nWhen I come home at night I find hi...,1621835,en,en,en,1960
203936,Zhankoye,pop,The Limeliters,1960,1411,{},"Az man fort kine Sevastopol, is nit veit fun S...",1540939,en,en,en,1960
368046,I Want You with Me,pop,Bobby Darin,1960,201,{},"When I was little, my mama said to me\nSomeday...",1714305,en,en,en,1960
173899,When Your Lover Has Gone,pop,Ricky Nelson,1960,63,{},"When you're alone, who cares for starlit skies...",1509230,en,en,en,1960
426113,Time Hurts As Well as It Heals,pop,Don Gibson,1960,129,{},You say that in time I'll forget how I feel bu...,1777134,en,en,en,1960


## Data vizualisation

Let's do some vizualisation to get a better understanding of our data. As we saw on previous distribution graphs over years, more titles have been recorded over the last 2 decades.

In [22]:
# barplot by decade
def barplot_by_decade(df):

    # groupby decade
    df_d = df.groupby(['decade']).size().reset_index(name='count')

    # create the figure
    fig = go.Figure()

    fig.add_bar(
        x=df_d.decade,
        y=df_d['count'],
        showlegend=False)

    fig.add_scatter(
            x=df_d.decade,
            y=df_d["count"],
            mode="markers+lines",
            name="trend",
            showlegend=False)

    fig.update_layout(
            title = "Music release over years",
            xaxis_title="decade",
            yaxis_title="release")
    return fig

# build and display
fig = barplot_by_decade(df)
fig.show()

In [23]:
# compute tag frequencies by decade
df_pies_d = df.groupby(['decade','tag']).size().reset_index(name='count')
df_pies_d[df_pies_d.decade == 1960]

Unnamed: 0,decade,tag,count
0,1960,country,1427
1,1960,misc,141
2,1960,pop,7268
3,1960,rap,2
4,1960,rb,674
5,1960,rock,1165


In [24]:
from plotly.subplots import make_subplots

# create en make subplot
fig = make_subplots(rows=3, cols=3,
                    specs=[
                        [{'type':'domain'}
                        for i in range(1,4)] for i in range(1,4)
                    ])
decades = df_pies_d.decade.unique().tolist()
for i in range(0,3):
    for k in range(0,3):
        decade = decades[i*3 + k]
        # group by decade
        df_p = df_pies_d[df_pies_d.decade == decade]
        # add figure
        fig.add_trace(go.Pie(labels=df_p.tag, values=df_p['count'], name=decade), i+1, k+1)
        # add annotation
        fig.add_annotation(arg=dict(
            text=decade, x=k*0.375 + 0.125,
            y=-i*0.3927 + 0.90, font_size=10,
            showarrow=False))
        if (i*3 + k) == 6:
            break


# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
    title_text="Tags proportions by decades"
    # Add annotations in the center of the donut pies.
    #annotations=[dict(text=decade, x=k*0.375+0.125, y= -i*0.125+0.90, font_size=10, showarrow=False)
     #           for k, decade in enumerate(decades) for i in range(0,4)]
)
fig.show()

The different pie charts show us an evolution of the different proportions of music styles by decades from 1960 to today. In the sixties the most listed music style is the **pop music**. Then comes an emergence of a rising style: the **rock**. This one increases with the decades until it overtakes the pop music during the decades 90 and 2000. The two styles come to balance thereafter facing the meteoric rise of the most listed genre in our current decade, namely **rap**. The evolution of the proportions recover can give us a rather precise idea of the most popular styles of their times. However, it is important to remember that the data still has some bias, Genius being a crowdsourced tool created during the last decade and at the base only as a lexical translator of rap music, it is normal to find a large amount of data from this period and especially from this style there.

## Text preprocessing

After the visualisation part let's focus more on the main data which are the lyrics.

In [25]:
df.iloc[0]["lyrics"]

"[Verse 1]\nAbuse me and confuse me\nBut never, never, never use me\nAh, you leave me so tired\nSo utterly uninspired\n\n[Pre-Chorus 1]\nThunder Bay was a drag, baby\nThunder Bay was a drag, baby\n\n[Verse 2]\nUpon the gravel and the dust\nIf I don't move I will rust\nYou're a faint recurring melody\nI can't seem to recall\n\n[Chorus]\nIn my mind, in my soul\nI never really loved you\nIn my mind, in my soul\nI never really loved you\n[Verse 3]\nAh, the sun shines off the power lines\nAnd the trees, they wave me on\nThere's a black cloud of happiness\nI can't finish what I've begun\n\n[Pre-Chorus 2]\nWe hit Duluth on a jag, baby\nWe hit Duluth on a jag, baby\n\n[Verse 4]\nI grab my coat, my hat and my paperback\nFrom the corner of my eyes, I see you smile\n\n[Chorus]\nIn my mind, in my soul\nI never really loved you\nIn my mind, in my soul\nI never really loved you\n\n[Bridge 1]\nI hear what they're saying\nThere's no use in praying\nSo I'll just slip away\n\n[Guitar Solo]\n[Chorus]\nIn

There is many undesirable characters like the line breaker '\n', figures or square, curly and simple brackets. So let's clean this data with regular expressions.

In [26]:
import re
from numpy.random import randint

def clean_text(text):
    # remove \n
    text = text.replace('\n', ' ')
    # remove punctuation
    text = re.sub(r'[,\.!?]', '', text)
    #removing text in square braquet
    text = re.sub(r'\[.*?\]', ' ', text)
    #removing numbers
    text = re.sub(r'\w*\d\w*',' ', text)
    #removing bracket
    text = re.sub(r'[()]', ' ', text)
    # convert all words in lower case
    text = text.lower()
    return text

# get the results of data cleaning
cleaned_text = df["lyrics"].apply(clean_text)

In [27]:
docs = cleaned_text.to_list()
docs[0]

"  abuse me and confuse me but never never never use me ah you leave me so tired so utterly uninspired    thunder bay was a drag baby thunder bay was a drag baby    upon the gravel and the dust if i don't move i will rust you're a faint recurring melody i can't seem to recall    in my mind in my soul i never really loved you in my mind in my soul i never really loved you   ah the sun shines off the power lines and the trees they wave me on there's a black cloud of happiness i can't finish what i've begun    we hit duluth on a jag baby we hit duluth on a jag baby    i grab my coat my hat and my paperback from the corner of my eyes i see you smile    in my mind in my soul i never really loved you in my mind in my soul i never really loved you    i hear what they're saying there's no use in praying so i'll just slip away      in my mind in my soul i never really loved you in my mind in my soul i never really loved you    do do do do do do do do do do do do do do do do do do do do do do do

In [28]:
# update dataframe
df.update(cleaned_text)
df.head(3)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language,decade
0,Life Floats By,rock,The Jayhawks,2000,91,{},abuse me and confuse me but never never neve...,1325199,en,en,en,2000
1,Red Rubber Ball,rock,Streetlight Manifesto,2010,725,{},i should have known you'd bid me farewell ther...,1325200,en,en,en,2010
2,The Lads Who Fought Won,pop,The Real McKenzies,2008,1386,{},that serbian man assassinated archduke ferdina...,1325201,en,en,en,2000


That's better! The libraries that we will use later to perform topic modeling usually provide preprocessing but it is always good to have control over what we manipulate.

## Topic modeling

I will perform 2 ways to do topic modeling :
- [LDA (latent dirichlet allocation)](https://fr.wikipedia.org/wiki/Allocation_de_Dirichlet_latente) are the common way to do topic modeling in the few last years, it works and it's quite easy to use with common python library like [Gensim](https://radimrehurek.com/gensim/auto_examples/index.html).
- [BERTopic](https://maartengr.github.io/BERTopic/index.html) seems to be one of the best technic this day to perform topic modeling. It combine the leverage of [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) the famous language model with [c-TF-IDF](https://maartengr.github.io/BERTopic/api/ctfidf.html) tansformer. 


### Define default tokenizer and Lemmatizer

In [29]:
import spacy

print("set gpu: ", spacy.prefer_gpu())

# small model /!\ take the bigger one for Kaggle
new_nlp = spacy.load('en_core_web_sm')

set gpu:  False


It could be difficult to process all this data on my computer or Kaggle. The memory will quickly be overwhelmed. I will work with a sample of our previously load data in order to avoid memory overload.

In [30]:
from collections import Counter

def sample(balanced = True, data = df, prop = 0.1):
    if balanced:
        # compute the sorted decreasing parties frequencies
        decade_frequencies = Counter(data['decade']).most_common()
        print(decade_frequencies)

        # retrieve the under represented class
        nb_under_class = decade_frequencies[-1][1]

        # Return a random sample of items from each party following the under sampled number of class
        sample_df = data.groupby("decade").sample(n = nb_under_class, random_state = 500)
    else:
        # create sample df 1/3 of the actual loaded data
        sample_df = df.sample(frac = prop)
    return sample_df
        

In [31]:
# sample the data
sdf = sample()

[(2010, 393178), (2000, 152623), (1990, 71905), (1980, 28803), (1970, 17241), (1960, 10677), (2020, 532)]


In [32]:
# check the distribution of the sample
barplot_by_decade(sdf)

In [33]:
# default preprocessing
def preprocess(text, nlp = new_nlp):

    #TOKENISATION
    tokens =[]
    for token in nlp(text):
        tokens.append(token)

    #REMOVING STOP WORDS
    spacy_stopwords = new_nlp.Defaults.stop_words
    sentence =  [word for word in tokens if word.text.isalpha() and word.text not in spacy_stopwords]

    #LEMMATISATION
    sentence = [word.lemma_ for word in sentence]

    return sentence

### Vizualise most frequent words over decades

In [34]:
from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer

class TermsDocumentsMatrix():
    
    def __init__(self, sdf, decades=[1960, 1970], colorscale = 'Plotly3'):
        # vectorizer on the sample lyrics
        self.__vectorizer = CountVectorizer(tokenizer = preprocess)
        # fit and transform the data
        self.__data_vectorized = self.__vectorizer.fit_transform(
            tqdm(sdf['lyrics'].loc[sdf['decade'].isin(decades)])
        )
        # get decades informations
        self.__decades = sdf['decade'].loc[sdf['decade'].isin(decades)].reset_index(drop=True)
        self.__unique_decades = decades
        # get colorscale template
        self.__colorscale = colorscale
    
    def get_tdmatrix(self):
        
        # compute a Matrix terms document by decades
        df_bw = pd.DataFrame(self.__data_vectorized.toarray(),
                    columns = self.__vectorizer.get_feature_names_out())
        
        # check the length
        if len(df_bw) != len(self.__decades):
            raise Exception('Not the same size')
        
        # concatenate decade
        df_bw['decade'] = self.__decades
        
        # check NaN values
        if len(df_bw.columns[df_bw.isna().any()].tolist()) != 0:
            raise Exception('Decade got Nan values')

        return df_bw
    
    def get_tdm_by_decade(self, decade):
        
        if decade not in self.__unique_decades:
            raise Exception("{} doesn't appear in the decades list".format(decade))
        
        # compute a Matrix terms document by decades (bag of words format)
        df_bw = pd.DataFrame(self.__data_vectorized.toarray(),
                    columns = self.__vectorizer.get_feature_names_out())
        
        # check the length
        if len(df_bw) != len(self.__decades):
            raise Exception('Not the same size')
        
        # concatenate decade
        df_bw['decade'] = self.__decades
        
        # check NaN values
        if len(df_bw.columns[df_bw.isna().any()].tolist()) != 0:
            raise Exception('Decade got Nan values')
        
        # select suitable decade
        df_bw = df_bw[df_bw['decade'] == decade]
        
        return df_bw
    
    def most_freq_terms(self, n_rows = 1, n_cols = 2, n_terms = 10):
        
        # create the document terms matrix
        df_bw = self.get_tdmatrix()
        
        # create en make subplot
        fig = make_subplots(rows=n_rows, cols=n_cols,
                            x_title = 'number of occurrences',
                            y_title = 'terms',
                            subplot_titles = self.__unique_decades)
        
        for i in range(0,n_rows):
            for k in range(0,n_cols):
                if (i*n_rows + k) == len(self.__unique_decades):
                    break
                
                # get the decade
                decade = self.__unique_decades[i*n_rows + k]
            
                #select the suitable decade and delete decade column
                df_decade = df_bw.loc[df_bw.decade == decade, df_bw.columns != 'decade']
                
                # compute terms frequencies by decade
                terms_freq = df_decade.sum().sort_values(ascending = False)
            
                # total number of terms occurences
                total_terms = terms_freq.values
                
                # add figure
                fig.add_trace(go.Bar(y=terms_freq.index.tolist()[:n_terms][::-1],
                                     x=total_terms[:n_terms][::-1],
                                     name=decade,
                                     orientation='h', showlegend = False,
                                    marker = dict(color = total_terms,
                                                  colorscale=self.__colorscale)),
                              i+1, k+1)
        return fig

In [35]:
# first decades
tdm = TermsDocumentsMatrix(sdf, decades = [1960, 1970, 1980, 1990],
                           colorscale = 'Plasma')

# display bar charts of most frequent terms
tdm.most_freq_terms(n_rows = 2, n_cols = 2, n_terms = 15)

100%|██████████| 2128/2128 [01:29<00:00, 23.72it/s]


According to the bar graphs displayed above, a group of words seems to recur on each decade: Love, know, go, feel ... Words that seem to relate to the popular song that can talk about love. This is consistent with our previous analysis from the pie charts showing the proportions of musical styles across time. We also notice an important presence of onomatopoeia like yeah or oh.

In [36]:
# first decades
tdm = TermsDocumentsMatrix(sdf, decades = [2000, 2010, 2020],
                          colorscale = 'Plasma')

# most frequent terms
tdm.most_freq_terms(n_rows = 2, n_cols = 2, n_terms = 15)

100%|██████████| 1596/1596 [01:19<00:00, 19.95it/s]


We get similar results on this second decade with similar high occurrence words. We see a greater amount of onomatopoeia in the current decade. We can explain this by an emergence of the rap music style on this current and last decade. There is in this style of music a very used process, the 'ad-libs'. They are sounds, words or onomatopoeias that the artists pronounce sometimes between two verses or at the end of a sentence to give more impact to their text and to dynamize the atmosphere of a title. This may explain the greater presence of onomatopoeia in the lyrics of this decade.


### Topic modeling with LDA

LDA is a common technic use in topic modeling, we firstly process basic preprocessing.

In [40]:
# gensim
from gensim.models import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.models.ldamulticore import LdaMulticore
from gensim.corpora.dictionary import Dictionary
from gensim.test.utils import datapath
from gensim.utils import ClippedCorpus

# utils
from datetime import datetime
import logging

# dashboards
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt

# TSNE dependencies
from sklearn.manifold import TSNE
from bokeh.plotting import figure, output_file, show
from bokeh.models import Label
from bokeh.io import output_notebook
import numpy as np
import matplotlib.colors as mcolors

#gensim_log = '/kaggle/working/log/gensim.log'

#initiate log file
#logging.basicConfig(filename=gensim_log,
#                   format='%(asctime)s:%(levelname)s:%(message)s',
#                    level=logging.INFO,
#                   force = True)

# utils
def parse_logfile(path_log):
    matcher = re.compile(r'(-*\d+\.\d+) per-word .* (\d+\.\d+) perplexity')
    likelihoods = []
    with open(path_log) as source:
        for line in source:
            match = matcher.search(line)
            if match:
                likelihoods.append(float(match.group(1)))
    return likelihoods


class LDATopicModeling():
    
    def __init__(self, df = sdf,
                 decade = 1960,
                 directory = "/kaggle/working/models/",
                 existing = False,
                 n_topics = 10,
                 worker_nodes = None,
                lang_preprocess = preprocess,
                cross_valid = False):
        # Apply preprocessing on decade data
        self.__documents = df.loc[df.decade == decade, 'lyrics'].apply(lang_preprocess)
            
        # Create a corpus from a list of texts
        self.__id2word = Dictionary(self.__documents.tolist())
        self.__corpus = [self.__id2word.doc2bow(doc) for doc in self.__documents.tolist()]
        self.__n_topics = n_topics
        
        #training
        if os.path.isfile(existing):
            # Load a potentially pretrained model from disk.
            self.model = LdaModel.load(temp_file)
            self.__cv_results = None # no cross_valid
        elif not cross_valid:
            self.model = LdaMulticore(
                corpus=tqdm(self.__corpus),
                id2word=self.__id2word,
                num_topics=n_topics,
                workers=worker_nodes,
                passes=50)
            self.__cv_results = None
        else: # cross validation
            
            # hyperparameter
            alpha = np.arange(0.01, 1, 0.3).tolist()
            alpha.append('symmetric')
            alpha.append('asymmetric')
            
            # hyperparameter
            eta = np.arange(0.01, 1, 0.3).tolist()
            eta.append('symmetric')
            
            # compute results of the cross_validation
            cv_results = {
                 'topics': [],
                 'alpha': [],
                 'eta': [],
                 'coherence': []
            }
            
            # prevent the computation time
            pbar = tqdm(total=(len(eta)*len(alpha)*len(topics_range)))
            model_list = []
            
            for k in range(2, n_topics+1):
                for a in alpha:
                    for b in beta:
                        
                        # train the model
                        model = LdaMulticore(
                            corpus=tqdm(corpus[i]),
                            id2word=self.__id2word,
                            num_topics=k,
                            workers=worker_nodes,
                            passes=10,
                            alpha=alpha,
                            eta=eta)
                        
                        # compute coherence
                        cv = CoherenceModel(
                            model=model,
                            texts=self.__document,
                            dictionary=self.__id2word,
                            coherence='c_v')
                        
                         # Save the model results
                        cv_results['topics'].append(k)
                        cv_results['alpha'].append(a)
                        cv_results['eta'].append(b)
                        cv_results['coherence'].append(cv)
                        
                        # update chargement bar
                        pbar.update(1) 
                        
            # choose the model given the best coherence
            self.model = model_list[np.argmax(cv_results['coherence'])]
            
            # save results as attribute
            self.__cv_results = cv_results
                                    
            # logging doesn't work on Kaggle
            #self.__likelihood = parse_logfile()
        # directory path
        self.__directory = directory
    
    # getters
    @property
    def get_id2word(self):
        return self.__id2word
    
    @property
    def get_corpus(self):
        return self.__corpus
    
    @property
    def get_likelihood(self):
        return self.__likelihood
    
    @property
    def get_cv_resutlts(self):
        return pd.DataFrame(self.__cv_results) if self.__cv_results else None
    
    def plot_coherence(self, metric = 'alpha'):
        """metric(str): alpha or eta
        """                         
        # get the dataframe
        df_res = self.get_cv_results()
                                    
        if df_res == None:
            raise Exception('No cross validation available')
                                    
        # groupby by metric
        grouped = df_res.group_by(metric)
        # create the layout
        fig = go.Figure()
        for level, df in grouped:
            fig.add_trace(
                go.Scatter(
                    x=df.topics,
                    y=df.coherence,
                    mode='lines+markers',
                    name=str(level)
                )
            )
        fig.update_layout(
            title = "coherence over topics by",
            xaxis_title="topic",
            yaxis_title="coherence")
        return fig
                                    
                                    
    def save_current_model(self):
        # retrieve time
        now = datetime.now()
        # create the directory if it doesn't exist
        try:
            os.makedirs(directory + now.strftime("%d%m%Y_%H%M%S"))
        except:
            pass
        # Save model to disk.
        temp_file = datapath(directory + now.strftime("%d%m%Y_%H%M%S") + '/model')
        
        self.model.save(temp_file)

    def get_perplexity(self):
        return self.model.log_perplexity(self.__corpus)
    
    def get_coherence(self):
        coherence_model_lda = CoherenceModel(
            model=self.model,
            texts=self.__documents,
            dictionary=self.__id2word,
            coherence='c_v')
        return coherence_model_lda.get_coherence()
    
    
    # data vizualisation
    def dashboard_LDAvis(self):
        # some basic dataviz
        pyLDAvis.enable_notebook()
        vis = pyLDAvis.gensim.prepare(self.model, self.__corpus,
                                      dictionary = self.model.id2word)
        return vis
        
    def plot_tsne(self):
        # n-1 rows each is a vector with i-1 posisitons, where n the number of documents
        # i the topic number and tmp[i] = probability of topic i
        topic_weights = []
        for row_list in self.model[self.get_corpus]:
            tmp = np.zeros(self.__n_topics)
            for i, w in row_list:
                tmp[i] = w
            topic_weights.append(tmp)


        # Array of topic weights    
        arr = pd.DataFrame(topic_weights).fillna(0).values

        # Keep the well separated points (optional)
        arr = arr[np.amax(arr, axis=1) > 0.35]

        # Dominant topic number in each doc
        topic_num = np.argmax(arr, axis=1)

        # tSNE Dimension Reduction
        tsne_model = TSNE(n_components=2, verbose=1, init='pca')
        tsne_lda = tsne_model.fit_transform(arr)

        # Plot the Topic Clusters using Bokeh
        output_notebook()
        mycolors = np.array([color for name, color in mcolors.TABLEAU_COLORS.items()])
        plot = figure(title="t-SNE Clustering of {} LDA Topics".format(self.__n_topics), 
                      plot_width=900, plot_height=700)
        plot.scatter(x=tsne_lda[:,0], y=tsne_lda[:,1], color=mycolors[topic_num])
        show(plot)
        
    def plot_likelihood(self):
        fig = go.Figure(
            go.Scatter(x=[range(0,50)], y=self.__likelihood[-50:],
                       mode='lines',
                       name='lines'))
        fig.update_layout(
            title = "Likelihood over passes",
            xaxis_title="Likekihood",
            yaxis_title="passes")
        return fig

In [41]:
# create my model
lda_model = LDATopicModeling()

100%|██████████| 532/532 [00:00<00:00, 1849.72it/s]


In [42]:
# print the result
lda_model.model.print_topics()

[(0,
  '0.062*"hey" + 0.038*"get" + 0.018*"yeah" + 0.014*"baby" + 0.014*"boy" + 0.013*"home" + 0.013*"come" + 0.012*"girl" + 0.010*"go" + 0.010*"child"'),
 (1,
  '0.033*"like" + 0.025*"tina" + 0.016*"come" + 0.015*"baby" + 0.014*"get" + 0.010*"way" + 0.009*"know" + 0.008*"go" + 0.008*"oh" + 0.007*"mama"'),
 (2,
  '0.020*"know" + 0.012*"oh" + 0.011*"let" + 0.009*"time" + 0.009*"eye" + 0.009*"walk" + 0.008*"baby" + 0.008*"mind" + 0.008*"way" + 0.007*"hang"'),
 (3,
  '0.019*"away" + 0.015*"mountain" + 0.015*"bind" + 0.012*"spectacle" + 0.010*"commodity" + 0.010*"music" + 0.008*"chile" + 0.007*"sadie" + 0.006*"marry" + 0.006*"chase"'),
 (4,
  '0.032*"da" + 0.013*"time" + 0.011*"like" + 0.010*"man" + 0.009*"oh" + 0.009*"good" + 0.009*"wish" + 0.009*"night" + 0.009*"come" + 0.008*"go"'),
 (5,
  '0.068*"love" + 0.029*"baby" + 0.027*"know" + 0.024*"go" + 0.020*"oh" + 0.014*"to" + 0.014*"come" + 0.013*"la" + 0.013*"get" + 0.012*"time"'),
 (6,
  '0.053*"c" + 0.028*"ahh" + 0.027*"angel" + 0.024*"

In [44]:
lda_model.plot_tsne()


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 527 samples in 0.001s...
[t-SNE] Computed neighbors for 527 samples in 0.020s...
[t-SNE] Computed conditional probabilities for sample 527 / 527
[t-SNE] Mean sigma: 0.005891
[t-SNE] KL divergence after 250 iterations with early exaggeration: 58.189865
[t-SNE] KL divergence after 1000 iterations: 0.285495


In [45]:
lda_model.dashboard_LDAvis()


In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only



Let's perform lda for each decade.

In [63]:
# LDA Topic Modeling by decade
class LDAPipeline():
    
    def __init__(self):
        self.models = {
            decade : LDATopicModeling(decade = decade) for decade in decades}
        
    def get_metrics(self):
        # compute metrics
        metrics = {
            'decade' : decade, 'coherence' : model.get_coherence,
            'perplexity': model.get_perplexity for decade, model in self.models
        }
        
        return df_m = pd.DataFrame(metrics, index = 'decade')
        
        
    def lda_info(self, decade):
        lda_model = self.models[decade]

        print("Perplexity: ", lda_model.get_perplexity())
        print("Coherence: ", lda_model.get_coherence())
        lda_model.plot_tsne()
        return lda_model.dashboard_LDAvis()

In [64]:
lda_models = LDAPipeline()

100%|██████████| 346/346 [00:00<00:00, 438.68it/s]
100%|██████████| 346/346 [00:00<00:00, 577.67it/s]
100%|██████████| 346/346 [00:01<00:00, 300.79it/s]
100%|██████████| 346/346 [00:00<00:00, 523.59it/s]
100%|██████████| 346/346 [00:00<00:00, 418.30it/s]
100%|██████████| 346/346 [00:00<00:00, 470.62it/s]
100%|██████████| 346/346 [00:00<00:00, 507.02it/s]


Display information for each decade.

In [65]:
lda_1960 = lda_models.lda_info(1960)

Perplexity:  -7.006579176800216
Coherence:  0.4234883164159899
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 346 samples in 0.001s...



The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



[t-SNE] Computed neighbors for 346 samples in 0.020s...
[t-SNE] Computed conditional probabilities for sample 346 / 346
[t-SNE] Mean sigma: 0.012190
[t-SNE] KL divergence after 250 iterations with early exaggeration: 60.899666
[t-SNE] KL divergence after 1000 iterations: 0.251704



In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only



In [66]:
lda_1960

In [67]:
lda_1970 = lda_models.lda_info(1970)

Perplexity:  -6.9705142720912185
Coherence:  0.32976137988860504
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 345 samples in 0.001s...
[t-SNE] Computed neighbors for 345 samples in 0.009s...
[t-SNE] Computed conditional probabilities for sample 345 / 345
[t-SNE] Mean sigma: 0.010710



The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



[t-SNE] KL divergence after 250 iterations with early exaggeration: 57.008461
[t-SNE] KL divergence after 1000 iterations: 0.223901



In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only



In [68]:
lda_1970

In [69]:
lda_1980 = lda_models.lda_info(1980)

Perplexity:  -7.33112650532853
Coherence:  0.4227321567171779



The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 344 samples in 0.001s...
[t-SNE] Computed neighbors for 344 samples in 0.015s...
[t-SNE] Computed conditional probabilities for sample 344 / 344
[t-SNE] Mean sigma: 0.013387
[t-SNE] KL divergence after 250 iterations with early exaggeration: 59.542175
[t-SNE] KL divergence after 1000 iterations: 0.279937



In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only



In [70]:
lda_1980

In [71]:
lda_1990 = lda_models.lda_info(1990)

Perplexity:  -7.412132443071911
Coherence:  0.3366493463108243
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 343 samples in 0.001s...
[t-SNE] Computed neighbors for 343 samples in 0.009s...
[t-SNE] Computed conditional probabilities for sample 343 / 343
[t-SNE] Mean sigma: 0.211093



The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



[t-SNE] KL divergence after 250 iterations with early exaggeration: 57.145000
[t-SNE] KL divergence after 1000 iterations: 0.266714



In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only



In [72]:
lda_1990

In [73]:
lda_2000 = lda_models.lda_info(2000)

Perplexity:  -7.416894436047527
Coherence:  0.37379762877155975



The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 342 samples in 0.001s...
[t-SNE] Computed neighbors for 342 samples in 0.009s...
[t-SNE] Computed conditional probabilities for sample 342 / 342
[t-SNE] Mean sigma: 0.012040
[t-SNE] KL divergence after 250 iterations with early exaggeration: 58.915337
[t-SNE] KL divergence after 1000 iterations: 0.235085



In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only



In [74]:
lda_2000

In [75]:
lda_2010 = lda_models.lda_info(2010)

Perplexity:  -7.886718278572713
Coherence:  0.3796978707090458
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 345 samples in 0.001s...
[t-SNE] Computed neighbors for 345 samples in 0.009s...
[t-SNE] Computed conditional probabilities for sample 345 / 345
[t-SNE] Mean sigma: 0.012458



The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



[t-SNE] KL divergence after 250 iterations with early exaggeration: 60.038513
[t-SNE] KL divergence after 800 iterations: 0.232442



In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only



In [76]:
lda_2010

In [77]:
lda_2020 = lda_models.lda_info(2020)

Perplexity:  -7.420023400679001
Coherence:  0.36113673695586224
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 344 samples in 0.001s...
[t-SNE] Computed neighbors for 344 samples in 0.009s...
[t-SNE] Computed conditional probabilities for sample 344 / 344
[t-SNE] Mean sigma: 0.235095



The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



[t-SNE] KL divergence after 250 iterations with early exaggeration: 60.307220
[t-SNE] KL divergence after 1000 iterations: 0.286681



In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only



In [78]:
lda_2020

Basic LDA model give us our baseline, we've got this default **perspicacity** and **coherence** score. Let's try to improve this to score and also our qualitative intuition about topic. As we saw topic seems really similar over the decade and it's quite difficult to retrieve some good topics given the representation we compute. 

## Improve the preprocessing

In this part, we will try to create a pre-processing function that can take into account bigrams and trigrams and also allow to put aside the terms that could have been too recurrent in the previous part.

### ngram recognition with gensim

A way to improve lyrics comprehension is to use bigram and trigram with the help of phraser in gensim.

In [46]:
import gensim

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

# process in lyrics into words
data = sdf['lyrics'].tolist()
data_words = list(sent_to_words(data))

In [47]:
# display the result
data_words[0][:10]

['the',
 'tour',
 'bus',
 'passed',
 'here',
 'yesterday',
 'exciting',
 'all',
 'the',
 'fools']

In [48]:
import gensim

# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_model = gensim.models.phrases.Phraser(bigram)
trigram_model = gensim.models.phrases.Phraser(trigram)

In [49]:
from numpy.random import randint, seed

# set seed from numpy
seed(18)

# draw the upper bound
upper_bound = randint(len(data_words))

# display a random example
print('bigram model: ', bigram_model[data_words[upper_bound]])
print('\ntrigram model: ', bigram_model[data_words[upper_bound]])
print('\nsentence: ',' '.join(data_words[upper_bound]))

bigram model:  ['hey', 'love', 'you', 'know', 'you', 'got', 'me', 'locked', 'up', 'and', 'in', 'more', 'ways', 'than', 'one', 'cause', 'in', 'jail', 'every', 'time', 'exhale', 'the', 'way', 'you', 'hold', 'that', 'gun', 'hearts', 'hearts', 'beating', 'fast', 'keep', 'moving', 'can', 'pass', 'your', 'little', 'criminal', 'mind', 'hearts', 'hearts', 'beating', 'fast', 'got', 'me', 'trapped', 'inside', 'the', 'past', 'feels', 'like', 'paralyzed', 'no', 'gave', 'away', 'the', 'one', 'real', 'key', 'to', 'my', 'heart', 'but', 'you', 'lied', 'and', 'tore', 'me', 'apart', 'changing', 'all', 'the', 'locks', 'swear', 'tonight', 'building', 'back', 'all', 'the', 'walls', 'you', 'broke', 'inside', 'can', 'deny', 'that', 'it', 'time', 'to', 'say', 'goodbye', 'changing', 'all', 'the', 'locks', 'swear', 'tonight', 'you', 'can', 'keep', 'the', 'key', 'for', 'your', 'other', 'locked', 'up', 'lovers', 'locked', 'up', 'lovers', 'you', 're', 'like', 'cop', 'locking', 'up', 'the', 'hotel', 'the', 'way', '

We are able to see some bigram and bigram, most of the time it's a words with its adjective or a group of onomatopeia.

As previously explained, we would also like to set aside irrelevant terms that may have been recurrent in our last analysis. If we look at the previous bar charts, we can observe a significant amount of these last terms over all the decades. This is the case for example of like, know or yeah. Most of the time these terms qualified as uninteresting are verbs or onomatopoeias. Let's try to identify the less interesting ones by comparing the bar charts with the visualizations of pyLDAvis. We can notice a strong recurrence of the verbs like, know, come, get which are not necessarily relevant because they are found in most of the analyzed topics. We can also find onomatopoeias oh, yeah and la in most of the topics.

In [98]:
# list recurrent terms
recurrent_terms = {'like','know','come','get', 'got','go','to','oh','yeah','la'}


# default preprocessing
def ngram_preprocess(text, nlp = new_nlp,
                     bigram = bigram_model,
                     trigram = trigram_model,
                    new_stopwords = recurrent_terms):
    
    # perform basic preprocessing to transform sentence to list of words
    words = gensim.utils.simple_preprocess(text)
    
    # customize stopwords
    spacy_stopwords = new_nlp.Defaults.stop_words
    ext_stopwords = spacy_stopwords | new_stopwords # union of set
    
    #removing stop words
    no_stop_words = [word for word in words if word not in ext_stopwords]
    
    # perform bigram model
    bigram_words = bigram[no_stop_words]
    
    # perform trigram model
    trigram_words = trigram[bigram_words]
    
    # recreate the sentence
    sentence = ' '.join(trigram_words)
    
    #tokenization to get lemma
    tokens = [token for token in nlp(sentence)]
    
    #LEMMATISATION and filter alphanumeric characters
    sentence = [word.lemma_ for word in tokens if word.text.isalpha()]

    return sentence

In [99]:
# test with the previously draw sentence
ngram_preprocess(' '.join(data_words[upper_bound]))[:10]

['hey',
 'love',
 'lock',
 'way',
 'cause',
 'jail',
 'time',
 'exhale',
 'way',
 'hold']

Let's now rerun 

In [101]:
ngram_model = LDATopicModeling(lang_preprocess = ngram_preprocess)

100%|██████████| 532/532 [00:00<00:00, 1742.98it/s]


In [103]:
ngram_model.dashboard_LDAvis()


In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only



## 2015 songs lyrics topic modelling

Let's first retrieve the english song in 2015 (year of with max genius lyrics repertoried).

Let's plot the tag distribution

The barplot below shows the frequency of each tag color by total views

Let's try topic modelling with top2vec library which is the easiest to start. But first let's filter the lyrics.

# 2000 song lyrics topic modelling

## References