<a href="https://colab.research.google.com/github/travischristy/ai-music-tools/blob/dev/Genius_lyrics_data_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
carlosgdcj_genius_song_lyrics_with_language_information_path = kagglehub.dataset_download('carlosgdcj/genius-song-lyrics-with-language-information')

print('Data source import complete.')


Downloading from https://www.kaggle.com/api/v1/datasets/download/carlosgdcj/genius-song-lyrics-with-language-information?dataset_version_number=1...


100%|██████████| 3.04G/3.04G [01:19<00:00, 41.0MB/s]

Extracting files...





Data source import complete.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Topic analysis based on Genius songs lyrics

Author : lievre.thomas@gmail.com

---

In this notebook, we will explore the genius data extract from [Kaggle](https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information).

**The aim of this analysis is to retrieve topic from lyrics and retrieve main topics by year or decade.**

This notebook was carried out in the context of a class project imposed by the [text mining course (TDDE16)](https://www.ida.liu.se/~TDDE16/project.en.shtml) of Linköpings universitet.


## Few informations about Genius website

Genius is an American digital company founded on August 27, 2009, by Tom Lehman, lan Zechory, and Mahbod Moghadam. Originally launched as Rap Genius with a focus on hip-hop music, it was initially a crowdsourced website where people could fill in the lyrics of rap music and give an interpretation of the lyrics. Over the years the site has grown to contain several million annotated texts from all eras ( from [Wikipedia Genius page](https://en.wikipedia.org/wiki/Genius_(company))).


## Load the data in memory

Data are all contain in a big 9GB csv file (around 5 millions rows). It could be difficult to load all this data in our computer memory. To deal with this issue, I made a loading class to split the data in 6 pickles files to improve the compressness of the data which aim to improve the loading speed in the memory. Then the pickles are randomly draw to improve generality of the data. We currently assumed the data pickles batch are identically distributed (we will explore the data batches at the second part). The class below deal with all the process.

In [None]:
import pandas as pd
from random import seed, sample
import pickle
import glob
import os

class Loader():

    def __init__(self, in_path, out_path):
        """
        Args:
            in_path (str): csv input path.
            out_path (str): Output directory path to store the pickles.
            chunksize (int, optional): Chunksize for DataFrame reader. Defaults to 10**6.
        """

        self.__in_path = in_path
        self.__out_path = out_path
        self.__chunksize = 10**6

    def __produce_pickles(self):
        """produce pickles by reading csv by chunksize
        """
        with pd.read_csv(self.__in_path, chunksize = self.__chunksize) as reader:
            try:
                os.makedirs(self.__out_path)
            except FileExistsError:
                # directory already exists
                pass
            for i, chunk in enumerate(reader):
                out_file = self.__out_path + "/data_{}.pkl".format(i+1)
                with open(out_file, "wb") as f:
                    pickle.dump(chunk, f, pickle.HIGHEST_PROTOCOL)

    def load_pickle(self, pickle_id):
        """load a pickle file by id

        Args:
            pickle_id (int): pickle id.

        Raises:
            Exception: The path of the given id isn't a file

        Returns:
            obj: DataFrame
        """
        # produce the pickles if the directory not exists or
        # if the directory is empty
        if (not os.path.exists(self.__out_path)) or \
              (len(os.listdir(self.__out_path)) == 0):
            self.__produce_pickles()

        # get the file path following the pickle_id
        # given in parameter
        file_path = self.__out_path + \
            "/data_" + str(pickle_id) + ".pkl"

        if os.path.isfile(file_path):
            df = pd.read_pickle(file_path)
        else:
            raise Exception("The pickle file data_{}.pkl doesn't exist".format(pickle_id))
        return df


    def random_pickles(self, n_pickles = 3, init = 42, verbose = True):
        """random reader over pickles files

        Args:
            n_pickles (int, optional): number of pickles to load. Defaults to 3.
            init (int, optional): Integer given to the random seed. Defaults to 42.
            verbose (bool, optional): Print the loaded files. Defaults to True

        Raises:
            Exception: Stop the process if n_pickles exceed pickle files number.

        Returns:
            obj: pd.Dataframe
        """

        # produce the pickles if the directory not exists or
        # if the directory is empty
        if (not os.path.exists(self.__out_path)) or \
              (len(os.listdir(self.__out_path)) == 0):
            self.__produce_pickles()

        pickle_files = [name for name in
                        glob.glob(self.__out_path + "/data_*.pkl")]
        # draw p_files
        seed(init)

        if n_pickles <= 6:
            random_p_files = sample(pickle_files, n_pickles)
        else:
            raise Exception("The parameter n_pickles (" +
                            "{}) exceed the numbers of pickle files ({})"\
                                .format(n_pickles, len(pickle_files)))
        # print the drawed files
        if verbose:
            print("Loaded pickles:")
            for p in random_p_files:
                print(p)

        # load random pickles file
        df_list = [pd.read_pickle(p) for p in random_p_files]

        # create the dataframe by concatenate the previous
        # dataframes list
        df = pd.concat(df_list, ignore_index = True)
        return df

In [None]:
# create reader
#  /!\ change path in kaggle
kaggle_input = "/content/song_lyrics.csv"
kaggle_output = "/content/drive/MyDrive/Colab Notebooks/genius-song-dataset/output"

In [None]:
# initiate the file loader
loader = Loader(in_path = kaggle_input, out_path = kaggle_output)

# load pickle 3
df = loader.load_pickle(3)

In [None]:
!cp -r /content/drive/MyDrive/Colab\ Notebooks/genius-song-dataset/output1 /content/drive/MyDrive/Colab\ Notebooks/genius-song-dataset/output


Batchs of data are randomly loaded in the memory. The number of batchs loaded depends on the memory capacity of the computer running the script. For the analysis, we will only works on the random samples loaded (All the data in Kaggle).  

# Exploring the coarse data

Let's visualize and explore the coarse data before a part of deeper analysis.

In [None]:
df.head()

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
2000000,Roses,rock,Vitja,2017,399,{},The roses start to wither\nWhere the devil lay...,3019113,en,en,en
2000001,Keep On Pushin,rap,Problem,2017,692,"{""My Princess Aeryn""}",[Hook]\nImma keep on pushin\nImma keep on push...,3019114,en,en,en
2000002,Inside,pop,The jepettos,2017,1302,{},"[Intro]\nOoh, ooh, ah, ah (x2)\n\n[Verse 1]\nI...",3019115,en,en,en
2000003,Girls Like You,rb,PnB Rock,2017,60114,{},[Chorus]\nBaby it was real\nYeah we were the b...,3019117,en,en,en
2000004,Froideur,rap,N'Dirty Deh,2017,5813,"{""N\\'Dirty Deh""}","[Refrain]\nEt j'ai perdu la foi, mais c'est pa...",3019120,fr,fr,fr


For each songs, we've got several informations :
- title of the song
- the tag (which kind of music)
- the artist singer name
- the release year
- the number of page views
- the featuring artists names
- the lyrics
- the genius identifier
- Lyrics language according to [CLD3](https://github.com/google/cld3). Not reliable results are NaN. CLD3 is a neural network model for language indentification.
- Lyrics language according to [FastText's langid](https://fasttext.cc/docs/en/language-identification.html). Values with low confidence (<0.5) are NaN. FastText's langid is library developped by Facebook’s AI Research lab for efficient learning of word representations and sentence classification. fastText has also published a fast and accurate tool for text-based language identification capable of recognizing more than 170 languages.
- Combines language_cld3 and language_ft. Only has a non NaN entry if they both "agree".

More information at this link : https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information

In [None]:
df.dtypes

Unnamed: 0,0
title,object
tag,object
artist,object
year,int64
views,int64
features,object
lyrics,object
id,int64
language_cld3,object
language_ft,object


In [None]:
# display the size
print('Data frame size (row x columns):', df.size)
print('Data rows number: ', len(df))
print('Number of unique songs (following genius id): ', len(df.id.unique()))

Data frame size (row x columns): 11000000
Data rows number:  1000000
Number of unique songs (following genius id):  1000000


Genius id seems to be the unique rows identifier.

Let's vizualise size of the coarse data over years before preprocessing to compare batch distributions. One things to know before vizualise the data, the pickles are create by chunks reading.

The last diplayed table gives us some information about the data. The csv file seems to be sort by id, so the pickle files are then sort too.

In [None]:
import plotly.express as px
import plotly.io as pio

# change theme template for every graph below
pio.templates.default = "plotly_white"

In [None]:
# get some information about the pickle data
def pickle_informations(data = loader):
    rows = []
    # Use the loader's output path
    for i in range(1, len(os.listdir(data._Loader__out_path)) + 1):
        df = data.load_pickle(i)
        rows.append(len(df))
        del df
    return rows

# get the rows
rows = pickle_informations()

# create the dataframe
df_data = pd.DataFrame(
    {'batch' : ['data ' + str(i) for i in range(1,len(rows) + 1)],
    'rows' : rows})

fig = px.bar(df_data, x="batch", y="rows")
fig.show()

Batch seems to have the same number of rows rexcept for the last one which is consistent because batch are create iteratively by 10e6 chunks over the csv The last batch could be seen as a rest.

In [None]:
import plotly.graph_objects as go

In [None]:
def add_bar(i, y1, y2, color, data = loader):
    df = data.load_pickle(i)
    df = df[(df.year >= y1) & (df.year <= y2)]
    df_year = df.groupby(['year']).size().reset_index(name='count')
    new_bar = go.Bar(
                x = df_year.year.values,
                y = df_year['count'].values,
                name = 'data_'+ str(i),
                marker = {'color' : color})
    new_trend = go.Scatter(
                x = df_year.year.values,
                y = df_year['count'].values,
                mode="lines",
                line={'color' : color,
                    'width' : 0.5},
                showlegend=False)
    del df_year, df
    return new_bar, new_trend


def multi_barplot(year1, year2, colors, data = loader):
    # create a empty plotly.Figure object
    fig = go.Figure()
    # compute the batch number
    n_batch = len(os.listdir(data._Loader__out_path)) # Use the loader's output path
    # test the color list feed in argument
    # fit well with the batch number
    if n_batch > len(colors):
        raise Exception(
            "The colors list size({})doesn't ".format(len(colors)) +
            "fit with the number of data".format(n_batch))
    for i in range(1, n_batch + 1):
        fig.add_traces((add_bar(i, year1, year2, colors[i-1])))
    fig.update_layout(
        title = "Data distribution over years ({} - {})"
            .format(year1, year2),
        xaxis_title="years",
        yaxis_title="title",
        legend_title="Data batch")
    return fig

In [None]:
import plotly.colors as col

# create the color list
colors = col.qualitative.Plotly

# 1990 - 2023
fig1 = multi_barplot(1960, 1989, colors)
fig1.show()
# 1960 - 1990
fig2 = multi_barplot(1990, 2023, colors)
fig2.show()

The first bar chart (1960 - 1989) shows an increasing numbers of data over years. Moreover batch seems to have quite similar distriutions over years. data_1 and data_2 batch quite outperform the 4 others. data_6 batch is weaker than the other due to its poor number of rows.
The data behaves similarly until 2012 as we can see on the second chart (1990-2023). After this year there is great increasing of the data retrieved. A minimum increase of at least 100% of the batch can be observed. An increase of up to 50 times the batch size for some.

# Data pre-processing

The aim of this part is to preprocess data in order to get suitable data for the analysis. let's focus on the year variable.

We will focus on English songs, to facilitate the analysis and the work of natural language processing algorithms.

In [None]:
# Retrieve only the texts identified as English language by both cld3 and fasttext langid
df = df[df.language == 'en']

Next, it can be quite interseting to check Nan values

In [None]:
# find which column contain nan value
df.columns[df.isna().any()].tolist()

['title']

In [None]:
# get all rows that contain NaN values
df_nan = df[df.isna().any(axis=1)]
df_nan

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
2000018,,pop,Elevation Worship,2017,7690,{},[Verse 1]\nYou are the One\nWho knows my need\...,3019135,en,en,en
2072706,,rap,Taha TFR,2017,41,{},[Refrain]x2\n\nRock that butt\nWe will never k...,3113282,en,en,en
2114245,,rock,Peter Dzubay,2017,52,{},"When I finally find what's real, why shouldn't...",3166588,en,en,en
2146772,,rap,OLlama,2017,2120,{},[Intro]\nToo many bitches; Too many bitches tr...,3210807,en,en,en
2158344,,rock,The Moth & The Flame,2011,1051,{},We used to be so similar\nWell that was wishfu...,3226446,en,en,en
2160846,,misc,Lawfermz,2017,3,{},(Intro)\nBeep beep beep beep beep beep.\nHello...,3230118,en,en,en
2210243,,rap,TripleYoThreat,2017,24,{},"TripleYoThreat, you ain't never gonna make it\...",3296318,en,en,en
2211581,,rap,Huntaps,2017,157,{},"[Hook]\nFuck 12, if you fuck with me I'll send...",3297867,en,en,en
2269317,,rap,KittyWilson,2017,2,{},They call me by my first name not knowing what...,3390058,en,en,en
2294150,,rap,Shaiza Maponyaza,2018,45,{},[intro]\n\nYah yah yah yah\nYah yah\nMan same ...,3438103,en,en,en


In [None]:
print('Number of untitled song:', len(df[df.isna().any(axis=1)]))

Number of untitled song: 27


Insofar as the title of the music is not to be taken into account in the learning of the topic modeling algorithms but But the titles can be related to the topics in the next phase of analysis and the low number of songs without any title, I decide to delete this data for the moment.

In [None]:
# Delete rows containing NaN values
df = df.dropna()
len(df)

645567

Next, we also try to check for None values

In [None]:
df[df.isnull().any(axis=1)]

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language


No None values in this dataframe.

Afterwards, let's look at the year variable, which is one of the important variables to take into account in our analysis because we want to extract the topics by decades.

---



In [None]:
# Reload the original data from pickle 3 to revert text processing
df = loader.load_pickle(3)

# Display the first few rows to confirm the lyrics are restored
display(df.head())

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
2000000,Roses,rock,Vitja,2017,399,{},The roses start to wither\nWhere the devil lay...,3019113,en,en,en
2000001,Keep On Pushin,rap,Problem,2017,692,"{""My Princess Aeryn""}",[Hook]\nImma keep on pushin\nImma keep on push...,3019114,en,en,en
2000002,Inside,pop,The jepettos,2017,1302,{},"[Intro]\nOoh, ooh, ah, ah (x2)\n\n[Verse 1]\nI...",3019115,en,en,en
2000003,Girls Like You,rb,PnB Rock,2017,60114,{},[Chorus]\nBaby it was real\nYeah we were the b...,3019117,en,en,en
2000004,Froideur,rap,N'Dirty Deh,2017,5813,"{""N\\'Dirty Deh""}","[Refrain]\nEt j'ai perdu la foi, mais c'est pa...",3019120,fr,fr,fr


In [None]:
years = df.year.unique()
print(years)

print('Number of unique years: ',len(years))

[2017 2016 1998 2002 2011 2015 2013 2014 2006 2009 1981 2012 2008 2005
 2018 1991 1994 2007 2003 1971 2010 1999 1989 1986 1913 1905 2004 1980
 1985 1993 1982 1992 2001 1964 1977 1958 1968 1899 2000 2020 1965 1959
 1997 1987 1990 1984 1983 1979 2021 1995 1825 2019 1966 1988 1949 1957
 1674 1967 1996 1974 1942 1850 1842    1 1972 1915 1976 1930 1975 1917
 1919 1911 1780 1960 1903 1978 1865 1950 1938 1952 1973 1951 1947 1954
 1961 1898 1892 1878 1934 1935 1970 1969 1944 1962 1909 2022 1927 1945
 1936 1926 1901   12 1900 1922 1929  420 1937 1953 1820 1864 1955 1540
 1916 1731 1963 1918 1904 1943 1939   14 1834 1844 1797 1772 1788 1894
 1886 1914 1948 1924 1932 1933 1956 1871 1912 1880 1923 1868 1910 1897
 1862 1946 1831 1870 1867 1866 1869   15 1941 1877 1771 1848   17 1931
 1700 1808  801 1888 1856 1928  510 1791 1895 1920   25  203   21 1908
 1940 1882 1859 1907 1902 2023  499 1857 1863 1666 1893 1889 1861 1925
 1872 1676 1675  709   79 1818 1400 1830 1005 1066    6 1853  176 1300
 1816 

We firstly want to know if the year variable format is suitable. It is highly likely that year are sometimes downsized (example : 92 instead of 1992).
Let's display the tag distribution for music with a release year below 215.

In [None]:
df_tag = df[df['year'] < 215].groupby(['tag']).size().reset_index(name='count')

fig = px.pie(df_tag, names="tag", values="count", title = "Outlier tag distribution")
fig.show()

It is rather surprising to observe that the majority style of music of this period (< 215) is rap music knowing that this style is known for the current emerging style. Of course, among this data their is a important part of outlier year.

In [None]:
# Extract the pieces of music of type 'rap' lower than the year 215
df_rap = df[(df['year'] < 215) & (df['tag'] == 'rap')]
df_rap.sort_values(by='views',ascending=False).head(20)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
2254802,Intro,rap,Ninho,1,39224,{},"[Paroles de ""Intro""]\n\n[Intro]\nIls sont pas ...",3358576,fr,fr,fr
2715640,R.I.P,rap,Tobi (ARG),1,6753,{},"[Letra de ""R.I.P""]\n\n[Estribillo]\nBuscando m...",4081146,es,es,es
2734875,Illegale Hobbys,rap,SDP,1,4768,{},[Songtext zu „Illegale Hobbys“]\n\n[Refrain: D...,4111039,de,de,de
2111689,Issa Snack,rap,Phatboitheceleb,1,4182,{},Chorus: its ya birthday and you know I ain't f...,3163216,en,en,en
2635867,BANDE DE PUTE,rap,So Sama,1,3259,{Kpri},"[Refrain: So Sama] X4\nBande de pute, bande de...",3961880,fr,fr,fr
2268205,Oof sandman sans sans diss track,rap,Squeaky,15,3206,{},OOF POOF BOOTH LOSE LOOSE MOVE LUKE FRISK WRIS...,3387617,en,en,en
2512592,Jeden Tag ne Flasche Sekt,rap,Genz,15,3095,{},"[Intro]\nYeah, falls ihr euch fragt wie, wie i...",3786485,de,de,de
2586144,Esta Noche,rap,Franky Style,1,2446,"{""Big Deiv"",Monkid,Isoldi}","[Letra de ""Esta Noche"" ft. Big Deiv, ByMonKid ...",3892316,es,es,es
2218600,First Day Home,rap,RetcH,1,2334,{},"[Intro]\n\n[Chorus]\nHome, it's my first day h...",3307004,en,en,en
2006109,Судно Sudno,rap,Brick Bazuka,1,2224,{},"[Интро]\nРаз-раз, йо (ай)\nРаз-раз, йо (ай)\nO...",3028420,ru,ru,ru


In [None]:
df_rap[df_rap['artist'] == 'Kanye East']

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language


If we search the release date of this track on google, we can find a release date from 4 May 2021 on the [Genius website](https://genius.com/Kanye-east-the-secrets-of-dababy-lyrics). Given the year that we find in our table and real one, we can assume some issue about the date format (1 instead 2021).

After few research on genius website, the most viewed songs of this above displayed list seems to be released on 2021 but more views decrease harder is the interpretation of date.

Let's check the second most popular tag 'pop' in this retrieve outliers data :

In [None]:
# Extract the pieces of music of type 'rap' lower than the year 215
df_pop = df[(df['year'] < 215) & (df['tag'] == 'pop')]
df_pop.sort_values(by='views',ascending=False).head(20)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language
2758763,Cagayake GIRLS,pop,Genius Romanizations,1,8432,{},Chatting Now\nGACHI de KASHIMASHI Never Ending...,4145490,ja,,
2274779,Off White,pop,Zach Clayton,1,7894,{},"Chorus\nSwish on my kicks, off white\nStripes ...",3399877,en,en,en
2951623,I dont wanna see you cryin anymore,pop,Adam Melchor,1,7479,{},[Verse 1]\nI don't wanna see you cryin’ anymor...,4452665,en,en,en
2691129,Tempat Terakhir,pop,Padi,1,7218,{},[Chorus]\nMeskipun aku di surga\nMungkin aku t...,4044035,id,id,id
2495830,Hello from the Dark Side,pop,RoyishGoodLooks,15,7019,{},[Chorus]\nHello from the Dark Side\nI must've ...,3763085,en,en,en
2385856,The Moons Detriment,pop,Shannon Lay,1,6698,{},If I were to know you\nI’d show you all the li...,3603840,en,en,en
2665349,One Heart. One Commitment. One Life. 300lasalle,pop,La Salle WorldWide,15,5109,{},There was a man in France\nWho dreamed a way t...,4005134,en,,
2914887,Si Mañana Me Muero,pop,Jaudy,1,4234,{},"[Letra de ""Si Mañana Me Muero""]\n\n[Intro]\nHa...",4395456,es,es,es
2635909,Mostly to Yourself,pop,Noah Reid,1,3517,{},Well it's mostly in the mornin'\nWhen your eye...,3961929,en,en,en
2759338,المسلم لا يشرب,pop,Bob Marley & The Wailers,1,2530,"{""Bob Marley""}",[verse 1]\nالمسلم لا يشرب شيشا و زقارا\nالمسلم...,4147142,ar,ar,ar


The titles recovered seem to be for the most part recent sounds, not very popular with a bad indexation of the years.

A case-by-case pre-processing of the data is too tedious compared to the amount of data to be processed. We will only use data with correctly formatted dates.

In [None]:
df = df[(df.year >= 1960) & (df.year < 2023)]
len(df)

995865

We wish to analyze the texts by decade then let's add a decade column.

In [None]:
import math

df['decade'] = df['year'].map(lambda x : int(math.trunc(x / 10) * 10))

df.sort_values(by = 'year').head(20)

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language,decade
2982207,Dream Boy,country,Annette Funicello,1960,5,{},"Some boys, they say\nThey wanna take me dancin...",4504062,en,en,en,1960
2982212,Apache,rock,Jrgen Ingmann,1960,22,"{""Jørgen Ingmann""}",Apache Indian\nKarma\nI Pray (unplugged)\nChor...,4504068,en,en,en,1960
2127955,Got to move,country,Sonny Boy Williamson II,1960,285,{},"Baby, you know, you know, you know I know\nOnc...",3185111,en,en,en,1960
2805362,Blue Mood,rock,T-Bone Walker,1960,238,{},[Verse 1]\nI guess I'll have to go out walkin'...,4217570,en,en,en,1960
2392492,Guantanamo Bay,pop,Oscar Brand,1960,699,{},At Guantanamo Bay we're confined to our quarte...,3610760,en,en,en,1960
2392495,The Reuben James,pop,Oscar Brand,1960,107,{},Have you heard of a ship called the good Reube...,3610763,en,en,en,1960
2374959,Because of Everything,pop,Brook Benton,1960,124,{},Because you treat me so wonderful\nBecause you...,3588014,en,en,en,1960
2798480,Baby I Done Got Wise,pop,Muddy Waters,1960,69,{},"A gypsy woman, she told me\nTo let them women ...",4206772,en,en,en,1960
2798479,Tell Me Baby,pop,Muddy Waters,1960,230,{},My friends told me and I thought it was a joke...,4206769,en,en,en,1960
2400031,Spanish Rose,pop,Original Broadway Cast of Bye Bye Birdie,1960,13446,"{""Chita Rivera""}",[ROSIE]\nI'm just a Spanish tamale according t...,3621309,en,en,en,1960


**Restore Option below**

## Further Data Refinements:

I am refining data to those artist with enough views to reduce down to the most skilled professionals recognized for their work.

In [None]:
# Remove songs with less than 500 views (views<=100 remove)
df = df[df.views > 500]
len(df)


288046

In [None]:
# create a new df for the most viewed songs only, (top 1000 most views songs)
df_most_viewed = df.sort_values(by='views',ascending=False).head(1000)

# create "most viewed" percentile values and save as new column in df
df['most_viewed'] = df['views'].rank(pct=True)
df.head()


Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language,decade,most_viewed
2000001,Keep On Pushin,rap,Problem,2017,692,"{""My Princess Aeryn""}",[Hook]\nImma keep on pushin\nImma keep on push...,3019114,en,en,en,2010,0.13802
2000002,Inside,pop,The jepettos,2017,1302,{},"[Intro]\nOoh, ooh, ah, ah (x2)\n\n[Verse 1]\nI...",3019115,en,en,en,2010,0.370859
2000003,Girls Like You,rb,PnB Rock,2017,60114,{},[Chorus]\nBaby it was real\nYeah we were the b...,3019117,en,en,en,2010,0.963037
2000004,Froideur,rap,N'Dirty Deh,2017,5813,"{""N\\'Dirty Deh""}","[Refrain]\nEt j'ai perdu la foi, mais c'est pa...",3019120,fr,fr,fr,2010,0.744876
2000005,Play,rap,Problem,2017,604,{},[Hook]\nAww baby let me play with ya pussy\nPl...,3019122,en,en,en,2010,0.081492


# Data vizualisation

Let's do some vizualisation to get a better understanding of our data. As we saw on previous distribution graphs over years, more titles have been recorded over the last 2 decades.

In [None]:
# barplot by decade
def barplot_by_decade(df):

    # groupby decade
    df_d = df.groupby(['decade']).size().reset_index(name='count')

    # create the figure
    fig = go.Figure()

    fig.add_bar(
        x=df_d.decade,
        y=df_d['count'],
        showlegend=False)

    fig.add_scatter(
            x=df_d.decade,
            y=df_d["count"],
            mode="markers+lines",
            name="trend",
            showlegend=False)

    fig.update_layout(
            title = "Music release over years",
            xaxis_title="decade",
            yaxis_title="release")
    return fig

# build and display
fig = barplot_by_decade(df)
fig.show()

In [None]:
# compute tag frequencies by decade
df_pies_d = df.groupby(['decade','tag']).size().reset_index(name='count')
df_pies_d[df_pies_d.decade == 1960]

Unnamed: 0,decade,tag,count
0,1960,country,58
1,1960,misc,44
2,1960,pop,440
3,1960,rb,109
4,1960,rock,114


In [None]:
from plotly.subplots import make_subplots

# create en make subplot
fig = make_subplots(rows=3, cols=3,
                    specs=[
                        [{'type':'domain'}
                        for i in range(1,4)] for i in range(1,4)
                    ])
decades = df_pies_d.decade.unique().tolist()
for i in range(0,3):
    for k in range(0,3):
        decade = decades[i*3 + k]
        # group by decade
        df_p = df_pies_d[df_pies_d.decade == decade]
        # add figure
        fig.add_trace(go.Pie(labels=df_p.tag, values=df_p['count'], name=decade), i+1, k+1)
        # add annotation
        fig.add_annotation(arg=dict(
            text=decade, x=k*0.375 + 0.125,
            y=-i*0.3927 + 0.90, font_size=10,
            showarrow=False))
        if (i*3 + k) == 6:
            break


# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
    title_text="Tags proportions by decades"
    # Add annotations in the center of the donut pies.
    #annotations=[dict(text=decade, x=k*0.375+0.125, y= -i*0.125+0.90, font_size=10, showarrow=False)
     #           for k, decade in enumerate(decades) for i in range(0,4)]
)
fig.show()

In [None]:
# Group by tag and calculate the average views
df_tag_avg_views = df.groupby('tag')['views'].mean().reset_index()

# Sort by average views in descending order
df_tag_avg_views = df_tag_avg_views.sort_values(by='views', ascending=False)

# Create a bar plot for average views
fig = px.bar(df_tag_avg_views, x='tag', y='views', title='Average Views per Song by Music Genre')

# Display the plot
fig.show()

In [None]:
# Sort the DataFrame by views in descending order
df_sorted_views = df.sort_values(by='views', ascending=False)

# Get the top N most viewed songs
df_top_100 = df_sorted_views.head(100)
df_top_1000 = df_sorted_views.head(1000)
df_top_10000 = df_sorted_views.head(10000)

# Function to create a pie chart of genre distribution for a given DataFrame
def plot_genre_distribution(dataframe, title):
    df_tag_counts = dataframe.groupby('tag').size().reset_index(name='count')
    fig = px.pie(df_tag_counts, names='tag', values='count', title=title)
    fig.show()

# Plot genre distribution for each subset
plot_genre_distribution(df_top_100, 'Genre Distribution for Top 100 Most Viewed Songs')
plot_genre_distribution(df_top_1000, 'Genre Distribution for Top 1000 Most Viewed Songs')
plot_genre_distribution(df_top_10000, 'Genre Distribution for Top 10000 Most Viewed Songs')

The different pie charts show us an evolution of the different proportions of music styles by decades from 1960 to today. In the sixties the most listed music style is the **pop music**. Then comes an emergence of a rising style: the **rock**. This one increases with the decades until it overtakes the pop music during the decades 90 and 2000. The two styles come to balance thereafter facing the meteoric rise of the most listed genre in our current decade, namely **rap**. The evolution of the proportions recover can give us a rather precise idea of the most popular styles of their times. However, it is important to remember that the data still has some bias, Genius being a crowdsourced tool created during the last decade and at the base only as a lexical translator of rap music, it is normal to find a large amount of data from this period and especially from this style there.

The different pie charts show us an evolution of the different proportions of music styles by decades from 1960 to today. In the sixties the most listed music style is the **pop music**. Then comes an emergence of a rising style: the **rock**. This one increases with the decades until it overtakes the pop music during the decades 90 and 2000. The two styles come to balance thereafter facing the meteoric rise of the most listed genre in our current decade, namely **rap**. The evolution of the proportions recover can give us a rather precise idea of the most popular styles of their times. However, it is important to remember that the data still has some bias, Genius being a crowdsourced tool created during the last decade and at the base only as a lexical translator of rap music, it is normal to find a large amount of data from this period and especially from this style there.

## Text preprocessing

After the visualisation part let's focus more on the main data which are the lyrics.

In [None]:
df.iloc[0]["lyrics"]

'[Hook]\nImma keep on pushin\nImma keep on pushin\nStay stayin away from the bullshit\n\n[Verse 1 : Problem]\nChach pull up they know what im all about\nBang my line make sure they call about\nM-O-N-E-Y cause that\'s what this world about\nSus niggas stay of my line with that hoe shit\nDid I fuck your girl man nigga I don\'t know shit\nReal to real nigga, go ask that hoe bitch\nGround breaking with it come check how this flo shift\nBeen playin chess y\'all still on go fish\n(Imma keep on pushin)\nNever waitin for no dealer to deal mine\nI only hang with your girl to kill time\nNigga stop with that boss shit\nBoy yousa a look out\nPulled up on the side of your whip you ain\'t even look out\nYou on but your homies ain\'t, you ain\'t even look out\nAnd gettin read by the bitches like you a book out\nIf i was you i would chill with that tough shit before you get took out like meat at a cook out\nYa feel me\nCause its a lot going on with me\nThuggin solo like bobby cause niggas singing like

There is many undesirable characters like the line breaker '\n', figures or square, curly and simple brackets. So let's clean this data with regular expressions.

In [None]:
import re
from numpy.random import randint

# Modify clean_text to optionally remove newline characters
def clean_text(text, remove_newlines=False):
    # remove figures and square, curly, simple brackets, but keep section tags [like this]
    # text = re.sub(r'\[.*?\]|\{.*?\}|\(.*?\)', '', text) # Original line that removed all brackets
    text = re.sub(r'\{.*?\}|\(.*?\)', '', text) # Remove curly and simple brackets, keep square brackets

    text = re.sub(r'\d+', '', text) # Remove figures

    if remove_newlines:
        text = text.replace('\n', ' ') # Replace newline characters

    return text

# Apply the cleaning function to create two new columns
df['lyrics_cleaned_no_newlines'] = df["lyrics"].apply(lambda x: clean_text(x, remove_newlines=True))
df['lyrics_with_newlines'] = df["lyrics"].apply(lambda x: clean_text(x, remove_newlines=False))

# Display the first few rows with the new columns
display(df[['lyrics', 'lyrics_cleaned_no_newlines', 'lyrics_with_newlines']].head())

Unnamed: 0,lyrics,lyrics_cleaned_no_newlines,lyrics_with_newlines
2000001,[Hook]\nImma keep on pushin\nImma keep on push...,[Hook] Imma keep on pushin Imma keep on pushin...,[Hook]\nImma keep on pushin\nImma keep on push...
2000002,"[Intro]\nOoh, ooh, ah, ah (x2)\n\n[Verse 1]\nI...","[Intro] Ooh, ooh, ah, ah [Verse ] I heard yo...","[Intro]\nOoh, ooh, ah, ah \n\n[Verse ]\nI hear..."
2000003,[Chorus]\nBaby it was real\nYeah we were the b...,[Chorus] Baby it was real Yeah we were the bes...,[Chorus]\nBaby it was real\nYeah we were the b...
2000004,"[Refrain]\nEt j'ai perdu la foi, mais c'est pa...","[Refrain] Et j'ai perdu la foi, mais c'est pas...","[Refrain]\nEt j'ai perdu la foi, mais c'est pa..."
2000005,[Hook]\nAww baby let me play with ya pussy\nPl...,[Hook] Aww baby let me play with ya pussy Play...,[Hook]\nAww baby let me play with ya pussy\nPl...


In [None]:
# Function to remove section tags and replace newlines with a space
def clean_lyrics_for_cleaned_column(text):
    text = re.sub(r'\[.*?\]', '', text) # Remove section tags
    text = text.replace('\n', ' ') # Replace newline characters with space
    text = re.sub(r'\d+', '', text) # Remove figures
    text = re.sub(r'\{.*?\}|\(.*?\)', '', text) # Remove other brackets
    return text

# Function to replace newlines with space+newline and keep section tags
def clean_lyrics_for_structured_column(text):
    text = text.replace('\n', ' \n ') # Replace newline characters with space and newline
    text = re.sub(r'\d+', '', text) # Remove figures
    text = re.sub(r'\{.*?\}|\(.*?\)', '', text) # Remove other brackets
    return text


# Apply the cleaning functions to create the new columns
df['lyrics_cleaned'] = df["lyrics"].apply(clean_lyrics_for_cleaned_column)
df['lyrics_structured'] = df["lyrics"].apply(clean_lyrics_for_structured_column)

# Display the first few rows with the new columns
display(df[['lyrics', 'lyrics_cleaned', 'lyrics_structured']].head())

Unnamed: 0,lyrics,lyrics_cleaned,lyrics_structured
2000001,[Hook]\nImma keep on pushin\nImma keep on push...,Imma keep on pushin Imma keep on pushin Stay ...,[Hook] \n Imma keep on pushin \n Imma keep on ...
2000002,"[Intro]\nOoh, ooh, ah, ah (x2)\n\n[Verse 1]\nI...","Ooh, ooh, ah, ah I heard you were the worl...","[Intro] \n Ooh, ooh, ah, ah \n \n [Verse ] \..."
2000003,[Chorus]\nBaby it was real\nYeah we were the b...,Baby it was real Yeah we were the best When w...,[Chorus] \n Baby it was real \n Yeah we were t...
2000004,"[Refrain]\nEt j'ai perdu la foi, mais c'est pa...","Et j'ai perdu la foi, mais c'est pas la faute...","[Refrain] \n Et j'ai perdu la foi, mais c'est ..."
2000005,[Hook]\nAww baby let me play with ya pussy\nPl...,Aww baby let me play with ya pussy Play with ...,[Hook] \n Aww baby let me play with ya pussy \...


In [None]:
# Update docs to use the column you intend for downstream processing (e.g., with newlines)
docs = df['lyrics_structured'].to_list()
print(docs[0])

[Hook] 
 Imma keep on pushin 
 Imma keep on pushin 
 Stay stayin away from the bullshit 
  
 [Verse  : Problem] 
 Chach pull up they know what im all about 
 Bang my line make sure they call about 
 M-O-N-E-Y cause that's what this world about 
 Sus niggas stay of my line with that hoe shit 
 Did I fuck your girl man nigga I don't know shit 
 Real to real nigga, go ask that hoe bitch 
 Ground breaking with it come check how this flo shift 
 Been playin chess y'all still on go fish 
  
 Never waitin for no dealer to deal mine 
 I only hang with your girl to kill time 
 Nigga stop with that boss shit 
 Boy yousa a look out 
 Pulled up on the side of your whip you ain't even look out 
 You on but your homies ain't, you ain't even look out 
 And gettin read by the bitches like you a book out 
 If i was you i would chill with that tough shit before you get took out like meat at a cook out 
 Ya feel me 
 Cause its a lot going on with me 
 Thuggin solo like bobby cause niggas singing like the

In [None]:
# Drop the old lyrics columns
df = df.drop(columns=['lyrics_cleaned_no_newlines', 'lyrics_with_newlines'])

# Display the first few rows of the DataFrame to show the new columns
display(df.head())

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language,decade,most_viewed,lyrics_cleaned,lyrics_structured
2000001,Keep On Pushin,rap,Problem,2017,692,"{""My Princess Aeryn""}",[Hook]\nImma keep on pushin\nImma keep on push...,3019114,en,en,en,2010,0.13802,Imma keep on pushin Imma keep on pushin Stay ...,[Hook] \n Imma keep on pushin \n Imma keep on ...
2000002,Inside,pop,The jepettos,2017,1302,{},"[Intro]\nOoh, ooh, ah, ah (x2)\n\n[Verse 1]\nI...",3019115,en,en,en,2010,0.370859,"Ooh, ooh, ah, ah I heard you were the worl...","[Intro] \n Ooh, ooh, ah, ah \n \n [Verse ] \..."
2000003,Girls Like You,rb,PnB Rock,2017,60114,{},[Chorus]\nBaby it was real\nYeah we were the b...,3019117,en,en,en,2010,0.963037,Baby it was real Yeah we were the best When w...,[Chorus] \n Baby it was real \n Yeah we were t...
2000004,Froideur,rap,N'Dirty Deh,2017,5813,"{""N\\'Dirty Deh""}","[Refrain]\nEt j'ai perdu la foi, mais c'est pa...",3019120,fr,fr,fr,2010,0.744876,"Et j'ai perdu la foi, mais c'est pas la faute...","[Refrain] \n Et j'ai perdu la foi, mais c'est ..."
2000005,Play,rap,Problem,2017,604,{},[Hook]\nAww baby let me play with ya pussy\nPl...,3019122,en,en,en,2010,0.081492,Aww baby let me play with ya pussy Play with ...,[Hook] \n Aww baby let me play with ya pussy \...


In [None]:
df.iloc[0]['lyrics']

'[Hook]\nImma keep on pushin\nImma keep on pushin\nStay stayin away from the bullshit\n\n[Verse 1 : Problem]\nChach pull up they know what im all about\nBang my line make sure they call about\nM-O-N-E-Y cause that\'s what this world about\nSus niggas stay of my line with that hoe shit\nDid I fuck your girl man nigga I don\'t know shit\nReal to real nigga, go ask that hoe bitch\nGround breaking with it come check how this flo shift\nBeen playin chess y\'all still on go fish\n(Imma keep on pushin)\nNever waitin for no dealer to deal mine\nI only hang with your girl to kill time\nNigga stop with that boss shit\nBoy yousa a look out\nPulled up on the side of your whip you ain\'t even look out\nYou on but your homies ain\'t, you ain\'t even look out\nAnd gettin read by the bitches like you a book out\nIf i was you i would chill with that tough shit before you get took out like meat at a cook out\nYa feel me\nCause its a lot going on with me\nThuggin solo like bobby cause niggas singing like

## Mapping Lyrics / Song Structures

Analyzing the structural patterns of song lyrics involves looking at elements like sections, line counts, syllables per line, rhyming patterns, as well as implied sentiment/emotional/energy/narrative flows. These are the foundational song building elements for lyrics (not respective to the melody/music which should align with the lyrics in a symbiotic way. Our brains are essentially pattern recognition machines and they're addicted to finding patterns, which the most effective songs employ in a way that both tease our senses with implied tension and ultimately provide us with the desired resolution or relief that satisfies our human brains.

A detailed inventory outlined to chart the desired patterns and elements that will be extraced using a hierarchy (most important/impactful and least difficult/time consuming first) then include a step by step plan to fulfill each inventory item.

In [None]:
## Mapping Lyrics / Song Structures

# Detailed Outline of Structural Song Lyrics Elements and Patterns to Extract

# Prioritized List (Most Important/Impactful & Least Difficult First)
# 1. Section Tags: Identify and count occurrences of section tags (e.g., [Verse], [Chorus], [Bridge]).
# 2. Section Sequence: Analyze the order and frequency of section tag sequences (e.g., Verse-Chorus-Verse).
# 3. Line Counts per Section: Count the number of lines within each identified section.
# 4. Total Line Count per Song: Count the total number of lines in each song.
# 5. Syllables per Line: Count the approximate number of syllables per line (Requires a syllable counting library).
# 6. Rhyming Patterns: Analyze rhyming schemes within and across sections (Complex, requires phonetic analysis or advanced pattern matching).
# 7. Sentiment/Emotional/Energy/Narrative Flows: Analyze the progression of sentiment, emotion, energy, or narrative themes throughout the song (Requires advanced NLP techniques).

### Code to Extract Top Priority Elements (Section Tags and Line Counts)

import re

def extract_structural_elements(lyrics):
    sections = re.findall(r'\[(.*?)\]', lyrics) # Find all text within square brackets
    lines = lyrics.split('\n')
    total_lines = len(lines)

    section_line_counts = {}
    current_section = 'Intro/Unknown' # Default section before the first tag
    line_count_in_current_section = 0

    for line in lines:
        section_match = re.search(r'\[(.*?)\]', line)
        if section_match:
            # Store count for the previous section
            if current_section in section_line_counts:
                section_line_counts[current_section].append(line_count_in_current_section)
            else:
                section_line_counts[current_section] = [line_count_in_current_section]

            current_section = section_match.group(1).strip() # Update current section
            line_count_in_current_section = 0 # Reset line count for the new section
        else:
            # Count non-empty lines within the current section
            if line.strip():
                line_count_in_current_section += 1

    # Store count for the last section
    if current_section in section_line_counts:
         section_line_counts[current_section].append(line_count_in_current_section)
    else:
         section_line_counts[current_section] = [line_count_in_current_section]


    # Remove the initial empty count for the default section if no lines before the first tag
    if 'Intro/Unknown' in section_line_counts and section_line_counts['Intro/Unknown'] == [0]:
        del section_line_counts['Intro/Unknown']


    return {
        'section_tags': sections,
        'total_lines': total_lines,
        'section_line_counts': section_line_counts
    }

# Apply the function to the DataFrame
# Use 'lyrics_structured' as it retains section tags and line breaks
df['structural_elements'] = df['lyrics_structured'].apply(extract_structural_elements)

# Display the first few rows with the new structural_elements column
display(df[['lyrics_structured', 'structural_elements']].head())

Unnamed: 0,lyrics_structured,structural_elements
2000001,[Hook] \n Imma keep on pushin \n Imma keep on ...,"{'section_tags': ['Hook', 'Verse : Problem', ..."
2000002,"[Intro] \n Ooh, ooh, ah, ah \n \n [Verse ] \...","{'section_tags': ['Intro', 'Verse ', 'Pre-Chor..."
2000003,[Chorus] \n Baby it was real \n Yeah we were t...,"{'section_tags': ['Chorus', 'Verse ', 'Chorus'..."
2000004,"[Refrain] \n Et j'ai perdu la foi, mais c'est ...","{'section_tags': ['Refrain', 'Couplet ', 'Pont..."
2000005,[Hook] \n Aww baby let me play with ya pussy \...,"{'section_tags': ['Hook', 'Verse ', 'Hook', 'V..."


In [None]:
# High-Level Data Processing Plan for Remaining Elements (Syllables, Rhyme, Sentiment/Flow)

# 1. Syllables per Line:
#    - Install a syllable counting library (e.g., `pyphen` or `syllable`).
#    - Iterate through each line of the lyrics.
#    - Use the library to count syllables for each line.
#    - Store or analyze the distribution of syllable counts per line.

# 2. Rhyming Patterns:
#    - This is significantly more complex.
#    - Potential approaches:
#        - Use a rhyming dictionary or phonetic analysis library to find rhyming words.
#        - Analyze the last words of lines within sections to identify rhyming schemes (e.g., AABB, ABAB).
#        - This may require advanced text processing and pattern recognition.

# 3. Sentiment/Emotional/Energy/Narrative Flows:
#    - Use NLP libraries for sentiment analysis (e.g., `NLTK`, `spaCy` with extensions, `transformers`).
#    - Analyze the sentiment/emotion of each line or section.
#    - Plot or analyze the progression of sentiment/emotion throughout the song.
#    - Narrative flow is even more complex and might involve identifying key themes, characters, or events and tracking their progression.
#    - This often requires sophisticated NLP models or manual annotation for training.

# Further analysis would involve aggregating and comparing these extracted features
# across different genres and popularity levels as outlined in the previous plan.

# --- Code for Syllables per Line (from Plan Step 1) ---

# Install a syllable counting library (e.g., `pyphen`)
# Note: You might need to run this in a separate cell if you encounter issues with installation and execution in the same cell.
# !pip install pyphen

# import pyphen

# # Load a dictionary for syllable counting (e.g., English)
# # dic = pyphen.Pyphen(lang='en')

# def count_syllables_per_line(lyrics):
#     lines = lyrics.split('\n')
#     syllable_counts = []
#     # Assuming 'dic' is initialized with a dictionary
#     # for line in lines:
#     #     words = line.split()
#     #     line_syllable_count = sum(len(dic.inserted(word).split('-')) for word in words if word.strip())
#     #     syllable_counts.append(line_syllable_count)
#     # return syllable_counts
#     # Placeholder for now, actual implementation needs a library
#     return [len(line.split()) for line in lines] # Placeholder: returns word count per line

# # Apply the function to the DataFrame
# # df['syllables_per_line'] = df['lyrics_structured'].apply(count_syllables_per_line)

# # Display the first few rows with the new syllables_per_line column (or a sample of the data)
# # display(df[['lyrics_structured', 'syllables_per_line']].head())



---





# Topic modeling

- [LDA (latent dirichlet allocation)](https://fr.wikipedia.org/wiki/Allocation_de_Dirichlet_latente) are the common way to do topic modeling in the few last years, it works and it's quite easy to use with common python library like [Gensim](https://radimrehurek.com/gensim/auto_examples/index.html).
- [BERTopic](https://maartengr.github.io/BERTopic/index.html) seems to be one of the best technic this day to perform topic modeling. It combine the leverage of [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) the famous language model with [c-TF-IDF](https://maartengr.github.io/BERTopic/api/ctfidf.html) tansformer.

In order to reach my first dead line (17/03/2023), I didn't have enough time to run BerTopic algorithm.


## Define default tokenizer and Lemmatizer

In [None]:
import spacy

print("set gpu: ", spacy.prefer_gpu())

# small model /!\ take the bigger one for Kaggle
new_nlp = spacy.load('en_core_web_sm')

set gpu:  False


It could be difficult to process all this data on my computer or Kaggle. The memory will quickly be overwhelmed. I will work with a sample of our previously load data in order to avoid memory overload.

In [None]:
from collections import Counter

def sample(balanced = True, data = df, prop = 0.2):
    if balanced:
        # compute the sorted decreasing parties frequencies
        decade_frequencies = Counter(data['decade']).most_common()
        print(decade_frequencies)

        # retrieve the under represented class
        nb_under_class = decade_frequencies[-1][1]

        # Return a random sample of items from each party following the under sampled number of class
        sample_df = data.groupby("decade").sample(n = int(prop * nb_under_class), random_state = 20)
    else:
        # create sample df 1/3 of the actual loaded data
        sample_df = df.sample(frac = prop)
    return sample_df


In [None]:
# sample the data
sdf = sample()

[(2010, 271193), (2000, 7787), (1990, 3348), (1980, 2199), (1970, 1577), (2020, 1177), (1960, 765)]


In [None]:
# check the distribution of the sample
barplot_by_decade(sdf)

In [None]:
# get the results of data cleaning
cleaned_text = sdf["lyrics"].apply(clean_text)
# update dataframe
sdf.update(cleaned_text)
sdf.head(3)

sdf

Unnamed: 0,title,tag,artist,year,views,features,lyrics,id,language_cld3,language_ft,language,decade,most_viewed,lyrics_cleaned,lyrics_structured,structural_elements
2338721,Root Beer,country,George Jones,1962,650,{},"Well, I sent my pretty baby to the neighborhoo...",3533393,en,en,en,1960,0.112298,"Well, I sent my pretty baby to the neighborhoo...","Well, I sent my pretty baby to the neighborhoo...","{'section_tags': [], 'total_lines': 34, 'secti..."
2023898,Thousand Finger Man,pop,Candido,1969,1056,{},[Intro]\nCandido \n\n[Chorus]\nThousand finger...,3050483,,,,1960,0.298100,Candido Thousand finger man The magical...,[Intro] \n Candido \n \n [Chorus] \n Thousan...,"{'section_tags': ['Intro', 'Chorus', 'Bridge',..."
2352376,Once Upon a Time There Was a World,rock,Kaleidoscope (Latin American band),1969,2073,{},The night came\n​It's so so dark outside\n​Oh ...,3554124,en,en,en,1960,0.512278,The night came ​It's so so dark outside ​Oh bu...,The night came \n ​It's so so dark outside \n ...,"{'section_tags': [], 'total_lines': 34, 'secti..."
2354520,Ive Got Levitation,rock,The 13th Floor Elevators,1967,2024,{},"Heading for the ceiling, I'm up off the floor\...",3557050,en,en,en,1960,0.505551,"Heading for the ceiling, I'm up off the floor ...","Heading for the ceiling, I'm up off the floor ...","{'section_tags': [], 'total_lines': 35, 'secti..."
2307698,When I Came Down,rock,Black Sabbath,1969,2005,{},[Verse]\nWhen I came down for the first hour\n...,3468404,en,en,en,1960,0.502838,When I came down for the first hour All of my...,[Verse] \n When I came down for the first hour...,"{'section_tags': ['Verse', 'Chorus', 'Verse', ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2359077,Floating,rap,6 Dogs,2021,1161,"{""Yung Bans""}",Lyrics From Snippets\n\n[Intro: Dogs]\nDanny ...,3563660,en,en,en,2020,0.332277,Lyrics From Snippets Danny Wolf Hundred thou...,Lyrics From Snippets \n \n [Intro: Dogs] \n ...,"{'section_tags': ['Intro: Dogs', 'Chorus: Do..."
2489095,In My Life,rb,The Weeknd,2020,2836,{},Lyrics from Snippet\n\nI know it seems\nThat y...,3753271,en,en,en,2020,0.594226,Lyrics from Snippet I know it seems That you'...,Lyrics from Snippet \n \n I know it seems \n ...,"{'section_tags': [], 'total_lines': 6, 'sectio..."
2673207,New Heights,pop,Ellie Goulding,2020,15556,{},"[Intro]\n\nOh, oh, oh\n\n[Verse ]\nAll this ti...",4016856,en,en,en,2020,0.877860,"Oh, oh, oh All this time, I felt completed...","[Intro] \n \n Oh, oh, oh \n \n [Verse ] \n A...","{'section_tags': ['Intro', 'Verse ', 'Chorus',..."
2005943,Folk Zvezda,rap,Zli Toni,2021,1645,{Coby},"[Tekst pesme ""Folk Zvezda"" ft. Coby]\n\n[Refre...",3028226,bs,,,2020,0.445087,Svakog dana tarapana - Seve S nama je ta sl...,"[Tekst pesme ""Folk Zvezda"" ft. Coby] \n \n [R...","{'section_tags': ['Tekst pesme ""Folk Zvezda"" f..."


In [None]:
from google.colab import sheets
sheet = sheets.InteractiveSheet(df=sdf)

2025-08-31 02:55:48,609:INFO:Failure refreshing credentials: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Engine metadata service. Status: 404 Response:\nb''", <google.auth.transport.requests._Response object at 0x7e0325cdcec0>)
2025-08-31 02:55:48,757:INFO:Failure refreshing credentials: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Engine metadata service. Status: 404 Response:\nb''", <google.auth.transport.requests._Response object at 0x7e0325cb1070>)


https://docs.google.com/spreadsheets/d/1GqBoQ_GabMtg62Huj96EJpmdgltSqHkGsK_rakz-Bv0/edit#gid=0


APIError: APIError: [400]: Invalid values[1][15]: struct_value 	 {
  fields {
    key: "section_line_counts"
    value {
      struct_value {
        fields {
          key: "Intro/Unknown"
          value {
            list_value {
              values {
                number_value: 29.0
              }
            }
          }
        }
      }
    }
  }
  fields {
    key: "section_tags"
    value {
      list_value {
      }
    }
  }
  fields {
    key: "total_lines"
    value {
      number_value: 34.0
    }
  }
}


## Vizualise most frequent words over decades

In [None]:
# default preprocessing
def preprocess(text, nlp = new_nlp):

    #TOKENISATION
    tokens =[]
    for token in nlp(text):
        tokens.append(token)

    #REMOVING STOP WORDS
    spacy_stopwords = new_nlp.Defaults.stop_words
    sentence =  [word for word in tokens if word.text.isalpha() and word.text not in spacy_stopwords]

    #LEMMATISATION
    sentence = [word.lemma_ for word in sentence]

    return sentence

In [None]:
# Calculate term frequencies


from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer

class TermsDocumentsMatrix():

    def __init__(self, sdf, decades=[1960, 1970], colorscale = 'Plotly3'):
        # vectorizer on the sample lyrics
        # Use the updated preprocess_cleaned function
        self.__vectorizer = CountVectorizer(tokenizer = preprocess_cleaned)
        # fit and transform the data
        self.__data_vectorized = self.__vectorizer.fit_transform(
            tqdm(sdf['lyrics'].loc[sdf['decade'].isin(decades)])
        )
        # get decades informations
        self.__decades = sdf['decade'].loc[sdf['decade'].isin(decades)].reset_index(drop=True)
        self.__unique_decades = decades
        # get colorscale template
        self.__colorscale = colorscale

    def get_tdmatrix(self):

        # compute a Matrix terms document by decades
        df_bw = pd.DataFrame(self.__data_vectorized.toarray(),
                    columns = self.__vectorizer.get_feature_names_out())

        # check the length
        if len(df_bw) != len(self.__decades):
            raise Exception('Not the same size')

        # concatenate decade
        df_bw['decade'] = self.__decades

        # check NaN values
        if len(df_bw.columns[df_bw.isna().any()].tolist()) != 0:
            raise Exception('Decade got Nan values')

        return df_bw

    def get_tdm_by_decade(self, decade):

        if decade not in self.__unique_decades:
            raise Exception("{} doesn't appear in the decades list".format(decade))

        # compute a Matrix terms document by decades (bag of words format)
        df_bw = pd.DataFrame(self.__data_vectorized.toarray(),
                    columns = self.__vectorizer.get_feature_names_out())

        # check the length
        if len(df_bw) != len(self.__decades):
            raise Exception('Not the same size')

        # concatenate decade
        df_bw['decade'] = self.__decades

        # check NaN values
        if len(df_bw.columns[df_bw.isna().any()].tolist()) != 0:
            raise Exception('Decade got Nan values')

        # select suitable decade
        df_bw = df_bw[df_bw['decade'] == decade]

        return df_bw

    def most_freq_terms(self, n_rows = 1, n_cols = 2, n_terms = 10):

        # create the document terms matrix
        df_bw = self.get_tdmatrix()

        # create en make subplot
        fig = make_subplots(rows=n_rows, cols=n_cols,
                            x_title = 'number of occurrences',
                            y_title = 'terms',
                            subplot_titles = self.__unique_decades)

        for i in range(0,n_rows):
            for k in range(0,n_cols):
                if (i*n_rows + k) == len(self.__unique_decades):
                    break

                # get the decade
                decade = self.__unique_decades[i*n_rows + k]

                #select the suitable decade and delete decade column
                df_decade = df_bw.loc[df_bw.decade == decade, df_bw.columns != 'decade']

                # compute terms frequencies by decade
                terms_freq = df_decade.sum().sort_values(ascending = False)

                # total number of terms occurences
                total_terms = terms_freq.values

                # add figure
                fig.add_trace(go.Bar(y=terms_freq.index.tolist()[:n_terms][::-1],
                                     x=total_terms[:n_terms][::-1],
                                     name=decade,
                                     orientation='h', showlegend = False,
                                    marker = dict(color = total_terms,
                                                  colorscale=self.__colorscale)),
                              i+1, k+1)
        return fig




In [None]:
# first decades
tdm = TermsDocumentsMatrix(sdf, decades = [1960, 1970, 1980, 1990, 2000, 2010],
                           colorscale = 'Plasma')

# display bar charts of most frequent terms
tdm.most_freq_terms(n_rows = 4, n_cols = 1, n_terms = 20)


The parameter 'token_pattern' will not be used since 'tokenizer' is not None'

100%|██████████| 918/918 [00:48<00:00, 18.94it/s]


According to the bar graphs displayed above, a group of words seems to recur on each decade: Love, know, go, feel ... Words that seem to relate to the popular song that can talk about love. This is consistent with our previous analysis from the pie charts showing the proportions of musical styles across time. We also notice an important presence of onomatopoeia like yeah or oh.

In [None]:
df.to_json('/content/drive/MyDrive/Colab Notebooks/genius-song-dataset/output/df.json')


In [None]:
# save all results and export df
df_tdm = tdm.get_tdmatrix()
df_tdm.to_csv('/kaggle/working/df_tdm.csv')



In [None]:
# first decades
tdm = TermsDocumentsMatrix(sdf, decades = [2000, 2010, 2020],
                          colorscale = 'Plasma')

# most frequent terms
tdm.most_freq_terms(n_rows = 2, n_cols = 2, n_terms = 15)


The parameter 'token_pattern' will not be used since 'tokenizer' is not None'

100%|██████████| 459/459 [00:25<00:00, 18.04it/s]


We get similar results on this second decade with similar high occurrence words. We see a greater amount of onomatopoeia in the current decade. We can explain this by an emergence of the rap music style on this current and last decade. There is in this style of music a very used process, the 'ad-libs'. They are sounds, words or onomatopoeias that the artists pronounce sometimes between two verses or at the end of a sentence to give more impact to their text and to dynamize the atmosphere of a title. This may explain the greater presence of onomatopoeia in the lyrics of this decade.


## Topic modeling with LDA

LDA is a common technic use in topic modeling, we firstly process basic preprocessing.

In [None]:
# utils
from datetime import datetime
import logging

gensim_log = '/kaggle/working/sample.log'

#initiate log file
logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',
                    level=logging.INFO,
                   force = True)

In [None]:
# gensim
from gensim.models import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.models.ldamulticore import LdaMulticore
from gensim.corpora.dictionary import Dictionary
from gensim.test.utils import datapath
from tqdm import tqdm

# utils
from datetime import datetime
import logging

# dashboards
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt

# TSNE dependencies
from sklearn.manifold import TSNE
from bokeh.plotting import figure, output_file, show
from bokeh.models import Label
from bokeh.io import output_notebook
import numpy as np
import matplotlib.colors as mcolors

# utils
def parse_logfile(path_log):
    matcher = re.compile(r'(-*\d+\.\d+) per-word .* (\d+\.\d+) perplexity')
    likelihoods = []
    with open(path_log, 'w') as source:
        for line in source:
            match = matcher.search(line)
            if match:
                likelihoods.append(float(match.group(1)))
    return likelihoods


class LDATopicModeling():

    def __init__(self, df = sdf,
                 decade = 1960,
                 directory = "/kaggle/working/models/",
                 existing = False,
                 n_topics = 10,
                 worker_nodes = None,
                lang_preprocess = preprocess,
                cross_valid = False,
                epochs = 30):
        # Apply preprocessing on decade data
        self.__documents = df.loc[df.decade == decade, 'lyrics'].apply(lang_preprocess)

        # Create a corpus from a list of texts
        self.__id2word = Dictionary(self.__documents.tolist())
        self.__corpus = [self.__id2word.doc2bow(doc) for doc in self.__documents.tolist()]

        #training
        if os.path.isfile(existing):
            # Load a potentially pretrained model from disk.
            self.model = LdaModel.load(temp_file)
            self.__cv_results = None # no cross_valid
            self.__n_topics = n_topics
        elif not cross_valid:
            self.model = LdaMulticore(
                corpus=tqdm(self.__corpus),
                id2word=self.__id2word,
                num_topics=n_topics,
                workers=worker_nodes,
                passes=epochs)
            self.__likelihood = parse_logfile(gensim_log)
            self.__n_topics = n_topics
            self.__cv_results = None
        else: # cross validation

            # hyperparameter
            alpha = []#np.arange(0.01, 1, 0.3).tolist()
            alpha.append('symmetric')
            alpha.append('asymmetric')

            # hyperparameter
            eta = []#np.arange(0.01, 1, 0.3).tolist()
            eta.append('symmetric')

            # compute results of the cross_validation
            cv_results = {
                 'topics': [],
                 'alpha': [],
                 'eta': [],
                 'coherence': []
            }

            # topic range
            topic_range = range(2, n_topics+1)

            # prevent the computation time
            total=(len(eta)*len(alpha)*len(topic_range))
            print("total lda computation: ",total)
            model_list = []

            for k in topic_range:
                for a in alpha:
                    for e in eta:

                        # train the model
                        model = LdaMulticore(
                            corpus=self.__corpus,
                            id2word=self.__id2word,
                            num_topics=k,
                            workers=worker_nodes,
                            passes=epochs,
                            alpha=a,
                            eta=e)

                        # compute coherence
                        cv = CoherenceModel(
                            model=model,
                            texts=self.__documents,
                            dictionary=self.__id2word,
                            coherence='c_v')

                        print('coherence: {}\nalpha: {}\neta: {}\ntopic: {}'.format(
                            cv.get_coherence(), a, e, k))

                         # Save the model results
                        cv_results['topics'].append(k)
                        cv_results['alpha'].append(a)
                        cv_results['eta'].append(e)
                        cv_results['coherence'].append(cv.get_coherence())
                        model_list.append(model)
            # retrieve index of the highest coherence model
            best_index = np.argmax(cv_results['coherence'])

            # choose the model given the best coherence
            self.model = model_list[best_index]

            # save results as attribute
            self.__cv_results = cv_results

            self.__n_topics = cv_results['topics'][best_index]

            # logging doesn't work on Kaggle
            #self.__likelihood = parse_logfile()
        # directory path
        self.__directory = directory

    # getters
    @property
    def get_id2word(self):
        return self.__id2word

    @property
    def get_corpus(self):
        return self.__corpus

    @property
    def get_likelihood(self):
        return self.__likelihood

    @property
    def get_cv_results(self):
        return pd.DataFrame(self.__cv_results) if self.__cv_results else None

    def plot_coherence(self, metric = 'alpha'):
        """metric(str): alpha or eta
        """
        if self.__cv_results is None:
            raise Exception('No cross validation available')

        # get the dataframe
        df_res = self.get_cv_results

        # groupby by metric
        grouped = df_res.groupby(metric)
        # create the layout
        fig = go.Figure()
        for level, df in grouped:
            fig.add_trace(
                go.Scatter(
                    x=df.topics,
                    y=df.coherence,
                    mode='lines+markers',
                    name=str(level)
                )
            )
        fig.update_layout(
            title = "coherence over topics by",
            xaxis_title="topic",
            yaxis_title="coherence")
        return fig


    def save_current_model(self):
        # retrieve time
        now = datetime.now()
        # create the directory if it doesn't exist
        try:
            os.makedirs(directory + now.strftime("%d%m%Y_%H%M%S"))
        except:
            pass
        # Save model to disk.
        temp_file = datapath(directory + now.strftime("%d%m%Y_%H%M%S") + '/model')

        self.model.save(temp_file)

    def get_perplexity(self):
        return self.model.log_perplexity(self.__corpus)

    def get_coherence(self):
        coherence_model_lda = CoherenceModel(
            model=self.model,
            texts=self.__documents,
            dictionary=self.__id2word,
            coherence='c_v')
        return coherence_model_lda.get_coherence()


    # data vizualisation
    def dashboard_LDAvis(self):
        # some basic dataviz
        pyLDAvis.enable_notebook()
        vis = pyLDAvis.gensim.prepare(self.model, self.__corpus,
                                      dictionary = self.model.id2word)
        return vis

    def plot_tsne(self, components = 2):
        # n-1 rows each is a vector with i-1 posisitons, where n the number of documents
        # i the topic number and tmp[i] = probability of topic i
        topic_weights = []
        for row_list in self.model[self.get_corpus]:
            tmp = np.zeros(self.__n_topics)
            for i, w in row_list:
                tmp[i] = w
            topic_weights.append(tmp)

        # Array of topic weights
        arr = pd.DataFrame(topic_weights).fillna(0).values

        # Keep the well separated points
        # filter documents with highest topic probability given lower bown (optional)
        # arr = arr[np.amax(arr, axis=1) > 0.35]

        # Dominant topic number in each doc (to compute color)
        topic_num = np.argmax(arr, axis=1)

        # tSNE Dimension Reduction
        tsne_model = TSNE(n_components=components, verbose=1, init='pca')
        tsne_lda = tsne_model.fit_transform(arr)

        mycolors = np.array([color for name, color in mcolors.TABLEAU_COLORS.items()])

        # components
        if components == 2:
            fig = go.Figure(
                go.Scatter(x=tsne_lda[:,0],
                        y=tsne_lda[:,1],
                        marker_color=mycolors[topic_num],
                        mode='markers',
                        name='Tsne'))
            fig.update_layout(
                title = "t-SNE Clustering of {} LDA Topics".format(self.__n_topics),
                xaxis_title="x",
                yaxis_title="y",
                autosize=False,
                width=900,
                height=700)
        elif components == 3:
            fig = go.Figure(
                go.Scatter3d(
                        x=tsne_lda[:,0],
                        y=tsne_lda[:,1],
                        z=tsne_lda[:,2],
                        marker_color=mycolors[topic_num],
                        mode='markers',
                        name='Tsne'))
            fig.update_layout(
                title = "t-SNE Clustering of {} LDA Topics".format(self.__n_topics),
                xaxis_title="x",
                yaxis_title="y")
        else:
            raise Exception("Components exceed covered numbers : {}".format(components))
        return fig

        #plot = figure(title="t-SNE Clustering of {} LDA Topics".format(self.__n_topics),
        #             plot_width=900, plot_height=700)
        #plot.scatter(x=tsne_lda[:,0], y=tsne_lda[:,1], color=mycolors[topic_num])
        #show(plot)

    def plot_likelihood(self):
        fig = go.Figure(
            go.Scatter(x=[i for i in range(0,50)], y=self.__likelihood[-50:],
                       mode='lines',
                       name='lines'))
        fig.update_layout(
            title = "Likelihood over passes",
            xaxis_title="Likekihood",
            yaxis_title="passes")
        return fig

ModuleNotFoundError: No module named 'gensim'

In [None]:
with open('/kaggle/working/sample.log',"w") as source:
    for line in source:
        match = matcher.search(line)
        if match:
            likelihoods.append(float(match.group(1)))

In [None]:
# create my model
lda_model = LDATopicModeling(df = sdf, decade = 1990)

In [None]:
a = [elem for elem in lda_model.model[lda_model.get_corpus]]
for i in a:
    if len(i) == 1:
        print("la taille dépasse 1")
        print(i)

In [None]:
print([i for i in lda_model.model[lda_model.get_corpus]])

In [None]:
print([i for i in lda_model.get_corpus])


In [None]:
# print the result
lda_model.model.print_topics()

In [None]:
lda_model.plot_tsne(2)

In [None]:
lda_model.get_likelihood


In [None]:
lda_model.dashboard_LDAvis()

Let's perform lda for each decade.

In [None]:
# LDA Topic Modeling by decade
class LDAPipeline():

    def __init__(self,
                 prep = preprocess,
                 cv = False,
                 decades = [
        1960, 1970, 1980, 1990, 2000, 2010, 2020
    ]):
        self.models = {
            decade : LDATopicModeling(
                decade = decade,
                lang_preprocess = prep,
                epochs = 10,
                cross_valid = cv) for decade in decades}

    def get_metrics(self):
        # compute metrics
        metrics = {
                 'decade': [],
                 'coherence': [],
                 'perplexity': []
        }
        for decade, model in self.models.items():
            metrics['decade'].append(decade)
            metrics['coherence'].append(model.get_coherence())
            metrics['perplexity'].append(model.get_perplexity())
        # create the dataframe
        df_m = pd.DataFrame(metrics)
        df_m.set_index('decade')
        return df_m


    def lda_info(self, decade):
        lda_model = self.models[decade]

        print("Perplexity: ", lda_model.get_perplexity())
        print("Coherence: ", lda_model.get_coherence())
        lda_model.plot_tsne()
        return lda_model.dashboard_LDAvis()

In [None]:
lda_models = LDAPipeline()

In [None]:
lda_models.get_metrics()

Display information for each decade.

In [None]:
lda_1960 = lda_models.lda_info(1960)

In [None]:
lda_1960

In [None]:
lda_1970 = lda_models.lda_info(1970)

In [None]:
lda_1970

In [None]:
lda_1980 = lda_models.lda_info(1980)

In [None]:
lda_1980

In [None]:
lda_1990 = lda_models.lda_info(1990)

In [None]:
lda_1990

In [None]:
lda_2000 = lda_models.lda_info(2000)

In [None]:
lda_2000

In [None]:
lda_2010 = lda_models.lda_info(2010)

In [None]:
lda_2010

In [None]:
lda_2020 = lda_models.lda_info(2020)

In [None]:
lda_2020

Basic LDA model give us our baseline, we've got this default **perspicacity** and **coherence** score. Let's try to improve this to score and also our qualitative intuition about topic. As we saw topic seems really similar over the decade and it's quite difficult to retrieve some good topics given the representation we compute.

# Improve the preprocessing

In this part, we will try to create a pre-processing function that can take into account bigrams and trigrams and also allow to put aside the terms that could have been too recurrent in the previous part.

## ngram recognition with gensim

A way to improve lyrics comprehension is to use bigram and trigram with the help of phraser in gensim.

In [None]:
import gensim

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

# process in lyrics into words
data = sdf['lyrics'].tolist()
data_words = list(sent_to_words(data))

In [None]:
# display the result
data_words[0][:10]

In [None]:
import gensim

# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_model = gensim.models.phrases.Phraser(bigram)
trigram_model = gensim.models.phrases.Phraser(trigram)

In [None]:
from numpy.random import randint, seed

# set seed from numpy
seed(18)

# draw the upper bound
upper_bound = randint(len(data_words))

# display a random example
print('bigram model: ', bigram_model[data_words[upper_bound]])
print('\ntrigram model: ', bigram_model[data_words[upper_bound]])
print('\nsentence: ',' '.join(data_words[upper_bound]))

We are able to see some bigram and bigram, most of the time it's a words with its adjective or a group of onomatopeia.

As previously explained, we would also like to set aside irrelevant terms that may have been recurrent in our last analysis. If we look at the previous bar charts, we can observe a significant amount of these last terms over all the decades. This is the case for example of like, know or yeah. Most of the time these terms qualified as uninteresting are verbs or onomatopoeias. Let's try to identify the less interesting ones by comparing the bar charts with the visualizations of pyLDAvis. We can notice a strong recurrence of the verbs like, know, come, get which are not necessarily relevant because they are found in most of the analyzed topics. We can also find onomatopoeias oh, yeah and la in most of the topics.

In [None]:
# list recurrent terms
recurrent_terms = {'like','know','come','get', 'got','go','to','oh','yeah','la'}


# default preprocessing
def ngram_preprocess(text, nlp = new_nlp,
                     bigram = bigram_model,
                     trigram = trigram_model,
                    new_stopwords = recurrent_terms):

    # perform basic preprocessing to transform sentence to list of words
    words = gensim.utils.simple_preprocess(text)

    # customize stopwords
    spacy_stopwords = new_nlp.Defaults.stop_words
    ext_stopwords = spacy_stopwords | new_stopwords # union of set

    #removing stop words
    no_stop_words = [word for word in words if word not in ext_stopwords]

    # perform bigram model
    bigram_words = bigram[no_stop_words]

    # perform trigram model
    trigram_words = trigram[bigram_words]

    # recreate the sentence
    sentence = ' '.join(trigram_words)

    #tokenization to get lemma
    tokens = [token for token in nlp(sentence)]

    #LEMMATISATION and filter alphanumeric characters
    sentence = [word.lemma_ for word in tokens if word.text.isalpha()]

    return sentence

In [None]:
# test with the previously draw sentence
ngram_preprocess(' '.join(data_words[upper_bound]))[:10]

Let's try to rerun LDA with the new preprocessing function

In [None]:
ngram_model = LDATopicModeling(decade = 2000, lang_preprocess = ngram_preprocess)

In [None]:
ngram_model.dashboard_LDAvis()

In [None]:
ngram_model.plot_tsne()

In [None]:
ngram_model = LDATopicModeling(decade = 2000,
                               lang_preprocess = ngram_preprocess,
                              cross_valid = True, epochs = 10)

In [None]:
ngram_model.plot_coherence('alpha')

In [None]:
ngram_model.plot_coherence('eta')

In [None]:
ngram_model.plot_tsne()

In [None]:
ngram_model.dashboard_LDAvis()

In [None]:
lda_cv_models = LDAPipeline(prep = ngram_preprocess, cv = True)

In [None]:
lda_cv_models.get_metrics()

In [None]:
lda_cv_1960 = lda_cv_models.lda_info(1960)

In [None]:
lda_cv_1960

In [None]:
lda_cv_1970 = lda_cv_models.lda_info(1970)

In [None]:
lda_cv_1970

In [None]:
lda_cv_1980 = lda_cv_models.lda_info(1980)

In [None]:
lda_cv_1980

In [None]:
lda_cv_1990 = lda_cv_models.lda_info(1990)

In [None]:
lda_cv_1990

In [None]:
lda_cv_2000 = lda_cv_models.lda_info(2000)

In [None]:
lda_cv_2000

In [None]:
lda_cv_2010 = lda_cv_models.lda_info(2010)

In [None]:
lda_cv_2010

In [None]:
lda_cv_2020 = lda_cv_models.lda_info(2020)

In [None]:
lda_cv_2020

In [None]:
lda_cv_models.get_metrics()

## 2015 songs lyrics topic modelling

Let's first retrieve the english song in 2015 (year of with max genius lyrics repertoried).

Let's plot the tag distribution

The barplot below shows the frequency of each tag color by total views

Let's try topic modelling with top2vec library which is the easiest to start. But first let's filter the lyrics.

# 2000 song lyrics topic modelling

## References

In [None]:
!unzip "/content/drive/MyDrive/Colab Notebooks/genius-song-dataset/song_lyrics.csv.zip" -d "."

Archive:  /content/drive/MyDrive/Colab Notebooks/genius-song-dataset/song_lyrics.csv.zip
  inflating: ./song_lyrics.csv       
