# Dataset Modeling
We have 2 different data sources:
1) csv files containing song information (lyrics-data.csv)
2) txt files for each artist, containing the lyrics of their songs

The goal is to model those sources into a single dataset for further analysis and preprocessing steps.

In [87]:
import pandas as pd
import os

## Prepare each dataframe - one from csv, other from txt

### CSV file - lyrics-data.csv

In [88]:
df_from_csv = pd.read_csv("../data/raw/csv/lyrics-data.csv")

In [89]:
df_from_csv.head()

Unnamed: 0,ALink,SName,SLink,Lyric,language
0,/ivete-sangalo/,Arerê,/ivete-sangalo/arere.html,"Tudo o que eu quero nessa vida,\nToda vida, é\...",pt
1,/ivete-sangalo/,Se Eu Não Te Amasse Tanto Assim,/ivete-sangalo/se-eu-nao-te-amasse-tanto-assim...,Meu coração\nSem direção\nVoando só por voar\n...,pt
2,/ivete-sangalo/,Céu da Boca,/ivete-sangalo/chupa-toda.html,É de babaixá!\nÉ de balacubaca!\nÉ de babaixá!...,pt
3,/ivete-sangalo/,Quando A Chuva Passar,/ivete-sangalo/quando-a-chuva-passar.html,Quando a chuva passar\n\nPra quê falar\nSe voc...,pt
4,/ivete-sangalo/,Sorte Grande,/ivete-sangalo/sorte-grande.html,A minha sorte grande foi você cair do céu\nMin...,pt


In [90]:
df_from_csv.tail()

Unnamed: 0,ALink,SName,SLink,Lyric,language
379926,/clegg-johnny/,The Waiting,/clegg-johnny/the-waiting.html,Chorus\nHere we stand waiting on the plain\nDa...,en
379927,/clegg-johnny/,Too Early For The Sky,/clegg-johnny/too-early-for-the-sky.html,I nearly disappeared into the mouth of a croco...,en
379928,/clegg-johnny/,Warsaw 1943 (I Never Betrayed The Revolution),/clegg-johnny/warsaw-1943-i-never-betrayed-the...,"Amambuka, amambuka azothengisa izwe lakithi, i...",en
379929,/clegg-johnny/,When The System Has Fallen,/clegg-johnny/when-the-system-has-fallen.html,Sweat in the heat for days on end\nwaiting for...,en
379930,/clegg-johnny/,Woman Be My Country,/clegg-johnny/woman-be-my-country.html,Here we stand on the edge of the day\nFaces me...,en


In [91]:
df_from_csv.columns.array

<NumpyExtensionArray>
['ALink', 'SName', 'SLink', 'Lyric', 'language']
Length: 5, dtype: object

The useful columns are 'Lyric' which obviously contains the lyrics of a song and 'ALink' which represents the song's author. The author will be used to avoid duplicate lyrics which may come from the text files.

Visual inspection shows the text files don't contain song name information so 'SName' can't be used for duplicates detection.

In [92]:
df_from_csv = df_from_csv.drop(["SName", "SLink", "language"], axis=1)

In [93]:
df_from_csv.head()

Unnamed: 0,ALink,Lyric
0,/ivete-sangalo/,"Tudo o que eu quero nessa vida,\nToda vida, é\..."
1,/ivete-sangalo/,Meu coração\nSem direção\nVoando só por voar\n...
2,/ivete-sangalo/,É de babaixá!\nÉ de balacubaca!\nÉ de babaixá!...
3,/ivete-sangalo/,Quando a chuva passar\n\nPra quê falar\nSe voc...
4,/ivete-sangalo/,A minha sorte grande foi você cair do céu\nMin...


Checking data types.

In [94]:
df_from_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 379931 entries, 0 to 379930
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   ALink   379930 non-null  object
 1   Lyric   379854 non-null  object
dtypes: object(2)
memory usage: 5.8+ MB


Handling missing data:
1) There are less non-null object (string) in column 'Lyric' than there are entries.
2) There is one less object (string) in column 'ALink' than there are entries.
3) They could possibly be filled with information from the text files data, which means we can remove the NaN values here. Joining with information from the text files will add the originaly missing information, if any exists though.

In [95]:
df_from_csv = df_from_csv.dropna()

We would like that artist names are formated in a way which makes merging with the text files easier. The convention would be [name part]-[name part].

In [96]:
df_from_csv['ALink'].values

array(['/ivete-sangalo/', '/ivete-sangalo/', '/ivete-sangalo/', ...,
       '/clegg-johnny/', '/clegg-johnny/', '/clegg-johnny/'], dtype=object)

In [97]:
df_from_csv['ALink'] = df_from_csv['ALink'].apply(lambda x: x.replace("/", ""))

In [98]:
df_from_csv['ALink'].values

array(['ivete-sangalo', 'ivete-sangalo', 'ivete-sangalo', ...,
       'clegg-johnny', 'clegg-johnny', 'clegg-johnny'], dtype=object)

### Text files

Let's create a data frame from the text files. The column 'ALink' which represents artist names will be constructed from the file names. The column 'Lyric' will have the contents of those file for each artis.

In [99]:
files = os.listdir("../data/raw/txt/")
for file in files:
    print(file)

adele.txt
al-green.txt
alicia-keys.txt
amy-winehouse.txt
beatles.txt
bieber.txt
bjork.txt
blink-182.txt
bob-dylan.txt
bob-marley.txt
britney-spears.txt
bruce-springsteen.txt
bruno-mars.txt
cake.txt
dickinson.txt
disney.txt
dj-khaled.txt
dolly-parton.txt
dr-seuss.txt
drake.txt
eminem.txt
janisjoplin.txt
jimi-hendrix.txt
johnny-cash.txt
joni-mitchell.txt
kanye-west.txt
kanye.txt
Kanye_West.txt
lady-gaga.txt
leonard-cohen.txt
lil-wayne.txt
Lil_Wayne.txt
lin-manuel-miranda.txt
lorde.txt
ludacris.txt
michael-jackson.txt
missy-elliott.txt
nickelback.txt
nicki-minaj.txt
nirvana.txt
notorious-big.txt
notorious_big.txt
nursery_rhymes.txt
patti-smith.txt
paul-simon.txt
prince.txt
r-kelly.txt
radiohead.txt
rihanna.txt


In [100]:
file_data = []
for file_name in os.listdir("../data/raw/txt/"):
    file_path = os.path.join("../data/raw/txt/", file_name)
    if os.path.isfile(file_path):
        with open(file_path, "r", encoding="utf-8") as file:
            file_content = file.read()
            file_data.append({"ALink": file_name[:-4], "Lyric": file_content})
df_from_txt = pd.DataFrame(file_data)

In [101]:
df_from_txt.head()

Unnamed: 0,ALink,Lyric
0,adele,Looking for some education\nMade my way into t...
1,al-green,"Let's stay together I, I'm I'm so in love with..."
2,alicia-keys,Ooh....... New York x2 Grew up in a town that ...
3,amy-winehouse,Build your dreams to the stars above\nBut when...
4,beatles,"Yesterday, all my troubles seemed so far away\..."


In [102]:
df_from_txt.tail()

Unnamed: 0,ALink,Lyric
44,paul-simon,"Hey, Vietnam, Vietnam, Vietnam, Vietnam\nVietn..."
45,prince,\n\nAll of this and more is for you\nWith love...
46,r-kelly,"I hear you callin', ""Here I come baby""\nTo sav..."
47,radiohead,"Come on, come on\nYou think you drive me crazy..."
48,rihanna,"Ghost in the mirror\nI knew your face once, bu..."


In [103]:
df_from_txt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ALink   49 non-null     object
 1   Lyric   49 non-null     object
dtypes: object(2)
memory usage: 916.0+ bytes


The dataframes can now be concatenated and saved as an interim dataset.