# Lyrics cleaning

This notebook mostly deals with a data corruption found in the lyrics data. As parts of the lyrics are found in the index of the lyrics data, which is supposed to be an integer index. The source of this issue is that Pandas failed to quote a part of the lyrics that contains newline character in the CSV file, which could potentially be a bug within Pandas or the CSV module. The issue is averted by manually adding quotes to such part in ma_lyrics_8.csv and forcing pandas to add quotes around every entry in lyrics column.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

In [24]:
full = pd.read_csv('data/ma_songs_lyrics.csv', index_col=0, low_memory=False)

In [25]:
full

Unnamed: 0,album_url,band_name,album_name,album_type,song_name,song_id,lyrics
0,https://www.metal-archives.com/bands/%21T.O.O....,!T.O.O.H.!,Democratic Solution,Full-length,Aura & Ziata (new version),2667201.0,Aura a Ziata\r\nDěvčata to k pohledání\r\nVšak...
1,https://www.metal-archives.com/bands/%21T.O.O....,!T.O.O.H.!,Democratic Solution,Full-length,Boubelovo životakončení,2667192.0,Jsem sodomita na sofa\r\nNadmutej po sečuánský...
2,https://www.metal-archives.com/bands/%21T.O.O....,!T.O.O.H.!,Democratic Solution,Full-length,Demokratické řešení,2667195.0,"Pilulky a harfa, postřik na plebs\r\nČipy, spo..."
3,https://www.metal-archives.com/bands/%21T.O.O....,!T.O.O.H.!,Democratic Solution,Full-length,Instrumental,2667203.0,(Instrumental)
4,https://www.metal-archives.com/bands/%21T.O.O....,!T.O.O.H.!,Democratic Solution,Full-length,Kokarda pýchy,2667194.0,Podléhá limitujícím názorům programovaných\r\n...
...,...,...,...,...,...,...,...
2210683,,,,,,,"I’m breaking down, I’m feeling down, feeling d..."
2210684,,,,,,,"Its not a dream, from witch I fell.\r\nBut I k..."
2210685,,,,,,,"Leave me alone, as I gather my pride.\r\nAnd s..."
2210686,,,,,,,Take your instruments of war.\r\nBurn down you...


In [26]:
full[full['lyrics'].isna()]

Unnamed: 0,album_url,band_name,album_name,album_type,song_name,song_id,lyrics
982528,https://www.metal-archives.com/bands/Expose_Yo...,Expose Your Hate,Hatecult,Full-length,Inherent Human Cruelty,787561.0,
982529,https://www.metal-archives.com/bands/Expose_Yo...,Expose Your Hate,Hatecult,Full-length,Lies,787551.0,
982530,https://www.metal-archives.com/bands/Expose_Yo...,Expose Your Hate,Hatecult,Full-length,Moment of Reflection,787555.0,
982531,https://www.metal-archives.com/bands/Expose_Yo...,Expose Your Hate,Hatecult,Full-length,Peculiar Reason,787552.0,
982532,https://www.metal-archives.com/bands/Expose_Yo...,Expose Your Hate,Hatecult,Full-length,Reality Phobia,787558.0,
...,...,...,...,...,...,...,...
You watch it burn and die,,,,,,,
Regrets now come too late,,,,,,,
The price of death is paid,,,,,,,
Destruction is the night,,,,,,,


In [27]:
full.count()

album_url     2456323
band_name     2456323
album_name    2456318
album_type    2456323
song_name     2456323
song_id       2456323
lyrics        2210691
dtype: int64

In [38]:
full[full.index.str.contains('\D')]

Unnamed: 0,album_url,band_name,album_name,album_type,song_name,song_id,lyrics
His flames destroy it all,,,,,,,
No mercy will he show,,,,,,,
For the avengers of his power,,,,,,,
As his fire reaches the skin,,,,,,,
The burns will make you suffer,,,,,,,
You pay the price for life long sins,,,,,,,
Eternal damnation calls,,,,,,,
This is what you feared to feel,,,,,,,
The hatred in his might,,,,,,,
He sets your land on fire,,,,,,,


In [41]:
lyrics_parts = [pd.read_csv('data/lyrics/ma_lyrics_' + str(i) + '.csv', index_col=0) for i in range(0, 10)]

Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            245622, 245623, 245624, 245625, 245626, 245627, 245628, 245629,
            245630, 245631],
           dtype='int64', length=245632)
Int64Index([245632, 245633, 245634, 245635, 245636, 245637, 245638, 245639,
            245640, 245641,
            ...
            491254, 491255, 491256, 491257, 491258, 491259, 491260, 491261,
            491262, 491263],
           dtype='int64', length=245632)
Int64Index([491264, 491265, 491266, 491267, 491268, 491269, 491270, 491271,
            491272, 491273,
            ...
            736886, 736887, 736888, 736889, 736890, 736891, 736892, 736893,
            736894, 736895],
           dtype='int64', length=245632)
Int64Index([736896, 736897, 736898, 736899, 736900, 736901, 736902, 736903,
            736904, 736905,
            ...
            982518, 982519, 982520, 982521, 982522, 982523, 982524, 982525,
      

In [42]:
for i, p in enumerate(lyrics_parts):
    print(i, p.index)

0 Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            245622, 245623, 245624, 245625, 245626, 245627, 245628, 245629,
            245630, 245631],
           dtype='int64', length=245632)
1 Int64Index([245632, 245633, 245634, 245635, 245636, 245637, 245638, 245639,
            245640, 245641,
            ...
            491254, 491255, 491256, 491257, 491258, 491259, 491260, 491261,
            491262, 491263],
           dtype='int64', length=245632)
2 Int64Index([491264, 491265, 491266, 491267, 491268, 491269, 491270, 491271,
            491272, 491273,
            ...
            736886, 736887, 736888, 736889, 736890, 736891, 736892, 736893,
            736894, 736895],
           dtype='int64', length=245632)
3 Int64Index([736896, 736897, 736898, 736899, 736900, 736901, 736902, 736903,
            736904, 736905,
            ...
            982518, 982519, 982520, 982521, 982522, 982523, 982524, 982525

In [44]:
lyrics_parts[8]

Unnamed: 0,0
1965056,(lyrics not available)
1965057,(lyrics not available)
1965058,(lyrics not available)
1965059,(lyrics not available)
1965060,(lyrics not available)
...,...
2210683,"I’m breaking down, I’m feeling down, feeling d..."
2210684,"Its not a dream, from witch I fell.\r\nBut I k..."
2210685,"Leave me alone, as I gather my pride.\r\nAnd s..."
2210686,Take your instruments of war.\r\nBurn down you...


In [56]:
lyrics_parts[8].index.get_loc(lyrics_parts[8].index[lyrics_parts[8].index.str.contains('\D')][0])

188911

In [58]:
lyrics_parts[8].iloc[188905 : 188935, :]

Unnamed: 0,0
2153961,Alone in the darkness\r\nThe crows are frighte...
2153962,The democratic power\r\nTries to standardize u...
2153963,Now it is time to fight\r\nBetween fear and st...
2153964,The wind has turned\r\nYou have to answer\r\n\...
2153965,The darkness times arrived\r\nWhen I was in my...
2153966,When he appears
His flames destroy it all,
No mercy will he show,
For the avengers of his power,
As his fire reaches the skin,


In [47]:
lyrics_parts[6]

Unnamed: 0,0
1473792,We didn't stop to recognize\r\nThis thundering...
1473793,"Drink a cup of emptiness\r\nTame the storm, in..."
1473794,When it began\r\nLacking a goal\r\nTen years I...
1473795,(lyrics not available)
1473796,(lyrics not available)
...,...
1719419,(lyrics not available)
1719420,(lyrics not available)
1719421,(lyrics not available)
1719422,(lyrics not available)
