In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
songs = pd.read_csv('data/ma_songs_lyrics.csv', index_col=0)

It can be observed that songs with "(lyrics not available)" or "(Instrumental)" are the ones without lyrics, so these should be dropped first. Then, to simplify the analysis, all columns with lyrics containing characters outside of basic ASCII will be dropped. This would eliminate most non-English lyrics, but could also affect English lyrics with diacritic marks.

In [3]:
songs = songs[  (songs['lyrics'] != '(lyrics not available)')
              & (songs['lyrics'] != '(Instrumental)')
              & ~songs['lyrics'].str.contains('[^\x00-\x7f]')]
songs = songs.reset_index(drop='True')
songs

Unnamed: 0,album_url,band_name,album_name,album_type,song_name,song_id,lyrics
0,https://www.metal-archives.com/bands/Greed/354...,Greed,Belial / Infest,Demo,Belial,5670454,Aggrandisement\r\n\r\nMaster of earth\r\nWeake...
1,https://www.metal-archives.com/bands/Greed/354...,Greed,Belial / Infest,Demo,Infest,5670455,Infest\r\n\r\nCorrupt the\r\nmasses\r\nInfest\...
2,https://www.metal-archives.com/bands/Greed/354...,Greed,The Purge of Earth,EP,Belial,5668770,Aggrandisement\r\n\r\nMaster of earth. Fall to...
3,https://www.metal-archives.com/bands/Greed/354...,Greed,The Purge of Earth,EP,Infest,5668769,Infest\r\nCorrupt the masses\r\n\r\nSuffering\...
4,https://www.metal-archives.com/bands/Blind_Gre...,Blind Greed,The Almighty Dollar,Full-length,Blind Greed,1397957,"You know I've heard lots of stories, about how..."
...,...,...,...,...,...,...,...
451587,https://www.metal-archives.com/bands/%ED%8F%90...,폐허,Songs for Darkspirits,Full-length,"Sweet, Gesture of the Death",1150474,"slumber,\r\nPeace.\r\n\r\nthis harmony to natu..."
451588,https://www.metal-archives.com/bands/%ED%8F%90...,폐허,When Fatigue Devours Reincarnation,EP,Diary of a Decaying Man,1150491,At the end of Chaos(The man who engulfed himse...
451589,https://www.metal-archives.com/bands/%ED%8F%90...,폐허,흉가,Full-length,통곡의 서막 / Prelude to Tremendous Sadness,2213275,"my lady, wake up,\r\nin this cold night.\r\nyo..."
451590,https://www.metal-archives.com/bands/%ED%8F%90...,폐허,흉가,Full-length,흉가에 얽힌 이야기 Part III / The Tale from the Hounte...,2213273,Beauty was this hill\r\nfilled with this blood...


In [4]:
words = songs['lyrics'].str.lower().str.findall("[a-z][a-z'-]*").explode()
words

0         aggrandisement
0                 master
0                     of
0                  earth
0                 weaker
               ...      
451591             there
451591             where
451591               she
451591              came
451591              from
Name: lyrics, Length: 65597626, dtype: object

In [5]:
word_counts = words.value_counts()
word_counts

the             4000539
of              1761256
to              1574878
i               1319205
and             1290361
                 ...   
prophocy              1
unterstellen          1
bestehen              1
aufstossen            1
vess                  1
Name: lyrics, Length: 433016, dtype: int64

In [6]:
word_counts.nlargest(40)

the     4000539
of      1761256
to      1574878
i       1319205
and     1290361
you     1197155
in      1091017
a       1070359
my       885538
your     794063
is       745845
me       585289
for      556704
all      461549
will     442974
we       426122
this     409062
it       406946
on       390570
with     375286
no       372444
that     357738
are      344915
be       339231
from     334701
life     252014
now      251701
by       249281
our      242471
as       241390
so       233759
but      224902
they     224138
time     222805
i'm      216284
what     212805
see      208820
one      207504
not      189348
have     187993
Name: lyrics, dtype: int64

I am filtering out words that are considered "trivial", which includes all pronouns, conjunctions, prepositions, and articles. This is done with this dataset: https://archive.org/details/mobypartofspeech03203gut. At the same time, there are other frequently used English words added manually.

In [7]:
pos = pd.read_csv('data/pos.txt', sep='\\', header=None)
trivial = set(pos[pos[1].str.contains('C|P|r|D|I').fillna(False)][0])
trivial.add('i')
trivial.add("i'm")
trivial.add("i'll")
trivial.add("you're")
trivial.add("it's")
trivial.add("they're")
trivial.add('be')
trivial.add('am')
trivial.add('is')
trivial.add('was')
trivial.add('are')
trivial.add('were')
trivial.add('have')
trivial.add('has')
trivial.add('had')
trivial.add('will')
trivial.add('would')
trivial.add('do')
trivial.add('does')
trivial.add("don't")
trivial.add("doesn't")
trivial.add('can')
trivial.add('could')
trivial.add("can't")
trivial.add("couldn't")
trivial.add('not')
trivial.add('or')
trivial.add('let')
trivial.add("let's")

In [8]:
nontriv = pd.Series(filter(lambda x: x not in trivial, words))
nontriv

0           aggrandisement
1                   master
2                    earth
3                   weaker
4                forgotten
                 ...      
34343313           killers
34343314              cold
34343315             alone
34343316                go
34343317              came
Length: 34343318, dtype: object

In [9]:
nontriv_counts = nontriv.value_counts()
nontriv_counts

life       252014
time       222805
see        208820
death      173587
never      170442
            ...  
gildel          1
nimsi           1
edzard          1
algot's         1
vess            1
Length: 432605, dtype: int64

In [10]:
nontriv_counts.nlargest(40)

life        252014
time        222805
see         208820
death       173587
never       170442
world       157951
just        153070
blood       151603
eyes        150044
know        142902
night       137785
away        135852
come        128621
way         127340
feel        125940
die         125044
take        123674
pain        119161
light       117656
mind        117560
soul        114370
end         114227
only        110466
again       102313
dead        100797
day          99322
here         97427
back         95580
love         90628
go           90139
fire         89169
god          88769
black        88422
hell         87973
fear         87164
heart        82578
dark         82528
darkness     78109
lost         77870
live         77668
dtype: int64