# Lyric Analysis

### By: Ari Silburt

### Date: January 20th, 2017

<div class="col-lg-6 col-sm-7">
<p>
Everyone listens to music. It is inexorably linked with who we are as humans, and the music one chooses to hear at a given moment is determined by how they are feeling.  
</p>
<p>
The meaning behind music varies between songs, but are there general similarities between songs of a specific genre? What general feelings and emotions are conveyed by a particular genre as a whole?
</p>
</div>

<div>
<img src="lyric_analysis/Pink_floyd.jpg", width=350>
</div>

## Setup
The code used for this project comes from https://github.com/silburt/Machine_Learning/. 

First we need some songs. I picked a few <em>"Best of (insert genre here)"</em> playlists from Spotify and extracted a csv from [here](https://github.com/watsonbox/exportify) containing the artist and song for each entry. Then, I scraped each lyric from the genius.com API using standard Python packages like `BeautifulSoup` and `requests`. In total, I obtained roughly 1000 different songs released between 2000-2017 for each of: EDM, Hip-Hop, Country, Rock and Pop. All the relevant code for this is contained [here](https://github.com/silburt/Machine_Learning/).


## Word Frequency
Now that we have a collection of songs, it's time to get to some fun stuff and analyze their content. The most straightforward thing to do is analyze word frequencies, and see which words show up most often. This is done pretty easily in Python doing:
***
```python
from collections import Counter
import matplotlib.pyplot as plt
import glob

def get_most_common_words(genre='edm'):
    #Load Songs
    songs = glob.glob('%s_songs/*.txt'%genre)

    # Count lyrics
    cnt = Counter()
    for song in songs:
        lyrics = open(song, 'r').read().lower().split()
        for word in lyrics:
            cnt[word] += 1

    return cnt

# Get top 40 most common words
cnt = get_most_common_words()
labels, count = zip(*cnt.most_common(40)) 

# Plot
x = range(len(count))
plt.plot(x, count)
plt.xticks(x, labels)
```
***
This gives you some nice looking word frequency graphs, and shown below I've highlighted the plots of EDM and Pop:
<div>
<div><img src="lyric_analysis/worddist_edm.png" align="left" width=495 ></div>
<div><img src="lyric_analysis/worddist_pop.png" align="left" width=495 ></div>
</div>

The first interesting thing about these graphs is the clear presence of [Zipfs Law](https://en.wikipedia.org/wiki/Zipf%27s_law), which is an interesting empirical law that states that the $\rm{frequency}$ of a word is inversely proportional to its $\rm{rank}$. Put another way, it should be universally true that:
$$
\rm{rank}^p_i*\rm{frequency}_i \sim \rm{rank}^p_j*\rm{frequency}_j
$$
where $p$ is an exponent (in the ideal case $p=1$, but in practice this is usually not true).
An equivalent statement is that the distribution of words forms a powerlaw, and should look like a straight line in a log-log plot (indeed these plots do, not shown here). 

So as an example for the EDM genre, if we take the first, middle and last words:
<center> "the" : ($\rm{rank}=1, \rm{frequency=7289}$) </center>  
<center> "be" : ($\rm{rank}=20,\rm{frequency=1524}$) </center>  
<center> "go" : ($\rm{rank}=40,\rm{frequency=958}$) </center>  
Chugging these through Zipfs law with $p=0.55$, we find that:

$$
\rm{rank}^p_{the}*\rm{frequency}_{the} = 1^{0.55}*7289 = 7289   \\
\rm{rank}^p_{be}*\rm{frequency}_{be} = 20^{0.55}*1524 = 7916   \\
\rm{rank}^p_{go}*\rm{frequency}_{go} = 40^{0.55}*958 = 7286   
$$
Indeed Zipfs law holds!

The second interesting thing is that the average number of words per EDM song less than half that of Pop. The table below shows the average words per song, and we see that Pop has more words per song than even Hip-Hop, which is a bit surprising to me considering that lyrics play an especially central role in Hip-hop. It does however make sense (to me at least) that EDM would have the fewest words per song. Often times I get the feeling that EDM lyrics are a bumper sticker afterthought...

| Genre        | Avg. Words per Song
| ------------- |:-------------:| 
| Pop      | 423 | 
| Hip-Hop  | 368 |
| Country  | 274 |
| Rock     | 225 |
| EDM      | 203 |

## Word Correlations
A more interesting thing to do is compare the popularity of words between genres. If we rank the words in each genre by frequency and compare, this can reveal the (sometimes not so) subtle messages between genres. Like before, this can be done pretty easily in Python using the previous code block used above along with a bit more code:
***
```python
def get_word_ranks(cnt1, pos1, cnt2, pos2, master_labels, n_words, pad=20):
    labels1, count1 = zip(*cnt1.most_common(10*n_words))
    labels2, count2 = zip(*cnt2.most_common(10*n_words))
    for i in range(n_words):
        l = labels1[i]
        if l not in master_labels[0:n_words]:
            try:
                _c, _p = cnt2[l], labels2.index(l)
                pos2.append(min(_p,(pad-3)*np.random.random()+n_words+1))
            except:
                pos2.append((pad-3)*np.random.random()+n_words+1)
            master_labels.append(l)
            pos1.append(i)
    return master_labels, pos1, pos2

n_words = 80
cnt_edm = get_most_common_words('edm')
cnt_hh = get_most_common_words('hip-hop')
labels, pos_edm, pos_hh = [], [], []

labels, pos_edm, pos_hh = get_word_ranks(cnt_edm, pos_edm, cnt_hh, pos_hh, labels, n_words)
labels, pos_hh, pos_edm = get_word_ranks(cnt_hh, pos_hh, cnt_edm, pos_edm, labels, n_words)

plt.plot(pos_edm, pos_hh, '.', color='black')
for i in range(len(pos_hh)):
    plt.text(pos_edm[i]+0.2, pos_hh[i]+0.2, labels[i], size=10, rotation=20)
```
***

<div class="col-lg-9 col-sm-7">
<img src="lyric_analysis/wordcorr_edm_hip-hop.png" align="left" width=800 >
</div>

<br><br><br><br><br><br>

| Unique EDM Words | Unique Hip-Hop Words
| ------------- |:-------------:| 
| we    | my
| oh    | she  
| love  | bitch 
| feel  | fuck
| time  | money 
| heart | girl 

<div class="col-lg-9 col-sm-7">
<img src="lyric_analysis/wordcorr_pop_rock.png" align="left" width=800 >
</div>

<br><br><br><br><br><br>

| Unique EDM Words | Unique Hip-Hop Words
| ------------- |:-------------:| 
| we    | my
| oh    | she  
| love  | bitch 
| feel  | fuck
| time  | money 
| heart | girl 

In [1]:
<div>
<div><img src="lyric_analysis/wordcorr_edm_hip-hop.png" align="left" width=495 ></div>
<div><img src="lyric_analysis/wordcorr_edm_pop.png" align="left" width=495 ></div>
</div>

SyntaxError: invalid syntax (<ipython-input-1-52125a08f3b7>, line 1)