# Lyric Analysis and Machine Learning in Python

### By: Ari Silburt

### Date: January 20th, 2017

<div class="col-lg-6 col-sm-7">
<p>
Music has a profound influence on people. Whether we are simply listening to music as background noise or blocking out the world to listen to a favourite song, it affects our mood and reflects our values. [Research](http://www.medicaldaily.com/who-do-you-think-you-are-what-your-taste-music-says-about-you-according-science-317388) suggests that musical taste is more than a simple preference and is connected to our [Big-Five](https://en.wikipedia.org/wiki/Big_Five_personality_traits) personality traits. In addition, [a study by Sonos](https://www.fastcompany.com/3056554/how-music-changes-our-behavior-at-home) showed that music can physically change our behaviour, with couples reporting 66% more intimacy, 33% more likely to cook together and 85% more likely to invite people over when music is playing.
</p>
<p>
Lyrics play an essential role in music, encoding the feelings, moods and philosophies of the artist. Although every song is different, are there overarching themes associated with various genres? For example, when you are feeling very happy is there a particular genre of music you're most likely to listen to? How does that change if you are sad? What about mad?
</p>
<p>
In this blog post I will analyze 5 different music genres - Country, Electonic Dance Music (EDM), Rock, Pop and Rap - and extract the key emotions and beliefs that define each.
</p>
</div>

<div>
<img src="lyric_analysis/Pink_floyd.jpg", width=350>
</div>

## Setup
The code used for this project comes from my [Machine Learning Repo](https://github.com/silburt/Machine_Learning/). 

First we need some songs. I picked a few playlists from Spotify: 
- <i>Country Top Songs of 2000-2017</i> by sjwheat (1567 songs)
- <i>Best EDM Playlist on the Planet!</i> by Matt Hoppe (1592 songs)
- <i>Greatest Rock Songs Ever</i> by Max John Maybury (1049 songs)
- <i>Super Top Hits 2000-2016</i> by Michele Insalata (1603 songs)
- <i>BEST of RAP</i> by Andy Mathers (2128 songs)

and converted these playlists to csvs [here](https://github.com/watsonbox/exportify), with each csv row containing the artist, song, album, etc. for each entry. Then, I scraped each lyric from the genius.com API using standard Python packages like `BeautifulSoup` and `requests`. A good fraction of songs (20-40% per genre) weren't recognized in the genius.com API due to e.g. slightly different names between spotify and genius, mix version, etc. I also removed any songs that had fewer than 15 words (i.e. instrumentals). In total, I obtained 1000 songs from each genre for the following analysis.

## Word Frequency
Now that we have a collection of songs, it's time to get to start analyzing their content. The most straightforward thing to do is analyze word frequencies, and see which words show up most often. This is done pretty easily in Python:
***
```python
from collections import Counter
import matplotlib.pyplot as plt
import glob

def get_most_common_words(genre='edm'):
    #Load Songs
    songs = glob.glob('%s_songs/*.txt'%genre)

    # Count lyrics
    cnt = Counter()
    for song in songs:
        lyrics = open(song, 'r').read().lower().split()
        for word in lyrics:
            cnt[word] += 1
    return cnt

# Get top 40 most common words
cnt = get_most_common_words()
labels, count = zip(*cnt.most_common(40)) 

# Plot
x = range(len(count))
plt.plot(x, count)
plt.xticks(x, labels)
```
***
This gives you some nice looking word frequency graphs, EDM and Pop are shown below:
<div>
<div><img src="lyric_analysis/worddist_edm.png" align="left" width=495 ></div>
<div><img src="lyric_analysis/worddist_pop.png" align="left" width=495 ></div>
</div>

You can clearly notice [Zipfs Law](https://en.wikipedia.org/wiki/Zipf%27s_law), an empirical law which states that the $\rm{frequency}$ of a word is inversely proportional to its $\rm{rank}$. Mathematically:
$$
\rm{rank}^p_i*\rm{frequency}_i \sim \rm{rank}^p_j*\rm{frequency}_j
$$
where $p$ is an exponent (in the ideal case $p=1$, in practice usually not true).
An equivalent statement is that the distribution of words forms a powerlaw, and should look like a straight line in log-log space. 

So for example, if we take the 1st, 20th and 40th most common words from the EDM genre:
<center> "the" : ($\rm{rank}=1, \rm{frequency=7289}$) </center>  
<center> "be" : ($\rm{rank}=20,\rm{frequency=1524}$) </center>  
<center> "go" : ($\rm{rank}=40,\rm{frequency=958}$) </center>  
and plug these into Zipfs law with $p=0.55$, we get:

$$
\rm{rank}^p_{the}*\rm{frequency}_{the} = 1^{0.55}*7289 = 7289   \\
\rm{rank}^p_{be}*\rm{frequency}_{be} = 20^{0.55}*1524 = 7916   \\
\rm{rank}^p_{go}*\rm{frequency}_{go} = 40^{0.55}*958 = 7286   
$$
Indeed, Zipfs law holds!

Another notable thing is the average number of words per song. For example, as shown in the above plots the average EDM song is less than half that of Pop. The table below shows the average words per song, and we see that Rap has by far the most words per song, which is expected  since lyrics play an especially central role in Hip-hop. It also is not surprising (to me at least) that EDM would have the fewest words per song, as the genre tends to be more focused on buildups and explosive drops.

| Genre        | Avg. Words per Song
| ------------- |:-------------:| 
| Rap      | 603 | 
| Pop      | 423 | 
| Country  | 274 |
| Rock     | 225 |
| EDM      | 203 |

## Word Correlations
Another interesting thing is to compare the popularity of each word <i>between</i> genres. If we rank all the words by popularity and compare across genres this can reveal differences in overall intent. This can also be done pretty easily in Python:
***
```python
def get_word_ranks(cnt1, pos1, cnt2, pos2, master_labels, n_words, pad=20):
    labels1, count1 = zip(*cnt1.most_common(10*n_words))
    labels2, count2 = zip(*cnt2.most_common(10*n_words))
    for i in range(n_words):
        l = labels1[i]
        if l not in master_labels[0:n_words]:
            try:
                _c, _r = cnt2[l], labels2.index(l)
                pos2.append(min(_r,n_words))
            except:
                pos2.append(n_words)
            master_labels.append(l)
            pos1.append(i)
    return master_labels, pos1, pos2

# get word frequencies
n_words = 80
cnt_edm = get_most_common_words('edm')
cnt_hh = get_most_common_words('hip-hop')
labels, pos_edm, pos_hh = [], [], []

# get word correlations
labels, pos_edm, pos_hh = get_word_ranks(cnt_edm, pos_edm, cnt_hh, pos_hh, labels, n_words)
labels, pos_hh, pos_edm = get_word_ranks(cnt_hh, pos_hh, cnt_edm, pos_edm, labels, n_words)

# plot
plt.plot(pos_edm, pos_hh, '.', color='black')
for i in range(len(pos_hh)):
    rot, bx, by = 0, 0.5, -1
    if pos_hh[i] == n_words:
        rot, bx, by = 90, 0, 5
    plt.text(pos_edm[i]+bx, pos_hh[i]+by, labels[i], size=10, rotation=rot)
```
***

Yielding the following plot: 
<div>
<img src="lyric_analysis/wordcorr_edm_rap.png" width=850>
</div>

In this plot we have EDM rank on the x-axis, and rap rank on the y-axis. The blue and two green lines show $y=x$, $y=2x$ and $y=0.5x$ lines, respectively. Thus, words near the blue line are equally as popular in EDM and rap, while  beyond the green lines words are disproportionately more popular in one genre vs. another. So for example, "love" is the 17th most popular EDM word and 68th most popular rap word, and is disproportionately a more popular word used in EDM over rap.

Finally, the datapoints that fall on the red-dotted lines correspond to words that either:
- did not show up in at all in the other genre (e.g. "bitch" is the X most popular word in rap, but doesn't show up at all in EDM).
- had too high of a rank to fit comfortably on the plot (e.g. "she" has a rank of 203 in EDM). 

## Unique Words
To me, probably the most interesting thing is to find words with disproportionately high or low rankings when compared across all genres. This augments the analysis above, where I was only comparing ranks between two genres at a time. The first table below shows words with at least a 1.5x higher rank in a given genre vs. all others, while the second table shows words with at least a 1.5x lower rank vs. all the other genres.

<br>
<center> __Disproportionately high rank for given genre__ </center>

| EDM | Rap | Rock | Country | Pop | 
| ------------- |:-------------:| ------------- |:-------------:| ------------- |
| we    | got       | well   | little | |
| feel  | she       | woman  | every  | |
| we're | ni\*\*a(s)| soul   | old    | |
| our   | they      | people | song   | |
| us    | bitch(es) |        | kiss   | |
| into  | fuck      |        | town   | |
| again | shit      |        | road   | |
| light | money     |        | those  | |
| fire  | i'mma     |        | | |
| fall  | hit       |        | | |
| jump  | pussy     |        | | | |

<br>
<center> __Disproportionately low rank for given genre__ </center>

| EDM | Rap | Rock | Country | Pop | 
| ------------- |:-------------:| ------------- |:-------------:| ------------- |
| she   | love   | up   | | |
| ain't | we're  | hard | | |
| her   | you're | work | | |
| girl  | our    | even | | |
| man   | gonna  |      | | |
| he    | away   |      | | |
| bad   | heart  |      | | |
| she's | were   |      | | |
| his   | world  |      | | |
| crazy | light  |      | | |
| him   | eyes   |      | | | |

A few interesting trends emerge from these tables:
- __EDM__: Focuses on togetherness (we, our, us), feelings, and drama (fire, light, fall, jump). It also appears very gender neutral, with gendered (pro)nouns like she/he, her/him, girl/man disproportionately rare.
- __Hip-Hop__: (Probably unsurprisingly), it's focused on women, money, and sex. J-Cole speaks the truth when he says (in G.O.M.D.), "_It's called love, Ni\*\*as don't sing about it no more_". "Love" was the most striking example of a word that had a disproportionately low rank in a specific genre. This trend seems to persist, having other intimate words like "heart", "our", "light", "eyes" being disproportionately rare in Hip-Hop.
- __Rock__: Still seems to have retained some overarching themes from the 60s and 70s maybe, with popular words like woman, soul and people. Not sure what to make of the disproportionately rare words... maybe rock musicians hate "hard work" :P?
- __Country__: Country seems to be focused on the simpler things in life (song, kiss, road), and also <i>really</i> likes to miniaturize things - "little baby", "little thing", "little kisses", etc., with "little" being the top unique word.
- __Pop__: Pop has no disproportionately popular/rare words, which is expected given that pop music draws from all the other genres. Country, Rock, Hip-Hop and EDM all penetrate the Top40s continuously. 

EDM and Hip-Hop seem to be the two most distinct genres, having the most number of disproportionately popular/rare words (there are more disproportionately rare words for both genres not shown here).

## Train a Neural Network to generate new lyrics

In [9]:
import numpy as np
a=np.arange(10)
2*a

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [14]:
np.where(10 > 2*a)[0]

array([0, 1, 2, 3, 4])

In [11]:
15 > 2*a.any()

True


"Love" was the most striking example of a word that had a high rank in all genres except Hip-Hop, with ranks of 17, 14, 23 and 15 for EDM, Rock, Country and Pop respectively, while having a rank of 64 in Hip-Hop. If you prefer comparing the actual word counts (normalized by song length), "love" appears on average 5 times less than the other genres. 