# Linkin Park Lyrical Analysis

Summary of notebook Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

In [1]:
# import source scripts
import context  # build context to modules in other packages
from data.filemgmt import vectorize_docs
from generate_df import DataframeGenerator, LINKIN_PARK_ALBUMS
from generate_plot import rel_freq_plot, cos_sim_plot, doc_sent_plot, phrase_sent_plot

import plotly.plotly as py
import qgrid
qgrid.nbinstall(overwrite=True)

## Application Programming Interface (API)

```vectorize_docs``` is a function that constructs an array of the lyrics generated for the artist and album specified. It generates either an ```Album``` or ```Artist``` object and flattens its outputs before returning.

Here is an example output of "Papercut" by Linkin Park (which happens to be the second track in the generated data), from their debut album Hybrid Theory. The first 1000 characters of the lyrics are shown.

In [2]:
data, labels = vectorize_docs(
    artist="linkin-park",
    albums=["hybrid-theory"],
    keep_album=True,
    titlify=True,
)

print "Song name:\n", labels[1], "\n\nLyrics:\n", data[1][:1000] + "..."

Song name:
Papercut (Hybrid Theory) 

Lyrics:
Why does it feel like night today?
Something in the air's not right today
Why am I so uptight today?
Paranoia's all I got left
I don't know what stressed me first
Or how the pressure was fed
But I know just what it feels like
To have a voice in the back of my head
Like a face that I hold inside
A face that awakes when I close my eyes
A face watches every time I lie
A face that laughs every time I fall
(It watches everything)
So I know now when it's time to sink or swim
That the face inside is hearing me
Right beneath my skin
It's like I'm paranoid lookin' over my back
It's like a whirlwind inside of my head
It's like I can't stop what I'm hearing within
It's like the face inside is right beneath my skin
I know I've got a face in me
Points out all my mistakes to me
You've got a face on the inside too
Your paranoia's probably worse
I don't know what set me off first but I know what I can't stand
Everybody acts like the fact of the matter is
I

The text is then normalized through several layers of cleaning.<br>

Using a sample string, the filtering process can be demonstrated. Let's pick a really tough one:<br>
"Ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."<br>
(Translation: "There's nothing more that can make you happier than these 80 songs.")

The text goes through all the cleaning methods in the sequence described. The methods include:

**Lowercase conversion**<br>
**_Original:_** ```"Ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```<br>
**_Cleaned:_** ```"ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```

**Expansion of word contractions**<br>
**_Original:_** ```"ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```<br>
**_Cleaned:_** ```"is not nothing more that will have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```

**Stripping punctuations (except parenthesis)**<br>
**_Original:_** ```"is not nothing more that will have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```<br>
**_Cleaned:_** ```"is not nothing more that will have you grinnin like a possum eatin a sweet tater than these 80 songs"```

**Stripping numerical values**<br>
**_Original:_** ```"is not nothing more that will have you grinnin like a possum eatin a sweet tater than these 80 songs"```<br>
**_Cleaned:_** ```"is not nothing more that will have you grinnin like a possum eatin a sweet tater than these songs"```

**Penn Treebank Parts of speech tagging**<br>
**_Original:_** ```"is not nothing more that will have you grinnin like a possum eatin a sweet tater than these songs"```<br>
**_Cleaned:_** ```[('is', 'VBZ'), ('not', 'RB'), ('nothing', 'NN'), ('more', 'JJR'), ('that', 'WDT'), ('will', 'MD'), ('have', 'VB'), ('you', 'PRP'), ('grinnin', 'VBP'), ('like', 'IN'), ('a', 'DT'), ('possum', 'NN'), ('eatin', 'VBZ'), ('a', 'DT'), ('sweet', 'JJ'), ('tater', 'NN'), ('than', 'IN'), ('these', 'DT'), ('songs', 'NNS')]```

**Lemmatization**<br>
**_Original:_** ```[('is', 'VBZ'), ('not', 'RB'), ('nothing', 'NN'), ('more', 'JJR'), ('that', 'WDT'), ('will', 'MD'), ('have', 'VB'), ('you', 'PRP'), ('grinnin', 'VBP'), ('like', 'IN'), ('a', 'DT'), ('possum', 'NN'), ('eatin', 'VBZ'), ('a', 'DT'), ('sweet', 'JJ'), ('tater', 'NN'), ('than', 'IN'), ('these', 'DT'), ('songs', 'NNS')]```<br>
**_Cleaned:_** ```['be', 'not', 'nothing', 'more', 'that', 'will', 'have', 'you', 'grinnin', 'like', 'a', 'possum', 'eatin', 'a', 'sweet', 'tater', 'than', 'these', 'song']```


**Special case filtering methods**
* Stripping off dates
* Grabbing strings in between paranthesis
* Ignoring punctuation stripping for special words, i.e. C++ 

In [3]:
from data import textfilter

print textfilter.normalize_text(u"Ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs.")

[u'be', 'not', 'nothing', 'more', 'that', 'will', 'have', 'you', 'grinnin', 'like', 'a', 'possum', 'eatin', 'a', 'sweet', 'tater', 'than', 'these', u'song']


To normalize the sample song grabbed earlier:

In [4]:
print " ".join(textfilter.normalize_text(data[1]))

why do it feel like night today something in the air not right today why be i so uptight today paranoia all i get leave i do not know what stress me first or how the pressure be feed but i know just what it feel like to have a voice in the back of my head like a face that i hold inside a face that awake when i close my eye a face watch every time i lie a face that laugh every time i fall it watch everything so i know now when it be time to sink or swim that the face inside be hear me right beneath my skin it be like i be paranoid lookin over my back it be like a whirlwind inside of my head it be like i cannot stop what i be hear within it be like the face inside be right beneath my skin i know i have get a face in me point out all my mistake to me you have get a face on the inside too your paranoia probably worse i do not know what set me off first but i know what i cannot stand everybody act like the fact of the matter be i cannot add up to what you can but everybody have a face that 

## Dataframes

The following variables map to arrays of dataframes or individual dataframes of their corresponding textual analysis:

* ```rel_freq```: Relative frequency analysis
* ```cos_sim```: Cosine similarity analysis
* ```cos_sim_all```: Cosine similarity analysis on songs across all albums
* ```doc_sent```: Aggregated document sentiment analysis
* ```phrase_sent```: Phrase sentiment analysis

An array of Linkin Park album titles is used to generate the analyses on the individual albums

In [5]:
print LINKIN_PARK_ALBUMS

('hybrid-theory', 'meteora', 'minutes-to-midnight', 'a-thousand-suns', 'living-things', 'the-hunting-party', 'one-more-light')


The dataframes are generated with the imported `generate_df` script.

In [6]:
df = DataframeGenerator()
df.init_dfs()

rel_freq generated
cos_sim generated
cos_sim_all generated
doc_sent generated
phrase_sent generated


## Plots

The plots for the analyses were generated with ```matplotlib``` and ```seaborn```.

### Relative Frequency Analysis

Vocabulary of n-gram range [2, 8] were generated using a count vectorizer, with the following configuration:

    sklearn.feature_extraction.text.CountVectorizer(
        tokenizer=data.textfilter.normalize_text,
        max_features=2000,
        ngram_range=(2, 8),
        stop_words="english",
    )

The vocabulary was sorted by its counts to create the dataframe.

In [7]:
qgrid.show_grid(df.rel_freq[0], export_mode=True)

Barplots of the phrases against their relative frequencies were generated.

In [8]:
fig = rel_freq_plot(df.rel_freq)

py.iplot(fig, filename='rel_freq')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]
[ (3,1) x5,y5 ]  [ (3,2) x6,y6 ]
[ (4,1) x7,y7 ]  [ (4,2) x8,y8 ]



### Cosine Similarity Analysis

The Term Frequency Inverse Document Frequency (TF-IDF) of each songs per album were generated to analyze the cosine similarity between them. The following configuration was used:

    sklearn.feature_extraction.text.TfidfVectorizer(
        tokenizer=data.textfilter.normalize_text,
        max_features=500,
        ngram_range=(1, 5),
    )


In [9]:
qgrid.show_grid(df.cos_sim[0], export_mode=True)

Heatmaps of the matrices were generated.

In [10]:
fig = cos_sim_plot(df.cos_sim)
py.iplot(fig, filename='cos_sim')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]
[ (3,1) x5,y5 ]  [ (3,2) x6,y6 ]
[ (4,1) x7,y7 ]  [ (4,2) x8,y8 ]



The cosine similarity between songs all across the albums were also computed, and `cos_sim_all` was generated with each songs against their top 5 similar songs.

In [11]:
qgrid.show_grid(df.cos_sim_all, export_mode=True)

### Sentiment Analysis

The VADER Sentiment Analyzer was used to generate the compound sentiment score for each sentences in the songs. `phrase_sent` was generated with all the phrases sorted by their sentiment scores.

In [12]:
qgrid.show_grid(df.phrase_sent, export_mode=True)

In [13]:
fig = phrase_sent_plot(df.phrase_sent)
py.iplot(fig, filename='phrase_sent')

### Aggregated Document Sentiment Analysis

The sample mean of the sentiment scores were then generated as the aggregated sentiment score of the entire song.

In [14]:
qgrid.show_grid(df.doc_sent, export_mode=True)

In [15]:
fig = doc_sent_plot(df.doc_sent)
py.iplot(fig, filename='doc_sent')