<h1 style="font-size: 40px;">Linkin Park Lyrical Analysis</h1>

<sub style="font-size: 12px;">Sabbir Ahmed<br>August 16, 2017</sub><br>

<sub style="font-size: 12px;">The source code for the extraction and analysis of the data can be found in [this repository](https://github.com/sabbirahm3d/linkin-park-lyrical-analysis). The source code for generating the dataframes and plots in this notebook are exported out to increase readability. The plots for the analyses were generated with [plotly](https://plot.ly).</sub>

Multi-Platinum and Diamond certified, multi-Grammy and multi-Billboard Awards winning band Linkin Park is one of the best-selling music artist in the world. The band is known for experimenting with their musical styles, from bringing old school nu metal to the limelight to pioneering in futuristic electronic rock.

<img src="http://www.lpassociation.com/gallery/d/20854-3/linkin-park-official-hybrid-theory-2001-012.jpg" width="50%">
<p style="text-align: center; font-size: 12px;">Photo Creds: Linkin Park Association</p>

This Notebook analyzes the lyrical content of their songs. The lyrics ranged from politically charged to nuclear warfare topics. Most of their songs, however, consisted of topics on internal struggles and emotional abuse. Those subjects were more apparent in their earlier work, but still managed to be embedded in most of their songs.<br>

The following analyses are generated on the lyrics: bag-of-words/term-frequency analysis, cosine similarity analysis and sentiment analysis.

The following dependancies are first imported.

In [1]:
# import packages
from IPython.display import Markdown, display
from plotly import plotly as py
import qgrid

# import source scripts
import context  # build context to modules in other packages
from data import textfilter
from data.filemgmt import vectorize_docs
from generate_df import DataframeGenerator
from generate_plot import cos_sim_plot, doc_sent_plot, \
    phrase_sent_plot, phrase_sent_scatter, rel_freq_plot, \
    valence_arousal_plot, valence_arousal_dims
from settings.artistinfo import LINKIN_PARK_ALBUMS

In [2]:
from secret import API_KEY, USERNAME

py.sign_in(username=USERNAME, api_key=API_KEY)  # credentials for plot.ly API
qgrid.nbinstall(overwrite=True)  # copies over JS dependencies

# Data

The tracklist and lyrics were obtained using a separate library closed off from this project. The package utilized Spotify API, BeautifulSoup4 and Tor. Its source is closed off because of the lyrics hosting website's instructions on its `robots.txt` explicitly forbidding any spiders.<br>

Because 80 songs do not contribute significantly to traffic, the data were scraped cautiously and responsibly. The Tor network was used to mask the spider with a different IP address after every hit, and the pages were crawled with sufficient delays. Here's the snippet that generates the delay time:

```python
time.sleep(random.uniform(5, 20))  # sleep for a random float between 5-20 seconds
```

Basic endpoints of the Spotify API were used to generate the tracklist before crawling.

Only songs from studio albums were considered. Songs from remix or collaboration albums, or featured songs were not included in the dataset to filter external influences from their lyrics. Songs with less than 10 lines of lyrics were removed. The final dataset consisted of 77 songs from 7 studio albums.<br>

An array of hyphen-delimited Linkin Park album titles is used to generate the analyses on the individual albums

In [3]:
print LINKIN_PARK_ALBUMS

('hybrid-theory', 'meteora', 'minutes-to-midnight', 'a-thousand-suns', 'living-things', 'the-hunting-party', 'one-more-light')


From the open sourced portions, `vectorize_docs` is a method that constructs an array of the lyrics generated for the artist and album specified. It generates either an `Album` or `Artist` object and flattens its outputs before returning.

Here is an example output of _"Crawling"_ by Linkin Park, from their debut album Hybrid Theory. The first 1000 characters of the lyrics are shown.

In [4]:
data, labels = vectorize_docs(
    artist="linkin-park",  # specify artist
    albums=["hybrid-theory"],  # specify album(s)
    keep_album=False,  # option to use the album name as a delimiter
    titlify=True,  # converts song title to original format
)

song_to_search = "Crawling"
for i, song in enumerate(labels):
    if song == song_to_search:
        display(Markdown("**Song name: **" + labels[i])) 
        display(Markdown("**Lyrics:**")) 
        print data[i][:1000] + "..."
        sample_lyrics = data[i]
        break

**Song name: **Crawling

**Lyrics:**

Crawling in my skin
These wounds, they will not heal
Fear is how I fall
Confusing what is real
There's something inside me that pulls beneath the surface
Consuming, confusing
This lack of self control I fear is never ending
Controlling
I can't seem
To find myself again
My walls are closing in
(Without a sense of confidence I'm convinced
That there's just too much pressure to take)
I've felt this way before
So insecure
Crawling in my skin
These wounds, they will not heal
Fear is how I fall
Confusing what is real
Discomfort, endlessly has pulled itself upon me
Distracting, reacting
Against my will I stand beside my own reflection
It's haunting how I can't seem
To find myself again
My walls are closing in
(Without a sense of confidence I'm convinced
That there's just too much pressure to take)
I've felt this way before
So insecure
Crawling in my skin
These wounds, they will not heal
Fear is how I fall
Confusing what is real
Crawling in my skin
These wounds, they will not heal
Fear is how 

The text is then normalized through several layers of cleaning.<br>

Using a sample string, the filtering process can be demonstrated. Let's pick a tough one:<br>
"Ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."<br>
(Translation: "There's nothing more that can make you happier than these 80 songs.")

The text goes through all the cleaning methods in the sequence described. The methods include:

**Lowercase conversion**<br>
**_Original:_** ```"Ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```<br>
**_Cleaned:_** ```"ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```

**Expansion of word contractions**<br>
**_Original:_** ```"ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```<br>
**_Cleaned:_** ```"is not nothing more that will have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```

**Stripping punctuations (except parentheses)**<br>
**_Original:_** ```"is not nothing more that will have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```<br>
**_Cleaned:_** ```"is not nothing more that will have you grinnin like a possum eatin a sweet tater than these 80 songs"```

**Stripping numerical values**<br>
**_Original:_** ```"is not nothing more that will have you grinnin like a possum eatin a sweet tater than these 80 songs"```<br>
**_Cleaned:_** ```"is not nothing more that will have you grinnin like a possum eatin a sweet tater than these songs"```

**Penn Treebank Parts of Speech Tagging** <a id="ref1_"><sup>[[1]](#ref1)</sup></a><br>
**_Original:_** ```"is not nothing more that will have you grinnin like a possum eatin a sweet tater than these songs"```<br>
**_Cleaned:_** ```[('is', 'VBZ'), ('not', 'RB'), ('nothing', 'NN'), ('more', 'JJR'), ('that', 'WDT'), ('will', 'MD'), ('have', 'VB'), ('you', 'PRP'), ('grinnin', 'VBP'), ('like', 'IN'), ('a', 'DT'), ('possum', 'NN'), ('eatin', 'VBZ'), ('a', 'DT'), ('sweet', 'JJ'), ('tater', 'NN'), ('than', 'IN'), ('these', 'DT'), ('songs', 'NNS')]```

**WordNet Lemmatization** <a id="ref2_"><sup>[[2]](#ref2)</sup></a><br>
**_Original:_** ```[('is', 'VBZ'), ('not', 'RB'), ('nothing', 'NN'), ('more', 'JJR'), ('that', 'WDT'), ('will', 'MD'), ('have', 'VB'), ('you', 'PRP'), ('grinnin', 'VBP'), ('like', 'IN'), ('a', 'DT'), ('possum', 'NN'), ('eatin', 'VBZ'), ('a', 'DT'), ('sweet', 'JJ'), ('tater', 'NN'), ('than', 'IN'), ('these', 'DT'), ('songs', 'NNS')]```<br>
**_Cleaned:_** ```['be', 'not', 'nothing', 'more', 'that', 'will', 'have', 'you', 'grinnin', 'like', 'a', 'possum', 'eatin', 'a', 'sweet', 'tater', 'than', 'these', 'song']```


**Special case filtering methods**
* Stripping off dates
* Grabbing strings in between paranthesis
* Ignoring punctuation stripping for special words, i.e. C++ 

Here is the sample string being used as an input to `normalize_text`:

In [5]:
print textfilter.normalize_text(
    u"Ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."
)

[u'be', 'not', 'nothing', 'more', 'that', 'will', 'have', 'you', 'grinnin', 'like', 'a', 'possum', 'eatin', 'a', 'sweet', 'tater', 'than', 'these', u'song']


Here is the sample song grabbed earlier as an input to `normalize_text`:

In [6]:
print "\n".join(
    textfilter.normalize_text(
        sample_lyrics,
        sentences=True  # preserves sentences
    )
)

crawl in my skin
these wound they will not heal
fear be how i fall
confuse what be real
there be something inside me that pull beneath the surface
consume confuse
this lack of self control i fear be never end
control
i cannot seem
to find myself again
my wall be close in
without a sense of confidence i be convince
that there be just too much pressure to take
i have felt this way before
so insecure
crawl in my skin
these wound they will not heal
fear be how i fall
confuse what be real
discomfort endlessly have pull itself upon me
distract react
against my will i stand beside my own reflection
it be haunt how i cannot seem
to find myself again
my wall be close in
without a sense of confidence i be convince
that there be just too much pressure to take
i have felt this way before
so insecure
crawl in my skin
these wound they will not heal
fear be how i fall
confuse what be real
crawl in my skin
these wound they will not heal
fear be how i fall
confuse confuse what be real
there be somethin

# Analysis

The following variables map to arrays of dataframes or individual dataframes of their corresponding textual analysis:

* `rel_freq`: Relative frequency analysis
* `cos_sim`: Cosine similarity analysis
* `cos_sim_all`: Cosine similarity analysis on songs across all albums
* `doc_sent`: Aggregated document sentiment analysis
* `phrase_sent`: Phrase sentiment analysis
* `extreme_phrase_sent`: Outliers of `phrase_sent`
* `valence_arousal`: Valence-arousal regression

The dataframes are generated with the imported `generate_df` script.

In [7]:
df = DataframeGenerator(LINKIN_PARK_ALBUMS)
df.init_dfs()

rel_freq generated
cos_sim generated
cos_sim_all generated
doc_sent generated
phrase_sent generated
extreme_phrase_sent generated
Regressor generated with 0.843% percentile
arousal generated
valence-arousal generated


## Relative Frequency Analysis

Vocabulary of n-gram range [2, 8] were generated using a count vectorizer, with the following configuration:

    sklearn.feature_extraction.text.CountVectorizer(
        tokenizer=data.textfilter.normalize_text,
        max_features=2000,
        ngram_range=(2, 8),
        stop_words="english",
    )
    
`rel_freq` is an array of DataFrames for each albums and the relative frequency of their terms.<br>
The first index of the array, corresponding to the relative frequency of terms of Hybrid Theory, is shown:

In [8]:
qgrid.show_grid(df.rel_freq[0], export_mode=True)

Barplots of the phrases against their relative frequencies are visualized below:

In [9]:
fig = rel_freq_plot(df.rel_freq)
py.iplot(fig, filename="rel_freq")

## Cosine Similarity Analysis <a id="ref3_"></a>

The Term Frequency Inverse Document Frequency (TF-IDF) of each songs per album were generated to analyze the cosine similarity between them.<sup>[[3]](#ref3)</sup> The following configuration was used:

    sklearn.feature_extraction.text.TfidfVector izer(
        tokenizer=data.textfilter.normalize_text,
        max_features=500,
        ngram_range=(1, 5),
    )

`cos_sim` is an array of DataFrames for each of the albums and their corresponding cosine similarity of their songs.<br>
The first index of the array, corresponding to the cosine similarity of the songs of Hybrid Theory, is shown:

In [10]:
qgrid.show_grid(df.cos_sim[0], export_mode=True)

Heatmaps of the matrices were generated.

In [11]:
fig = cos_sim_plot(df.cos_sim)
py.iplot(fig, filename='cos_sim')

The plots can be used to visualize the similariy in content per album. Concept albums will score higher in their overall similarity than a compilation album. Meteora appears to have the most similarity in themes in its songs.

The cosine similarity between songs all across the albums were also computed, and `cos_sim_all` was generated with each songs against their top 5 similar songs.

In [12]:
qgrid.show_grid(df.cos_sim_all, export_mode=True)

## Sentiment Analysis <a id="ref4_"></a>

The VADER Sentiment Analyzer was used to generate the compound sentiment score for each sentences in the songs.<sup>[[4]](#ref4)</sup> `phrase_sent` was generated with all the phrases sorted by their sentiment scores.

In [13]:
qgrid.show_grid(df.phrase_sent, export_mode=True)

In [14]:
fig = phrase_sent_scatter(df.phrase_sent)
py.iplot(fig, filename="phrase_sent")

In [15]:
fig = phrase_sent_plot(df.extreme_phrase_sent)
py.iplot(fig, filename="top_phrase_sent")

## Aggregated Document Sentiment Analysis

The sample mean of the sentiment scores were then generated as the aggregated sentiment score of the entire song.<br><br>
$$sentiment_{doc}=\frac{1}{len-1} \sum_{phrase=0}^{len} {sentiment_{phrase}}$$

In [16]:
qgrid.show_grid(df.doc_sent, export_mode=True)

In [17]:
fig = doc_sent_plot(df.doc_sent)
py.iplot(fig, filename="doc_sent")

## Valence-Arousal Regression

In [18]:
fig = valence_arousal_dims()
py.iplot(fig, filename="valence_arousal_circle")

In [19]:
qgrid.show_grid(df.valence_arousal, export_mode=True)

In [20]:
fig = valence_arousal_plot(df.valence_arousal, df.doc_sent)
py.iplot(fig, filename="valence_arousal")

# References

[[1]](#ref1_) <a id="ref1"></a> Santorini, Beatrice. <a href="http://repository.upenn.edu/cis_reports/570/?utm_source=repository.upenn.edu%2Fcis_reports%2F570&utm_medium=PDF&utm_campaign=PDFCoverPages">_"Part-of-speech tagging guidelines for the Penn Treebank Project (3rd revision)"_</a>. (1990).

[[2]](#ref2_) <a id="ref2"></a> Princeton University _"About WordNet."_ WordNet. Princeton University. 2010. <http://wordnet.princeton.edu>

[[3]](#ref3_) <a id="ref3"></a> Manning, Christopher D, Prabhakar Raghavan, and Hinrich Schütze. <a href="https://nlp.stanford.edu/IR-book/pdf/07system.pdf">_Introduction to Information Retrieval_</a>. New York: Cambridge University Press, 2008. Print.

[[4]](#ref4_) <a id="ref4"></a> Hutto, Clayton J., and Eric Gilbert. <a href="https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8109/8122">"Vader: A parsimonious rule-based model for sentiment analysis of social media text"</a>. _Eighth international AAAI conference on weblogs and social media._ 2014.

[[5]](#ref5_) <a id="ref5"></a> Preotiuc-Pietro, Daniel, et al. <a href="https://wwbp.org/papers/va16wassa.pdf">"Modelling Valence and Arousal in Facebook posts."</a> _WASSA@ NAACL-HLT._ 2016.