# Linkin Park Lyrical Analysis

Multi-Platinum and Diamond certified, multi-Grammy and multi-Billboard Awards winning band Linkin Park is one of the most influential musical groups of all time. The band is known for experimenting on their musical styles, ranging from old school nu metal to futuristic electronic rock. The content of their songs were kept consistent, ranging from internal struggles to nuclear warfare. <br> 
Hands down, Linkin Park personally shaped 

In [1]:
# import packages
from IPython.display import Markdown, display
from plotly import plotly as py
import qgrid

# import source scripts
import context  # build context to modules in other packages

In [2]:
from secret import API_KEY, USERNAME

py.sign_in(username=USERNAME, api_key=API_KEY)  # for plot.ly API
qgrid.nbinstall(overwrite=True)  # copies over JS dependencies

## Data

The tracklist and lyrics were obtained using a separate library closed off from this project. The package utilized Spotify API, BeautifulSoup4 and Tor. Its source is closed off because of the lyrics hosting website's instructions on its `robots.txt` explicitly forbidding any spiders.<br>

Because 80 songs do not contribute significantly to traffic, the data were scraped cautiously and responsibly. The Tor network was used to mask the spider with a different IP address after every hit, and the pages were crawled with sufficient delays. Here's the snippet that generates the delay time:

```python
time.sleep(random.uniform(5, 20))  # sleep for a random float between 5-20 seconds
```

Basic endpoints of the Spotify API were used to generate the tracklist before crawling.

From the open sourced portions, `vectorize_docs` is a method that constructs an array of the lyrics generated for the artist and album specified. It generates either an `Album` or `Artist` object and flattens its outputs before returning.

Here is an example output of _"Crawling"_ by Linkin Park, from their debut album Hybrid Theory. The first 1000 characters of the lyrics are shown.

In [3]:
from data.filemgmt import vectorize_docs
from settings.artistinfo import LINKIN_PARK_ALBUMS

data, labels = vectorize_docs(
    artist="linkin-park",  # specify artist
    albums=["hybrid-theory"],  # specify album(s)
    keep_album=False,  # option to use the album name as a delimiter
    titlify=True,  # converts song title to original format
)

song_to_search = "Crawling"
for i, song in enumerate(labels):
    if song == song_to_search:
        display(Markdown("**Song name: **" + labels[i])) 
        display(Markdown("**Lyrics:**")) 
        print data[i][:1000] + "..."
        sample_lyrics = data[i]
        break

**Song name: **Crawling

**Lyrics:**

Crawling in my skin
These wounds, they will not heal
Fear is how I fall
Confusing what is real
There's something inside me that pulls beneath the surface
Consuming, confusing
This lack of self control I fear is never ending
Controlling
I can't seem
To find myself again
My walls are closing in
(Without a sense of confidence I'm convinced
That there's just too much pressure to take)
I've felt this way before
So insecure
Crawling in my skin
These wounds, they will not heal
Fear is how I fall
Confusing what is real
Discomfort, endlessly has pulled itself upon me
Distracting, reacting
Against my will I stand beside my own reflection
It's haunting how I can't seem
To find myself again
My walls are closing in
(Without a sense of confidence I'm convinced
That there's just too much pressure to take)
I've felt this way before
So insecure
Crawling in my skin
These wounds, they will not heal
Fear is how I fall
Confusing what is real
Crawling in my skin
These wounds, they will not heal
Fear is how 

The text is then normalized through several layers of cleaning.<br>

Using a sample string, the filtering process can be demonstrated. Let's pick a tough one:<br>
"Ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."<br>
(Translation: "There's nothing more that can make you happier than these 80 songs.")

The text goes through all the cleaning methods in the sequence described. The methods include:

**Lowercase conversion**<br>
**_Original:_** ```"Ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```<br>
**_Cleaned:_** ```"ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```

**Expansion of word contractions**<br>
**_Original:_** ```"ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```<br>
**_Cleaned:_** ```"is not nothing more that will have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```

**Stripping punctuations (except parenthesis)**<br>
**_Original:_** ```"is not nothing more that will have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."```<br>
**_Cleaned:_** ```"is not nothing more that will have you grinnin like a possum eatin a sweet tater than these 80 songs"```

**Stripping numerical values**<br>
**_Original:_** ```"is not nothing more that will have you grinnin like a possum eatin a sweet tater than these 80 songs"```<br>
**_Cleaned:_** ```"is not nothing more that will have you grinnin like a possum eatin a sweet tater than these songs"```

**Penn Treebank Parts of speech tagging**<br>
**_Original:_** ```"is not nothing more that will have you grinnin like a possum eatin a sweet tater than these songs"```<br>
**_Cleaned:_** ```[('is', 'VBZ'), ('not', 'RB'), ('nothing', 'NN'), ('more', 'JJR'), ('that', 'WDT'), ('will', 'MD'), ('have', 'VB'), ('you', 'PRP'), ('grinnin', 'VBP'), ('like', 'IN'), ('a', 'DT'), ('possum', 'NN'), ('eatin', 'VBZ'), ('a', 'DT'), ('sweet', 'JJ'), ('tater', 'NN'), ('than', 'IN'), ('these', 'DT'), ('songs', 'NNS')]```

**Lemmatization**<br>
**_Original:_** ```[('is', 'VBZ'), ('not', 'RB'), ('nothing', 'NN'), ('more', 'JJR'), ('that', 'WDT'), ('will', 'MD'), ('have', 'VB'), ('you', 'PRP'), ('grinnin', 'VBP'), ('like', 'IN'), ('a', 'DT'), ('possum', 'NN'), ('eatin', 'VBZ'), ('a', 'DT'), ('sweet', 'JJ'), ('tater', 'NN'), ('than', 'IN'), ('these', 'DT'), ('songs', 'NNS')]```<br>
**_Cleaned:_** ```['be', 'not', 'nothing', 'more', 'that', 'will', 'have', 'you', 'grinnin', 'like', 'a', 'possum', 'eatin', 'a', 'sweet', 'tater', 'than', 'these', 'song']```


**Special case filtering methods**
* Stripping off dates
* Grabbing strings in between paranthesis
* Ignoring punctuation stripping for special words, i.e. C++ 

Here is the sample string being used as an input to `normalize_text`:

In [3]:
from data import textfilter

print textfilter.normalize_text(
    u"Ain't nothing more that'll have you grinnin’ like a possum eatin’ a sweet tater than these 80 songs."
)

[u'be', 'not', 'nothing', 'more', 'that', 'will', 'have', 'you', 'grinnin', 'like', 'a', 'possum', 'eatin', 'a', 'sweet', 'tater', 'than', 'these', u'song']


Here is the sample song grabbed earlier as an input to `normalize_text`:

In [4]:
print "\n".join(
    textfilter.normalize_text(
        sample_lyrics,
        sentences=True  # preserves sentences
    )
)

crawl in my skin
these wound they will not heal
fear be how i fall
confuse what be real
there be something inside me that pull beneath the surface
consume confuse
this lack of self control i fear be never end
control
i cannot seem
to find myself again
my wall be close in
without a sense of confidence i be convince
that there be just too much pressure to take
i have felt this way before
so insecure
crawl in my skin
these wound they will not heal
fear be how i fall
confuse what be real
discomfort endlessly have pull itself upon me
distract react
against my will i stand beside my own reflection
it be haunt how i cannot seem
to find myself again
my wall be close in
without a sense of confidence i be convince
that there be just too much pressure to take
i have felt this way before
so insecure
crawl in my skin
these wound they will not heal
fear be how i fall
confuse what be real
crawl in my skin
these wound they will not heal
fear be how i fall
confuse confuse what be real
there be somethin

## Dataframes

The following variables map to arrays of dataframes or individual dataframes of their corresponding textual analysis:

* `rel_freq`: Relative frequency analysis
* `cos_sim`: Cosine similarity analysis
* `cos_sim_all`: Cosine similarity analysis on songs across all albums
* `doc_sent`: Aggregated document sentiment analysis
* `phrase_sent`: Phrase sentiment analysis
* `extreme_phrase_sent`: Outliers of `phrase_sent`

An array of Linkin Park album titles is used to generate the analyses on the individual albums

In [5]:
print LINKIN_PARK_ALBUMS

('hybrid-theory', 'meteora', 'minutes-to-midnight', 'a-thousand-suns', 'living-things', 'the-hunting-party', 'one-more-light')


The dataframes are generated with the imported `generate_df` script.

In [4]:
from generate_df import DataframeGenerator

df = DataframeGenerator(LINKIN_PARK_ALBUMS)
df.init_dfs()

## Analysis

The plots for the analyses were generated with [`plotly`](https://plot.ly).

In [7]:
from generate_plot import cos_sim_plot, doc_sent_plot, phrase_sent_plot, \
    phrase_sent_scatter, rel_freq_plot

### Relative Frequency Analysis

Vocabulary of n-gram range [2, 8] were generated using a count vectorizer, with the following configuration:

    sklearn.feature_extraction.text.CountVectorizer(
        tokenizer=data.textfilter.normalize_text,
        max_features=2000,
        ngram_range=(2, 8),
        stop_words="english",
    )

In [8]:
qgrid.show_grid(df.rel_freq[0], export_mode=True)

Barplots of the phrases against their relative frequencies were generated.

In [9]:
fig = rel_freq_plot(df.rel_freq)
py.iplot(fig, filename="rel_freq")

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]
[ (3,1) x5,y5 ]  [ (3,2) x6,y6 ]
[ (4,1) x7,y7 ]  [ (4,2) x8,y8 ]



### Cosine Similarity Analysis <a id="ref1_"></a>

The Term Frequency Inverse Document Frequency (TF-IDF) of each songs per album were generated to analyze the cosine similarity between them.<sup>[[1]](#ref1)</sup> The following configuration was used:

    sklearn.feature_extraction.text.TfidfVector izer(
        tokenizer=data.textfilter.normalize_text,
        max_features=500,
        ngram_range=(1, 5),
    )


In [10]:
qgrid.show_grid(df.cos_sim[0], export_mode=True)

Heatmaps of the matrices were generated.

In [11]:
fig = cos_sim_plot(df.cos_sim)
py.iplot(fig, filename='cos_sim')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]
[ (3,1) x5,y5 ]  [ (3,2) x6,y6 ]
[ (4,1) x7,y7 ]  [ (4,2) x8,y8 ]



The cosine similarity between songs all across the albums were also computed, and `cos_sim_all` was generated with each songs against their top 5 similar songs.

In [12]:
qgrid.show_grid(df.cos_sim_all, export_mode=True)

### Sentiment Analysis <a id="ref2_"></a>

The VADER Sentiment Analyzer was used to generate the compound sentiment score for each sentences in the songs.<sup>[[2]](#ref2)</sup> `phrase_sent` was generated with all the phrases sorted by their sentiment scores.

In [13]:
qgrid.show_grid(df.phrase_sent, export_mode=True)

In [16]:
fig = phrase_sent_scatter(df.phrase_sent)
py.iplot(fig, filename="phrase_sent")

In [17]:
fig = phrase_sent_plot(df.extreme_phrase_sent)
py.iplot(fig, filename="top_phrase_sent")

### Aggregated Document Sentiment Analysis

The sample mean of the sentiment scores were then generated as the aggregated sentiment score of the entire song.<br><br>
$$sentiment_{doc}=\frac{1}{len-1} \sum_{phrase=0}^{len} {sentiment_{phrase}}$$

In [14]:
qgrid.show_grid(df.doc_sent, export_mode=True)

In [15]:
fig = doc_sent_plot(df.doc_sent)
py.iplot(fig, filename="doc_sent")

## Classifiers

In [5]:
df.init_clf_dfs()

Model trained.
Vectorizer:
TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=2000, min_df=1,
        ngram_range=(1, 5), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=<function normalize_text at 0x7fe09fbc9398>,
        use_idf=True, vocabulary=None) 
Classifier:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=500, n_jobs=-1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
Model trained.
Vectorizer:
TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
     

## References

[[1]](#ref1_) <a id="ref1"></a> Manning, Christopher D, Prabhakar Raghavan, and Hinrich Schütze. <a href="https://nlp.stanford.edu/IR-book/pdf/07system.pdf">_Introduction to Information Retrieval_</a>. New York: Cambridge University Press, 2008. Print.

[[2]](#ref2_) <a id="ref2"></a> Hutto, Clayton J., and Eric Gilbert. <a href="https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8109/8122">"Vader: A parsimonious rule-based model for sentiment analysis of social media text"</a>. _Eighth international AAAI conference on weblogs and social media._ 2014.

[[3]](#ref3_) <a id="ref3"></a> Preotiuc-Pietro, Daniel, et al. <a href="https://wwbp.org/papers/va16wassa.pdf">"Modelling Valence and Arousal in Facebook posts."</a> _WASSA@ NAACL-HLT._ 2016.

In [6]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools

coords = []
for i in xrange(1, 5):
    for j in xrange(2):
        coords.append((i, j % 2 + 1))


fig = tools.make_subplots(
    rows=4, cols=2,
    subplot_titles=tuple(i.title().replace("-", " ")
                         for i in LINKIN_PARK_ALBUMS)
)


data = []
for i, album in enumerate(LINKIN_PARK_ALBUMS):
    fig.append_trace(
        go.Scatter(
#             x=df[df["album"] == album].index,
            x=df.valence_arousal[df.valence_arousal["album"] == album]["valence_pred"].tolist(),
            y=df.valence_arousal[df.valence_arousal["album"] == album]["arousal_pred"].tolist(),
            mode="markers",
            name=album.title().replace("-", " "),
            text=df.valence_arousal[df.valence_arousal["album"] == album].index.tolist(),
            marker=dict(
                size=10,
                line=dict(width=1),
                color=album,
                opacity=0.5
            ),
            error_x=dict(
                type='data',
                color="rgba(0, 0, 0, 0.5)",
                symmetric=False,
                array=df.valence_arousal[df.valence_arousal["album"] == album]["valence_high"].tolist(),
                arrayminus=df.valence_arousal[df.valence_arousal["album"] == album]["valence_low"].tolist(),
            ),
            error_y=dict(
                type='data',
                color="rgba(0, 0, 0, 0.5)",
                symmetric=False,
                array=df.valence_arousal[df.valence_arousal["album"] == album]["arousal_high"].tolist(),
                arrayminus=df.valence_arousal[df.valence_arousal["album"] == album]["arousal_low"].tolist(),
            ),
        ),
        coords[i][0], coords[i][-1]
    )

for attr in fig["layout"]:
    if "xaxis" in attr or "yaxis" in attr:
        fig["layout"][attr].update(
            range=[1, 9],
            tick0=1,
            dtick=1,
        )
        
for attr in range(8):
    fig["layout"]["shapes"].append(
        {
            "yref": "y" + str(attr),
            "xref": "x" + str(attr),
            'type': 'line',
            'x0': 5,
            'y0': 1,
            'x1': 5,
            'y1': 9,
            'line': {
                'color': "rgba(0, 0, 0, 0.5)",
                'dash': "dot",
            },
        },
    )
    fig["layout"]["shapes"].append(
        {
            "yref": "y" + str(attr),
            "xref": "x" + str(attr),
            'type': 'line',
            'x0': 1,
            'y0': 5,
            'x1': 9,
            'y1': 5,
            'line': {
                'color': "rgba(0, 0, 0, 0.5)",
                'dash': "dot",
            },
        },
    )

fig["layout"].update(
    height=2000, width=900,
    title="Cosine Similarity",
#     margin=go.Margin(l=190, b=150),
)

py.iplot(fig, filename='error-bar-asymmetric-array')


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]
[ (3,1) x5,y5 ]  [ (3,2) x6,y6 ]
[ (4,1) x7,y7 ]  [ (4,2) x8,y8 ]

