# Title

[Free Form Description]

**Resources**

- [The NRC Valence, Arousal, and Dominance Lexicon](http://saifmohammad.com/WebPages/nrc-vad.html)
- [Spotipy docs (Python Wrapper)](https://spotipy.readthedocs.io/en/latest/)


**Data Input:**

- `data/processed/audio_data.csv`: DataFrame of all CC tracks with "Sonic Brutality Index" (from notebook 1)
- `data/raw/NRC-VAD-Lexicon.txt`: Data of approx 20'000 words with valence, arousal and dominance scores

**Data Output:**

- `...`: ...

**Changes**

- 2019-02-18: Start project
- 20-02-25: Complete audio analysis



---

## Import libraries, load data

In [28]:
# Import libraries

from pprint import pprint
import json
import numpy as np
import pandas as pd

from sklearn.preprocessing import minmax_scale

# Visualization
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('raph-base')
import seaborn as sns 

Reading the lexicon into a Pandas DataFrame requires a little tweaking / cleaning first.

In [3]:
with open('data/raw/NRC-VAD-Lexicon.txt') as file:
    data_list = []
    line = file.readline()
    while line:
        data_list.append(str(line))
        line = file.readline()

In [4]:
# Check results
data_list[:5]

['Word\tValence\tArousal\tDominance\n',
 'aaaaaaah\t0.479\t0.606\t0.291\n',
 'aaaah\t0.520\t0.636\t0.282\n',
 'aardvark\t0.427\t0.490\t0.437\n',
 'aback\t0.385\t0.407\t0.288\n']

In [5]:
# Split and clean
data_list2 = [x.replace('\n', '').split('\t') for x in data_list]

In [17]:
vad_lexicon = pd.DataFrame(data_list2[1:], columns=data_list2[0], dtype=float)
vad_lexicon.columns = (col.lower() for col in vad_lexicon.columns)

In [18]:
# Check results ...
display(vad_lexicon.iloc[[1860]])

Unnamed: 0,word,valence,arousal,dominance
1860,bloodshed,0.048,0.942,0.525


Exactly what we are looking for: low valence, high arousal ... ;-) 

We can also see that `dominance` is quite neutral and probably no feature that will be of further help. To more easily filter and analize for words with a combination of low-valence and high-arousal I will create a new feature `anti-valence` that is (1 - valance). Then we can simply sum the 2 scores to get a `word brutality index (WBI)`. (To land in a range between 0 and 1 we will normalize it using sklearn's minmax_scaler.)

In [38]:
lexicon = vad_lexicon.copy()
lexicon['anti_valence'] = lexicon['valence'].apply(lambda x: 1-x)
wbi = minmax_scale(lexicon['anti_valence'] + lexicon['arousal'])
lexicon['wbi'] = wbi
lexicon.drop(['valence', 'dominance'], axis=1, inplace=True)

In [46]:
# Check results ...
display(lexicon.nlargest(10, 'wbi'))
display(lexicon.loc[lexicon['word'] == 'zombie'])

Unnamed: 0,word,arousal,anti_valence,wbi
8472,homicide,0.973,0.99,1.0
11521,murderer,0.96,0.99,0.992746
9854,killer,0.971,0.959,0.981585
20,abduction,0.99,0.938,0.980469
17277,suicidebombing,0.957,0.969,0.979353
11523,murderous,0.94,0.983,0.977679
4366,dangerous,0.941,0.98,0.976562
1035,assassinate,0.969,0.949,0.974888
386,aggresive,0.971,0.941,0.97154
1856,bloodbath,0.971,0.94,0.970982


Unnamed: 0,word,arousal,anti_valence,wbi
19999,zombie,0.648,0.786,0.704799


Wow, people nowadays definitely seem to be more scared of suicide bombers than of zombies ... how come?

## Request data

### Artist

In [None]:
# Get Artist URI

name = "Cannibal Corpse"

def get_artist_uri(name):
    results = sp.search(q='artist:' + name, type='artist')
    items = results['artists']['items']
    artist_uri = items[0]['uri'] 
    return artist_uri

In [None]:
artist_uri = get_artist_uri(name)
pprint(artist_uri)

### Trackslist

The easiest way to query for tracks is as follows:

```python
results = sp.search(q=artist, limit=50, type='track')
for i, t in enumerate(results['tracks']['items']):
    print(' ', i, t['name'])
```

But problem is that the upper limit possible is 50, and CC have released many more songs than that, so I will try a work around. Get a list of all albums, clean it a bit and then combine all the tracks of each single album in the list.

In [None]:
# Get Artist albums (dict)
# Note: setting title as key catches some duplicates

def get_artist_albums(artist_uri):
    albums = {}
    results = sp.artist_albums(artist_uri, album_type='album')
    for i, item in enumerate(results['items']):
        albums[item['name'].title()] = item['uri']
    return albums

In [None]:
artist_albums = get_artist_albums(artist_uri)
pprint(artist_albums)

In [None]:
# Manually clean some entries, we want originals only and no live albums
albums_to_delete = ['レッド・ビフォー・ブラック', 
                     'Vile (Expanded Edition)', 
                     'The Bleeding - Reissue',
                     'Live Cannibalism',
                     'Torturing And Eviscerating',
                   ]
def get_clean_album_uri_list(artist_albums, albums_to_delete=albums_to_delete):
    if albums_to_delete is not None:
        for key in albums_to_delete:
            artist_albums.pop(key)  
    artist_albums_uri = [uri for uri in artist_albums.values()]
    return artist_albums_uri

In [None]:
artist_albums_uri = get_clean_album_uri_list(artist_albums, albums_to_delete)
print(artist_albums_uri)

In [None]:
def get_full_tracklist_dict(artist_albums_uri):
    tracklist = {}
    for album_uri in artist_albums_uri:
        album = sp.album(album_uri)
        for track in album['tracks']['items']:
            tracklist[track['name'].title()] = track['uri']
    return tracklist

In [None]:
full_tracklist = get_full_tracklist_dict(artist_albums_uri)
print(list(full_tracklist.items())[0])
print("Total tracks:", len(full_tracklist))

### Audio Features

We use the audio features provided by spotify ([see here](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/)) to determine the sonic brutality of a track. We actually only need `Energy`and `Valence` for that, but in addition let's also have a look at the `Dancability`of Cannibal Corpse. Just for fun.

> Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
    
> Valence is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). 
    
> Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.

In [None]:
def get_audio_features_dict(full_tracklist):
    audio_features_dict = {}
    for uri in list(full_tracklist.values()):
        features = sp.audio_features(uri)
        audio_features_dict[uri] = {'energy': features[0]['energy'],
                                    'valence': features[0]['valence'],
                                    'danceability': features[0]['danceability'],
                                   }
    return audio_features_dict

In [None]:
audio_features_dict = get_audio_features_dict(full_tracklist)
pprint(list(audio_features_dict.items())[:2])

## Analyse Songs

### Prepare dataframe

Getting the songs and features in separate dicts was ok for exploring the Spotify API and Spotipy wrapper, but for our the actual Analyis I prefer to combine everything in a dataframe.

In [None]:
temp_df1 = pd.DataFrame(full_tracklist.items(), columns = ['title', 'uri'])
temp_df2 = pd.DataFrame(audio_features_dict.items(), columns = ['uri', 'features'])
assert len(temp_df1) == len(temp_df2)
song_data = pd.merge(temp_df1, temp_df2, on=['uri'])
display(song_data.head(2))

In [None]:
song_data['energy'] = song_data['uri'].apply(lambda x: audio_features_dict[x]['energy'])
song_data['valence'] = song_data['uri'].apply(lambda x: audio_features_dict[x]['valence'])
song_data['danceability'] = song_data['uri'].apply(lambda x: audio_features_dict[x]['danceability'])
song_data.drop('features', axis=1, inplace=True)

In [None]:
display(song_data.head(2))

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, sharex=True, figsize=(16,4))

sns.distplot(song_data['energy'], ax=axes[0])
sns.distplot(song_data['valence'], ax=axes[1])
sns.distplot(song_data['danceability'],color="grey", ax=axes[2]);

In [None]:
# Check for outlier with energy value of approx. 0.8 only
# And get link to a 30 sek sample

low_energy_uri = song_data['uri'].loc[song_data['energy'] == song_data['energy'].min()].values[0]
results = sp.track(low_energy_uri)
print('track       : ' + results['name'])
print('from ablbum : ' + results['album']['name'])
print('audio       : ' + results['preview_url'])
print('cover art   : ' + results['album']['images'][0]['url'])


### Calculate "Sonic Brutality Index"

Using both `energy` and `valence`, we can create an equation for the “Sonic Brutality Index” by calculating the geometric mean of `energy` and `1 - valence` (subtracting valence from 1 so that a higher value means it’s more “negative”). This way, the most brutal songs will be those that are both high in energy and low in valence, while equally weighting both.

$$\\Sonic Brutality Index = \sqrt{(1 - valence) * energy}$$

In [None]:
def calc_sbi(valence, energy):
    sbi = np.sqrt((1 - valence) * energy)
    return sbi
    
song_data['sbi'] = song_data.apply(lambda x: calc_sbi(x['valence'], x['energy']), axis=1)
display(song_data.head(2))

In [None]:
plt.figure(figsize=(8,4))
sns.distplot(song_data['sbi'], bins=20);

In [None]:
# Check for most brutal song (accustically)

most_brutal_uri = song_data['uri'].loc[song_data['sbi'] == song_data['sbi'].max()].values[0]
results = sp.track(most_brutal_uri)
print('track       : ' + results['name'])
print('from ablbum : ' + results['album']['name'])
print('audio       : ' + results['preview_url'])
print('cover art   : ' + results['album']['images'][0]['url'])

Youtube-Clip: 

<a href="http://www.youtube.com/watch?feature=player_embedded&v=57WwWg9PD74
" target="_blank"><img src="http://img.youtube.com/vi/57WwWg9PD74/0.jpg" 
alt="Link to Youtube clip" width="240" height="180" border="10" /></a>

In [None]:
song_data.sort_values(['sbi'], ascending=False)

In [None]:
# Lets listen to a not so brutal but danceable track now
# (don't expect too much though ...)

rabid_uri = song_data['uri'].loc[song_data['title'] == 'Rabid'].values[0]
results = sp.track(rabid_uri)
print('track       : ' + results['name'])
print('from ablbum : ' + results['album']['name'])
print('audio       : ' + results['preview_url'])
print('cover art   : ' + results['album']['images'][0]['url'])

In [None]:
## Save data
song_data.to_csv('data/processed/audio_data.csv', index=False)

---

## Appendix: Compare Sonic Brutality of Cannibal Corpse and Cannabis Corpse

In [None]:
# Retrieve data from API

name2 = "Cannabis Corpse"

artist_uri2 = get_artist_uri(name2)
artist_albums2 = get_artist_albums(artist_uri2)
artist_albums_uri2 = get_clean_album_uri_list(artist_albums2, albums_to_delete=None)
full_tracklist2 = get_full_tracklist_dict(artist_albums_uri2)
audio_features_dict2 = get_audio_features_dict(full_tracklist2)
pprint(list(audio_features_dict2.items())[:2])
print("\nTotal Number of songs:", len(audio_features_dict2))

In [None]:
# Construct DataFrame

temp_df1 = pd.DataFrame(full_tracklist2.items(), columns = ['title', 'uri'])
temp_df2 = pd.DataFrame(audio_features_dict2.items(), columns = ['uri', 'features'])
assert len(temp_df1) == len(temp_df2)
song_data2 = pd.merge(temp_df1, temp_df2, on=['uri'])

song_data2['energy'] = song_data2['uri'].apply(lambda x: audio_features_dict2[x]['energy'])
song_data2['valence'] = song_data2['uri'].apply(lambda x: audio_features_dict2[x]['valence'])
song_data2['danceability'] = song_data2['uri'].apply(lambda x: audio_features_dict2[x]['danceability'])
song_data2.drop('features', axis=1, inplace=True)

In [None]:
# Calculae SBI

song_data2['sbi'] = song_data2.apply(lambda x: calc_sbi(x['valence'], x['energy']), axis=1)
display(song_data2.head(2))

In [None]:
# Compare Brutality of Cannibal Corpse and Cannabis Corpse

print(f"Mean Brutality Score for {name}: {song_data['sbi'].mean():.2f}")
print(f"Mean Brutality Score for {name2}: {song_data2['sbi'].mean():.2f}")

In [None]:
plt.figure(figsize=(8,4))
sns.distplot(song_data['sbi'], bins=20, label=name);
sns.distplot(song_data2['sbi'], color='yellow', bins=20, label=name2);
plt.legend(loc='upper left');

In [None]:
most_brutal_uri2 = song_data2['uri'].loc[song_data2['sbi'] == song_data2['sbi'].max()].values[0]
results = sp.track(most_brutal_uri2)
print('track       : ' + results['name'])
print('from ablbum : ' + results['album']['name'])
print('audio       : ' + results['preview_url'])
print('cover art   : ' + results['album']['images'][0]['url'])

In [None]:
song_data2.nlargest(1, 'sbi')

---