# Title

[Free Form Description]

**Resources**

- [The NRC Valence, Arousal, and Dominance Lexicon](http://saifmohammad.com/WebPages/nrc-vad.html)
- [Blogpost on scraping song lyrics](https://chrishyland.github.io/scraping-from-genius/) - Thanks to Chris Hyland for this!
- [Genius API documentation](https://docs.genius.com/#/getting-started-h1) 


**Data Input:**

- `data/processed/audio_data.csv`: DataFrame of all CC tracks with "Sonic Brutality Index" (from notebook 1)
- `data/raw/NRC-VAD-Lexicon.txt`: Data of approx 20'000 words with valence, arousal and dominance scores

**Data Output:**

- `data/processed/wbi_lexicon.csv`: DataFrame containing 20'000 words with their corresponding 'Word Brutality Index"

**Changes**

- 2019-02-18: Start project
- 20-02-25: Complete audio analysis



---

## Import libraries

In [33]:
# Import libraries

from pprint import pprint
import re
import requests
import urllib.request
import urllib.parse
import json
import numpy as np
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup
from sklearn.preprocessing import minmax_scale
import credentials # file where credentials for genius API are stored

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer  # default lemmatizer

# Visualization
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('raph-base')
import seaborn as sns 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\r2d4\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load and Prepare Lexicon
The NRC Valence, Arousal, and Dominance (VAD) Lexicon includes a list of more than 20,000 English words and their valence, arousal, and dominance scores. For a given word and a dimension (V/A/D), the scores range from 0 (lowest V/A/D) to 1 (highest V/A/D). Reading the lexicon into a Pandas DataFrame requires a little tweaking / cleaning first.

In [2]:
with open('data/raw/NRC-VAD-Lexicon.txt') as file:
    data_list = []
    line = file.readline()
    while line:
        data_list.append(str(line))
        line = file.readline()

In [3]:
# Check results
data_list[:5]

['Word\tValence\tArousal\tDominance\n',
 'aaaaaaah\t0.479\t0.606\t0.291\n',
 'aaaah\t0.520\t0.636\t0.282\n',
 'aardvark\t0.427\t0.490\t0.437\n',
 'aback\t0.385\t0.407\t0.288\n']

In [4]:
# Split and clean
data_list2 = [x.replace('\n', '').split('\t') for x in data_list]

In [5]:
vad_lexicon = pd.DataFrame(data_list2[1:], columns=data_list2[0], dtype=float)
vad_lexicon.columns = (col.lower() for col in vad_lexicon.columns)

In [6]:
# Check results ...
display(vad_lexicon.iloc[[1860]])

Unnamed: 0,word,valence,arousal,dominance
1860,bloodshed,0.048,0.942,0.525


Exactly what we are looking for: low valence, high arousal ... ;-) 

We can also see that `dominance` is quite neutral and probably no feature that will be of further help. To more easily filter and analize for words with a combination of low-valence and high-arousal I will create a new feature `anti-valence` that is (1 - valance). Then we can simply sum the 2 scores to get a `word brutality index (WBI)`. (To land in a range between 0 and 1 we will normalize it using sklearn's minmax_scaler.)

In [7]:
lexicon = vad_lexicon.copy()
lexicon['anti_valence'] = lexicon['valence'].apply(lambda x: 1-x)
wbi = minmax_scale(lexicon['anti_valence'] + lexicon['arousal'])
lexicon['wbi'] = wbi
lexicon.drop(['valence', 'dominance'], axis=1, inplace=True)

In [8]:
# Check results ...
display(lexicon.nlargest(10, 'wbi'))
display(lexicon.loc[lexicon['word'] == 'zombie'])

Unnamed: 0,word,arousal,anti_valence,wbi
8472,homicide,0.973,0.99,1.0
11521,murderer,0.96,0.99,0.992746
9854,killer,0.971,0.959,0.981585
20,abduction,0.99,0.938,0.980469
17277,suicidebombing,0.957,0.969,0.979353
11523,murderous,0.94,0.983,0.977679
4366,dangerous,0.941,0.98,0.976562
1035,assassinate,0.969,0.949,0.974888
386,aggresive,0.971,0.941,0.97154
1856,bloodbath,0.971,0.94,0.970982


Unnamed: 0,word,arousal,anti_valence,wbi
19999,zombie,0.648,0.786,0.704799


Wow, people nowadays definitely seem to be scared a lot more of suicide bombers than of zombies ... how come?

In [47]:
lexicon.to_csv('data/processed/wbi_lexicon.csv', index=False)

## Load and Prepare Lyrics

Scrape all Cannibal Corpse lyrics from the genius API.

In [9]:
base = "https://api.genius.com"
genius_token = credentials.genius_token

def get_json(path, params=None, headers=None):
    '''Send request and get response in json format.'''

    # Generate request URL
    requrl = '/'.join([base, path])
    token = f"Bearer {genius_token}"
    if headers:
        headers['Authorization'] = token
    else:
        headers = {"Authorization": token}
    # Get response object from querying genius api
    response = requests.get(url=requrl, params=params, headers=headers)
    response.raise_for_status()
    return response.json()

In [10]:
# Get artist ID

name = "Cannibal Corpse"

def get_artist_id(artist_name):
    '''Search Genius API for artist ID via artist name.'''

    search = "/search?q="
    query = base + search + urllib.parse.quote(artist_name)
    request = urllib.request.Request(query)
    request.add_header("Authorization", "Bearer " + genius_token)
#     request.add_header("User-Agent", "")  
    response = urllib.request.urlopen(request, timeout=3)
    raw = response.read()
    data = json.loads(raw)['response']['hits']
    
    return (data[0]['result']['primary_artist']['id'])

In [11]:
artist_id = get_artist_id(name)
print(artist_id)

41863


In [12]:
def get_songlist(artist_id):
    '''Get all the song ids and titles from an artist in form of a dict.'''
    current_page = 1
    next_page = True
    songs = [] # to store final song ids
    while next_page:
        path = f"artists/{artist_id}/songs/"
        params = {'page': current_page} # the current page
        data = get_json(path=path, params=params) # get json of songs
        page_songs = data['response']['songs']
        if page_songs:
            # Add all the songs of current page
            songs += page_songs
            # Increment current_page value for next loop
            current_page += 1
            print(f"Page {current_page} finished scraping")
            # If you don't wanna wait too long to scrape, un-comment this
            # if current_page == 2:
            #     break
        else:
            # If page_songs is empty, quit
            next_page = False

    print(f"Song id were scraped from {current_page} pages")

    # Get all the song ids, excluding not-primary-artist songs.
    songlist = {song["id"]: song['title'].lower() for song in songs
                if song["primary_artist"]["id"] == artist_id}

    return songlist

In [13]:
songlist = get_songlist(artist_id)
print(f"\n{list(songlist.items())[:2]}")

Page 2 finished scraping
Page 3 finished scraping
Page 4 finished scraping
Page 5 finished scraping
Page 6 finished scraping
Page 7 finished scraping
Page 8 finished scraping
Page 9 finished scraping
Page 10 finished scraping
Page 11 finished scraping
Song id were scraped from 11 pages

[(764037, 'absolute hatred'), (715726, 'a cauldron of hate')]


In [14]:
def connect_lyrics(song_id):
    '''Constructs the path of song lyrics. (Called within next function.)'''

    url = f"songs/{song_id}"
    data = get_json(url)
    # Gets the path of song lyrics
    path = data['response']['song']['path']
    return path

def retrieve_lyrics(song_id):
    '''Retrieves lyrics from html page.'''

    path = connect_lyrics(song_id)
    URL = "http://genius.com" + path
    page = requests.get(URL)
    # Extract the page's HTML as a string
    html = BeautifulSoup(page.text, "html.parser")
    # Scrape the song lyrics from the HTML
    lyrics = html.find("div", class_="lyrics").get_text()
    return lyrics

In [15]:
def scrape_lyrics(songlist):
    """Scrape the lyrics from the songs in a songlist."""
    lyrics_dict = {}
    for song_id, title in tqdm(songlist.items()):
        lyrics_dict[title] = retrieve_lyrics(song_id)
        
    return lyrics_dict

In [16]:
lyrics_dict = scrape_lyrics(songlist)

100%|██████████| 187/187 [04:42<00:00,  1.51s/it]


In [41]:
def process_text(raw_text):
    
    lemmatizer = WordNetLemmatizer()
    stop_words = stopwords.words('english')
    
    text = re.sub(r"[^a-zA-Z0-9]", " ", raw_text.lower().strip())
    tokens = word_tokenize(text)
    lemmed = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    lemmed_tokens = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]  # lemm to verbs, not nouns
    
    
    return lemmed_tokens

**Note:** The lemmatization has a certain effect on the results as for example the words "torture" (V: 0.115, D: 0.878) has not exactly the same values as the word "tortured" (V: 0.062, D: 0.890). But the differences are fairly small and so I will just go on.

In [43]:
lyrics_dict_clean = {title: process_text(lyric) for title, lyric in lyrics_dict.items()}

In [46]:
print(lyrics_dict_clean)

