# Title

[Free Form Description]

**Resources**

- [The NRC Valence, Arousal, and Dominance Lexicon](http://saifmohammad.com/WebPages/nrc-vad.html)
- [Blogpost on scraping song lyrics](https://chrishyland.github.io/scraping-from-genius/) - Thanks to Chris Hyland for this!
- [Genius API documentation](https://docs.genius.com/#/getting-started-h1) 


**Data Input:**

- `data/processed/audio_data.csv`: DataFrame of all CC tracks with "Sonic Brutality Index" (from notebook 1)
- `data/raw/NRC-VAD-Lexicon.txt`: Data of approx 20'000 words with valence, arousal and dominance scores

**Data Output:**

- `...`: ...

**Changes**

- 2019-02-18: Start project
- 20-02-25: Complete audio analysis



---

## Import libraries

In [3]:
# Import libraries

from pprint import pprint
import requests
import urllib.request
import urllib.parse
import json
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.preprocessing import minmax_scale
import credentials # file where credentials for genius API are stored

# Visualization
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('raph-base')
import seaborn as sns 

## Load and Prepare Lexicon
The NRC Valence, Arousal, and Dominance (VAD) Lexicon includes a list of more than 20,000 English words and their valence, arousal, and dominance scores. For a given word and a dimension (V/A/D), the scores range from 0 (lowest V/A/D) to 1 (highest V/A/D). Reading the lexicon into a Pandas DataFrame requires a little tweaking / cleaning first.

In [3]:
with open('data/raw/NRC-VAD-Lexicon.txt') as file:
    data_list = []
    line = file.readline()
    while line:
        data_list.append(str(line))
        line = file.readline()

In [4]:
# Check results
data_list[:5]

['Word\tValence\tArousal\tDominance\n',
 'aaaaaaah\t0.479\t0.606\t0.291\n',
 'aaaah\t0.520\t0.636\t0.282\n',
 'aardvark\t0.427\t0.490\t0.437\n',
 'aback\t0.385\t0.407\t0.288\n']

In [5]:
# Split and clean
data_list2 = [x.replace('\n', '').split('\t') for x in data_list]

In [17]:
vad_lexicon = pd.DataFrame(data_list2[1:], columns=data_list2[0], dtype=float)
vad_lexicon.columns = (col.lower() for col in vad_lexicon.columns)

In [18]:
# Check results ...
display(vad_lexicon.iloc[[1860]])

Unnamed: 0,word,valence,arousal,dominance
1860,bloodshed,0.048,0.942,0.525


Exactly what we are looking for: low valence, high arousal ... ;-) 

We can also see that `dominance` is quite neutral and probably no feature that will be of further help. To more easily filter and analize for words with a combination of low-valence and high-arousal I will create a new feature `anti-valence` that is (1 - valance). Then we can simply sum the 2 scores to get a `word brutality index (WBI)`. (To land in a range between 0 and 1 we will normalize it using sklearn's minmax_scaler.)

In [38]:
lexicon = vad_lexicon.copy()
lexicon['anti_valence'] = lexicon['valence'].apply(lambda x: 1-x)
wbi = minmax_scale(lexicon['anti_valence'] + lexicon['arousal'])
lexicon['wbi'] = wbi
lexicon.drop(['valence', 'dominance'], axis=1, inplace=True)

In [46]:
# Check results ...
display(lexicon.nlargest(10, 'wbi'))
display(lexicon.loc[lexicon['word'] == 'zombie'])

Unnamed: 0,word,arousal,anti_valence,wbi
8472,homicide,0.973,0.99,1.0
11521,murderer,0.96,0.99,0.992746
9854,killer,0.971,0.959,0.981585
20,abduction,0.99,0.938,0.980469
17277,suicidebombing,0.957,0.969,0.979353
11523,murderous,0.94,0.983,0.977679
4366,dangerous,0.941,0.98,0.976562
1035,assassinate,0.969,0.949,0.974888
386,aggresive,0.971,0.941,0.97154
1856,bloodbath,0.971,0.94,0.970982


Unnamed: 0,word,arousal,anti_valence,wbi
19999,zombie,0.648,0.786,0.704799


Wow, people nowadays definitely seem to be more scared of suicide bombers than of zombies ... how come?

## Load and Prepare Lyrics

Scrape all Cannibal Corpse lyrics from the genius API.

In [46]:
base = "https://api.genius.com"
genius_token = credentials.genius_token

def get_json(path, params=None, headers=None):
    '''Send request and get response in json format.'''

    # Generate request URL
    requrl = '/'.join([base, path])
    token = f"Bearer {genius_token}"
    if headers:
        headers['Authorization'] = token
    else:
        headers = {"Authorization": token}
    # Get response object from querying genius api
    response = requests.get(url=requrl, params=params, headers=headers)
    response.raise_for_status()
    return response.json()

In [47]:
# Get artist ID

name = "Cannibal Corpse"

def get_artist_id(artist_name):
    '''Search Genius API for artist ID via artist name.'''

    search = "/search?q="
    query = base + search + urllib.parse.quote(artist_name)
    request = urllib.request.Request(query)
    request.add_header("Authorization", "Bearer " + genius_token)
#     request.add_header("User-Agent", "")  
    response = urllib.request.urlopen(request, timeout=3)
    raw = response.read()
    data = json.loads(raw)['response']['hits']
    
    return (data[0]['result']['primary_artist']['id'])

In [48]:
artist_id = get_artist_id(name)
print(artist_id)

41863


In [80]:
def get_songlist(artist_id):
    '''Get all the song ids and titles from an artist in form of a dict.'''
    current_page = 1
    next_page = True
    songs = [] # to store final song ids
    while next_page:
        path = f"artists/{artist_id}/songs/"
        params = {'page': current_page} # the current page
        data = get_json(path=path, params=params) # get json of songs
        page_songs = data['response']['songs']
        if page_songs:
            # Add all the songs of current page
            songs += page_songs
            # Increment current_page value for next loop
            current_page += 1
            print(f"Page {current_page} finished scraping")
            # If you don't wanna wait too long to scrape, un-comment this
            # if current_page == 2:
            #     break
        else:
            # If page_songs is empty, quit
            next_page = False

    print(f"Song id were scraped from {current_page} pages")

    # Get all the song ids, excluding not-primary-artist songs.
    songlist = {song["id"]: song['title'] for song in songs
                if song["primary_artist"]["id"] == artist_id}

    return songlist

In [83]:
songlist = get_songlist(artist_id)
pprint(list(songlist.items())[:2])

Page 2 finished scraping
Page 3 finished scraping
Page 4 finished scraping
Page 5 finished scraping
Page 6 finished scraping
Page 7 finished scraping
Page 8 finished scraping
Page 9 finished scraping
Page 10 finished scraping
Page 11 finished scraping
Song id were scraped from 11 pages
[(764037, 'Absolute Hatred'), (715726, 'A Cauldron of Hate')]


In [84]:
...

Ellipsis

In [None]:
def connect_lyrics(song_id):
    '''Constructs the path of song lyrics.'''

    url = f"songs/{song_id}"
    data = get_json(url)
    # Gets the path of song lyrics
    path = data['response']['song']['path']
    return path

def retrieve_lyrics(song_id):
    '''Retrieves lyrics from html page.'''

    path = connect_lyrics(song_id)
    URL = "http://genius.com" + path
    page = requests.get(URL)
    # Extract the page's HTML as a string
    html = BeautifulSoup(page.text, "html.parser")
    # Scrape the song lyrics from the HTML
    lyrics = html.find("div", class_="lyrics").get_text()
    return lyrics

def get_song_information(song_ids):
    '''Retrieve meta data about a song.'''
    # initialize a dictionary.
    song_list = {}
    print("Scraping song information")
    for i, song_id in enumerate(song_ids):
        print("id:" + str(song_id) + " start. ->")
        path = "songs/{}".format(song_id)
        data = get_json(path=path)["response"]["song"]

        song_list.update({
        i: {
            "title": data["title"],
            "album": data["album"]["name"] if data["album"] else "<single>",
            "release_date": data["release_date"] if data["release_date"] else "unidentified",
            "featured_artists":
                [feat["name"] if data["featured_artists"] else "" for feat in data["featured_artists"]],
            "producer_artists":
                [feat["name"] if data["producer_artists"] else "" for feat in data["producer_artists"]],
            "writer_artists":
                [feat["name"] if data["writer_artists"] else "" for feat in data["writer_artists"]],
            "genius_track_id": song_id,
            "genius_album_id": data["album"]["id"] if data["album"] else "none"}

        })



        print("-> id:" + str(song_id) + " is finished. \n")

    return song_list