Exploring Sentiment in Beyonce's 'LEMONADE'
------------------------------------------
### Author: Travis Beach, trb96[at]cornell.edu

#### Description: 
The following notebook will use a text file of URLs to scrape <code>azlyrics.com</code>. It will process the HTML and use [NLTK's VADER](http://www.nltk.org/_modules/nltk/sentiment/vader.html) sentiment analysis tools. The output will be visualized using D3.js.

#### Background:
'LEMONADE' takes listeners on a journey through dispair and revenge to healing and acceptance. The album's title is a nod to the proverb, ["When life gives you lemons, make lemonade"](https://en.wikipedia.org/wiki/When_life_gives_you_lemons,_make_lemonade). 

#### Hypothesis:
Due to the flucuating themes of the album, the sentiment will likely vary greatly across tracks, starting more negative, and ending more positive. 

In [1]:
import matplotlib.pyplot as plt
import urllib2
from bs4 import BeautifulSoup
from bs4 import Comment
import re
import nltk
import json
from IPython.core.display import display, HTML
from IPython.display import Javascript

#### Scrape and Preprocess Data

In [2]:
#read files into list
with open('urls.txt', 'r') as f:
    urls = f.readlines()
    
#trim each one to remove new line
urls = [url.strip() for url in urls]

In [3]:
def getHTML(url):
    return urllib2.urlopen(url).read()

def removeHTML(raw_html):
    reg = re.compile(r'<.*?>')
    return reg.sub('', raw_html)

#Remove mentions of other artists in the form of, '[CHORUS: Beyonce]'
def removeMention(raw_html):
    reg = re.compile(r'\[.*?\]')
    return reg.sub('', raw_html)

In [4]:
#get HTML from each page. 
html_pages = {i: getHTML(url) for i, url in enumerate(urls)}
    

In [5]:
def get_title(raw_html):
    soup=BeautifulSoup(raw_html, 'html.parser')
    title = soup.find('div', class_="ringtone").next_sibling.next_sibling.string
    return title.replace("\'", '')

In [6]:
def process_html(raw_html):
    soup = BeautifulSoup(raw_html, 'html.parser')
    #kind of limited in how to get the actual lyrics out of the page. 
    #going to find the comment disclaimer and then hop through the lyrics. 
    
    #Select all of the comments
    comments=soup.find_all(string=lambda text:isinstance(text,Comment))

    #filter it specifically for "licensing agreement"
    comment = [comment for comment in comments if "licensing agreement" in comment.string][0]
    
    #put each lyric section in a list. first line + the rest in a chunk. 
    lyrics = []
    lyrics.append(unicode(comment.next_sibling))
    lyrics.append(unicode(comment.next_sibling.next_sibling))
    lyrics.append(unicode(comment.next_sibling.next_sibling.next_sibling))
    raw_html = u''.join(lyrics)
    
    #strip the html tags. 
    processed_input = removeHTML(raw_html)
    processed_input = removeMention(processed_input)
    
    #process the strings a little more. 
    word_array = processed_input.replace('\n', ' ').replace("\'", '').strip().lower()
    if word_array[-1] == 'none':
        word_array.pop();
    return word_array;


In [7]:
#processed_lyrics[song_id] = ['processed', 'lyrics']
processed_lyrics = {i: process_html(html_pages[i]) for i in html_pages}

In [8]:
#titles[i] = "title"
titles = {i: get_title(html_pages[i]) for i in html_pages}

#### Analyze Sentiment of each song

In [9]:
#VADER is specifically for social media corpora, but should work okay for similarly informal lyrics
from nltk.sentiment.vader import SentimentIntensityAnalyzer



In [10]:
#instantiate sentiment analyzer
vader = SentimentIntensityAnalyzer()

In [11]:
def calculate_sentiment(analyzer, processed_input):
    return analyzer.polarity_scores(processed_input)
sentiment = {i:calculate_sentiment(vader, processed_lyrics[i]) for i in processed_lyrics}

In [12]:
#combine titles and sentiments for export
big_structure = [{'id':x, 'title': titles[x], 'sentiment': sentiment[x]} for x in processed_lyrics]

#### Pass data to javascript and visualize with D3

In [13]:
%%javascript
require.config({
  paths: {
      d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min'
  }
});

<IPython.core.display.Javascript object>

In [14]:
json_struct = json.dumps(big_structure)
#runs arbitrary javascript, client-side
Javascript("window.data={};".format('"test format"'))
Javascript("window.vizObj={};".format(json_struct))

<IPython.core.display.Javascript object>

In [15]:
#setup 'stylesheet'
display(HTML("""
<style>


.axis {
 font: 10px sans-serif;
}
.axis path,
.axis line {
 fill: none;
 stroke: #000;
}
.axis text{
    fill: black
    
}

path { 
    stroke: #d3d3d3;
    stroke-width: 2;
    fill: none;
}
.circle{
    fill: black
}
.circle:hover{
    opacity: 0.5
}
.bold{
font-weight: 700
}

</style>

"""))

In [16]:
%%javascript

//add d3 
require(['d3'], function(d3){
    
    // Set the dimensions of the canvas / graph
    var margin = {top: 50, right: 20, bottom: 50, left: 50},
    width = 800 - margin.left - margin.right,
    height = 600 - margin.top - margin.bottom;
    
    //add chart div and area for graph labels
    element.append("<div id='chart1'><p id='title'>Click on a circle to show song title</p><p id='score'>  </p></div>");
   
    //append svg
    var svg = d3.select("#chart1")
                .append("svg")
                .attr("width", 800)
                .attr("height", 500)
                .attr("id", "chartSvg");
    
    //create scale functions
    var xScale = d3.scale.linear()
                    .domain(d3.extent(window.vizObj, (d, i)=> i))
                    .range([margin.left, width-margin.right]); 
    var yScale = d3.scale.linear()
                    .domain(d3.extent(window.vizObj, d=>d.sentiment.compound))
                    .range([height-margin.top, margin.bottom]); 
    
    // x and y axes 
    var xAxis = d3.svg.axis()
                    .scale(xScale)
                    .orient("bottom");
    var yAxis = d3.svg.axis()
                    .scale(yScale)
                    .orient("left");
    
    //function for converting data to svg path
    var lineFunction = d3.svg.line()
                    .x(d=>xScale(d.id))
                    .y(d=>yScale(d.sentiment.compound));
    
    // add the path.
    svg.append("path")
        .attr("class", "line")
        .attr("d", lineFunction(window.vizObj));

    // Add the X Axis
    svg.append("g")
        .attr("class", "x axis")
        .attr("transform", "translate(0," + height/2 + ")")
        .call(xAxis)

    // Add the Y Axis
    svg.append("g")
        .attr("class", "y axis")
        .attr("transform", "translate("+margin.left+", 0)")
        .call(yAxis);
    
    svg.append('text')
        .attr("x", width)
        .attr("y", height/2-2)
        .style("font-size", 10)
        .text("Track Number")
        .attr("text-anchor", "end")
    svg.append('text')
        .attr("x", margin.left)
        .attr("y", margin.top -10)
        .style("font-size", 10)
        .attr("text-anchor", "end")
        .text("Sentiment")

    // add a circle for each song. 
    var circles = svg.selectAll(".circle")
                    .data(window.vizObj)
                    .enter()
                    .append("circle")
                    .attr("cx", d=>xScale(d.id))
                    .attr("cy", d=>yScale(d.sentiment.compound))
                    .attr("r", 5)
                    .attr("class", "circle")
                    //edit title and score <p>s on hover 
                    .on("mouseover", function(d){
                        var selectedObject = d;
                        d3.select("#title").html("<span class='bold'>Title: </span>"+selectedObject.title)
                        d3.select("#score").html("<span class='bold'>Score: </span>"+selectedObject.sentiment.compound)                  
                    })
})

<IPython.core.display.Javascript object>

#### Results
The plot above shows how the sentiment of the album changes over time. Tracks such as "Daddy Lessons", where Beyonce recounts her own father's infidelity and "Sandcastles" which describes a past love gone wrong both receive very low scores. As expected, Beyonce made lemonade from the lemons in previous tracks and tracks 8-11 are very positive.