# Mapping the mean Twitter sentiment by state

<i>By Diego Ramallo</i>

In this notebook, I will go over how I built an interactive US State map that displays Twitter sentiment by state using the Bokeh plotting library. 
Collecting, parsing, and calculating the sentiment scores is material for another notebook, but I will provide a copy of the data file that I use here in my Informatics directory. The data that I used for this analysis is a txt file with a dictionary where US state names are the key and a list of sentiment scores from tweets are the value items (the file can easily be read by Python's json library). 

## Loading state sentiment scores

To begin with, we'll load our data file and proceed to filter our tweet dictionary and state coordinates dictionaries a little bit before they are ready to feed to Bokeh.

In [1]:
import json
stateScores= json.load(open('game5_q1Sent.txt'))

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib nbagg

In [4]:
#Load raw data into dataframe
df= pd.DataFrame.from_dict(stateScores, orient= 'index')

In [5]:
df2= df.transpose()
df2.head(3)

Unnamed: 0,WA,DE,DC,WI,WV,HI,FL,WY,NH,NJ,...,MN,MI,KS,MT,MP,MS,SC,KY,OR,SD
0,0,0,0,0,3,5,0,0.0,0,0,...,1,2,-2,2.0,,2,0,0,3,0.0
1,0,0,3,0,0,-2,0,-2.0,0,0,...,0,0,0,0.0,,2,0,0,0,
2,2,0,0,1,1,1,3,,0,0,...,0,0,0,,,6,-2,0,0,


In order to visualize the data on a map easily, we'll drop the non-continental states and territories. First we'll drop them from the map coordinates from bokeh, and then from the dataframe df2.

In [6]:
#First delete/exclude from the bokeh coordinate dictionary and sort them alphabetically
from bokeh.plotting import figure, show, output_file
from bokeh.sampledata.us_states import data as states

del states["HI"]
del states["AK"]

EXCLUDED = ("ak", "hi", "pr", "gu", "vi", "mp", "as")#Exclude territories

import collections#This will allow us to order our states to match coordinates of coord library with data

ordStates= collections.OrderedDict(sorted(states.items()))

In [7]:
#Now exclude Hawaii, Alaska, and territories from our dataframe
df3= df2.drop(['HI','AK','PR','GU','VI','MP','AS','NA'], axis= 1)#For some reason we also have a 'NA' column, drop that too

### Normalizing sentiment scores

Now we'll count the number of tweets in in each state in df3 and calculate the mean tweet score for all the columns/states/series (will ignore NaN's, but output is NaN if the list for a state was empty). 

In [8]:
dfCount= df3.count()
dfMean= df3.mean()
#zip(dfCount, dfMean)

So now we have three dataframes: df3 = filtered data, dfCount = tweet count for each state in df3, dfMean= mean tweet score for each state in df3.  We'll use these to build our map and <strong>scale and normalize our tweet sentiment score data</strong>.
Since some of our values are negative, we'll account for that as well.

In [9]:
#First I'll convert this to a dictionary to play with just the values and order the dictionary to match ordStates
dfMeanDict= dfMean.to_dict()
ordMeanDict= collections.OrderedDict(sorted(dfMeanDict.items()))

#I'll also turn the item values into lists to make it easy to normalize scores
ordScoreKey= ordMeanDict.keys()
ordMeanScore= ordMeanDict.values()

#Now we'll calculate the min scores to help us normalize the data
minScore= min(ordMeanScore)

In [10]:
#The min and max values in our lexicon are -5 and +5
#Thus, first we'll make all values positive by adding 5, and normalize by dividing by 10
normScores= ordMeanScore#Shifts baseline scores by absolute value of minimum score 

In [11]:
#Get weight after converting count dictionary and sorting based on key, then taking values like we did with scores
weights= dfCount.to_dict()#Convert to dict
orderedCounts= collections.OrderedDict(sorted(weights.items())).values()#Sort dict by key, take values list
weights= np.float64(orderedCounts)/max(orderedCounts)#Normalize weight by max value

In [12]:
#Let's check if our dictionaries match
#zip(ordScoreKey,ordStates.keys())

Great. Now we have a sorted dictionary for our state coordinates (ordStates), a list of normalized tweetscores whose values match the order of the dictionary (normScores), and a list of corresponding weights for said scores (weights). We can now visualize the data! 

### Building Map Components

In [13]:
#Here, we'll split the coordinates of all the states into x,y lists and initialize our figure object
state_xs = [ordStates[code]["lons"] for code in ordStates]
state_ys = [ordStates[code]["lats"] for code in ordStates]

Now we'll generate a list of colors that we'll use to represent tweet score values. We'll use the cmap object to return a RGBA value depending on what value from 0 to 1 we feed it from our normalized tweet score list. 

In [14]:
#Below we'll set the range for the colormap that we use
#Although min and max values for sentiments are -5, and +5, 
#it'll be much easier to visualize changes if we narrow the working range, say to -1.5:1.5
import matplotlib
norm = matplotlib.colors.Normalize(vmin=-1.5, vmax=1.5)

cmap= matplotlib.cm.get_cmap('bwr')#Initialize cmap object

In [15]:
#Now we'll use this cmap object to take our normalized scores and assign a color to them in hex format
colorScore= []#Initialize list
for i in range(0,len(normScores)):
    if np.isnan(normScores[i]):#If the normScores value is nan (no tweets), return black (not in colormap)
        colorScore.append('#000000')
        
    else:
        colorScore.append(matplotlib.colors.rgb2hex(cmap(norm(normScores[i]))[:3]))

Bokeh has a little bit of problems adding colorbars to figures. Thus, we'll generate a colorbar using matplotlib
and scale it using the same max/min we used above. I'll save/crop the figure colorbar and save it as a png
in Dropbox and add the link to reference it for every state.

In [16]:
#Just making a matrix to be able to use the colorbar. Made sure to use the same cmap and vmin/vmax range as above!
data= np.array([[3,2,3,4],[0,3,4,1]])
plt.imshow(data, vmin = -1.5, vmax = 1.5, cmap = 'bwr', interpolation = 'nearest')
plt.colorbar(label = 'Sentiment Score')

#Finally here we'll get the address for an image of our colorbar that will pop up with all of the hovertool data for every state
scalebar= 'https://db.tt/5LGfM4dz'
stateUrls= [scalebar]*49

<IPython.core.display.Javascript object>

### Choropleth of Sentiment Scores

Now we'll use Bokeh to make a choropleth plot that displays the name of each state along with its corresponding tweet count and sentiment score. Unlike in previous Bokeh plots we've made, since our HoverTool tooltips feature will contain the scalebar image, all of the data in the ColumnDataSource will need to be called using HTML commands.  

In [19]:
from bokeh.plotting import figure, output_file, show, ColumnDataSource
from bokeh.models import HoverTool

#We'll now make the source for the info on our hover
source= ColumnDataSource(data= dict(stateKey=ordScoreKey, counts=orderedCounts, score=ordMeanScore, scale= stateUrls))


hover= HoverTool(tooltips= '''
                 <div><label style='font-size: 17px; font-family: Arial; font-weight: bold;'>State: <span style="font-size: 17px; font-family: Arial; font-weight: normal;">@stateKey</span></label></div>
                 <div><label style='font-size: 17px; font-family: Arial; font-weight: bold;'>Count: <span style="font-size: 17px; font-family: Arial; font-weight: normal;"">@counts</span></label></div>
                 <div><label style='font-size: 17px; font-family: Arial; font-weight: bold;'>Mean Score: </label></div>
                 <div><span style="font-size: 17px; font-family: Arial; font-weight: normal;">@score</span></div>
                 <div><img src= '@scale'</img><div>                
          
'''
)

p = figure(title="Twitter Sentiment Map: GSW-OKC Game5 Q1", tools=[hover, 'wheel_zoom', 'pan', 'reset'], toolbar_location="left",
           plot_width=1100, plot_height=700)

#Plot map and use color list to assign colors to each state
p.patches(state_xs, state_ys, line_color= 'black', color= colorScore, source= source)#, fill_alpha= alphaWeights)
#This part here is the tricky part ,we must take the order listed by states and match it with our score/weight order

output_file("choropleth.html", title="choropleth.py example")
show(p)

The code above will generate a plot in a separate tab and write its HTML file to your directory. The file will now have a hover tool that shows the state name, tweet count, and sentiment score for each state as well as the colorbar to facilitate interpretation of the colors and values. 

### Summary

In this exercise, we were able to take a key:value pair dictionary of states:[sentiment_score_list] to generate a choropleth that colorcoded the mean sentiment score for all continental US states.
Although functional, there is still a lot of room for improvement. It can be somewhat cumbersome to generate colorbars in Bokeh. My solution was to use an image of a colorbar made in matplotlib and tag it to every hover element. Also, although very accurate, I believe that the state coordinates dictionary in Bokeh has some mistakes since the state outlines appear to look a little different here than they do in reality. 
Both of these obstacles will surely be corrected as new versions of Bokeh are updated. As dashboard components become easier to build, it will also become easier to incorporate multiple datasets into the same data object.