# Data Viz with Plotly

**Goals**

- In our last lecture, we are going to have some fun making some cool interactive plots with Plotly
- Learn how to use Plotly's Python API to make plots and then how to customize those plots on the plotly website.
- Cover a variety of plots from the usual (line, scatter) to 3D and geographic plots.

### Setup

1. Go to https://plot.ly/ and sign up for an account.
2. Open your email and verify your account.
3. Go to the api key under the settings section, click re-generate api, and copy/paste the api key here in the jupyter notebook.
4. Install plotly library with `pip install plotly`

<br>

The plotly guide to plots in python https://plot.ly/python/

In [3]:
#Fill this out with your own user_name and api key
user_name = "geomcin"
api_key = "1SF5eyqKMiCuvMrc68ct"

In [16]:
#Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
#Import plotly and sign in with user name and api

import plotly.plotly as py
py.sign_in(user_name, api_key)
import plotly.graph_objs as go

In [10]:
#Load in data

spotify = pd.read_csv('../data/spotify_data.csv', index_col=[0])
amazon = pd.read_csv('../data/amazon_cities_data.csv')
housing = pd.read_csv('../data/kc_house_data.csv')
pokemon = pd.read_csv('../data/Pokemon.csv')
soccer_tweets = pd.read_pickle('../data/geotweets.pkl')


We're ready to go!

### Scatter plot

In [17]:
#Look at spotify data

spotify.head()

Unnamed: 0,acousticness,danceability,instrumentalness,valence,energy,target
Mask_Off***Future,0.0102,0.833,0.0219,0.286,0.434,1
Redbone***Childish_Gambino,0.199,0.743,0.00611,0.588,0.359,1
Xanny_Family***Future,0.0344,0.838,0.000234,0.173,0.412,1
Master_Of_None***Beach_House,0.604,0.494,0.51,0.23,0.338,1
Parallel_Lines***Junior_Boys,0.18,0.678,0.512,0.904,0.561,1


In [24]:
#Assign x and y variables from spotify data

x = spotify.acousticness
y = spotify.danceability


#Plot the two attributes versus each other in scatter plot

# Create a trace 
trace = go.Scatter(
    x = x,
    y = y,
    mode = 'markers',
    name = "Basic Scatter plot"
)

#put all the traces into a list
data = [trace]

# Plot and embed in ipython notebook!
layout = go.Layout(showlegend=True)
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='basic-scatter')

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~geomcin/0 or inside your plot.ly account where it is named 'basic-scatter'


Tada! Your first plotly chart. If you click on edit chart, it will open a new window that takes you to the plotly website where you can further edit the plot. We'll do this later.

Let's try this again but with color along with some other configurations.

In [34]:
#Use valence variable to assign color
color = spotify.valence


# Create a trace 
trace = go.Scatter(
    x = x,
    y = y,
    mode = 'markers',
    marker = {"size":14,
             "color" : color,
             "colorscale" : "Viridis",
             "showscale": True},
    name = "Acousticness vs Danceability"
    
)

#put all the traces into a list
data = [trace]

# Plot and embed in ipython notebook. Give it a title.
layout = go.Layout(title = "Spotify Scatter Plot")
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='spotify_scatter')

In [29]:
#Use outcome variable to assign color
color = spotify.target

# Create a trace 
trace = go.Scatter(
    x = x,
    y = y,
    mode = 'markers',
    marker = {"size":14,
             "color" : color,
             "colorscale" : "Viridis",
             "showscale": True},
    name = "Acousticness vs Danceability"
    
)

#put all the traces into a list
data = [trace]

# Plot and embed in ipython notebook. Give it a title.
layout = go.Layout(title = "Spotify Scatter Plot")
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='spotify_scatter')

Same plot but with two different traces.

In [32]:
#Assign x0, y0, x1, y1

x0 = spotify[spotify.target == 0].acousticness
y0 = spotify[spotify.target == 0].danceability

x1 = spotify[spotify.target == 1].acousticness
y1 = spotify[spotify.target == 1].danceability

In [33]:
# Create a trace0 with x0 and y0
trace0 = go.Scatter(
    x = x0,
    y = y0,
    mode = 'markers',
    marker = {"size":14,
             "color" : "blue"},
    name = "Disliked Songs"    
)

trace1 = go.Scatter(
    x = x1,
    y = y1,
    mode = 'markers',
    marker = {"size":14,
             "color" : "red"},
    name = "Liked Songs"    
)

#put all the traces into a list
data = [trace0, trace1]

# Plot and embed in ipython notebook. Give it a title.
layout = go.Layout(title = "Spotify Scatter Plot")
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='spotify_scatter_2')

Now let's click on edit chart to customize the plot.

Let's include hover info

In [45]:
amazon.columns

Index([u'cities', u'sprawl', u'diversity', u'business_score',
       u'fiber_coverage', u'excellent_education', u'percent_bachelors',
       u'life_quality', u'mobile_network_score', u'transit_scores'],
      dtype='object')

In [48]:
#Assign variables from amazon dataset
cities = amazon.cities
diversity = amazon.diversity
transit = amazon.transit_scores


# Create trace 
trace = go.Scatter(
    x = diversity,
    y = transit,
    mode = 'markers',
    text = cities    
)

#put all the traces into a list
data = [trace]

# Plot and embed in ipython notebook. Give it a title.
layout = go.Layout(title = "City Diversity vs Transit Scores",
                  hovermode = "closest")
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='Amazon Cities')

Go ahead and hover your mouse over the dots.

### Line time series plot

In [39]:
#Load in apple stock price data

aapl = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv")
aapl.head()

Unnamed: 0,Date,AAPL.Open,AAPL.High,AAPL.Low,AAPL.Close,AAPL.Volume,AAPL.Adjusted,dn,mavg,up,direction
0,2015-02-17,127.489998,128.880005,126.919998,127.830002,63152400,122.905254,106.741052,117.927667,129.114281,Increasing
1,2015-02-18,127.629997,128.779999,127.449997,128.720001,44891700,123.760965,107.842423,118.940333,130.038244,Increasing
2,2015-02-19,128.479996,129.029999,128.330002,128.449997,37362400,123.501363,108.894245,119.889167,130.884089,Decreasing
3,2015-02-20,128.619995,129.5,128.050003,129.5,48948400,124.510914,109.785449,120.7635,131.741551,Increasing
4,2015-02-23,130.020004,133.0,129.660004,133.0,70974100,127.876074,110.372516,121.720167,133.067817,Increasing


In [43]:


# Create a line plot of closing price
aapl_close = go.Scatter(
    x = aapl.Date,
    y = aapl["AAPL.Close"],
    name = "Closing Price"
)
# Create a line plot of opening price
aapl_open = go.Scatter(
    x = aapl.Date,
    y = aapl["AAPL.Open"],
    name = "Opening Price"
)

#put all the traces into a list
data = [aapl_close, aapl_open]

# Plot and embed in ipython notebook!
layout = go.Layout(title = "Apple (aapl) Stock Opening and Closing Price")
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='Apple Stock Prices')

### Bar plot and histograms

Histogram

In [73]:
#Filter out homes more expensive that $2M

housing = housing[housing.price <= 2000000].copy()

In [74]:
#Create histogram of housing prices

prices = housing.price

trace = go.Histogram(x = prices)

py.iplot([trace], filename='Housing Prices Histogram')

Normalized version

In [75]:
#Create histogram of housing prices

trace = go.Histogram(x = prices, histnorm= "probability")

py.iplot([trace], filename='Housing Prices Histogram')

Overlaid histogram, housing prices for 1 and 2 floor houses.

In [76]:
x0 = housing[housing.floors == 1].price
x1 = housing[housing.floors == 2].price

In [77]:
hist1 = go.Histogram(
    x=x0,
    opacity=0.65,
    name = "One Floor Homes"
)
hist2 = go.Histogram(
    x=x1,
    opacity=0.65,
    name = "Two Floor Homes"
)

data = [hist1, hist2]
layout = go.Layout(barmode='overlay')
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='overlaid housing prices histogram')

### Bar Plots

In [44]:
amazon.head()

Unnamed: 0,cities,sprawl,diversity,business_score,fiber_coverage,excellent_education,percent_bachelors,life_quality,mobile_network_score,transit_scores
0,Atlanta,41.0,67.26,4.0,11.2,2,35.8,6.6,97.2,7.7
1,Austin,102.44,69.91,3.7,12.1,1,41.7,7.8,97.2,5.47
2,Baltimore,115.62,64.92,1.7,60.2,2,37.3,6.3,96.6,8.52
3,Boston,126.93,68.96,3.0,38.8,7,44.6,7.1,95.3,9.44
4,Charlotte,70.45,69.56,3.3,11.3,0,32.2,7.1,95.8,4.33


In [50]:
#Filter data to include the following cities

city_list = ["Atlanta", "Chicago", "San Francisco", "Boston", "Houston"]

amazon2 = amazon[amazon.cities.isin(city_list) == True]


In [51]:
amazon2

Unnamed: 0,cities,sprawl,diversity,business_score,fiber_coverage,excellent_education,percent_bachelors,life_quality,mobile_network_score,transit_scores
0,Atlanta,41.0,67.26,4.0,11.2,2,35.8,6.6,97.2,7.7
3,Boston,126.93,68.96,3.0,38.8,7,44.6,7.1,95.3,9.44
5,Chicago,125.9,70.57,1.7,2.2,4,35.5,6.1,97.6,9.14
11,Houston,76.74,71.51,4.0,12.1,1,30.7,7.0,97.3,6.24
31,San Francisco,194.28,68.36,0.0,15.3,2,45.6,7.0,95.7,9.59


Grouped Bar Charts of the five cities and selected features

In [52]:
#Assign variables

cities2 = amazon2.cities
sprawl = amazon2.sprawl
mobile = amazon2.mobile_network_score
bachelors = amazon2.percent_bachelors

In [53]:
#Sprawl trace
bar1 = go.Bar(
x = cities2,
y = sprawl,
name = "Sprawl Score")

bar2 = go.Bar(
x = cities2,
y = mobile,
name = "Mobile Network Score")

bar3 = go.Bar(
x = cities2,
y = bachelors,
name = "Percent of Bachelors")

data = [bar1, bar2, bar3]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Amazon Grouped Bar Charts')

Horizontal version

In [54]:
#Sprawl trace
bar1 = go.Bar(
x = sprawl,
y = cities2,
name = "Sprawl Score",
orientation = "h")

bar2 = go.Bar(
x = mobile,
y = cities2,
name = "Mobile Network Score",
orientation = "h")

bar3 = go.Bar(
x = bachelors,
y = cities2,
name = "Percent of Bachelors",
orientation = "h")

data = [bar1, bar2, bar3]
layout = go.Layout(
    barmode='group', title = "Horizontal Amazon Chart")

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Horizontal Amazon Grouped Bar Charts')

Click edit chart to customize chart.

## Let's have some real fun now with 3D and Geographic plots

### 3D

We're going to plot a 3 PCA components data of the spotify dataset.

In [78]:
#imports
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [89]:
# X

In [90]:
#Assign variables

song_info = spotify.index
X = spotify.drop("target", axis =1)
target = spotify.target

#Transform data

scale = StandardScaler()
Xs = scale.fit_transform(X)

pca = PCA(n_components=3)
Xp = pca.fit_transform(Xs)

In [91]:
#Percent explained variance
pca.explained_variance_ratio_.sum()

0.82065314403766842

In [92]:
#assign variables

x = Xp[:, 0]
y = Xp[:, 1]
z = Xp[:, 2]

In [96]:

# Create a trace 
trace1 = go.Scatter3d(
    x = x,
    y = y,
    z = z,
    mode = 'markers',
    marker = {"size":8,
             "color" : target,
             "colorscale": "Jet"},
    opacity = .5,
    text = song_info
)

#put all the traces into a list
data = [trace1]

# Plot and embed in ipython notebook. Give it a title.
layout = go.Layout(title = "3D Spotify PCA Plot", hovermode = "closest")
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='spotify_scatter_3d')

What do you notice? Zoom in and out, hover over the dots.

### Geographic plotting

We're going to scatter plots of tweets about the champions league final using the lat/lon coordinates. 

In [105]:
#View data
soccer_tweets.head(2)

Unnamed: 0,tweet,ID,handle,display_name,is_retweeted,time,follower_count,geo,language,location,description,coor,lats,longs
42,Vamos Juve #forzajuve #Lavazza #championsleagu...,871055614209871873,jorgeadarme,Jorge Andres Adarme,False,Sat Jun 03 17:26:56 +0000 2017,135,"[3.51361, -74.0517]",es,Colombia,"Ingeniero Agroforestal, sencillo, sincero y de...",nah,3.51361,-74.0517
536,Unos en Cardiff y otros en Cardys en Carabanch...,871055661924319233,j_asanchez,José Antonio Sánchez,False,Sat Jun 03 17:27:07 +0000 2017,16711,"[40.386111, -3.738889]",es,Morαtα de Tαjuñα (Mαdrid),"Tengo 2 niñas preciosas, una aquí y otra en el...",nah,40.386111,-3.738889


In [98]:
#Assign variables
lat = soccer_tweets.lats
lon = soccer_tweets.longs
tweet = soccer_tweets.tweet

In [104]:
trace = go.Scattergeo(lat = lat, 
                      lon= lon, 
                      text = tweet)

data = [trace]

# Plot and embed in ipython notebook. Give it a title.
layout = go.Layout(title = "Champions League Tweets", hovermode = "closest",
                  geo = {"scope": "world", "showland":True, 
                        "landcolor": "rgb(150, 250, 250)"})
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='geo scatter tweets')

Plotting country clusters. We're going to make the plots from this article I wrote.
https://opendatascience.com/blog/redefining-what-it-means-to-be-a-first-world-or-third-world-country/

In [109]:
countries = pd.read_pickle("../data/country_development_data.pkl")
countries.head()

IndicatorCode,Access to electricity (% of population),Renewable electricity output (% of total electricity output),CO2 emissions (metric tons per capita),"Commercial bank branches (per 100,000 adults)",Depth of credit information index (0=low to 8=high),Strength of legal rights index (0=weak to 12=strong),Mobile cellular subscriptions (per 100 people),Internet users (per 100 people),GDP per capita (current US$),Proportion of seats held by women in national parliaments (%),...,Health expenditure per capita (current US$),"Labor force, female (% of total labor force)","Unemployment, total (% of total labor force)",Net migration,"Mortality rate, infant (per 1,000 live births)","Life expectancy at birth, total (years)","Survival to age 65, female (% of cohort)","Population, ages 0-14 (% of total)","Age dependency ratio, young (% of working-age population)",Urban population (% of total)
CountryName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,43.0,0.0,0.425262,2.465221,0.0,9.0,74.882842,6.39,633.569247,27.710843,...,54.964148,16.051439,9.1,473007,66.3,60.028268,60.48056,44.870996,85.153384,26.282
Albania,100.0,100.0,1.607038,22.24169,6.0,7.0,105.469966,60.1,4564.390339,20.714286,...,239.577092,41.266475,16.1,-91750,12.5,77.537244,90.56197,18.930427,27.432462,56.409
Algeria,100.0,1.08368,3.316038,5.064181,0.0,2.0,93.31075,18.09,5484.066806,31.601732,...,313.520212,17.376156,9.5,-143268,21.9,74.568951,85.02157,28.205909,42.742285,70.129
Angola,37.0,70.906823,1.354008,12.863257,0.0,1.0,63.479208,21.26,5900.52957,36.818182,...,267.224299,46.204554,6.8,102322,96.0,51.866171,49.09112,47.850317,96.00576,43.274
Argentina,99.8,23.769069,4.562049,13.303817,8.0,2.0,158.735762,64.7,12509.531118,36.18677,...,1074.066944,40.43136,8.2,30000,11.1,75.986098,87.23494,25.346427,39.700196,91.604


Let's quickly make some clusters

In [112]:
new_cols = ["electric_accces", "renewables_percent", "co2_emissions", "commercial_bank",
           "credit_information_index_depth", "legal_rights", "cell_phone_usage",
           "internet_users","gdp_per_cap", "women_politicians", "communicable_disease_rate", "health_per_cap",
           "female_labor_rate", "unemploy_rate", "net_migration", "mortality_rate", "life_exp",
           "female_survival_65", "child_popul", "age_depend_ratio", "urban_pop"]
countries.columns = new_cols

bc = ['women_politicians',
 'legal_rights',
 'cell_phone_usage',
 'unemploy_rate',
 'net_migration',
 'commercial_bank']
X = countries.copy().drop(bc, axis=1)
scale = StandardScaler()
Xs = scale.fit_transform(X)

In [113]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

km = KMeans(n_clusters=3)
km.fit(Xs)
silhouette_score(Xs, km.labels_)

0.34391835031973944

In [None]:
countries["cluster"] = km.labels_

In [115]:
#Load in country codes
cc = pd.read_table("../data/country_codes", sep="|")
cc.head()

Unnamed: 0,COUNTRY,CODE
0,Afghanistan,AFG
1,Africa,AFR
2,Albania,ALB
3,Algeria,DZA
4,American Samoa,ASM


In [117]:
country_dict = dict(zip(cc.COUNTRY, cc.CODE))
cc = cc[cc.COUNTRY.isin(countries.index.tolist())].copy()

In [120]:

trace = go.Choropleth(locations = cc.CODE, z= countries.cluster, text = cc.COUNTRY,
                   autocolorscale = True)

data = [trace]

# Plot and embed in ipython notebook. Give it a title.
layout = go.Layout(title = "Country Clusters", hovermode = "closest",
                  geo = {"scope": "world", "showland":True, "showcoastlines": True})

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='country cluster')

Let's try that again but include socio-econ information when you hover over a country.

In [129]:
cc.shape

(159, 2)

In [137]:
text = countries.index + "<br> Unemployment Rate: " + countries.unemploy_rate.round(3).astype(str) + \
"<br> Life Expectancy: " + countries.life_exp.round(3).astype(str)

In [138]:

trace = go.Choropleth(locations = cc.CODE, z= countries.cluster, text = text,
                   autocolorscale = True)

data = [trace]

# Plot and embed in ipython notebook. Give it a title.
layout = go.Layout(title = "Country Clusters", hovermode = "closest",
                  geo = {"scope": "world", "showland":True, "showcoastlines": True})

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='country cluster2')

# Resources

Plotly's collection of plotting in python:https://plot.ly/python

Github version: https://github.com/plotly/python-user-guide

- Plotting of crypto currencies: https://github.com/triestpa/Cryptocurrency-Analysis-Python/blob/master/Cryptocurrency-Pricing-Analysis.ipynb

- https://github.com/santosjorge/cufflinks

- https://github.com/empet/Plotly-plots

- https://www.analyticsvidhya.com/blog/2017/01/beginners-guide-to-create-beautiful-interactive-data-visualizations-using-plotly-in-r-and-python/

- https://www.youtube.com/watch?v=5OShFM6bjME

- https://github.com/Mantej-Singh/Playing-with-Earthquakes-dataset/blob/master/Scatter%20Plots%20on%20Maps.ipynb

- https://dev.socrata.com/blog/2016/02/02/plotly-pandas.html


# Lab Time

For the rest of class I want you use plotly to make plots for you final project. Use plotly to make your EDA graphs, use it show the results of your machine learning model (roc curve).