CS 690V : Visual Analytics Assignment 1

Student Name : Priyadarshi Rath(priyadarshir@umass.edu)

INTRODUCTION

The dataset being considered is the number of cellphone subscribers in each country, over time. The data is present for years starting 1995 through 2011. The visualisation of this dataset captures the widespread popularity of cell phones in the last decade.

Multiple viualisations are presented. All values presented have been scaled down to a logarithmic scale(since actual number of subscribers is well over 10<sup>6</sup>, creating scaling issues).

The first visualisation is a bar chart of all countries for a particular year, different years being accessed by means of the slider provided.

The second, is a plot of a single country. This visualization enables the user to select which country they want to observe.

The third and final visualisation is a simple total subscriber count over time.

HoverTools are incorporated into each visualization, for better experience.

Dataset credits : [Gapminder](http://www.gapminder.org/data/)

Data can be found [here](https://docs.google.com/spreadsheets/d/14ivgHIV18Mr6hoW1deQ1L7nPXZUiTNyS8H-8sK9tMsg/pub).

In [24]:
from bokeh.models   import ColumnDataSource, HoverTool, Slider, CustomJS, Select
from bokeh.plotting import figure
from bokeh.io       import output_notebook, show
from bokeh.layouts  import column

import pandas as pd
import numpy as np

output_notebook()

In [75]:
# Read Data
df = pd.read_excel('broadband total.xlsx')
df.rename(columns={'Fixed broadband Internet subscribers':'country'}, inplace=True)

# The year 2011 has all zeros, so drop this column
df.drop(['2011'], axis=1, inplace=True)

# Logarithmic scale is easier to visualise ; so change all zeros and NaN's to one, else will throw error
df = df.fillna(0)
df.replace(to_replace=0, value=1, inplace=True)

# drop country name from DataFrame as logarithm will throw Error ; instead rename the index to country name
country_list = df['country'].tolist()      # get country list
df.index = country_list                    # replace index
df.drop(['country'], axis=1, inplace=True) # drop string data from dataframe
df_orig = df
df = np.log(df)                            # finally take logarithm

# assign country ID to make indexing easier
cid = pd.Series(data = range(df.shape[0]), index = country_list)
df['Cid'] = cid

# Take a preliminary look at the way data is organised
df.head()

Unnamed: 0,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,Cid
Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.298317,5.393628,6.214608,6.214608,6.214608,6.907755,7.31322,0
Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.605802,0.0,9.21034,11.066638,11.429544,11.566646,1
Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.798127,10.491274,11.81303,12.043554,12.567373,13.091904,13.614618,13.71015,2
American Samoa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.045777,8.188967,8.745444,9.243872,9.589872,9.82693,9.936535,10.040681,10.10651,4


In [63]:
# ColumnDatasource objects

All_Data = ColumnDataSource(df) # data of all countries

# This is the data that goes in the plot ; initialise it with first year data
ID = df['Cid'].values
y = df['1995'].values
Plotting_Data = ColumnDataSource(data={'ID':ID , 'y':y, 'country':df.index.values})

# CustomJS callback for slider interactivity
callback = CustomJS(args=dict(s1=All_Data, s2=Plotting_Data), code="""
    var d1 = s1.data;
    var d2 = s2.data;
    var Y = cb_obj.value;
    //alert(Y);
    d2.y = d1[Y.toString()];
    //alert(s2.y);
    s2.change.emit();
""")

# Create figure, add vertical bar, hover tool, and slider
p = figure(width=900, height=500, 
           x_range=(-5,215), y_range=(-0.5,20), 
           y_axis_label="Log Count", x_axis_label="Country(Hover to select)")

p.vbar(x='ID', top='y', bottom=0, width=1.5, source=Plotting_Data)

p.add_tools(HoverTool(tooltips=[("Log Count", "@y"), ("Country", "@country")]))

p.xaxis.major_tick_line_color = None  # turn off x-axis major ticks
p.xaxis.minor_tick_line_color = None  # turn off x-axis minor ticks
p.xaxis.major_label_text_font_size = '0pt'  # turn off x-axis tick labels

slider = Slider(title="Year", width=950, start=1995, end=2010, value=1995, step=1, callback=callback)

layout = column(slider, p)
show(layout)

<br  />  <br  />  <br  />  
The second visualisation is the plot of a particular country versus time. Here, the desired country may be selected using the apporpriate entry from the drop down menu.

In [43]:
# Slightly modify the Dataframe object
dft = df.T # take Transpose
dft.drop(['Cid'], axis=0, inplace=True) # in initial DataFrame, a column was devoted to country ID.
                                        #  That is useless now, so drop it.
# Take a preliminary look at the way data is organised
dft.head()

Unnamed: 0,Afghanistan,Albania,Algeria,American Samoa,Andorra,Angola,Antigua and Barbuda,Argentina,Armenia,Aruba,...,Uruguay,Uzbekistan,Vanuatu,Venezuela,Vietnam,Virgin Islands (U.S.),West Bank and Gaza,"Yemen, Rep.",Zambia,Zimbabwe
1995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [50]:
# ColumnDatasource objects

All_Countries = ColumnDataSource(dft) # data of all countries

# This data goes in the plot ; initializing it with first country in the dataframe
yr   = list(dft.index.values)
subs = dft['Afghanistan'].values
Plotting_Country = ColumnDataSource(data={'year':yr , 'subs':subs})

# CustomJS callback for slider interactivity
callback = CustomJS(args=dict(s1=All_Countries, s2=Plotting_Country), code="""
    var d1 = s1.data;
    var d2 = s2.data;
    var country = cb_obj.value;
    //alert(country);
    d2.subs = d1[country];
    //alert(d2.subs);
    s2.change.emit();
""")

# Create figure, add line plot, hover tool, and slider
p = figure(width=900, height=500, y_range=(-0.5,20), x_range=(1994,2011),
           y_axis_label="Log Count", x_axis_label="Time")
p.line('year', 'subs', source=Plotting_Country)
p.add_tools(HoverTool(tooltips=[("Log Count", "@subs"), ("year", "@year")], mode="vline"))

select = Select(title="Country", value="Afghanistan", options=list(dft.columns.values), callback=callback)

layout = column(select, p)
show(layout)

<br  />  <br  />  <br  />  
The final visualization displays total subscriber count over time.

In [90]:
counts=df_orig.sum(axis=0) # new DataFrame having sum of all subscribers per year
counts=np.log(counts)      # Take logarithm
vals=counts.values
yrs=counts.index.values

# Create plot
p=figure(width=900, height=500, y_range=(-0.5,23), x_range=(1994,2011), 
         y_axis_label="Log Count", x_axis_label="Time")
p.add_tools(HoverTool(tooltips=[("Log Count", "$y"), ("year", "$x{0000}")], mode="vline"))
p.line(yrs,vals,line_width=2)

show(p)