<a href="https://colab.research.google.com/github/big-data-analytics-physics/handsonml_ch1/blob/master/ch1_data_and_plotly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 1: Introduction
## Opening a data file
The data we want to explore is in a "csv" or "comma-separated-value" file.   We will want to open this file and read it into memory, so we can explore it.   "Exploring" a data file could mean a number of things:
1.  Counting the number of lines
2.  Calculating statistics on the data in the file - averages, standard deviations, etc.
3.  Making plots of the data in the file

With python, we have a number of different options for reading a csv file into memory.   We will use "pandas" to do this.   To learn more, you can of course type into google: 

             python pandas read csv

When you do this, one of the top links which comes up is this one:  
      https://www.datacamp.com/community/tutorials/pandas-read-csv

Take a look at this to get a sense of how to read in files.

First we need to import the apprproiate python package: in this cases it is called "pandas"

In [0]:
import pandas as pd

Note that we have introduced a new package called "pandas" into the mix.  As you go through the examples below, you might begin to see that it seems similar in many ways to "numpy".   This is not surprising since it is built upon numpy!   In my practice, I generally only use pandas (specifically pandas dataframes, which we will get into below), and only rarely use numpy arrays.   However, many packages require their inputs to be in the form of a numpy array, so you will have to be able to convert into that format.

Here  is a gentle intorduction to pandas, although you can find lots more on the web:
      https://medium.com/@harrypotter0/an-introduction-to-data-analysis-with-pandas-27ecbce2853
      

Back to opening and reading files!   Next we have to get our data to a place where colab can see it.   There are a couple of different ways of doing this:


1.   Copy the file from wherever it is and put it on your google drive. You will then want to put it in the 'My Drive/Colab Notebooks' folder.    You the have to *mount* your google drive so that you can access the data.   You would use code like this (executed in a code cell):

from google.colab import drive

drive.mount('/content/drive')

After executing this you will be prompted to click on a link, where you will get an authorization code and enter into a form in colab.   The drive will remain mounted throughout your colab session until you log off.   To access the data you would do something like:

datapath = 'drive/My Drive/Colab Notebooks/data/'

gdp_data = pd.read_csv(datapath+"gdp_oecd_data_byCountry.csv")

2.   Another option - and the one we will do - is to have the data file in a git repo (which means the data file can't be much bigger than about 50mB).   You need the url to the *raw* file.   Our class uses a public epo (so anyone can see it) with some of our test data files.  The variable *url* below contains the pointer to this file.   Once we have the pointer, it is easy to read the file in using *pandas*:





In [0]:

url = "https://raw.githubusercontent.com/big-data-analytics-physics/data/master/ch1/gdp_oecd_data_byCountry.csv"
gdp_data=pd.read_csv(url)

## Simple Exploration of a data set using pandas
Now the we have the data in pandas, what can we do with it?   To find out the full capbabilities of pandas, go to [this link](http://pandas.pydata.org/pandas-docs/stable/10min.html).

For now, we will do a few very simple things:
1.  Look at a few *rows* of the dataset:  use gdp_data.head().   With no argument it prints out the first 5 rows.  With an argument of "10" it prints out the first 10 rows.   Tere is a similar function called "tail" which prints out the last *n* rows of the data frame.
2.  Get some info about the names and types of the columns in the dataframe, the number of rows, and how much memory the dataframe takes up: using gdp_data.info()
3.  Get some basic statistical info about the dataframe (mean, std, etc): use gdp_data.describe()
4.  Get correlations among all of the columns: use gdp_data.corr()


In [0]:
print(gdp_data.head())

In [0]:
print(gdp_data.info())

In [0]:
print(gdp_data.describe())

In [0]:
print(gdp_data.corr())

## More Cool things with pandas dataframes
You can do many interesting things with a dataframe, such as:
1.  Select a subset of rows and/or columns
2.  Select a subset of the full dataframe based on a requirement of a single column
3.  Add a column based on other columns

In [0]:
selected_columns = ['Life expectancy','Employment rate']
gdp_data_subset = gdp_data.loc[:,selected_columns]      # the ":" means select all rows
print(gdp_data_subset.head(10))                           # remember head by default yields 5 rows

In [0]:
selected_columns = ['Life expectancy','Employment rate']
gdp_data_subset = gdp_data.loc[10:13,selected_columns]      # the "10:13" means select rows 10,11,12,13
print(gdp_data_subset)

In [0]:
gdp_data_subset = gdp_data.loc[10:13,:]      # the ":" means select all columns
print(gdp_data_subset)
gdp_data_subset = gdp_data.loc[10:13]      # you don't actually need to specificy the column names
print(gdp_data_subset)

In [0]:
gdp_data["Personal earnings (thousands)"]= gdp_data["Personal earnings"]/1000.0   
print(gdp_data.head())

## Making a descriptive column
This can be useful for some of the plotting that we will do later.   What we are doing here is combining data from various columns and concatenating them together as a long string.   When we do this, we need to convert columns which are numerical (like 'GDP') to a string, using the *astype(str)* modifier.

In [0]:
gdp_data['Info'] = "Country:"+gdp_data['Country']+"<br>GDP:"+gdp_data['GDP'].astype(str)+"<br>Employment rate:"+gdp_data['Employment rate'].astype(str)+"<br> Homicide rate:"+gdp_data['Homicide rate'].astype(str)
print(gdp_data.head())

## Selecting a subset of a dataframe based on the value of a row
Here we want to select just those countries which have a GDP above or below the median of our dataset:

In [0]:
gdp_above_median = gdp_data.loc[(gdp_data['GDP'] >gdp_data['GDP'].median())]
print("Above median:")
print(gdp_above_median.head(5))
gdp_below_median = gdp_data.loc[~(gdp_data['GDP'] >gdp_data['GDP'].median())]
print("Below median:")
print(gdp_below_median.head(5))

# Plotting!
## Simplest: just use matplotlib
To plot in Jupyter, we need to use a "magic".    We will first use "matplotlib" to do our plotting.   To do this, we need to execute the following line:

%matplotlib inline    

Notice the leading "%" .   Now it appears that we don't actually need this in the jupyter environment in colab, but if you ever install jupyter locally on your own machine, you will need to do something like that.

Next we need to import the appropriate python packages:

In [0]:
import matplotlib


## The correlation matrix
A nice way to summarize relationships among several features in a dataset is through the correlation matrix.   A good discussion of correlation can be found here: https://www.datascience.com/blog/introduction-to-correlation-learn-data-science-tutorials.

## "Plotting" the correlation matrix.
We used a feature of pandas to print out the correlation matrix, but it was fairly hard to read.   There is a nice built in feature that allows us to make a **heat map** of the correlation matrix:

In [0]:
corr = gdp_data.corr()
corr.style.background_gradient().set_precision(3)

That is nice, but since I am interested in the GDP, and how other things influence that, I would love GDP to be in the first row.   We can do that!   To the google machine:   google **pandas move column to front**

In [0]:
mid = gdp_data['GDP']
gdp_data.drop(labels=['GDP'], axis=1,inplace = True)
gdp_data.insert(0, 'GDP', mid)

corr = gdp_data.corr()
corr.style.background_gradient().set_precision(3)


## Simple scatter plots
To plot one column against another:

In [0]:
import matplotlib.pyplot as plt
plt.scatter(gdp_data['GDP'],gdp_data['Life satisfaction'])
plt.show()


## Making the scatter plots prettier
We can do the same things, but lets make the size bigger

In [0]:
plt.scatter(gdp_data['GDP'],gdp_data['Life satisfaction'],s=100)
plt.show()


If we want the size to depend on a column value, things get more complicated.   Lets use 'Personal earnings'.   Note the "s=" term at the end of the following line:

In [0]:
plt.scatter(gdp_data['GDP'],gdp_data['Life satisfaction'],s=gdp_data['Personal earnings'])

What is going on here?   The problem is that the scale of our column is too big.   We can fix this by making a new column which is *min-max* scaled.   The first time you run this the points will be tiny!   Try uncommenting (removing the '#') from the 3rd line and rerun the cell.

In [0]:
mn = gdp_data['Personal earnings'].min()
mx = gdp_data['Personal earnings'].max()
#gdp_data['Personal earnings Norm'] = (gdp_data['Personal earnings']-mn)/(mx-mn)  # this goes from 0-1.0
gdp_data['Personal earnings Norm'] = 200.0*(gdp_data['Personal earnings']-mn)/(mx-mn)  # this goes from 0-200.0
plt.scatter(gdp_data['GDP'],gdp_data['Life satisfaction'],s=gdp_data['Personal earnings Norm'])

Now lets make a scatter plot where one axis is the country name.   We rotate the axis so that the country name is readable.

In [0]:
plt.scatter(gdp_data['Country'],gdp_data['GDP'],s=gdp_data['Personal earnings Norm'])
pltx = plt.xticks(rotation='vertical')

## Histogramming our data
Finally, we can make a histogram of the entire dataset (excluding categorical variables like Country) by using gdp_data.hist.    This is a built in feature of pandas (remember that #gdp_data# is a pandas dataframe).  

Arguments allow us to control the binning, as well as the size of the resulting plots.

See the documentation at this [link](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html) for more information.

In [0]:
gdp_data.hist(bins=50,figsize=(20,15))
plt.show()

# More sophisticated plotting.
The plots above are nice, but it turns out that we can add an interactive aspect to our visualizations, which will be helpful especially in the early data exploration stages.   To do this, we (of course) have many options.   A good one which is reasonably simple is called *plotly*.



Unfortunately, using plotly is not completely straightforward in google's colab environment (it is a little easier in a local jupyter enviroment).   But it is not that bad.   We need to first define a function called "enable_plotly_in_cell":

In [0]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)

We need to execute that once (so that the function is defined in our session).  Then anytime we want to make a plotly plot, we just need to call the above function in the cell where we want to make a plot.

Below is an example from the plotly website using random data.
After you run the cell below, use your mouse to hover over the plot and explore it a bit.

In [0]:
import plotly.plotly as py
import numpy as np
from plotly.offline import iplot
from plotly.graph_objs import Contours, Histogram2dContour, Marker, Scatter

enable_plotly_in_cell()
x = np.random.randn(2000)
y = np.random.randn(2000)
iplot([Histogram2dContour(x=x, y=y, contours=Contours(coloring='heatmap')),
       Scatter(x=x, y=y, mode='markers', marker=Marker(color='white', size=3, opacity=0.3))], show_link=False)





---



---


Now lets use plotly to make plots of our data.   First lets make a bar plot of the GDP vs each country.   Again, use your mouse to hover over the plot.




In [0]:

import plotly.graph_objs as go

enable_plotly_in_cell()

data = [
    go.Bar(
        y=gdp_data['GDP'], # assign x as the dataframe column 'x'
        x=gdp_data['Country'],
        text=gdp_data['Info']
    )
]
iplot(data)





---

Remember that we made a scatter plot of life satisfaction vs GDP earlier.   We can look at this another way: by splitting our data into groups and plotting those two groups as histograms on the same figure.   As we did earlier, lets split the data based on whether it is above or below the median GDP, and then plot the "life satisfaction" of those two agains one another.   The code I am using is based on examples from this plotly page: https://plot.ly/python/histograms/#overlaid-histogram

In [0]:
import plotly.plotly as py
import numpy as np
from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()
gdp_above_median = gdp_data.loc[(gdp_data['GDP'] >gdp_data['GDP'].median())]
gdp_below_median = gdp_data.loc[(gdp_data['GDP'] <gdp_data['GDP'].median())]
trace1 = go.Histogram(
    x=gdp_above_median['Life satisfaction'],
    opacity=0.75,
    name="Above Median GDP"
)
trace2 = go.Histogram(
    x=gdp_below_median['Life satisfaction'],
    opacity=0.75,
    name="Below Median GDP"
)

data = [trace1, trace2]
layout = go.Layout(barmode='overlay')
fig = go.Figure(data=data, layout=layout)

iplot(fig, filename='overlaid histogram')

Another cool way of visualizing this data is through the use of a "box plot".

In [0]:
import plotly.plotly as py
import numpy as np
from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()


boxes = []
labels = []

boxes.append(gdp_above_median['Life satisfaction'])
boxes.append(gdp_below_median['Life satisfaction'])
labels.append('Above Median')
labels.append('Below Median')

data = [{
    'y': boxes[i], 
    'name':labels[i],
    'type':'box'
    } for i in range(len(labels))]



layout = dict(
    title='Life Satisfactions vs GDP',
    xaxis=dict(title='GDP'),
    yaxis=dict(title='Life satisfaction')
)

iplot(dict(data=data))#,layout=layout))



---



---


Now lets make a fancy bubble plot.   This is just a scatter plot of  one variable vs another.   But we can provide more information by using color to code another feature of each data point, and the size of the marker as yet another way to convey information about each data point.


In [0]:
import plotly.plotly as py
from plotly.offline import iplot

import plotly.graph_objs as go
enable_plotly_in_cell()
mn = gdp_data['Personal earnings'].min()
mx = gdp_data['Personal earnings'].max()
gdp_data['Personal earnings Norm2'] = 100.0*(gdp_data['Personal earnings']-mn)/(mx-mn)  
trace0 = go.Scatter(
    x=gdp_data['GDP'],
    y=gdp_data['Life satisfaction'],
    text=gdp_data['Info'],
    mode='markers',
    marker=dict(
        size=gdp_data['Personal earnings Norm2'],
        color=gdp_data['Country'].astype("category").cat.codes,
        colorscale='Jet',
        colorbar=dict(thickness=20)
    )
)


#layout = go.Layout(
#    title='Life satisfaction VS GDP by Country',
#    xaxis=dict(title='GDP'),
#    yaxis=dict(title='Life satisfaction')
#)

data = [trace0]      #   this is a list because you might want to plot many data sets
iplot(dict(data=data))#,layout=layout))

## Maps
Finally, lets make a map.   

In [0]:
import pandas as pd
enable_plotly_in_cell()
plottingVar = 'Life satisfaction'
plottingVar = 'GDP'
data = [ dict(
        type = 'choropleth',
        locationmode = 'country names',
        locations = gdp_data['Country'],
        z = gdp_data[plottingVar],
        text = gdp_data['Info'],
        colorscale = [[0,"rgb(5, 10, 172)"],[0.35,"rgb(40, 60, 190)"],[0.5,"rgb(70, 100, 245)"],\
            [0.6,"rgb(90, 120, 245)"],[0.7,"rgb(106, 137, 247)"],[1,"rgb(220, 220, 220)"]],
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = False,
            #tickprefix = '$',
            title = 'plottingVar'),
      ) ]

layout = dict(
    title = '2014 Global GDP<br>Source:\
            <a href="https://www.cia.gov/library/publications/the-world-factbook/fields/2195.html">\
            CIA World Factbook</a>',
    geo = dict(
        showframe = False,
        showcoastlines = False,
        projection = dict(
            type = 'Mercator'
        )
    )
)

fig = dict( data=data, layout=layout )
iplot( fig, validate=False, filename='d3-world-map' )