# Lecture 10 CSCI E-7
## Introduction to Scientific Computing and Data Analysis
##### Nenad Svrzikapa April 12 2017
#### NumPy, Pandas, Matplotlib, Bokeh and more

The following lecture uses code and examples from Python for Data Analysis by Wes McKinney.  Wes is the creator of Pandas and I highly recommend that you purchase this book if you are interested in Data Analysis with Python.  In addition to that this lecture contains Machine Learning code from Machine Learning Mastery by Jason Brownlee.  I recommend Jason's book if you are interested in getting into Machine Learning.  There is an excellent Python Machine Learning tutorial by Josh Gordon - Google on YouTube.  I also like Siraj Raval's videos.

The following code requires installs of the following libraries:

 * matplotlib
 * bokeh
 * numpy
 * pandas
 * seaborn
 * plotly

# NumPy

In [None]:
# importing numpy note that np is convention, but up to you
import numpy as np
from scipy import stats

my_list = [0,1,2,2,2,3.5,4,5]
import numpy as np
from scipy import stats


n1 = np.array(my_list)
n2 = np.linspace(0,1,5)
print (n2)

#mean
print (n1.mean())
print (np.mean(n1))
#mode have to import scipy for mode
print (stats.mode(n1))
#median
print (np.median(n1))
#max
print (np.max(n1))
#min
print (np.min(n1))

#shape of the array
print (np.shape(n1))

### Accessing Array Data

In [None]:
my_list = [[1,2,3,4],[6,4,6,8],[3,5,6,8]]
my_array = np.array(my_list)
print (np.shape(my_array)) #rows and columns

#accessing the first row
print (my_array[0])

#accessing the last row
print (my_array[-1])

#accessing element in specific row and column
print (my_array[2][1])

#accessing a column

#column 1
print (my_array[:,1])

### Numpy Array Arithmetic

In [None]:
# as long as they have the same shape
np1 = np.array([1,2,3])
np2 = np.array([2,3,4])

np3 = np1 + np2
np4 = np3 - np2 - np1

print (np3)
print (np4)

#you can also do math with a numpy array
#multiplies all np values with 5
np5 = np1 * 5
print (np5)

In [None]:
### Matplotlib

In [None]:
import matplotlib.pyplot as plt
import numpy
#making a numpy array of a range of numbers from 1-10
n1 = numpy.array(range(1,11))

fig = plt.figure()
fig.suptitle('Awesome Title', fontsize=20)

plt.plot(n1)
plt.xlabel('some x axis')
plt.ylabel('some y axis')
plt.show()
fig.savefig('myfig.jpg')
plt.clf() #clear figure

n2 = numpy.array([5,7,9,10,11,12,13,16,25,30])
plt.scatter(n1,n2,color=['red','green','blue','yellow',
                         'purple','orange','pink','brown','cyan',                       
                        'darkred'])
plt.xlabel('Time [min]')
plt.ylabel('Values [kg]')
plt.show()

# pandas
There are two main data structures you will need to master:
* Series
* DataFrame

### pandas Series

Series are one-dimensional objects containing an array of data and an index associated with that data.  Note that this data has to be of the same type.  Check what happens if one value is a float.

In [None]:
from pandas import Series, DataFrame
import pandas as pd

quiz_scores = Series([12,11,14,15])

quiz_scores

We can access the index and values of the Series object.

In [None]:
quiz_scores.values

In [None]:
quiz_scores.index

The automatically generated index is useful, but more often than not you will want to have your own index for the data.  Let's make our own:

In [None]:
quiz_scores.index = ["Quiz 1","Quiz 2","Quiz 3","Quiz 4"]
quiz_scores

Series can be easily accessed by index.  We can do this either for a single value or a list of specifix indexes.

In [None]:
quiz_scores['Quiz 1']

In [None]:
quiz_scores[['Quiz 1','Quiz 3']]

Appending values to pandas series is simple.  We can accomplish this with the append method.

In [None]:
quiz_scores2 = Series([9,8],index = ['Quiz 5','Quiz 6'])
print (quiz_scores2)
quiz_scores = quiz_scores.append(quiz_scores2)
quiz_scores

We have powerful manipulation techniques at our disposal.  Let's try a few.  Let's say the Max value of the quiz is 15 and we want to convert these values to percentages.

In [None]:
#we can do math on all the values immediately
quiz_scores = quiz_scores /15 *100
print (quiz_scores)
#we can round our scores
quiz_scores = quiz_scores.round(1)
print (quiz_scores)

In [None]:
quiz_scores[quiz_scores > 80]

Ok so it appears that series map the indexes to some values.  In a way they are kind of like dictionaries.  In fact making a series out of a dictionary-like data, or json is quite direct.  All you have to do is call Series(your dictionary) and you are done.

In [None]:
population = {'NY City':8000000, 'LA':4000000,'Boston':650000}
s_pop = Series(population)
s_pop

Notice something?  The keys got ordered!  Series are in a way like ordered dictionaries when made out of a dictionary.

Now that we have done this, we can create new series from existing series by specifying the index keys.  

In [None]:
cities = ["Boston","LA"]
s_pop_smaller = Series(s_pop,index=cities)
s_pop_smaller

But what happens if we don't really know the indexes and we are just interested to make a new series with cities of interest.  Let's say that we want to know about Cambridge and Somerville too.

In [None]:
cities2 = ["Boston","LA","Cambridge","Somerville"]
s_pop_larger = Series(s_pop, index = cities2)
s_pop_larger

You are probably wondering what NaN is.  It means: Not A Number.  It's basically pandas for missing values.  You can test if Index values are missing with pd.isnull(your series) or the pd.notnull(your series)

In [None]:
pd.isnull(s_pop_larger)

The most awesome thing about series is that you can do mathematical operations with multiple series and they will automagically align even if they are not indexed the same way.  Let's assume we have the original one and the larger one I made with the fake values.

In [None]:
s_pop + s_pop_larger

Finally, both the Series object and the index have a name attribute that can be set for downstream functionality.

In [None]:
s_pop_larger.name = 'population'
s_pop_larger.index.name = 'cities'
s_pop_larger

More often than not you will want your plots to be inline and enclosed in your notebook.  This is possible.  Let's do something more fun:

https://data.world/

## pandas Dataframes

Here is how we can create a very simple dataframe our of a 3x3 numpy array.  The columns list holds the header names of the columns, and the index list holds the row names.

In [None]:
my_df = pd.DataFrame(np.random.random([3, 3]),columns=['Pset 1', 'Pset 2', 'Pset 3'], index=['Student 1', 'Student 2', 'Student 3'])
my_df*=100
my_df = my_df.round(2)
my_df

The easiest way to creat a dataframe is probably from equal length dictionaries where the keys of those dictionaries will become the colum index names.  Let's try!

In [None]:
d = {"name":['Jen','Ali','Kwabena','Goran'],
     "dob":[1970,1965,1980,1983],
    "salary":[100000,90000,95000,70000]}

my_df1 = DataFrame(d)
my_df1

In [None]:
my_df2 = DataFrame(d,columns = ['name','dob','salary'])
my_df2

From the above example you can see that by specifying the column names I reordered the columns the way I like them to be.  But, this is even more powerful.  I can in theory initialize another column.  Similar to the Series objects, all the values of that column (if not present in the dictionary) will get initialized as NaN.

In [None]:
my_df3 = DataFrame(d,columns = ['name','dob','salary','bonus'])
my_df3

This is great, but let's say HR has worked out the bonuses and you want to include values here and replace NaN with actuall values.  But, first let's discuss the two ways to access data in the dataFrame.  One is like typical dictionary access i.e. my_df3['bonus']  the other is by attribute with the dot notation. my_df3.bonus

In [None]:
print (my_df3['bonus'])
print (my_df3.bonus)

Now that we know how to access  this we can assign some value.  Let's say they all got 10000 bonus.

In [None]:
my_df3['bonus'] = 10000
my_df3

It worked!  However, in real life they are unlikely to all get the same bonus.  We can do this with a list of values but note that the length of that list must match or things will get broken.

In [None]:
my_df3['bonus'] = [10000,11000,9000,8000]
my_df3

Now we have the bonuses all set, but HR sends you an email and says that they have made a mistake and that Ali and Goran's bonuses are 12000 and 9000 respectively.  We can use a series to quickly update these values.

In [None]:
update = Series([1200,9000],index=[1,3])
my_df3.bonus = update
my_df3

# Loading your data into a pandas dataframe
* Getting your .csv into a Pandas dataframe
* Naming your columns
* Using Plotly to make a nice visualization of your dataframe

In [None]:
import plotly.figure_factory as ff
import pandas as pd

#loading a csv into a Pandas dataframe from link.
#I knew that there was no header in this data
#If I didn't declare that the first row would have been assumed to be the header

df = pd.read_csv("https://query.data.world/s/32o6gpwtme7iz5n22npxp2rea",header=None)

#let's name our dataframe columns
df.columns = ['Sepal Length', 'Sepal Width',
             'Petal Length','Petal Width',
             'Species']
#creating nice tables with plotly from your dataframe
my_table = ff.create_table(df)

#a paid account will get you private files and folders
#you can create private folders if you specify it in the filename
# i.e. my_project1/my_table1 will create a folder with the file in it.
py.iplot(my_table, filename='iris')

# Introduction to Machine Learning with Iris

Machine Learning is a subset of Artificial Intelligence.  Machine learning is a study of algorithms that learn from examples and experience rather than hard-coded rules.  The machine learning portion of the code in this example is by Jason Brownlee, with my commentary and some alternate visualizations.  See more in references.

* Data Collection
* Pick a Model (based suited for data)
* Train the Model
* Test the Model


### Classifier I/O (Data Input, Assigns some Label as Output)
### Training a classifier
### Supervised vs. Unsupervised learning

Let's start with the simplest possible example.  This example is described in Josh Gordon's youtube series

In [None]:
import sklearn
features = [[139,"Smooth"],[130,"Smooth"],[135,"Smooth"],[149,"Bumpy"],[151,"Bumpy"],[160,"Bumpy"]]
labels = ["Apple","Apple","Apple","Orange","Orange","Orange"]

#binarize the data
features = [[139,1],[130,1],[135,1],[149,0],[151,0],[160,0]]
labels = [1,1,1,0,0,0]

#we will use a Decision Tree Classifier
clf = sklearn.tree.DecisionTreeClassifier()
#Now we will train the classifier
clf = clf.fit(features,labels)

d = {1:"Apple",0:"Orange"}

#Now let's test this!
results =  (clf.predict([[155,0],[140,1],[133,0]]))
for r in results:
    print (d[r])

In [None]:
# Load libraries
import pandas
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Dataset Exploration

First we can take a peak at a portion of the data.  The head method
will allow us to take a look at the beginning of the dataframe.  We can specify a value df.head(n) where n tells us how many values we want to look at.  If we don't specify this df.head() will show us the first 5 values.

In [None]:
df
df.head(20)

We can also take a look at the end of the dataframe by using the df.tail() method.  This works in much the same way as the df.head().  It's a quick way to see how many rows you have and if the data at the end is in some way different.  Again, we can specify df.tail(n)

In [None]:
df.tail(5)

Sometimes you will have too many columns and rows and you want to get an idea about the shape of your data.  This can be accomplished with the shape method.  You can notice that the dataframe object stores a tuple of its shape.  The first value of the tuple is the number of rows in your data and the second it's the number of columns.  Notice that the index column is not counted.

In [None]:
print (type(df.shape))
df.shape

### Descriptive Statistics

In [None]:
df.describe()

my_table = ff.create_table(df.describe(),index=True)
#a paid account will get you private files and folders
#you can create private folders if you specify it in the filename
# i.e. my_project1/my_table1 will create a folder with the file in it.
py.iplot(my_table, filename='iris_vc')


### Favorite topic: Rounding :)

There are several ways to round.  The first and easiest way is to just round the whole dataframe. df.round(n) where n is the number of decimal digits required.

In [None]:
my_table = ff.create_table(df.describe().round(2),index=True)
#a paid account will get you private files and folders
#you can create private folders if you specify it in the filename
# i.e. my_project1/my_table1 will create a folder with the file in it.
py.iplot(my_table, filename='rounded')

### Important note:
You can notice that I am passing this data to plotly, but the dataframe itself is still not rounded.

In [None]:
description = df.describe()
description = description.round(2)
my_table = ff.create_table(description,index=True)
#a paid account will get you private files and folders
#you can create private folders if you specify it in the filename
# i.e. my_project1/my_table1 will create a folder with the file in it.
py.iplot(my_table, filename='description')

### Describing specific categories

Sometimes if your data is categorized you will want to take a look at the descriptive statistics of those categories.  This is not hard to do.  In the example below I am making a new dataframe based on some criteria.  This criteria is tested by the following line: 
df['Species'] == "Iris-versicolor"

ok so everywhere where in the column Species the value is equal to "Iris-versicolor".  Not so bad.

The next thing I do here is some more specific rounding.  Previously we reounded the whole dataframe, but what if we wanted to round only specific columns?  Examine the rounding before.  What do you conclude?

In [None]:
df_iris_versicolor = df.loc[df['Species'] == "Iris-versicolor"]
a = df_iris_versicolor.describe()
a = a.round({'Sepal Length':2,'Petal Length':2})
#Note here that I said that index is True
my_table = ff.create_table(a,index=True)

#a paid account will get you private files and folders
#you can create private folders if you specify it in the filename
# i.e. my_project1/my_table1 will create a folder with the file in it.
py.iplot(my_table, filename='iris_vc')


Finally, you can use a pandas series to specify how the dataframe should be rounded.  This gives us the power to predefine how to round in a series structure that can be used in the round method.

In [None]:
decimals = pd.Series([0,1,2,3], index=['Sepal Length', 'Sepal Width',
                                       'Petal Length', 'Petal Width'])
a_rounded = a.round(decimals)
my_table = ff.create_table(a_rounded,index=True)

#a paid account will get you private files and folders
#you can create private folders if you specify it in the filename
# i.e. my_project1/my_table1 will create a folder with the file in it.
py.iplot(my_table, filename='iris_vc_series_rounded')

### Getting counts of a particular column

There are several ways to do this.  The value_counts() is somewhat like a Counter, but it returns a pandas series.  If you run that method on any column you can see that it will count how many times all the values appear.  This is useful because you can take a peak of the representation of the data and if some value is particularly enriched in your dataset.  Depending on the circumstances that may be interesting to know.  The value_counts() method returns back a pandas series.

In [None]:
df["Species"].value_counts()

In [None]:
df["Sepal Width"].value_counts()

In [None]:
# class distribution
print(df.groupby('Species').size())
type(df.groupby('Species').size())

## Data Visualization
### Univariate

In [None]:
#Put on same axis by modifying sharex and sharey
df.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

We can also generate histograms to look at the distributions.

In [None]:
df.hist()
plt.show()

Looks like the Sepal measurments are close to normally distributted.
### Multivariate

In [None]:
scatter_matrix(df)
plt.show()

In [None]:
import seaborn as sea
import matplotlib.pyplot as plt


sea.FacetGrid(df, hue="Species", 
    size=6).map(plt.scatter, "Sepal Width", "Petal Width").add_legend()
plt.show()

In [None]:
#check with hue="Species" and without
sea.pairplot(df,hue="Species").add_legend()
plt.show()

In [None]:
sea.set(style="darkgrid")
g = sea.PairGrid(df)
g.map_diag(sea.kdeplot)
g.map_offdiag(sea.kdeplot, cmap="Greens",shade=True, n_levels=10);
plt.show()

### Preparing the data

We have our dataset but we want to split it in such a way that a portion of it will be dedicated to training our models (80% in this case) and 20% will be allocated for validation of the models.  Note that those 20% are known real data measurements so if we are good in detecting them then our model is performing well

In [None]:
array = df.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)



### Defining a mesaure of performance

Accuracy is defined as the percent of data that is classified correctly and is calculated by taking the count of the data that is called correctly divided by the total data points number times 100.

In [None]:
scoring = 'accuracy'

### Measuring model performance

We are now going to take a look at 6 different models and test them.  In order to train the models we will devide our training data into 10 even parts.  9/10 of those will be used for training and 1/10 of those data splits will be used to test the accuracy of the model.  Let's see how that looks like.

In [None]:
#holds our models
models = []
#we add our models to our list of models
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

#we iterate through all models and test their performance
results = []
names = []
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

### Model comparison

In [None]:
from matplotlib import pylab #need this to change the Y axis range
fig = plt.figure()
fig.suptitle('Model Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
pylab.ylim([0.9,1]) #I changed the Y axis range here

ax.set_xticklabels(names)
plt.show()

### Ok for me SVM seems to be working the best, so let's use it!

In [None]:
svm = SVC()
svm.fit(X_train, Y_train)
predictions = svm.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

# Plotly
First you will need to make an account and get your api_key.
Second you will want to specify if your plot will be viewable by the world.  If True you will want to set your sharing to public.



In [None]:
import plotly
plotly.tools.set_credentials_file(username='Your User Name', api_key='Your API Key')
plotly.tools.set_config_file(world_readable=True,
                             sharing='public')

import plotly.plotly as py
from plotly.graph_objs import *

trace0 = Scatter(
    x=[1, 2, 3, 4, 5],
    y=[1, 2, 3, 4, 5]
)
trace1 = Scatter(
    x=[1, 2, 3, 4, 5],
    y=[10, 20, 30, 40, 50]
)
trace2 = Scatter(
    x=[1, 2, 3, 4, 5],
    y=[5, 10, 15, 20, 25]
)
data = Data([trace0, trace1,trace2])

py.plot(data, filename = 'basic-line')

### Birthplaces of Staff

In [None]:
import plotly.plotly as py
from plotly.graph_objs import *

import pandas as pd

mapbox_access_token = 'pk.eyJ1IjoiY2hlbHNlYXBsb3RseSIsImEiOiJjaXFqeXVzdDkwMHFrZnRtOGtlMGtwcGs4In0.SLidkdBMEap9POJGIe1eGw'

rating_one_site_lat = [41.11722,33.5061877,42.814401,3.9055556,29.7807902]
rating_one_site_lon = [20.80194, -86.80343,-70.890917,-76.50333333333333,-95.3977855]
locations_name = ['Nenad','Alan','Kaleigh','Jose','Joe']



data = Data([
    Scattermapbox(
        lat=rating_one_site_lat,
        lon=rating_one_site_lon,
        mode='markers',
        marker=Marker(
            size=18,
            color='rgb(155, 240, 225)',
            opacity=0.7
        ),
        text=locations_name,
        hoverinfo='text'
    ),
    Scattermapbox(
        lat=rating_one_site_lat,
        lon=rating_one_site_lon,
        mode='markers',
        marker=Marker(
            size=8,
            color='rgb(205, 245, 100)'
        ),
        hoverinfo='skip'
    )]
)
        
layout = Layout(
    title='Birthplaces of Staff Members',
    autosize=True,
    hovermode='closest',
    showlegend=False,
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=38,
            lon=-94
        ),
        pitch=0,
        zoom=3,
        style='dark'
    ),
)

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='StaffBirthPlaces')

### Bokeh

In [None]:
from collections import Counter
import numpy as np
import scipy as sp
from IPython.display import HTML, display
import pandas as pd
from bokeh.charts import Donut, show, output_notebook, vplot
from bokeh.charts.utils import df_from_json
from bokeh.io import output_notebook,hplot
from bokeh.models import HoverTool, layouts


output_notebook() #to generate plots inline

a_at_SNP1 = 500
c_at_SNP1 = 400
g_at_SNP1 = 300
t_at_SNP1 = 200

baseDistSNP1 = [a_at_SNP1, c_at_SNP1, g_at_SNP1, t_at_SNP1]



d_snp1 = Donut(pd.Series(baseDistSNP1, index=['A', 'C', 'G', 'T']),title = "My Base Distribution")



p = layouts.Row(d_snp1)

# show the results
show(p)

In [None]:
baseDistSNP1_percent = [70,20,5,5]
TOOLS = 'pan,wheel_zoom,box_zoom,resize,reset,save,box_select,hover'
df1 = pd.DataFrame({'Base':['A', 'C', 'G', 'T'],'Counts': baseDistSNP1_percent})

d_snp1 = Donut(df1,label = 'Base',values ='Counts', legend = True,title = "Base Distribution At rs362307",color=['#ffa700','#d62d20','#0057e7','#008744'],tools=TOOLS)


hover1 = d_snp1.select(dict(type=HoverTool))
hover1.tooltips = [("Percent","@values%")]


p = layouts.Row(d_snp1)
show(p)

References:

https://plot.ly/python/ipython-notebook-tutorial/

https://data.world/

https://seaborn.pydata.org/

Python for Data Analysis - Wes McKinney

Machine Learning Mastery - Jason Brownlee

Machine Learning Videos YouTube - Josh Gordon


