## Week 6: Capstone project

* Week 5 recap
  * Predictive analytics using statsmodels
  * Predictive anlaytics using scikit-learn
* Week 6
  * No discussion post required for the week
  * Capston project
    * Due end of course
    * Scoring rubric
    * Components of the capstone project

## Week 5 recap
* Predictive vs. descriptive analytics
  * Descriptive statistics provide summary of existing data
  * Predictive statistics allows us to make conclusion outside of existing data
* Descriptive statistics
  * Univariate
    * mean/std/percentiles
  * Multivariate
    * correlation/covariance    

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
import statsmodels
from patsy import dmatrices
import scipy.stats
from statsmodels.stats import weightstats as weightstats
%matplotlib inline

## Statsmodels (https://www.statsmodels.org)

* Python module that provides functions and classes for statistical tests and models
* Simulates R like 'formula' syntax, but provides tight integration with Pandas dataframe
* Functionality
  * Statistical tests
  * Linear Regression
  * Logistical Regression (Classification)

In [None]:
# Minimal example
import seaborn as sns
tips = sns.load_dataset('tips')
tips.head()

### Linear regression
* Predicting continuous valued response variable
* Predictor variables can be continuous/discrete


In [None]:
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf

results = smf.ols('tip ~ total_bill', data = tips).fit()
print(results.params)
results.summary()

In [None]:
plt.plot(tips['total_bill'], tips['tip'], '.')
plt.plot(tips['total_bill'], results.fittedvalues, '.')
plt.legend(['actual tip', 'predicted tip'])

### Classification
* Instead of predicting actual tip, we'd like to predict if tip percentage > 20%
* This is an example where response variable is categorical (True , False)


In [None]:
tips['tip_gt_20'] = (tips['tip']/tips['total_bill']  > 0.2).astype(np.float32)
tips['tip_gt_20'].mean()

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

y, X = dmatrices('tip_gt_20 ~ total_bill', data=tips)
model = smf.Logit(y, X)
result = model.fit()

In [None]:
yhat = result.predict(X)
sns.distplot(yhat[y[:,0] > 0])
sns.distplot(yhat[y[:,0] == 0])
plt.legend(['Positive', 'Negative'])

In [None]:
print('Classification report')
import sklearn.metrics
yhat = result.predict(X) > 0.15
print(sklearn.metrics.classification_report(y, yhat))

### Scikit learn (http://scikit-learn.org)

<div align="left">
<img src="http://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_thumb.png">
</div>

* Python library for data mining, data analysis and machine learning
* Built on top of numpy, scipy and matplotlib

### Regression using sklearn
* Dataframes need to be converted to numeric matrices using patsy
* Can use non-linear techniques for better accuracy

In [None]:
# Linear regression model
from sklearn import linear_model
model = linear_model.LinearRegression(fit_intercept=False)


y,X = dmatrices('tip ~ total_bill', data=tips)
res = model.fit(X, y)
plt.plot(X[:,1], y, '.')
plt.plot(X[:,1], model.predict(X), '.')
plt.legend(['actual tip', 'predicted tip'])

In [None]:
# Non linear (tree based) regression model
from sklearn import ensemble
model = sklearn.ensemble.RandomForestRegressor()
model.fit(X,y)
plt.plot(X[:,1], y, 'r.')
plt.plot(X[:,1], model.predict(X), 'g.')
plt.legend(['actual tip', 'predicted tip'])

### Classification
* Instead of predicting actual tip, we'd like to predict if tip percentage > 20%
* This is an example where response variable is categorical (True , False)

In [None]:
import sklearn.linear_model 

model = sklearn.linear_model.LogisticRegression(C=0.1)
y,X = dmatrices('tip_gt_20 ~ total_bill + sex + time + day', data=tips)
model.fit(X, y)
yhat = model.predict(X)
print(sklearn.metrics.classification_report(y, yhat))

In [None]:
yhat = model.predict_proba(X)
sns.distplot(yhat[y[:,0] > 0, 1])
sns.distplot(yhat[y[:,0] == 0, 1])

In [None]:
# Non linear model
import sklearn.ensemble
y, X = dmatrices('tip_gt_20 ~ total_bill', data=tips)
model = sklearn.ensemble.RandomForestClassifier()
model.fit(X,y)
yhat = model.predict(X)
print(sklearn.metrics.classification_report(y, yhat.ravel()))

In [None]:
yhat = model.predict_proba(X)
sns.distplot(yhat[y[:,0] > 0, 1])
sns.distplot(yhat[y[:,0] == 0, 1])
plt.legend(['Positive', 'Negative'])