# Data Science Analysis with Python

Before we start, we need to install a couple of things (in this order), which will make it possible to visualize decision trees. The instructions are different for MAC and Windows systems.

# Set up

### MAC

<p>Open the Terminal and do the following:</p>
<p>
<ol>
  <li>Run the following command and hit 'Enter': 
    <pre><code>xcode-select –install</code></pre>
  <li>Run the Xcode installer. Once the installation is complete run the following command to install brew. 
    <pre><code>ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"</code></pre> </li>
  <li>Run the following command once you're done to ensure Homebrew is installed and working
    <pre><code>brew doctor</code></pre>
  <li>Enter the command below to install graphviz: 
    <pre><code>brew install graphviz</code></pre>
  <li>Install pydotplus and other packages: 
<pre><code>pip install pydotplus</code></pre>
<pre><code>pip install pandas</code></pre>
<pre><code>pip install seaborn</code></pre>
<pre><code>pip install sklearn</code></pre></ol>
</p>

### WINDOWS

<ol>
<li>Download and install the msi file of grphicviz
<li>Add the executables (e.g., C:\Program Files (x86)\Graphviz2.38\bin) to the path
<li>Install packages by executing these commands in a terminal: 
<pre><code>pip install pydotplus</code></pre>
<pre><code>pip install pandas</code></pre>
<pre><code>pip install seaborn</code></pre>
<pre><code>pip install sklearn</code></pre>
</ol>

#### Select the following cells and press SHIFT-ENTER. You must be able to run all of them without errors.

In [None]:
import pandas as pd

In [None]:
import numpy as np

In [None]:
import seaborn as sns

In [None]:
%pylab inline

In [None]:
import sklearn as sk

In [None]:
import sklearn.tree as tree

In [None]:
from IPython.display import Image  

In [None]:
import pydotplus

# DataFrame

DataFrame = Table. Let's load a data set in csv format.

In [None]:
df = pd.read_csv('titanic.csv',index_col=0)

Show the first few rows

In [None]:
df.head()

### Columns
This data set is available at <a>https://www.kaggle.com/c/titanic</a>
<ul>
<li><b>Survived (dependent variable)</b>: binary attribute that indicates whether the passenger survived. 
<li><b>Pclass</b>: Ticket class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
<li><b>Name</b>: Passenger name
<li><b>Sex</b>: male/female
<li><b>Age</b>: Passenger age
<li><b>SibSp</b>: The number of the passenger's siblings/spouses aboard the Titanic
<li><b>Parch</b>: The number of the passenger's parents/children aboard the Titanic
<li><b>Ticket</b>: The ticket number
<li><b>Fare</b>: The ticket fare
<li><b>Cabin</b>: the cabin number
<li><b>Embarked</b>: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
</ul>

How many elements are in the DataFrame?

In [None]:
len(df)

Let's display some summary statistics.

In [None]:
df.describe()

# Series

Series = column (or row) of a DataFrame.

In [None]:
df['Survived']

We can compute average, median, etc.

In [None]:
df['Survived'].mean()

# Selection

Select the passengers younger than 40

In [None]:
df[df['Age'] < 40]

# Data Cleaning

Let's get rid of Name, Ticket, and Cabin

In [None]:
df  = df.drop(['Name','Ticket','Cabin'],axis=1)

We need dummy variables for sex and embarked

In [None]:
df = pd.get_dummies(df, columns=['Sex','Embarked'])

We can remove one sex an one embarked

In [None]:
df= df.drop(['Sex_male', 'Embarked_S'],axis=1)

Eliminate Nan

In [None]:
df = df.dropna()

Add a new calculated column <b>LargeFamily</b>: 1 if passenger was travelling with 3 or more family members

In [None]:
df['LargeFamily'] = df['SibSp'] + df['Parch']

## Decision Tree for Knowledge Discovery

The goal here is to find what made it more or less likely to survive the Titanic sinking. To make things easier, let us just analyze the whole data set.

Let's train a decision tree of max_depth 2

In [None]:
dt = tree.DecisionTreeClassifier(max_depth = 2)

In [None]:
X = df.drop('Survived',axis=1)
Y = df.Survived

In [None]:
dt.fit(X,Y)

See slides for how to read the tree

Visualize the tree

In [None]:
# This code will visualize a decision tree dt, trained with the attributes in X and the class labels in Y
dt_feature_names = list(X.columns)
dt_target_names = np.array(Y.unique(),dtype=np.string_) 
tree.export_graphviz(dt, out_file='tree.dot', 
    feature_names=dt_feature_names, class_names=dt_target_names,
    filled=True)  
graph = pydotplus.graph_from_dot_file('tree.dot')
Image(graph.create_png())

## Validating the finding with seaborn

In [None]:
import seaborn as sns

In [None]:
sns.factorplot(data=df,x='Pclass',y='Survived',kind='bar',hue='Sex_female')

In [None]:
df2 = df.copy()
df2.Age = pd.cut(df2.Age,bins=[0,6.5,1000])
sns.factorplot(data=df2,x='Age',y='Survived',kind='bar',hue='Sex_female')

## Prediction

### Cross-validation performance over whole set

Build a RandomForest classifier

In [None]:
from sklearn import ensemble
cl = ensemble.RandomForestClassifier(n_jobs=-1,random_state=0)

Run 10-fold cross validation and record the AUC

In [None]:
import sklearn.model_selection as ms
kf = ms.KFold(10)
auc = ms.cross_val_score(cl,X,y=Y,cv=kf,scoring='roc_auc').mean()
auc

### Out-of-sample

Split data set X,Y into training set (70%) and test set (30%)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.3,random_state=0)

Train on training set

In [None]:
cl.fit(X_train, Y_train)

Get binary predictions on test set

In [None]:
y_pred_test = cl.predict(X_test)

And probability predictions

In [None]:
y_pred_proba_test = cl.predict_proba(X_test)[:,1]

Accuracy and AUC on test set

In [None]:
import sklearn.metrics as metrics
metrics.accuracy_score(Y_test,y_pred_test)

In [None]:
metrics.roc_auc_score(Y_test,y_pred_proba_test)

### Out-of-sample with tuning

Let's find the classifier by tuning the parameters on the training set

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
cl.get_params()

In [None]:
parameters = {'min_samples_leaf':(1,5,10), 'n_estimators':[10, 20, 50], 'criterion':['gini','entropy']}

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
clf = GridSearchCV(cl, parameters,scoring='roc_auc',n_jobs=-1)
clf.fit(X_train,Y_train)
clf.best_estimator_

First, train the model on the whole training set

In [None]:
clf.best_estimator_.fit(X_train, Y_train)

Accuracy on test set

In [None]:
y_pred_test = clf.best_estimator_.predict(X_test)
metrics.accuracy_score(Y_test,y_pred_test)

AUC on test set

In [None]:
y_pred_proba_test = clf.best_estimator_.predict_proba(X_test)[:,1]
metrics.roc_auc_score(Y_test,y_pred_proba_test)