# Data Science Analysis with Python

Before we start, we need to install a couple of things (in this order), which will make it possible to visualize decision trees. The instructions are different for MAC and Windows systems.

# Set up

### MAC

<p>Open the Terminal and do the following:</p>
<p>
<ol>
  <li>Run the following command and hit 'Enter': 
    <pre><code>xcode-select –install</code></pre>
  <li>Run the Xcode installer. Once the installation is complete run the following command to install brew. 
    <pre><code>ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"</code></pre> </li>
  <li>Run the following command once you're done to ensure Homebrew is installed and working
    <pre><code>brew doctor</code></pre>
  <li>Enter the command below to install graphviz: 
    <pre><code>brew install graphviz</code></pre>
  <li>Install pydotplus and other packages: 
<pre><code>pip install pydotplus</code></pre>
<pre><code>pip install pandas</code></pre>
<pre><code>pip install seaborn</code></pre>
<pre><code>pip install sklearn</code></pre></ol>
</p>

### WINDOWS

<ol>
<li>Download and install the msi file of grphicviz
<li>Add the executables (e.g., C:\Program Files (x86)\Graphviz2.38\bin) to the path
<li>Install packages by executing these commands in a terminal: 
<pre><code>pip install pydotplus</code></pre>
<pre><code>pip install pandas</code></pre>
<pre><code>pip install seaborn</code></pre>
<pre><code>pip install sklearn</code></pre>
</ol>

#### Select the following cells and press SHIFT-ENTER. You must be able to run all of them without errors.

In [None]:
import pandas as pd

In [None]:
import numpy as np

In [None]:
import seaborn as sns

In [None]:
%pylab inline

In [None]:
import sklearn as sk

In [None]:
import sklearn.tree as tree

In [None]:
from IPython.display import Image  

In [None]:
import pydotplus

# DataFrame

DataFrame = Table. Let's load a data set in csv format.

Show the first few rows

### Columns
This data set is available at <a>https://www.kaggle.com/c/titanic</a>
<ul>
<li><b>Survived (dependent variable)</b>: binary attribute that indicates whether the passenger survived. 
<li><b>Pclass</b>: Ticket class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
<li><b>Name</b>: Passenger name
<li><b>Sex</b>: male/female
<li><b>Age</b>: Passenger age
<li><b>SibSp</b>: The number of the passenger's siblings/spouses aboard the Titanic
<li><b>Parch</b>: The number of the passenger's parents/children aboard the Titanic
<li><b>Ticket</b>: The ticket number
<li><b>Fare</b>: The ticket fare
<li><b>Cabin</b>: the cabin number
<li><b>Embarked</b>: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
</ul>

How many elements are in the DataFrame?

Let's display some summary statistics.

# Series

Series = column (or row) of a DataFrame.

We can compute average, median, etc.

# Selection

Select the passengers younger than 40

# Data Cleaning

Let's get rid of Name, Ticket, and Cabin

We need dummy variables for sex and embarked

We can remove one sex an one embarked

Eliminate Nan

Add a new calculated column <b>LargeFamily</b>: 1 if passenger was travelling with 3 or more family members

## Decision Tree for Knowledge Discovery

The goal here is to find what made it more or less likely to survive the Titanic sinking. To make things easier, let us just analyze the whole data set.

Let's train a decision tree of max_depth 2

See slides for how to read the tree

Visualize the tree

In [None]:
# This code will visualize a decision tree dt, trained with the attributes in X and the class labels in Y
dt_feature_names = list(X.columns)
dt_target_names = np.array(Y.unique(),dtype=np.string_) 
tree.export_graphviz(dt, out_file='tree.dot', 
    feature_names=dt_feature_names, class_names=dt_target_names,
    filled=True)  
graph = pydotplus.graph_from_dot_file('tree.dot')
Image(graph.create_png())

## Validating the finding with seaborn

## Prediction

### Cross-validation performance over whole set

Build a RandomForest classifier

Run 10-fold cross validation and record the AUC

### Out-of-sample

Split data set X,Y into training set (70%) and test set (30%)

Train on training set

Get binary predictions on test set

And probability predictions

Accuracy and AUC on test set

### Out-of-sample with tuning

Let's find the classifier by tuning the parameters on the training set

First, train the model on the whole training set

Accuracy on test set

AUC on test set