## Classification trees

Classification trees are very similar to regression trees. Here is a quick comparison:

|regression trees|classification trees|
|---|---|
|predict a continuous response|predict a categorical response|
|predict using mean response of each leaf|predict using most commonly occuring class of each leaf|
|splits are chosen to minimize MSE|splits are chosen to minimize Gini index (discussed below)|

Here's an **example of a classification tree**, which predicts whether Barack Obama or Hillary Clinton would win the Democratic primary in a particular county in 2008:

<img src="Images/obama_clinton_tree.jpg">

**A few questions:**

- What is the response variable?
- What are the features?
- What is the most predictive feature?
- How would we calculate the total number of counties?

## Splitting criteria for classification trees

Here are common options for the splitting criteria:

- **classification error rate:** fraction of training observations in a region that don't belong to the most common class
- **Gini index:** measure of total variance across classes in a region
- **cross-entropy:** numerically similar to Gini index

The goal when splitting is to increase the "node purity", and it turns out that the **Gini index and cross-entropy** are better measures of purity than classification error rate. The Gini index is faster to compute than cross-entropy, so it is generally preferred (and is used by scikit-learn by default).

## Titanic Survival Prediction 
<img src="Images/Titanic_Image.jpg" width="50%">


We'll build a classification tree using the [Titanic data](https://www.kaggle.com/c/titanic-gettingStarted/data) provided by Kaggle.

In [None]:
import pandas as pd
import numpy as np

In [None]:
# read in the data
df = pd.read_csv('./Datasets/Titanic_train.csv')
df.info()

Let's choose our response variable and a few features, and review **how to handle categorical features**:

- **Survived:** This is our response variable, and is already encoded as 0=died and 1=survived.
- **Pclass:** These are the passenger class categories (1=first class, 2=second class, 3=third class). They are logically ordered, so we'll leave them as-is. (If the tree splits on this feature, the splits will occur at 1.5 or 2.5.)
- **Sex:** This is a binary category, so we should encode it as 0=female and 1=male. (If the tree splits on this feature, the split will occur at 0.5.)
- **Age:** This is a numeric feature, but we need to fill in the missing values.
- **Embarked:** This is the port they embarked from. There are three unordered categories, so we should create dummy variables and drop one level as usual.

### Data Pre-Processing

In [None]:
df.info()

If you carefully observe the above summary of pandas, there are total 891 rows, Age shows only 714 (means missing), Embarked (2 missing) and Cabin missing a lot as well. Object data types are non-numeric so we have to find a way to encode them to numerical values. One such way is columnisation ie. factorize to the row values to column header.

#### Lets try to drop some of the columns which many not contribute much to our machine learning model such as Name, Ticket, Cabin etc.

In [None]:
cols = ['Name','Ticket','Cabin']
df = df.drop(cols,axis=1)
df.info()

#### If you observe carefully, Age has only 714 rows ( some missing values) . One easiest way is to drop the rows with missing values 

In [None]:
df_temp=df.dropna()
df_temp.info()

#### But this is loss of too much training data.   Pandas has a nice interpolate() function that will replace all the missing NaNs to interpolated values.

In [None]:
df['Age'] = df['Age'].interpolate()
df.info()

#### Now we convert the Pclass, Sex, Embarked to columns in pandas and drop them after conversion.

In [None]:
dummies = []
cols = ['Pclass','Sex','Embarked']
for col in cols:
  dummies.append(pd.get_dummies(df[col]))

In [None]:
titanic_dummies = pd.concat(dummies, axis=1)
titanic_dummies

### Finally we concatenate to the original dataframe columnwise

In [None]:
df = pd.concat((df,titanic_dummies),axis=1)
df.info()

Now that we converted Pclass, Sex, Embarked values into columns, we drop the redundant same columns from the dataframe

In [None]:
df = df.drop(['Pclass','Sex','Embarked'],axis=1)
df.info()

# Time for Machien Learning 

Now we convert our dataframe from pandas to numpy and we assign input and output

In [None]:
X = df.values
y = df['Survived'].values

X = np.delete(X,1,axis=1)

#### Now that we are ready with X and y, lets split the dataset for 70% Training and  30% test set using scikit cross validation

In [None]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=5)
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
0.78735805970149249