# 2018-01-24-morning

### Decision Trees

You are given data from the Titanic, in `data.csv`.

For this challenge we need to guess whether the individuals from the dataset had survived or not. Use the provided features and either modify, delete or add new features based on existing ones. This is a very core part of being a data scientist.

After you have massaged the data into the form that makes you happy, then use a DecisionTreeClassifier from sklearn and try and get the highest accuracy you can get. Try adjusting the depth of the tree to vary accuracy.

Finally try to perform cross validation.

*Note* the `data-dictionary.txt` provides information on the fields in the CSV file.

### Pull Request

Send me a pull request after analysis is complete.


In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from itertools import combinations
from sklearn.model_selection import cross_val_score
from sklearn import tree
%matplotlib inline

In [98]:
df = pd.read_csv('data.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


I replaced all of the Nan's with zero's and dummified the sex column



In [103]:
df.replace(np.nan, 0, inplace=True)
#df.dropna(axis=0, how='any', inplace=True)

In [104]:
def sex_to_numeric(s):
    if s=='Male':
        return 1
    else:
        return 0

In [105]:
df['male'] = df['Sex'].apply(sex_to_numeric)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,male
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,0,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,0,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,0,S,0
5,6,0,3,"Moran, Mr. James",male,0.0,0,0,330877,8.4583,0,Q,0
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,0,S,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,0,S,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,0,C,0


After I ran this model once, I decided to go back and split out the name column to get just the prefix so we could generate a column that was either "adult or child" to see if we could get a better score.   that was a no on the better score.

In [115]:
df['Last'], df['beginning'] = df['Name'].str.split(',', 1).str

In [119]:
df['Prefix'], df['first'] = df['beginning'].str.split('.', 1).str
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,male,Last,first,beginning,Prefix
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,0,S,0,Braund,Owen Harris,Mr. Owen Harris,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,Cumings,John Bradley (Florence Briggs Thayer),Mrs. John Bradley (Florence Briggs Thayer),Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,0,S,0,Heikkinen,Laina,Miss. Laina,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,0,Futrelle,Jacques Heath (Lily May Peel),Mrs. Jacques Heath (Lily May Peel),Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,0,S,0,Allen,William Henry,Mr. William Henry,Mr
5,6,0,3,"Moran, Mr. James",male,0.0,0,0,330877,8.4583,0,Q,0,Moran,James,Mr. James,Mr
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,0,McCarthy,Timothy J,Mr. Timothy J,Mr
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,0,S,0,Palsson,Gosta Leonard,Master. Gosta Leonard,Master
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,0,S,0,Johnson,Oscar W (Elisabeth Vilhelmina Berg),Mrs. Oscar W (Elisabeth Vilhelmina Berg),Mrs
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,0,C,0,Nasser,Nicholas (Adele Achem),Mrs. Nicholas (Adele Achem),Mrs


In [129]:
def Prefix_to_numeric(w):
    if w==' Mr':
        return 1
    if w==' Mrs':
        return 1
    if w==' Miss':
        return 0
    if w==' Master':
        return 0
    else:
        return 1

In [130]:
df['Adult'] = df['Prefix'].apply(Prefix_to_numeric)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,male,Last,first,beginning,Prefix,Prefixscore,Adult
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,0,S,0,Braund,Owen Harris,Mr. Owen Harris,Mr,1,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,Cumings,John Bradley (Florence Briggs Thayer),Mrs. John Bradley (Florence Briggs Thayer),Mrs,1,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,0,S,0,Heikkinen,Laina,Miss. Laina,Miss,2,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,0,Futrelle,Jacques Heath (Lily May Peel),Mrs. Jacques Heath (Lily May Peel),Mrs,1,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,0,S,0,Allen,William Henry,Mr. William Henry,Mr,1,1
5,6,0,3,"Moran, Mr. James",male,0.0,0,0,330877,8.4583,0,Q,0,Moran,James,Mr. James,Mr,1,1
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,0,McCarthy,Timothy J,Mr. Timothy J,Mr,1,1
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,0,S,0,Palsson,Gosta Leonard,Master. Gosta Leonard,Master,2,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,0,S,0,Johnson,Oscar W (Elisabeth Vilhelmina Berg),Mrs. Oscar W (Elisabeth Vilhelmina Berg),Mrs,1,1
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,0,C,0,Nasser,Nicholas (Adele Achem),Mrs. Nicholas (Adele Achem),Mrs,1,1


Then I ran the test train split for decision trees to train the model and score it.

In [137]:
dfsubset = df.filter(["Survived", "Pclass", "Age", "Fare", "male"])

In [138]:
X = dfsubset[["Pclass", "Age", "Fare", "male"]]
y = dfsubset[["Survived"]]

In [140]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [141]:
model = tree.DecisionTreeClassifier().fit(X_train, y_train)

In [142]:
model

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [143]:
model.score(X, y)

0.86644219977553316

Trying to make a picture, I think I have to install something else first.  Says I don't have pydotplus

In [158]:
from sklearn.tree import DecisionTreeClassifier
dtree=DecisionTreeClassifier()
dtree.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [159]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())


ModuleNotFoundError: No module named 'pydotplus'

In [145]:
model.predict([[1,.67, 14.5,  1]])

array([0], dtype=int64)