# Data Preparation and Advanced Model Evaluation

## Agenda

**Data preparation**

- Handling missing values
- Handling categorical features (review)

**Advanced model evaluation**

- ROC curves and AUC
- Bonus: ROC curve is only sensitive to rank order of predicted probabilities
- Cross-validation

## Part 1: Handling missing values

scikit-learn models expect that all values are **numeric** and **hold meaning**. Thus, missing values are not allowed by scikit-learn.

In [2]:
# read the Titanic data
import pandas as pd
pd.options.mode.chained_assignment = None
url = '../../dataset/titanic.csv'
titanic = pd.read_csv(url, index_col='PassengerId')
titanic.shape

(891, 11)

In [3]:
titanic.head(20)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [4]:
# check for missing values
titanic.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

One possible strategy is to **drop missing values**:

In [5]:
# drop rows with any missing values
titanic.dropna().shape

(183, 11)

In [6]:
# drop rows where Age is missing
titanic[titanic.Age.notnull()].shape

(714, 11)

In [7]:
titanic.Age.describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

Sometimes a better strategy is to **impute missing values**:

In [8]:
# mean Age
titanic.Age.mean()

29.69911764705882

In [9]:
# median Age
titanic.Age.median()

28.0

In [10]:
# most frequent Age
titanic.Age.mode()

0    24.0
dtype: float64

In [11]:
# create a dataframe with no missing values for ages
titanic_with_ages = titanic[titanic.Age.notnull()]
y = titanic_with_ages['Age']
X = titanic_with_ages.drop(['Age','Survived'], axis=1)

In [12]:
# Feature Engineering Example with lambda function
X['ticket_contains_letters'] = X['Ticket'].apply(lambda x: x[0].isalpha()).astype(int)
X.head()

Unnamed: 0_level_0,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked,ticket_contains_letters
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,,S,1
2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C85,C,1
3,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925,,S,1
4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1,C123,S,0
5,3,"Allen, Mr. William Henry",male,0,0,373450,8.05,,S,0


## Create a model to predict the age

- Create new features
- Try different models
- Test differnt hyperparameters

### Select and Clean a few features

In [13]:
X['Q'] = (X['Embarked'] == 'Q').astype('int')
X['C'] = (X['Embarked'] == 'C').astype('int')
X['S'] = (X['Embarked'] == 'S').astype('int')

In [14]:
#Does person have a ticket that includes a cabin? 
X['In_Cabin'] = X['Cabin'].notnull().astype('int')
X['In_Cabin'] = X['In_Cabin'].fillna(value=0)

In [15]:
features = ['Pclass','SibSp', 'Parch','Fare', 'ticket_contains_letters', 'In_Cabin', 'Q','C','S']

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

### Filter X_train to selected features

In [17]:
X_train_selected_features = X_train[features]
X_test_selected_features = X_test[features]

### Build a model and select hyperparameters

In [18]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn import metrics
import numpy as np

knn = KNeighborsRegressor()
knn.fit(X_train_selected_features, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=5, p=2,
          weights='uniform')

In [19]:
def knn_optimize(leaf_size, n_neighbors, X_train_selected_features, y_train, y_test):
    knn = KNeighborsRegressor(leaf_size = leaf_size, n_neighbors = n_neighbors)
    knn.fit(X_train_selected_features, y_train)
    y_pred = knn.predict(X_test_selected_features)
    return(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [20]:
eval_holder = [] 
eval_scores = [] 
leaf_test = [10, 20, 30, 40]
neighbors_test = [3,6,9,12,15]

for l in leaf_test:
    for n in neighbors_test:
        results = knn_optimize(l, n, X_train_selected_features, y_train, y_test)
        eval_holder.append([l, n, results])
        eval_scores.append(results)

In [21]:
min(eval_scores)

11.66247681931188

In [22]:
for e in eval_holder:
    if e[2] == min(eval_scores):
        print(e[0], 'leaf', e[1], 'neighbors')

40 leaf 12 neighbors


In [23]:
knn = KNeighborsRegressor(leaf_size = 40, n_neighbors=12)
knn.fit(X_train_selected_features, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=40, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=12, p=2,
          weights='uniform')

### Store predictions

In [24]:
y_pred = knn.predict(X_test_selected_features)

### Evaluate Models

In [25]:
np.sqrt(metrics.mean_squared_error(y_test, y_pred))

11.66247681931188

### Apply a model to missing age data

In [26]:
#use tilda to get inverse of notnull i.e. null
missing_age = titanic[~titanic.Age.notnull()]
missing_age.shape

(177, 11)

In [27]:
missing_age.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S
20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C
27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C
29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q


In [28]:
#Create same features we created above
missing_age['ticket_contains_letters'] = missing_age['Ticket'].apply(lambda x: x[0].isalpha()).astype(int)

In [29]:
missing_age['Q'] = (missing_age['Embarked'] == 'Q').astype('int')
missing_age['C'] = (missing_age['Embarked'] == 'C').astype('int')
missing_age['S'] = (missing_age['Embarked'] == 'S').astype('int')

In [30]:
missing_age['In_Cabin'] = missing_age['Cabin'].notnull().astype('int')
missing_age['In_Cabin'] = missing_age['In_Cabin'].fillna(value=0)

In [31]:
missing_age.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,ticket_contains_letters,Q,C,S,In_Cabin
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,0,1,0,0,0
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S,0,0,0,1,0
20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C,0,0,1,0,0
27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C,0,0,1,0,0
29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q,0,1,0,0,0


In [32]:
X_Validation = missing_age[features]

In [33]:
#Set age in dataframe equal to the prediction on the X_Validation features
missing_age['Age_Predict'] = knn.predict(X_Validation)

In [34]:
missing_age.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,ticket_contains_letters,Q,C,S,In_Cabin,Age_Predict
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,0,1,0,0,0,29.791667
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S,0,0,0,1,0,30.5
20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C,0,0,1,0,0,27.083333
27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C,0,0,1,0,0,27.083333
29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q,0,1,0,0,0,29.791667


In [35]:
print('Mean age calculated from our model:', round(missing_age.Age_Predict.mean(),2))
print('Mean age imputed from rest of data:', round(titanic.Age.mean(),2))

Mean age calculated from our model: 30.39
Mean age imputed from rest of data: 29.7


In [36]:
print('Median age calculated from our model:', round(missing_age.Age_Predict.median(),2))
print('Median age imputed from rest of data:', round(titanic.Age.median(),2))

Median age calculated from our model: 29.54
Median age imputed from rest of data: 28.0
