# Psych 198: Reproducibility DeCal (Spring 2021)

## Demo/Lab 3: Error Metrics

In this demo/lab, we will go through some of the error metrics used in data science. Code adopted from Data 144.

Note that this notebook is Python-based, so if you have any questions regarding the syntax of the code, please feel free to reach out or sign up for an office hour slot. 

Author: Yuyang Zhong (2021). This work is licensed under a [Creative Commons BY-NC-SA 4.0 International
License][cc-by]. 

![CC BY-NC-SA 4.0][cc-by-shield]

[cc-by]: http://creativecommons.org/licenses/by/4.0/
[cc-by-shield]: https://img.shields.io/badge/license-CC--BY--NC--SA%204.0-blue

#### Note on using Jupyter Notebooks 
Enter code into a code cell, then press SHIFT+Enter to run that cell. The output of the code should be shown right underneath the cell you just run.

In [None]:
import pandas as pd
import numpy as np
import re

from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import plot_confusion_matrix

### Dataset

We will be using a dataset about the passengers on the Titanic, and try to predict whether someone will survive or not. Run the cell below to load the data.

In [None]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [None]:
df_train.shape, df_test.shape

#### Replacing missing values

In [None]:
#check missing values for df_train
df_train.isna().sum()

In [None]:
def fill_age_by_pclass(df):
    group_mean = df.groupby('Pclass').mean()['Age']

    for index, row in df.iterrows():
        if np.isnan(row['Age']):
            df.at[index,'Age'] = group_mean[row['Pclass']]

In [None]:
#replace missing values for age column with the mean of the passenger class group
fill_age_by_pclass(df_train)
fill_age_by_pclass(df_test)

Approximating missing embarking port based on plot above:

* For anyone in 1st class, sample from [S, S, C, C, C, Q]
* For anyone in 2nd class, sample from [S, S, S, C, Q]
* For anyone in 3rd class, sample from [S, S, C, Q, Q, Q, Q]

In [None]:
def fill_embark_by_pclass(df):
    for index, row in df.iterrows():
        if row['Embarked'] != row['Embarked']:
            if row['Pclass'] == 1:
                df.at[index,'Embarked'] = np.random.choice(['S', 'S', 'C', 'C', 'C', 'Q'])
            elif row['Pclass'] == 2:
                df.at[index,'Embarked'] = np.random.choice(['S', 'S', 'S', 'C', 'Q'])
            else:
                df.at[index,'Embarked'] = np.random.choice(['S', 'S', 'C', 'Q', 'Q', 'Q', 'Q'])

In [None]:
#replace missing values for embarked column with a approximate probabilistic sampling by class
fill_embark_by_pclass(df_train)
fill_embark_by_pclass(df_test)

In [None]:
#replace null values of the cabin column with 0 and non null with 1
df_train['Cabin'].loc[~df_train['Cabin'].isnull()] = 1
df_train['Cabin'].loc[df_train['Cabin'].isnull()] = 0

df_test['Cabin'].loc[~df_test['Cabin'].isnull()] = 1
df_test['Cabin'].loc[df_test['Cabin'].isnull()] = 0

In [None]:
#check missing values
df_train.isna().sum()

In [None]:
#check missing values in df_test
df_test.isna().sum()

#### Extracting Salutations

In [None]:
def extract_salutations(series):
    suffix_train = []
    for elem in series:
        suff = re.findall('[A-Z]{1}[a-z]{1,}[.]', elem)
        suffix_train.append(suff[0])
    return suffix_train

In [None]:
#extract the title from the name
df_train['Title'] = extract_salutations(df_train['Name'])
df_test['Title'] = extract_salutations(df_test['Name'])

In [None]:
#replace some titles with Mr. and Ms. in df_train so it has same # of unique title as df_test
df_train['Title'] = df_train['Title'].replace(['Major.', 'Sir.', 'Capt.', 'Jonkheer.'], 'Mr.').replace(
    ['Mlle.', 'Mme.', 'Lady.', 'Countess.'], 'Miss.')
df_train['Title'].value_counts()

In [None]:
df_train['Num Relatives'] = df_train['SibSp'] + df_train['Parch']
df_test['Num Relatives'] = df_test['SibSp'] + df_test['Parch']

#### Setup final dataframes

In [None]:
x_feats = ['Pclass', 'Sex', 'Age', 'Embarked', 'Num Relatives', 'Title']

x_train = df_train[x_feats]
y_train = df_train['Survived']

In [None]:
from sklearn import preprocessing

x_train = preprocessing.scale(pd.get_dummies(x_train))
x_train

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x_train, y_train, test_size = .2, random_state = 42)

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

log_model = LogisticRegression(penalty='l1', solver='liblinear', 
                               max_iter=50, random_state=42)
log_model.fit(X_train, y_train)

log_model.score(X_train, y_train)

In [None]:
plot_confusion_matrix(log_model, X_train, y_train)

We can also report the accuracy, recall, precision, and F1-scores:

In [None]:
accuracy_score(y_train, log_model.predict(X_train))

Recall is also known as the true positive rate:

In [None]:
recall_score(y_train, log_model.predict(X_train))

Precision is also known as the

In [None]:
precision_score(y_train, log_model.predict(X_train))

In [None]:
f1_score(y_train, log_model.predict(X_train))

### Check on test set

In [None]:
log_model.score(X_test, y_test)

In [None]:
recall_score(y_test, log_model.predict(X_test))

In [None]:
precision_score(y_test, log_model.predict(X_test))

In [None]:
f1_score(y_test, log_model.predict(X_test))

**Exercise:** Try this with another model/classifier.