# Assignment 2

The purpose of this assignment is to test your understanding of Classification.  You will use the Titanic dataset and your goal is to predict whether a passenger Survives based on the passenger's features.


# Instructions

## General

1. Use the same train  dataset as was used in the lecture.  Instructions below for where to find them.

2. As usual: your grade depends on **both** the correct answer and properly presenting your process (as in the "Recipe" taught in class, and the Geron book Appendix B)

3. You will classify whether a passenger Survives or not using Logistic Regression.

4. You may use the code presented in class to **start** your assignment but I expect you to significantly enhance it.  For example: you may use my code to get you started with plotting but it is up to you to decide whether this alone suffices.

5. Use 5-fold cross validation for all models.  Report the average as your result.


## Specific goals to address

1. Use a baseline model against which you will compare your models.
    - Discuss your choice.  Is this the best baseline model to use ?
    - Create a variable SCORE_BASELINE that contains a Python scalar value: the accuracy for your baseline model.
2. You will conduct several experiments 
    - present a Confusion Matrix for each experiment and discuss
    - you will create several variables per experiment that will be used for grading.
        - The variables for experiment 1 will have suffix "_1". For experiment 2, they will have suffix "_2", etc.
3. Experiment 1
    - You will *extend* the results presented in the lecture
        - use the same features
        - use the same way of dealing with missing features
        - be sure to treat categorical features correctly
     
    - Create a variable SCORE_1 that contains a Python scalar value: the accuracy for your experiment.
    - Create a variable MISCLASSIFIED_SURVIVE_1 that contains a Python list of *at least 10* passengers
        - the list should contain the identity of passengers that were mis-classified as Surviving.
        - the "identity" of a passenger should be given as the  *row number* within the unshuffled **train** data set,
        - The first row is considered row 0
    - Create a variable MISCLASSIFIED_NOT_SURVIVE_1 that contains a Python list of *at least 10* passengers
        - the list should contain the "identity" of passengers that were mis-classified as Not Surviving.
        - The "identity" of a passenger should be given as the  *row number* within the unshuffled **train** data set, as above
4. Experiment 2
    - Turn Age from a continous variable to one that is assigned to buckets.
        - You will decide the range for each bucket.  Discuss your choice
        - Treat the buckets as categorical features
    - Compare your prediction to the previous experiment and discuss
    - Create variables SCORE_2, MISCLASSIFIED_SURVIVE_2, MISCLASSIFIED_NOT_SURVIVE_2 analagous to the variables in Experiment 1
        
The correctness part of your grade will depend on the values you assign to these variables.    

# Extra credit

Create your own Logistic Regression model for the Titanic dataset given !
- Feel free to change **anything**, e.g., features or ways to treat missing values
- We will create a hidden test dataset
- Students whose model accuracy (evaluated on the hidden test dataset) are in the Top 33% of the class get extra credit !


# Getting the data 
You may obtain the train and test datasets from the repository using code from the following cell.

**NOTE** You may need to change the NOTEBOOK_ROOT variable to point to the directory into which you've cloned the repository.  On my machine, it is `~/Notebooks/NYU`.

In [28]:
import pandas as pd
import os

## NOTEBOOK_ROOT = "~/Notebooks/NYU"
NOTEBOOK_ROOT = "~/Desktop/7773 Machine Learning/ML_Spring_2019"
TITANIC_PATH = os.path.join( NOTEBOOK_ROOT, "external/jack-dies", "data")

train_data = pd.read_csv( os.path.join(TITANIC_PATH, "train.csv") )
test_data  = pd.read_csv( os.path.join(TITANIC_PATH, "test.csv")  )

In [29]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [30]:
import numpy as np
import random
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import FeatureUnion

# Experiement 1

In [31]:
# define a function to create some new features to help boost the accuracy
def CreateNewFeature(Data):
    # New Feature: Age Categories
    df = Data
    def AgeCat(x):
        if x < 10:
            return 0
        if 10 <= x < 40:
            return 1
        if 40 <= x < 60:
            return 2
        if x >= 60:
            return 3
    df["AgeBucket"] = df["Age"].apply(AgeCat)
    
    # New Feature: Fare Categories
    def FareCat(x):
        if x < 15:
            return 0
        if 15 <= x <35:
            return 1
        if 35 <= x <90:
            return 2
        else:
            return 3    
    df["FareCat"] = df["Fare"].apply(FareCat)
    
    # New Feature: Passenger Class & Sex
    df["PclassSex"] = df["Pclass"].apply(str) + df["Sex"]
    
    # New Feature: Family Size
    df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
    
    # New Feature: Is Alone
    def Alone(x):
        if x == 1:
            return 1
        else:
            return 0
    df["IsAlone"] = df["FamilySize"].apply(Alone)
    
    # New Feature: Name Length
    df["NameLen"] = df["Name"].apply(len)
    
    # New Feature: Name Prefix
    df["NamePre"] = df["Name"].apply(lambda x: x.split()[1])
    
    return df

In [32]:
new_train_data = CreateNewFeature(train_data)

In [33]:
# X_train
titanic_train = new_train_data[["SibSp", "Parch", "Embarked", 
                           "PclassSex", "FamilySize", "FareCat", "AgeBucket", 
                           "IsAlone", "NameLen", "NamePre"]]
# y_train
titanic_train_labels = new_train_data["Survived"]

In [34]:
# seperate the data into two part: numerical and categorical

# numerical data of training set
titanic_train_num = titanic_train[["SibSp", "Parch", "FamilySize", "NameLen"]]
# categorical data of training set
# AgeBucket is categorical
titanic_train_cat = titanic_train[["Embarked", "PclassSex", "FareCat", "AgeBucket", "IsAlone"]]

In [35]:
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("std_scaler", StandardScaler())
])

cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ('encoder', OneHotEncoder(sparse=False))
])

num_attribs = list(titanic_train_num)
cat_attribs = list(titanic_train_cat)

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs)
])

In [38]:
# training set transformed by all the pipeline(imputer, encoder, standard scaler)
titanic_train_prepared = full_pipeline.fit_transform(titanic_train)

In [39]:
logistic_clf = LogisticRegression(solver="liblinear")
logistic_clf.fit(titanic_train_prepared, titanic_train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [40]:
SCORE = cross_val_score(logistic_clf, titanic_train_prepared, titanic_train_labels, cv=5, scoring="accuracy")
SCORE =  SCORE.mean()
print("Score: {s}".format(s=SCORE))

Score: 0.8159540419187691


In [41]:
titanic_train_predict = logistic_clf.predict(titanic_train_prepared)

In [42]:
titanic_train_confusion_matrix = confusion_matrix(titanic_train_labels, titanic_train_predict)

In [43]:
CONFUSION_MATRIX = pd.DataFrame(titanic_train_confusion_matrix, 
                                  columns=["Predicted Negative", "Predicted Positive"],
                                  index=["Actual Negative", "Actual Positive"])
print("Confusion Matrix: \n")
print(CONFUSION_MATRIX)

Confusion Matrix: 

                 Predicted Negative  Predicted Positive
Actual Negative                 504                  45
Actual Positive                 106                 236


In [44]:
titanic_subtract = titanic_train_predict - titanic_train_labels
# 1 suggests that (survive) - (not survive) = misclassified survive
# -1 suggests that (not survive) - (survive) = misclassified not survive

MISCLASSIFIED_SURVIVE = list(titanic_subtract[titanic_subtract == 1].index)
MISCLASSIFIED_NOT_SURVIVE = list(titanic_subtract[titanic_subtract == -1].index)

In [45]:
print("Misclassified Survive Passengers: Total {p}\n".format(p=len(MISCLASSIFIED_SURVIVE)))
print(MISCLASSIFIED_SURVIVE, "\n")
print("Misclassified Not Survive Passengers: Total {p}\n".format(p=len(MISCLASSIFIED_NOT_SURVIVE)))
print(MISCLASSIFIED_NOT_SURVIVE, "\n")

Misclassified Survive Passengers: Total 45

[14, 18, 24, 34, 41, 49, 54, 114, 118, 139, 140, 147, 177, 199, 205, 246, 264, 297, 312, 351, 357, 373, 374, 415, 423, 452, 498, 501, 502, 505, 557, 578, 583, 617, 654, 657, 680, 702, 766, 767, 772, 799, 852, 854, 867] 

Misclassified Not Survive Passengers: Total 106

[2, 17, 21, 23, 25, 36, 55, 65, 68, 74, 79, 81, 85, 106, 107, 125, 127, 128, 141, 146, 183, 187, 204, 207, 209, 216, 220, 224, 226, 233, 248, 261, 267, 271, 279, 283, 286, 288, 298, 301, 315, 330, 338, 376, 390, 391, 400, 414, 429, 444, 447, 449, 453, 455, 460, 483, 509, 510, 512, 543, 547, 553, 554, 569, 570, 572, 579, 587, 607, 621, 622, 630, 632, 643, 645, 649, 660, 664, 673, 677, 690, 692, 701, 707, 709, 712, 724, 740, 744, 755, 762, 780, 786, 788, 797, 802, 804, 821, 823, 828, 838, 839, 855, 857, 869, 889] 



In [50]:
new_test_data = CreateNewFeature(test_data)

# X_test
titanic_test = new_test_data[["SibSp", "Parch", "Embarked", 
                           "PclassSex", "FamilySize", "FareCat", "AgeBucket", 
                           "IsAlone", "NameLen", "NamePre"]]

titanic_test_prepared = full_pipeline.fit_transform(titanic_test)

test_predict = logistic_clf.predict(titanic_test_prepared)

In [51]:
print(test_predict)

[0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 0 0 1
 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0
 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 1 1 1 0 0 1 0 0 0]
