# Supervised Learning Workflow

## Baseline Model & Model Evaluation

In [0]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

train = pd.read_csv("../../../../Data/data_titanic/train.csv")
train.Pclass = train.Pclass.astype(float) # to avoid DataConversionWarning

In [0]:
train.head()

## Brief Exploration

In [0]:
# Categorical features
train.describe(include = object)

In [0]:
# Numerical features
train.describe()

Let's work only with the following features for simplicity:   

**Categorical**   
- Sex
- Embarked

**Numerical**  
- Survived: *our target feature* (0 = No, 1 = Yes)
- Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Age: Age in years
 
More detailed info: https://www.kaggle.com/c/titanic

In [0]:
# Let's keep only the desired columns
train = train[['Sex','Embarked','Pclass', 'Age','Survived']]
train.shape

In [0]:
# Check for missing values
train.isna().sum()

For simplicity, we drop any row containing missing values. 

In [0]:
train = train.dropna(axis=0)
train.head()

## Feature Engineering
With our current knowledge, we can try to individually implement [various transformers from Scikit Learn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing).
Let's not forget to create a holdout set!

In [0]:
X_train, X_test, y_train, y_test = train_test_split(train[['Pclass', 'Age', 'Sex', 'Embarked']],
                          train['Survived'],
                          test_size=0.2, 
                          random_state=42)

### Numerical Features
The only numerical features we have are 'Pclass' and 'Age'.  
Let's scale these two features using [`MinMaxScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

In [0]:
scaler = preprocessing.MinMaxScaler()
scaler.fit(X_train[['Pclass', 'Age']])
X_train_transformed_numerical = scaler.transform(X_train[['Pclass', 'Age']])
X_test_transformed_numerical = scaler.transform(X_test[['Pclass', 'Age']])

print(X_train_transformed_numerical.shape)
print(X_test_transformed_numerical.shape)

### Categorical Features
The categorical features we have are 'Sex' and 'Embarked'.   
We can simply one-hot encode these using [`OneHotEncoder()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

In [0]:
encoder = preprocessing.OneHotEncoder(sparse=False)
encoder.fit(X_train[['Sex', 'Embarked']])
X_train_transformed_categorical = encoder.transform(X_train[['Sex', 'Embarked']])
X_test_transformed_categorical = encoder.transform(X_test[['Sex', 'Embarked']])

print(X_train_transformed_categorical.shape)
print(X_test_transformed_categorical.shape)

## Exercises
It's time for our first exercise! 
Before, let's concatenate the transformed numerical and categorical features into a single dataframe.

In [0]:
X_train_transformed = np.concatenate((X_train_transformed_numerical, X_train_transformed_categorical), axis = 1)
X_test_transformed = np.concatenate((X_test_transformed_numerical, X_test_transformed_categorical), axis = 1)

print(X_train_transformed.shape)
print(X_test_transformed.shape)

In [0]:
# TASK 1A: Fit DummyClassifier to the transformed training set.  
# Then, let the model predict for train (X_train_transformed) and holdout set (X_test_transformed).
# Store the prediction as y_pred_TRAIN_DUMMY (training set) and as y_pred_HOLDOUT_DUMMY (holdout set).

In [0]:
# OPTIONAL TASK 1B: Think about a simple heuristic that can be used as a baseline. 
# One possibility is to use gender and for example predict that every men or every woman has survived.
# You can store the result as y_pred_TRAIN_HEURISTIC and as y_pred_HOLDOUT_HEURISTIC.

Great! We have our first prediction! It is time to evaluate how good our model is using the [*sklearn.metrics* module](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics).

In [0]:
#TASK 2A: Display ACCURACY on TRAIN set.

#TASK 2B: Display ACCURACY on HOLDOUT set.

#OPTIONAL TASK 2C: Can you think of a better measure than accuracy based on the domain problem? If yes, use it the same way.

Great! Now we would also like to see the confusion matrix as it is always a good idea to visually confirm the quality of our predictions.

In [0]:
#TASK 3: Display a CONFUSION MATRIX on HOLDOUT set. Hint: do not use plot_confusion_matrix but confusion_matrix only.