# Assignment 2

Submitted by \<Please add your name\>.

In this assigment, we will work with the *Adult* data set. Please download the data from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/2/adult). Extract the data files into the subdirectory: `../data/adult/` (relative to `./src/`).

## Variable Description

There are several files that you will get in the download archive. We will only use one file: `adult.data`. The file is comma-separated, does not contains headers, and the variable specification is below.


|Variable Name |Role |Type |Demographic |Description |Units |Missing Values|
|--------------|-----|-----|------------|------------|------|--------------|
|age |Feature |Integer |Age |N/A | |no|
|workclass |Feature |Categorical |Income |Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. | |yes|
|fnlwgt |Feature |Integer | | | |no|
|education |Feature |Categorical |Education Level |Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. | |no|
|education-num |Feature |Integer |Education Level | | |no|
|marital-status |Feature |Categorical |Other |Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. | |no|
|occupation |Feature |Categorical |Other |Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. | |yes|
|relationship |Feature |Categorical |Other |Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. | |no|
|race |Feature |Categorical |Race |White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. | |no|
|sex |Feature |Binary |Sex |Female, Male. | |no|
|capital-gain |Feature |Integer | | | |no|
|capital-loss |Feature |Integer | | | |no|
|hours-per-week |Feature |Integer | | | |no|
|native-country |Feature |Categorical |Other |United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. | |yes|
|income |Target |Binary |Income |>50K, <=50K. | |no|


## Objective

The objective of this assignment is to construct a preprocessing and model pipeline to predict the variable `income`. We will evaluate this pipeline using cross-validation.

# Load the data

Assuming that the files `adult.data` and `adult.test` are in `../data/adult/`, then you can use the code below to load them.

In [1]:
import pandas as pd
columns = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
    'native-country', 'income'
]
adult_dt = (pd.read_csv('../data/adult/adult.data', header = None, names = columns)
              .assign(income = lambda x: (x.income.str.strip() == '>50K')*1))


# Get X and Y

Create the features data frame and target data:

+ Create a dataframe `X` that holds the features (all columns that are not `income`).
+ Create a dataframe `Y` that holds the target data (`income`).
+ From `X` and `Y`, obtain the training and testing data sets:

    - Use a train-test split of 70-30%. 
    - Set the random state of the splitting function to 42.

In [2]:
from sklearn.model_selection import train_test_split

X = adult_dt.drop(columns = 'income')
Y = adult_dt['income']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state = 42)



## Random States

Please comment: 

+ What is the [random state](https://scikit-learn.org/stable/glossary.html#term-random_state) of the [splitting function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)? ***It's used to initialize or generate random number to split the data into a group of train and a group of test.***
+ Why is it [useful](https://en.wikipedia.org/wiki/Reproducibility)? ***It's useful to make sure that we get the same split of data everytime we run the code which is useful for reproducibility.***

(Comment here.)

# Preprocessing

Create a [Column Transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) that treats the features as follows:

- Numerical variables

    * Apply [KNN-based imputation for completing missing values](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html):
        
        + Consider the 7 nearest neighbours.
        + Weight each neighbour by the inverse of its distance, causing closer neigbours to have more influence than more distant ones.
    * [Scale features using statistics that are robust to outliers](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler).

- Categorical variables: 
    
    * Apply a [simple imputation strategy](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer):

        + Use the most frequent value to complete missing values, also called the *mode*.

    * Apply [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html):
        
        + Handle unknown labels if they exist.
        + Drop one column for binary variables.
    
    
The column transformer should look like this:

![](./img/assignment_2__column_transformer.png)

In [3]:
adult_dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  int32 
dtypes: int32(1), int64(6), object(8)
memory usage: 3.6+ MB


In [14]:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, roc_auc_score, log_loss, balanced_accuracy_score

numerical_transformer = Pipeline(steps=[
    ('imputer' , KNNImputer(n_neighbors=7 , weights='distance')),
    ('scaler', RobustScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer' , SimpleImputer(strategy='most_frequent')),
    ('onehot' , OneHotEncoder(handle_unknown='ignore' , drop='if_binary'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num_transforms' , numerical_transformer , X.select_dtypes(include=['int64' , 'float64']).columns),
        ('cat_transforms' , categorical_transformer , X.select_dtypes(include=['object']).columns)
    ]
)

preprocessor


## Model Pipeline

Create a [model pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html): 

+ Add a step labelled `preprocessing` and assign the Column Transformer from the previous section.
+ Add a step labelled `classifier` and assign a [`RandomForestClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to it.

The pipeline looks like this:

![](./img/assignment_2__pipeline.png)

In [5]:
import sklearn.ensemble as randomforestclassifier

RandomForestClassifier = randomforestclassifier.RandomForestClassifier

pipeline = Pipeline(steps=[
    ('preprocessing' , preprocessor),
    ('classifier' , RandomForestClassifier(n_estimators=100 , random_state=42))])

pipeline



# Cross-Validation

Evaluate the model pipeline using [`cross_validate()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html):

+ Measure the following [preformance metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values): negative log loss, ROC AUC, accuracy, and balanced accuracy.
+ Report the training and validation results. 
+ Use five folds.


In [6]:
from sklearn.model_selection import cross_validate

scoring = ['neg_log_loss' , 'roc_auc' , 'accuracy' , 'balanced_accuracy']

X = adult_dt.drop(columns = 'income')
Y = adult_dt['income']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state = 42)

results = cross_validate(pipeline , X_train , Y_train , cv = 5 , scoring = scoring)




Display the fold-level results as a pandas data frame and sorted by negative log loss of the test (validation) set.

In [7]:
results = pd.DataFrame(results)
results_sorted = results.sort_values('test_neg_log_loss')
results_sorted


Unnamed: 0,fit_time,score_time,test_neg_log_loss,test_roc_auc,test_accuracy,test_balanced_accuracy
4,11.356566,0.195965,-0.380379,0.902299,0.856077,0.776089
2,11.370265,0.184988,-0.375988,0.901378,0.854103,0.776017
1,11.49585,0.187,-0.369239,0.901079,0.850406,0.771881
0,11.372931,0.189034,-0.357675,0.904384,0.850625,0.774484
3,11.423141,0.183996,-0.356791,0.90725,0.859807,0.782859


Calculate the mean of each metric. 

In [8]:
results.mean()

fit_time                  11.403751
score_time                 0.188196
test_neg_log_loss         -0.368014
test_roc_auc               0.903278
test_accuracy              0.854204
test_balanced_accuracy     0.776266
dtype: float64

Calculate the same performance metrics (negative log loss, ROC AUC, accuracy, and balanced accuracy) using the testing data `X_test` and `Y_test`. Display results as a dictionary.

*Tip*: both, `roc_auc()` and `neg_log_loss()` will require prediction scores from `pipe.predict_proba()`. However, for `roc_auc()` you should only pass the last column `Y_pred_proba[:, 1]`. Use `Y_pred_proba` with `neg_log_loss()`.

In [10]:
preprocessor.fit(X_train)
pipeline.fit(X_train, Y_train)

pipeline.predict_proba(X_test)

array([[1.  , 0.  ],
       [0.57, 0.43],
       [0.31, 0.69],
       ...,
       [1.  , 0.  ],
       [0.34, 0.66],
       [0.01, 0.99]])

In [15]:
preprocessor.fit(X_train)
pipeline.fit(X_train, Y_train)

Y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
Y_pred = pipeline.predict(X_test)

test_results = {
    'log_loss': log_loss(Y_test, Y_pred_proba),
    'roc_auc': roc_auc_score(Y_test, Y_pred_proba),
    'accuracy': accuracy_score(Y_test, Y_pred),
    'balanced_accuracy': balanced_accuracy_score(Y_test, Y_pred)
}

test_results



{'log_loss': 0.3972650658939891,
 'roc_auc': 0.8999828704291436,
 'accuracy': 0.8551540587572934,
 'balanced_accuracy': 0.7751631656838177}

# Target Recoding

In the first code chunk of this document, we loaded the data and immediately recoded the target variable `income`. Why is this [convenient](https://scikit-learn.org/stable/modules/model_evaluation.html#binary-case)?

The specific line was:

```
adult_dt = (pd.read_csv('../data/adult/adult.data', header = None, names = columns)
              .assign(income = lambda x: (x.income.str.strip() == '>50K')*1))
```

(Answer here.)
***providing probability estimates or non-thresholded decision values offers greater flexibility, interpretability, and customization in making predictions and evaluating the performance of binary classification models.***

# Reference

Becker,Barry and Kohavi,Ronny. (1996). Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.