# SEN163A - Responsible Data Analytics
## Lab session 5: Predictive Analytics: Regression and Classification
### Delft University of Technology
### Q3 2023-24

**Instructor**: Dr. Ir. Sepinoud Azimi Rashti - s.azimirashti@tudelft.nl

**TAs**: Anagha Magadi Rajeev - a.magadirajeev@student.tudelft.nl

#### Instructions

Lab session aim to:
- Show and reinforce how models and ideas presented in class are put to practice.
- Help you gather hands-on machine learning skills.

Lab sessions are:

- Learning environments where you work with Jupyter notebooks and where you can get support from TAs and fellow students.
- Not graded and do not have to be submitted.
- A good preparation for the assignments (which are graded).


### Application: Predictive analytics of a health and insurance related data

In this lab session, we will explore how to performe predicitive analytics to solve both a classification (predicting a categorical variable) and a regression (predicting a numerical variable) task. 
The classification case will be related to the prediction of the occurrence of a stroke, based on both physiological measurements as well as user features.
The regression case, on the other hand, will be related to the prediction of health insurance costs, based on user features and behaviour.

#### Learning objectives
After completing the following exercises you will be able to:

1. Apply common preprocessing techniques to prepare data for machine learning techniques: categorical preprocessing, imputation.
2. Split the available dataset into a training set (for model fitting) and a testing set (for performance evaluation).
3. Fit benchmark models to determine baseline performances on both a classification and regression case.
4. Compute the most commonly applied performance measures for classification and regression tasks.
5. Fit the most commonly applied machine learning predictive models for classification and regression tasks.
6. Compare predictive models across different performance metrics.

In [20]:
import pandas
import numpy

import seaborn
import matplotlib

seaborn.set_palette("Set2")
seaborn.color_palette("Set2")

#
seaborn.set(rc={"figure.figsize":(15, 10),
            'legend.title_fontsize' : 25,
            'legend.fontsize' : 20,
            'xtick.labelsize' : 20,
            'ytick.labelsize' : 20,
            'axes.labelsize' : 25})

In [21]:
seaborn.set_context('notebook')
#sns.set_context('paper')
#sns.set_context('talk')
#sns.set_context('poster')

# Predictive Analytics - Classification example

The classification task we will be tackling is based on the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?select=healthcare-dataset-stroke-data.csv).

In this case, we will use the available data to try to predict the occurrence of a stroke (`stroke` variable) as a function of the other variables.

Before starting the modeling task, please have a look at the metadata about the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?select=healthcare-dataset-stroke-data.csv), in order to better understand the meaning of the different variables.




## Activity 1.1 - Descriptive analytics

We are going to use the `pandas` library to perform some exploratory understanding of the data.

1. Load the dataset `healthcare-dataset-stroke-data.csv` in the `stroke_df` variable
2. Display the content of the `stroke_df` variable
3. What are the type of the different columns? Use the knowledge from `pandas` to determine the type.


In [22]:
stroke_df = pandas.read_csv("healthcare-dataset-stroke-data.csv")

In [23]:
stroke_df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [24]:
stroke_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


## Activity 1.2 - Diagnostic analytics

A common problem in many datasets is missing data, usually indicated by N/A, NA, NaN, and extreme values (outliers).

As a reminder, several ways exist to deal with incomplete or missing data, the most common being:

![MissingData](figures/MissingData.png)

**Source:** *Skarga-Bandurova, I., Biloborodova, T., & Dyachenko, Y. (2018). Strategy to managing mixed datasets with missing items. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations: 17th International Conference, IPMU 2018, Cádiz, Spain, June 11-15, 2018, Proceedings, Part II 17 (pp. 608-620). Springer International Publishing.*


1. Is there any column containing missing data in this dataset?
2. If there are any, display the column(s) containing missing data.
3. Count the number of missing values in the column(s) containing missing data.
4. Analyze the missing values and their potential causes, and propose the most appropriate way to process them in order to have a dataset without missing values for the further steps.
5. Produce a new dataset `stroke_noNA_df` containing no missing values.


In [25]:
stroke_df.isna().any()

id                   False
gender               False
age                  False
hypertension         False
heart_disease        False
ever_married         False
work_type            False
Residence_type       False
avg_glucose_level    False
bmi                   True
smoking_status       False
stroke               False
dtype: bool

In [26]:
stroke_df.columns[stroke_df.isna().any()]

Index(['bmi'], dtype='object')

The only column including missing data is the BMI column, let's inspect it some more to see the nature of the data:

In [27]:
stroke_df[stroke_df.columns[stroke_df.isna().any()]]

Unnamed: 0,bmi
0,36.6
1,
2,32.5
3,34.4
4,24.0
...,...
5105,
5106,40.0
5107,30.6
5108,25.6


In [28]:
print(stroke_df[stroke_df.columns[stroke_df.isna().any()]].isna().sum())      

bmi    201
dtype: int64


In [29]:
stroke_df.shape

(5110, 12)

The missing values are constituted by NaN values, and there is no available information to recompute the BMI from the other variables, so we are left with no choice but removing the missing values.

In [30]:
stroke_noNA_df = stroke_df.dropna()

In [31]:
stroke_noNA_df.shape

(4909, 12)

## Activity 1.3

In order to apply a Machine Learning predictive model on the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?select=healthcare-dataset-stroke-data.csv) that we had previously imported in the `stroke_df` variable, we need to perform the following operations:

1. Impute missing values (Done in 1.2 by dropping/imputing the missing values)
2. Split data into training and test using the `train_test_split` function (1.3)
3. Transform categorical variables (1.4)

**N.B.**: Please note that the transformation in categorical variables needs to be done after the split into training and test set in order to avoid information leakage (normally the testing set should not be seen by the model during its training phase).

We are going to use the `scikit-learn` library to perform most of the split and transformation tasks.

Here you need to:
1. Divide the `stroke_noNA_df` dataset into two variables:
- `X` containing the input variables
- `Y` containing the target variable (`stroke`)
2. Use the `train_test_split` function to obtain `X_train, X_test, Y_train, Y_test` with a 70% train - 30% test split

In [32]:
from sklearn.model_selection import train_test_split

X = stroke_noNA_df.iloc[:,1:11]
Y = stroke_noNA_df.iloc[:,11]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)



## Activity 1.4

Before inputting the data to a Machine Learning model, we need all the inputs to be numeric.
In order to transform categorical data into numeric ones, three techniques exist (cf. https://www.kaggle.com/code/alexisbcook/categorical-variables):
- Dropping Categorical variables
- Ordinal Encoding: A categorical variable is replaced by a single numerical variable, where each category is mapped to a different, increasing integer value.
- One-hot Encoding: A categorical variable with $n$ different categories is replaced by $n$ binary variables, each of them corresponding to a category. 

We are going to use the `scikit-learn` library to perform the transformation of the variables and to subsequently fit the models.

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library 
2. Have a look at the following code to perform the transformation of categorical variables:
- Dropping Categorical variables: `drop_X_train` and `drop_X_test`
- Ordinal Encoding: `label_X_train` and `label_X_test`
- One-hot Encoding: `OH_X_train` and `OH_X_test`

TO DO: Add example with Pandas.get_dummies

In [33]:
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

Categorical variables:
['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']


### Dropping categorical variables

In [34]:
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_test = X_test.select_dtypes(exclude=['object'])


### Ordinal Encoding

In [35]:
from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_test = X_test.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_test[object_cols] = ordinal_encoder.transform(X_test[object_cols])


### One-hot Encoding

In [36]:
import sklearn
print(sklearn.__version__)


1.4.1.post1


In [37]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
OH_cols_train = pandas.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_test = pandas.DataFrame(OH_encoder.transform(X_test[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_test.index = X_test.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_test = X_test.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pandas.concat([num_X_train, OH_cols_train], axis=1)
OH_X_test = pandas.concat([num_X_test, OH_cols_test], axis=1)

In [45]:
#label_X_train.info()
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_test.columns = OH_X_test.columns.astype(str)

## Activity 1.5

Finally, with the data cleaned of missing values, and with the categorical variable appropriately transformed we are able to fit some models using the `scikit-learn` library.

As seen in Lecture 5 a starter, we will will be using a baseline for classification models: a [Naive Bayesian Model](https://scikit-learn.org/stable/modules/naive_bayes.html)

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library for the Naive Bayes model.
2. Initialize the model
3. Use the `fit` function to perform the training of the model on the training set
4. Use the `predict` function to perform the prediction of the model on the test set
5. Use the `accuracy_score, balanced_accuracy_score, f1_score` to compare the predictions with the actual values and obtain different performance metrics about the models

In [46]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, balanced_accuracy_score, f1_score

gnb = GaussianNB()
gnb.fit(drop_X_train, Y_train)
Y_pred = gnb.predict(drop_X_test)
print("Accuracy (Dropped categorical): ", accuracy_score(Y_test, Y_pred))
print("F1 (Dropped categorical): ", f1_score(Y_test, Y_pred))

gnb = GaussianNB()
gnb.fit(label_X_train, Y_train)
Y_pred = gnb.predict(label_X_test)
print("Accuracy (Ordinal encoding): ", accuracy_score(Y_test, Y_pred))
print("F1 (Ordinal encoding): ", f1_score(Y_test, Y_pred))

gnb = GaussianNB()
gnb.fit(OH_X_train, Y_train)
Y_pred = gnb.predict(OH_X_test)
print("Accuracy (One-hot encoding): ", accuracy_score(Y_test, Y_pred))
print("F1 (One-hot encoding): ", f1_score(Y_test, Y_pred))

Accuracy (Dropped categorical):  0.8811948404616429
F1 (Dropped categorical):  0.2616033755274262
Accuracy (Ordinal encoding):  0.8689748811948405
F1 (Ordinal encoding):  0.24313725490196078
Accuracy (One-hot encoding):  0.41683638832315
F1 (One-hot encoding):  0.13668341708542714


## Activity 1.6

Now that you are familiar with the pipeline of training, testing and evaluating one model, you can easily repeat the procedure for multiple models.

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library for other classification models:
    - Logistic Regression
    - Decision Trees
    - SVM
    - Random Forest
    - Gradient Boosting
    - Artificial Neural Networks
    - K-Nearest Neighbors
2. For each model:
    1. Initialize the model
    2. Use the `fit` function to perform the training of the model on the training set
    3. Use the `predict` function to perform the prediction of the model on the test set
    4. Use the `accuracy_score, balanced_accuracy_score, f1_score` to compare the predictions with the actual values and obtain performance metrics about the models.
    
3. Create a dictionary/Data Frame in order to be able to compare the performance scores of the different models.
    1. Are there any differences in the values of the metrics?
    2. Why are these values different? Check the documentation to get to know more about the metrics.



In [47]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier


names = [
    "Logistic Regression",
    "Decision Tree",
    "Linear SVM",
    "Random Forest",
    "AdaBoost",
    "Neural Net",
    "K-Nearest Neighbours"
]

classifiers = [
    LogisticRegression(random_state=0),
    SVC(kernel="linear", C=0.025),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    AdaBoostClassifier(),
    MLPClassifier(alpha=1, max_iter=1000),
    KNeighborsClassifier(3),
]

In [53]:
categorical_technique_list = ["Drop Variables", "Ordinal", "One-hot"]
X_train_list = [drop_X_train,label_X_train,OH_X_train]
X_test_list = [drop_X_test,label_X_test,OH_X_test]

# accuracy_per_dataset_df = pandas.DataFrame(columns=["Dataset Name"].append(names))
# balanced_accuracy_per_dataset_df = pandas.DataFrame(columns=["Dataset Name"].append(names))
# f1_score_per_dataset_df = pandas.DataFrame(columns=["Dataset Name"].append(names))

#creating empty DFs
accuracy_per_dataset_df = pandas.DataFrame(columns=["Dataset Name"] + names)
balanced_accuracy_per_dataset_df = pandas.DataFrame(columns=["Dataset Name"] + names)
f1_score_per_dataset_df = pandas.DataFrame(columns=["Dataset Name"] + names)

# Concatenating the DataFrames
accuracy_per_dataset_df = pandas.concat([accuracy_per_dataset_df] * len(names), ignore_index=True)
balanced_accuracy_per_dataset_df = pandas.concat([balanced_accuracy_per_dataset_df] * len(names), ignore_index=True)
f1_score_per_dataset_df = pandas.concat([f1_score_per_dataset_df] * len(names), ignore_index=True)


for technique,X_train,X_test in zip(categorical_technique_list,X_train_list,X_test_list):
    print("[INFO] - Categorical technique: ", technique)
    accuracy_line = {"Dataset Name": technique}
    balanced_accuracy_line = {"Dataset Name": technique}
    f1_score_line = {"Dataset Name": technique}
    
    for classifier,method_name in zip(classifiers,names):
        print("[INFO] - Classifier: ", method_name)
        classifier.fit(X_train, Y_train)
        Y_pred = classifier.predict(X_test)
        accuracy_line[method_name] = accuracy_score(Y_test,Y_pred)
        balanced_accuracy_line[method_name] = balanced_accuracy_score(Y_test,Y_pred)
        f1_score_line[method_name] = f1_score(Y_test,Y_pred)
    
    # accuracy_per_dataset_df = accuracy_per_dataset_df.append(accuracy_line,ignore_index=True)
    # balanced_accuracy_per_dataset_df = balanced_accuracy_per_dataset_df.append(balanced_accuracy_line,ignore_index=True)
    # f1_score_per_dataset_df = f1_score_per_dataset_df.append(f1_score_line,ignore_index=True)
        
    # Append the lines to the DataFrames
    accuracy_per_dataset_df = pandas.concat([accuracy_per_dataset_df, pandas.DataFrame([accuracy_line])], ignore_index=True)
    balanced_accuracy_per_dataset_df = pandas.concat([balanced_accuracy_per_dataset_df, pandas.DataFrame([balanced_accuracy_line])], ignore_index=True)
    f1_score_per_dataset_df = pandas.concat([f1_score_per_dataset_df, pandas.DataFrame([f1_score_line])], ignore_index=True)

[INFO] - Categorical technique:  Drop Variables
[INFO] - Classifier:  Logistic Regression
[INFO] - Classifier:  Decision Tree
[INFO] - Classifier:  Linear SVM
[INFO] - Classifier:  Random Forest
[INFO] - Classifier:  AdaBoost
[INFO] - Classifier:  Neural Net




[INFO] - Classifier:  K-Nearest Neighbours
[INFO] - Categorical technique:  Ordinal
[INFO] - Classifier:  Logistic Regression
[INFO] - Classifier:  Decision Tree


  accuracy_per_dataset_df = pandas.concat([accuracy_per_dataset_df, pandas.DataFrame([accuracy_line])], ignore_index=True)
  balanced_accuracy_per_dataset_df = pandas.concat([balanced_accuracy_per_dataset_df, pandas.DataFrame([balanced_accuracy_line])], ignore_index=True)
  f1_score_per_dataset_df = pandas.concat([f1_score_per_dataset_df, pandas.DataFrame([f1_score_line])], ignore_index=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[INFO] - Classifier:  Linear SVM
[INFO] - Classifier:  Random Forest
[INFO] - Classifier:  AdaBoost
[INFO] - Classifier:  Neural Net




[INFO] - Classifier:  K-Nearest Neighbours
[INFO] - Categorical technique:  One-hot
[INFO] - Classifier:  Logistic Regression
[INFO] - Classifier:  Decision Tree


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[INFO] - Classifier:  Linear SVM
[INFO] - Classifier:  Random Forest
[INFO] - Classifier:  AdaBoost
[INFO] - Classifier:  Neural Net




[INFO] - Classifier:  K-Nearest Neighbours


In [54]:
accuracy_per_dataset_df

Unnamed: 0,Dataset Name,Logistic Regression,Decision Tree,Linear SVM,Random Forest,AdaBoost,Neural Net,K-Nearest Neighbours
0,Drop Variables,0.95112,0.95112,0.947047,0.95112,0.948405,0.95112,0.939579
1,Ordinal,0.95112,0.95112,0.948405,0.95112,0.949762,0.95112,0.940258
2,One-hot,0.95112,0.95112,0.947047,0.95112,0.95112,0.95112,0.9389


In [55]:
balanced_accuracy_per_dataset_df

Unnamed: 0,Dataset Name,Logistic Regression,Decision Tree,Linear SVM,Random Forest,AdaBoost,Neural Net,K-Nearest Neighbours
0,Drop Variables,0.5,0.5,0.504446,0.5,0.498572,0.5,0.50052
1,Ordinal,0.5,0.5,0.498572,0.5,0.499286,0.5,0.500877
2,One-hot,0.5,0.5,0.504446,0.5,0.5,0.5,0.500164


In [56]:
f1_score_per_dataset_df

Unnamed: 0,Dataset Name,Logistic Regression,Decision Tree,Linear SVM,Random Forest,AdaBoost,Neural Net,K-Nearest Neighbours
0,Drop Variables,0.0,0.0,0.025,0.0,0.0,0.0,0.021978
1,Ordinal,0.0,0.0,0.0,0.0,0.0,0.0,0.022222
2,One-hot,0.0,0.0,0.025,0.0,0.0,0.0,0.021739


## Activity 1.7

Congratulations! By now you should be able to train, test and evaluate multiple models on a classification task.

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library for the different parameters of other classification models.

2. Analyze the impact of different changes in the predictive setup on the model:
- Does the amount of data in the training set affect the predictive performance? Try to apply the procedure by varying the training-test proportion.
- Does the parameter setting of the different models have an impact on the model performances? Try to tweak the performance by varying the parameters.

# Predictive Analytics - Regression

The regression task we will be tackling is based on the [Medical Cost Personal Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance?ref=hackernoon.com&select=insurance.csv).

In this case, we will use the available data to try to predict the insurance cost (`charges` variable) as a function of the other variables.

Before starting the modeling task, please have a look at the metadata about the [Medical Cost Personal Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance?ref=hackernoon.com&select=insurance.csv), in order to better understand the meaning of the different variables.

## Activity 2.1 - Descriptive analytics

We are going to use the `pandas` library to perform some exploratory understanding of the data.

1. Load the dataset in the `insurance_df` variable
2. Display the content of the `insurance_df` variable
3. What are the type of the different columns? Use the knowledge from `pandas` to determine the type.


In [60]:
insurance_df = pandas.read_csv("insurance.csv")

In [61]:
insurance_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [62]:
insurance_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


## Activity 2.2 - Diagnostic analytics

A common problem in many datasets is missing data, usually indicated by N/A, NA, NaN, and extreme values (outliers).

As a reminder, several ways exist to deal with incomplete or missing data, the most common being:

![MissingData](figures/MissingData.png)

**Source:** *Skarga-Bandurova, I., Biloborodova, T., & Dyachenko, Y. (2018). Strategy to managing mixed datasets with missing items. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations: 17th International Conference, IPMU 2018, Cádiz, Spain, June 11-15, 2018, Proceedings, Part II 17 (pp. 608-620). Springer International Publishing.*


1. Is there any column containing missing data in this dataset?
2. If there are any, display the column(s) containing missing data.
3. Count the number of missing values in the column(s) containing missing data.
4. Analyze the missing values and their potential causes, and propose the most appropriate way to process them in order to have a dataset without missing values for the further steps.
5. Produce a new dataset `insurance_noNA_df` containing no missing values.



As there are no missing values, we can use the original dataset as is:

In [63]:
insurance_noNA_df = insurance_df

## Activity 2.3

In order to apply a Machine Learning predictive model on the [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?select=healthcare-dataset-stroke-data.csv) that we had previously imported in the `stroke_df` variable, we need to perform the following operations:

1. Impute missing values (Done in 2.2 by dropping/imputing the missing values)
2. Split data into training and test using the `train_test_split` function (2.3)
3. Transform categorical variables (2.4)

**N.B.**: Please note that the transformation in categorical variables needs to be done after the split into training and test set in order to avoid information leakage (normally the testing set should not be seen by the model during its training phase).

We are going to use the `scikit-learn` library to perform most of the split and transformation tasks.

Here you need to:
1. Divide the `insurance_noNA_df` dataset into two variables:
- `X` containing the input variables
- `Y` containing the target variable (`charges`)
2. Use the `train_test_split` function to obtain `X_train, X_test, Y_train, Y_test` with a 70% train - 30% test split.

In [64]:
from sklearn.model_selection import train_test_split

X = insurance_noNA_df.iloc[:,0:6]
Y = insurance_noNA_df.iloc[:,6]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)


## Activity 2.4

Before inputting the data to a Machine Learning model, we need all the inputs to be numeric.
In order to transform categorical data into numeric ones, three techniques exist (cf. https://www.kaggle.com/code/alexisbcook/categorical-variables):
- Dropping Categorical variables
- Ordinal Encoding: A categorical variable is replaced by a single numerical variable, where each category is mapped to a different, increasing integer value.
- One-hot Encoding: A categorical variable with $n$ different categories is replaced by $n$ binary variables, each of them corresponding to a category. 

We are going to use the `scikit-learn` library to perform the transformation of the variables and to subsequently fit the models.

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library 
2. Have a look at the following code to perform the transformation of categorical variables:
- Dropping Categorical variables: `drop_X_train` and `drop_X_test`
- Ordinal Encoding: `label_X_train` and `label_X_test`
- One-hot Encoding: `OH_X_train` and `OH_X_test`

In [65]:
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)

Categorical variables:
['sex', 'smoker', 'region']


### Dropping categorical variables

In [66]:
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_test = X_test.select_dtypes(exclude=['object'])


### Ordinal Encoding

In [67]:
from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_test = X_test.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_test[object_cols] = ordinal_encoder.transform(X_test[object_cols])


### One-hot Encoding

In [69]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
OH_cols_train = pandas.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_test = pandas.DataFrame(OH_encoder.transform(X_test[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_test.index = X_test.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_test = X_test.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pandas.concat([num_X_train, OH_cols_train], axis=1)
OH_X_test = pandas.concat([num_X_test, OH_cols_test], axis=1)


## Activity 2.5

Finally, with the data cleaned of missing values, and with the categorical variables appropriately transformed we are able to fit some models using the `scikit-learn` library.

As seen in Lecture 5 a starter, we will will be using a baseline for regression models: a [Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library for the Linear Regression model.
2. Initialize the model
3. Use the `fit` function to perform the training of the model on the training set.
4. Use the `predict` function to perform the prediction of the model on the test set.
5. Use the `mean_squared_error, mean_absolute_error, mean_absolute_percentage_error` to compare the predictions with the actual values and obtain different performance metrics about the models.

In [71]:
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_test.columns = OH_X_test.columns.astype(str)

In [72]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error

lr = LinearRegression()
lr.fit(drop_X_train, Y_train)
Y_pred = lr.predict(drop_X_test)
print("MSE (Dropped categorical): ", mean_squared_error(Y_test, Y_pred))
print("MAE (Dropped categorical): ", mean_absolute_error(Y_test, Y_pred))
print("MAPE (Dropped categorical): ", mean_absolute_percentage_error(Y_test, Y_pred))

lr = LinearRegression()
lr.fit(label_X_train, Y_train)
Y_pred = lr.predict(label_X_test)
print("MSE (Ordinal encoding): ", mean_squared_error(Y_test, Y_pred))
print("MAE (Ordinal encoding): ", mean_absolute_error(Y_test, Y_pred))
print("MAPE (Ordinal encoding): ", mean_absolute_percentage_error(Y_test, Y_pred))

lr = LinearRegression()
lr.fit(OH_X_train, Y_train)
Y_pred = lr.predict(OH_X_test)
print("MSE (One-hot encoding): ", mean_squared_error(Y_test, Y_pred))
print("MAE (One-hot encoding): ", mean_absolute_error(Y_test, Y_pred))
print("MAPE (One-hot encoding): ", mean_absolute_percentage_error(Y_test, Y_pred))


MSE (Dropped categorical):  127399626.37416688
MAE (Dropped categorical):  9079.649028580896
MAPE (Dropped categorical):  1.1825368635705567
MSE (Ordinal encoding):  33805466.898688614
MAE (Ordinal encoding):  4155.239843059382
MAPE (Ordinal encoding):  0.44125939462651353
MSE (One-hot encoding):  33780509.57479163
MAE (One-hot encoding):  4145.450555627585
MAPE (One-hot encoding):  0.43585625991943105


## Activity 2.6

Now that you are familiar with the pipeline of training, testing and evaluating one model, you can easily repeat the procedure for multiple models.

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library for other regression models:
    - Lasso/ElasticNet
    - Decision Tree
    - Random Forest
    - Gradient Boosting
    - Artificial Neural Networks
    - K-Nearest Neighbors
2. For each model:
    1. Initialize the model
    2. Use the `fit` function to perform the training of the model on the training set
    3. Use the `predict` function to perform the prediction of the model on the test set
    4. Use the `accuracy_score, balanced_accuracy_score, f1_score` to compare the predictions with the actual values and obtain performance metrics about the models.
    
3. Create a dictionary/Data Frame in order to be able to compare the performance scores of the different models.
    1. Are there any differences in the values of the metrics?
    2. Why are these values different? Check the documentation to get to know more about the metrics.

In [73]:
from sklearn.linear_model import Perceptron, Lasso, ElasticNet
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor


names = [
    "Lasso",
    "Elastic Net",
    "Linear SVM",
    "Decision Tree",
    "Random Forest",
    "AdaBoost",
    "Neural Net",
    "K-Nearest Neighbours"
]

regressors = [
    Lasso(alpha=0.1),
    ElasticNet(random_state=0),
    SVR(kernel="linear", C=0.025),
    DecisionTreeRegressor(max_depth=5),
    RandomForestRegressor(max_depth=5, n_estimators=10, max_features=1),
    AdaBoostRegressor(),
    MLPRegressor(alpha=1, max_iter=1000),
    KNeighborsRegressor(3),
]

In [77]:
categorical_technique_list = ["Drop Variables", "Ordinal", "One-hot"]
X_train_list = [drop_X_train,label_X_train,OH_X_train]
X_test_list = [drop_X_test,label_X_test,OH_X_test]

# MSE_per_dataset_df = pandas.DataFrame(columns=["Dataset Name"].append(names))
# MAE_per_dataset_df = pandas.DataFrame(columns=["Dataset Name"].append(names))
# MAPE_per_dataset_df = pandas.DataFrame(columns=["Dataset Name"].append(names))

# Creating empty DataFrames
MSE_per_dataset_df = pandas.DataFrame(columns=["Dataset Name"] + names)
MAE_per_dataset_df = pandas.DataFrame(columns=["Dataset Name"] + names)
MAPE_per_dataset_df = pandas.DataFrame(columns=["Dataset Name"] + names)

# Concatenating the DataFrames
MSE_per_dataset_df = pandas.concat([MSE_per_dataset_df] * len(names), ignore_index=True)
MAE_per_dataset_df = pandas.concat([MAE_per_dataset_df] * len(names), ignore_index=True)
MAPE_per_dataset_df = pandas.concat([MAPE_per_dataset_df] * len(names), ignore_index=True)

for technique,X_train,X_test in zip(categorical_technique_list,X_train_list,X_test_list):
    print("[INFO] - Categorical technique: ", technique)
    MSE_line = {"Dataset Name": technique}
    MAE_line = {"Dataset Name": technique}
    MAPE_line = {"Dataset Name": technique}
    
    for regressor,method_name in zip(regressors,names):
        print("[INFO] - Regressor: ", method_name)
        regressor.fit(X_train, Y_train)
        Y_pred = regressor.predict(X_test)
        MSE_line[method_name] = mean_squared_error(Y_test, Y_pred)
        MAE_line[method_name] = mean_absolute_error(Y_test, Y_pred)
        MAPE_line[method_name] = mean_absolute_percentage_error(Y_test, Y_pred)
    
    # MSE_per_dataset_df = MSE_per_dataset_df.append(MSE_line,ignore_index=True)
    # MAE_per_dataset_df = MAE_per_dataset_df.append(MAE_line,ignore_index=True)
    # MAPE_per_dataset_df = MAPE_per_dataset_df.append(MAPE_line,ignore_index=True)
    # Append the lines to the DataFrames
    MSE_per_dataset_df = pandas.concat([MSE_per_dataset_df, pandas.DataFrame([MSE_line])], ignore_index=True)
    MAE_per_dataset_df = pandas.concat([MAE_per_dataset_df, pandas.DataFrame([MAE_line])], ignore_index=True)
    MAPE_per_dataset_df = pandas.concat([MAPE_per_dataset_df, pandas.DataFrame([MAPE_line])], ignore_index=True)

[INFO] - Categorical technique:  Drop Variables
[INFO] - Regressor:  Lasso
[INFO] - Regressor:  Elastic Net
[INFO] - Regressor:  Linear SVM
[INFO] - Regressor:  Decision Tree
[INFO] - Regressor:  Random Forest
[INFO] - Regressor:  AdaBoost
[INFO] - Regressor:  Neural Net
[INFO] - Regressor:  K-Nearest Neighbours
[INFO] - Categorical technique:  Ordinal
[INFO] - Regressor:  Lasso
[INFO] - Regressor:  Elastic Net
[INFO] - Regressor:  Linear SVM
[INFO] - Regressor:  Decision Tree
[INFO] - Regressor:  Random Forest
[INFO] - Regressor:  AdaBoost
[INFO] - Regressor:  Neural Net


  MSE_per_dataset_df = pandas.concat([MSE_per_dataset_df, pandas.DataFrame([MSE_line])], ignore_index=True)
  MAE_per_dataset_df = pandas.concat([MAE_per_dataset_df, pandas.DataFrame([MAE_line])], ignore_index=True)
  MAPE_per_dataset_df = pandas.concat([MAPE_per_dataset_df, pandas.DataFrame([MAPE_line])], ignore_index=True)


[INFO] - Regressor:  K-Nearest Neighbours
[INFO] - Categorical technique:  One-hot
[INFO] - Regressor:  Lasso
[INFO] - Regressor:  Elastic Net
[INFO] - Regressor:  Linear SVM
[INFO] - Regressor:  Decision Tree
[INFO] - Regressor:  Random Forest
[INFO] - Regressor:  AdaBoost
[INFO] - Regressor:  Neural Net
[INFO] - Regressor:  K-Nearest Neighbours




In [78]:
MSE_per_dataset_df

Unnamed: 0,Dataset Name,Lasso,Elastic Net,Linear SVM,Decision Tree,Random Forest,AdaBoost,Neural Net,K-Nearest Neighbours
0,Drop Variables,127399600.0,127284500.0,159648600.0,155413700.0,133299500.0,154022800.0,129261000.0,164483500.0
1,Ordinal,33805470.0,87303250.0,159624600.0,20869580.0,41896090.0,26201020.0,122960500.0,131509400.0
2,One-hot,33780680.0,68802850.0,159555100.0,20957330.0,32897370.0,24827010.0,115229300.0,105771000.0


In [79]:
MAE_per_dataset_df

Unnamed: 0,Dataset Name,Lasso,Elastic Net,Linear SVM,Decision Tree,Random Forest,AdaBoost,Neural Net,K-Nearest Neighbours
0,Drop Variables,9079.650755,9083.953829,7086.288501,9351.948069,9089.105148,11121.295617,9182.722541,9069.469601
1,Ordinal,4155.272559,7322.318781,7085.098638,2670.427244,4902.308131,4110.065489,8951.151441,7420.412707
2,One-hot,4145.507331,6299.410887,7084.326374,2672.988491,4264.514947,3780.938608,8619.774231,6225.511141


In [80]:
MAPE_per_dataset_df

Unnamed: 0,Dataset Name,Lasso,Elastic Net,Linear SVM,Decision Tree,Random Forest,AdaBoost,Neural Net,K-Nearest Neighbours
0,Drop Variables,1.18254,1.190088,0.557398,1.236815,1.279493,1.85015,1.285927,1.192442
1,Ordinal,0.441262,0.950017,0.557056,0.307797,0.738643,0.68165,1.2544,0.793421
2,One-hot,0.435868,0.815184,0.557314,0.308484,0.708845,0.567752,1.198214,0.591876


## Activity 2.7

Congratulations! By now you should be able to train, test and evaluate multiple models on a classification task.

1. Have a look at the documentation of the [Scikit-learn](https://scikit-learn.org/stable/index.html) library for the different parameters of other classification models.

2. Analyze the impact of different changes in the predictive setup on the model:
- Does the amount of data in the training set affect the predictive performance? Try to apply the procedure by varying the training-test proportion.
- Does the parameter setting of the different models have an impact on the model performances? Try to tweak the performance by varying the parameters.