This notebook contains material from [Kaggle's Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning), [Kaggle's Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) and [Kaggle's Machine Learning Explainability](https://www.kaggle.com/learn/machine-learning-explainability).

### Table of Contents
#### <a href='#partI'> 1. Intro to Machine Learning </a>  
> <a href='#dataExploration'> 1.1 Basic Data Exploration </a>  
> <a href='#buildingModel'> 1.2 Building a Model </a>  
> <a href='#modelValidation'> 1.3 Model Validation </a>  
> <a href='#underfitting&overfitting'> 1.4 Underfitting and Overfitting </a>  
> <a href='#randomForests'> 1.5 Random Forests </a>

#### <a href='#partII'> 2. Intermediate Machine Learning </a>  
> <a href='#missingValues'> 2.1 Missing Values </a>  
> <a href='#categoricalValues'> 2.2 Categorical Values </a>  
> <a href='#pipelines'> 2.3 Pipelines </a>  
> <a href='#crossValidation'> 2.4 Cross-Validation </a>  
> <a href='#XGBoost'> 2.5 XGBoost </a>  
> <a href='#dataLeakage'> 2.6 Data Leakage </a>  

#### <a href='#partIII'> 3. Machine Learning Explainability </a>  
> <a href='#modelInsights'> 3.1 Use cases for Model Insights </a>  
> <a href='#permutationImportance'> 3.2 Permutation Importance </a>  
> <a href='#partialDependence'> 3.3 Partial Dependence Plots </a>  
> <a href='#SHAP'> 3.4 SHAP values </a>  

#### <a href='#references'> References </a> 

# <a id="partI">1. Intro to Machine Learning</a>
## <a id="dataExploration">1.1 Basic Data Exploration</a>
We have to familiarize ourselves with the data. First, we will use *pandas* to load the *csv* data. 


In [None]:
import pandas as pd
housing_data_full = pd.read_csv("../input/melbourne-housing-snapshot/melb_data.csv")

Then, we could analyze the data e.g. number of observations, features etc. One convenient method to use is `describe()`:

In [None]:
housing_data_full.describe()

We can output the number of features using `columns`:

In [None]:
housing_data_full.columns

Finally, we can also remove data with missing values:

In [None]:
housing_data = housing_data_full.dropna(axis=0)

## <a id="buildingModel">1.2 Building a model</a>

In our model, we want to predict the price of a house given certain features. Therefore, our target function *y* would be the price:

In [None]:
y = housing_data.Price
y.head()

Then, we have to choose the different features we will take into account to predict the price of different houses. We have to select them carefully, since they will be the building blocks of the model. 

In [None]:
selected_features = ["Rooms", "Bathroom", "Landsize", "Lattitude", "Longtitude"]
X = housing_data[selected_features]

We will use *decision trees* to model our data. To do so, we import beforehand the *sklearn* module. Then we fit the data:

In [None]:
from sklearn.tree import DecisionTreeRegressor
housing_model = DecisionTreeRegressor(random_state=0)
housing_model.fit(X, y)

Let's carefully compare the true and the predicted values.

In [None]:
predicted_values = housing_model.predict(X)
print(predicted_values)
print(y.to_numpy())

The predicted results match the true values. Note that we have validated our model with training data. That is, the model has to make predictions using data it has already seen. However, what would happen if we give the model a new example? It might happen that the data used for training the model is biased. In other words, we don't take into account all the different possible houses there might be. If that is the case, our model will not **generalize** properly.

## <a id="modelValidation">1.3 Model Validation</a>

Model validation involves the assessment of the current model if we evaluate it with new data. We define the data used to build the model as the _training set_. The data to test the model is the _test set_. We can split our training set in two different groups: the training set itself, which will be used to train the model; and the validation set, data that is only used to assess the model. The validation set is helpful when the test data is not available (which occurs most of the times). We can split the training data using *sklearn*.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state = 0)
housing_model = DecisionTreeRegressor()
housing_model.fit(X_train, y_train)
val_predictions = housing_model.predict(X_valid)

We can evaluate the performance of the model using different metrics. In this case, we will use the *mean_absolute_error* function from *sklearn*:

In [None]:
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(y_valid, val_predictions))

We have an absolute error of almost 270,000 dollars (quite terrible!)

## <a id="underfitting&overfitting">1.4 Underfitting and Overfitting</a>

Before, we have talked about the ability of the model to generalize. The more the data, the more theoretical variance the model will account, and hence, the better the generalization of the model. If we use few data or the complexity is the model is to high, we are bound to find **overfitting**: the model will perform well with training data, but it will awful with new observations. Contrarily, if our model is too simple, we will find **underfitting**: the model will be incapable of capturing important patterns within the data. We can observe both phenomena if changing the number of leaves from the decision tree model we used before.

In [None]:
def get_mae_train(max_leaf_nodes, X_train, X_valid, y_train, y_valid):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    preds_val = model.predict(X_train)
    mae = mean_absolute_error(y_train, preds_val)
    return(mae)

def get_mae_test(max_leaf_nodes, X_train, X_valid, y_train, y_valid):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    preds_val = model.predict(X_valid)
    mae = mean_absolute_error(y_valid, preds_val)
    return(mae)

for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae_train = get_mae_train(max_leaf_nodes, X_train, X_valid, y_train, y_valid)
    my_mae_test = get_mae_test(max_leaf_nodes, X_train, X_valid, y_train, y_valid)
    print("Max leaf nodes: %d  \t\t MAE train:  %d \t\t MAE test:  %d" %(max_leaf_nodes, my_mae_train, my_mae_test))

We can plot the training and test error over the different number of leaves. 

In [None]:
my_mae_train = []
my_mae_test = []
leaves = range(5,1050,5)

for idx, max_leaf_nodes in enumerate(leaves):
    my_mae_train.append(get_mae_train(max_leaf_nodes, X_train, X_valid, y_train, y_valid))
    my_mae_test.append(get_mae_test(max_leaf_nodes, X_train, X_valid, y_train, y_valid))

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
#plt.plot(leaves, my_mae_train)
#plt.plot(leaves, my_mae_test)

sns.lineplot(x=leaves, y=my_mae_train)
sns.lineplot(x=leaves, y=my_mae_test)
plt.xlabel("No. leaves", fontsize=15)
plt.ylabel("MAE", fontsize=15)
plt.legend(labels=['Train MAE', 'Test MAE']);

The plot shows how the test error decreases until it reaches a lower-bound at $\approx$ 150 leaves. The training error keeps decreasing as the complexity increases, since the model keeps capturing all the different patterns of the training data (event those which are particular to the training set).

## <a id="randomForests">1.5 Random Forests</a>
Random forests look to combine different decision trees models. In this way, we prevent the model from underfitting and overfitting since the prediction is weighed among the different models.


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(X_train, y_train)
predictionsRF = forest_model.predict(X_valid)
print(mean_absolute_error(y_valid, predictionsRF))

The error has droped considerable to almost 220,000!

# <a id="partII">2. Intermediate Machine Learning</a>
## <a id="missingValues">2.1 Missing Values </a>
It might happen that some features of our data are incomplete. In those cases, we have three different options.



In [None]:
# Create and evaluate model in a function
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=50, random_state=1)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

# Retrieve all the dataset and divide it intro train and val
y = housing_data_full.Price
X_wo_y = housing_data_full.drop(['Price'], axis=1)
X = X_wo_y.select_dtypes(exclude=['object'])
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state = 0)

### 1. Drop the features with unknown values 


In [None]:
# Get names of columns with missing values
features_with_missing = [feature for feature in X_train.columns if X_train[feature].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(features_with_missing, axis=1)
reduced_X_valid = X_valid.drop(features_with_missing, axis=1)

print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

### 2. Imputation (replace the unknown value with the mean over the features)

In [None]:
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

### 3. Enhanced imputation (specify in other column weather the value is available or not)

In [None]:
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for feature in features_with_missing:
    X_train_plus[feature + '_was_missing'] = X_train_plus[feature].isnull()
    X_valid_plus[feature + '_was_missing'] = X_valid_plus[feature].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

Note that the second and third method achieved a slower error than the first one!

## <a id="categoricalValues">2.2 Categorical Values</a>

We might have some features that include values that are not numbers. For instance, strings such as "Never" or "Always" or booleans such as "True" or "False". We can check the categorical values of our data with the following code:






In [None]:
# Read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()] 
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)


We can include this data to the model by mapping those values into numbers. We have three different methods to do that:

### 1. Drop categorical values

In [None]:
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

### 2. Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in object_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])

print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

### 3. One-hot vector

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

## <a id="pipelines">2.3 Pipelines</a>

Throughout the notebook, we have copied the code for the preprocessing and fit over and over. We can bundle everything together using pipelines, making our code cleaner and more readable.
### Step 1: Define preprocessing steps

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Select categorical columns
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

### Step 2: Define the model

In [None]:
# Define the model
model = RandomForestRegressor(n_estimators=100, random_state=0)

### Step 3: Create and evaluate the pipeline

In [None]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

## <a id="crossValidation">2.4 Cross-Validation</a>

If we perform the train-test split in our data, we might disregard randomness in the data (not used for training but for testing). Cross-validation prevents this bias, by iteratively training and validating the model with different chunks of data. Computationally, it is more expensive than a single training and validation sweep. However, we will be certainly sure that our model does not skim interesting patterns in our data.

In [None]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)

## <a id="XGBoost">2.5 XGBoost</a>

Random forests is an example of ensemble methods. Ensemble methods combine and average predictions of different models. Gradient boosting is a method that iteratively adds new models into an ensemble. The different steps are the following:

1. Calculate the predicitons with the current ensemble
2. Compute the loss of the ensemble
3. Fit a new model using the previous loss
4. Add the new model to the ensemble and repeat


In [None]:
from xgboost import XGBRegressor

# Read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Select subset of predictors
feature_names = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[feature_names]

# Select target
y = data.Price

# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))

The XGBoost has different parameters to tune: `n_estimators`, that is the number of total models of the ensemble; the `learning_rate`, or how fast will the gradient jump from one iteration to the following; `early_stopping_rounds` which stops the model when there is no gain in the test error; `n_jobs`, which helps to parallelize the tasks in the boosting.

In [None]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=2)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))

## <a id="dataLeakage">2.6 Data Leakage</a>

Data leakage occurs when your training data contains features that will not be present when the model is used for prediction. There are two main data leakage types: *target leakage* and *train-test contamination*. The first one occurs when the data includes fields that will not be available for prediction tasks (e.g. hindsight events); the train-test contamination occurs when data from the training is introduced in validation.

# <a id="partIII">3. Machine Learning Explainability</a>

## <a id="modelInsights">3.1 Use cases for Model Insights</a>
Understanding machine learning models is important for learning about the insights underlying prediction in different scenarios â€“ with different data. Some of the ways machine learning explainability are helpful are the following:
1. Debugging
2. Informing Feature Engineering
3. Directing Future Data Collection
4. Informing Human Decision-Making
5. Building trust

## <a id="permutationImportance">3.2 Permutation Importance</a>
Feature importance tries to address the following question: *What features have the biggest impact on our machine learning model when predicting?*

One of the easiest ways to to quantify feature importance is through *permutation importance*.

Idea: Train a model and perform predictions on a validation set. Then, shuffle the different observations of a feature (column) and assess the performance of your model again. If the predictions are worsen, the feature in discussion affected the model considerably.

We can make use of the *eli5* library to study permutation importance on our model. We will use the past model trained (XGBoost) with five features:

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

# Fill NaN values with the mean
X_valid = X_valid.fillna(X_valid.mean())

# Perforformance permutation importance
perm = PermutationImportance(my_model, random_state=1).fit(X_valid, y_valid)
eli5.show_weights(perm, feature_names = X_valid.columns.tolist())

Results show the most important feature to predict a house price is distance, which refers to the distance from the house to the central business district (CBD). From the five selected features, the year of construction seems to be the least critical.

## <a id="partialDependence">3.3 Partial Dependence Plots</a>
Partial dependence plots show *how* an individual feature affects predictions. In the same way as permutation importance, partial dependece plots as produced after the model has been trained. By selecting continuously different number of rows, altering the value for one variable and doing predictions on the validation set, we can disentangle the interaction of multiple features. 

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Separate data into training and validation sets
X = X.fillna(X.mean())
y_scaled = y.copy()/1000 # scale for visualization 
X_train, X_valid, y_train, y_valid = train_test_split(X, y_scaled)

# Train the model
tree_model = RandomForestRegressor(n_estimators=30, random_state=1).fit(X_train, y_train)

In [None]:
from pdpbox import pdp, get_dataset, info_plots

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=X_valid, model_features=feature_names, 
                            feature='Distance')

# Plot
pdp.pdp_plot(pdp_goals, 'Distance');

We can examine how the distance affects the prediction. As we increase the distance, we can see a negative tendency of the price: it decreases as the distance becomes larger. That is, the further is the house from the CDB, the less expensive it becomes. We can analyze other feature, for instance, the year built.

In [None]:
# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=X_valid, model_features=feature_names, 
                            feature='YearBuilt')
# Plot
pdp.pdp_plot(pdp_goals, 'YearBuilt');

This plot shows how the houses become more afordable as the houses are newer. We can observe the partial dependence of two different features by means of a 2-D partial dependence plot.

In [None]:
features_to_plot = ['Distance', 'Rooms']
inter  =  pdp.pdp_interact(model=tree_model, dataset=X_valid, model_features=feature_names, features=features_to_plot)
pdp.pdp_interact_plot(pdp_interact_out=inter, feature_names=features_to_plot, plot_type='contour', plot_pdp=True)
plt.show()

## <a id="SHAP">3.4 SHAP values</a>
SHAP values (SHappley Additive exPlanations) help to break down the prediction, showing the direct impact from the different features. SHAP values compare the prediction given a value for a feature with a baseline value for the same feature. In this way, it can construct an explanation of why the prediction was off a baseline.  
We will use a classification problem to illustrate how the SHAP values work. We will load and use the hospital readmissions dataset. It contains medical information of different patients, together with if they were readmitted to the hospital or not. The task is to predict whether a given patient will need to be rehospitalized or not.

In [None]:
import shap
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Load Hospital Readmissions dataset
data = pd.read_csv("../input/hospital-readmissions/train.csv")
y = data["readmitted"]
feature_names = [i for i in data.columns if data[i].dtype in [np.int64, np.int64]]
X = data[feature_names]
X = X.drop("readmitted", axis=1)
X

The data shows different features, such as time in hospital, the number of procedures or the number of diagnoses. After the model is trained, we can observe the SHAP values to see how each feature affected the final prediction.

In [None]:
# Split data and train the classifier
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(random_state=1).fit(X_train, y_train)

# Select the row for prediction
row_to_show = 15
data_for_prediction = X_valid.iloc[row_to_show] 
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
my_model.predict(data_for_prediction_array)

# Calculate and plot SHAP values
explainer = shap.TreeExplainer(my_model)
shap_values = explainer.shap_values(data_for_prediction)
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction)

In [None]:
# Plot the summary plot of the first 200 observations
shap_values = explainer.shap_values(X_valid[:200])
shap.summary_plot(shap_values[1], X_valid[:200])

The plot shows on the y-axis the different features of the model. On the x-axis it is ploted the impact of the feature on a given prediction. The color shows the value of the feature itself: low and high.

In [None]:
# Plot the dependence plot of the first 200 observations
shap_values = explainer.shap_values(X[:200])
shap.dependence_plot('number_inpatient', shap_values[1], X[:200], interaction_index="number_diagnoses")

The plot has an upward tendency. When the feature `number_inpatient` increases, the higher the importance on the model's prediction. The color shows us the value of `number_diagnoses` for a given observation. This dependence plot help us to see the interplay between these two features and the final prediction. Across the observations `number_diagnoses` is considerably high. This help us to devise that the interplay between these two features does not necessary changes the readmission rate.

# <a id="references">References</a>
1. [Kaggle's Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) 
2. [Kaggle's Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)
3. [Kaggle's Machine Learning Explainability](https://www.kaggle.com/learn/machine-learning-explainability)