This is where the magic happens:

# MODELLING

In this notebook I'll work on a model to what a tourist will spend when vacationing in Tanzania.
The evaluation metric for the model is **Mean Absolute Error**.


To do:
- feature selection/feature engineering (subregions)
- model
- again if necessary: feature selection/feature engineering
- outlier handling (Isolation Forest?)
- hyperparameter tuning
- interpretation/visualization

In [1]:
# import some packages that I'll need

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from verstack.stratified_continuous_split import scsplit

# suppress warnings
import warnings
warnings.filterwarnings('ignore')

# set color scheme
cpal = ["#f94144","#f3722c","#f8961e","#f9844a","#f9c74f","#90be6d","#43aa8b","#4d908e","#577590","#277da1"]

# seaborn theme
sns.set()

# use natural numbers
pd.options.display.float_format = "{:.2f}".format

# set random seed
RSEED = 42

In [2]:
# load data
TZA = pd.read_csv('data/Train.csv')

## Train Test Split

I'm going to split the train and test data now, very in the beginning to avoid data leakage.

I'm not using the regular sklear train test split as it gave me concerning results in the first run (much better performance on test than on train data). That's why I use Verstack stratified continuos split which allows me to stratify by continuos target variable. (It makes sure that different bins of total_cost are evenly divided among train and test data).

In [3]:
# train test split
train, test = scsplit(TZA, stratify = TZA['total_cost'], test_size = 0.3, random_state = RSEED)

## Preprocessing

I am going to preprocess the data now. I'll do it separately for train and test data.

#### Missing values and minor adjustments

In [4]:
# function to handle missing data and to make some minor adjustments on the dataset

def basic_preprocessing(df):
    # fill NaN total_male/total_female with 0
    df['total_male'] = df['total_male'].fillna(0)
    df['total_female'] = df['total_female'].fillna(0)
    
    # fill NaN travel_with with "Alone" if total_male plus total_female is one
    df.loc[df['total_female'] + df['total_male'] == 1, 'travel_with'] = 'Alone'
    
    # fill remaining NaN travel_with with missing
    df['travel_with'] = df['travel_with'].fillna('missing')
    
    # fill NaN most_impressing with "No comments"
    df['most_impressing'] = df['most_impressing'].fillna('No comments')
   
    # drop id column
    df = df.drop(['ID'], axis =1)
    
    return df

In [5]:
# apply function on train data
train = basic_preprocessing(train)
# apply function on test data
test = basic_preprocessing(test)

In [6]:
# separate target variable, both in train and test data

X_train = train.drop(['total_cost'], axis=1)
y_train = train['total_cost']

X_test = test.drop(['total_cost'], axis=1)
y_test = test['total_cost']

### Build Pipelines

I'm going to build some pipelines now. They'll make modelling easier and faster.

I start with a pipeline for the categorical features. I use a One Hot Encoder to convert them into numbers.

In [7]:
# create list of categorical features
cat_features = list(X_train.columns[X_train.dtypes==object])

# build pipeline
cat_pipeline = Pipeline([
    ('1hot', OneHotEncoder(handle_unknown= 'ignore', drop = 'first'))
])

For the numerical features I'll use a Robust Scaler. It can handle outliers pretty good.

In [8]:
# create list of numerical features
num_features = list(X_train.columns[X_train.dtypes!=object])

# build pipeline
num_pipeline = Pipeline([
    ('rob_scaler', RobustScaler())
])

In [9]:
# combine both pipelines in a preprocessor
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

### Baseline Model

First I'm going to train a linear regression model. Except for some NaN imputations and the basic preprocessing we haven't made adjustments on the data yet. The result of the Baseline Model will serve me as a benchmark.

In [10]:
# build pipeline that combines the preprocessor and the linear regression model
pipe_linreg_bl = Pipeline([
    ('preprocessor', preprocessor),
    ('linreg', LinearRegression())
])

In [11]:
# cross validate to check how the model performs on the train data
y_train_predicted_bl_cv = cross_val_predict(pipe_linreg_bl, X_train, y_train, cv=100)

# print MAE of Baseline Model (train data)
print("Mean Absolute Error Baseline Model (train data): {:.2f}".format(mean_absolute_error(y_train, y_train_predicted_bl_cv)))

Mean Absolute Error Baseline Model (train data): 5858302.30


In [12]:
# fit the actual model
y_train_predicted_bl = pipe_linreg_bl.fit(X_train, y_train)

# make predictions for the test data
y_test_predicted_bl = pipe_linreg_bl.predict(X_test)
# print MAE of Baseline Model (test data)
print("Mean Absolute Error Baseline Model (test data): {:.2f}".format(mean_absolute_error(y_test, y_test_predicted_bl)))

Mean Absolute Error Baseline Model (test data): 5895242.84


#### Interpretation

The Mean Absolute Error of the Baseline Model is 5895242.84 Tanzanian Schillig TZS.
What does that mean?
The MAE is the sum of absolute errors divided by the sample size $n$ where $y_i$ is the prediction and $x_i$ is the true value:

$$ 
MAE = \frac {\sum_{i=1}^n \vert y_i - x_i \vert} {n}
$$

The MAE uses the same scale as the data, so in this case TZS. 

So, on average, the model's predictions are 5895242.84 TZS off the true value. This is roughly 2156 Euro and seems quite a lot. 

Our aim is to improve (lower) this metric as much as we can.

So let's start with the actual 

### Modelling