# 1. Problem Definition

The Swedish Auto Insurance Dataset involves predicting the total payment for all claims in thousands of Swedish Kronor, given the total number of claims.

It is a <u>regression</u> problem. It is comprised of 63 observations with 1 input variable and one output variable. The variable names are as follows:

- Number of claims.
- Total payment for all claims in thousands of Swedish Kronor.

We are  going to cover the following steps:
1. Problem Definition
2. Load data
3. Understand our data with descriptive statistics
4. Understand our data with visualization
5. Validation dataset
6. Evaluate Algorithms: Baseline
7. Evaluate Algorithms: Standardization
8. Run Models on Validation dataset
9. References and Credits
10. Thoughts

# 2. Load data

Let's start off by loading the libraries required for this project.

## 2.1 Import libraries

First, let's import all of the modules, functions and objects we are going to use in this project.

In [None]:
# Load libraries
import numpy
from numpy import arange
from numpy import set_printoptions
from pandas import read_csv
from pandas import set_option
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoLars
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import ARDRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import OrthogonalMatchingPursuit
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')

## 2.2 Load Dataset

In [None]:
# Load dataset
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

filename = '/kaggle/input/auto-insurance-in-sweden-small-dataset/insurance.csv'
data = read_csv(filename, skiprows=list(range(0,5)), header=None, names=['claims','payment'])

# 3. Understand our data with descriptive statistics

We are going to cover the following steps:

1. Take a peek at our raw data.
2. Review the dimensions of our dataset.
3. Review the data types of attributes in our data.
4. Summarize the distribution of instances across classes in our dataset. (Since this is not a classification problem, we will skip this step)
5. Summarize our data using descriptive statistics.
6. Understand the relationships in our data using correlations.
7. Review the skew of the distributions of each attribute.

## 3.1 Peek at our data

Let's review the first five rows of the data.

In [None]:
peek = data.head()
print(peek)

## 3.2 Dimensions of our data

In [None]:
shape = data.shape
print(shape)

We can see that the dataset has 63 rows and 2 columns

## 3.3 Data Type for Each Attribute

In [None]:
types = data.dtypes
print(types)

- We can see that claims is of type integer and payment is of type floating point

## 3.4 Descriptive Statistics

In [None]:
# Statistical Summary
set_option('display.width', 100)
set_option('precision', 3)
description = data.describe()
print(description)

- There are no missing/NA values, hence we do not need to handle missing values (i.e data imputation is not required).

## 3.6 Correlations Between Attributes

In [None]:
# Pairwise Pearson correlations
correlations = data.corr(method='pearson')
print(correlations)

- claims and payment are highly positively correlated (i.e. 0.913) with each other

## 3.7 Skew of Univariate Distributions

In [None]:
# Skew for each attribute
skew = data.skew()
print(skew)

- The skew result show a positive (right) skew. Values closer to zero show less skew.
- claims and payment are positively skewed

# 4. Understand our data with visualization

We are going to cover the following visualizations:
1. Univariate Plots (Histograms, Density Plots, Box and Whisker Plots)
2. Multivariate Plots (Correlation Matrix Plot, Scatter Plot Matrix)

### 4.1.1 Univariate Plots (Histograms)

In [None]:
# Univariate Histograms
data.hist()
pyplot.show()

- Both, claims and payment have an exponential-like distribution

### 4.1.2 Univariate Plots (Density Plots)

In [None]:
# Univariate Density Plots
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
pyplot.show()

- We can see the distribution for each attribute is clearer than the histograms.

### 4.1.3 Univariate Plots (Box and Whisker Plots)

In [None]:
# Box and Whisker Plots
data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
pyplot.show()

- It appears that the spread of the attributes is similar
- claims range between 0 to 100
- payment ranges between 0 to 400

### 4.2.1 Multivariate Plots (Correlation Matrix Plot)

In [None]:
# Correction Matrix Plot
correlations = data.corr()
# plot correlation matrix
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticklabels(data.columns)
ax.set_yticklabels(data.columns)
pyplot.show()

- claims and payment are positively correlated
- There are <u>no negative correlations</u>.

### 4.2.2 Multivariate Plots (Scatter Plot Matrix)

In [None]:
# Scatterplot Matrix
g = sns.PairGrid(data, diag_sharey=False)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot, colors="C0")
g.map_diag(sns.kdeplot, lw=2)

- the correlation between claims and payment is also displayed in the scatter matrix (top right) displayed above

# 5. Validation Dataset

It is a good idea to use a validation hold-out set. This is a sample of the data that we hold back from our analysis and modeling. We use it right at the end of our project to confirm the accuracy of our final model. It is a smoke test that we can use to see if we messed up and to give us confidence on our estimates of accuracy on unseen data. We will use 80% of the dataset for modeling and hold back 20% for validation.

In [None]:
# Split-out validation dataset
array = data.values
X = array[:,0:1]
Y = array[:,1]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

# 6. Evaluate Algorithms: Baseline

We have no idea what algorithms will do well on this problem. Gut feel suggests regression algorithms like Linear Regression and ElasticNet may do well. It is also possible that decision trees and even SVM may do well. We have no idea. Let's design our test harness. We will use 10-fold cross validation. The dataset is small and this is a good standard test harness configuration. We will evaluate algorithms using the Mean Squared Error (MSE) metric. MSE will give a gross idea of how wrong all predictions are (0 is perfect).

In [None]:
# Test options and evaluation metric
num_folds = 10
seed = 7
scoring = 'neg_mean_squared_error'

Let's create a baseline of performance on this problem and spot-check a number of different algorithms. We will select a suite of different algorithms capable of working on this regression problem. The six algorithms selected include:
- Linear Algorithms: Linear Regression (LR), Lasso Regression (LASSO) and ElasticNet (EN).
- Nonlinear Algorithms: Classification and Regression Trees (CART), Support Vector Regression (SVR) and k-Nearest Neighbors (KNN).

In [None]:
# Spot-Check Algorithms
models = []
models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso()))
models.append(('EN', ElasticNet()))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor()))
models.append(('SVR', SVR()))

The algorithms all use default tuning parameters. Let's compare the algorithms. We will display the mean and standard deviation of MSE for each algorithm as we calculate it and collect the results for use later.

In [None]:
# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

- The LASSO (-1497), EN (-1497) and LR (-1498) seem to be worth further study.
- We can see that Linear algorithms (LR, LASSO, EN) are performing better on this data compared to Non-Linear algorithms
- Why don't we try more regression algorithms such as Ridge Regression?

Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients.

- Ridge Regression - Setting the regularization parameter: generalized Cross-Validation
    - RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The object works in the same way as GridSearchCV except that it defaults to Generalized Cross-Validation (GCV), an efficient form of leave-one-out cross-validation.
    - Specifying the value of the cv attribute will trigger the use of cross-validation with GridSearchCV, for example cv=10 for 10-fold cross-validation, rather than Generalized Cross-Validation

In [None]:
models.append(('Ridge', Ridge()))
models.append(('RidgeCV', RidgeCV(alphas=numpy.logspace(-6, 6, 13))))
models.append(('LassoLars', LassoLars(alpha=0.1)))
models.append(('BayesianRidge', BayesianRidge()))
models.append(('ARDRegression', ARDRegression()))
models.append(('OrthogonalMatchingPursuit', OrthogonalMatchingPursuit()))

for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

- We can see that RidgeCV (-1476) is the best MSE we have so far. It also has the minimum standard deviation of 695

# 7. Evaluate Algorithms: Standardization

Let's evaluate the same algorithms with a standardized copy of the dataset. This is where the data is transformed such that our attribute has a mean value of zero and a standard deviation of 1. We also need to avoid data leakage when we transform the data. A good way to avoid leakage is to use pipelines that standardize the data and build the model for each fold in the cross validation test harness. That way we can get a fair estimation of how each model with standardized data might perform on unseen data.

In [None]:
# Standardize the dataset
pipelines = []
pipelines.append(('ScaledLR', Pipeline([('Scaler', StandardScaler()),('LR', LinearRegression())])))
pipelines.append(('ScaledLASSO', Pipeline([('Scaler', StandardScaler()),('LASSO', Lasso())])))
pipelines.append(('ScaledEN', Pipeline([('Scaler', StandardScaler()),('EN', ElasticNet())])))
pipelines.append(('ScaledKNN', Pipeline([('Scaler', StandardScaler()),('KNN', KNeighborsRegressor())])))
pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()),('CART', DecisionTreeRegressor())])))
pipelines.append(('ScaledSVR', Pipeline([('Scaler', StandardScaler()),('SVR', SVR())])))
pipelines.append(('ScaledRidge', Pipeline([('Scaler', StandardScaler()),('Ridge', Ridge())])))
pipelines.append(('ScaledRidgeCV', Pipeline([('Scaler', StandardScaler()),('RidgeCV', RidgeCV(alphas=numpy.logspace(-6, 6, 13)))])))
pipelines.append(('ScaledLassoLars', Pipeline([('Scaler', StandardScaler()),('LassoLars', LassoLars(alpha=0.1))])))
pipelines.append(('ScaledBayesianRidge', Pipeline([('Scaler', StandardScaler()),('BayesianRidge', BayesianRidge())])))
pipelines.append(('ScaledARDRegression', Pipeline([('Scaler', StandardScaler()),('ARDRegression', ARDRegression())])))
pipelines.append(('ScaledOrthogonalMatchingPursuit', Pipeline([('Scaler', StandardScaler()),('OrthogonalMatchingPursuit', OrthogonalMatchingPursuit())])))
results_standardized = []
names_standardized = []
for name, model in pipelines:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results_standardized.append(cv_results)
    names_standardized.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

- <u>Inference</u>: Let's not standardize, because standardizing is leading to poor performance of our best algorithm
    - RidgeCV (-1476)
    - StandardizedRidgeCV (-1486)
    
Let's take a look at the distribution of the scores across the cross validation folds.

In [None]:
# Compare Algorithms
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results) # Baseline
ax.set_xticklabels(names) # Baseline
pyplot.show()

In [None]:
# ensembles
ensembles = []
ensembles.append(('ScaledAB', Pipeline([('Scaler', StandardScaler()),('AB', AdaBoostRegressor())])))
ensembles.append(('ScaledGBM', Pipeline([('Scaler', StandardScaler()),('GBM', GradientBoostingRegressor())])))
ensembles.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()),('RF', RandomForestRegressor())])))
ensembles.append(('ScaledET', Pipeline([('Scaler', StandardScaler()),('ET', ExtraTreesRegressor())])))
results_ensembles = []
names_ensembles = []
for name, model in ensembles:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results_ensembles.append(cv_results)
    names_ensembles.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

# 8. Runs Models on Validation dataset

In this section, we will evaluate the models on our hold out validation dataset.

In [None]:
# prepare the model
model = RidgeCV(alphas=numpy.logspace(-6, 6, 13))
model.fit(X_train, Y_train)

Let's generate predictions on the validation dataset.

In [None]:
predictions = model.predict(X_validation)
print(mean_squared_error(Y_validation, predictions))

What if used any model apart from RidgeCV? Let's try other models as well.

In [None]:
# prepare the model
model = LinearRegression()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print(mean_squared_error(Y_validation, predictions))

- What is happening?
- On training data, the MSE of RidgeCV was better than LR
- On validation data, the MSE of LR is better than RidgeCV
- Why don't we try other algorithms as well on the validation data? Let's do that.

In [None]:
# prepare the model
model = Lasso() # ElasticNet(), KNeighborsRegressor(), DecisionTreeRegressor(), SVR(), Ridge(), RidgeCV(), LassoLars(), BayesianRidge(), ARDRegression(), OrthogonalMatchingPursuit()
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
print(mean_squared_error(Y_validation, predictions))

- On validation data, the MSE is as follows:
    - LinearRegression (859.209)
    - Lasso (859.716)
    - ElasticNet (860.319)
    - KNeighborsRegressor (1155.381)
    - DecisionTreeRegressor (1310.374)
    - SVR (3187.732)
    - Ridge (859.243)
    - RidgeCV (862.653)
    - LassoLars (975.414)
    - BayesianRidge (863.474)
    - ARDRegression (863.474)
    - OrthogonalMatchingPursuit (859.209)
- LinearRegression (859.209) and OrthogonalMatchingPursuit (859.209) have the least MSE that we have managed to get so far.

# 9. References and Credits
- Thank you to Jason Brownlee https://machinelearningmastery.com/
- Thanks to https://www.kaggle.com/tanejaa3/auto-insurance-linear-regression which was referenced for loading the dataset without any issues.
- For Ridge Regression https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
- For RidgeCV Regression https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares
- https://scikit-learn.org/stable/modules/linear_model.html#bayesian-regression
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.OrthogonalMatchingPursuit.html#sklearn.linear_model.OrthogonalMatchingPursuit
- Regarding Mean Squared Error https://en.wikipedia.org/wiki/Mean_squared_error

# 10. Thoughts
- I tried to rescale, standardize and normalize the training data separately, but the MSE of all the algorithms deteriorated, hence we had to use the raw data
- I need to figure out a way to tune the hyper-parameters of the linear algorithms using http://pavelbazin.com/post/linear-regression-hyperparameters/#linear-regression-batch-gradient-descent 