# Predicting Pet Insurance Claims - Model Tuning and Predictions
## 1 Introduction
### 1.1 Background
Whenever a pet insurance policy holder incurs veterinary expenses related to their enrolled pet, they can submit claims for reimbursement, and the insurance company reimburses eligible expenses. To price insurance products correctly, the insurance company needs to have a good idea of the amount their policy holders are likely to claim in the future.

### 1.2 Project Goal
The goal of this project is to create a machine learning model to predict how much (in USD) a given policy holder will claim for during the second year of their policy.

### 1.3 Notebook Goals
* TBD

# Add Goals

## 2 Setup
### 2.1 Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pprint import pprint

from sklearn.model_selection import train_test_split, cross_validate, learning_curve, GridSearchCV, RandomizedSearchCV
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_absolute_error
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn import __version__ as sklearn_version

# import datetime
# import os
# import pickle
# from library.sb_utils import save_file

### 2.2 Data and Initial Modeling Results
We ended preprocessing and initial modeling with our training and test sets and some initial modeling results.


#### 2.2.1 Data Load
Let's take a look at the shape of our training and test sets and get a preview of our test data.

In [2]:
# Read in the training and test sets
X_train = pd.read_csv('../data/X_train.csv', index_col=0)
y_train = np.genfromtxt('../data/y_train.csv', delimiter=',')
X_test = pd.read_csv('../data/X_test.csv', index_col=0)
y_test = np.genfromtxt('../data/y_test.csv', delimiter=',')

# Preview the shape
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(40000, 12)
(10000, 12)
(40000,)
(10000,)


In [3]:
# Preview the training data
X_train.head(8).T

Unnamed: 0,0,1,2,3,4,5,6,7
Species,Dog,Dog,Dog,Dog,Dog,Dog,Cat,Dog
AgeYr1,3,0,0,1,2,0,6,6
YoungAge,0,0,0,0,0,0,0,0
AmtClaimsYr1,76.25,0.0,6010.88,303.15,0.0,701.6,431.57,0.0
AvgClaimsYr1,76.25,0.0,546.443636,151.575,0.0,701.6,215.785,0.0
NumClaimsYr1,1,0,11,2,0,1,2,0
BreedAvgTotalClaims,663.502526,540.515378,705.72864,705.72864,663.502526,925.434405,395.718466,1509.175333
BreedAvgNumClaims,1.491045,1.627294,1.766242,1.766242,1.491045,1.875611,0.871485,2.552941
BreedAvgClaimAmt,196.601991,159.645266,198.844913,198.844913,196.601991,240.294459,137.970879,315.759532
AgeYr1AvgTotalClaims,722.652663,902.942389,902.942389,699.071121,619.496248,902.942389,449.891932,904.086963


#### 2.2.2 Initial Modeling Results
In the previous notebook, we completed some initial modeling including the following:
* Established a baseline using `DummyRegressor()` with predictions equivalent to the mean value of the target 
* Created a simple liner model using `LinearRegressor()` and default settings
* Enhanced our simple linear model by incorporating feature selection and making predictions using the best'k' features

Using the linear model with feature selection, we managed to improve upon the baseline model by **~9%**, or over **\$90** per customer. Given the size of the total customer base (~65-75 million pets), this represents a significant total value that could be realized by factoring model predictions into policy premium and deductible amounts.

In this notebook, we'll continue tuning our model and evaluating additional modeling approaches to see how much we can improve upon the initial result. Following this evaluation, we will select our final model and make predictions on our test set to generate final results. 

## 3 Lasso Regression

# Add intro - why?

In [4]:
# Create a preprocessor to encode the categorical columns
preprocessor = make_column_transformer(
    (OneHotEncoder(drop='if_binary'), ['Species']),
    remainder='passthrough'
)

### 3.1 Simple Lasso Regression Model

In [5]:
from sklearn.linear_model import Lasso

# Make pipeline and get results
pipe = make_pipeline(preprocessor, Lasso())
cv_results = cross_validate(pipe, X_train, y_train, scoring='neg_mean_absolute_error', cv=5)
score_mean = -1 * round(np.mean(cv_results['test_score']), 2)
print("mean test score (mae): " + str(score_mean))

mean test score (mae): 930.28


### 3.2 Lasso Regression with GridSearchCV

In [6]:
# Create the pipeline
pipe = Pipeline([('preprocessor', preprocessor),('regressor', Lasso())])

# Establish the parameter grid
param_grid = {'regressor__alpha': [0.0001, 0.001, 0.1, 1], 'regressor__fit_intercept': [True, False],
              'regressor__normalize': [True, False]}

# Instantiate the grid search and fit the model
Lasso_GS = GridSearchCV(pipe, param_grid=param_grid, scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)
Lasso_GS.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('onehotencoder',
                                                                         OneHotEncoder(drop='if_binary'),
                                                                         ['Species'])])),
                                       ('regressor', Lasso())]),
             n_jobs=-1,
             param_grid={'regressor__alpha': [0.0001, 0.001, 0.1, 1],
                         'regressor__fit_intercept': [True, False],
                         'regressor__normalize': [True, False]},
             scoring='neg_mean_absolute_error')

In [7]:
# Print best test score
best_score = -1 * round(np.max(Lasso_GS.cv_results_['mean_test_score']), 2)
print("best score (mae): " + str(best_score))

best score (mae): 930.28


In [8]:
Lasso_GS.best_params_

{'regressor__alpha': 1,
 'regressor__fit_intercept': True,
 'regressor__normalize': False}

# Add results for Lasso

## 4 Gradient Boosting

### 4.1 Simple Gradient Boosting Model

In [9]:
from sklearn.ensemble import GradientBoostingRegressor

# Make pipeline, cross-validate, and get results
pipe = make_pipeline(preprocessor, GradientBoostingRegressor())
cv_results = cross_validate(pipe, X_train, y_train, scoring='neg_mean_absolute_error', cv=5)
score_mean = -1 * round(np.mean(cv_results['test_score']), 2)
print("mean test score (mae): " + str(score_mean))

mean test score (mae): 932.59


### 4.2 Gradient Boosting with GridSearchCV

# Consider using Random Search here or scrap if results are bad

In [10]:
# # Create the pipeline
# pipe = Pipeline([('preprocessor', preprocessor),('regressor', GradientBoostingRegressor())])

# # Establish the parameter grid
# param_grid = {'regressor__learning_rate': [0.01,0.1],
#               'regressor__subsample'    : [0.9, 0.5, 0.1],
#               'regressor__max_depth'    : [3,5,10],
#               'regressor__criterion'    : ['mae']
#              }

# # Instantiate the grid search and fit the model
# GB_GS = GridSearchCV(pipe, param_grid=param_grid, scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)
# GB_GS.fit(X_train, y_train)

In [11]:
# # Print best test score
# best_score = -1 * round(np.max(GB_GS.cv_results_['mean_test_score']), 2)
# print("best score (mae): " + str(best_score))

In [12]:
# GB_GS.best_params_

In [13]:
# print(" Results from Grid Search " )
# print("\n Best estimator:\n", GB_GS.best_estimator_)
# print("\n Best score:\n",GB_GS.best_score_)
# print("\n Best params:\n", GB_GS.best_params_)

## 5 Decision Tree

In [32]:
from sklearn.tree import DecisionTreeRegressor

# Make pipeline, cross-validate, and get results
pipe = make_pipeline(preprocessor, DecisionTreeRegressor())
cv_results = cross_validate(pipe, X_train, y_train, scoring='neg_mean_absolute_error', cv=5)
score_mean = -1 * round(np.mean(cv_results['test_score']), 2)
print("mean test score (mae): " + str(score_mean))

mean test score (mae): 1115.1


In [38]:
from sklearn import tree

pipe = make_pipeline(preprocessor, DecisionTreeRegressor())
pipe.fit(X_train, y_train)

# Create list of features
features = ['Species_Dog','AgeYr1', 'YoungAge', 'AmtClaimsYr1', 'AvgClaimsYr1', 'NumClaimsYr1',
            'BreedAvgTotalClaims', 'BreedAvgNumClaims', 'BreedAvgClaimAmt', 'AgeYr1AvgTotalClaims',
            'AgeYr1AvgNumClaims', 'AgeYr1AvgClaimAmt']

fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf, feature_names=features, filled=True)

NotFittedError: This DecisionTreeRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

<Figure size 1800x1440 with 0 Axes>

## 6 Random Forest Regression

In [23]:
from sklearn.ensemble import RandomForestRegressor

# Number of trees in random forest
# n_estimators = [int(x) for x in np.linspace(start = 50, stop = 1000, num = 10)]
n_estimators = [50, 100]

# Number of features to consider at every split
max_features = ['auto', None]

# Maximum number of levels in tree
# max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth = [10, 50, 100]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'regressor__n_estimators': n_estimators,
               'regressor__max_features': max_features,
               'regressor__max_depth': max_depth,
               'regressor__min_samples_split': min_samples_split,
               'regressor__min_samples_leaf': min_samples_leaf,
               'regressor__bootstrap': bootstrap}

# Print the grid
pprint(random_grid)

{'regressor__bootstrap': [True, False],
 'regressor__max_depth': [10, 50, 100, None],
 'regressor__max_features': ['auto', None],
 'regressor__min_samples_leaf': [1, 2, 4],
 'regressor__min_samples_split': [2, 5, 10],
 'regressor__n_estimators': [50, 100]}


In [24]:
# Create the pipeline
pipe = Pipeline([('preprocessor', preprocessor),('regressor', RandomForestRegressor())])

# Instantiate the grid search and fit the model
RF_Random = RandomizedSearchCV(estimator = pipe, param_distributions = random_grid,
                               n_iter = 5, cv = 5, scoring='neg_mean_absolute_error', verbose=2, random_state=42, n_jobs = -1)

# Fit the random search model
RF_Random.fit(X_train, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('preprocessor',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('onehotencoder',
                                                                               OneHotEncoder(drop='if_binary'),
                                                                               ['Species'])])),
                                             ('regressor',
                                              RandomForestRegressor())]),
                   n_iter=5, n_jobs=-1,
                   param_distributions={'regressor__bootstrap': [True, False],
                                        'regressor__max_depth': [10, 50, 100,
                                                                 None],
                                        'regressor__max_features': ['auto',
                                          

In [25]:
# View best performing parameter combination
RF_Random.best_params_

{'regressor__n_estimators': 100,
 'regressor__min_samples_split': 2,
 'regressor__min_samples_leaf': 4,
 'regressor__max_features': 'auto',
 'regressor__max_depth': 10,
 'regressor__bootstrap': False}

In [26]:
RF_Random.best_score_

-968.1680597037124

## 8 Save Model

In [None]:
# # Store some basic information about the model
# best_model = rf_grid_cv.best_estimator_
# best_model.version = '1.0'
# best_model.pandas_version = pd.__version__
# best_model.numpy_version = np.__version__
# best_model.sklearn_version = sklearn_version
# best_model.X_columns = [col for col in X_train.columns]
# best_model.build_datetime = datetime.datetime.now()

# # save the model

# modelpath = '../models'
# save_file(best_model, 'ski_resort_pricing_model.pkl', modelpath)