# Predicting Pet Insurance Claims - Pre-processing
## 1 Introduction
### 1.1 Background
Whenever a pet insurance policy holder incurs veterinary expenses related to their enrolled pet, they can submit claims for reimbursement, and the insurance company reimburses eligible expenses. To price insurance products correctly, the insurance company needs to have a good idea of the amount their policy holders are likely to claim in the future.

### 1.2 Project Goal
The goal of this project is to create a machine learning model to predict how much (in dollars) a given policy holder will claim for during the second year of their policy.

### 1.3 Notebook Goals
* Split data into Train and Test sets
* Complete any remaining feature engineering
* Pre-process the data to prepare for modeling

## 2 Setup
### 2.1 Imports

In [16]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
# from sklearn.decomposition import PCA
# from sklearn.preprocessing import scale

### 2.2 Data Load and Preview
At the end of exploratory data analysis, we had two dataframes:
1. **pets** - Dataframe containing all pet records
    * Shape - Our clean dataframe is 50000 rows with each row corresponding to a single pet.
    * Basic Info - For each pet, we have some basic info including species, breed, and age at time of enrollment.
    * YoungAge - Designation for pets who enrolled at a very young age (< 7 weeks)
    * Policy Info - We also have policy-level info for each pet including the monthly premium and deductible amount for claims.
    * Claims Data - We also have claims data for each pet covering the first two policy years including:
    * Number of claims per year and total (years 1 and 2 combined)
    * Average claim amount per year and total (years 1 and 2 combined)
    * Amount of claims per year and total (years 1 and 2 combined)


2. **breeds** - Dataframe containing a rollup of our breed data
    * Species - The species of pet that corresponds with the breed
    * PetCount - The number of pets in our data for the given breed
    * Claims Data - Rollup of claims-related data for the breed
        * AvgTotalClaims, AvgNumClaims - The unweighted averages for the breed for total claims amount and total number of claims per pet
        * WeightedTotalClaims, WeightedNumClaims - The weighted averages for the breed for total claims amount and total number of claims per pet when all breeds are given equal weight in the data regardless of pet count
        

Let's load in both and preview.

In [17]:
pets = pd.read_csv('../data/pets.csv', index_col=0)
pets.head(8).T

Unnamed: 0,0,1,2,3,4,5,6,7
PetId,0,1,2,3,4,5,6,7
Species,Dog,Dog,Dog,Dog,Dog,Dog,Cat,Dog
Breed,Schnauzer Standard,Yorkiepoo,Mixed Breed Medium,Labrador Retriever,French Bulldog,Shih Tzu,American Shorthair,Boxer
Premium,84.54,50.33,74.0,57.54,60.69,43.53,47.4,75.14
Deductible,200,500,500,500,700,700,250,700
AgeYr1,3,0,0,0,0,2,0,5
YoungAge,0,0,0,0,0,0,0,0
AmtClaimsYr1,0.0,0.0,640.63,0.0,7212.25,2665.67,0.0,2873.47
AmtClaimsYr2,1242.0,0.0,1187.68,0.0,168.75,0.0,811.38,2497.03
AvgClaimsYr1,0.0,0.0,213.543333,0.0,801.361111,296.185556,0.0,410.495714


In [18]:
breeds = pd.read_csv('../data/breeds.csv', index_col=0)
breeds.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Breed,Abyssinian,Affenpinscher,Afghan Hound,Aidi,Airedale Terrier,Akbash Dog,Akita,Alaskan Klee Kai,Alaskan Malamute,American Bandogge Mastiff
Species,Cat,Dog,Dog,Dog,Dog,Dog,Dog,Dog,Dog,Dog
PetCount,24,3,10,2,41,1,49,30,64,1
AvgTotalClaims,1462.780417,1746.19,645.543,0.0,1338.705366,11383.6,1217.91102,1510.666333,3096.346719,2295.07
AvgNumClaims,4.25,1.333333,2.5,0.0,2.097561,23.0,3.591837,3.166667,3.765625,8.0
WeightedTotalClaims,35106.73,5238.57,6455.43,0.0,54886.92,11383.6,59677.64,45319.99,198166.19,2295.07
WeightedNumClaims,102.0,4.0,25.0,0.0,86.0,23.0,176.0,95.0,241.0,8.0


### 2.3 Initial Plan for Pre-processing and Feature Engineering
Our primary goal for pre-processing will be to prepare our data for modeling. At a our minimum, this will include scaling and/or normalizing our features against one another, generating any required dummy values for categorical columns and splitting our data into our train and test sets.

**Train / Test Prep**
* Drop all 'Yr2' and 'Total' columns except AmtClaimsYr2 (our target) as this is data we would not have available for making predictions
* Split our data into training and test sets

**Feature Engineering** 
* Premium and Deductible - Roll these features up to the breed level to smooth out some of the variability between pets that are in essence, identical
* Breed - Employ a method to reduce the number of unique values
* Breed statistics - Add features to incorporate breed-related statistics into pets data
* PetAge - Consider adding a new features factoring in age (e.g., average claim amount in year 1, average total claims in year 1 - by age grouping)
* AmtClaimsYr1 and AvgClaimsYr1 - consider rolling these up into one feature or rolling these up by breed
* NumClaimsYr1 - Consider dropping this column or rolling up by breed; Could also be converted to binary (claims and no-claims)

**Pre-processing**
* Species - Convert to binary
* Breed - Create dummy variables for the remaining breeds
* All columns - Scale or normalize any columns not already converted to binary or dummy variables


## 3 Split the Data
As a first step, we'll take care of the train/test split to prevent any data leakage. Before we split the data, we need to drop some of the features that won't be part of our model. 

In addition, we observed in data wrangling and EDA, that there are a wide variety of breeds in our data. If we want to maintain a balanced distribution of breeds after our split, we'll need to use *stratify* to achieve this. Stratify won't work with our current data since we have some breeds with only 1 pet and stratify requires a minimum of 2 in each category.

To work around this, we can take a few steps to reduce the number of unique values for 'Breed'. In doing so, we can ensure a minimum number of pets in each category. 

### 3.1 Drop Unnecessary Features 
We have a number of features in our current dataset that would be unfair to use in our predictive model. The purpose of our model will be to predict claims in year 2. That being the case, we need to remove any features that include data for year 2.

In [19]:
# Drop features that won't be part of the model
drop_cols = ['PetId', 'AvgClaimsYr2', 'NumClaimsYr2', 'AmtClaimsTotal', 'AvgClaimsTotal',
             'NumClaimsTotal', 'YrsWithClaims']
pets.drop(columns=drop_cols, inplace=True)

### 3.2 Reduce the Number of Unique Breeds
To reduce the number of unique breeds in the data, we'll follow the steps below:

1. Set threshold and save a list of breeds with counts greater or equal to the threshold
2. Write a function to update the breed for a row based on whether or not it exists in the list from step 1
3. Create a copy of our original df and apply the function
4. Print out the before and after numbers for our count of unique breeds

As part of step 2 above, we'll update the breed name for breeds with a low pet count to group them together in an *Other* category. To ensure we don't lose any species-specific information, we'll create two versions of *Other*, 'Other Cat' and 'Other Dog'. 

In [25]:
# Set threshold
threshold = 100

# Preserve list of Breeds with count greater equal to the threshold
breeds_list = breeds[breeds.PetCount >= threshold].Breed.tolist()

# Create function to update breed column based on threshold
def update_breed(row):
    if (row["Breed"] in breeds_list):
        return row["Breed"]
    else:
        if (row["Species"] == 'Cat'):
            return 'Other Cat'
        else:
            return 'Other Dog'

# Print number of unique breeds before update
print("Number of unique breeds before: " + str(pets.Breed.nunique()))

# Create copy of original df and apply function to update Breed
pets_new = pets.copy()
pets_new["Breed"] = pets_new.apply(update_breed, axis=1)
print("Number of unique breeds after: " + str(pets_new.Breed.nunique()))

Number of unique breeds before: 373
Number of unique breeds after: 79


### 3.3 Split Data into Train and Test

In [27]:
# Split the data into train and test sets
pets_train, pets_test = train_test_split(pets_new, test_size=0.2, stratify=pets_new['Breed'])

In [28]:
print(pets_train.shape)
print(pets_test.shape)

(40000, 10)
(10000, 10)


# Start Here

## 4 Feature Engineering


# PCA


In [None]:
# Filter df down the a subset of features
cols = ['Breed', 'AgeYr1', 'AmtClaimsYr1', 'NumClaimsYr1']

#Create a new dataframe and set the index to Breed
df_new_scale = df_new[cols].set_index('Breed')

#Save the breed labels
df_new_index = df_new_scale.index

#Save the column names 
df_new_columns = df_new_scale.columns
df_new_scale.head()

In [None]:
# Scale the data
df_new_scale = scale(df_new_scale)

In [None]:
#Create a new dataframe using saved column names
df_new_scaled_df = pd.DataFrame(df_new_scale, columns=df_new_columns)
df_new_scaled_df.head()

In [None]:
# Verify the scaling
df_new_scaled_df.mean()

In [None]:
# Verify scaled std
df_new_scaled_df.std(ddof=0)

In [None]:
# Fit the PCA tranformation
pets_pca = PCA().fit(df_new_scale)

In [None]:
# Plot the result
plt.subplots(figsize=(10, 6))
plt.plot(pets_pca.explained_variance_ratio_.cumsum())
plt.xlabel('Component #')
plt.ylabel('Cumulative ratio variance')
plt.title('Cumulative variance ratio explained by PCA components for pets summary statistics')
plt.show()

Looking at the results from PCA, we can see that about 85% of the variance is explained by the first 5 features of the data. This information may be helpful down the road in preprocessing and model creation as it provides us with a better foundation for understanding our data.

# TODO
* Should I do more with the PCA results here before moving on to the summary?