# Allstate Insurance Claims 
#### W261, Final Project Spring 2020, Team 10

Authors:
- Yang Jing
- Ryan Keunho Kim
- Christine Barger 
- Sophia Cui

Code:
- [Github Repository](https://github.com/UCB-w261/project-sp20-team-10)
    - Supplementary notebooks are in corresponding folders in the repository
- [Public Notebook with Results](https://ucb-w261.github.io/project-sp20-team-10/W261_SP20_FINAL_PROJECT.html)

##Project Formulation and Hypotheses <a name="introduction"></a>

The dataset chosen was the AllState Insurance claims dataset. 

The **goal of the analysis is to correctly train an algorithm to accurately predict the severity of an insurance claim** based on certain variable inputs. From insurance company's perspective, being able to accurately anticipate future claims and their severity can help financial planning and set up sufficient reserve amount to back it up. It also helps foresee any catastrophic events that may impede their financial well-being. 

As far as the data provided, there were three separate datasets, which included both categorical and continuous data fields. 
The datasets given were a training set, a test set, and a sample output. 

Because the target variable is the severity of claims, we are labeling this as a regression problem, and as such, the data given will be fed through various algorithms and evaluated based on the MAE (mean absolute error), which is a linear score using equal weighting for all individual differences. In order to justify an accurately performing model, we will be searching for the lowest MAE score associated. The most optimal algorithm will also have the lowest bias and variance, which relate to the model’s ability to fit the training and test set, respectively. Finding the best trade-off between these two terms is important because the lower the bias, the smaller the error, but the higher chance for model complexity, whereas only focusing on a lower variance could result in underfitting if the complexity is too simple; important features may be missed.

We will implement two pipelines: sklearn pipeline and spark pipeline.

#### Setup Code

In [5]:
# imports
import re
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import ast
import os
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error 
from numpy import mean
from numpy import absolute
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import HuberRegressor
from sklearn.preprocessing import PowerTransformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [6]:
username = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
userhome = 'dbfs:/user/' + username
print(userhome)
finalproject_path = userhome + "/finalproject/" 
finalproject_path_open = '/dbfs' + finalproject_path.split(':')[-1] # for use with python open()
dbutils.fs.mkdirs(finalproject_path)

sum = 0
DATA_PATH = 'dbfs:/mnt/mids-w261/data/datasets_final_project/'
for item in dbutils.fs.ls(DATA_PATH):
  sum = sum+item.size
sum

In [7]:
sc = spark.sparkContext
spark

## Data Storage and Scalability Exploration

In [9]:
#unzip files
import zipfile
with zipfile.ZipFile('/dbfs/mnt/mids-w261/data/datasets_final_project/allstate-claims-severity.zip') as zip_ref:
    zip_ref.extractall("/dbfs/user/"+username+"/finalproject/")


In [10]:
dbutils.fs.put(finalproject_path+'test.txt',"hello world",True)
display(dbutils.fs.ls(finalproject_path))


path,name,size
dbfs:/user/sophia@ischool.berkeley.edu/finalproject/sample_submission.csv,sample_submission.csv,1106039
dbfs:/user/sophia@ischool.berkeley.edu/finalproject/sample_submission.csv.zip,sample_submission.csv.zip,296933
dbfs:/user/sophia@ischool.berkeley.edu/finalproject/test.csv,test.csv,45715862
dbfs:/user/sophia@ischool.berkeley.edu/finalproject/test.csv.zip,test.csv.zip,9873043
dbfs:/user/sophia@ischool.berkeley.edu/finalproject/test.txt,test.txt,11
dbfs:/user/sophia@ischool.berkeley.edu/finalproject/train.csv,train.csv,70025339
dbfs:/user/sophia@ischool.berkeley.edu/finalproject/train.csv.zip,train.csv.zip,15848282


**File descriptions:**

- train.csv - the training set
- test.csv - the test set. You must predict the loss value for the ids in this file.
- sample_submission.csv - a sample submission file in the correct format

Update:
- Confirmed that test set does not have target variable 'loss' - the given test set does not have a target variable, so we will discard and split a holdout set from train set

In [12]:
trainheaders = dbutils.fs.head(finalproject_path + '/train.csv')
trainheaders = trainheaders.split('\n')[0]
testheaders = dbutils.fs.head(finalproject_path + '/test.csv')
testheaders = testheaders.split('\n')[0]
print(trainheaders)
print(testheaders)

In [13]:
# load the raw data into an RDD
trainRDD = sc.textFile(finalproject_path + '/train.csv')\
            .filter(lambda x: x != trainheaders)
             

In [14]:
training_data = sc.textFile(finalproject_path + "/train.csv")
print(training_data.count())

###Data Storage and Scalability

####Summary Statistics

####Data: 
- 188319 in train set and 125546 in test set. 
- There are no missing values. 
- 116 categorical data fields and 14 continuous data fields. 
- Continuous data fields are already normalized between 0 and 1. 
- Target variable values vary a lot with min of 0.67 and max of 121012. 

####Size
The dataset is not very large, at only 70MB or so for the training set, and 50MB or so for the test set, uncompressed. The counts for the dataset is 188k for the training set and 126k for the test set, small enough to run locally and large enough to load into memory. 

Assuming the dataset can't grow or be expanded upon to gain magnitudes in size, which for claims data, is unlikely, we can consider a few storage options:

- CSV - the current storage format, easy to serialize/compress and cross compatible as input for many tools of analysis. reasonable file type for moving around a network, but prone to consistency and performance issues if used as a definitive source for query / update.
- SQL - datastore that enables quick and consistent updates, queries, uptime, etc. good for pulling up rows of data and quick analysis and transforms.
- RDD - distributed storage if we expect large volumes of claims in the long term future.
- DataFrame/DataTable - a secondary storage format, great for quick analysis of historical numbers, easy to run locally for simple analysis

Also noted is that the data is very structured, which suits well for above data formats. If data was unstructured or semi-structured, we can consider more document based storage solutions. We have 116 categorical values and 14 continuous values.

####Future Extensions of Data Consideration
However, the size of this dataset can increase if:

- the problem itself introduces more types and numbers of claims over time
- the nature of the problem we consider includes more factors, such as IoT or smart device data linked to person(s) related to a claim

For those factors, we may consider a more scalable distributed storage solution with:

- distributed SQL for well structured data columns to enable quick queries
- distributed JSON (Document oriented DB like NoSQL, Mongo) for semi-structured device data

Because of the structured, small dataset, SQL could be a great storage solution and a serialized DataFrame or DataTable could be great for quick analysis. If the data grows in size, or is expanded upon, we can consider a more distributed solution.

In the interest of this class, we will stick to RDD operations and use Dataframes as a fallback/sanity check.

##Exploratory Data Analysis

#### Data Skew
The EDA showed that the target variable is highly skewed, so a log transformation is needed. Transformation will mute the skewness. The other continuous values all had similar mean and variance. One issue that was noticed was that the test dataset provided was missing the target variable column. We will disgard test set and split out holdout set from train set.


We used histograms and a covariance matrix to review variables distribution and relation.

In [18]:
#cache train set continuous values
trainRDDCached_cont=trainRDD.map(lambda x: x.split(','))\
        .map(lambda x: (x[117:131],x[-1])).cache()
totalRDDCached_cont = trainRDDCached_cont
one_cont = trainRDDCached_cont.take(1)[0]
one_cont

In [19]:
#cache train set categorical values
trainRDDCached_cat=trainRDD.map(lambda x: x.split(','))\
        .map(lambda x: (x[1:116],x[-1])).cache()
totalRDDCached_cat = trainRDDCached_cat
one_cat = trainRDDCached_cat.take(1)[0]
one_cat

In [20]:
#convert RDD to dataframe
dataset_cont = np.array(totalRDDCached_cont.map(lambda x: np.append(x[0], [x[1]])).take(188318))
dataset_cat = np.array(totalRDDCached_cat.map(lambda x: np.append(x[0], [x[1]])).take(188318))
                                
FIELDS_continuous = trainheaders.split(',')[117:132]
FIELDS_cat = trainheaders.split(',')[1:116]
FIELDS_cat.append('loss')
dataset_cont_df = pd.DataFrame(np.array(dataset_cont),columns=FIELDS_continuous)
dataset_cat_df = pd.DataFrame(np.array(dataset_cat),columns=FIELDS_cat)

In [21]:
dataset_cont_df = dataset_cont_df.convert_objects(convert_numeric=True)
dataset_cont_df.info()

The continuous variables are all converted to numeric data type. All of them have filled values between 0 and 1 except target variable 'loss'.
The continuous independent variables have similar mean and standard deviation. There are no extreme values or missing values.

In [23]:
dataset_cont_df.describe()

Unnamed: 0,cont1,cont2,cont3,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14,loss
count,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0
mean,0.493861,0.507188,0.498918,0.491812,0.487428,0.490945,0.48497,0.486437,0.485506,0.498066,0.493511,0.49315,0.493138,0.495717,3037.337686
std,0.18764,0.207202,0.202105,0.211292,0.209027,0.205273,0.17845,0.19937,0.18166,0.185877,0.209737,0.209427,0.212777,0.222488,2904.086186
min,1.6e-05,0.001149,0.002634,0.176921,0.281143,0.012683,0.069503,0.23688,8e-05,0.0,0.035321,0.036232,0.000228,0.179722,0.67
25%,0.34609,0.358319,0.336963,0.327354,0.281143,0.336105,0.350175,0.3128,0.35897,0.36458,0.310961,0.311661,0.315758,0.29461,1204.46
50%,0.475784,0.555782,0.527991,0.452887,0.422268,0.440945,0.438285,0.44106,0.44145,0.46119,0.457203,0.462286,0.363547,0.407403,2115.57
75%,0.623912,0.681761,0.634224,0.652072,0.643315,0.655021,0.591045,0.62358,0.56682,0.61459,0.678924,0.675759,0.689974,0.724623,3864.045
max,0.984975,0.862654,0.944251,0.954297,0.983674,0.997162,1.0,0.9802,0.9954,0.99498,0.998742,0.998484,0.988494,0.844848,121012.25


In [24]:
dataset_cat[0]

In [25]:
dataset_cat_df.info()
for col in dataset_cat_df.columns.values:
  if (col != 'loss'):
    dataset_cat_df[col] = dataset_cat_df[col].astype('category')
  else:
    dataset_cat_df[col] = dataset_cat_df[col].astype('float')

dataset_cat_df.info()


In [26]:
cat_columns = {}
cat_columns_total = 0
for col in dataset_cat_df.columns.values:
  if (col != 'loss'):
    val_count = dataset_cat_df[col].value_counts()
    cat_columns_total += len(val_count)
    for key, count_ in enumerate(val_count):
      if (col not in cat_columns):
        cat_columns[col] = []
      cat_columns[col].append(count_)

print(cat_columns)
print(cat_columns_total)

for the continuous values, some are left skewed and some are right skewed.

In [28]:
dataset_cont_df.astype('float').hist(figsize=(15,15), bins=15)
display(plt.show())

Target variable 'loss' is highly skewed based on its histogram. We will transform it using np.log1p to smooth it out. It also has extreme values, extremely small (<1) or extremely large (>100,000). We will exclude them from regression.

In [30]:
print(dataset_cont_df.astype('float').skew())

target variable follows more like normal distribution after log transformation

In [32]:
plt.figure(figsize=(7,5))
sns.distplot(np.log1p(dataset_cont_df[FIELDS_continuous[-1]].astype('float')))
display(plt.show())

Amongst the continuous variables, some are very correlated (darker red). Highly correlated independent variables would influence regression results as the assumption is that all independent variables are not correlated. We will use PCA to make all variables othogonal.

In [34]:
corr = dataset_cont_df[FIELDS_continuous[:-1]].astype('float').corr()
fig, ax = plt.subplots(figsize=(9, 7))
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(240, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, center=0, linewidths=.5)
plt.title("Correlations between features.")
display(plt.show())

##Proposed Features

####Variable Dimensions
We have 116 categorical values and 14 continuous values in addition to our loss outcome variable. Exploring the categorical variables, they can be expanded into 813 dummy variables or dimensions. Some of the categorical variables are binary (only two values) and others go up to dozens of values with a skewed distribution, e.g. cat113, cat115. Given that there are 180k data rows, 813 variables is significant, and likely needs to be pruned. However, its not such a large number of dimensions that a kitchen sink model is out of the question.

Some of the dummy variables are positively correlated or negatively correlated with loss, as well as the continuous variables. The dummy and continuous variables are inter-correlated as well.

We are not given hints about any of the variables underlying meaning or context, although some of them are likely gender, age, etc. In this sense, we will focus on model creation that minimizes error (type I and type II) as well as minimizing the number of inputs. We will cross validate with test sets.

####Potential Approaches

- Brute force use all variables (all dummies + continuous), and refine model by gradient descent with Lasso/Ridge to minimize model complexity
- PCA / Dimensionality reduction prior to model construction
- Transformation on target variable

A baseline linear regression model (kitchen sink) is not a bad idea to draw a baseline.

#### Feature Engineering

##### One-hot Encoding
Categorical variables are likely related to an individual’s characteristics, for example, sex, ethnicity, age, etc., whereas continuous variables may relate to height, weight, or the amount of time since the individual’s last traffic incident.

##### Log-Transform Output Variable
We logged transformed the output variable since it showed a very high skew, and the normalization by log1 tranformed the output variable into a normal curve.

##### Principal Component Analysis
For the task of feature engineering, PCA (Principal Component Analysis), which is an “unsupervised, non-parametric statistical technique primarily used for dimensionality reduction,” (source: https://medium.com/apprentice-journal/pca-application-in-machine-learning-4827c07a61db) was used as an attempt to whittle down the amount of input variables. Often times, when there are too many features, sometimes referred to as high-dimensionality, we run into the issue of model overfitting. If we overfit a model to the training set, then the model is limited to fitting only those scenarios found in the training set and will likely have a much higher error rate on a test set with even a few new scenarios.

Step 1: encoding all categorical variables using get_dummies

In [38]:
# encoding for all cat variables
dataset_cat_df_dummies = pd.get_dummies(dataset_cat_df, columns=['cat1', 'cat2','cat3', 'cat4','cat5','cat6','cat7','cat8','cat9','cat10','cat11','cat12','cat13','cat14','cat15','cat16','cat17','cat18','cat19','cat20','cat21','cat22','cat23','cat24','cat25','cat26','cat27','cat28','cat29','cat30','cat31','cat32','cat33','cat34','cat35','cat36','cat37','cat38','cat39','cat40','cat41','cat42','cat43','cat44','cat45', 'cat46','cat47','cat48','cat49','cat50','cat51','cat52','cat53','cat54','cat55','cat56','cat57','cat58','cat59','cat60','cat61','cat62','cat63','cat64','cat65', 'cat66','cat67','cat68','cat69','cat70','cat71','cat72','cat73','cat74','cat75','cat76','cat77','cat78','cat79','cat80','cat81','cat82','cat83','cat84','cat85', 'cat86','cat87','cat88','cat89','cat90','cat91','cat92','cat93','cat94','cat95','cat96','cat97','cat98','cat99','cat100','cat101','cat102','cat103','cat104','cat105', 'cat106','cat107','cat108','cat109','cat110','cat111','cat112','cat113','cat114','cat115'])
dataset_cat_df_dummies.describe()

Unnamed: 0,loss,cat1_A,cat1_B,cat2_A,cat2_B,cat3_A,cat3_B,cat4_A,cat4_B,cat5_A,cat5_B,cat6_A,cat6_B,cat7_A,cat7_B,cat8_A,cat8_B,cat9_A,cat9_B,cat10_A,cat10_B,cat11_A,cat11_B,cat12_A,cat12_B,cat13_A,cat13_B,cat14_A,cat14_B,cat15_A,cat15_B,cat16_A,cat16_B,cat17_A,cat17_B,cat18_A,cat18_B,cat19_A,cat19_B,cat20_A,...,cat114_C,cat114_D,cat114_E,cat114_F,cat114_G,cat114_I,cat114_J,cat114_L,cat114_N,cat114_O,cat114_Q,cat114_R,cat114_S,cat114_U,cat114_V,cat114_W,cat114_X,cat115_A,cat115_B,cat115_C,cat115_D,cat115_E,cat115_F,cat115_G,cat115_H,cat115_I,cat115_J,cat115_K,cat115_L,cat115_M,cat115_N,cat115_O,cat115_P,cat115_Q,cat115_R,cat115_S,cat115_T,cat115_U,cat115_W,cat115_X
count,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,...,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0,188318.0
mean,3037.337686,0.751654,0.248346,0.566706,0.433294,0.945173,0.054827,0.681799,0.318201,0.657064,0.342936,0.699312,0.300688,0.975711,0.024289,0.941355,0.058645,0.600697,0.399303,0.850758,0.149242,0.893096,0.106904,0.848697,0.151303,0.896627,0.103373,0.987909,0.012091,0.999819,0.000181,0.965617,0.034383,0.993049,0.006951,0.994759,0.005241,0.990399,0.009601,0.998917,...,0.089174,2.7e-05,0.087485,0.041977,5e-06,0.012914,0.043538,0.00462,0.013036,0.001274,0.000228,0.004843,2.1e-05,0.001328,0.000175,5e-06,5e-06,0.000398,1.1e-05,5e-06,2.1e-05,5.8e-05,0.001428,0.001673,0.014831,0.037649,0.126886,0.232936,0.085626,0.06608,0.11915,0.142382,0.11437,0.043851,0.010822,0.001328,0.000297,0.000138,3.2e-05,2.7e-05
std,2904.086186,0.432055,0.432055,0.495532,0.495532,0.227644,0.227644,0.465779,0.465779,0.474692,0.474692,0.458559,0.458559,0.153944,0.153944,0.234961,0.234961,0.489757,0.489757,0.356328,0.356328,0.308992,0.308992,0.358345,0.358345,0.304446,0.304446,0.109294,0.109294,0.013436,0.013436,0.182212,0.182212,0.083083,0.083083,0.072206,0.072206,0.097512,0.097512,0.032895,...,0.284995,0.005153,0.282545,0.200537,0.002304,0.112905,0.204065,0.067812,0.113431,0.035677,0.015109,0.069422,0.004609,0.036411,0.013237,0.002304,0.002304,0.019953,0.003259,0.002304,0.004609,0.007643,0.037768,0.040865,0.120878,0.190347,0.332847,0.422703,0.279812,0.248422,0.323965,0.349442,0.318261,0.204765,0.103465,0.036411,0.017242,0.011749,0.005644,0.005153
min,0.67,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1204.46,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2115.57,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3864.045,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,121012.25,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Step 2: remove extreme values, merge continuous fields with categorical fields, and split train and test sets

In [40]:
dataset_cat_df_dummies = dataset_cat_df_dummies[dataset_cat_df_dummies['loss']<100000][dataset_cat_df_dummies['loss']>1]
dataset_cont_df = dataset_cont_df[dataset_cont_df['loss']<100000][dataset_cont_df['loss']>1]
X =  pd.concat([dataset_cat_df_dummies.loc[:,dataset_cat_df_dummies.columns != 'loss'], dataset_cont_df.loc[:,dataset_cont_df.columns != 'loss']], axis=1, sort=False)
y = np.log1p(dataset_cont_df.loc[:,dataset_cont_df.columns == 'loss'].astype('float'))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)


In [41]:
y_train.describe()

Unnamed: 0,loss
count,169483.0
mean,7.68436
std,0.81108
min,1.832581
25%,7.092981
50%,7.655793
75%,8.257392
max,11.361225


In [42]:
y_test.describe()

Unnamed: 0,loss
count,18832.0
mean,7.699792
std,0.810619
min,3.090588
25%,7.10971
50%,7.672507
75%,8.278927
max,10.635805


In [43]:
print(X_train.head())
print(X_train.shape)
print(X_test.shape)

Run PCA to get an idea of how many dimension we need to get good explantory power. 200 dimension explains 96.6% variation.

In [45]:
#Run PCA to identify important features
pca = PCA(n_components=200)
pca.fit(X_train)
print(pca.explained_variance_ratio_.cumsum())

In [46]:
X_pca_train = pca.fit_transform(X_train)
X_pca_test = pca.transform(X_test)
print("original shape:   ", X_train.shape)
print("transformed shape:", X_pca_train.shape)
print("original shape:   ", X_test.shape)
print("transformed shape:", X_pca_test.shape)

Normalization: further transform target variable to make train/test on the same scale

In [48]:
scaler = MinMaxScaler()
y_train_scaled = scaler.fit_transform(y_train)
y_test_scaled = scaler.transform(y_test)

In [49]:
pd.DataFrame(y_train_scaled).describe()

Unnamed: 0,0
count,169483.0
mean,0.614125
std,0.08512
min,0.0
25%,0.552062
50%,0.611127
75%,0.674263
max,1.0


In [50]:
pd.DataFrame(y_test_scaled).describe()

Unnamed: 0,0
count,18832.0
mean,0.615745
std,0.085072
min,0.132024
25%,0.553817
50%,0.612881
75%,0.676523
max,0.92387


Linear regression with PCA and scaled target variable

In [52]:
linear = LinearRegression()
model = linear.fit(X_pca_train, y_train_scaled)

#inverse scaling/transformation
y_train_pred= np.expm1(scaler.inverse_transform(model.predict(X_pca_train)))
y_test_pred= np.expm1(scaler.inverse_transform(model.predict(X_pca_test)))
print(mean_absolute_error(y_train_pred, np.expm1(y_train)))
print(mean_absolute_error(y_test_pred, np.expm1(y_test)))

print(model.coef_)

In [53]:
#########################################################
# DF to RDD code 
#########################################################

def transformRDDtoDF(dataRDD):
  """helper function: transform dataRDD into dataframe with dummy variables."""
  dataRDDCached_cont=dataRDD.map(lambda x: x.split(','))\
    .map(lambda x: (x[117:131],x[-1])).cache()
             
  dataRDDCached_cat=dataRDD.map(lambda x: x.split(','))\
    .map(lambda x: (x[1:116],x[-1])).cache()

  dataset_cont = np.array(dataRDDCached_cont.map(lambda x: np.append(x[0], [x[1]])).take(188318))
  dataset_cat = np.array(dataRDDCached_cat.map(lambda x: np.append(x[0], [x[1]])).take(188318))

  FIELDS_continuous = trainheaders.split(',')[117:132]
  FIELDS_cat = trainheaders.split(',')[1:116]
  FIELDS_cat.append('loss')
  dataset_cont_df = pd.DataFrame(np.array(dataset_cont),columns=FIELDS_continuous)
  dataset_cat_df = pd.DataFrame(np.array(dataset_cat),columns=FIELDS_cat)
  dataset_cont_df = dataset_cont_df.convert_objects(convert_numeric=True)

  for col in dataset_cat_df.columns.values:
    if (col != 'loss'):
      dataset_cat_df[col] = dataset_cat_df[col].astype('category')
    else:
      dataset_cat_df[col] = dataset_cat_df[col].astype('float')

  
  # encoding for all cat variables
  dataset_cat_df_dummies = pd.get_dummies(dataset_cat_df, columns=['cat1', 'cat2','cat3', 'cat4','cat5','cat6','cat7','cat8','cat9','cat10','cat11','cat12','cat13','cat14','cat15','cat16','cat17','cat18','cat19','cat20','cat21','cat22','cat23','cat24','cat25','cat26','cat27','cat28','cat29','cat30','cat31','cat32','cat33','cat34','cat35','cat36','cat37','cat38','cat39','cat40','cat41','cat42','cat43','cat44','cat45', 'cat46','cat47','cat48','cat49','cat50','cat51','cat52','cat53','cat54','cat55','cat56','cat57','cat58','cat59','cat60','cat61','cat62','cat63','cat64','cat65', 'cat66','cat67','cat68','cat69','cat70','cat71','cat72','cat73','cat74','cat75','cat76','cat77','cat78','cat79','cat80','cat81','cat82','cat83','cat84','cat85', 'cat86','cat87','cat88','cat89','cat90','cat91','cat92','cat93','cat94','cat95','cat96','cat97','cat98','cat99','cat100','cat101','cat102','cat103','cat104','cat105', 'cat106','cat107','cat108','cat109','cat110','cat111','cat112','cat113','cat114','cat115'])

  data_frame = pd.concat([dataset_cat_df_dummies.loc[:,dataset_cat_df_dummies.columns != 'loss'], dataset_cont_df.loc[:,dataset_cont_df.columns != 'loss']], axis=1, sort=False)
  data_outcome = np.log1p(dataset_cat_df_dummies[['loss']].astype('float'))
  return data_frame, data_outcome

In [54]:
#########################################################
# Fill in missing categoricals
#########################################################

# df test runs for different models
testRDD = sc.textFile(finalproject_path + '/test.csv')\
              .filter(lambda x: x != testheaders)\

X_train_, X_train_loss_ = transformRDDtoDF(trainRDD)
X_test_, X_test_loss_ = transformRDDtoDF(testRDD)

train_cols = X_train_.columns
test_cols = X_test_.columns

# filling in missing categoricals
train_diff = list(set(train_cols) - set(test_cols))
test_diff = list(set(test_cols) - set(train_cols))
print(train_diff)
print(test_diff)

for col in train_diff:
  X_test[col] = 0

for col in test_diff:
  X_train_[col] = 0

train_cols = X_train_.columns
test_cols = X_test_.columns

print(list(set(train_cols) - set(test_cols)))


## Algorithm Exploration

The baseline algorithmic model used with the data was a **kitchen sink linear regression model**. Expertly named, a kitchen sink model is a type of regression that uses as many independent variables as possible in order to explain away any potential variances found on the dependent variable. It gives way to the phrase, ‘everything but the kitchen sink’, which basically means throw everything in and see what happens. One issue with this type of model is that it can use too many of the independent variables, which will lead to an overfit model, an issue explained in the previous section.

For a simple comparison to the baseline model, **Ridge and Lasso regressions** were run to see if the baseline model could be improved with regularization. While an OLS estimator, like the one performed for the baseline model, generally has low bias, the variance can be quite high, especially if there are many dimensions, as are present in the Allstate data. Enter regularization with Ridge and Lasso regressions. Ridge Regression works to regularize a model by setting predictor’s coefficients that are too far from zero to be a very small value, which all but eliminates them from the model; the model has decreased in complexity without actually removing any variables. In order to make this model most efficient, the lambda parameter, also called the regularization penalty coefficient, must be set in such a way that the bias and variance are balanced. Too high of a value causes the variance to decrease, but also leads to an increase in bias. Ridge Regression is also referred to as the L2 loss function. The other form of regularization is a Lasso Regression, or L1 loss function, and is somewhat similar to the Ridge in that it can add an adjustment to non-zero coefficients, penalizing the sums of absolute values. This results in many coefficients being completely zeroed out if the lambda parameter is set too high.

Another algorithm considered was **KNN (k-Nearest Neighbors)**, which holds on to the assumption that “birds of a feather flock together’; in other words, similar things tend to exist within close proximity to each other. kNN works by calculating distances between data points. In order to do this, data scientists often use the Euclidean distance calculation between sets of data points (square the difference for each axis, sum these, and then take the square root) to accomplish this. The “k” comes into play as the variable for which we’d want to assign our number of neighbors. This variable can be modified for each algorithm run so an optimal value (lowest error) is used. A step-by-step approach to this algorithm (useful for pipelining) is as follows; once a k-value is selected, the distance between each datapoint is calculated. As each value is calculated, it, as well as its index are added to an ordered collection, which is sorted ascendingly by index and distance, with the smallest distance at the top of the stack. The first K entries are picked from the stack, labeled, and if the algorithm was called for regression, the mean of the group is returned; for classification, the mode would be returned. A few caveats to this method are that as the value for K approaches one, the predictions become less accurate because there is a higher chance that the query node point will select a neighbor from a different group instead of ones also nearby that have similar features. The model is also known to become slow to run if the amount of independent variables or predictors from the input dataset are too substantial.

A **Decision Tree algorithm** falls within the class of ‘supervised learning’ algorithms, meaning that it uses a training model made up of simple decision rules in order to predict the target variable. The algorithm learns in terms of root nodes, leaf nodes, and inner nodes that when laid out, look like the roots of a tree, hence the name. The tree maps observations related to an item to conclusions about the item’s target value. Decision Trees are used for two types of problems; those with categorical variables and those with continuous variables. The model is a popular choice for machine learning problems for many reasons including the following; it doesn’t require any pre-processing, it can be used as a dimensionality reduction approach, they work better than a linear model when there is a high dimensionality between independent and predictor variables, they are very robust in the presence of missing data, and they don’t need to normalize or standardize the data. When building a decision tree, the root node represents all of the data and is split into branches or sub-nodes, commonly referred to as decision or inner nodes. Decision nodes are then split into either another decision branch, or they end with a leaf node, which is the final segment, sometimes called the terminal node. This is where the classification occurs. The goal of each branch is to reach “purity” at the leaf node, which basically means that we want the mean and variance to be as close to, or at, zero as possible. The previous purity measurement calculation for classification is known as entropy. Other measurements for purity classifications include information gain, which is a statistical property that computes the entropy difference before a node split, and then the average entropy after the split, and Reduction in Variance, which uses the standard formula for variance to find the split points with the lowest variance to be used as the official split. Though decision trees are a popular choice, there can be drawbacks, like the common problem of overfitting, which can occur if there are no stopping points set, such as setting a rule that says once a node contains ten or less samples, do not split again. If stop points are not set, the decision tree could end up creating one leaf for every single observation in the data given. One method to combat this will be introduced below.

By using a **Random Forest**, which is part of the Decision Tree toolbox, the problem of overfitting can be solved. “Random Forest is an example of ensemble learning, in which we combine multiple machine learning algorithms to obtain better predictive performance.” (source: https://towardsdatascience.com/decision-tree-algorithm-explained-83beb6e78ef4) Random Forests use a technique called Bagging, which basically builds ensembles using many different, yet random samples from the dataset; predictions from each of these learned trees are aggregated and compared, with the best solution being chosen by means of averaging the results. One can think of this technique as “crowd wisdom”, where low correlation between models is key. The main difference between a regular decision tree and a random forest is that when a decision tree chooses to split, every possible feature is considered with the one producing the most separation between the left and right nodes being picked, whereas in a random forest, because the subset of data may only have a handful of features, the training and node splitting are done based on those accessed features. Like the name implies, in a forest, there are many different trees, which ultimately means that we will see more variation amongst the trees, which leads to lower correlation and more diversification. 

The final subset of the Decision Tree model that was looked at is **Gradient Boosting Decision Trees**. Similar to Random Forest, this method uses an ensemble of decision trees in its prediction methods, with the biggest difference being that it also includes a parameter called the Learning Rate. Calculations are done by first computing the residuals of each sample and then building a tree with the goal of predicting those residual values. The final prediction takes the average output and adds the learning rate multiplied by the residual predicted by the decision tree. A new set of residuals is then computed using the actual value minus the predicted, and in turn, this new set of residuals is used for the leaves of the next decision tree. Once all of the trees have been created, a final prediction is made by taking the mean target and summing with each of the ensemble tree residuals multiplied by the learning rate. 

Sources for above summaries: 
- Ridge/Lasso - https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net
- KNN - https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761
- Random Forest - https://towardsdatascience.com/random-forest-a-powerful-ensemble-learning-algorithm-2bf132ba639d
- GB DT - https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4

### Different Models + CV (sklearn)

#### Sklearn Implementation

We implemented our sklearn models by using a 90/10 training split with cross validation to reduce a specific test set biasing our metrics. 

We also took into consideration the following implementation details:

- **Caching** - We cached all of our models and results (cross validated results, predictions and MAE), including the performance counters for each model run. In this way, we're able to leverage the local storage cache as a way to reuse the programmatic work we did prior. In addition, this allows us to quickly run (or rerun) certain models to get comparative run results independent of day or cluster usage.

- **IO / Memory** - We monitored the memory usage of sklearn models avidly. We know because sklearn primarily uses CPU and memory for computation, and some of our models are large, this metric would be important to monitor, especially for larger models like KNN. We used the `.info()` function of data frames to judge complexity. 

- **Modularity / Code Reuse** - We opted to reuse and standarize as much code as possible, so most of our work are functionalized. This promotes consistency, efficiency, reduces errors or typos and creates a more readable flow. Our models reuse the same training and metrics functions as well as the same dataset split.

- **Time Complexity** - We used performance counters in our code to monitor how long training a model takes, and how long prediction takes for any model. We cache these results. Some models are quite large and doesn't work in databricks in our shared cluster. These include KNN and RandomForest* if we are to run the entire dataset.

- **Sampling** - We sampled the training data for a smaller dataset to run the larger and more time consuming models. These include KNN and RandomForest. This way, we're able to get an idea of how a model performs with different parameters, and if it's feasible as potentially feasible model. The sampling is built into the function that creates the model as a parameter.

In [58]:
#########################################################
# Initialization code 
#########################################################

from sklearn import model_selection
from sklearn.metrics.scorer import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, mean_absolute_error

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# initialize train test split
X_train_subset, X_test_subset, y_train_subset, y_test_subset = train_test_split(X_train_, X_train_loss_, test_size=0.08, random_state=1)

# initialize model caching variables
cacheModels = []
cacheModelNames = []
cacheCVResults = []
cacheMAE = {}
cacheMAE['train'] = []
cacheMAE['test'] = []
cachePerf = {}
cachePerf['fit'] = []
cachePerf['predict'] = []
cachePerf['metric'] = []


In [59]:
#########################################################
# Sklearn Helper Functions 
#########################################################

def transform_mae(y_true, y_pred, **kwargs):
  '''Transforms predicted values to normalized mae'''
  return mean_absolute_error(np.expm1(y_pred), np.expm1(y_true))

def pickle_dump():
  '''Use dumps to convert cached objects to a serialized string'''
  pickle.dump( cacheModels, open( "/dbfs/FileStore/cacheModels.p", "wb" ) )
  pickle.dump( cacheModelNames, open( "/dbfs/FileStore/cacheModelNames.p", "wb" ) )
  pickle.dump( cacheCVResults, open( "/dbfs/FileStore/cacheCVResults.p", "wb" ) )
  pickle.dump( cacheMAE, open( "/dbfs/FileStore/cacheMAE.p", "wb" ) )
  pickle.dump( cachePerf, open( "/dbfs/FileStore/cachePerf.p", "wb" ) )

  display(dbutils.fs.ls("dbfs:/FileStore"))

def print_summary_raw():
  '''Prints a text version of run results'''
  for count, item in enumerate(cacheMAE['train']):
    print(cacheModelNames[count] + " Train - " + str(item))
    print(cacheModelNames[count] + " Test - " + str(cacheMAE['test'][count]))
    print(cacheModelNames[count] + " Time Fit:" + str(round(cachePerf['fit'][count]/60, 1)) + "min")
    print(cacheModelNames[count] + " Time Predict:" + str(round(cachePerf['predict'][count]/60, 1)) + "min")
    print(cacheModelNames[count] + " Time Metric:" + str(round(cachePerf['metric'][count]/60, 1)) + "min")
    
def plotSidebySideBarCharts(data1, data1_lbl, data2, data2_lbl, labels, title):
  '''Plots side by side bar charts'''
  x = np.arange(len(labels))  
  width = 0.4  

  fig, ax = plt.subplots()
  plt.xticks(rotation=90)
  rects2 = ax.bar(x + width/2, data1, width, label=data1_lbl)
  rects1 = ax.bar(x - width/2, data2, width, label=data2_lbl)
  ax.set_title(title)
  ax.set_xticks(x)
  ax.set_xticklabels(labels)
  ax.legend(loc="lower left")

  fig.tight_layout()

  display(plt.show())
  
def plotBoxPlotCVCharts(title):
  '''Plots box plot charts'''
  fig = plt.figure()
  fig.suptitle(title)
  ax = fig.add_subplot(111)
  plt.boxplot(cacheCVResults[2:])
  ax.set_xticklabels(cacheModelNames[2:])
  plt.xticks(rotation=90)
  display(plt.show())
  
def plotLineChart(data1, data1_label, data2, data2_label, x_num, xLabel, yLabel, title, axes, fig, subplot):
  '''Plots side by side subplots for line charts'''
  axes[subplot].plot(x_num, data1, label=data1_label)
  axes[subplot].plot(x_num, data2, label=data2_label)
  axes[subplot].set_title(title)
  axes[subplot].set(xlabel=xLabel, ylabel=yLabel)
  axes[subplot].legend(loc="upper left")
  plt.legend()
  
def addModelToCache(model, name, fulldataset = True):
  '''Creates models for a given dataset, can be full or partial dataset'''
  smallSet = 300
  tic = time.perf_counter()
  if (fulldataset == True):
    model.fit(X_train_subset, y_train_subset)
  else:
    model.fit(X_train_subset[0:smallSet], y_train_subset[0:smallSet])
    
  cachePerf['fit'].append(time.perf_counter()-tic)
  kfold = model_selection.KFold(n_splits = 5, random_state = 1)
  
  cv_results = model_selection.cross_val_score(model, X_train_subset[0:smallSet], y_train_subset[0:smallSet], cv=kfold, scoring="neg_mean_absolute_error")	
  if (fulldataset == True):
    cv_results = model_selection.cross_val_score(model, X_train_subset, y_train_subset, cv=kfold, scoring="neg_mean_absolute_error")	

  tic = time.perf_counter()
  y_pred_train = model.predict(X_train_subset)
  y_pred_test = model.predict(X_test_subset)
  cachePerf['predict'].append(time.perf_counter()-tic)

  cacheModelNames.append(name)
  cacheModels.append((name, model)) 
  
  cacheCVResults.append(cv_results) 
  
  tic = time.perf_counter()
  cacheMAE['train'].append(mean_absolute_error(np.expm1(y_train_subset), np.expm1(y_pred_train))) 
  cacheMAE['test'].append(mean_absolute_error(np.expm1(y_pred_test), np.expm1(y_test_subset)))
  cachePerf['metric'].append(time.perf_counter()-tic)
  
  return

# make the custom scorer for transforming MAE
custom_scorer = make_scorer(transform_mae, greater_is_better=False)


In [60]:
#########################################################
# Sklearn Models and Functions 
#########################################################

from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

import pickle
import os.path
import warnings
from tabulate import tabulate
warnings.filterwarnings('ignore')

# verbose debugging
debug = False

# we try to pick up cached results, if it doesn't exist, we rerun all models
if (os.path.isfile("/dbfs/FileStore/cacheModels.p")):
  cacheModels = pickle.load( open( "/dbfs/FileStore/cacheModels.p", "rb" ))
  cacheModelNames = pickle.load( open( "/dbfs/FileStore/cacheModelNames.p", "rb" ))
  cacheCVResults = pickle.load( open( "/dbfs/FileStore/cacheCVResults.p", "rb" ))
  cacheMAE = pickle.load( open( "/dbfs/FileStore/cacheMAE.p", "rb" ))
  cachePerf = pickle.load( open( "/dbfs/FileStore/cachePerf.p", "rb" ))
  if (debug == True):
    print("Number of models loaded: " + str(len(cacheModels)))
    print("Number of cross validated results loaded: " + str(len(cacheCVResults) * len(cacheCVResults[0])))
else:
  addModelToCache(LinearRegression(normalize=False), "LinRegression")
  addModelToCache(LinearRegression(normalize=True), "LinRegression (norm)")
  addModelToCache(DecisionTreeRegressor(max_depth=5), "DTRegressor (d-5)")
  addModelToCache(DecisionTreeRegressor(max_depth=7), "DTRegressor (d-7)")
  addModelToCache(DecisionTreeRegressor(max_depth=9), "DTRegressor (d-9)")

  addModelToCache(Ridge(alpha=1.0), "RidgeRegression (a-1)")
  addModelToCache(Ridge(alpha=3.0), "RidgeRegression (a-3)")
  addModelToCache(RidgeCV(alphas=[1, 1e3, 1e6], store_cv_values=True), "RidgeCVRegression (1,e3,e6)")

  addModelToCache(Lasso(alpha=0.05, max_iter=1000), "LassoRegression (a-0.05, i-1000)")
  addModelToCache(Lasso(alpha=0.4, max_iter=1000), "LassoRegression (a-0.4, i-1000)")

  addModelToCache(KNeighborsRegressor(n_neighbors=2), "KNNRegression (n-2) ss", False)
  addModelToCache(KNeighborsRegressor(n_neighbors=3), "KNNRegression (n-3) ss", False)

  addModelToCache(RandomForestRegressor(criterion='mae', n_estimators=40, random_state=0), "RandomForestRegressor (n-40) ss", False)
  addModelToCache(RandomForestRegressor(criterion='mae', n_estimators=60, random_state=0), "RandomForestRegressor (n-60) ss", False)

  addModelToCache(XGBRegressor(n_estimators=500, random_state=0), "XGBRegressor (n-500) ss")
  addModelToCache(XGBRegressor(n_estimators=800, random_state=0), "XGBRegressor (n-800) ss")




In [61]:
# Tabular representation of training and test MAE
table_data = []
for count, item in enumerate(cacheMAE['train']):
  table_data.append([cacheModelNames[count], cacheMAE['train'][count], cacheMAE['test'][count]])
print(tabulate(table_data, headers=['Sklearn Model', 'Train MAE', 'Test MAE'], tablefmt="github"))


In [62]:
plotSidebySideBarCharts(cacheMAE['train'], "Train", cacheMAE['test'], "Test", cacheModelNames, 'Train and Test MAE across Sklearn Models')


In [63]:
plotBoxPlotCVCharts("Sklearn regression k-fold cross-validation between models")

##### Train and test MAE across sklearn models
- XGBRegressor has the best results for both train and test set. cross validation also shows smaller variation. Low bias and low variance.
- Kitchen sink baseline linear regression model and ridge regression come second which is a bit surprising because we expect linear regression would overfit. Cross validation shows small variation for both models as well. Low bias. Ridge should have lower variance as it's regularized.
- Decision tree comes third in terms of goodness of the fit. Higher depth yields better results. Cross validation shows small variation.
- KNN and random forest regressor have large MAE and large variation during cross validation - this is likely due to the sampled data set.

In [65]:
# Tabular representation of fit and predict times
table_data = []
for count, item in enumerate(cacheMAE['train']):
  table_data.append([cacheModelNames[count], cachePerf['fit'][count], cachePerf['predict'][count]])
print(tabulate(table_data, headers=['Sklearn Model', 'Fit Time (s)', 'Predict Time (s)'], tablefmt="github"))

In [66]:
plotSidebySideBarCharts(cachePerf['fit'], "Fit", cachePerf['predict'], "Predict", cacheModelNames, 'Fit and Predict Time across Sklearn Models')

##### Fit and predict times across sklearn models
- XGBregressor takes the longest to train the model but takes much less time to predict. It should scale up well.
- KNN doesn't work (times out) on sklearn with the full dataset. Even on a smaller dataset, it takes a long time to predict because it needs to evaluate train set data for every prediction. It does not scale up well in sklearn.
- Decision tree, ridge/lasso, linear, and random forest models all take similar time to train and predict.

### Add on PCA in sklearn pipeline

PCA + linear regression (sklearn)

In [70]:
#normalize target variable using MinMaxScaler
scaled_clf = make_pipeline(PCA(n_components=200), LinearRegression())
scaled_clf = scaled_clf.fit(X_train, y_train_scaled)

pred_test = np.expm1(scaler.inverse_transform(scaled_clf.predict(X_test)))
pred_train = np.expm1(scaler.inverse_transform(scaled_clf.predict(X_train)))
print(mean_absolute_error(pred_train, np.expm1(y_train)))
print(mean_absolute_error(pred_test, np.expm1(y_test)))


PCA + Lasso (sklearn)

In [72]:
#try a different transformation
pt = PowerTransformer()
y_train_trans = pt.fit_transform(y_train)
y_test_trans = pt.transform(y_test)

In [73]:
clf = make_pipeline(PCA(n_components=200), 
                    GridSearchCV(Lasso(),
                                 param_grid={'alpha': [0.01,1,5,10]},
                                 cv=5,
                                 refit=True))

clf.fit(X_train, y_train_trans)
y_pred_test = np.expm1(pt.inverse_transform(pd.DataFrame(clf.predict(X_test))))
y_pred_train =np.expm1(pt.inverse_transform(pd.DataFrame(clf.predict(X_train))))
print(round(mean_absolute_error (np.expm1(y_train), y_pred_train), 5))
print(round(mean_absolute_error (np.expm1(y_test), y_pred_test), 5))

PCA + Decision Tree (sklearn)

In [75]:
# prepare the model with input scaling
tree = make_pipeline(PCA(n_components=200), 
                    GridSearchCV(DecisionTreeRegressor(),
                                 param_grid={'max_depth': [5,10,15,20]},
                                 cv=10,
                                 refit=True))

tree.fit(X_train, y_train_trans)
y_pred_test_tree = np.expm1(pt.inverse_transform(pd.DataFrame(tree.predict(X_test))))
y_pred_train_tree =np.expm1(pt.inverse_transform(pd.DataFrame(tree.predict(X_train))))
print(mean_absolute_error (np.expm1(y_train), y_pred_train_tree), 5)
print(mean_absolute_error (np.expm1(y_test), y_pred_test_tree), 5)

### Different Models Comparison (pyspark.ml)

Pyspark pipeline

We had some scale issues with KNN and RandomForest in sklearn. To draw a comparison between sklearn and pyspark, we ran the same models in pyspark ML.  
Using similar transformations, we one-hot encoded using th pyspark OneHotEncoderEstimator and VectorAssembler to create the pipeline. 

Unlike sklearn where each step was executed as it was coded, the pyspark implementation was lazy, meaning it only encoded and assembled features once we asked for a model fit.

In [78]:
#########################################################
# Spark Setup 
#########################################################

from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator, VectorAssembler
from pyspark.ml import Pipeline
from operator import add

import pickle
import os.path
import warnings

warnings.filterwarnings('ignore')

cacheSparkModels = []
cacheSparkModelNames = []
cacheSparkCVResults = []
cacheSparkMAE = {}
cacheSparkMAE['train'] = []
cacheSparkMAE['test'] = []
cacheSparkPerf = {}
cacheSparkPerf['fit'] = []
cacheSparkPerf['predict'] = []
cacheSparkPerf['metric'] = []

def readEncodeTransformData():
  '''Reads and transforms data for spark modeling from csv'''
  
  # Load training data
  # infer data type (cat, double, etc)
  training_spark_raw = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load(finalproject_path + '/train.csv')

  categorical_columns = ["cat1","cat2","cat3","cat4","cat5","cat6","cat7","cat8","cat9","cat10","cat11","cat12","cat13","cat14","cat15","cat16","cat17","cat18","cat19","cat20","cat21","cat22","cat23","cat24","cat25","cat26","cat27","cat28","cat29","cat30","cat31","cat32","cat33","cat34","cat35","cat36","cat37","cat38","cat39","cat40","cat41","cat42","cat43","cat44","cat45","cat46","cat47","cat48","cat49","cat50","cat51","cat52","cat53","cat54","cat55","cat56","cat57","cat58","cat59","cat60","cat61","cat62","cat63","cat64","cat65","cat66","cat67","cat68","cat69","cat70","cat71","cat72","cat73","cat74","cat75","cat76","cat77","cat78","cat79","cat80","cat81","cat82","cat83","cat84","cat85","cat86","cat87","cat88","cat89","cat90","cat91","cat92","cat93","cat94","cat95","cat96","cat97","cat98","cat99","cat100","cat101","cat102","cat103","cat104","cat105","cat106","cat107","cat108","cat109","cat110","cat111","cat112","cat113","cat114","cat115","cat116"]

  # index categorical columns
  strindexers = [
      StringIndexer(inputCol = col_, outputCol = "{0}_indexed".format(col_))
      for col_ in categorical_columns
  ]

  # one-hot encode categorical columns
  encoder = OneHotEncoderEstimator(
      inputCols = [indexer.getOutputCol() for indexer in strindexers],
      outputCols = ["{0}_encoded".format(indexer.getOutputCol()) for indexer in strindexers]
  )

  # get all features (encoded and continuous)
  all_features = encoder.getOutputCols() + ["cont1","cont2","cont3","cont4","cont5","cont6","cont7","cont8","cont9","cont10","cont11","cont12","cont13","cont14"]

  # assemble into a pipeline
  assembler = VectorAssembler(
      inputCols = all_features,
      outputCol = "features"
  )

  # fit and actually do the work
  pipeline = Pipeline(stages=strindexers + [encoder, assembler])
  training_transformed = pipeline.fit(training_spark_raw).transform(training_spark_raw)
  
  return training_transformed

#########################################################
# Spark Initialization and Partition Code
#########################################################

training_transformed = readEncodeTransformData()
training_transformed.take(1)

# spark split into training and test sets 
splits = training_transformed.randomSplit([0.9, 0.1])
training_transformed_df = splits[0]
test_transformed_df = splits[1]


In [79]:
#########################################################
# Spark Helper Functions
#########################################################

def pickle_dump_spark():
  '''Use dumps to convert objects to a serialized string. does not work for spark models '''
  # pickle.dump( cacheModels, open( "/dbfs/FileStore/cacheModels.p", "wb" ) )
  pickle.dump( cacheSparkModelNames, open( "/dbfs/FileStore/cacheModelNames.p", "wb" ) )
  pickle.dump( cacheSparkMAE, open( "/dbfs/FileStore/cacheMAE.p", "wb" ) )
  pickle.dump( cacheSparkPerf, open( "/dbfs/FileStore/cacheMAE.p", "wb" ) )

  display(dbutils.fs.ls("dbfs:/FileStore"))

def print_summary_raw_spark():
  '''Prints raw summary for spark models'''
  for count, item in enumerate(cacheSparkMAE['train']):
    print(cacheSparkModelNames[count] + " Train - " + str(item))
    print(cacheSparkModelNames[count] + " Test - " + str(cacheSparkMAE['test'][count]))
    print(cacheSparkModelNames[count] + " Time Fit:" + str(round(cacheSparkPerf['fit'][count]/60, 1)) + "min")
    print(cacheSparkModelNames[count] + " Time Predict:" + str(round(cacheSparkPerf['predict'][count]/60, 1)) + "min")
    print(cacheSparkModelNames[count] + " Time Metric:" + str(round(cacheSparkPerf['metric'][count]/60, 1)) + "min")


def addSparkModelToCache(model, name, fulldataset = True):
  '''Runs spark models for a transformed dataset, saves results'''
  # initialize timers
  tic = time.perf_counter()
  lr_model = model.fit(training_transformed_df)
  cacheSparkPerf['fit'].append(time.perf_counter()-tic)
      
  # transform and predict training and test data
  tic = time.perf_counter()
  y_train_predictions = lr_model.transform(training_transformed_df)
  y_test_predictions = lr_model.transform(test_transformed_df)
  cacheSparkPerf['predict'].append(time.perf_counter()-tic)

  # initialize regression evaluator
  tic = time.perf_counter()
  dt_evaluator = RegressionEvaluator(
      labelCol="loss", predictionCol="prediction", metricName="mae")
  
  mae_train = dt_evaluator.evaluate(y_train_predictions)
  mae_test = dt_evaluator.evaluate(y_test_predictions)
  # cache all results
  cacheSparkPerf['metric'].append(time.perf_counter()-tic)

  cacheSparkModelNames.append(name)
  cacheSparkModels.append((name, lr_model)) 
  
  cacheSparkMAE['train'].append(mae_train) 
  cacheSparkMAE['test'].append(mae_test)
                          
  print(name + " Train:" + str(mae_train))
  print(name + " Test:" + str(mae_test))
  
  return
  
def addPCAResultsFromAnotherNoteBook():
  '''adds PCA results to cached results'''
  cacheSparkMAE['train'].append(1418.92)
  cacheSparkMAE['test'].append(1432.34)
  cacheSparkPerf['fit'].append(1662.6)
  cacheSparkPerf['metric'].append(367.2)
  cacheSparkPerf['predict'].append(6.3)
  cacheSparkModelNames.append("Spk DecisionTreeRegressor + PCA")
  cacheResultsModels()
  
def cacheResultsModels():
  '''pickles cached results'''
  pickle.dump( cacheSparkModelNames, open( "/dbfs/FileStore/cacheSparkModelNames.p", "wb" ) )
  pickle.dump( cacheSparkMAE, open( "/dbfs/FileStore/cacheSparkMAE.p", "wb" ) )
  pickle.dump( cacheSparkPerf, open( "/dbfs/FileStore/cacheSparkPerf.p", "wb" ) )


In [80]:
#########################################################
# Spark Models and Cache
#########################################################

from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import GBTRegressor

if (os.path.isfile("/dbfs/FileStore/cacheSparkPerf.p")):
  cacheSparkModelNames = pickle.load( open( "/dbfs/FileStore/cacheSparkModelNames.p", "rb" ))
  #cacheSparkCVResults = pickle.load( open( "/dbfs/FileStore/cacheSparkCVResults.p", "rb" ))
  cacheSparkMAE = pickle.load( open( "/dbfs/FileStore/cacheSparkMAE.p", "rb" ))
  cacheSparkPerf = pickle.load( open( "/dbfs/FileStore/cacheSparkPerf.p", "rb" ))
  print("Number of models loaded: " + str(len(cacheSparkModelNames)))
else:
  addSparkModelToCache(LinearRegression(featuresCol = 'features', labelCol='loss', regParam=0.0), "Spk LinRegression")

  addSparkModelToCache(LinearRegression(featuresCol = 'features', labelCol='loss', regParam=1.0, elasticNetParam = 0), "Spk RidgeRegression (a-1)")
  addSparkModelToCache(LinearRegression(featuresCol = 'features', labelCol='loss', regParam=3.0, elasticNetParam = 0), "Spk RidgeRegression (a-3)")
  addSparkModelToCache(LinearRegression(featuresCol = 'features', labelCol='loss', regParam=0.05, elasticNetParam = 1), "Spk LassoRegression  (a-0.05)")
  addSparkModelToCache(LinearRegression(featuresCol = 'features', labelCol='loss', regParam=0.4, elasticNetParam = 1), "Spk LassoRegression  (a-0.4)")
  addSparkModelToCache(DecisionTreeRegressor(featuresCol = 'features', labelCol='loss', maxDepth=5), "Spk DecisionTreeRegressor  (d-5)")

  addSparkModelToCache(DecisionTreeRegressor(featuresCol = 'features', labelCol='loss', maxDepth=7), "Spk DecisionTreeRegressor (d-7)")
  addSparkModelToCache(DecisionTreeRegressor(featuresCol = 'features', labelCol='loss', maxDepth=9), "Spk DecisionTreeRegressor (d-9)")
  addSparkModelToCache(RandomForestRegressor(featuresCol = 'features', labelCol='loss', numTrees=40), "Spk RandomForestRegressor (n-40)")
  addSparkModelToCache(RandomForestRegressor(featuresCol = 'features', labelCol='loss', numTrees=60), "Spk RandomForestRegressor (n-60)")
  addSparkModelToCache(GBTRegressor(featuresCol = 'features', labelCol='loss', maxIter=10), "Spk GBTRegressor")
  cacheResultsModels()
    

In [81]:
table_data = []
for count, item in enumerate(cacheSparkMAE['train']):
  table_data.append([cacheSparkModelNames[count], cacheSparkMAE['train'][count], cacheSparkMAE['test'][count]])
print(tabulate(table_data, headers=['Spark Model', 'Train MAE', 'Test MAE'], tablefmt="github"))


In [82]:
plotSidebySideBarCharts(cacheSparkMAE['train'], "Train", cacheSparkMAE['test'], "Test", cacheSparkModelNames, 'Train and Test MAE across pySpark Models')

##### Train and test MAE across spark models

From the results, the MAE for test and train were not drastically different from sklearn, which is expected since they are the same models. Overall, most of the test MAE was a little higher than training MAE.The MAE was similar between linear, ridge and lasso regression. The interesting thing here is that as depth increased for the decision tree parameter, the model got incrementally better, which suggests there's room for tuning.


Some other observations:
- Lasso regression has the smallest mae for test set, but the advantage is very small. Ridge and baseline yield similar results.
- Gradient boost tree comes in third 
- Other tree models did not yield great results, but decision tree did better as the depth increased for both training and test.
- PCA also does not seem to have large effect, but expectedly increased MAE

In [84]:
table_data = []
for count, item in enumerate(cacheSparkMAE['train']):
  table_data.append([cacheSparkModelNames[count], cacheSparkPerf['fit'][count], cacheSparkPerf['metric'][count]+cacheSparkPerf['predict'][count]])
print(tabulate(table_data, headers=['Spark Model', 'Fit Time (s)', 'Predict Time (s)'], tablefmt="github"))

In [85]:
cacheSparkPerf['predict_metric'] = list( map(add, cacheSparkPerf['predict'], cacheSparkPerf['metric']) )

plotSidebySideBarCharts(cacheSparkPerf['fit'], "Fit", cacheSparkPerf['predict_metric'], "Predict", cacheSparkModelNames, 'Fit and Predict Time across Spark Models')

##### Fit and predict times across spark models
- This chart show the fit and predict times for our pyspark models. 
- The first thing is that no matter what the model is, most of the training (fit) times are quite similar, which was not true for sklearn (some models were on the order of seconds, whereas others did not work in sklearn with the full dataset on databricks). 
- PCA + decision tree takes the longest to train but not the longest to predict. It has to go through training twice, once for PCA and another time for decision tree, so understandably, it takes longer to complete.

In [87]:
side_by_side_labels = ['LinRegression', 'RidgeRegression (a-1)', 'RidgeRegression (a-3)', 'LassoRegression', 'DecisionTreeRegressor (d-5)', 'DecisionTreeRegressor (d-7)', 'DecisionTreeRegressor (d-9)']
side_by_side_values_sk_fit = [cachePerf['fit'][0], cachePerf['fit'][5], cachePerf['fit'][6], cachePerf['fit'][8], cachePerf['fit'][2], cachePerf['fit'][3], cachePerf['fit'][4]]
side_by_side_values_spk_fit = [cacheSparkPerf['fit'][0], cacheSparkPerf['fit'][1], cacheSparkPerf['fit'][2], cacheSparkPerf['fit'][3], cacheSparkPerf['fit'][5], cacheSparkPerf['fit'][6], cacheSparkPerf['fit'][7]]

side_by_side_values_sk_predict = [cachePerf['predict'][0], cachePerf['predict'][5], cachePerf['predict'][6], cachePerf['predict'][8], cachePerf['predict'][2], cachePerf['predict'][3], cachePerf['fit'][4]]
side_by_side_values_spk_predict = [cacheSparkPerf['predict_metric'][0], cacheSparkPerf['predict_metric'][1], cacheSparkPerf['predict_metric'][2], cacheSparkPerf['predict_metric'][3], cacheSparkPerf['predict_metric'][5], cacheSparkPerf['predict_metric'][6], cacheSparkPerf['predict_metric'][7]]

plotSidebySideBarCharts(side_by_side_values_sk_fit, "Sklearn Fit Time", side_by_side_values_spk_fit, "Spark Fit Time", side_by_side_labels, 'Sklearn and Spark Fit Time Comparison')
plotSidebySideBarCharts(side_by_side_values_sk_predict, "Sklearn Predict Time", side_by_side_values_spk_predict, "Spark Predict Time", side_by_side_labels, 'Sklearn and Spark Predict Time Comparison')

##### Fit and predict times between spark and sklearn models
- Spark models takes much longer to train and predict across the board for linear, ridge, lasso and decision tree regressors. The dataset is very small, so spark does not have advantage as it has larger overhead and speeds up slow.

####On Random Forests and Nearest Neighbors

In [90]:
side_by_side_labels = ['RandomForestRegressor (n-40)', 'RandomForestRegressor (n-60)']
side_by_side_labels_num = [40, 60]
side_by_side_values_sk_fit = [cachePerf['fit'][12], cachePerf['fit'][13]]
side_by_side_values_spk_fit = [cacheSparkPerf['fit'][8], cacheSparkPerf['fit'][9]]

side_by_side_values_sk_predict = [cachePerf['predict'][12], cachePerf['predict'][13]]
side_by_side_values_spk_predict = [cacheSparkPerf['predict_metric'][8], cacheSparkPerf['predict_metric'][8]]

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))
plt.tight_layout()
plotLineChart(side_by_side_values_sk_fit, "Sklearn Fit Time", side_by_side_values_sk_predict, "Sklearn Predict Time", side_by_side_labels_num, 'Number of Estimators', 'Execution Time', "RandomForest Sklearn Fit + Predict Time vs. Number of Estimators", axes, fig, 0)

plotLineChart(side_by_side_values_spk_fit, "Spark Fit Time", side_by_side_values_spk_predict, "Spark Predict Time", side_by_side_labels_num, 'Number of Estimators', 'Execution Time', "RandomForest Spark Fit + Predict Time vs. Number of Estimators", axes, fig, 1)
display(fig.tight_layout())

##### Random forests demonstrate spark scalability

For the random forest runs, we were not able to run the full dataset on sklearn, but were able to run it in spark. Indeed, even on the smaller data set that random forest was run on, we saw an incremental increase in execution time as we scaled up the number of estimators. **Spark, on the other hand, did not have an incremental increase in execution time, likely due to the parallelization in the background for tree generation.**

### Preferred Algorithm

#### Decision Trees

Given that Allstate claims’ dataset has 116 categorical variables along with 14 continuous variables, our consensus approach was to start with decision tree model as our preferred algorithm. 

**Why**

In addition to the point that both mutliple discrete variables and continuous variables can be easily accommodated, decision trees are able to capture non-linear interaction between the features and the label (“loss”). 
It is also easy to understand with human perspective by graphically representing the process of decision-making. For a model that would likely be transformed into an algorithm or service for decision makers at Allstate to predict claims loss, it's important for them to know how the model operates, and update individual portions of the model as needed. 
Decisions trees can offer that modularity without rerunning the entire model, as well as easily accomodate new inputs, which would be likely for future iterations of this dataset.

Here’s a small example of 6 data points. Although the variables in the actual dataset are all masked, we think these fields are likely included. 


|id	|Age group	|City/Rural	|Income	|Claims loss|
|---|---|---|---|---|---|
|X1 |’A(20-40)’ |‘C’ |0.75 |2213.18|
|X2 |‘B(40-60)’ |‘R’ |0.25 |1283.6|
|X3 |‘A(20-40)’ |‘R’ |0.25 |1132.22| 
|X4 |‘B(40-60)’ |‘C’ |0.60 |5142.87|
|X5 |‘B(40-60)’ |‘C’ |0.52 |2142.87|
|X6 |‘C(60-80)’ |‘C’ |0.40 |3005.09|

As in our feature engineering, discrete fields are transformed via the method of one-hot encoding to values between (0,1). Then PCA for dimension reduction is applied since the size determines the complexity of the tree and has to be neither too simple or too big leading to overfitting.  
Now the transformed data points will look like this. 

|id	|Age group |City/Rural	|Income	|Claims loss|
|---|---|---|---|---|---|---|
|X1 |0.25|1 |0.75 |2213.18|
|X2 |0.5 |0 |0.25 |1283.6|
|X3 |0.25|0 |0.25 |1132.22| 
|X4 |0.5 |1 |0.60 |5142.87|
|X5 |0.5 |1 |0.52 |2142.87|
|X6 |0.75 |1 |0.40 |3005.09|

* assuming X2 and X3 are grouped into one category after PCA

This gives us the decision tree flow chart as below. 

![Decision Tree Flow Chart](https://s3-us-west-2.amazonaws.com/sophiaxcui.com/images/image1.png)

We define the loss function L(y, yˆ) as Mean Absolute Error, which is the variance in the target “loss” terms. MAE is appropriate for Allstate insurance claims due to the fact that all the individual “loss” differences are weighted equally in the average. (Hence, the models with the lowest MAEs will be ideal from the bias-variance trade-off.)
The calculation of MAE is,
$$
Mean Absolute Error (Variance) = \frac{1}{n}\sum\_{i=1}^{n}\left| y\_i\ - \mu\right|
$$
Therefore the small example data points yields MAE of 525.23

|id	|Age group	|City/Rural	|Income	|Claims loss(A)|| Prediction(B) | MAE_i (A-B)|
|---|---|---|---|---|---|---||---|---|
|X1 |0.25|1 |0.75 |2213.18||2213.18| 0 |
|X2 |0.5 |0 |0.25 |1283.6||1207.91| 75.69 |
|X3 |0.25|0 |0.25 |1132.22|| 1207.91| 75.69 |
|X4 |0.5 |1 |0.60 |5142.87||3642.87 | 1500 |
|X5 |0.5 |1 |0.52 |2142.87||3642.87| 1500 |
|X6 |0.75 |1 |0.40 |3005.09||3005.09| 0 |
$$
MAE of decision tree = \frac{3151.38}{6} = 525.23
$$

### Conclusion

The Allstate dataset is unique because it's just small enough to run sklearn but also large enough where pyspark could make a difference in execution complexity.
Hence, we explored the Allstate dataset through two different methodologies, running a dozen models through sklearn and those same models through pyspark ml libraries.

**Some salient insights include:**
- Both pipelines yielded similar evaluation metrics (ours was MAE)
- XGBoostRegressor had the lowest MAE, consistently, and by far across all models, and would be our choice if the only salient point of consideration was MAE (Kaggle)
- Due to the large number correlated of correlated inputs, PCA helped reduce the number of inputs, but it did not improve MAE
- For less computationally expensive models (linear regression, ridge, lasso), sklearn is indeed much faster by magnitudes.  
- For more computationally heavy models (KNN, RandomForest), spark scales significantly better, especially with respect to parameter tuning 
- For our preferred model, Spark Decision Tree Regression, we saw an increase in computational time for sklearn as the depth of the tree increased, but for pyspark, this was not observed.

**Real world implications and future extensions:**
The Allstate dataset has many anonymized inputs, and due to the nature of claims being modeled by many different factors, future versions of this dataset can include magnitudes more inputs. We can imagine a IoT world where a very large amount of device or personal data can be fed in to create a more complete picture for any given claim. In this sense, a blackbox solution is not always best, especially with the sensitive nature of judging a claim by its cover (e.g. demographic data like race, gender, income, etc.). In addition, introducing implicit social biases in a black box model would be very hard to detect or fix until its too late. See [Amazon scraps internal AI recruiting tool that was biased against women](https://www.theverge.com/2018/10/10/17958784/ai-recruiting-tool-bias-amazon-report). 

Our preferred model, pyspark decision tree regression, does ok in the pack of regressors in terms of MAE (our metric for model accuracy). However, it's very easy to understand and interpret, as well as being able to modify to accomodate new inputs. Mostly, the inherent structure of a decision tree aligns well with decisions made by humans, and can be manually adjusted easily should the model become a source of contention for bias or lawsuits. Moreover, it's scalable and can be modularized for fine tuning decisions made by a subset of data, or additional data.