## Kaggle competition: [Allstate Claims Severity](https://www.kaggle.com/c/allstate-claims-severity)
Introduction:
> When you’ve been devastated by a serious car accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. This is why [Allstate](https://www.allstate.com/), a personal insurer in the United States, is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect.

> Allstate is currently developing automated methods of predicting the cost, and hence severity, of claims. In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstate’s efforts to ensure a worry-free customer experience.

In [1]:
import pandas as pd
import numpy as np

dataset = pd.read_csv("./train.csv")
test = pd.read_csv("./test.csv")

print(dataset.shape)
print(test.shape)

dataset.head(5)

(188318, 132)
(125546, 131)


Unnamed: 0,id,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,cont14,loss
0,1,A,B,A,B,A,A,A,A,B,...,0.718367,0.33506,0.3026,0.67135,0.8351,0.569745,0.594646,0.822493,0.714843,2213.18
1,2,A,B,A,A,A,A,A,A,B,...,0.438917,0.436585,0.60087,0.35127,0.43919,0.338312,0.366307,0.611431,0.304496,1283.6
2,5,A,B,A,A,B,A,A,A,B,...,0.289648,0.315545,0.2732,0.26076,0.32446,0.381398,0.373424,0.195709,0.774425,3005.09
3,10,B,B,A,B,A,A,A,A,B,...,0.440945,0.391128,0.31796,0.32128,0.44467,0.327915,0.32157,0.605077,0.602642,939.85
4,11,A,B,A,B,A,A,A,A,B,...,0.178193,0.247408,0.24564,0.22089,0.2123,0.204687,0.202213,0.246011,0.432606,2763.85


### Create dummy variables for both training and testing data 

In [2]:
loss = dataset.iloc[:, 131]

id_train = dataset['id']
id_test = test['id']
n1 = len(dataset.index)
n2 = len(test.index)

# For some columns, labels in test data are different from train data, so that
# we should combine train and test data, create dummy vars, and then split train and test
dataset = dataset.iloc[:,1:131]
test    = test.iloc[:, 1:131]
dfnew = dataset.append(test, ignore_index=True)
print(dfnew.shape)
n3 = len(dfnew.index)
print n1, n2, n3

df0 = pd.get_dummies(dataset.iloc[:, 0:116])
print(df0.shape)
df1 = pd.get_dummies(dfnew.iloc[:, 0:116])
df2 = dfnew.iloc[:, 116:131]
print(df1.shape)
print(df2.shape)
# we found that the combined dataset has more dummy variables than training dataset

dfnew = pd.concat([df1, df2], axis = 1)
print(dfnew.shape)

train = dfnew.iloc[0:n1, :]
test  = dfnew.iloc[n1:n3, :]
print(train.shape)
print(test.shape)

(313864, 130)
188318 125546 313864
(188318, 1139)
(313864, 1176)
(313864, 14)
(313864, 1190)
(188318, 1190)
(125546, 1190)


### Prepare data for following model building

In [3]:
# pandas dataframe to numpy array
X = train.values
Y = loss.values
Y = np.log1p(Y)

from sklearn import cross_validation

# split train data for cross validation
val_size = 0.1
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=val_size, random_state=0)
print X_train.shape
print X_test.shape

(169486, 1190)
(18832, 1190)


### Test the performance of Lasso regression model

In [5]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import Lasso

for alpha in [0.001, 0.002, 0.005, 0.01]:
  model = Lasso(alpha=alpha, random_state=0, max_iter = 500)
  model.fit(X_train, Y_train)
  Y_pred = model.predict(X_test)
  print alpha, mean_absolute_error(np.expm1(Y_test), np.expm1(Y_pred))

0.001 1262.53144301
0.002 1266.5983432
0.005 1289.94823958
0.01 1344.3820347


### Then test the popular Extreme Gradient Boosting model

In [None]:
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor

for n_estimators in [100, 200, 500, 1000]:
  model = XGBRegressor(n_estimators=n_estimators,seed=0)
  model.fit(X_train, Y_train)
  Y_pred = model.predict(X_test)
  print n_estimators, mean_absolute_error(np.expm1(Y_test), np.expm1(Y_pred))

100 1219.71291158
200 1193.30103586
500 1173.67955071
1000 1168.79391021


### Let's see if we can remove some variables to improve performance

In [10]:
from sklearn.feature_selection import SelectPercentile, f_regression

# use F regression to select different percentage of variables
for pt in [50, 75, 95]:
  selector = SelectPercentile(f_regression, percentile= pt)
  selector.fit(X_train, Y_train)

  X_t1 = selector.transform(X_train)
  X_t2 = selector.transform(X_test)
  print X_t1.shape, X_t2.shape

  model = XGBRegressor(n_estimators=1000, seed=0)
  model.fit(X_t1, Y_train)
  Y_pred = model.predict(X_t2)
  print str(pt)+"%", mean_absolute_error(np.expm1(Y_test), np.expm1(Y_pred))

(169486, 595) (18832, 595)
50% 1168.59042176
(169486, 892) (18832, 892)
75% 1166.80152502
(169486, 1130) (18832, 1130)
95% 1169.96463204


### Finally we generate prediction result for the test data

In [9]:
selector = SelectPercentile(f_regression, percentile= 75)
selector.fit(X, Y)
X_t1 = selector.transform(X)
X_t2 = selector.transform(test.values)

n_estimators = 10
model = XGBRegressor(n_estimators=n_estimators,seed=0)
model.fit(X_t1, Y)

y_test = model.predict(X_t2)
y_test = np.expm1(y_test)

df = pd.concat([id_test, pd.Series(y_test)], axis = 1)
df.columns = ['id', 'loss']

print df.shape
print df.head(5)

df.to_csv('submission3.csv', index = False)

(125546, 2)
   id        loss
0   4  148.563919
1   6  148.563919
2   9  328.922943
3  12  307.540894
4  15  133.955414
