## [Allstate Claims Severity](https://goo.gl/1DwHVy) -- Predictions using machine learning:

### Author: Dr. Rahul Remanan, CEO and  Chief Imagination Officer [Moad Computer](https://www.moad.computer)

The [Allstate Corporation](https://en.wikipedia.org/wiki/Allstate) is the one of the largest insurance providers in the United States and one of the largest that is publicly held. The company also has personal lines insurance operations in Canada. Allstate was founded in 1931 as part of Sears, Roebuck and Co., and was spun off in 1993.[1](https://goo.gl/ce2JJ2) The company has had its headquarters in Northfield Township, Illinois, near Northbrook since 1967.[2](https://goo.gl/oX4kfZ),[3](https://goo.gl/mcTd3y)

As part of Allstate's ongoing efforts to develop automated methods of predicting the cost, and hence severity, of claims, they releasd a claims severity assessment dataset on Kaggle.[4](https://goo.gl/1DwHVy) In this challenge, datascientists were invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. The goal of this challenge was to help aspiring competitors demonstrate their insight into better ways of predicting claims severity.

In this notebook, we will be using this dataset to build a prediction model using extreme gradient boosting technique.

## Part 03 -- Using [Extreme Gradient Boosting](https://en.wikipedia.org/wiki/Xgboost)

This script is a modified fork from [Kaggle](https://www.kaggle.com/mmueller/allstate-claims-severity/yet-another-xgb-starter/code).

In [1]:
library(data.table)
library(Matrix)
library(xgboost)
library(Metrics)

In [2]:
ID = 'id'
TARGET = 'loss'
SEED = 0

In [3]:
TRAIN_FILE = "./data/train.csv"
TEST_FILE = "./data/test.csv"
SUBMISSION_FILE = "./data/sample_submission.csv"

In [4]:
train = fread(TRAIN_FILE, showProgress = TRUE)
test = fread(TEST_FILE, showProgress = TRUE)

In [5]:
y_train = log(train[,TARGET, with = FALSE])[[TARGET]]

In [6]:
train[, c(ID, TARGET) := NULL]
test[, c(ID) := NULL]

In [7]:
ntrain = nrow(train)
train_test = rbindlist(list(train, test), use.names = T)

In [8]:
train_test[, c("cat12_cat80", "cat79_cat80", "cat57_cat80", "cat101_cat79", "cat57_cat79", "cat101_cat80"):=list(
  paste(cat12, cat80, sep="_"),
  paste(cat79, cat80, sep="_"),
  paste(cat57, cat80, sep="_"),
  paste(cat101, cat79, sep="_"),
  paste(cat57, cat79, sep="_"),
  paste(cat101, cat80, sep="_")
)]

In [9]:
train_test[, c("cat101_cat79_cat81", "cat57_cat79_cat80", "cat103_cat12_cat80"):=list(
  paste(cat101, cat79, cat81, sep="_"),
  paste(cat57, cat79, cat80, sep="_"),
  paste(cat103, cat12, cat80, sep="_")
)]

In [10]:
features = names(train_test)

In [11]:
for (f in features) {
  if (class(train_test[[f]])=="character") {
    #cat("VARIABLE : ",f,"\n")
    levels <- unique(train_test[[f]])
    train_test[[f]] <- as.integer(factor(train_test[[f]], levels=levels))
  }
}

In [12]:
x_train = train_test[1:ntrain,]
x_test = train_test[(ntrain+1):nrow(train_test),]

In [13]:
dtrain = xgb.DMatrix(data.matrix(x_train), label=y_train)
dtest = xgb.DMatrix(data.matrix(x_test))

In [14]:
xgb_params = list(
  seed = 0,
  colsample_bytree = 0.7,
  subsample = 0.7,
  eta = 0.08,
  objective = 'reg:linear',
  max_depth = 6,
  num_parallel_tree = 1,
  min_child_weight = 1,
  base_score = 7
)

In [15]:
xg_eval_mae <- function (yhat, dtrain) {
   y = getinfo(dtrain, "label")
   err= mae(exp(y),exp(yhat) )
   return (list(metric = "error", value = err))
}

In [16]:
res = xgb.cv(xgb_params,
             dtrain,
             nrounds=2500,
             nfold=4,
             early_stopping_rounds=1000,
             print_every_n = 10,
             verbose= 1,
             feval=xg_eval_mae,
             maximize=FALSE)

[1]	train-error:2029.334385+2.270803	test-error:2029.427014+6.695207 
Multiple eval metrics are present. Will use test_error for early stopping.
Will train until test_error hasn't improved in 1000 rounds.

[11]	train-error:1615.833298+2.014766	test-error:1617.770397+6.331664 
[21]	train-error:1387.854266+1.518799	test-error:1393.418447+6.395620 
[31]	train-error:1281.474778+1.811797	test-error:1291.343192+5.811237 
[41]	train-error:1228.740235+2.005922	test-error:1242.271519+6.229243 
[51]	train-error:1198.602227+1.730095	test-error:1215.901242+5.946639 
[61]	train-error:1178.309403+1.491066	test-error:1199.204767+6.246607 
[71]	train-error:1164.479017+1.460059	test-error:1188.521046+5.842423 
[81]	train-error:1153.758453+1.406112	test-error:1180.586551+5.877487 
[91]	train-error:1145.564109+1.531224	test-error:1174.805743+5.984665 
[101]	train-error:1138.393589+1.457040	test-error:1170.236584+5.591081 
[111]	train-error:1132.696933+1.506825	test-error:1167.163990+5.382285 
[121]	train

In [17]:
best_nrounds = res$best_iteration
cv_mean = res$evaluation_log$test_error_mean[best_nrounds]
cv_std = res$evaluation_log$test_error_std[best_nrounds]
cat(paste0('CV-Mean: ',cv_mean,' ', cv_std))

CV-Mean: 1149.02454210411 4.42899476228009

In [18]:
gbdt = xgb.train(xgb_params, dtrain, nrounds=best_nrounds, feval = xg_eval_mae, verbose = 1, maximize = F)

In [19]:
submission = fread(SUBMISSION_FILE, colClasses = c("integer", "numeric"))
submission$loss = exp(predict(gbdt,dtest))
filename <- paste("xgbCV", as.character(round(cv_mean,4)), format(Sys.time(), "%Y%m%d%H%M%S"), sep = "_")
write.csv(submission, paste0(filename,'.csv',collapse = ""), row.names=FALSE, quote = FALSE)