# Porto Seguro's Safe Driver Kaggle Challenge

Nothing ruins the thrill of buying a brand new car more quickly than seeing your new insurance bill. The sting’s even more painful when you know you’re a good driver. It doesn’t seem fair that you have to pay so much if you’ve been cautious on the road for years.

Porto Seguro, one of Brazil’s largest auto and homeowner insurance companies, completely agrees. Inaccuracies in car insurance company’s claim predictions raise the cost of insurance for good drivers and reduce the price for bad ones.

In this competition, you’re challenged to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year. While Porto Seguro has used machine learning for the past 20 years, they’re looking to Kaggle’s machine learning community to explore new, more powerful methods. A more accurate prediction will allow them to further tailor their prices, and hopefully make auto insurance coverage more accessible to more drivers.

### Initialize

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

import missingno as msno

In [3]:
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('pdf', 'png')

In [5]:
pd.options.display.float_format = '{:.2f}'.format

In [6]:
rc = {
    'savefig.dpi': 75,
    'figure.autolayout': False,
    'figure.figsize': [12, 8],
    'axes.labelsize': 18,
    'axes.titlesize': 18,
    'font.size': 18,
    'lines.linewidth': 2.0,
    'lines.markersize': 8,
    'legend.fontsize': 16,
    'xtick.labelsize': 16,
    'ytick.labelsize': 16
}

sns.set(style='dark',rc=rc)

In [8]:
default_color = '#56B4E9'
colormap = plt.cm.cool

In [9]:
# Setting working directory

path = './data/raw/'

### Data Description

In this competition, you will predict the probability that an auto insurance policy holder files a claim.

In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.

- train.csv contains the training data, where each row corresponds to a policy holder, and the target columns signifies that a claim was filed.
- test.csv contains the test data.
- sample_submission.csv is submission file showing the correct format.

## Load Files

In [10]:
train = pd.read_csv(path + 'train.csv', na_values = -1)
test = pd.read_csv(path + 'test.csv', na_values = -1)

In [11]:
train.shape

(595212, 59)

In [12]:
test.shape

(892816, 58)

In [13]:
# keeping ids for future reference
id_test = test['id'].values