This is a small exploration of the data from the 2000 Caravan Insurance Challenge. The goal of the challenge was to try to predict whether or customers would be interested in buying caravan insurance. 
Here we will be exploring the data a little bit, and then trying to see if we can predict whether or not customers have health insurance.

We start of by importing the neccesary tools and he data itself.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
insure_data = pd.read_csv('../input/caravan-insurance-challenge.csv')

Here we get a cursory overlook of the data. The data contains 87 different demographic indicators for nearly 10,000 zip codes. We can find a key for the data here: https://www.kaggle.com/uciml/caravan-insurance-challenge


For our purposes we will only be focusing on a few select categories. 

In [None]:
insure_data.head()

In [None]:
insure_data.describe()

We'll be focusing on the following columns of data:


'MINKGEM': Average Income


'MZPART': Percent with Private Health Insurance


'MRELGE': Percent Married


'MOPLHOOG': Higher Education


'MHKOOP': Home Owners 


'MAUT0': Non Car Owners


We should also notice all the data points are labeled categorically, so we'll need the key to translate them:

L1: average age keys:

1: 20-30 years 2: 30-40 years 3: 40-50 years 4: 50-60 years 5: 60-70 years 6: 70-80 years

L3: percentage keys:

0: 0%
1: 1 - 10%
2: 11 - 23%
3: 24 - 36%
4: 37 - 49%
5: 50 - 62%
6: 63 - 75%
7: 76 - 88%
8: 89 - 99%
9: 100%

L4: total number keys:

0: 0
1: 1 - 49
2: 50 - 99
3: 100 - 199
4: 200 - 499
5: 500 - 999
6: 1000 - 4999
7: 5000 - 9999
8: 10,000 - 19,999
9: >= 20,000

In [None]:
df = insure_data[['MINKGEM','MZPART','MRELGE','MOPLHOOG','MHKOOP','MAUT0']]
df.head()

We'll rename the columns for convenience. 

In [None]:
df.columns = ['Average Income','Percent Prvte Hlth Insure','Percent Married','Percent High Education',
              'Percent Home Owners','Percent No Car']

In [None]:
df.head()

For the most part, we'll leave the data as is for our analysis, with the exception of the Average Income column. The average income column seems to describe the percentage of customers near the national average income. This is incovenient for our purposes and doesn't tell us much. Fortunately, we have 6 other columns that include income information. Each one of these columns tells us what percentage of the zip code is in a certain earning bracket: 


'MINKM30': <$30,000


'MINK3045'$30,000 - $45,000


'MINK4575' $45,000 - $75,000


'MINK7512' $75,000 - $120,000


'MINK123M' > $120,000


We can use these to estimate the average income for every zip code.

In [None]:
income = insure_data[['MINKM30','MINK3045','MINK4575','MINK7512','MINK123M','MINKGEM']]

In [None]:
income.columns = ["< 30,000", "30,000 - 45,000", '45,000 - 75,000', '75,000 - 120,000', '> 120,000', 'Percent Near Average']

Looking at the keys for percentage we see that each number represents a range of percentages. Thus we can convert each number to the average of the percentages that it represents. Similarly, we assume average income for each income bracket.

In [None]:
def data_conversion(num):
    percent_dict = {0 : 0, 1 : .05, 2 : .17, 3 : .30, 4 : .43,
                    5 : .56, 6 : .69, 7 : .84, 8 : .94, 9 : 1.0}
    return percent_dict[num]

In [None]:
income = income.applymap(data_conversion)

In [None]:
income.loc[:,'< 30,000'] *= 15000
income.loc[:,'30,000 - 45,000'] *= 37500
income.loc[:, '45,000 - 75,000'] *= 60000
income.loc[:, '75,000 - 120,000'] *= 97500
income.loc[:,'> 120,000'] *= 120000

In [None]:
income['Average Income'] = income.sum(axis = 1)

Now we have an average income column that represents a good estimate of the average income for each zip code.

In [None]:
income.head()

And we can now replace our original Average Income column with our derived one.

In [None]:
df.loc[:,'Average Income'] = income['Average Income']


In [None]:
df.head()

Now we can explore any correlations between different columns in our data. What follows are a few highlights of the  interesting, though perhaps unsurprising comparisons between columns.

In [None]:
sns.set_style('ticks')
sns.jointplot(x = 'Percent High Education', y = 'Average Income', data = df, kind='kde')

In [None]:
sns.jointplot(y = 'Percent Home Owners', x = 'Average Income', data = df, kind='kde')

In [None]:
sns.jointplot(y = 'Percent No Car', x = 'Average Income', data = df, kind='kde')

In [None]:
sns.jointplot(y = 'Percent Prvte Hlth Insure', x = 'Average Income', data = df, kind='kde')

In [None]:
sns.jointplot(y = 'Percent Prvte Hlth Insure', x = 'Percent No Car', data = df, kind='kde')

In [None]:
sns.jointplot(x = 'Average Income', y = 'Percent Married', data = df, kind='kde')

In [None]:
sns.jointplot(y = 'Percent Prvte Hlth Insure', x = 'Percent High Education', data = df, kind='kde')

In [None]:
sns.jointplot(x = 'Percent Home Owners', y = 'Percent Prvte Hlth Insure', data = df, kind='kde')

Unsurprisingly, the average income in a zip code is a good predictor of rates of home ownerhsip, car ownership, higher education, health coverage, and marriage. This makes sense, because a high income can help you afford a home, a car, health insurance, or a family, while we know that receiving a college education will raise your expected income. 

Now lets see if we can use these data points to predict whether the percentage of insured people in a zip code. 
First we'll attempt a simple linear regression model. 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df[['Average Income','Percent High Education','Percent Married','Percent No Car','Percent Home Owners']]
Y = df['Percent Prvte Hlth Insure']

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 2017)

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linearmodel = LinearRegression()
linearmodel.fit(X_train, Y_train)

In [None]:
linear_predict = linearmodel.predict(X_test)

In [None]:
from sklearn import metrics

In [None]:
metrics.mean_absolute_error(Y_test, linear_predict)

In [None]:
metrics.mean_squared_error(Y_test, linear_predict)

In [None]:
np.sqrt(metrics.mean_squared_error(Y_test, linear_predict))

In [None]:
plt.scatter(Y_test, linear_predict)

It seems that on average, or linear model will be off by one percentage bracket. 

While the Linear model seems to have some predictive capabilities, the fact that all the data, excluding our derived average income, is labeled categorically with discrete classifiers into brackets, instead of with the direct data they represent, a linear regression model doens't work well.

Instead, we might be better of using a classification model. Here we try to use a decision tree to predict rates of health coverage.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dtree = DecisionTreeClassifier()

In [None]:
dtree.fit(X_train, Y_train)

In [None]:
dtree_predict = dtree.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(classification_report(Y_test, dtree_predict))

The decision tree seems to have worked much better. We can now predict with a 92% degree of accuracy in what bracket of health insurance coverage a zip code belongs to. 

Lets see if we can improve on the decision tree with a random forest.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=500)
rfc.fit(X_train, Y_train)

In [None]:
RndForPred = rfc.predict(X_test)

In [None]:
print(classification_report(Y_test, RndForPred))

The random forest seems to be no more accurate than the simpler decision tree.

Still, the decision tree's predictive capabilites are interesting enough on there own.