 Can you predict who would be interested in buying a caravan insurance policy and give an explanation why?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn import datasets
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

This is a small exploration of the data from the 9822 rows and 87 columns Caravan Insurance data.
The goal of the challenge was to try to predict whether or customers would be interested in buying caravan insurance.
Here we will be exploring the data a little bit, and then trying to see
if we can predict whether or not customers have health insurance.

In [None]:
carvaan_insurance = pd.read_csv('/kaggle/input/caravan-insurance-challenge/caravan-insurance-challenge.csv')
carvaan_insurance

The data file contains the following fields:

    ORIGIN: train or test, as described above
    MOSTYPE: Customer Subtype; see L0
    MAANTHUI: Number of houses 1 - 10
    MGEMOMV: Avg size household 1 - 6
    MGEMLEEF: Avg age; see L1
    MOSHOOFD: Customer main type; see L2

** Percentages in each group, per postal code (see L3)**:

    MGODRK: Roman catholic
    MGODPR: Protestant ...
    MGODOV: Other religion
    MGODGE: No religion
    MRELGE: Married
    MRELSA: Living together
    MRELOV: Other relation
    MFALLEEN: Singles
    MFGEKIND: Household without children
    MFWEKIND: Household with children
    MOPLHOOG: High level education
    MOPLMIDD: Medium level education
    MOPLLAAG: Lower level education
    MBERHOOG: High status
    MBERZELF: Entrepreneur
    MBERBOER: Farmer
    MBERMIDD: Middle management
    MBERARBG: Skilled labourers
    MBERARBO: Unskilled labourers
    MSKA: Social class A
    MSKB1: Social class B1
    MSKB2: Social class B2
    MSKC: Social class C
    MSKD: Social class D
    MHHUUR: Rented house
    MHKOOP: Home owners
    MAUT1: 1 car
    MAUT2: 2 cars
    MAUT0: No car
    MZFONDS: National Health Service
    MZPART: Private health insurance
    MINKM30: Income < 30.000
    MINK3045: Income 30-45.000
    MINK4575: Income 45-75.000
    MINK7512: Income 75-122.000
    MINK123M: Income >123.000
    MINKGEM: Average income
    MKOOPKLA: Purchasing power class


We'll rename the columns for convenience and cleaniness.

In [None]:
rename_carvaan_insurance = carvaan_insurance.rename({'MOSTYPE':'Customer_Subtype',
                          'MAANTHUI':'Number_of_houses',
                          'MGEMOMV':'Avg_size_household',
                          'MGEMLEEF':'Avg_age',
                          'MOSHOOFD':'Customer_main_type',
                          'MGODRK':'Roman_catholic',
                          'MGODPR':'Protestant',
                           'MGODOV':'Other_religion',
                           'MGODGE':'No_religion',
                           'MRELGE':'Married',
                           'MRELSA':'Living_together',
                           'MRELOV':'Other_relation',
                           'MFALLEEN':'Singles',
                           'MFGEKIND':'Household_without_children',
                           'MFWEKIND':'Household_with_children',
                           'MOPLHOOG':'Highlevel_education',
                           'MOPLMIDD':'Mediumlevel_education',
                           'MOPLLAAG':'Lowerlevel_education',
                           'MBERHOOG':'High_status',
                           'MBERZELF':'Entrepreneur',
                           'MBERBOER':'Farmer',
                           'MBERMIDD':'Middle_management',
                           'MBERARBG':'Skilled_labourers',
                           'MBERARBO':'Unskilled_labourers',
                           'MSKA':'Socialclass_A',
                           'MSKB1':'Socialclass_B1',
                           'MSKB2':'Socialclass_B2',
                           'MSKC':'Socialclass_C',
                           'MSKD':'Socialclass_D',
                           'MHHUUR':'Rented_house',
                           'MHKOOP':'Home_owners',
                           'MAUT1':'1_car',
                           'MAUT2':'2_cars',
                           'MAUT0':'No_car',
                           'MZFONDS':'National_HealthService',
                           'MZPART':'Private_health_insurance',
                           'MINKM30':'Income_30',
                           'MINK3045':'Income30_45',
                           'MINK4575':'Income45_75',
                           'MINK7512':'Income75_122',
                           'MINK123M':'Income_123',
                           'MINKGEM':'Average_income',
                           'MKOOPKLA':'Purchasing_power_class',
                           'PWAPART':'private_thirdparty_insurance',
                          'PWABEDR':'thirdparty_insurance_firms',
                          'PWALAND':'thirdparty_insurance_agriculture',
                          'PPERSAUT':'car_policies',
                          'PBESAUT':'delivery van policies',
                          'PMOTSCO':'motorcycle/scooter_policies',
                          'PVRAAUT':'lorry_policies',
                           'PAANHANG':'trailer_policies',
                           'PTRACTOR':'tractor_policies',
                           'PWERKT':'agricultural_machines_policies',
                           'PBROM':'moped_policies',
                           'PLEVEN':'life_insurances',
                           'PPERSONG':'private_accident_insurance_policies',
                           'PGEZONG':'family_accidents_insurance_policies',
                           'PWAOREG':'disability_insurance_policies',
                           'PBRAND':'fire_policies',
                           'PZEILPL':'surfboard_policies',
                           'PPLEZIER':'boat_policies',
                           'PFIETS':'bicycle_policies',
                           'PINBOED':'property_insurance_policies',
                           'PBYSTAND':'social_security_insurance_policies',
                           'AWAPART':'private_thirdparty_insurance_1_12',
                           'AWABEDR':'Number_thirdparty_insurance_firms',
                           'AWALAND':'Number_thirdparty_insurance_agriculture',
                           'APERSAUT':'Number_car_policies',
                           'ABESAUT':'Number_delivery_van_policies',
                           'AMOTSCO':'Number_motorcycle/scooter_policies',
                           'AVRAAUT':'Number_lorry_policies',
                           'AAANHANG':'Number_trailer_policies',
                           'ATRACTOR':'Number_tractor_policies',
                           'AWERKT':'Number_agricultural_machines_policies',
                           'ABROM':'Number_moped_policies',
                           'ALEVEN':'Number_life_insurances',
                           'APERSONG':'Number_privatAvg_size_householde_accident_insurance_policies',
                           'AGEZONG':'Number_family_accidents_insurance_policies',
                           'AWAOREG':'Number_disability_insurance_policies',
                           'ABRAND':'Number_fire_policies',
                           'AZEILPL':'Number_surfboard_policies',
                           'APLEZIER':'Number_boat_policies',
                           'AFIETS':'Number_bicycle_policies',
                           'AINBOED':'Number_property_insurance_policies',
                           'ABYSTAND':'Number_social_security_insurance_policies',
                           'CARAVAN':' Number_mobilehome_policies_0_1'}, 
                 axis=1)
 
#rename_carvaan_insurance

In [None]:
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
# rename_carvaan_insurance['Income_30'].value_counts()
# rename_carvaan_insurance['Income30_45'].value_counts()
# rename_carvaan_insurance['Income45_75'].value_counts()
# rename_carvaan_insurance['Income75_122'].value_counts()
# rename_carvaan_insurance['Income_123'].value_counts()
# rename_carvaan_insurance['Private_health_insurance'].value_counts()

Based on this datasets, I analyze to explore the datasets, rename the columns name, and this data sets is supervised learning datasets.
Calculating the correlation coefficent to analyze the relationship between the variables.
Applying various methods and techiques and evualting the resulting model. 
Comparing the models to select the best one.


In [None]:
rename_carvaan_insurance.describe()

In [None]:
carvaan_correlation = rename_carvaan_insurance[['Number_of_houses','Avg_age','Average_income','Customer_main_type','Roman_catholic','No_religion','Private_health_insurance','Avg_size_household','Home_owners','Married','Number_family_accidents_insurance_policies','Number_disability_insurance_policies','Number_fire_policies','Number_surfboard_policies','Number_boat_policies','Number_bicycle_policies','Number_property_insurance_policies','Number_social_security_insurance_policies']]
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
a = carvaan_correlation.corr()
a.head(20)

In [None]:
correlation_carvaan = sns.clustermap(carvaan_correlation.iloc[:, 1:20].corr(), annot=True, fmt = ".2f", cmap = "coolwarm")
correlation_carvaan

Another Analze shows, there is highly correlation coefficent between the Average income and Private health insurances.
Majority of the customers having income less than 75,000 and majority of customers not preferring to having private health insurances.

In [None]:
# g = sns.clustermap(rename_carvaan_insurance.iloc[:, 1:20].corr(), annot=True, fmt = ".2f", cmap = "coolwarm")
# g

To estimated probability density function over the data.

In [None]:
sns.distplot(rename_carvaan_insurance["Average_income"], bins=16, color="purple")

The “Average Income” variable appears skewed in nature and most of the Average Income values are in the range of 2 to 5.
Means Majority if customers having insurances less than 75000.

In [None]:
sns.distplot(rename_carvaan_insurance["Private_health_insurance"], bins=16, color="blue")

The “Private Health Insurance” variable appears skewed in nature and most of the Average Income values are in the range of 0 to 5.
Means Majority of customers having no private health insurances. 

In [None]:
sns.set_style('ticks')
sns.jointplot(x = 'Highlevel_education', y = 'Average_income', data = rename_carvaan_insurance, kind='kde')

In [None]:
sns.set_style('ticks')
sns.jointplot(x = 'Number_fire_policies', y = 'Average_income', data = rename_carvaan_insurance, kind='kde')

In [None]:
sns.set_style('ticks')
sns.jointplot(x = 'Home_owners', y = 'Average_income', data = rename_carvaan_insurance, kind='kde')

In [None]:
sns.set_style('ticks')
sns.jointplot(x = 'Average_income', y = 'Private_health_insurance', data = rename_carvaan_insurance, kind='kde')

The estimates of individual linear regression coefficients and the quality of the overall fit. 
R2 measures how much of the variation in the response variable y is explained by variation in the regressors X 

In [None]:
#x = rename_carvaan_insurance[['Average_income','Highlevel_education','Household_with_children','Married','Number_of_houses','Home_owners']]
x = rename_carvaan_insurance[['Average_income']]
y = rename_carvaan_insurance[['Private_health_insurance']]
model_insurances = sm.OLS(x,y).fit()
prediction_insurances = model_insurances.predict(x)
model_insurances.summary()

OLS stands for Ordinary Least Squares and the method “Least Squares” means that we’re trying to fit a regression line that would minimize the square of distance from the regression line (see the previous section of this post). Date and Time are pretty self-explanatory :) So as number of observations. Df of residuals and models relates to the degrees of freedom — “the number of values in the final calculation of a statistic that are free to vary.”

The data is "linear". That is, the dependent variable (Private Health Insurance) is a linear function of independent variables (Average Income).
R- square is 0.719 is being a perfect fit between the average income and private health insurances.
As expected, Income will be a strong predictor of Private health insurances, corroborated by a significant p-value for the coefficient of Income in the model.

The coefficient of 1.0042 means that as the RM variable increases by 1, the predicted value of MDEV increases by 1.0042. A few other important values are the R-squared — the percentage of variance our model explains; the standard error is the standard deviation of the sampling distribution of a statistic, most commonly of the mean; the t scores and p-values, for hypothesis test — the RM has statistically significant p-value; there is a 95% confidence intervals for the RM meaning we predict at a 95% percent confidence that the value of RM is between 0.992 and 1.017.

In [None]:
# rename_carvaan_insurance.plot(x='Private_health_insurance', y='Average_income', style='o')
# plt.title('Average Income vs Private Health Insurance')
# plt.xlabel('Incomes')
# plt.ylabel('Percentage Insurance')
# plt.show()

In [None]:
x = rename_carvaan_insurance[['Average_income']]
y = rename_carvaan_insurance[['Private_health_insurance']]
x = sm.add_constant(x)
model_insurances = sm.OLS(y,x).fit()
prediction_insurances = model_insurances.predict(x)
model_insurances.summary()

With the constant term the coefficients are different. Without a constant we are forcing our model to go through the origin, but now we have a y-intercept at 0.2560. 

In [None]:
fig = sm.graphics.plot_partregress_grid(model_insurances, fig=plt.figure(figsize=(20,12)))

As you can see the partial regression plot confirms the influence on the partial relationship between income and Private Health Insurances. The cases greatly decrease the effect of income on Private Health Insurances. Dropping these cases confirms this.
As you can see the relationship between the variation in Private Health Insurances explained by Average Income conditionalseems to be linear, though you can see there are some observations that are exerting considerable influence on the relationship.

Drop the Origin Variable as this variable is categorical and In Modeling is better for output result if we have all numerical variable. 

In [None]:
# X = rename_carvaan_insurance
# Y = rename_carvaan_insurance["Average_income"]
carvaan_insurances = rename_carvaan_insurance.drop(["ORIGIN"], axis = 1)
#carvaan_insurances

#Linear Regression Model

In [None]:
X = carvaan_insurances[['Average_income']]
Y = rename_carvaan_insurance[['Private_health_insurance']]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
lm = LinearRegression()
model = lm.fit(X_train,Y_train)
predictions = lm.predict(X)
lm.score(X,Y)

In [None]:
X = carvaan_insurances[['Number_of_houses','Avg_size_household','Avg_age','Customer_main_type','Roman_catholic','No_religion','Private_health_insurance','Avg_size_household','Home_owners','Number_fire_policies','Married','Number_family_accidents_insurance_policies','Number_disability_insurance_policies','Number_fire_policies','Number_surfboard_policies','Number_boat_policies','Number_privatAvg_size_householde_accident_insurance_policies','Number_bicycle_policies','Number_property_insurance_policies','Number_social_security_insurance_policies']]
Y = carvaan_insurances[['Average_income']]
X = sm.add_constant(X)
model_insurances = sm.OLS(Y,X).fit()
prediction_insurances = model_insurances.predict(X)
model_insurances.summary()

This is the R² score of our model. As you probably remember, this the percentage of explained variance of the predictions. 

In [None]:
# X = carvaan_insurances.iloc[:,:-1].values
# Y = carvaan_insurances.iloc[:,1].values
# X = X.drop("Average_income", 1)
X = carvaan_insurances[['Number_of_houses','Avg_size_household','Avg_age','Customer_main_type','Roman_catholic','No_religion','Private_health_insurance','Avg_size_household','Home_owners','Number_fire_policies','Married','Number_family_accidents_insurance_policies','Number_disability_insurance_policies','Number_fire_policies','Number_surfboard_policies','Number_boat_policies','Number_privatAvg_size_householde_accident_insurance_policies','Number_bicycle_policies','Number_property_insurance_policies','Number_social_security_insurance_policies']]
Y = carvaan_insurances[['Average_income']]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
lm = LinearRegression()
model = lm.fit(X_train,Y_train)
predictions = lm.predict(X)
model.score(X,Y)

In [None]:
model.coef_

In [None]:
model.intercept_

In [None]:
The result should be approximately Coefficent:
        -9.49373079e-02,  7.81713841e-02, -1.00978615e-01,
        -7.96435261e-02,  1.05429362e-01,  3.78331605e-02,
         1.66272892e-01,  7.81713841e-02,  9.77441286e-02,
         5.33136190e-03,  1.08002079e-01,  8.67035422e-02,
        -4.21708707e-01,  5.33136190e-03, -2.02242846e-01,
         2.28472449e-04, -1.68738803e-01,  7.29316616e-02,
         2.78768154e-01,  2.87326091e-02 and
The result should be approximately Coefficent is 2.45819083 respectively.

Now that we have trained our algorithm, it’s time to make some predictions. To do so, we will use our test data and see how accurately our algorithm predicts the percentage score. 

In [None]:
y_pred = model.predict(X_test)
y_pred

Now compare the actual output values for X_test with the predicted values, execute the following script:

In [None]:
X = carvaan_insurances[['Average_income']].values.reshape(-1,1)
Y = carvaan_insurances[['Private_health_insurance']].values.reshape(-1,1)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
lm = LinearRegression()
model = lm.fit(X_train,Y_train)
predictions = lm.predict(X)
model.score(X,Y)
Y_test = np.array(list(Y_test))
y_pred = np.array(y_pred)
Predicted_Actual = pd.DataFrame({'Actual': Y_test.flatten(), 'Predicted': y_pred.flatten()})
Predicted_Actual.head(25)

In [None]:
flatten_Graph = Predicted_Actual.head(25)
flatten_Graph.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

Though our model is not very precise, the predicted percentages are close to the actual ones.

In [None]:
X1 = carvaan_insurances[['Number_of_houses','Avg_size_household','Avg_age','Customer_main_type','Roman_catholic','No_religion','Private_health_insurance','Avg_size_household','Home_owners','Number_fire_policies','Married','Number_family_accidents_insurance_policies','Number_disability_insurance_policies','Number_fire_policies','Number_surfboard_policies','Number_boat_policies','Number_privatAvg_size_householde_accident_insurance_policies','Number_bicycle_policies','Number_property_insurance_policies','Number_social_security_insurance_policies']]
Y1 = carvaan_insurances[['Private_health_insurance']]
# X = X.drop("Average_income", 1)
X_train, X_test, Y_train, Y_test = train_test_split(X1, Y1, test_size=0.2, random_state=0)
lm1 = LinearRegression()
model = lm1.fit(X_train,Y_train)
predictions = lm1.predict(X1)
lm1.score(X1,Y1)

In [None]:
lm1.coef_

In [None]:
lm1.intercept_

Model Evaluation Metrics for Regression
Mean Squared Error is the mean of the squared errors.

In [None]:
from sklearn.metrics import accuracy_score, mean_squared_error
predictions = lm.fit(X_train,Y_train).predict(X_test)
print(mean_squared_error(predictions, Y_test))

The Mean Squared Error is 6.328323041957195e-30 used as a default metric for evaluation of the performance of most regression algorithms 

In [None]:
# features_cols = ['Private_health_insurance','Home_owners','Number_fire_policies','Married']
# X = carvaan_insurances[features_cols]
# Y = carvaan_insurances["Average_income"]
X = carvaan_insurances[['Average_income']]
# X = X.drop("Average_income", 1)
Y = carvaan_insurances[['Number_of_houses','Avg_size_household','Avg_age','Customer_main_type','Roman_catholic','No_religion','Private_health_insurance','Avg_size_household','Home_owners','Number_fire_policies','Married','Number_family_accidents_insurance_policies','Number_disability_insurance_policies','Number_fire_policies','Number_surfboard_policies','Number_boat_policies','Number_privatAvg_size_householde_accident_insurance_policies','Number_bicycle_policies','Number_property_insurance_policies','Number_social_security_insurance_policies']]
linear_regressor = LinearRegression()  # create object for the class
linear_regressor.fit(X, Y)  # perform linear regression
list(zip(features_cols,linear_regressor.coef_ ))

In [None]:
X2 = carvaan_insurances[['Average_income']]
# X = X.drop("Average_income", 1)
Y2 = carvaan_insurances[['Number_of_houses','Avg_size_household','Avg_age','Customer_main_type','Roman_catholic','No_religion','Private_health_insurance','Avg_size_household','Home_owners','Number_fire_policies','Married','Number_family_accidents_insurance_policies','Number_disability_insurance_policies','Number_fire_policies','Number_surfboard_policies','Number_boat_policies','Number_privatAvg_size_householde_accident_insurance_policies','Number_bicycle_policies','Number_property_insurance_policies','Number_social_security_insurance_policies']]
linear_regressor = LinearRegression()  # create object for the class
linear_regressor.fit(X2, Y2)  # perform linear regression
list(zip(X2,linear_regressor.coef_ ))

In [None]:
X1 = carvaan_insurances[['Private_health_insurance']]
# X = X.drop("Average_income", 1)
Y1 = carvaan_insurances[['Number_of_houses','Avg_size_household','Average_income','Avg_age','Customer_main_type','Roman_catholic','No_religion','Avg_size_household','Home_owners','Number_fire_policies','Married','Number_family_accidents_insurance_policies','Number_disability_insurance_policies','Number_fire_policies','Number_surfboard_policies','Number_boat_policies','Number_privatAvg_size_householde_accident_insurance_policies','Number_bicycle_policies','Number_property_insurance_policies','Number_social_security_insurance_policies']]
linear_regressor = LinearRegression()  # create object for the class
linear_regressor.fit(X1, Y1)  # perform linear regression
list(zip(X1,linear_regressor.coef_ ))

In this, I studied the most fundamental machine learning algorithms i.e. linear regression.
I implemented both simple linear regression and multiple linear regression with the help of the Scikit-Learn machine learning library and others helpful library.

