# Insurance prediction

This is a problem statement for customers of an insurance company. Each data point is one customer. The group represents the number of accidents the customer has been involved with in the past .

* 0 - red: many accidents
* 1 - green: few or no accidents
* 2 - yellow: in the middle  

Since the output feature has discreate value so Its a classification problem

## Importing Libs

In [1]:
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
%pylab inline

import pandas as pd
import matplotlib.pyplot as plt
plt.xkcd()

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn import metrics

Populating the interactive namespace from numpy and matplotlib


# Loading and exploring our data set      

In [2]:
df = pd.read_csv('./insurance-customers-1500.csv', sep=';')
df.head()

Unnamed: 0,speed,age,miles,group
0,98.0,44.0,25.0,1
1,118.0,54.0,24.0,1
2,111.0,26.0,34.0,0
3,97.0,25.0,10.0,2
4,114.0,38.0,22.0,1


In [3]:
df.shape

(1500, 4)

In [4]:
df.describe()

Unnamed: 0,speed,age,miles,group
count,1500.0,1500.0,1500.0,1500.0
mean,122.492667,44.980667,30.434,0.998667
std,17.604333,17.1304,15.250815,0.816768
min,68.0,16.0,1.0,0.0
25%,108.0,32.0,18.0,0.0
50%,120.0,42.0,29.0,1.0
75%,137.0,55.0,42.0,2.0
max,166.0,100.0,84.0,2.0


###  creating independant and dependant feature from dataset

In [5]:
# Output feature
y=df['group']
y.head()

0    1
1    1
2    0
3    2
4    1
Name: group, dtype: int64

In [6]:
# since 'group' is now the label we want to predict, we need to remove it from the training data 
df.drop('group', axis='columns', inplace=True)
X = df
X.head()

Unnamed: 0,speed,age,miles
0,98.0,44.0,25.0
1,118.0,54.0,24.0
2,111.0,26.0,34.0
3,97.0,25.0,10.0
4,114.0,38.0,22.0


### Splitting independant and dependant data in to train and test data

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(1200, 3) (1200,) (300, 3) (300,)


In [8]:
# Creating object for Decision Tree algorithm
clf = DecisionTreeClassifier()
%time clf.fit(X_train, y_train)
# finding max depth
print("max depth is " , clf.tree_.max_depth )

Wall time: 30.9 ms
max depth is  18


In [9]:
print( " Training score " , round(clf.score(X_train, y_train)*100 , 2))
print( " Testing score " , round(clf.score(X_test, y_test)*100 , 2))

 Training score  99.92
 Testing score  71.33


In [10]:
# Sending Decision tree parameters to imrove testing score ( Avoiding over fitting)
clf = DecisionTreeClassifier(max_depth=10,
                              min_samples_leaf=3,
                              min_samples_split=2)
%time clf.fit(X_train, y_train)

print("max depth is " , clf.tree_.max_depth )

Wall time: 11 ms
max depth is  10


In [11]:
print( " Training score " , round(clf.score(X_train, y_train)*100 , 2))
print( " Testing score " , round(clf.score(X_test, y_test)*100 , 2))

 Training score  86.33
 Testing score  73.67


## Evaludate a score by cross-validation

In [12]:
scores = cross_val_score(clf, X, y, n_jobs=-1)
scores

array([0.75333333, 0.75666667, 0.77333333, 0.70666667, 0.77333333])

## Tuning parameters and finding best model

In [13]:
param_grid = {
    'max_depth': list(range(2, 25)),
    'min_samples_split': list(range(2, 11)),
    'min_samples_leaf': list(range(1, 11))
}

In [14]:
clf = GridSearchCV(DecisionTreeClassifier(), param_grid, n_jobs=-1)
%time clf.fit(X, y)
best_params = clf.best_params_
print("Best params ",best_params)

Wall time: 50.4 s
Best params  {'max_depth': 9, 'min_samples_leaf': 6, 'min_samples_split': 2}


In [15]:
clf = DecisionTreeClassifier(max_depth=9,
                              min_samples_leaf=6,
                              min_samples_split=4)
%time clf.fit(X_train, y_train)
print("max depth is " , clf.tree_.max_depth )

Wall time: 10.2 ms
max depth is  9


In [16]:
print( " Training score " , round(clf.score(X_train, y_train)*100 , 2))
print( " Testing score " , round(clf.score(X_test, y_test)*100 , 2))

 Training score  82.25
 Testing score  76.0


In [17]:
# Evaludate a score by cross-validation
scores = cross_val_score(clf, X, y, n_jobs=-1)
scores

array([0.75333333, 0.78666667, 0.78666667, 0.72666667, 0.76      ])

### Predicting the testdata with trained model

In [18]:
test_pred = clf.predict(X_test)

# Classification scores & Model Evaluation

In [19]:
# Finding accuracy
metrics.accuracy_score(y_test ,test_pred)

0.76

In [20]:
# Finding confusion matrix
metrics.confusion_matrix(y_test, test_pred)

array([[81, 10,  9],
       [10, 75, 15],
       [10, 18, 72]], dtype=int64)

In [21]:
# Finding precision score
metrics.precision_score(y_test, test_pred, average='macro')

0.7600451792752091

In [22]:
# Finding recall score
metrics.recall_score(y_test, test_pred,average='macro')

0.7600000000000001

In [23]:
# Finding classification report
print(metrics.classification_report(y_test, test_pred))

              precision    recall  f1-score   support

           0       0.80      0.81      0.81       100
           1       0.73      0.75      0.74       100
           2       0.75      0.72      0.73       100

    accuracy                           0.76       300
   macro avg       0.76      0.76      0.76       300
weighted avg       0.76      0.76      0.76       300



In [24]:
# Finding precision recall fscore support
metrics.precision_recall_fscore_support(y_test, test_pred)

(array([0.8019802 , 0.72815534, 0.75      ]),
 array([0.81, 0.75, 0.72]),
 array([0.80597015, 0.73891626, 0.73469388]),
 array([100, 100, 100], dtype=int32))

## Predicting with new sample data point

In [25]:
input = [[135.0, 48.0,25.5]]
clf.predict(input)

array([2], dtype=int64)