---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._

---

### Import required modules and load data file

In [58]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set_style("darkgrid")
%matplotlib inline

retention_df = pd.read_csv('./data/retention/insurance_cust_retention.csv')

In [59]:
retention_df.head()

Unnamed: 0,Marital Status,AGE,Gender,Car Value,Years of No Claims Bonus,Annual Mileage,Payment Method,Acquisition Channel,Years of Tenure with Current Provider,Price,Actual Change in Price vs last Year,% Change in Price vs last Year,Grouped Change in Price,Renewed?
0,M,45,F,500,4,6000,Monthly,Inbound,4,289.4,-11.94,-3.96%,-0.05,0
1,M,40,M,3000,8,6000,Monthly,Inbound,4,170.4,45.62,37%,0.35,1
2,S,25,F,4000,4,4000,Monthly,Inbound,4,466.1,-123.15,-21%,-0.2,1
3,M,42,M,1800,9,10000,Annual,Inbound,4,245.1,2.34,1%,0.0,1
4,M,59,M,5000,9,3000,Annual,Inbound,4,240.5,42.56,22%,0.2,0


In [60]:
retention_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20020 entries, 0 to 20019
Data columns (total 14 columns):
Marital Status                           20020 non-null object
AGE                                      20020 non-null int64
Gender                                   20020 non-null object
Car Value                                20020 non-null int64
Years of No Claims Bonus                 20020 non-null int64
Annual Mileage                           20020 non-null int64
Payment Method                           20020 non-null object
Acquisition Channel                      20020 non-null object
Years of Tenure with Current Provider    20020 non-null int64
Price                                    20017 non-null float64
Actual Change in Price vs last Year      20020 non-null object
% Change in Price vs last Year           20020 non-null object
Grouped Change in Price                  20020 non-null object
Renewed?                                 20020 non-null int64
dtypes: float6

## Data Cleanup

### Find Duplicate Columns (if any):

In [61]:
# Get column names
column_names = retention_df.columns
# print(column_names)
# Get column data types
retention_df.dtypes

# Also check if the column is unique
for i in column_names:
  print('{} is duplicate: {}'.format(i, retention_df[i].is_unique))

Marital Status is duplicate: False
AGE is duplicate: False
Gender is duplicate: False
Car Value is duplicate: False
Years of No Claims Bonus is duplicate: False
Annual Mileage is duplicate: False
Payment Method is duplicate: False
Acquisition Channel is duplicate: False
Years of Tenure with Current Provider is duplicate: False
Price is duplicate: False
Actual Change in Price vs last Year is duplicate: False
% Change in Price vs last Year is duplicate: False
Grouped Change in Price is duplicate: False
Renewed? is duplicate: False


### Find Missing Values (if any):

In [62]:
retention_df.isnull().sum()

Marital Status                           0
AGE                                      0
Gender                                   0
Car Value                                0
Years of No Claims Bonus                 0
Annual Mileage                           0
Payment Method                           0
Acquisition Channel                      0
Years of Tenure with Current Provider    0
Price                                    3
Actual Change in Price vs last Year      0
% Change in Price vs last Year           0
Grouped Change in Price                  0
Renewed?                                 0
dtype: int64

### Removing the Rows with Missing Values

In [63]:
retention_df = retention_df.dropna(how='any',axis=0)
retention_df.isnull().sum()

Marital Status                           0
AGE                                      0
Gender                                   0
Car Value                                0
Years of No Claims Bonus                 0
Annual Mileage                           0
Payment Method                           0
Acquisition Channel                      0
Years of Tenure with Current Provider    0
Price                                    0
Actual Change in Price vs last Year      0
% Change in Price vs last Year           0
Grouped Change in Price                  0
Renewed?                                 0
dtype: int64

### Finding the Relevance of Features

### Examining the data

In [64]:
# retention_df = pd.concat([retention_df, pd.get_dummies(retention_df['Marital Status'], prefix='Marital Status')], axis=1)
# retention_df = pd.concat([retention_df, pd.get_dummies(retention_df['Gender'], prefix='Gender')], axis=1)
# retention_df = pd.concat([retention_df, pd.get_dummies(retention_df['Payment Method'], prefix='Payment Method')], axis=1)
# retention_df = pd.concat([retention_df, pd.get_dummies(retention_df['Acquisition Channel'], prefix='Acquisition Channel')], axis=1)

In [65]:
retention_df['Actual Change in Price vs last Year'] = pd.to_numeric(retention_df['Actual Change in Price vs last Year'])
retention_df['% Change in Price vs last Year'] = retention_df['% Change in Price vs last Year'].str.rstrip('%').astype('float') / 100.0
retention_df['Grouped Change in Price'] = pd.to_numeric(retention_df['Grouped Change in Price'])

In [66]:
retention_df.head()

Unnamed: 0,Marital Status,AGE,Gender,Car Value,Years of No Claims Bonus,Annual Mileage,Payment Method,Acquisition Channel,Years of Tenure with Current Provider,Price,Actual Change in Price vs last Year,% Change in Price vs last Year,Grouped Change in Price,Renewed?
0,M,45,F,500,4,6000,Monthly,Inbound,4,289.4,-11.94,-0.0396,-0.05,0
1,M,40,M,3000,8,6000,Monthly,Inbound,4,170.4,45.62,0.37,0.35,1
2,S,25,F,4000,4,4000,Monthly,Inbound,4,466.1,-123.15,-0.21,-0.2,1
3,M,42,M,1800,9,10000,Annual,Inbound,4,245.1,2.34,0.01,0.0,1
4,M,59,M,5000,9,3000,Annual,Inbound,4,240.5,42.56,0.22,0.2,0


### Create train-test split

In [67]:
from sklearn import preprocessing

# We use the most significant features to train the model
# X = retention_df.drop(columns=['Marital Status', 'AGE', 'Gender', 'Annual Mileage', 'Payment Method', 'Acquisition Channel', 'Years of Tenure with Current Provider', 'Renewed?'])
X = retention_df.drop(columns=['Grouped Change in Price', '% Change in Price vs last Year', 'Car Value', 'Renewed?'])
# X = retention_df.drop(columns=['Marital Status', 'Gender', 'Years of No Claims Bonus', 'Annual Mileage', 'Payment Method', 'Acquisition Channel', 'Years of Tenure with Current Provider', '% Change in Price vs last Year', 'Renewed?'])
y = retention_df['Renewed?']


# Normalize the data attributes
normalized_X = preprocessing.normalize(X)

# default is 75% / 25% train-test split
X_train, X_test, y_train, y_test = train_test_split(normalized_X, y, test_size=0.2, random_state=0)


### Create classifier object

In [68]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)

In [57]:
import pandas as pd
feature_imp = pd.Series(clf.feature_importances_,index=['AGE', 'Car Value', 'Years of No Claims Bonus', 'Annual Mileage', 'Years of Tenure with Current Provider', 'Price', 'Actual Change in Price vs last Year', '% Change in Price vs last Year', 'Grouped Change in Price']).sort_values(ascending=False)
feature_imp

Actual Change in Price vs last Year      0.137913
AGE                                      0.122850
Years of No Claims Bonus                 0.118567
Price                                    0.117608
Years of Tenure with Current Provider    0.117099
Annual Mileage                           0.101096
Car Value                                0.100934
% Change in Price vs last Year           0.098632
Grouped Change in Price                  0.085301
dtype: float64

### After training, checking the accuracy using actual and predicted values.

In [69]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.6258741258741258


In [70]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Creating a bar plot
sns.barplot(x=['Marital Status', 'Gender', 'Payment Method', 'Acquisition Channel'], y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

ValueError: Neither the `x` nor `y` variable appears to be numeric.