# Analysis of life insurance data to support a data product proposal for an insurance brokerage

### Author: Robbie Sharma (robbie.sharma@gmail.com)
### Mentor: Hobson Lane (hobsonlane@gmail.com)
### Prepared for: Springboard - Data Science Intensive Course 
### January 2016 - February 2016


## Thanks

I would like to thank Hobson Lane for his patience and support in this project.  In most of the 30 min mentor sessions, we were able to communicate well and I received a lot of useful information from him on the world of machine learning.  His knowledge, expertise and insane level of intelligence was very helpful in boosting my understanding of the concepts.  

# Introduction

Insurance brokers are challenged to retain their client base while growing business in diverse markets.  Personalizing insurance products and providing policies at competitive premiums are primary goals brokers have, although it is key for them to understand what factors can influence claims risk, the future continuance of policies, and the administrative overhead with managing their client policies.

For the Data Science Intensive Capstone project, my goal was to understand data wrangling and predictive modelling by machine learning so I can help my insurance broker client meet or exceed their strategic retention and growth goals.  In order to achienve this, I analyzed a life insurance data set offered by Prudential Life Insurance through a competition on Kaggle.com.  By using data analytics and machine learning techniques,  I studied a structured and fairly "clean" data set in order to make predictions on the ordinal categorical risk rating based on profiles of life insurance customers.  The premise of this study is to show my client how insurance data can be used to identify and predict features and trends for policy holders.

Considering the life insurance data set, methologies and analytical techniques are shown to identify interesting data features, normalize and transform certain features, and how to apply and tune simple machine learning algorithms for predictive purposes.   

Multiple interviews were conducted with the CEO and CFO of the insurance brokerage in order to understand their business processes, data sources and types, and current management procedures to mitigate the risk of policy cancellations (retention), while expanding their client base.

This report will make useful connections between solving a strucutred machine learning problem involving the life insurance dataset and how a similar methodology could be applied to the datasets from the insurance brokerage.



## Problem description: Life insurance 

The goal of the life insurance Kaggle competition, is to solve a multi-classification problem by assigning a categorical (ordinal) risk rating  from 1-9 for a customer profile.  The algorithm accuracy is tested by measuring the quadratic weighted kappa (QWK), the inter-rater agreement between a predicted set to the actual set of risk ratings.  

The data set contains 127 features (minus the ID feature) consisting of categorical (ordinal), continuous (normalized), and discrete feature types. There are 48 Medical Keyword features acting as dummy variables with binary, discrete values. 

The deliverables were a machine learning algorithm for predicting risk response in the Kaggle dataset. 

An appropriate process would need to be applied to transform the data with useful features selected, a machine learning algorithm implemented and the QWK calculated to determine error in classification.  The top competitors achieved QWK scores of 0.67939.

## Problem description: Insurance brokerage

My client is an insurance broker and has been in business for 21 years.  They have collected a significant amount of data on their clients, insurance agencies, products and sales representatives over the past 20 years. Due to confidentiality concerns, I cannot release any data although the strategies imployed will be discussed. 

By being able to find risk rating correlations between life insurance applicants, I can use the methods learned in this course and project to facilitate a discussion on how my client’s problem of improving retention can be resolved using similar methods.

I explored their Applied Systems TAMS software (a insurance brokerage management tool).  The tool produces client proudct summaries, claims revenue, sales summaries, and accounting reports.  It can export into CSV. 

The deliverables were a memo outlining an approach to solving my client’s retention problem. and a machine learning algorithm for predicting risk response in the Kaggle dataset. A portion of the Email memo is attached in the Appendix.  

In the context of this project, the goal is to determine what tasks would be useful in a data science project that could help my insurance broker client meet their company goals.  


# Methods & Analysis

The following methods and code was written were taken to analyze the data.  The primary goal was to compare Response ratings to the other features in the data set.  

   1 Data exploration and plotting
       1 The 48 Medical_Keyword_X columns were summed to derive a total medical keyword count feature.
       1 Histogram plots of all the features were created to perform a preliminary exploration.
       1 The following features were explored in this project in more detail: Ins_Age, Ht, Wt, BMI, Product_Info_2, and Medical_Keyword_Sum
       1 Histogram plots and a scatter matrix plot were saved for the features listed above.
   1 Data transformation and normalization
       1 Replaced alpha-numeric labels in Product_Info_2 with an enumerated dictionary of dummy integers.  
       1 Replaced all the NaNs in the dataset with -1.   
       1 Normalization based on min max difference of data sets.  
           1 Risk rating was normalized to values between 0 and 1.  
           1 Product_Info_2 normalized between 0 and 1.
           1 Categorical and discrete data sets were normalized to 0 and 1. Elements with a -1 or NaN were normalized to 0.
   1 Machine learning
       1 Training/Test set
           1 10% of the train.csv data set used for a testing set and 90% used for a training set.        
       1 Test and fitting using the following classifiers
           *Linear Model - LASSO
           *Random Forest
       1 Evaluation
           * Quadratic Weighted Kappa function from skll library used to classify the error
           

# Data exploration


## Descriptive statistics

There are 59380 rows in the dataset. Descriptive statistics are given in the following notebook: capstone-data-story-project.ipynb

## Scatter matrix

The following is observed in the scatter matrix:

* 

![caption](images/scatter_matrix/Response_scatter_matrix_2016-03-06.png)

## Histograms

Histograms of selected features created at each response rating (1-8).  The following was observed:

* 

### 'Ht' feature 

![caption](images/Ht/1_hist_Response_Ht_-2016-03-06.png)
![caption](images/Ht/2_hist_Response_Ht_-2016-03-06.png)
![caption](images/Ht/3_hist_Response_Ht_-2016-03-06.png)
![caption](images/Ht/4_hist_Response_Ht_-2016-03-06.png)
![caption](images/Ht/5_hist_Response_Ht_-2016-03-06.png)
![caption](images/Ht/6_hist_Response_Ht_-2016-03-06.png)
![caption](images/Ht/7_hist_Response_Ht_-2016-03-06.png)
![caption](images/Ht/8_hist_Response_Ht_-2016-03-06.png)

### 'BMI' feature 


![caption](images/BMI/1_hist_Response_BMI_-2016-03-06.png)
![caption](images/BMI/2_hist_Response_BMI_-2016-03-06.png)
![caption](images/BMI/3_hist_Response_BMI_-2016-03-06.png)
![caption](images/BMI/4_hist_Response_BMI_-2016-03-06.png)
![caption](images/BMI/5_hist_Response_BMI_-2016-03-06.png)
![caption](images/BMI/6_hist_Response_BMI_-2016-03-06.png)
![caption](images/BMI/7_hist_Response_BMI_-2016-03-06.png)
![caption](images/BMI/8_hist_Response_BMI_-2016-03-06.png)

### 'Ins_Age' feature 
![caption](images/Ins_Age/1_hist_Response_Ins_Age_-2016-03-06.png)
![caption](images/Ins_Age/2_hist_Response_Ins_Age_-2016-03-06.png)
![caption](images/Ins_Age/3_hist_Response_Ins_Age_-2016-03-06.png)
![caption](images/Ins_Age/4_hist_Response_Ins_Age_-2016-03-06.png)
![caption](images/Ins_Age/5_hist_Response_Ins_Age_-2016-03-06.png)
![caption](images/Ins_Age/6_hist_Response_Ins_Age_-2016-03-06.png)
![caption](images/Ins_Age/7_hist_Response_Ins_Age_-2016-03-06.png)
![caption](images/Ins_Age/8_hist_Response_Ins_Age_-2016-03-06.png)

### 'Product_Info_2' feature
![caption](images/Product_Info_2/1_hist_Response_Product_Info_2_-2016-03-06.png)
![caption](images/Product_Info_2/2_hist_Response_Product_Info_2_-2016-03-06.png)
![caption](images/Product_Info_2/3_hist_Response_Product_Info_2_-2016-03-06.png)
![caption](images/Product_Info_2/4_hist_Response_Product_Info_2_-2016-03-06.png)
![caption](images/Product_Info_2/5_hist_Response_Product_Info_2_-2016-03-06.png)
![caption](images/Product_Info_2/6_hist_Response_Product_Info_2_-2016-03-06.png)
![caption](images/Product_Info_2/7_hist_Response_Product_Info_2_-2016-03-06.png)
![caption](images/Product_Info_2/8_hist_Response_Product_Info_2_-2016-03-06.png)

### 'Medical Keyword Sum' feature
![caption](images/Medical_Keyword_Sum/1_hist_Response_Medical_Keyword_Sum_-2016-03-06.png)
![caption](images/Medical_Keyword_Sum/2_hist_Response_Medical_Keyword_Sum_-2016-03-06.png)
![caption](images/Medical_Keyword_Sum/3_hist_Response_Medical_Keyword_Sum_-2016-03-06.png)
![caption](images/Medical_Keyword_Sum/4_hist_Response_Medical_Keyword_Sum_-2016-03-06.png)
![caption](images/Medical_Keyword_Sum/5_hist_Response_Medical_Keyword_Sum_-2016-03-06.png)
![caption](images/Medical_Keyword_Sum/6_hist_Response_Medical_Keyword_Sum_-2016-03-06.png)
![caption](images/Medical_Keyword_Sum/7_hist_Response_Medical_Keyword_Sum_-2016-03-06.png)
![caption](images/Medical_Keyword_Sum/8_hist_Response_Medical_Keyword_Sum_-2016-03-06.png)







## Interesting graphs

![caption](images/hist_Product_Info_2.png)

-Shows exponental distribution of categories
-Categories should be rearranged to make it normally distributed
-D3 is very common occurance.... why?


![caption](images/hist_product_info_4.png)
-exponention distribution of categories... why?
-can also be rearranged.

![caption](images/hist_response.png)
![caption](images/hist_norm_Response.png)


## Other graphs
![caption](images/hist_norm_Product_Info_1.png)
![caption](images/hist_norm_Product_Info_3.png)
![caption](images/hist_norm_Product_Info_5.png)
![caption](images/hist_norm_Product_Info_6.png)
![caption](images/hist_norm_Product_Info_7.png)


![caption](images/hist_product_info_1.png)
![caption](images/hist_Product_Info_3.png)
![caption](images/hist_Product_Info_5.png)
![caption](images/hist_Product_Info_6.png)
![caption](images/hist_Product_Info_7.png)





# Kaggle competition

Highest value was 0.67939

placed 2371/2619  - can't plug and play an algorithm to win Kaggle


# Machine Learning Algorithms


Lasso Linear modelling - measuring sparce coefficients.
- Tuned with alpha value from 0 to 0.0099
- A very low alpha suggests high variance and the algorithm is being over fitted. 
- Need to reduce the features to make the algorithm more generalized.

Regularization reduces the overfitting problem

L1 regularization - Lasso regression - a low alpha reduces the norm size of the input variables.  

The alpha variable affects the regularization of the linear regression model.  

(Favoured) L2 regularization - spreads out the shrinkage so all the interdependent variables are equally influential

Random Forest Classifier.
- Tuned with 0 to 1000 estimators


## ML Results

### alpha/time vs. kappa: linear lasso - test#1

	alpha	kappa	time
count	50.000000	50.000000	50.00000
mean	0.050000	0.162239	0.47140
std	0.029155	0.086191	0.09196
min	0.001000	0.111002	0.36900
25%	0.025500	0.116683	0.41125
50%	0.050000	0.126014	0.44200
75%	0.074500	0.146227	0.49200
max	0.099000	0.461167	0.79400

![caption](images/scatterLasso_alpha_kappa_test1.png)

# alpha/time vs. kappa: linear lasso: test2

	alpha	kappa	time
count	50.000000	50.000000	50.000000
mean	0.005000	0.386544	1.049080
std	0.002915	0.049576	2.596456
min	0.000100	0.309687	0.524000
25%	0.002550	0.340325	0.618250
50%	0.005000	0.384313	0.651500
75%	0.007450	0.430685	0.680500
max	0.009900	0.473927	19.014000

![caption](images/scatterLasso_alpha_kappa_test2.png)

# Estimators/Time vs. kappa: RandomForest: Test1

	est	kappa	time
count	10.000000	10.000000	10.000000
mean	46.000000	0.337425	19.335500
std	30.276504	0.035419	13.075482
min	1.000000	0.259207	0.530000
25%	23.500000	0.325625	9.757500
50%	46.000000	0.346819	18.733000
75%	68.500000	0.358949	28.620500
max	91.000000	0.376106	41.517000

![caption](images/RFC_scatter_alpha_kappa_test1.png)


# Estimators/Time vs. kappa: RandomForest: Test2

	est	kappa	time
count	9.000000	9.000000	9.000000
mean	500.000000	0.357689	210.043556
std	273.861279	0.006708	114.229644
min	100.000000	0.346036	42.440000
25%	300.000000	0.357782	123.817000
50%	500.000000	0.358111	214.093000
75%	700.000000	0.362640	291.163000
max	900.000000	0.366685	356.299000

![caption](images/RFC_scatter_alpha_kappa_test2.png)


## ML Evaluation criteria

Accuracy_score, mean squared error and quadratic kappa were used. 


Algorithm placed 2300/2619



# Conclusion

* Learned the difficulty in processing the data
* Collected a lot of good code snippets to aid in future work
* Linear lasso classifier had a higher QWK score although the alpha value of 0.0099 suggests the algorithm is over-fitting the data.  


Lessons learned

* when installing packages through a Windows-based Anaconda environment, use the 'conda install' command rather than 'pip install' to perform the package install.  I installed the 'skll' package using pip and my package environment was compromised, so I had to manually remove packages and reinstall many of them to work out the module error bugs in the code.  

* To further improve the algorithm I would perform the following:
 * Split data set into 70% train and 30% test.
 * Select partial features to train the classifiers.  Product_Info_2, Product_Info_4, Ins_Agea and Medical_Keyword_Sum look like promising base features to include.  Employement_Info and Insure_History would need to be explored in more detail.
 * Use LassoCV to perform some further cross validation checks for linear regression.
 * Explore AdaBoost gradient 
 

## On the algorithm

Tune the random forest.  Reduce the number of branches for faster processing, higher estimators

Tune lasso with alpha of higher magnitude

Create scatter plots of Response vs. each label

Replace NaN with 0.  Disregard NaNs in data set


## With the client

* Give more visuals

Reduce technical jargon

Estimate work and present proposal fast


# References

http://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/

http://scikit-learn.org/stable/modules/linear_model.html#lasso

https://www.kaggle.com/c/prudential-life-insurance-assessment

#Data Science Intensive: Capstone Milestone Report

##Background

I had decided to pursue a hybrid of two projects.  Analyzing life insurance applicant data for predicting risk rating.  Preparing a data analytics project proposal for an insurance brokerage.  

The deliverables would be:

* Prediction algorithms to determine risk rating for life insurance applicants.  
* Project proposal for insurance brokerage data analytics project.


###Life Insurance Deep Dive



 * Product Info
   * Product_Info_1 has 2 categories.  Category, Rowcount =  [(1,57816) where row count = (2,1565)]
   * Product_Info_2 has 
   * Product_Info_2 has 
   * Product_Info_2 has 
   

###Insurance brokerage Deep Dive

After observing the client's database I came to the following workflow and processes to assist my client in meeting their business goals.  

 1.	Data cleanup/transformation.  
    1.	Observed duplicates, missing data, information not properly filled in etc.
    1.	Need to investigate platform on how to perform mass changes and what is required to be changed
    1.	Need to investigate if SQL database can be directly queried or there is an API to connect
 2.  Data exploration
    1.	Perform ETL processes on TAM data using Python
    1.	Identification of data types (continuous, discrete, categorical etc.)
    1.	Identification of data features related to retention and cross-selling goals
 3.	Data analytics
    1.  Basic descriptive statistics on 
      1.  products
      1.  representatives
      1.  sales activities
      1.  claims losses
      1.  premium revenues
    1. Basic Tables/charts -> top 20%, histograms, pie charts
    1. Retention rates of different premium brackets
       1.  New policies/Total policies, Lost policies/Total Policies  
	1. Customer segments (building 1st and 2nd  order models)
       1,  Preimum brackets
       1.  Combinations of meta data 
       1.  Income bracket, postal code, city, province, gender, age, personal? commercial? both?
	1. Discrete, continuous, and categorical time series signatures of “customer features”.  
    1. Experimentation with machine learning and predictive models simple linear regression, SVM, decision trees  
 4.	Data visualization
    1.	Excel, Qlikview, or Tableau dashboards… TBD after exploration and further needs assessments
 5.	Management Consulting
    1.	Recommending reporting, decision-making and operating procedures/policies on retention and product cross-selling
    1.	Identifying an appropriate reporting and analytics toolchain and workflow for the company 

#Capstone milestone report questions

1. An introduction to the problem: What is the problem? Who is the Client? (Feel free to reuse points 1-2 from your proposal document)
1.  A deeper dive into the data set:
  1. What important fields and information does the data set have?	
  1. What are its limitations i.e. what are some questions that you cannot answer with this data set?
  1. What kind of cleaning and wrangling did you need to do?
  1. Are there other datasets you can find, use and combine with, to answer the questions that matter?
  1.  Any preliminary exploration you’ve performed and your initial findings. Test the hypotheses one at a time. Often, the data story emerges as a result of a sequence of testing hypothesis e.g. You first tested if X was true, and because it wasn't, you tried Y, which turned out to be true.
  1.  Based on these findings, what approach are you going to take? How has your approach changed from what you initially proposed, if applicable?








#CAPSTONE PROPOSAL - DISREGARD - USED AS reference

Where I would perform machine learning analysis on an existing life insurance applicant data set, and take the learning patterns toThe problem I want to solve is to create a machine learning algorithm that predicts risk response based on a trained classifier set. A goal for their project would be to understand how retention rates differ for different product and customer segments and how best to improve the retention rates for the upcoming months.  Determining their high value clients and the ones most likely not to renew would form a significant part of their retention focused strategy.  

The insurance client I have is interested in retention and growing their business, and wanted a better understanding of the data they had stored.  As a project, I wanted to analyze an insurance data set offered on Kaggle.com by Prudential Life Insurance.  The premise of the competition was to predict the risk response of a client based on a normalized dataset of current clients.  

##Predicting risk rating for life insurance applicants - Capstone project

The normalized dataset contains continuous data based on height, age, BMI.  Categorical (nominal) datasets based on risk response rating (1-7), medical histories, etc.  

The main dependent variable is the Risk Response (1-8).

### Project Goals
The problem I want to solve is to create a machine learning algorithm that predicts risk response based on a trained classifier set.



##Predicting retention risk for insurance brokerage - Brokerage project


My client is an insurance broker and has been in business for 21 years.  They have collected a significant amount of data on their clients, insurance agencies, products and sales representatives over the past 20 years. Due to confidentiality concerns, I cannot release any data although the strategies imployed will be discussed. 

By being able to find risk rating correlations between life insurance applicants, I can use the methods learned in this course and project to facilitate a discussion on how my client’s problem of improving retention can be resolved using similar methods.

I will be exploring their Applied Systems TAMS software (a insurance brokerage management tool).  The tool produces various reports they have and determining an action plan for a data analytics project.  This data can be exported into CSV format.
The deliverables would be a memo outlining an approach to solving my client’s retention problem and a machine learning algorithm for predicting risk response in the Kaggle dataset. 


###Project Goals

A goal for their project would be to understand how retention rates differ for different product and customer segments and how best to improve the retention rates for the upcoming months.  Determining their high value clients and the ones most likely not to renew would form a significant part of their retention focused strategy.  



##Capstone project outline 

###Theme

My capstone project will be a hybrid of two different projects.  The life insurance project will be to implement machine learning algorithms to predict risk rating.  The brokerage project will be to outline a proposal for a data analytics project based on my initial exploration study of their systems.  My reasoning is that there could be useful patterns in analyzing the life insurance data that could be useful in the brokerage project.  I will attempt to secure a paid work project based on what is learned in this course.

###Data collection

####Life insurance data set - Kaggle
The Kaggle dataset is already fairly clean.  It does require separation and exploration of the data into different categories.

####Insurance brokerage data - Applied Systems TAMS
I need to interview the key executives at the company, particularly the controller, to understand what reports they use and how they use them to make decisions.  I also need an understanding how their data is collected and if there are any problems in data entry.  I need to present a data project proposal that can help meet their retention and growth requirements.  Information will also be collected on how their data can be exported, the most used reports, and figuring out a plan to extract, transform and load the datasets into something useful.

###Deliverables

  1.  Machine learning algorithm for predicting risk rating in life insurance applicants. iPython Notebook
  2.  Project proposal for insurance brokerage on data analytics project improving retention rates and growth opportunities at  the firm. iPython Notebook
  3.  Slide deck

###Capstone questions

re is the evaluation criteria your mentor will use to grade the final submission:

The technical quality of the work (50 points)
Does the technical material make sense?
Are the methods tried reasonable?
Are the proposed algorithms or applications clever and interesting?

Problem significance (15 points)
Did the student choose an interesting or a “real” problem to work on, or only a small “toy” problem?
Are the selected data sets for the problem “real”? Did the student do a good job with data acquisition, wrangling and cleaning?
Is this work likely to be useful and/or have impact?

Storytelling (25 points)
Is there a clear, compelling, logical flow to the report and slide deck?
Does it clearly motivate the problem, describe the solution and present the results as they’d be presented to the client?
Is it clear what the client can do with the analysis - what is it that they can now do or decide differently than before the analysis?

Deliverables (10 points)
Did the student meet all the deadlines agreed to, and turn in all the material as expected (code on github, reports, slide decks etc)?
Is the code well-documented?
5 points extra credit: Did the student send the report or present it to the client? If the client did respond, how did they receive it?