![BAIME banner](https://user-images.githubusercontent.com/47600826/89530907-9b3f6480-d7ef-11ea-9849-27617f6025cf.png)

# Customer Lifetime Value prediction

![CLV](https://www.searchwarrant.ca/wp-content/uploads/sites/99/2020/04/LTV_ConversionRate_Part2800x420.png)

# The problem

In this notebook we look at the data we got via this [Kaggle dataset](https://www.kaggle.com/saniyajaswani/credit-card-data). 
It involves the car insurance customer lifetime value.

Customer Lifetime Value Prediction( CLV ) value refers to net profit attributed to the entire future relationship with a customer. 
A bank will use different predictive analytic approaches to predict the revenue that can be generated from any customer in the future. 
This helps the banks in segmentating the customers in specific groups based on their CLV.

Identifying customers with high future values will enable the organization to keep maintaining good relationships with such customers. 
It can be done by investing more time and resources on them such as better prices, offers, discounts, customer care services, etc.

Finding and engaging reliable and profitable customers has always been a great challenge for banks.
With the increasing competition, the banks need to keep a check on each and every activity of their customers for utilizing their resources effectively. 

To solve this problem, Data Science in banking is being used for extracting actionable insights concerning customer behaviors and expectations.
Using Data Science models for predicting the CLV of a customer will help a bank to take some suitable decisions for their growth and profit.


![CLV](https://2112leafletdistribution.co.uk/wp-content/uploads/2018/03/CLV.png)

# Import the important libraries / packages
These packages are needed to load and use the dataset

In [None]:
import pandas as pd #we use this to load, read and transform the dataset
import numpy as np #we use this for statistical analysis
import matplotlib.pyplot as plt #we use this to visualize the dataset
import seaborn as sns #we use this to make countplots
import sklearn.metrics as sklm #This is to test the models

# Load and explore the dataset
The data is all in one csv file. In this next step I will first load the data to see how this looks like

In [None]:
#here we load the data
data = pd.read_csv('/kaggle/input/credit-card-data/Fn-UseC_-Marketing-Customer-Value-Analysis.csv')

#and immediately I would like to see how this dataset looks like
data.head()

In [None]:
#now let's look closer at the dataset we got
data.info()

It seems that we have a lot of text / category information (these are of the Dtype 'object') and a few numerical columns (Dtypes 'int64' and 'float64'). 

The column 'Customer Lifetime Value' is the column we would like to predict. 

In [None]:
data.shape

The dataset consists of 9134 rows and 24 columns. 

In [None]:
data.describe()

It seems that we have some strange outliers for the CLV and claim amounts. We will look and handle these later on. 

In [None]:
data.describe(include='O')

In [None]:
#Let's see what the options are in the text columns with two or three options (the objects)
print('Response: '+ str(data['Response'].unique()))
print('Coverage: '+ str(data['Coverage'].unique()))
print('Education: '+ str(data['Education'].unique()))
print('Employment Status: '+ str(data['EmploymentStatus'].unique()))
print('Gender: ' + str(data['Gender'].unique()))
print('Location Code: ' + str(data['Location Code'].unique()))
print('Married: ' + str(data['Marital Status'].unique()))
print('Policy Type: ' + str(data['Policy Type'].unique()))
print('Vehicle Size: ' + str(data['Vehicle Size'].unique()))

# Customer Lifetime Value 

As Customer Lifetime Value is the column we want to predict, let's explore this column in the training dataset.

The formula to calculate the CLV:

![CLV formula](https://d35fo82fjcw0y8.cloudfront.net/2018/08/30131556/calculation-for-customer-lifetime-value.jpg)

In [None]:
#As this is a numeric, thus continous number, I will use a scatterplot to see if there is a pattern. 
plt.hist(data['Customer Lifetime Value'], bins = 10)
plt.title("Customer Lifetime Value") #Assign title 
plt.xlabel("Value") #Assign x label 
plt.ylabel("Customers") #Assign y label 
plt.show()

In [None]:
plt.boxplot(data['Customer Lifetime Value'])

In [None]:
#We see that there are some great outliers here. 
#let's look closer to these outliers over 50000
outliers = data[data['Customer Lifetime Value'] > 50000]
outliers.head(25)

In [None]:
outliers.info()

Looks like there are only 20 rows of the 9134 rows that have a lifetime value of more than 50000. 
We will leave this as is for now

# Handling missing values
Let's continue with handling the missing values in this dataset. 
Let's see where and how many missing values there are in this dataset.  

In [None]:
#let's look in what columns there are missing values 
data.isnull().sum().sort_values(ascending = False)

There seem to be no missing values in this dataset. 

## Making the text columns Numeric
We first need to make all column input numeric to use them further on. 
This is what I will do now. 

In [None]:
#First we drop the customer column, as this is a unique identifier and will bias the model
data = data.drop(labels = ['Customer'], axis = 1)

In [None]:
#let's load the required packages
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
# Let's transform the categorical variables to continous variables
column_names = ['Response', 'Coverage', 'Education', 
                     'Effective To Date', 'EmploymentStatus', 
                     'Gender', 'Location Code', 'Marital Status',
                     'Policy Type', 'Policy', 'Renew Offer Type',
                     'Sales Channel', 'Vehicle Class', 'Vehicle Size', 'State']

for col in column_names:
    data[col] = le.fit_transform(data[col])
    
data.head()

In [None]:
data.dtypes

As my model can not handle floats, we will change these to integers.

In [None]:
data['Customer Lifetime Value'] = data['Customer Lifetime Value'].astype(int)
data['Total Claim Amount'] = data['Total Claim Amount'].astype(int)


# Most important features
Let's continue by looking at the most important features according to two different tests. 
Than we will use the top ones to train and test our first model. 

In [None]:
#First we need to split the dataset in the y-column (the target) and the components (X), the independent columns. 
#This is needed as we need to use the X columns to predict the y in the model. 

y = data['Customer Lifetime Value'] #the column we want to predict 
X = data.drop(labels = ['Customer Lifetime Value'], axis = 1)  #independent columns 
 

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k='all')
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Name of the column','Score']  #naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  #print 10 best features

In [None]:
#get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,10))

#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

What pop's out when looking at the correlations for the CLV is the column 'Monthly Premium Auto' and the 'Total Claim Amount'
These might be the best features to use.

Seems that the feature selection models differ a bit in which feature is the most important.
For the first test I will keep:
- Total Claim Amount (high in all both tests)
- Monthly Premium Auto (high in all both tests and the highest in the correlation)
- Income (high in two tests)
- Months Since Policy Inception (High in the best features test)
- Coverage (High in the correlation)


# Machine learning Model
We want to predict a continous number, therefore we need a linear regression model.

In [None]:
from sklearn.linear_model import LinearRegression

## Split the dataset in train and test
Before we are going to use the model choosen, we will first split the dataset in a train and test set.
This because we want to test the performance of the model on the training set and to be able to check it's accuracy. 

In [None]:
from sklearn.model_selection import train_test_split

#First try with the 5 most important features
X_5 = data[['Total Claim Amount', 'Monthly Premium Auto', 'Income', 'Coverage', 'Months Since Policy Inception']] #independent columns chosen 
y = data['Customer Lifetime Value']    #target column 

#I want to withhold 30 % of the trainset to perform the tests
X_train, X_test, y_train, y_test= train_test_split(X_5,y, test_size=0.3 , random_state = 25)

In [None]:
print('Shape of X_train is: ', X_train.shape)
print('Shape of X_test is: ', X_test.shape)
print('Shape of Y_train is: ', y_train.shape)
print('Shape of y_test is: ', y_test.shape)

In [None]:
#To check the model, I want to build a check:
import math
def print_metrics(y_true, y_predicted, n_parameters):
    ## First compute R^2 and the adjusted R^2
    r2 = sklm.r2_score(y_true, y_predicted)
    r2_adj = r2 - (n_parameters - 1)/(y_true.shape[0] - n_parameters) * (1 - r2)
    
    ## Print the usual metrics and the R^2 values
    print('Mean Square Error      = ' + str(sklm.mean_squared_error(y_true, y_predicted)))
    print('Root Mean Square Error = ' + str(math.sqrt(sklm.mean_squared_error(y_true, y_predicted))))
    print('Mean Absolute Error    = ' + str(sklm.mean_absolute_error(y_true, y_predicted)))
    print('Median Absolute Error  = ' + str(sklm.median_absolute_error(y_true, y_predicted)))
    print('R^2                    = ' + str(r2))
    print('Adjusted R^2           = ' + str(r2_adj))
   


## Linear Regression on 5 features
Let's try the model

In [None]:
# Linear regression model
model_5 = LinearRegression() 
model_5.fit(X_train, y_train)

In [None]:
Predictions = model_5.predict(X_test)
print_metrics(y_test, Predictions, 5)

Hmmm, that is not a good result, just over 14% reliable...

## Linear Regression on all
Let's try the model on all features to see if this improves

In [None]:
#I want to withhold 30 % of the trainset to perform the tests
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3 , random_state = 25)

print('Shape of X_train is: ', X_train.shape)
print('Shape of X_test is: ', X_test.shape)
print('Shape of Y_train is: ', y_train.shape)
print('Shape of y_test is: ', y_test.shape)

In [None]:
# Linear regression model
model = LinearRegression() 
model.fit(X_train, y_train)

In [None]:
Predictions = model.predict(X_test)
print_metrics(y_test, Predictions, 22)

This is even worse. 

# Conclusion

This model does not perform well to predict the CLV, as the CLV data is highly skewed.
To improve the prediction, we could try to normalize the distribution of the CLV column. 
I will try this here below using Box Cox and Log (two different methods)




In [None]:
#to see the CLV data as is (without having the extremes removed)
data.hist('Customer Lifetime Value', bins = 10)
plt.show()

In [None]:
#Chech the skewness, if p < 0.05 it is skewed
clv = data['Customer Lifetime Value']
from scipy.stats import shapiro
shapiro(clv)[1]

In [None]:
#as this does not work, let's continue with the log function
log_clv = np.log(clv)
import seaborn as sns
sns.distplot(log_clv)

In [None]:
#it is slightly improved regarding the skewness. Let's try Box Cox now
from scipy.stats import boxcox
boxcox_clv = boxcox(clv)[0]
sns.distplot(boxcox_clv)

BoxCox improved the normal distribution a bit better. Let's try our linear regression now. 

In [None]:
#I want to withhold 30 % of the trainset to perform the tests
X_train, X_test, y_train, y_test= train_test_split(X_5,boxcox_clv, test_size=0.3 , random_state = 25)

In [None]:
model_5.fit(X_train, y_train)

In [None]:
Predictions_box = model_5.predict(X_test)
print_metrics(y_test, Predictions_box, 5)

We can see a slight improvement to 18,5% now. But we need to do further feature improvement to better the result. 