In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
import seaborn as sns
import scipy.stats as stats
from scipy.stats.mstats import winsorize
from datetime import datetime
import json
from wordcloud import WordCloud

%matplotlib inline
pd.options.display.float_format = '{:.2f}'.format

import warnings
warnings.filterwarnings(action="ignore")

# Goal : Predict Customer Life-time Value (CLV )for an Auto Insurance Company.
Customer lifetime value is the net profit acquired from a customer throughout a company’s relationship with them.

Knowing each customer’s customer lifetime value helps you know how much you should be spending on customer acquisition. A customer’s acquisition cost could be more than what they spend on their purchase, but if you nurture that relationship, their CLV may grow to an amount that’s well worth the investment. That’s just one of the many reasons why success in the customer-centered economy means understanding the importance of customer lifetime value.

In [None]:
df = pd.read_csv('Marketing-Customer-Value-Analysis.csv')
df.sort_values('Customer Lifetime Value')

In [None]:
df.info()

In [None]:
df.head().T

In [None]:
#lets edit date format
df['Effective To Date']= df['Effective To Date'].astype('datetime64[ns]')

There are 9134 Observations of 24  Different Variable. (mix of categorical and continous DataType)

Dependent Variable is Customer Life Time Value as we have to predict the CLV.

Independent Variables are: Customer, StateCustomerLifetimeValue, Response, Coverage, Education, EffectiveToDate, EmploymentStatus, Gender, Income, LocationCode, MaritalStatus, MonthlyPremiumAuto, MonthsSinceLastClaim, MonthsSincePolicyInception, NumberofOpenComplaints, NumberofPoliciesPolicyType, Policy, RenewOfferType, SalesChannel, TotalClaimAmountVehicleClass, VehicleSize

Continues Independed Variables are : CustomerLifetimeValue, Income,MonthlyPremiumAuto, MonthsSinceLastClaim, MonthsSincePolicyInception, NumberofOpenComplaints, NumberofPolicies, TotalClaimAmount



In [None]:
df.describe()


In [None]:
df.isnull().sum()


In [None]:
# Looking at outliers of continuos variables

significant_cont = ['Income','Monthly Premium Auto','Total Claim Amount']

sns.set(color_codes=True)
plt.figure(figsize=(15,20))
plt.subplots_adjust(hspace=0.5)

for i in range(len(significant_cont)):
    plt.subplot(3,2,i+1)
    plt.boxplot(df[significant_cont[i]])
    plt.title(significant_cont[i])
    
plt.show()

As it can be seen there are outliers in the total claim amount and also in monthly premium auto , usually we remove the outliers for a better model. Since our dataset is related to insurance industry, outliers can be our potential customer. So, we will check the alternative models that includes outliers and do not include outliers.

There are no outliers in the income.




In [None]:
#checking all categorical variables to determine significant ones.

cat_df = df.select_dtypes(include='object')
cat_df = cat_df.drop(['Customer'], axis = 1)
cols = cat_df.columns
cols

In [None]:
sns.set(color_codes=True)
plt.subplots_adjust(hspace=0.5)
plt.figure(figsize=(20,40))

for i in range(len(cols)):
    plt.subplot(7,2,i+1)
    sns.barplot(x = cols[i],y='Customer Lifetime Value',data = df)
    plt.title(cols[i])
    
plt.show()

Interpretations from graphs:

Customers who have taken only 1 policy have lower customer lifetime value and customers who have taken 3 or greater show a similar trend. So, we can combine all of them into one bin and we can also see that the customers who have taken 2 policies have very high customer lifetime value comparitively.

Customer Lifetime Value is different for different types of coverage.


In [None]:

sns.set(style="whitegrid")
plt.figure(figsize=(15,6))
ax = sns.violinplot(x="Number of Policies", y="Customer Lifetime Value", data=df)


# Statistical Analysis


Interpretation of graphs gives us some insights but we need to do statistical analysis for statistically significant variables and more clear results.

Considering CLV (Customer Lifetime Value) as the target variable, we will try to understand how each of the independent variables are contributing towards the target variable.

Because our target variable  CLV is a continuous variable, we will have to perform f test/ ANOVA to understand how significant are the independent variables towards target variable.



In [None]:
# Test whether Gender differences are significant or not.
gender = df[['Customer Lifetime Value','Gender']].groupby('Gender')
female = gender['Customer Lifetime Value'].get_group('F')
male = gender['Customer Lifetime Value'].get_group('M')

In [None]:
stats.ttest_ind(female,male)

MEANS ARE SAME FOR GENDER

pvalue > 0.05 implies that there is no significant difference in the mean of target variable for 'Gender' which means 'Gender' feature is not significant for predicting 'Customer Lifetime Value'

In [None]:
# Test whether Covarage differences are significant or not.
Coverage = df[['Customer Lifetime Value','Coverage']].groupby('Coverage')
Basic = Coverage['Customer Lifetime Value'].get_group('Basic')
Extended = Coverage['Customer Lifetime Value'].get_group('Extended')
Premium =Coverage['Customer Lifetime Value'].get_group('Premium')

In [None]:
stats.f_oneway(Basic,Extended,Premium)

pvalue > 0.05 implies that there is no significant difference in the mean of target variable for 'Coverage' which means 'Gender' feature is not significant for predicting 'Customer Lifetime Value'

In [None]:
# Test whether Marital Status differences are significant or not.

Marital = df[['Customer Lifetime Value','Marital Status']].groupby('Marital Status')
married = Marital['Customer Lifetime Value'].get_group('Married')
single = Marital['Customer Lifetime Value'].get_group('Single')


In [None]:
stats.ttest_ind(married,single)

pvalue < 0.05 shows that there is significant difference in the mean of target variable for at least one group of 'Marital Status' that means 'Marital Status' could be a significant feature for predicting 'Customer Lifetime Value'

In [None]:
# Test whether Vehicle Class differences are significant or not.

Vehicleclass = df[['Customer Lifetime Value','Vehicle Class']].groupby('Vehicle Class')
fourdoor = Vehicleclass['Customer Lifetime Value'].get_group('Four-Door Car')
twodoor = Vehicleclass['Customer Lifetime Value'].get_group('Two-Door Car')
suv = Vehicleclass['Customer Lifetime Value'].get_group('SUV')
luxurysuv =Vehicleclass['Customer Lifetime Value'].get_group('Luxury SUV')
luxurycar =Vehicleclass['Customer Lifetime Value'].get_group('Luxury Car')
sportscar =Vehicleclass['Customer Lifetime Value'].get_group('Sports Car')



In [None]:
stats.f_oneway(fourdoor,twodoor,suv,luxurysuv,luxurycar,sportscar)

In [None]:
# Test whether Renew Offer Type differences are significant or not.

Renewoffer = df[['Customer Lifetime Value','Renew Offer Type']].groupby('Renew Offer Type')
offer1 = Renewoffer['Customer Lifetime Value'].get_group('Offer1')
offer2 = Renewoffer['Customer Lifetime Value'].get_group('Offer2')
offer3 = Renewoffer['Customer Lifetime Value'].get_group('Offer3')
offer4 =Renewoffer['Customer Lifetime Value'].get_group('Offer4')



In [None]:
stats.f_oneway(offer1,offer2,offer3,offer4)

In [None]:
# Test whether EmploymentStatus differences are significant or not.


EmploymentStatus = df[['Customer Lifetime Value','EmploymentStatus']].groupby('EmploymentStatus')
employed = EmploymentStatus['Customer Lifetime Value'].get_group('Employed')
unemployed = EmploymentStatus['Customer Lifetime Value'].get_group('Unemployed')
medleave = EmploymentStatus['Customer Lifetime Value'].get_group('Medical Leave')
disabled = EmploymentStatus['Customer Lifetime Value'].get_group('Disabled')
retired = EmploymentStatus['Customer Lifetime Value'].get_group('Retired')

In [None]:
stats.f_oneway(employed,unemployed,medleave,disabled,retired)

pvalue < 0.05 implies that there is significant difference in the mean of target variable for atleast one group of 'EmploymentStatus' which means 'EmploymentStatus' feature can be a significant for predicting 'Customer Lifetime Value'

In [None]:

# Test whether Education differences are significant or not.

Education = df[['Customer Lifetime Value','Education']].groupby('Education')
bachelor = Education['Customer Lifetime Value'].get_group('Bachelor')
college = Education['Customer Lifetime Value'].get_group('College')
highschool = Education['Customer Lifetime Value'].get_group('High School or Below')
master = Education['Customer Lifetime Value'].get_group('Master')
doctor = Education['Customer Lifetime Value'].get_group('Doctor')

In [None]:
stats.f_oneway(bachelor,college,highschool,master,doctor)

pvalue < 0.05 implies that there is significant difference in the mean of target variable for atleast one group of 'Education' which means 'Education' feature can be a significant for predicting 'Customer Lifetime Value'

 ### Furthur Modelling:

#### So we did the EDA and also the Statistical Analysis, so now we can just disregard the features which that are not significant  for our model.

In [None]:
df2 =df.copy()

In [None]:
df2.drop(['State','Coverage','Renew Offer Type','Vehicle Class','Customer','Response','Gender','Location Code','Vehicle Size','Policy','Policy Type','Sales Channel','Effective To Date'],axis=1,inplace = True)

Although months since policy inception, months since last claim, number of open complaints and number of policies are all numerical we will consider them as categorical features while preparing the model because numerical values are not high.

Firstly, according to our EDA, we saw that the number of policies >= 3 have similar trend so we will group all of them as 3

In [None]:
df2['Number of Policies'] = np.where(df2['Number of Policies']>2,3,df2['Number of Policies'])

Lets get dummies of chosen categorical variables

In [None]:
new = pd.get_dummies(df2,columns=['Marital Status','Number of Policies','Education','EmploymentStatus'],drop_first=True)

In [None]:
new

## Continuous Variables

Check continious variables and relations of them with categorical variables to see if there is any possibility to create new categorical variables from continuous ones.

In [None]:

ax = sns.scatterplot(x="Income", y="Customer Lifetime Value", hue="State",
                     data=df)


In [None]:

maritalstts = sns.scatterplot(x="Income", y="Customer Lifetime Value", hue="EmploymentStatus",
                     data=df)


In [None]:

ax = sns.scatterplot(x="Total Claim Amount", y="Customer Lifetime Value", hue="Marital Status",
                     data=df)


## Model 1

There is no obvious pattern to create new categorical variable from continious variables. So far, I have explored the dataset in detail and got familiar with it. Now it is time to create the model and see if I can predict Customer Life Time Value.

In [None]:
import statsmodels.api as sm

y = new['Customer Lifetime Value']
x = new.drop('Customer Lifetime Value',axis=1)


x = sm.add_constant(x)
results = sm.OLS(y, x).fit()
results.summary()

Then I will split my dataset into training and testing data which means I will select 25% of the data randomly and separate it from the training data. (test_size shows the percentage of the test data – 25%) (If you don’t specify the random_state in your code, then every time you run (execute) your code, a new random value is generated and training and test datasets would have different values each time.)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 450)

print('Train Data Count: {}'.format(x_train.shape[0]))
print('Test Data Count: {}'.format(x_test.shape[0]))

x_train = sm.add_constant(x_train)
results = sm.OLS(y_train, x_train).fit()
results.summary()

In [None]:
# Model graph to see predictions


x_test = sm.add_constant(x_test)

y_preds = results.predict(x_test)
sns.set(color_codes=True)
plt.scatter(y_test, y_preds)
plt.plot(y_test, y_test, color="red")
plt.xlabel("Actual ltv")
plt.ylabel("Estimated ltv", )
plt.title("Actual vs Estimated Customer LTV")
plt.show()

In [None]:
#lets see their errors

print("Mean Absolute Error (MAE)        : {}".format(mean_absolute_error(y_test, y_preds)))
print("Mean Sq. Error (MSE)          : {}".format(mse(y_test, y_preds)))
print("Root Mean Sq. Error (RMSE)     : {}".format(rmse(y_test, y_preds)))
print("Mean Abs. Perc. Error (MAPE) : {}".format(np.mean(np.abs((y_test - y_preds) / y_test)) * 100))

In [None]:
all_score = []

all_score.append((results.rsquared,
                  mean_absolute_error(y_test, y_preds),
                 mse(y_test, y_preds),rmse(y_test, y_preds),
                 np.mean(np.abs((y_test - y_preds) / y_test)) * 100))

## not a good prediction

## Model 2

In [None]:

#duplicate the original data and get the log version of it to be able to reach higher R2(with outliers)
df3 = new.copy()

df3['Monthly Premium Auto'] = np.log(df2['Monthly Premium Auto'])
df3['Total Claim Amount'] = np.log(df2['Total Claim Amount'])
y = np.log(df3['Customer Lifetime Value'])

import statsmodels.api as sm


x1 =  df3.drop('Customer Lifetime Value',axis=1)

In [None]:
x1_train, x1_test, y_train, y_test = train_test_split(x1, y, test_size = 0.25, random_state = 450)

print('Train Data Count: {}'.format(x1_train.shape[0]))
print('Test Data Count: {}'.format(x1_test.shape[0]))

x1_train = sm.add_constant(x1_train)
results_log = sm.OLS(y_train, x1_train).fit()
results_log.summary()

In [None]:
# Model graph to see predictions


x1_test = sm.add_constant(x1_test)

y_preds = results_log.predict(x1_test)
sns.set(color_codes=True)
plt.scatter(y_test, y_preds)
plt.plot(y_test, y_test, color="red")
plt.xlabel("Actual ltv")
plt.ylabel("Estimated ltv", )
plt.title("Actual vs Estimated Customer LTV-Log Transformation with outliers")
plt.show()

In [None]:
print("Mean Absolute Error (MAE)        : {}".format(mean_absolute_error(y_test, y_preds)))
print("Mean Sq. Error (MSE)          : {}".format(mse(y_test, y_preds)))
print("Root Mean Sq. Error (RMSE)     : {}".format(rmse(y_test, y_preds)))
print("Mean Abs. Perc. Error (MAPE) : {}".format(np.mean(np.abs((y_test - y_preds) / y_test)) * 100))

In [None]:
exp_ypreds = np.exp(y_preds)
exp_ytest = np.exp(y_test)



In [None]:
print("Mean Absolute Error (MAE)        : {}".format(mean_absolute_error(exp_ytest, exp_ypreds)))
print("Mean Sq. Error (MSE)          : {}".format(mse(exp_ytest, exp_ypreds)))
print("Root Mean Sq. Error (RMSE)     : {}".format(rmse(exp_ytest, exp_ypreds)))
print("Mean Abs. Perc. Error (MAPE) : {}".format(np.mean(np.abs((exp_ytest - exp_ypreds) / exp_ytest)) * 100))

In [None]:
all_score.append((results.rsquared,
                  mean_absolute_error(exp_ytest, exp_ypreds),
                 mse(exp_ytest, exp_ypreds),rmse(exp_ytest, exp_ypreds),
                 np.mean(np.abs((exp_ytest - exp_ypreds) / exp_ytest)) * 100))

## Model 3

In [None]:
#duplicate the original data and winsorize the data at %5
df4 = new.copy()

df4['Monthly Premium Auto'] = winsorize(df4['Monthly Premium Auto'],(0, 0.05))
df4['Total Claim Amount'] = winsorize(df4['Total Claim Amount'],(0, 0.05))


y = df4['Customer Lifetime Value']
x3 =  df4.drop('Customer Lifetime Value',axis=1)


In [None]:
x3_train, x3_test, y_train, y_test = train_test_split(x3, y, test_size = 0.25, random_state = 450)

print('Train Data Count: {}'.format(x3_train.shape[0]))
print('Test Data Count: {}'.format(x3_test.shape[0]))


x3_train = sm.add_constant(x3_train)
results_wins = sm.OLS(y_train, x3_train).fit()
results_wins.summary()

In [None]:
# Model graph to see predictions


x3_test = sm.add_constant(x3_test)

y_preds = results_wins.predict(x3_test)
sns.set(color_codes=True)
plt.scatter(y_test, y_preds)
plt.plot(y_test, y_test, color="red")
plt.xlabel("Actual ltv")
plt.ylabel("Estimated ltv", )
plt.title("Actual vs Estimated Customer LTV-5% Winsorize")
plt.show()

In [None]:
print("Mean Absolute Error (MAE)        : {}".format(mean_absolute_error(y_test, y_preds)))
print("Mean Sq. Error (MSE)          : {}".format(mse(y_test, y_preds)))
print("Root Mean Sq. Error (RMSE)     : {}".format(rmse(y_test, y_preds)))
print("Mean Abs. Perc. Error (MAPE) : {}".format(np.mean(np.abs((y_test - y_preds) / y_test)) * 100))

In [None]:
all_score.append((results_wins.rsquared,
                  mean_absolute_error(y_test, y_preds),
                 mse(y_test, y_preds),rmse(y_test, y_preds),
                 np.mean(np.abs((y_test - y_preds) / y_test)) * 100))

## Model 4

In [None]:
#duplicate the original data and take log of the data without outlier

df5 = df4.copy()


df5['Monthly Premium Auto'] = np.log(df5['Monthly Premium Auto'])
df5['Total Claim Amount'] = np.log(df5['Total Claim Amount'])


y = np.log(df5['Customer Lifetime Value'])
x7 =df5.drop('Customer Lifetime Value',axis=1)


In [None]:
x7_train, x7_test, y_train, y_test = train_test_split(x7, y, test_size = 0.25, random_state = 450)

print('Train Data Count: {}'.format(x7_train.shape[0]))
print('Test Data Count: {}'.format(x7_test.shape[0]))


x7_train = sm.add_constant(x7_train)
results_logwins = sm.OLS(y_train, x7_train).fit()
results_logwins.summary()

In [None]:
# Model graph to see predictions


x7_test = sm.add_constant(x7_test)

y_preds = results_logwins.predict(x7_test)
sns.set(color_codes=True)
plt.scatter(y_test, y_preds)
plt.plot(y_test, y_test, color="red")
plt.xlabel("Actual ltv")
plt.ylabel("Estimated ltv", )
plt.title("Actual vs Estimated Customer LTV- Both Log Transformation & 5% Winsorize")
plt.show()

In [None]:
print("Mean Absolute Error (MAE)        : {}".format(mean_absolute_error(y_test, y_preds)))
print("Mean Sq. Error (MSE)          : {}".format(mse(y_test, y_preds)))
print("Root Mean Sq. Error (RMSE)     : {}".format(rmse(y_test, y_preds)))
print("Mean Abs. Perc. Error (MAPE) : {}".format(np.mean(np.abs((y_test - y_preds) / y_test)) * 100))

In [None]:
exp_ypreds = np.exp(y_preds)
exp_ytest = np.exp(y_test)

all_score.append((results_logwins.rsquared,
                  mean_absolute_error(exp_ytest, exp_ypreds),
                 mse(exp_ytest, exp_ypreds),rmse(exp_ytest, exp_ypreds),
                 np.mean(np.abs((exp_ytest - exp_ypreds) / exp_ytest)) * 100))

## Model 5

In [None]:
#the best model is the one with log transformation and outliers included

#Let's use polynomial features to see if we can do better


from sklearn.preprocessing import PolynomialFeatures


y = np.log(df3['Customer Lifetime Value'])
x5 =df3.drop('Customer Lifetime Value',axis=1)


pol = PolynomialFeatures()


array = pol.fit_transform(x5)

df_pol = pd.DataFrame(array)
df_pol.columns = pol.get_feature_names(x5.columns)


In [None]:
df_pol_train, df_pol_test, y_train, y_test = train_test_split(df_pol, y, test_size = 0.25, random_state = 450)

print('Train Data Count: {}'.format(df_pol_train.shape[0]))
print('Test Data Count: {}'.format(df_pol_test.shape[0]))

df_pol_train = sm.add_constant(df_pol_train)
results_pol = sm.OLS(y_train, df_pol_train).fit()
results_pol.summary()

In [None]:
# Model graph to see predictions


df_pol_test = sm.add_constant(df_pol_test)

y_preds = results_pol.predict(df_pol_test)
sns.set(color_codes=True)
plt.scatter(y_test, y_preds)
plt.plot(y_test, y_test, color="red")
plt.xlabel("Actual ltv")
plt.ylabel("Estimated ltv", )
plt.title("Actual vs Estimated Customer LTV-Polynomial Features")
plt.show()

In [None]:
# Model graph to see predictions


df_pol_test = sm.add_constant(df_pol_test)

y_preds = results_pol.predict(df_pol_test)
sns.set(color_codes=True)
plt.scatter(y_test, y_preds)
plt.plot(y_test, y_test, color="red")
plt.xlabel("Actual ltv")
plt.ylabel("Estimated ltv", )
plt.title("Actual vs Estimated Customer LTV-Polynomial Features")
plt.show()

In [None]:
print("Mean Absolute Error (MAE)     : {}".format(mean_absolute_error(y_test, y_preds)))
print("Mean Sq. Error (MSE)          : {}".format(mse(y_test, y_preds)))
print("Root Mean Sq. Error (RMSE)    : {}".format(rmse(y_test, y_preds)))
print("Mean Abs. Perc. Error (MAPE)  : {}".format(np.mean(np.abs((y_test - y_preds) / y_test)) * 100))

In [None]:
exp_ypreds = np.exp(y_preds)
exp_ytest = np.exp(y_test)

all_score.append((results_pol.rsquared,
                  mean_absolute_error(exp_ytest, exp_ypreds),
                 mse(exp_ytest, exp_ypreds),rmse(exp_ytest, exp_ypreds),
                 np.mean(np.abs((exp_ytest - exp_ypreds) / exp_ytest)) * 100))

In [None]:
# Model graph to see exponential version of predictions


df_pol_test = sm.add_constant(df_pol_test)

y_preds = np.exp(results_pol.predict(df_pol_test))
sns.set(color_codes=True)
plt.scatter(exp_ytest, y_preds)
plt.plot(exp_ytest, exp_ytest, color="red")
plt.xlabel("Actual ltv")
plt.ylabel("Estimated ltv", )
plt.title("Actual vs Estimated Customer LTV-Polynomial Features-Exp")
plt.show()


Actual scores and predicted scores have good linearity but after some point we see that linearity is not good enough. In the graph, it is seen that customer life time value prediction is better with the values lower than 10.000.Lets check if there is any improvement on mean sq error term when we predict customer LTV lower than 10.000.

In [None]:
mse( y_test[y_test<10],y_preds[y_test<10])

We see that Mean Sq. Error decreased from 0.04 to 0.02 which is almost half of the initial error.

## Model 6

We see some improvements when we get polynomial feautures into the scene. However, there are some insignificant features in the model that p-values are more than 0.05. Thats why we will build a new model by removing insignificant features towards target variable.

In [None]:
significant_features = list(results_pol.pvalues[results_pol.pvalues <= 0.05].index)

In [None]:


df_sig_train, df_sig_test, y_train, y_test = train_test_split(df_pol[significant_features], y, test_size = 0.25, random_state = 450)

print('Train Data Count: {}'.format(df_sig_train.shape[0]))
print('Test Data Count: {}'.format(df_sig_test.shape[0]))

df_sig_train = sm.add_constant(df_sig_train)
results_sig = sm.OLS(y_train, df_sig_train).fit()
results_sig.summary()

In [None]:
# Model graph to see predictions


df_sig_test = sm.add_constant(df_sig_test)

y_preds = results_sig.predict(df_sig_test)
sns.set(color_codes=True)
plt.scatter(y_test, y_preds)
plt.plot(y_test, y_test, color="red")
plt.xlabel("Actual ltv")
plt.ylabel("Estimated ltv" )
plt.title("Actual vs Estimated Customer LTV-Polynomial Features with significant variables")
plt.show()

In the graph, we see that model predicts lower values betten than higher ones.

In [None]:
print("Mean Absolute Error (MAE)        : {}".format(mean_absolute_error(y_test, y_preds)))
print("Mean Sq. Error (MSE)          : {}".format(mse(y_test, y_preds)))
print("Root Mean Sq. Error (RMSE)     : {}".format(rmse(y_test, y_preds)))
print("Mean Abs. Perc. Error (MAPE) : {}".format(np.mean(np.abs((y_test - y_preds) / y_test)) * 100))

In [None]:
exp_ypreds = np.exp(y_preds)
exp_ytest = np.exp(y_test)

all_score.append((results_sig.rsquared,
                  mean_absolute_error(exp_ytest, exp_ypreds),
                 mse(exp_ytest, exp_ypreds),rmse(exp_ytest, exp_ypreds),
                 np.mean(np.abs((exp_ytest - exp_ypreds) / exp_ytest)) * 100))

In [None]:
df_allscore = pd.DataFrame(all_score)

In [None]:
df_allscore.index = ['Standard','Log with outliers','Without Outliers','Log without outliers',
                       'Polynomial Features',
                       'Polynomial with significant features']

df_allscore.columns = ['R2', 'MAE', 'MSE','RMSE','MAPE']


df_allscore

## Let's check the test /train data prediction if there is underfitting/overfitting problem

In [None]:
lrm = LinearRegression()
lrm.fit(df_pol_train, y_train)

y_train_predict = lrm.predict(df_pol_train)
y_test_predict = lrm.predict(df_pol_test)

print("Train observation number  : {}".format(df_pol_train.shape[0]))
print("Test observation number   : {}".format(df_pol_test.shape[0]), "\n")

print("Train R-Square  : {}".format(lrm.score(df_pol_train, y_train)))
print("-----Test Scores---")
print("Test R-Square   : {}".format(lrm.score(df_pol_test, y_test)))
print("Mean_absolute_error (MAE)             : {}".format(mean_absolute_error(y_test, y_test_predict)))
print("Mean squared error (MSE)              : {}".format(mse(y_test, y_test_predict)))
print("Root mean squared error(RMSE)         : {}".format(rmse(y_test, y_test_predict)))
print("Mean absolute percentage error (MAPE) : {}".format(np.mean(np.abs((y_test - y_test_predict) / y_test)) * 100))

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import PredictionError


# Create the train and test data
df_pol_train, df_pol_test, y_train, y_test = train_test_split(df_pol, y, test_size = 0.25, random_state = 450)

# Instantiate the linear model and visualizer
model = Lasso()
visualizer = PredictionError(model)

visualizer.fit(df_pol_train, y_train)  # Fit the training data to the visualizer
visualizer.score(df_pol_test, y_test)  # Evaluate the model on the test data
visualizer.show()                 # Finalize and render the figure

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import ResidualsPlot


# Instantiate the linear model and visualizer
Model = Ridge()
visualizer_residual = ResidualsPlot(Model)

visualizer_residual.fit(df_pol_train, y_train)  # Fit the training data to the visualizer
visualizer_residual.score(df_pol_test, y_test)  # Evaluate the model on the test data
visualizer_residual.show()                 # Finaliz

## Conclusion

We have created six different models to reach the best model with highest R-square and lower error terms.

In the light of comparison table, we could choose to go for the 5th model which have both log transformation and polynomial features. We see that R square is 0.91 means that 91% of the variance can be explained, which is really high. 

It seems like I predict values really good! Actual scores and predicted scores have good linearity but after some point we see that linearity is not good enough. In the graph, it is seen that customer life time value prediction is better with the values lower than 10.000. If we predict customer LTV lower than 10.000, we see that Mean Sq. Error decreased from 0.04 to 0.02 which is almost half of the initial error.

We do not see overfitting problem with the model but still I have checked Lasso and Ridge models to see if there is any change on the model.

From marketing perspective, we have a better opinion which customer have higher predicted life time value. With that information it is easier to lead marketing activities into more profitable scale.

