
Refrence: https://www.amazon.com/Hands-Data-Science-Marketing-strategies/dp/1789346347

**Importing libraries**




In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import statsmodels.formula.api as sm



import matplotlib.pyplot as plt
import seaborn as sns               # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
sns.set()
from subprocess import check_output

import warnings                                            # Ignore warning related to pandas_profiling
warnings.filterwarnings('ignore') 

def annot_plot(ax,w,h):                                    # function to add data to plot
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    for p in ax.patches:
         ax.annotate(f"{p.get_height() * 100 / df.shape[0]:.2f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
         ha='center', va='center', fontsize=11, color='black', rotation=0, xytext=(0, 10),
         textcoords='offset points')             
def annot_plot_num(ax,w,h):                                    # function to add data to plot
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    for p in ax.patches:
        ax.annotate('{0:.1f}'.format(p.get_height()), (p.get_x()+w, p.get_height()+h))

import os
print(os.listdir("../input"))


#  Data Loading

In [None]:
df = pd.read_csv('../input/WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv')

#  Exploratory Data Analysis:
Before we dive into regression analysis, we will first take a more detailed look at the data, in order to have a better understanding of what data points we have and what patterns we can see in the data. If you look at the data, you will notice a column named Response. It contains information on whether a customer responded to marketing calls. We will use this field as a measure of customer engagement. For future computations, it will be better to encode this field with numerical values


In [None]:
df.shape

In [None]:
df.head()

In [None]:

df['Engaged'] = df.Response.apply(lambda x: 0 if x =='No' else 1)

In [None]:
df.head()

# Engagement rate:

The first thing that we are going to look at is the aggregate engagement rate. This engagement rate is simply the percentage of customers that responded to the marketing calls.



In [None]:

engagement_rate_df = pd.DataFrame(df.groupby(by='Engaged').count()['Response'] / df.shape[0] * 100)
engagement_rate_df

To make this easier to read, we can transpose the DataFrame, meaning that we can flip the rows and columns in the DataFrame. You can transpose a pandas DataFrame by using the T attribute of a DataFrame



In [None]:
engagement_rate_df.T

As you can see, about 14% of the customers have responded to marketing calls, and the remaining 86% of the customers have not responded.

**Sales Channels**

Now, let's see whether we can find any noticeable patterns in the sales channel and engagement. We are going to analyze how the engaged and non-engaged customers are distributed among different sales channels



In [None]:
engagement_by_sales_channel_df = pd.pivot_table(df, values='Response', index='Sales Channel', columns='Engaged', aggfunc=len).fillna(0.0)
engagement_by_sales_channel_df.columns = ['Not Engaged', 'Engaged']

In [None]:
engagement_by_sales_channel_df

As you have noticed in the previous section, there are significantly more customers that are not engaged with the marketing efforts, so it is quite difficult to look at the differences in the sales channel distributions between the engaged and non-engaged customers from raw numbers. To make the differences more visually identifiable, we can build pie charts 

In [None]:
engagement_by_sales_channel_df.plot(
    kind='pie',
    figsize=(15, 7),
    startangle=90,
    subplots=True,
    autopct=lambda x: '%0.1f%%' % x
)

plt.show()

**Total claim amounts**

The last thing that we are going to look at before we dive into the regression analysis are the differences in the distributions of Total Claim Amount between the engaged and non-engaged groups. We are going to visualize this by using box plots.

In [None]:
ax = df[['Engaged', 'Total Claim Amount']].boxplot(by='Engaged', showfliers=False, figsize=(10,7))

ax.set_xlabel('Engaged')
ax.set_ylabel('Total Claim Amount')
ax.set_title('Total Claim Amount Distributions by Enagements')

plt.suptitle("")
plt.show()


In [None]:
ax = df[['Engaged', 'Total Claim Amount']].boxplot(
    by='Engaged',
    showfliers=True,
    figsize=(10,7)
)

ax.set_xlabel('Engaged')
ax.set_ylabel('Total Claim Amount')
ax.set_title('Total Claim Amount Distributions by Enagements')

plt.suptitle("")
plt.show()

So far, we have analyzed the types of fields that we have in the data and how the patterns differ between the engaged group and the non-engaged group. Now,We will first build a logistic regression model with continuous variables. 

In [None]:
df.describe()


In [None]:
df['Customer Lifetime Value'].dtype

In [None]:
df['Income'].dtype

In [None]:
continuous_vars = [
                    'Customer Lifetime Value', 'Income', 'Monthly Premium Auto', 
                    'Months Since Last Claim', 'Months Since Policy Inception', 
                    'Number of Open Complaints', 'Number of Policies', 
                    'Total Claim Amount'
                ]

In [None]:

logit = sm.Logit(df['Engaged'], df[continuous_vars])

In [None]:
logit_fit = logit.fit()

In [None]:
logit_fit.summary()

Looking at this model output, we can see that Income, Monthly Premium Auto, Months Since Last Claim, Months Since Policy Inception, and Number of Policies variables have significant relationships with the output variable, Engaged. For example, Number of Policies variable is significant and is negatively correlated with Engaged. This suggests that the more policies that the customers have, the less likely they are to respond to marketing calls. 




**Categorical variables**

In [None]:
gender_values, gender_labels = df['Gender'].factorize()
df['GenderFactorized'] = gender_values

In [None]:
gender_values

In [None]:
gender_labels

In [None]:
df

In [None]:
categories = pd.Categorical(df['Education'], categories=['High School or Below', 'Bachelor', 'College', 'Master', 'Doctor'])

In [None]:
df['EducationFactorized'] = categories.codes

In [None]:
df.head()

In [None]:
logit = sm.Logit(df['Engaged'], df[['GenderFactorized','EducationFactorized']])

In [None]:
logit_fit = logit.fit()

In [None]:
logit_fit.summary()

**Regression Analysis with Both Continuous and Categorical Variables**

In [None]:
logit = sm.Logit(
    df['Engaged'], 
    df[['Customer Lifetime Value',
        'Income',
        'Monthly Premium Auto',
        'Months Since Last Claim',
        'Months Since Policy Inception',
        'Number of Open Complaints',
        'Number of Policies',
        'Total Claim Amount',
        'GenderFactorized',
        'EducationFactorized'
    ]]
)

In [None]:
logit_fit = logit.fit()

In [None]:
logit_fit.summary()