# Lead Scoring Case Study

## Problem Statement

X Education offers online courses to industry professionals. On a daily basis,professionals who are interested in the courses land on their website and browse for courses. Though X Education acquires a lot of lead, their conversion rate(lead to purchase) is poor. The company want to idenitfy potential leads so the sales team focus on communicating to the potential clients.

The company want to develop a model to improve the identification of hot leads

## Objective

Develop a regression model for identification of lead with a sensitivity of ~80%

## The analysis Structure is as follows:
- Read the data
- Inspect and Clean the data
- Prepare the data for modelling
- Model development and model optimization
- Identifying the optimal probability cutoff
- Checking model on test data
- Final Model and generation of lead Score 

## 1. Importing Libraries

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt

pd.set_option('display.max_rows', 999)
pd.set_option('display.max_columns', 999)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

import statsmodels.api as sm

from sklearn import metrics

from statsmodels.stats.outliers_influence import variance_inflation_factor

## 2. Importing Data

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Importing the dataset
location = '/kaggle/input//leadscore/Leads.csv'
lead_data = pd.read_csv(location)

# Inspecting the original dataFrame.
lead_data.head()

## 3. Data Cleaning and Data Inspection

### 3.1 Understanding the provided dataset

In [None]:
# Number of rows and columns
lead_data.shape

In [None]:
# Understanding the data types and missing values in the features
lead_data.info()

In [None]:
# Understanding the statisctics of the numerical columns
lead_data.describe()

### 3.2 Data Cleaning
<ol>
<li> Handling the 'select' values in the categorical columns.
<li> Identifying the percentage of missing value in the columns.
<li> Dropping the columns with high missing values.
<li> Handling skewed data distribution.
<li> Treating the varing unique categories. 
<li> Dropping rows and columns with high missing value
</ol>

#### 3.2.1 Handling the 'select' value in the categorical columns

In [None]:
# Identifying number of select values across each column
# lead_data[lead_data[list(lead_data.columns)]=='Select'].notnull().sum()

In [None]:
# checking missing value across each column
lead_data.isnull().sum()

In [None]:
# Replacing 'select' with null value
lead_data.replace('Select', np.nan, inplace=True)

In [None]:
# Checking if 'select' value are replaced or not
# lead_data[lead_data[list(lead_data.columns)]=='Select'].notnull().sum()

In [None]:
# checking if the 'select' value converted is traslated as null
lead_data.isnull().sum()

In [None]:
# Identifying number of select values across each categorical variable
# lead_data[lead_data[list(lead_data.columns)]=='Select'].notnull().sum()

#### 3.2.2 Identifying the percentage of missing value in the columns

In [None]:
# calculating the missing percentage of the missing values 
round(100.0 * lead_data.isnull().sum()/len(lead_data), 2)

##### There are a few columns with a lot of missing values. Hence, we will drop columns with more than 40% missing values


#### 3.2.3 Dropping the columns with high missing values

Dropping the columns with more than 40% missing value
- How did you hear about X Education
- Lead Quality  
- Lead Profile
- Asymmetrique Activity Index
- Asymmetrique Profile Index
- Asymmetrique Activity Score   
- Asymmetrique Profile Score

In [None]:
# dropping the column
lead_df = lead_data.drop(columns = ['How did you hear about X Education','Lead Quality','Lead Profile','Asymmetrique Activity Index','Asymmetrique Profile Index','Asymmetrique Activity Score','Asymmetrique Profile Score'], axis=1)
lead_df.head()

In [None]:
# Original lead dataframe
lead_data.shape

In [None]:
# Cleaned/ Updated Dataframe
lead_df.shape

#### 3.2.4 Creating list of categorical and numerical features

In [None]:
lead_df.info()

In [None]:
# creating a list of numerical feature
numerical_columns = ['Lead Number', 'Converted','TotalVisits', 'Total Time Spent on Website','Page Views Per Visit']
numerical_columns

In [None]:
# creating a list of categorical columns
categorical_columns = []
for i in list(lead_df.columns):
    if i not in numerical_columns:
        categorical_columns.append(i)
categorical_columns

#### 3.2.5 Checking the categorical columns having skewed distribution

In [None]:
for i in categorical_columns[1:]:
    plt.figure(figsize=(20,8))
    ax = (pd.Series(lead_df[i]).value_counts(normalize=True, sort=False)*100).plot.bar()
    ax.set(ylabel="Percent")
    plt.title(i)
    plt.show()

#### Based on the above bar chart, we can see the following categorical columns are heavily skewed (~100%) towards one category:
- Do Not Call
- What matters most to you in choosing a course
- Search
- Magazine
- Newspaper Article
- X Education Forums
- Newspaper
- Digital Advertisement
- Through Recommendations
- Receive More Updates About Our Courses
- Update me on Supply Chain Content
- Get updates on DM Content
- I agree to pay the amount through cheque

In [None]:
skewed_columns= [
    'Do Not Call',
    'What matters most to you in choosing a course',
    'Search',
    'Magazine',
    'Newspaper Article',
    'X Education Forums',
    'Newspaper',
    'Digital Advertisement',
    'Through Recommendations',
    'Receive More Updates About Our Courses',
    'Update me on Supply Chain Content',
    'Get updates on DM Content',
    'I agree to pay the amount through cheque'
]

In [None]:
lead_df = lead_df.drop(columns= skewed_columns)

In [None]:
# After dropping the skewed data.
lead_df.shape

#### 3.2.6 Checking the unique categories in the categorical columns

In [None]:
categorical_columns

In [None]:
categorical_columns = []
for i in list(lead_df.columns):
    if i not in numerical_columns:
        categorical_columns.append(i)
categorical_columns

In [None]:
# finding number of unique categories in each columns 
lead_df[categorical_columns].nunique(dropna=True)

There are a few columns with high number of categories. For all such columns, we will check percentage records in each category and may combine low contibution categories into one

#### 3.2.6.1 Checking the unique categories in the Lead Origin column

In [None]:
round(lead_df['Lead Origin'].value_counts(normalize = True)*100.0, 2)

In [None]:
lead_df['Lead Origin'].replace(['Lead Import','Quick Add Form'], 'others', inplace=True)

In [None]:
round(lead_df['Lead Origin'].value_counts(normalize = True)*100.0, 2)

#### 3.2.6.2 Checking the unique categories in the Lead Source column

In [None]:
round(lead_df['Lead Source'].value_counts(normalize = True)*100.0, 2)

In [None]:
lead_source = pd.DataFrame(lead_df['Lead Source'].value_counts(normalize = True)).reset_index()

In [None]:
lead_df['Lead Source'].replace(list(lead_source[lead_source['Lead Source']<0.05]['index']), 'others', inplace=True)

In [None]:
round(lead_df['Lead Source'].value_counts(normalize = True)*100.0, 2)

#### 3.2.6.3 Checking the unique categories in the Last Activity column

In [None]:
round(lead_df['Last Activity'].value_counts(normalize = True)*100.0, 2)

In [None]:
last_activity = pd.DataFrame(lead_df['Last Activity'].value_counts(normalize = True)).reset_index()

In [None]:
lead_df['Last Activity'].replace(list(last_activity[last_activity['Last Activity']<0.05]['index']), 'others', inplace=True)

In [None]:
round(lead_df['Last Activity'].value_counts(normalize = True)*100.0, 2)

#### 3.2.6.4 Checking the unique categories in the Country column

In [None]:
round(lead_df['Country'].value_counts(normalize = True)*100.0, 2)

In [None]:
country = pd.DataFrame(lead_df['Country'].value_counts(normalize = True)).reset_index()

In [None]:
lead_df['Country'].replace(list(country[country['Country']<0.01]['index']), 'others', inplace=True)

In [None]:
round(lead_df['Country'].value_counts(normalize = True)*100.0, 2)

#### 3.2.6.5 Checking the unique categories in the Specialization column

In [None]:
round(lead_df['Specialization'].value_counts(normalize = True)*100.0, 2)

In [None]:
specialization = pd.DataFrame(lead_df['Specialization'].value_counts(normalize = True)).reset_index()

In [None]:
lead_df['Specialization'].replace(list(specialization[specialization['Specialization']<0.05]['index']), 'others', inplace=True)

In [None]:
round(lead_df['Specialization'].value_counts(normalize = True)*100.0, 2)

#### 3.2.6.6 Checking the unique categories in the What is your current occupation column

In [None]:
round(lead_df['What is your current occupation'].value_counts(normalize = True)*100.0, 2)

In [None]:
occupation = pd.DataFrame(lead_df['What is your current occupation'].value_counts(normalize = True)).reset_index()

In [None]:
lead_df['What is your current occupation'].replace(list(occupation[occupation['What is your current occupation']<0.01]['index']), 'others', inplace=True)

In [None]:
round(lead_df['What is your current occupation'].value_counts(normalize = True)*100.0, 2)

#### 3.2.6.7 Checking the unique categories in the Tags column

In [None]:
round(lead_df['Tags'].value_counts(normalize = True)*100.0, 2)

In [None]:
tags = pd.DataFrame(lead_df['Tags'].value_counts(normalize = True)).reset_index()

In [None]:
lead_df['Tags'].replace(list(tags[tags['Tags']<0.03]['index']), 'others', inplace=True)

In [None]:
round(lead_df['Tags'].value_counts(normalize = True)*100.0, 2)

#### 3.2.6.8 Checking the unique categories in the City column

In [None]:
round(lead_df['City'].value_counts(normalize = True)*100.0, 2)

In [None]:
city = pd.DataFrame(lead_df['City'].value_counts(normalize = True)).reset_index()

In [None]:
lead_df['City'].replace(list(city[city['City']<0.01]['index']), 'others', inplace=True)

In [None]:
round(lead_df['City'].value_counts(normalize = True)*100.0, 2)

#### 3.2.6.9 Checking the unique categories in the Last Notable Activity  column

In [None]:
round(lead_df['Last Notable Activity'].value_counts(normalize = True)*100.0, 2)

In [None]:
last_activity = pd.DataFrame(lead_df['Last Notable Activity'].value_counts(normalize = True)).reset_index()

In [None]:
lead_df['Last Notable Activity'].replace(list(last_activity[last_activity['Last Notable Activity']<0.05]['index']), 'others', inplace=True)

In [None]:
round(lead_df['Last Notable Activity'].value_counts(normalize = True)*100.0, 2)

In [None]:
# checking the updated number of unique categories in each columns 
lead_df[categorical_columns[1:]].nunique(dropna=True)

#### 3.2.7 Handling the columns with lower missing value
In this step, we will adopt two techniques:
1) Drop the rows with a high number of missing values(>5)
2) Treat the missing values of individual column

In [None]:
# Checking for the no. of null values per column
round(100.0 * lead_df.isnull().sum()/len(lead_df), 2)

##### 3.2.7.1 Dropping rows with high missing value

In [None]:
lead_df['missing_row_count'] = lead_df.isnull().sum(axis=1)

In [None]:
lead_df['missing_row_count'].value_counts()

In [None]:
print ("The total no. of rows with high missing values",len(lead_df) - len(lead_df[lead_df['missing_row_count']<5]))

In [None]:
lead_df.shape

In [None]:
lead_df = lead_df[lead_df['missing_row_count']<5]

In [None]:
# Updated dataset
lead_df.shape

In [None]:
round(100.0 * lead_df.isnull().sum()/len(lead_df), 2)

##### 3.2.7.2 Treating columns with high missing value

###### 3.2.7.2.1 Treating Lead Source

In [None]:
lead_df['Lead Source'].value_counts(normalize=True)

The value for lead source feature is  spread across multiple categories. Hence, imputation is difficult. Further number of rows with missing value is small. We will just drop the missing values

In [None]:
lead_df = lead_df[~pd.isnull(lead_df['Lead Source'])]

In [None]:
lead_df.shape

In [None]:
round(100.0 * lead_df.isnull().sum()/len(lead_df), 2)

###### 3.2.7.2.2 Treating TotalVisits

In [None]:
lead_df['TotalVisits'].describe()

In [None]:
sns.boxplot(y =lead_df['TotalVisits'])

The value of TotalVisits feature is broadly distributed. Hence, imputation is difficult. Further number of rows with missing value is small. We will just drop the missing values

In [None]:
lead_df = lead_df[~pd.isnull(lead_df['TotalVisits'])]

In [None]:
lead_df.shape

In [None]:
round(100.0 * lead_df.isnull().sum()/len(lead_df), 2)

###### 3.2.7.2.3 Treating Country column

In [None]:
lead_df['Country'].value_counts(normalize=True)

As majority of the value in Country Category is India, we will replace null with India

In [None]:
lead_df['Country'].fillna('India', inplace=True)

In [None]:
round(100.0 * lead_df.isnull().sum()/len(lead_df), 2)

###### 3.2.7.2.4 Treating Specialization column

In [None]:
lead_df['Specialization'].value_counts(normalize=True)

There is no clear mode in for this feature. Hence, imputation by central tendency value will not be correct. Further, it may be an important prediction feature. Hence, we will replace it by missing value and will later check if it helps in prediction 

In [None]:
lead_df['Specialization'].fillna('Missing Value', inplace=True)

###### 3.2.7.2.5 Treating What is your current occupation column

In [None]:
lead_df['What is your current occupation'].value_counts(normalize=True)

As majority of the value in What is your current occupation is Unemployed, we will replace null with Unemployed

In [None]:
lead_df['What is your current occupation'].fillna('Unemployed', inplace=True)

###### 3.2.7.2.6 Treating Tags column

In [None]:
lead_df['Tags'].value_counts(normalize=True)

There is no clear mode in this column. Further, this feature is collected when a call is made to user. However, this feature is not required for our prediction. Hence, we will drop it

In [None]:
lead_df.drop(columns=['Tags'], inplace= True)

###### 3.2.7.2.7 Treating City column

In [None]:
lead_df['City'].value_counts(normalize=True)

There is no clear mode in for this feature. Hence, imputation by central tendency value will not be correct. 
Further, adding Mumbai and null values are equal to 6752 and rest has 12 % values which are not giving any information. Hence, we will drop this columns

In [None]:
lead_df.drop(columns='City', inplace=True)

In [None]:
round(100.0 * lead_df.isnull().sum()/len(lead_df), 2)

In [None]:
lead_df.drop(columns='missing_row_count', inplace=True)

In [None]:
lead_df.shape

### 3.3 EDA for numerical columns

#### 3.3.1 Univariate Analysis

In [None]:
for i in numerical_columns[1:]:
    sns.boxplot(y=lead_df[i])
    plt.title(i)
    plt.show()

#### As boxplots show the presence of outliers in TotalVisits and Page Views Per Visit, we have to do outlier treatment. We will use winsorizing/capping for outlier treatment

#### 3.3.2 Outlier treatment for TotalVisits

In [None]:
q4 = lead_df['TotalVisits'].quantile(0.95)
q4
lead_df['TotalVisits'][lead_df['TotalVisits']>=q4] = q4

In [None]:
sns.boxplot(y=lead_df['TotalVisits'])

#### 3.3.2 Outlier treatment for Page Views Per Visit

In [None]:
q4 = lead_df['Page Views Per Visit'].quantile(0.95)
q4
lead_df['Page Views Per Visit'][lead_df['Page Views Per Visit']>=q4] = q4

In [None]:
sns.boxplot(y=lead_df['Page Views Per Visit'])

#### 3.3.3 Bi-variate Analysis

In [None]:
# Heatmap for understanding correlation
plt.figure (figsize= (25,20))
sns.heatmap (lead_df[numerical_columns[1:]].corr(), annot =True, cmap ='YlGnBu')
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.yticks(rotation = 0)
plt.show()
plt.show()

#### Total visits and Page Views Per Visit have high correlation and we might need to choose one. However, we will assess this later in the model develop phase and will rely on VIF

In [None]:
# Pairplot for relation between the features
sns.pairplot(lead_df[numerical_columns[1:]])

### 3.4 Understanding Data lost in data cleaning

In [None]:
round(100.0 * len(lead_df)/len(lead_data), 2)

Various data cleaning steps led to the removal of 11% of the data

## 4. Data Preparation for Modelling

### 4.1 Converting Binary categorical variable to Numerical Variable

In [None]:
lead_df.nunique()

In [None]:
lead_df.head(5)

There are two columns which need to be converted
- Do Not Email
- A free copy of Mastering The Interview

In [None]:
varlist =  ['Do Not Email', 'A free copy of Mastering The Interview']
lead_df[varlist]=lead_df[varlist].apply(lambda x :x.map({'Yes': 1, "No": 0}))

In [None]:
lead_df.head(5)

### 4.2 Creating Dummy Variables

In [None]:
lead_df.info()

In [None]:
lead_df.columns

In [None]:
categorical_features = [
    'Lead Origin',
    'Lead Source',
    'Last Activity',
    'Country',
    'What is your current occupation',
    'A free copy of Mastering The Interview',
    'Last Notable Activity'
]

In [None]:
lead_model_df = lead_df.copy()

In [None]:
# Creating a dummy variable for some of the categorical variables and dropping the first one.
dummy_var = pd.get_dummies(lead_model_df[categorical_features], drop_first=True)

# Adding the results to the master dataframe
lead_model_df = pd.concat([lead_model_df, dummy_var], axis=1)

In [None]:
lead_model_df.head(5)

In [None]:
lead_model_df['Specialization'].value_counts()

In [None]:
dummy1 = pd.get_dummies(lead_model_df['Specialization'])
dummy_var_1 = dummy1.drop(['Missing Value'], 1)
# Adding the results to the master dataframe
lead_model_df = pd.concat([lead_model_df,dummy_var_1], axis=1)
lead_model_df.head(5)

In [None]:
lead_df.shape

In [None]:
lead_model_df.shape

In [None]:
# dropping the repeated categorical columns 
lead_model_df = lead_model_df.drop(columns = [ 'Lead Origin','Lead Source','Last Activity','Country','What is your current occupation','A free copy of Mastering The Interview','Last Notable Activity','Specialization'], axis=1)

In [None]:
lead_model_df.shape

In [None]:
lead_model_df.head()

### 4.3 Train Test Split 

In [None]:
# Putting feature variable to X
X = lead_model_df.drop(['Prospect ID','Lead Number','Converted'], axis=1)
X.head()

In [None]:
# Putting response variable to y
y = lead_model_df['Converted']

y.head()

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

### 4.4 Feature Scaling
- Using MinMax scaling (Normalisation) - Compressing the data between 0-1.

     #### Normalisation: (x- xmin/ xmax- xmin)

In [None]:
scaler = MinMaxScaler()

X_train[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']] = scaler.fit_transform(X_train[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']])

X_train.head()

In [None]:
### Checking the Conversion Rate
conversion = (sum(lead_df['Converted'])/len(lead_df['Converted'].index))*100
conversion

There is some data imbalance but it is not very high

## 5. Model Development

### 5.1 Understanding correlation among features

In [None]:
# Validating Multi Colinearity
plt.figure(figsize=(30,10))
sns.heatmap(X_train.corr(),annot = True, cmap="RdYlGn",linewidth =1)
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.yticks(rotation = 0)
plt.show()

#### There is some correlation between features and hence, we might need to drop a few feature using VIF

### 5.2 Variable Selection Using RFE

In [None]:
logreg = LogisticRegression()
rfe = RFE(logreg, 15)           
rfe = rfe.fit(X_train, y_train)

In [None]:
rfe.support_

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
# Identifying features suggested by RFE
col = X_train.columns[rfe.support_]

In [None]:
# features outside top 15 as per RFE
X_train.columns[~rfe.support_]

### 5.3 Model Building 

#### 5.3.1 Model-1

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm1 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm1.fit()
res.summary()

In [None]:
# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

##### Creating a dataframe with the actual churn flag and the predicted probabilities

In [None]:
y_train_pred_final = pd.DataFrame({'converted':y_train.values, 'converted_prob':y_train_pred})
y_train_pred_final['LeadID'] = y_train.index
y_train_pred_final.head()

##### Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0. We will later optimize the optimal probability cut-off using ROC curve

In [None]:
    y_train_pred_final['predicted'] = y_train_pred_final['converted_prob'].map(lambda x: 1 if x > 0.5 else 0)

    # Let's see the head
    y_train_pred_final.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final['converted'], y_train_pred_final['predicted'])
print(confusion)

In [None]:
# Let's check the overall accuracy.
print("The overall accuracy of the model1 is", round(100.0 * metrics.accuracy_score(y_train_pred_final['converted'], y_train_pred_final['predicted']), 2))

#### Checking VIFs

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

As some of the feature have VIF >5, we will drop the feature and then create a new model

#### 5.3.2 Model-2

In [None]:
col = col.drop('Page Views Per Visit', 1)
col

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

In [None]:
y_train_pred_final = pd.DataFrame({'converted':y_train.values, 'converted_prob':y_train_pred})
y_train_pred_final['LeadID'] = y_train.index
y_train_pred_final.head()

In [None]:
y_train_pred_final['predicted'] = y_train_pred_final['converted_prob'].map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final['converted'], y_train_pred_final['predicted'] )
print(confusion)

In [None]:
# Let's check the overall accuracy.
print("The overall accuracy of the model2 is",round(100.0 * metrics.accuracy_score(y_train_pred_final['converted'], y_train_pred_final['predicted']), 2))

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### As the Marketing Management feature is giving p-value >0.05, we will drop the feature and create a next model

#### 5.3.3 Model-3

In [None]:
col = col.drop('Marketing Management', 1)
col

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm3 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm3.fit()
res.summary()

In [None]:
# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

In [None]:
y_train_pred_final = pd.DataFrame({'converted':y_train.values, 'converted_prob':y_train_pred})
y_train_pred_final['LeadID'] = y_train.index
y_train_pred_final.head()

In [None]:
y_train_pred_final['predicted'] = y_train_pred_final['converted_prob'].map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head()

In [None]:
y_train_pred_final['predicted']

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final['converted'], y_train_pred_final['predicted'] )
print(confusion)

In [None]:
# Let's check the overall accuracy.
print("The overall accuracy of the model3 is",round(100.0 * metrics.accuracy_score(y_train_pred_final['converted'], y_train_pred_final['predicted']), 2))

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### Model-3/ Final Model seems to be the best model as all coeffecients have stat-sig value and none of the features have VIF >5. Further, there is no major change in accuracy as we move from model-1 to model-3. The number of features used is 12

## 6. Confusion Matrix

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

#### The confusion matrix shows that the chosen cut off probability value is not optimal as it is not giving the desired specificity. We will use ROC curves to find optimal cut off value

## 7. ROC curve
- ROC Curves shows the tradeoff between the True Positive Rate (TPR) and the False Positive Rate (FPR).

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final['converted'], y_train_pred_final['converted_prob'], drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final['converted'], y_train_pred_final['converted_prob'])

_**As the ROC curve is more towards the upper-left corner of the graph, the proposed model(model3) could be considered a good model**_

In [None]:
# creating columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final['converted_prob'].map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Calculating accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final['converted'], y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

## 8. Optimal Probability Cutoff

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

#### From the curve above, 0.35 is the optimum point to take it as a cutoff probability.

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final['converted_prob'].map( lambda x: 1 if x > 0.35 else 0)

y_train_pred_final.head()

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_train_pred_final['converted'], y_train_pred_final['final_predicted'])

In [None]:
confusion2 = metrics.confusion_matrix(y_train_pred_final['converted'], y_train_pred_final['final_predicted'])
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
TP,TN,FP,FN

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))

In [None]:
# Positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN + FN))

#### At cut off proabibility of 0.35, we get the desired sensitivity with acceptable accuracy and specificity. We will choose 0.35 as the cut off probability and will check model performance on test data

### 9: Making predictions on the test set

In [None]:
X_test[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']] = scaler.transform(X_test[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']])

In [None]:
X_test = X_test[col]
X_test.head()

In [None]:
X_test_sm = sm.add_constant(X_test)

In [None]:
y_test_pred = res.predict(X_test_sm)

In [None]:
y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array
y_pred_1 = pd.DataFrame(y_test_pred)
y_pred_1.head()

In [None]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)
y_test_df.head()

In [None]:
# Putting LeadID to index
y_test_df['LeadID'] = y_test_df.index
y_test_df.head()

In [None]:
# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
y_test_df.head()

In [None]:
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [None]:
y_pred_final.head()

In [None]:
# Renaming the column 
y_pred_final= y_pred_final.rename(columns={ 0 : 'converted_prob'})

In [None]:
# Let's see the head of y_pred_final
y_pred_final.head()

In [None]:
y_pred_final['final_predicted'] = y_pred_final['converted_prob'].map(lambda x: 1 if x > 0.35 else 0)

In [None]:
y_pred_final.head()

### 9.1 Confusion matrix on test data

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_pred_final.Converted, y_pred_final.final_predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

#### The sensitivity on the test data is within 5% range of the train data. Hence, we can finalize the model3 as the final model

## 10. Final model and score variable calculation

### Based on sensitivity, specificity, and accuracy, we can conclde Model 3 with cut off probability of 0.35 as the recommended model for lead idenitification

### 10.1 score variable calculation

In [None]:
y_train_score_variable = y_train_pred_final[['LeadID','converted','final_predicted','converted_prob']] 

In [None]:
y_train_score_variable['converted_prob'] = round(y_train_score_variable['converted_prob']*100.0, 2)
y_train_score_variable.rename(columns = {'converted_prob':'Lead Score'}).head()

In [None]:
y_test_score_variable = y_pred_final[['LeadID','Converted','final_predicted','converted_prob']]

In [None]:
y_test_score_variable['converted_prob'] = round(y_test_score_variable['converted_prob']*100.0, 2)
y_test_score_variable.rename(columns = {'converted_prob':'Lead Score'}).head()