# <font color=red> Problem Statement <font>

X education is in the business of providing online courses to industrial professionals. The typical leads conversion rate at X education is around 30%. This is very poor and much below the CEO's target of 80%. The result is loss of revenue, due to inability to identify the right set of leads that have a higher probability to get converted. Therefore, the problem at hand is to identify the most potential leads a.k.a. "hot leads".

## <font color=blue> 1. Reading and understanding the data <font>

In [None]:
# Importing all the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")


In [None]:
# Import all the sklearn libraries

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_curve

In [None]:
# Reading the dataset and inspecting the first five rows of the dataset

leads=pd.read_csv("/kaggle/input/lead-scoring-x-online-education/Leads X Education.csv")
leads.head()

In [None]:
#  Shape of the dataframe

leads.shape

#### Observation:

It is observed that there are 37 columns and 9240 rows in leads df

In [None]:
# Inspecting the datatypes of various columns and shape of the dataset

leads.info()

#### Observation:

Few features with null values are observed. There is no requirement for typecasting

In [None]:
# Inspecting inferential statistics

leads.describe().transpose()

**Observation:**
- We see that there are null values in many columns that need to be dealt with
- We also observe that there are outliers in the continuous variables that need to be dealt with. We can guage this by the huge variation between the 75th percentile and the maximum value. 

## <font color=blue> 2. Preparing the data for analysis <font>

In [None]:
# Inspecting the entries in each column

for column in leads.columns:
    print(column)
    print(leads[column].value_counts())
    print()

**Observation:**
- We can see that there are certain categorical columns that have "Select" as an option. These are cases, where the lead may have left the column blank. Let us identify the variables that have "Select"

In [None]:
# Identifying categorical variables having "Select"

for column in leads.columns:
    print(column)
    print(leads[column].value_counts()[(leads[column].value_counts().index=="Select")==1])
    print()

**Observation:**
- There are four categorical variables that have "Select", viz., "Specialization", "How did you hear about X Education", "Lead Profile" and "City". This means that the leads have left the columns unfilled. Hence these should be replaced with null before performing missing value check

In [None]:
# Determining the % of cases with "Select"

print(round(leads["Specialization"].value_counts()[leads["Specialization"].value_counts().index=="Select"]/leads.shape[0]*100,2))
print(round(leads["How did you hear about X Education"].value_counts()[leads["How did you hear about X Education"].value_counts().index=="Select"]/leads.shape[0]*100,2))
print(round(leads["Lead Profile"].value_counts()[leads["Lead Profile"].value_counts().index=="Select"]/leads.shape[0]*100,2))
print(round(leads["City"].value_counts()[leads["City"].value_counts().index=="Select"]/leads.shape[0]*100,2))

#### Observation:

It is observed that more than 20% of the leads have not filled the above 4 features

In [None]:
# To replace Select with null for the features - "Specialization", "How did you hear about X Education", "Lead Profile" and "City"

varlist = ["Specialization", "How did you hear about X Education", "Lead Profile", "City"]

for column in varlist:
    leads[column] = leads[column].replace('Select',np.nan) 


In [None]:
# Identifying count of missing values

round(leads.isnull().sum()[leads.isnull().sum()>0]/leads.shape[0]*100,2)

In [None]:
# List of columns with null values more than 35% 

leads.isnull().sum()[(leads.isnull().sum()/leads.shape[0]*100)>35]/leads.shape[0]*100

In [None]:
# Dropping all columns with null values more than 35% 

leads.drop(leads.isnull().sum()[(leads.isnull().sum()/leads.shape[0]*100)>35].index,1, inplace = True)

In [None]:
# Verify all the features with 35% of null values are dropped or not

leads.info()

#### Observation:

All the features with more than 35% of null values are dropped

In [None]:
# Dropping "Prospect ID" as we already have a "Lead Number" to identify each lead

leads.drop("Prospect ID", 1, inplace=True)

In [None]:
# Verify the value_counts for each and every feature

for column in leads.columns:
    print(column)
    print(leads[column].value_counts())
    print()

#### Observation:

It is observed that the below features are highly skewed (huge data imbalance) and can be deleted. Reason - The results might end up being biased


        1) Do Not Email
        2) Do not Call
        3) Country
        4) What is your current occupation
        5) What matters most to you in choosing a course
        6) Search
        7) Magazine
        8) Newspaper Article
        9) X Education Forums
        10) Newspaper
        11) Digital Advertisement
        12)Through Recommendations
        13) Receive More Updates About Our Courses
        14) Update me on Supply Chain Content
        15) Get updates on DM Content
        16) I agree to pay the amount through cheque

In [None]:
# Features whose data is highly skewed to be dropped
leads.drop(columns = ['Do Not Email', 'Do Not Call', 'Country', 'What is your current occupation', 
                      'What matters most to you in choosing a course','Search','Magazine', 'Newspaper Article',
            'X Education Forums','Newspaper','Digital Advertisement','Through Recommendations',
            'Receive More Updates About Our Courses', 'Update me on Supply Chain Content', 'Get updates on DM Content',
            'I agree to pay the amount through cheque'], axis = 0 ,inplace=True)

In [None]:
# Identify final count of missing values

round(leads.isnull().sum()[leads.isnull().sum()>0]/leads.shape[0]*100,2)

#### Observation:

As the no. of missing values in above features is less than 5%, the records with missing values can be deleted

In [None]:
# Dropping rows where the % null are in single digits

leads=leads[~leads["Lead Source"].isnull()]
leads=leads[~leads["TotalVisits"].isnull()]
leads=leads[~leads["Page Views Per Visit"].isnull()]
leads=leads[~leads["Last Activity"].isnull()]

In [None]:
# Feature 'A free copy of Mastering The Interview' to map to meaningful value for the model to understand

leads['A free copy of Mastering The Interview'] = leads['A free copy of Mastering The Interview'].apply(lambda x: 1 if x=='Yes' else 0)

In [None]:
# Cleaning variable "Lead Source":

def clean(x):
    return(x.title())

leads['Lead Source']=leads['Lead Source'].apply(clean)

In [None]:
# Inspecting variable "Lead Source"

leads["Lead Source"].value_counts()

In [None]:
# Converting all values of variable "Lead Source" having count of less than 10 as "Others" 

leads.reset_index(inplace=True)

LeadSource_lessthan10=leads["Lead Source"].value_counts()[leads["Lead Source"].value_counts()<10].index
for i in range(leads["Lead Source"].count()):
    for j in range(len(LeadSource_lessthan10)):
        if leads["Lead Source"][i]==LeadSource_lessthan10[j]:
            leads["Lead Source"][i]="Others"

In [None]:
# Inspecting variable "Lead Source":

leads["Lead Source"].value_counts()

In [None]:
# Converting all values of variable "Last Activity" having count of less than 10 as "Others" 

LastActivity_lessthan10=leads["Last Activity"].value_counts()[leads["Last Activity"].value_counts()<10].index
for i in range(leads["Last Activity"].count()):
    for j in range(len(LastActivity_lessthan10)):
        if leads["Last Activity"][i]==LastActivity_lessthan10[j]:
            leads["Last Activity"][i]="Others"

In [None]:
# Inspecting variable "Last Activity":

leads["Last Activity"].value_counts()

In [None]:
# Converting all values of variable "Last Notable Activity" having count of less than 10 as "Others" 

LastNotableActivity_lessthan10=leads["Last Notable Activity"].value_counts()[leads["Last Notable Activity"].value_counts()<10].index
for i in range(leads["Last Notable Activity"].count()):
    for j in range(len(LastNotableActivity_lessthan10)):
        if leads["Last Notable Activity"][i]==LastNotableActivity_lessthan10[j]:
            leads["Last Notable Activity"][i]="Others"

In [None]:
# Inspecting variable "Last Notable Activity":

leads["Last Notable Activity"].value_counts()

In [None]:
# Dropping redundant "index" column

leads.drop("index",1,inplace=True)

In [None]:
# Check the shape of the dataframe

leads.shape

In [None]:
leads.info()

#### Observation

With the completion of missing value treatment and other data cleaning activities, the dataframe now has 9074 records with 10 features

## <font color=blue> 3. Segmented Univariate Analysis <font>

### Analysis of Categorical variables

In [None]:
# Visualising the target variable "Converted"

sns.countplot(leads["Converted"])
plt.title ("Conversion of leads to clients")
plt.show()

In [None]:
# % of Leads who converted

round(np.mean(leads["Converted"]),2)

**Observation:**
- We can see that less than 40% of the leads were converted to clients.
- It will be useful to conduct segmented univariate analysis, to visualize how the leads who converted behave as compared to those that did not for all categorical and continuous feature variables.

In [None]:
# Visualising variable "Lead Origin"

plt.figure (figsize=[10,5])
plt.subplot(1,2,1)
explodeTuple = (0.0, 0.0, 1.12, 0.0)
plt.pie(leads["Lead Origin"][leads["Converted"]==0].value_counts(),explode=explodeTuple,labels=leads["Lead Origin"][leads["Converted"]==0].value_counts().index,autopct='%1.1f%%')
plt.title("Origin of leads who did not convert to clients")
plt.subplot(1,2,2)
explodeTuple1 = (0.0, 0.0, 0.2, 0.0)
plt.pie(leads["Lead Origin"][leads["Converted"]==1].value_counts(),explode=explodeTuple1,labels=leads["Lead Origin"][leads["Converted"]==1].value_counts().index,autopct='%1.1f%%')
plt.title("Origin of leads who converted to clients")
plt.subplots_adjust(right=1.5)

**Observation:**
- It can be seen that more than 50% of the leads are originated from "Landing Page Submission"
- We can also see that 15.8% of the leads who converted originated from "Lead Add Form"
- It is also observed that the % of leads converted through API is 32.5 which is less than the % of leads not converted (43.7)  


In [None]:
# Visualizing "Lead Source"

plt.figure(figsize=[15,5])
plt.subplot(1,2,1)
round(leads["Lead Source"][leads["Converted"]==0].value_counts(normalize=True, ascending=True)*100,1).plot(kind="barh")
plt.title ("% of Lead Source who did not get converted to clients")
plt.xlabel("Percentage")
plt.subplot(1,2,2)
round(leads["Lead Source"][leads["Converted"]==1].value_counts(normalize=True, ascending=True)*100,1).plot(kind="barh")
plt.title ("% of Lead Source who got converted to clients")
plt.xlabel("Percentage")
plt.subplots_adjust(right=1.2)
plt.show()

**Observation:**
- More than 30% of the leads are sourced from Google
- Leads sourced from Reference and Welingkar Website seem to have a higher probability to get converted
- Leads sourced from Olark Chat seem to have a lower probability to get converted

In [None]:
# Visualizing "Last Activity"

plt.subplot(1,2,1)
sns.countplot(leads["Last Activity"][leads["Converted"]==0], 
              order=leads["Last Activity"][leads["Converted"]==0].value_counts().index)
plt.title("Leads who did not convert to clients")
plt.xticks(rotation=60)
plt.subplot(1,2,2)
sns.countplot(leads["Last Activity"][leads["Converted"]==1], 
              order=leads["Last Activity"][leads["Converted"]==1].value_counts().index)
plt.title("Leads who converted to clients")
plt.xticks(rotation=60)
plt.subplots_adjust(right=2.5)
plt.show()

# Visualizing "Last Notable Activity"

plt.subplot(1,2,1)
sns.countplot(leads["Last Notable Activity"][leads["Converted"]==0], 
              order=leads["Last Notable Activity"][leads["Converted"]==0].value_counts().index)
plt.title("Leads who did not convert to clients")
plt.xticks(rotation=60)
plt.subplot(1,2,2)
sns.countplot(leads["Last Notable Activity"][leads["Converted"]==1], 
              order=leads["Last Notable Activity"][leads["Converted"]==1].value_counts().index)
plt.title("Leads who converted to clients")
plt.xticks(rotation=60)
plt.subplots_adjust(right=2.5)
plt.show()

**Observation:**
- Leads who got converted seemed to be more responsive SMS and less responsive via e-mail
- Leads who got converted to clients seem to be less active on Olark Chat as compared to those who did not get converted to clients.
- Leads who got converted to clients seem to be less active on e-mail as compared to those who got converted. This is evident as lesser number of leads who converted to clients have opened e-mail or clicked on e-mail link sent by X Education.
- Leads who have already converted to clients for a course may not sign up for another course
- Leads who seemed to have modified their application form more are less likely to convert to clients

In [None]:
# Visualizing "A free copy of Mastering The Interview"

plt.subplot(1,2,1)
sns.countplot(leads["A free copy of Mastering The Interview"][leads["Converted"]==0])
plt.title("Leads who did not get converted to clients")
plt.subplot(1,2,2)
sns.countplot(leads["A free copy of Mastering The Interview"][leads["Converted"]==1])
plt.title("Leads who got converted to clients")
plt.subplots_adjust(right=1.5)
plt.show()

In [None]:
# Leads who converted and asked for "A free copy of Mastering The Interview"

round(np.mean(leads["A free copy of Mastering The Interview"][leads["Converted"]==1]),2)

**Observation:**
- Irrespective of whether or not leads are getting converted to clients, 30% of them are asking for "A free copy of Mastering The Interview".

### Analysis of Continuous variables

In [None]:
# Visualizing the "TotalVisits" 

plt.figure(figsize=[5,15])
sns.boxplot(y=leads["TotalVisits"],x=leads["Converted"])
plt.show()

In [None]:
# Visualizing the distribution of "TotalVisits"

plt.figure(figsize=[5,5])
plt.subplot(1,2,1)
sns.distplot(leads["TotalVisits"][leads["Converted"]==0],bins=30)
plt.title("Leads who did not get converted")
plt.subplot(1,2,2)
sns.distplot(leads["TotalVisits"][leads["Converted"]==1],bins=30)
plt.title("Leads who got converted")
plt.subplots_adjust(right=2.5)
plt.show()

**Observation:**
- We see that the leads who get converted to clients has a higher range of total visits as compared to those who do not get converted.

In [None]:
# Count of outliers in "TotalVisits" of leads who did not get converted to clients

Q3,Q1=np.percentile(leads["TotalVisits"][leads["Converted"]==0],[75,25])
IQR=Q3-Q1
UW=Q3+IQR*1.5
leads["TotalVisits"][leads["Converted"]==0][leads["TotalVisits"]>UW].count()

In [None]:
# Identifying list of outliers in "TotalVisits" of leads who got converted to clients

Q3,Q1=np.percentile(leads["TotalVisits"][leads["Converted"]==1],[75,25])
IQR=Q3-Q1
UW=Q3+IQR*1.5
leads["TotalVisits"][leads["Converted"]==1][leads["TotalVisits"]>UW].count()

#### Observation:

As there are many outliers for leads converted or not, the outliers can be treated by capping them

In [None]:
# Capping outliers in "TotalVisits" of leads

Q3,Q1=np.percentile(leads["TotalVisits"],[75,25])
IQR=Q3-Q1
UW=Q3+IQR*1.5
leads["TotalVisits"]=leads["TotalVisits"].apply(lambda x: UW if x>=UW else x)

In [None]:
# Visualizing the "Total Time Spent on Website" 

plt.figure(figsize=[5,15])
sns.boxplot(y=leads["Total Time Spent on Website"],x=leads["Converted"])
plt.show()

In [None]:
# Visualizing the distribution of "Total Time Spent on Website"

plt.figure(figsize=[5,5])
plt.subplot(1,2,1)
sns.distplot(leads["Total Time Spent on Website"][leads["Converted"]==0],bins=30)
plt.title("Leads who did not get converted")
plt.subplot(1,2,2)
sns.distplot(leads["Total Time Spent on Website"][leads["Converted"]==1],bins=30)
plt.title("Leads who got converted")
plt.subplots_adjust(right=2.5)
plt.show()

In [None]:
# Count of outliers in "Total Time Spent on Website" of leads who did not get converted to clients

Q3,Q1=np.percentile(leads["Total Time Spent on Website"][leads["Converted"]==0],[75,25])
IQR=Q3-Q1
UW=Q3+IQR*1.5
leads["Total Time Spent on Website"][leads["Converted"]==0][leads["Total Time Spent on Website"]>UW].count()

In [None]:
# Identifying list of outliers in "Total Time Spent on Website" of leads who got did not get converted to clients

Q3,Q1=np.percentile(leads["Total Time Spent on Website"][leads["Converted"]==0],[75,25])
IQR=Q3-Q1
UW=Q3+IQR*1.5
leads["Total Time Spent on Website"][leads["Converted"]==0][leads["Total Time Spent on Website"]>UW].count()

In [None]:
# Identifying list of outliers in "Total Time Spent on Website" of leads who got converted to clients

Q3,Q1=np.percentile(leads["Total Time Spent on Website"][leads["Converted"]==1],[75,25])
IQR=Q3-Q1
UW=Q3+IQR*1.5
leads["Total Time Spent on Website"][leads["Converted"]==1][leads["Total Time Spent on Website"]>UW].count()

**Observation:**
- Leads who get converted to clients typcially spend a lot more time on X Education's website than those who do not get converted to clients. We also do not observe any outliers in those who get converted.
- We observe many outliers amongst the leads who did not get converted. The outliers can be treated by capping them

In [None]:
# Capping outliers in "Total Time Spent on Website" of leads

Q3,Q1=np.percentile(leads["Total Time Spent on Website"],[75,25])
IQR=Q3-Q1
UW=Q3+IQR*1.5
leads["Total Time Spent on Website"]=leads["Total Time Spent on Website"].apply(lambda x: UW if x>=UW else x)

In [None]:
# Visualizing the "Page Views Per Visit" 

plt.figure(figsize=[5,15])
sns.boxplot(y=leads["Page Views Per Visit"],x=leads["Converted"])
plt.show()

In [None]:
# Visualizing the distribution of "Page Views Per Visit"

plt.figure(figsize=[5,5])
plt.subplot(1,2,1)
sns.distplot(leads["Page Views Per Visit"][leads["Converted"]==0],bins=30)
plt.title("Leads who did not get converted")
plt.subplot(1,2,2)
sns.distplot(leads["Page Views Per Visit"][leads["Converted"]==1],bins=30)
plt.title("Leads who got converted")
plt.subplots_adjust(right=2.5)
plt.show()

**Observations:**
- We see that leads who get converted typcially vist lot many pages of the website as compared to those who do not get converted.

In [None]:
# Identifying list of outliers in "Page Views Per Visit" of leads who did not get converted to clients

Q3,Q1=np.percentile(leads["Page Views Per Visit"][leads["Converted"]==0],[75,25])
IQR=Q3-Q1
UW=Q3+IQR*1.5
leads["Page Views Per Visit"][leads["Converted"]==0][leads["Page Views Per Visit"]>UW].count()

In [None]:
# Identifying list of outliers in "Page Views Per Visit" of leads who got converted to clients

Q3,Q1=np.percentile(leads["Page Views Per Visit"][leads["Converted"]==1],[75,25])
IQR=Q3-Q1
UW=Q3+IQR*1.5
leads["Page Views Per Visit"][leads["Converted"]==1][leads["Page Views Per Visit"]>UW].count()

**Observations:**
- There are many outliers observed for the leads that got converted or not. The outliers can be treated by capping them

In [None]:
# Capping outliers in "Page Views Per Visit" of leads

Q3,Q1=np.percentile(leads["Page Views Per Visit"],[75,25])
IQR=Q3-Q1
UW=Q3+IQR*1.5
leads["Page Views Per Visit"]=leads["Page Views Per Visit"].apply(lambda x: UW if x>=UW else x)

## <font color=blue> 4. Segmented Multivariate Analysis <font>

### Categorical - Continuous

In [None]:
# Visualizing "Total time Spent on Website" by "Lead Origin" separately by "Converted" 

plt.figure(figsize=[15,5])
sns.boxplot(x=leads["Lead Origin"], y=leads["Total Time Spent on Website"], hue=leads["Converted"])
plt.suptitle("Total time spent by Lead Origin by Converted")
plt.show()

In [None]:
# Check the value counts of "Lead Origin"

leads["Lead Origin"].value_counts()

In [None]:
# Check count of records of each origin type, for all leads that did not get converted and total time spent on website is 0

leads[["Lead Origin","Total Time Spent on Website"]][(leads["Total Time Spent on Website"] ==0) & (leads.Converted ==0)].groupby("Lead Origin").count()

In [None]:
# Check count of records of each origin type, for all leads that got converted and total time spent on website is 0

leads[["Lead Origin","Total Time Spent on Website"]][(leads["Total Time Spent on Website"] ==0) & (leads.Converted ==1)].groupby("Lead Origin").count()

**Observation:**
- Of the leads who got converted, those who were sourced from "Landing Page Submission" and "API" spent much higher time given by the higher median, as compared to those who did not get converted 
- It is observed that 85% of lead add form which got converted to leads have zero time spent on the website
- It is observed that 99% of Lead import whether converted to lead or not spent zero time on the website

In [None]:
# Visualizing "Total time Spent on Website" by "Lead Source" separately by "Converted" 

plt.figure(figsize=[15,5])
sns.boxplot(x=leads["Lead Source"], y=leads["Total Time Spent on Website"], hue=leads["Converted"])
plt.suptitle("Total time spent by Lead Source by Converted")
plt.xticks(rotation=60)
plt.show()

In [None]:
# Check the no. of records for each lead source

leads[["Lead Source","Total Time Spent on Website"]].groupby("Lead Source").count()

In [None]:
# Check count of records of each source type, for all leads that got converted and total time spent on website is 0

leads[["Lead Source","Total Time Spent on Website"]][(leads["Total Time Spent on Website"] ==0) & (leads.Converted ==1)].groupby("Lead Source").count()

**Observation:**
- Irrespective of the Lead Source, the time spent of X Education's website by leads who got converted to clients is higher than those who did not get converted to clients.
- It is observed that the Source type as - Reference, Wellingak Website, Olark Chat with zero time spent on the website has high % of leads getting converted to clients 

In [None]:
# Plot heat map - Lead Source Vs Lead Origin

plt.figure(figsize=[1,5])

plt.subplot(1,2,1)
sns.heatmap(leads.pivot_table(values="Converted", columns="Lead Origin", 
                              index="Lead Source", aggfunc=np.mean), annot=True, cmap="GnBu")
plt.title("Proportion of leads who converted per bucket of Lead Source and Lead Origin")
plt.xticks(rotation=90)

plt.subplot(1,2,2)
sns.heatmap(leads.pivot_table(values="Converted", columns="Lead Origin", 
                              index="Lead Source", aggfunc=np.sum), annot=True, cmap="GnBu", fmt='.1f')
plt.title("Count of lead Source")
plt.xticks(rotation=90)

plt.subplots_adjust(right=15)
plt.show() 


**Observation:**
- Leads originating on Lead Add Form are more likely to get converted as compared to those originating on Landing Page 
Submission
- The source of Lead Add Form which converts the leads to clients are from 'Reference' and 'Welingak Website'
- The source for Lead Import is 'Facebook' which converts leads to clients though very less in number


In [None]:
# Plot heat map - Lead Source Vs Converted

plt.figure(figsize=[1,5])
plt.subplot(1,2,1)
sns.heatmap(leads.pivot_table(values="Converted", 
                              index="Lead Source", aggfunc=np.mean), annot=True, cmap="GnBu")
plt.title("Proportion of leads who converted per Lead Source")
plt.subplot(1,2,2)
sns.heatmap(leads.pivot_table(values="Converted", 
                              index="Lead Source", aggfunc="count"), annot=True, cmap="GnBu",fmt = '1d')
plt.title("Count of lead Source")
plt.subplots_adjust(right=15)
plt.show() 

**Obseravtion:**
- Leads sourced from Weligak Website and Reference have a very high probability of getting converted
- Lead Source Google has more no. of leads converted to clients than any other lead source

In [None]:
# Plot heat map - Lead Activity Vs Converted

plt.figure(figsize=[1,7])
plt.subplot(1,2,1)
sns.heatmap(leads.pivot_table(values="Converted", 
                              index="Last Activity", aggfunc=np.mean), annot=True, cmap="GnBu")
plt.title("Proportion of leads who converted per Lead Activity")
plt.subplot(1,2,2)
sns.heatmap(leads.pivot_table(values="Converted", 
                              index="Last Activity", aggfunc="count"), annot=True, cmap="GnBu", fmt='1d')
plt.title("Count of Lead Activity")
plt.subplots_adjust(right=20)
plt.show() 

**Observation:**
- Leads with whom a recent phone conversation has happened have a very high likelihood to get converted
- Majority of the Leads who have opened their E-mail or to whom the SMS was sent, have got converted to clients

In [None]:
# Plot heat map based on - Proportion of leads who converted per bucket of Lead Source and Last Activity

plt.figure(figsize=[15,5])
sns.heatmap(leads.pivot_table(values="Converted", columns="Last Activity", 
                              index="Lead Source", aggfunc=np.sum), annot=True, cmap="GnBu", fmt='.1f')
plt.suptitle("Proportion of leads who converted per bucket of Lead Source and Last Activity")
plt.show() 

In [None]:
# Plot heat map based on - Count of leads who converted per bucket of Lead Source and Last Activity

plt.figure(figsize=[15,5])
sns.heatmap(leads.pivot_table(values="Converted", columns="Last Activity", 
                              index="Lead Source", aggfunc=np.mean), annot=True, cmap="GnBu")
plt.suptitle("Count of leads who converted per bucket of Lead Source and Last Activity")
plt.show() 

**Observation:**
- A lead with whom a recent phone conversation has happened or who was responsive to a recent SMS sent are more likely to get converted
- Lead sourced through Google has high number converted to clients based on the last activity as Email opened and sms sent
- Lead source as Reference and Welingak Website is highly likely to get converted to clients
- Lead source through Olark Chat is very unlikely to get converted with last activity as Olark chat conversation, though less in number

In [None]:
# Plot heat map based on - Proportion of leads who converted per bucket of Lead Source and Last Notable Activity

plt.figure(figsize=[15,5])
sns.heatmap(leads.pivot_table(values="Converted", columns="Last Notable Activity", 
                              index="Lead Source", aggfunc=np.sum), annot=True, cmap="GnBu",fmt = '.1f')
plt.suptitle("Proportion of leads who converted per bucket of Lead Source and Last Notable Activity")
plt.show() 

In [None]:
# Plot heat map based on - Count of leads who converted per bucket of Lead Source and Last Notable Activity

plt.figure(figsize=[15,5])
sns.heatmap(leads.pivot_table(values="Converted", columns="Last Notable Activity", 
                              index="Lead Source", aggfunc=np.mean), annot=True, cmap="GnBu")
plt.suptitle("Count of leads who converted per bucket of Lead Source and Last Notable Activity")
plt.show() 

**Observation:**
- Leads sourced from Welingak Website and Reference are getting converted more than 95% of the time
- Leads with last notable activity and last activity as had a phone conversation have a very high conversion rate
- Leads with last notable activity and last activity as sms sent have a very high conversion rate
- Since Last Activity / Last Notable Activity seem similar, we can drop Last Notable Activity 

## <font color=blue> 5. Preparing the data for modelling <font>

In [None]:
# Dropping "Last Notable Activity" as both "Last Notable Activity" and "Last Activity" seems to be having similar levels

leads.drop(["Last Notable Activity"],1, inplace=True)

In [None]:
# Creating dummy variables for the categorical variables and dropping the level with smallest counts.

# Creating dummy variables for the variable 'Lead Origin'
lo = pd.get_dummies(leads['Lead Origin'], prefix='Lead Origin')
# Dropping "Lead Origin_Lead Import" column
lo1 = lo.drop(['Lead Origin_Lead Import'], 1)
#Adding the results to the master dataframe
leads = pd.concat([leads,lo1], axis=1)

# Creating dummy variables for the variable 'Lead Source'
ls = pd.get_dummies(leads['Lead Source'], prefix='Lead Source')
# Dropping "Lead Source_Others" column
ls1 = ls.drop(['Lead Source_Others'], 1)
#Adding the results to the master dataframe
leads = pd.concat([leads,ls1], axis=1)

# Creating dummy variables for the variable 'Last Activity'
la = pd.get_dummies(leads['Last Activity'], prefix='Last Activity')
# Dropping "Last Activity" column
la1 = la.drop(['Last Activity_Others'], 1)
#Adding the results to the master dataframe
leads = pd.concat([leads,la1], axis=1)

# Dropping the original categorical columns for which we have created dummy variables

leads=leads.drop(['Lead Origin','Lead Source','Last Activity'],1)

In [None]:
# Inspecting all columns to check for datatypes

leads.info()

#### Observation:

- The datatype for all features are appropriate and no changes needed
- There are 9074 records with 28 columns

In [None]:
leads.head()

In [None]:
# Inspecting the inferential statistics

pd.set_option("max_rows", 100)
leads.describe().transpose()

## <font color=blue> 6. Train-test Split <font>

In [None]:
# Storing all feature variables in X

X = leads.drop(["Lead Number", "Converted"], 1)

# Storing dependent variable in y

y = leads["Converted"]

In [None]:
# Inspecting first five rows of dataframe X

X.head()

In [None]:
# Inspecting first five rows of dataframe y

y.head()

In [None]:
# Splitting the data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

## <font color=blue> 7. Feature Scaling <font>

In [None]:
# Scaling "Total Time Spent on Website", "TotalVisits" and "Page Views Per Visit"

scaler = StandardScaler()

X_train[["Total Time Spent on Website", "TotalVisits", "Page Views Per Visit"]] = scaler.fit_transform(X_train[["Total Time Spent on Website", "TotalVisits", "Page Views Per Visit"]])

X_train.head()

## <font color=blue> 8. Correlations <font>

In [None]:
# Inspecting correlation matrix

mask = np.array(leads.corr())
mask[np.tril_indices_from(mask)] = False
fig,ax = plt.subplots()
fig.set_size_inches(40,30)
sns.heatmap(leads.corr(),mask = mask,vmax = 0.8,square= True,cmap="GnBu",annot = True, fmt="0.2f")
plt.title("Correlation between variables", fontsize=15)
plt.show()

**Observation:**

- Converted has high correlation with Total Time Spent on Website, Last_Activity_SMS Sent, Lead Source_References, Lead Source_Welingak Website, Lead Origin_Lead Add Form
- Converted has negative correlation with Last_Activity_Olark Chat Conversation, Lead Origin_API, Last_Activity_Email Bounced 
- Lead Origin_Lead Add Form has high correlation with Lead Source_References and Lead Source_Welingak Website
- Lead Origin_API has high correlation with Lead_Source_Olark Chat
- There are some more correlations that will need to be dealt with. These will be dealt with as part of Recursive Feature ELimination (RFE).


## <font color=blue> 9. Model building <font>

#### Feature selection using RFE

In [None]:
logreg = LogisticRegression()
rfe = RFE(logreg, 20)             # running RFE with 20 variables as output
rfe = rfe.fit(X_train, y_train)
rfe.support_

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]

In [None]:
X_train.columns[~rfe.support_]

### Assessing the model with StatsModels

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### Observation:

Lead Source_Google has a high p-value and VIF. Hence this feature can be dropped from the model

In [None]:
# Dropping variable with p-values statistically insignificant

X_train_sm.drop("Lead Source_Google", 1, inplace=True)
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Re-running VIFs

vif = pd.DataFrame()
vif['Features'] = X_train_sm.drop("const", 1).columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.drop("const", 1).values, i) for i in range(X_train_sm.drop("const", 1).shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### Observation:

Lead Origin_LeadAddForm has a high p-value and VIF. Hence this feature can be dropped from the model

In [None]:
# Dropping variable with p-values statistically insignificant

X_train_sm.drop("Lead Origin_Lead Add Form", 1, inplace=True)
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Re-running VIFs

vif = pd.DataFrame()
vif['Features'] = X_train_sm.drop("const", 1).columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.drop("const", 1).values, i) for i in range(X_train_sm.drop("const", 1).shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### Observation:

Lead Source_Referral Sites has a high p-value. Hence this feature can be dropped from the model

In [None]:
# Dropping variable with p-values statistically insignificant

X_train_sm.drop("Lead Source_Referral Sites", 1, inplace=True)
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Re-running VIFs

vif = pd.DataFrame()
vif['Features'] = X_train_sm.drop("const", 1).columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.drop("const", 1).values, i) for i in range(X_train_sm.drop("const", 1).shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### Observation:

"Last Activity_Email Link Clicked" has negative coefficient and is not making intuitive business sense. Typically one would expect a person clicking on a email link to have a better likelihood to get converted. But the negative coefficient implies otherwise. Hence this feature can be dropped

In [None]:
# Dropping variable with p-values statistically insignificant

X_train_sm.drop("Last Activity_Email Link Clicked", 1, inplace=True)
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Re-running VIFs

vif = pd.DataFrame()
vif['Features'] = X_train_sm.drop("const", 1).columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.drop("const", 1).values, i) for i in range(X_train_sm.drop("const", 1).shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### Observation:

"Last Activity_Form Submitted on Website" has negative coefficient and is not making intuitive business sense. Typically one would expect a person submitting the form on website to have a better likelihood to get converted. But the negative coefficient implies otherwise. Hence this feature can be dropped

In [None]:
# Dropping variable with p-values statistically insignificant

X_train_sm.drop("Last Activity_Form Submitted on Website", 1, inplace=True)
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Re-running VIFs

vif = pd.DataFrame()
vif['Features'] = X_train_sm.drop("const", 1).columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.drop("const", 1).values, i) for i in range(X_train_sm.drop("const", 1).shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Dropping variable with p-values statistically insignificant

X_train_sm.drop("Last Activity_Olark Chat Conversation", 1, inplace=True)
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Re-running VIFs

vif = pd.DataFrame()
vif['Features'] = X_train_sm.drop("const", 1).columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.drop("const", 1).values, i) for i in range(X_train_sm.drop("const", 1).shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Dropping variable with p-values statistically insignificant

X_train_sm.drop("Last Activity_Page Visited on Website", 1, inplace=True)
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Re-running VIFs

vif = pd.DataFrame()
vif['Features'] = X_train_sm.drop("const", 1).columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.drop("const", 1).values, i) for i in range(X_train_sm.drop("const", 1).shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

**Observation:**
- Since the p-values of the coefficients for all feature variables are significant (less than 0.05) and multicollinearity between feature variables is low (VIF < 4), this is the final model and we can predit dependent variable "Converted" based on this model. 

### Creating new dataframe to store predicted values of dependent variable "Converted"

In [None]:
# Converting predicted probablities into Lead Score and storing in new dataframe

y_train_pred = round(res.predict(X_train_sm)*100,0)
y_train_pred = y_train_pred.astype("int")
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Lead Score':y_train_pred})
y_train_pred_final['Lead Number'] = leads["Lead Number"]
y_train_pred_final.head()

## <font color=blue> 10. Plotting ROC Curve <font>

In [None]:
# Defining a function to draw ROC curve

def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'r--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
# Computing False Postive Rate (fpr), True Positive Rate (tpr) and thresholds

fpr, tpr, thresholds = metrics.roc_curve(y_train_pred_final.Converted, 
                                         y_train_pred_final["Lead Score"], drop_intermediate = False )

In [None]:
# Drawing ROC curve

draw_roc(y_train_pred_final.Converted, y_train_pred_final["Lead Score"])

**Observation:**
- The threshold where the True Positive rate is around 0.8 and the False positive rate is around 0.2 is the optimum threshold.

## <font color=blue> 11. Finding optimal cut-off point <font>

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [x*10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final["Lead Score"].map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Calculating accuracy, sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['Lead Score Threshold','Accuracy','Sensitivity','Specificity'])

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

for i in numbers:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
cutoff_df

#### Observation:

From the above table the cut off threshold looks to be between 30 and 40 with good convergence. Lets plot a graph of all the 3 metrics with its corresponding lead score

In [None]:
# Plotting accuracy, sensitivity and specificity for various probabilities.

cutoff_df.plot.line(x='Lead Score Threshold', y=['Accuracy','Sensitivity','Specificity'])
plt.show()

**Observation:**
- From the curve above, 35 looks to be the optimum point to take it as Lead Score Threshold.

In [None]:
# Final predicted values of conversions for train data

y_train_pred_final['Predicted_Conversions'] = y_train_pred_final["Lead Score"].map( lambda x: 1 if x > 35 else 0)

print ("Lead Score Prediction Table")
y_train_pred_final.head()

In [None]:
# Drawing the Precision-Recall Curve

p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final["Lead Score"])
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

**Observation:**
- Based on Precision-Recall curve, lead score is around 40
- From the "Lead Score Prediction Table", that we calculated above, with threshold of 35, Accuracy, Sensitivity and Specificity are at better score as compared to 40. Hence, a threshold of 35 is more appropriate. 

In [None]:
# Checking overall Accuracy
round(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.Predicted_Conversions)*100,2)

## <font color=blue> 12. Metrics beyond Accuracy <font>

In [None]:
# To create confusion matrix

confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.Predicted_Conversions)
confusion

In [None]:
# Computing relevant components for metrics calculation

TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Checking Sensitivity (also known as Recall)

Sensitivity = round(TP / float(TP+FN) *100,2)
Sensitivity

In [None]:
# Checking Specificity

Specificity = round(TN / float(TN+FP) *100,2)
Specificity

In [None]:
# Checking false postive rate - predicting conversion when the lead has not converted

round(FP/ float(TN+FP) *100,2)

In [None]:
# Positive predictive value (also known as Precision)

Precision = round(TP / float(TP+FP) *100,2)
Precision

In [None]:
# Negative predictive value

round(TN / float(TN+ FN) *100,2)

In [None]:
# To calculate F1 score

F1_score = (2 * Precision * Sensitivity)/(Precision + Sensitivity)
F1_score

#### Observation:

- It is observed that with a threshold cut-off of 35, the Sensitivity (or Recall) is 78.21%, Specificity at 79.23%, Precision at 70.23% with an accuracy of 78.84% for trai n data
- F1 Score for train data is 74%


## <font color=blue> 13. Making predictions on test dataset <font>

In [None]:
# Scaling test dataset

X_test[["Total Time Spent on Website"]] = scaler.fit_transform(X_test[["Total Time Spent on Website"]])

In [None]:
# Create test dataframe based on the final features arrived using the final model

final_col = X_train_sm.drop("const", 1).columns

X_test = X_test[final_col]

X_test.head()

In [None]:
# Predicting Lead scores on the test data set and storing it in new dataframe - y_test_pred_final

X_test_sm = sm.add_constant(X_test)

y_test_pred = round(res.predict(X_test_sm)*100,0)
y_test_pred = y_test_pred.astype("int")
y_test_pred_final = pd.DataFrame({'Converted':y_test.values, 'Lead Score':y_test_pred})
y_test_pred_final['Lead Number'] = leads["Lead Number"]
y_test_pred_final.head()

In [None]:
# Final predicted values of conversions for test data using the same cut off threshold as 35

y_test_pred_final['Predicted_Conversions'] = y_test_pred_final["Lead Score"].map(lambda x: 1 if x > 35 else 0)

In [None]:
# Checking overall Accuracy
round(metrics.accuracy_score(y_test_pred_final.Converted, y_test_pred_final.Predicted_Conversions) *100,2)

In [None]:
# To create confusion matrix

confusion = metrics.confusion_matrix(y_test_pred_final.Converted, y_test_pred_final.Predicted_Conversions)
confusion

In [None]:
# Computing relevant components for metrics calculation

TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Sensitivity of the logistic regression model on test dataset

Sensitivity = round(TP / float(TP+FN) *100,2)
Sensitivity

In [None]:
# Specificity of the logistic regression model on test dataset

round(TN / float(TN+FP) *100,2)

In [None]:
# Positive predictive value (also known as Precision)

Precision = round(TP / float(TP+FP) *100,2)

In [None]:
# To calculate F1 score

F1_score = (2 * Precision * Sensitivity)/(Precision + Sensitivity)
F1_score

#### Observation:

It is observed that with a threshold cut-off of 35, the Sensitivity (or Recall) is 86.25%, Specificity at 65.4% with an accuracy of 72.97% for test data
F1 Score for test data is around 70%

In [None]:
# Merging predictions into the "leads" dataframe

predictions=y_train_pred_final[["Converted","Lead Score","Lead Number",
                                "Predicted_Conversions"]].append(y_test_pred_final)

leads=pd.merge(leads,predictions[["Predicted_Conversions","Lead Number"]],on="Lead Number",how="inner")

leads.head()

In [None]:
# Predicted conversion %

round(np.mean(leads["Predicted_Conversions"])*100,2)

In [None]:
# List of "hot leads"

leads[leads["Predicted_Conversions"]==1]

#### Observation:

It can be concluded that 46.03% of the existing leads can be cosidered as 'Hot leads' with a total of 4177 records 

## <font color=blue> 14. Conclusion <font>

Equation of the sigmoid curve of best fit is:

Lead Score (Probablity) = 1/1+e**-(0.57+4 X Lead Source_Welingak Website + 2.6 X Last Activity_Had a Phone Conversation + 2.5 X Lead Source_Reference -2 X Last Activity_Email Bounced -1.9 X Lead Origin_Landing Page Submission -1.8 X Lead Origin_API + 1.4 X Last Activity_SMS Sent + 1.1 X Total Time Spent on Website + 1 X Lead Source_Olark Chat -0.9 X Last Activity_Converted to Lead -0.3 X Lead Source_Direct Traffic -0.3 X Lead Source_Organic Search + 0.2 X TotalVisits ))
 

<font color=green> Indicators that a lead will get converted to a client `(largest indicator on top)`: <font>

1. Lead Source_Welingak Website - A lead sourced from Welingkar Website is more likely to get converted

2. Last Activity_Had a Phone Conversation - A lead who was responsive to a recent phone conversation is more likely to get converted

3. Lead Source_Reference - A lead sourced from Reference is more likely to get converted

4. Last Activity_SMS Sent - A lead who was responsive to a recent SMS sent is more likely to get converted

5. Total Time Spent on Website - A lead who spends more time on X Education website is more likley to get converted

6. Lead Source_Olark Chat - A lead who was sourced from Olark Chat is likely to get converted

7. TotalVisits - A lead who visits X Education website more often has a higher likelihood of getting converted

   
<font color=red> Indictors that a lead will not get converted to a client `(largest indicator on top)`: <font>

1. Last Activity_Email Bounced - A lead to whom an e-mail recently bounced is less likely to get converted

2. Lead Origin_Landing Page Submission - A lead originating on Landing Page Submission is less likely to get converted

3. Lead Origin_API - A lead originating on API is less likely to get converted

4. Last Activity_Converted to Lead - A lead who has already converted for one of the courses of X-Education is less likely to get converted

5. Lead Source_Direct Traffic - A lead who is sourced from Direct Traffic is less likely to get converted

6. Lead Source_Organic Search - A lead who is sourced through organic search is less likley to get converted

     
<font color=blue> Guidance to X Education:

The company should use a leads score threshold of 35 to identify "Hot Leads" as at this threshold, Sensitivity Score of the model is around 86% which is well above the CEO's target of 80%. Based, on this approach, we have identified 4,000+ leads as "Hot Leads" which is 46.03% of total leads.