# Lead Scoring 

## Problem Statement

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 

 

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. 

 

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:


Lead Conversion Process - Demonstrated as a funnel
As you can see, there are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion.

 

X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

 

Data
You have been provided with a leads dataset from the past with around 9000 data points. This dataset consists of various attributes such as Lead Source, Total Time Spent on Website, Total Visits, Last Activity, etc. which may or may not be useful in ultimately deciding whether a lead will be converted or not. The target variable, in this case, is the column ‘Converted’ which tells whether a past lead was converted or not wherein 1 means it was converted and 0 means it wasn’t converted. You can learn more about the dataset from the data dictionary provided in the zip folder at the end of the page. Another thing that you also need to check out for are the levels present in the categorical variables. Many of the categorical variables have a level called 'Select' which needs to be handled because it is as good as a null value (think why?).

## Goals

There are quite a few goals for this case study.


Build a model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads. A higher score would mean that the lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get converted.
There are some more problems presented by the company which your model should be able to adjust to if the company's requirement changes in the future so you will need to handle these as well. 

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# visulaisation
from matplotlib.pyplot import xticks
%matplotlib inline

In [None]:
# Data display coustomization
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

## Data Preparation

### Data Loading

In [None]:
lead = pd.read_csv(r"/kaggle/input/lead-scoring-dataset/Lead Scoring.csv")
lead.head()

Data Dictionary

In [None]:
word=pd.read_excel(r"/kaggle/input/lead-scoring-dataset/Leads Data Dictionary.xlsx")
word.head()

In [None]:
pd.set_option('display.max_colwidth', -1)
word.drop('Unnamed: 0',inplace=True,axis=1)
word.columns = word.iloc[1]
word = word.iloc[2:]
word.reset_index(drop=True, inplace=True)
word.head(len(word))

## Duplicate Check

In [None]:
lead_dub = lead.copy()

# Checking for duplicates and dropping the entire duplicate row if any
lead_dub.drop_duplicates(subset=None, inplace=True)
lead_dub.shape

In [None]:
lead.shape

The shape after running the drop duplicate command is same as the original dataframe.

Hence we can conclude that there were zero duplicate values in the dataset.

## Data Inspection

In [None]:
lead.shape

In [None]:
lead.info()

In [None]:
lead.describe()

## Data Cleaning

In [None]:
# As we can observe that there are select values for many column.
#This is because customer did not select any option from the list, hence it shows select.
# Select values are as good as NULL.

# Converting 'Select' values to NaN.
lead = lead.replace('Select', np.nan)
lead.head()

In [None]:
lead.isnull().sum()

In [None]:
round(100*(lead.isnull().sum()/len(lead.index)), 2)

In [None]:
# we will drop the columns having more than 60% NA values.
lead = lead.drop(lead.loc[:,list(round(100*(lead.isnull().sum()/len(lead.index)), 2)>60)].columns, 1)

In [None]:
round(100*(lead.isnull().sum()/len(lead.index)), 2)

In [None]:
#dropping Lead Number and Prospect ID since they have all unique values

lead.drop(['Prospect ID', 'Lead Number'], 1, inplace = True)

In [None]:
lead.head()

Now we will take care of null values in each column one by one.


In [None]:
# Lead Quality: Indicates the quality of lead based on the data and intuition the the employee who has been assigned to the lead

In [None]:
lead['Lead Quality'].value_counts()


In [None]:
lead['Lead Quality'].describe()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['Lead Quality'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
plt.show()

In [None]:
# As Lead quality is based on the impression employee & the lead, 
#if anything is left blank we can impute 'Not Sure' in NaN safely.

lead['Lead Quality'] = lead['Lead Quality'].replace(np.nan, 'Not Sure')


In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['Lead Quality'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
plt.show()

This are few field where human psychology, consumer behavior & business understanding overpowers the statistic interpretation of the data 

In [None]:
# Asymmetrique Activity Index  |
# Asymmetrique Profile Index   \   An index and score assigned to each customer
# Asymmetrique Activity Score  |    based on their activity and their profile
# Asymmetrique Profile Score   \

In [None]:
fig, axs = plt.subplots(2,2, figsize = (10,9))
plt1 = sns.countplot(lead['Asymmetrique Activity Index'], ax = axs[0,0])
for p in plt1.patches:
    plt1.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt2 = sns.violinplot(lead['Asymmetrique Activity Score'], ax = axs[0,1])
plt3 = sns.countplot(lead['Asymmetrique Profile Index'], ax = axs[1,0])
for p in plt3.patches:
    plt3.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt4 = sns.violinplot(lead['Asymmetrique Profile Score'], ax = axs[1,1])
plt.tight_layout()

In [None]:
# There is too much variation in thes parameters so its not reliable to impute any value in it. 
# 45% null values means we need to drop these columns.

In [None]:
lead = lead.drop(['Asymmetrique Activity Index','Asymmetrique Activity Score',
                  'Asymmetrique Profile Index','Asymmetrique Profile Score'],1)

In [None]:
round(100*(lead.isnull().sum()/len(lead.index)), 2)

In [None]:
# City

In [None]:
lead.City.value_counts()


In [None]:
lead.City.describe()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['City'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
plt.show()

In [None]:
# Around 57.8% of the data available  is Mumbai so we can impute Mumbai in the missing values.

In [None]:
lead['City'] = lead['City'].replace(np.nan, 'Mumbai')

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['City'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
plt.show()

In [None]:
# Specailization

In [None]:
lead.Specialization.describe()

In [None]:
lead.Specialization.value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['Specialization'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
plt.show()

In [None]:
# It maybe the case that lead has not entered any specialization if his/her option is not availabe on the list,
#  may not have any specialization or is a student.
# Hence we can make a category "Others" for missing values. 


In [None]:
lead['Specialization'] = lead['Specialization'].replace(np.nan, 'Others')

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['Specialization'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
plt.show()

In [None]:
round(100*(lead.isnull().sum()/len(lead.index)), 2)

In [None]:
# Tags

In [None]:
lead.Tags.describe()

In [None]:
lead.Tags.value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['Tags'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
plt.show()

In [None]:
# Blanks in the tag column may be imputed by 'Will revert after reading the email'.

In [None]:
lead['Tags'] = lead['Tags'].replace(np.nan, 'Will revert after reading the email')

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['Tags'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
plt.show()

In [None]:
# What matters most to you in choosing a course

In [None]:
lead['What matters most to you in choosing a course'].describe()

In [None]:
lead['What matters most to you in choosing a course'].value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['What matters most to you in choosing a course'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

In [None]:
# Blanks in the this column may be imputed by 'Better Career Prospects'.

In [None]:
lead['What matters most to you in choosing a course'] = lead['What matters most to you in choosing a course'].replace(np.nan, 'Better Career Prospects')

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['What matters most to you in choosing a course'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

In [None]:
# Occupation

In [None]:
lead['What is your current occupation'].describe()

In [None]:
lead['What is your current occupation'].value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['What is your current occupation'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

In [None]:
# 86% entries are of Unemployed so we can impute "Unemployed" in it.

In [None]:
lead['What is your current occupation'] = lead['What is your current occupation'].replace(np.nan, 'Unemployed')

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['What is your current occupation'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

In [None]:
# Country

In [None]:
lead['Country'].describe()

In [None]:
lead['Country'].value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['Country'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

In [None]:
# Country is India for most values so let's impute the same in missing values.
lead['Country'] = lead['Country'].replace(np.nan, 'India')

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(lead['Country'])
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

In [None]:
round(100*(lead.isnull().sum()/len(lead.index)), 2)

In [None]:
lead.isnull().sum()

In [None]:
# Rest missing values are under 1.5% so we can drop these rows.
lead.dropna(inplace = True)

In [None]:
round(100*(lead.isnull().sum()/len(lead.index)), 2)

In [None]:
lead.isnull().sum()

In [None]:
data_retailed= len(lead)* 100 / len(lead_dub)
print("{} % of original rows is available for EDA".format(round(data_retailed,2)))

In [None]:
lead.shape

Now Data is free from all missing value  and we can start with the analysis 

# Exploratory Data Analytics

## Univariate Analysis

### Converted

In [None]:
# Converted is the target variable, Indicates whether a lead has been successfully converted (1) or not (0).

In [None]:
Converted = round((sum(lead['Converted'])/len(lead['Converted'].index))*100,2)

print("We have almost {} %  Converted rate".format(Converted))



### Lead Origin

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "Lead Origin", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- API and Landing Page Submission have 30-35% conversion rate but count of lead originated from them are considerable.
- Lead Add Form has more than 90% conversion rate but count of lead are not very high.
- Lead Import are very less in count.

**To improve overall lead conversion rate, we need to focus more on improving lead converion of API and Landing Page Submission origin and generate more leads from Lead Add Form.**

## Lead Source

In [None]:
plt.figure(figsize = (20,5))
ax= sns.countplot(x = "Lead Source", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

In [None]:
lead['Lead Source'] = lead['Lead Source'].replace(['google'], 'Google')
lead['Lead Source'] = lead['Lead Source'].replace(['Click2call', 'Live Chat', 'NC_EDM', 'Pay per Click Ads', 'Press_Release',
  'Social Media', 'WeLearn', 'bing', 'blog', 'testone', 'welearnblog_Home', 'youtubechannel'], 'Others')

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "Lead Source", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Google and Direct traffic generates maximum number of leads.
- Conversion Rate of reference leads and leads through welingak website is high.

**To improve overall lead conversion rate, focus should be on improving lead converion of olark chat, organic search, direct traffic, and google leads and generate more leads from reference and welingak website.**

## Do Not Email & Do Not Call

In [None]:
plt.figure(figsize = (20,5))
plt.subplot(1,2,1)
ax= sns.countplot(x = "Do Not Email", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.subplot(1,2,2)
ax= sns.countplot(x = "Do Not Call", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

## Total Visits

In [None]:
lead['TotalVisits'].describe(percentiles=[0.05,.25, .5, .75, .90, .95, .99])

In [None]:
plt.figure(figsize = (10,5))
sns.violinplot(lead['TotalVisits'])
plt.show()

In [None]:
# As we can see there are a number of outliers in the data.
# We will cap the outliers to 95% value for analysis.

In [None]:
percentiles = lead['TotalVisits'].quantile([0.05,0.95]).values
lead['TotalVisits'][lead['TotalVisits'] <= percentiles[0]] = percentiles[0]
lead['TotalVisits'][lead['TotalVisits'] >= percentiles[1]] = percentiles[1]

In [None]:
plt.figure(figsize = (10,5))
sns.violinplot(lead['TotalVisits'])
plt.show()

In [None]:
plt.figure(figsize = (10,5))
sns.violinplot(y = 'TotalVisits', x = 'Converted', data = lead)
plt.show()

- Median for converted and not converted leads are the same.

**Nothng conclusive can be said on the basis of Total Visits.**

## Total time spent on website

In [None]:
lead['Total Time Spent on Website'].describe()

In [None]:
plt.figure(figsize = (10,5))
sns.violinplot(lead['Total Time Spent on Website'])
plt.show()

In [None]:
plt.figure(figsize = (10,5))
sns.violinplot(y = 'Total Time Spent on Website', x = 'Converted', data = lead)
plt.show()

- Leads spending more time on the weblise are more likely to be converted.

**Website should be made more engaging to make leads spend more time.**

## Page views per visit

In [None]:
lead['Page Views Per Visit'].describe()

In [None]:
plt.figure(figsize = (10,5))
sns.violinplot(lead['Page Views Per Visit'])
plt.show()

In [None]:
# As we can see there are a number of outliers in the data.
# We will cap the outliers to 95% value for analysis.

In [None]:
percentiles = lead['Page Views Per Visit'].quantile([0.05,0.95]).values
lead['Page Views Per Visit'][lead['Page Views Per Visit'] <= percentiles[0]] = percentiles[0]
lead['Page Views Per Visit'][lead['Page Views Per Visit'] >= percentiles[1]] = percentiles[1]

In [None]:
plt.figure(figsize = (10,5))
sns.violinplot(lead['Page Views Per Visit'])
plt.show()

In [None]:
plt.figure(figsize = (10,5))
sns.violinplot(y = 'Page Views Per Visit', x = 'Converted', data = lead)
plt.show()

- Median for converted and unconverted leads is the same.

**Nothing can be said specifically for lead conversion from Page Views Per Visit**

## Last Activity

In [None]:
lead['Last Activity'].describe()

In [None]:
lead['Last Activity'].value_counts()

In [None]:
plt.figure(figsize = (25,5))
ax= sns.countplot(x = "Last Activity", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

In [None]:
# Let's keep considerable last activities as such and club all others to "Other_Activity"
lead['Last Activity'] = lead['Last Activity'].replace(['Had a Phone Conversation', 'View in browser link Clicked', 
                                                       'Visited Booth in Tradeshow', 'Approached upfront',
                                                       'Resubscribed to emails','Email Received', 'Email Marked Spam'],
                                                      'Other_Activity')

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "Last Activity", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Most of the lead have their Email opened as their last activity.
- Conversion rate for leads with last activity as SMS Sent is almost 62%.

## Country

In [None]:
lead.Country.describe()

In [None]:
lead.Country.value_counts()

In [None]:
plt.figure(figsize = (25,5))
ax= sns.countplot(x = "Country", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Most values are 'India' , we can tell core business is coming from India market 

**They have potential to make business from US, Middle East & Europe**

## Specialization

In [None]:
lead.Specialization.describe()

In [None]:
lead.Specialization.value_counts()

In [None]:
lead['Specialization'] = lead['Specialization'].replace(['Others'], 'Other_Specialization')

In [None]:
plt.figure(figsize = (25,5))
ax= sns.countplot(x = "Specialization", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Focus should be more on the Specialization with high conversion rate.

## Occupation

In [None]:
lead['What is your current occupation'].describe()

In [None]:
lead['What is your current occupation'].value_counts()

In [None]:
lead['What is your current occupation'] = lead['What is your current occupation'].replace(['Other'], 'Other_Occupation')

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "What is your current occupation", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Working Professionals going for the course have high chances of joining it.
- Unemployed leads are the most in numbers but has around 30-35% conversion rate.

## What matters most to you in choosing a course

In [None]:
lead['What matters most to you in choosing a course'].describe()

In [None]:
lead['What matters most to you in choosing a course'].value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "What matters most to you in choosing a course", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Most entries are 'Better Career Prospects'. No Inference can be drawn with this parameter.

## Search

In [None]:
lead.Search.describe()

In [None]:
lead.Search.value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "Search", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Most entries are 'No'. No Inference can be drawn with this parameter.

## Magazine

In [None]:
lead.Magazine.describe()

In [None]:
lead.Magazine.value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "Magazine", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- All entries are 'No'. No Inference can be drawn with this parameter.

## Newspaper Article

In [None]:
lead['Newspaper Article'].describe()

In [None]:
lead['Newspaper Article'].value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "Newspaper Article", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Most entries are 'No'. No Inference can be drawn with this parameter.

## X Education Forums

In [None]:
lead['X Education Forums'].describe()

In [None]:
lead['X Education Forums'].value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "X Education Forums", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Most entries are 'No'. No Inference can be drawn with this parameter.

## Newspaper

In [None]:
lead['Newspaper'].describe()

In [None]:
lead['Newspaper'].value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "Newspaper", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Most entries are 'No'. No Inference can be drawn with this parameter.

## Digital Advertisement

In [None]:
lead['Digital Advertisement'].describe()

In [None]:
lead['Digital Advertisement'].value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "Digital Advertisement", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Most entries are 'No'. No Inference can be drawn with this parameter.

## Through Recommendations

In [None]:
lead['Through Recommendations'].describe()

In [None]:
lead['Through Recommendations'].value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "Through Recommendations", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Most entries are 'No'. No Inference can be drawn with this parameter.

## Receive More Updates About Our Courses

In [None]:
lead['Receive More Updates About Our Courses'].describe()

In [None]:
lead['Receive More Updates About Our Courses'].value_counts()

In [None]:
plt.figure(figsize = (10,5))
ax= sns.countplot(x = "Receive More Updates About Our Courses", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- All entries are 'No'. No Inference can be drawn with this parameter.

## Tags

In [None]:
lead.Tags.describe()

In [None]:
lead.Tags.value_counts()

In [None]:
plt.figure(figsize = (30,6))
ax= sns.countplot(x = "Tags", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

In [None]:
# Let's keep considerable last activities as such and club all others to "Other_Activity"
lead['Tags'] = lead['Tags'].replace(['In confusion whether part time or DLP', 'in touch with EINS','Diploma holder (Not Eligible)',
                                     'Approached upfront','Graduation in progress','number not provided', 'opp hangup','Still Thinking',
                                    'Lost to Others','Shall take in the next coming month','Lateral student','Interested in Next batch',
                                    'Recognition issue (DEC approval)','Want to take admission but has financial problems',
                                    'University not recognized'], 'Other_Tags')

In [None]:
plt.figure(figsize = (20,6))
ax= sns.countplot(x = "Tags", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- 'Will revert after reading the email' is a mixed emotion, it may be Interested or Not Interested. Depend upon the mood of customer, their requirement & content of the email, Lead can be conveted into a customer.
- 'Closed by Horizon', 'Lost to EINS' are positive tag for Lead
- 'Invalid number', 'wrong number given','Not doing further education'& 'Interested  in full time MBA' are negative tag 

## Lead Quality

In [None]:
lead['Lead Quality'].describe()

In [None]:
lead['Lead Quality'].value_counts()

In [None]:
plt.figure(figsize = (10,6))
ax= sns.countplot(x = "Lead Quality", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- 'Not Sure' is a mixed emotion, it may be Interested or Not Interested. Depend upon the mood of customer,their requirement & content of the communication, Lead can be conveted into a customer.
- 'Worst' Lead Quality brings less business 

## Update me on Supply Chain Content

In [None]:
lead['Update me on Supply Chain Content'].describe()

In [None]:
lead['Update me on Supply Chain Content'].value_counts()

In [None]:
plt.figure(figsize = (10,6))
ax= sns.countplot(x = "Update me on Supply Chain Content", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- All entries are 'No'. No Inference can be drawn with this parameter.

## Get updates on DM Content

In [None]:
lead['Get updates on DM Content'].describe()

In [None]:
lead['Get updates on DM Content'].value_counts()

In [None]:
plt.figure(figsize = (10,6))
ax= sns.countplot(x = "Get updates on DM Content", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- All entries are 'No'. No Inference can be drawn with this parameter.

## I agree to pay the amount through cheque

In [None]:
lead['I agree to pay the amount through cheque'].describe()

In [None]:
lead['I agree to pay the amount through cheque'].value_counts()

In [None]:
plt.figure(figsize = (10,6))
ax= sns.countplot(x = "I agree to pay the amount through cheque", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

All entries are 'No'. No Inference can be drawn with this parameter.

## A free copy of Mastering The Interview

In [None]:
lead['A free copy of Mastering The Interview'].describe()

In [None]:
lead['A free copy of Mastering The Interview'].value_counts()

In [None]:
plt.figure(figsize = (10,6))
ax= sns.countplot(x = "A free copy of Mastering The Interview", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- 'A free copy of Mastering The Interview' doesn't play role in decision making. 

## City

In [None]:
lead.City.describe()

In [None]:
lead.City.value_counts()

In [None]:
plt.figure(figsize = (10,6))
ax= sns.countplot(x = "City", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- Most leads are from mumbai with around 30% conversion rate.

## Last Notable Activity

In [None]:
lead['Last Notable Activity'].describe()

In [None]:
lead['Last Notable Activity'].value_counts()

In [None]:
plt.figure(figsize = (20,6))
ax= sns.countplot(x = "Last Notable Activity", hue = "Converted", data = lead)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.01 , p.get_height() * 1.01))
plt.xticks(rotation = 90)
ax.set_yscale('log')
plt.show()

- 'SMS Sent' is strong symbol for positive lead 

## **Results**


Based on the univariate analysis we have seen that many columns are not adding any information to the model, hence we can drop them for frther analysis

In [None]:
lead = lead.drop(['What matters most to you in choosing a course','Search',
                  'Magazine','Newspaper Article','X Education Forums','Newspaper',
           'Digital Advertisement','Through Recommendations','Receive More Updates About Our Courses',
                  'Update me on Supply Chain Content',
           'Get updates on DM Content','I agree to pay the amount through cheque',
                  'A free copy of Mastering The Interview','Country'],1)

In [None]:
print("Original Columns {} % Retained".format(round((100* len(lead.columns)/len(lead_dub.columns)),2)))

In [None]:
print("Original Data {} % Retained".format(round((len(lead) * 
                                                     len(lead.columns))*100/(len(lead_dub.columns)*len(lead_dub)),2)))

In [None]:
lead.shape

In [None]:
lead.head()

## Data Preparation

### Converting some binary variables (Yes/No) to 1/0

In [None]:
# List of variables to map

varlist =  ['Do Not Email', 'Do Not Call']

# Defining the map function
def binary_map(x):
    return x.map({'Yes': 1, "No": 0})

# Applying the function to the housing list
lead[varlist] = lead[varlist].apply(binary_map)
lead.head()

In [None]:
# Creating a dummy variable for some of the categorical variables and dropping the first one.
dummy1 = pd.get_dummies(lead[['Lead Origin', 'Lead Source', 'Last Activity', 'Specialization','What is your current occupation',
                              'Tags','Lead Quality','City','Last Notable Activity']], drop_first=True)
dummy1.head()

In [None]:
# Adding the results to the master dataframe
lead = pd.concat([lead, dummy1], axis=1)
lead.head()

In [None]:
lead = lead.drop(['Lead Origin', 'Lead Source', 'Last Activity', 'Specialization',
                  'What is your current occupation','Tags','Lead Quality','City','Last Notable Activity'], axis = 1)
lead.head()

In [None]:
lead.shape