# Lead Scoring

## Problem Statement

#### About company:
	- X Education sells online courses to industry professionals
	- They typically markets their courses on  Websites, Search engine (such as Google)
    - At any given day, professionals land on their webstite and browse through courses

#### Leads conversion process:
	- General website behaviour:
		1. Individuals land on website and browse courses and watch videos
		2. They might end up filling a form
	- A LEAD is generated when:
		1. An individual has provided his/her phone# or email address
		2. Past referral is provided
	- Sales team action on Leads:
		1. Make calls
		2. Writes emails

#### Business Problem scenario:
	- Their Typical lead conversion Rate is 30%
	- Company wishes to identify and spend time effectively on most potential leads (a.k.a. "HOT LEADS"), that are most likely to get converted
    
   ##### The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%

## Solution Approach

The problem can be addressed using a Logistic Regression model, which can predict whether a lead can be classified as a potential lead or not, to help the business in making effective communications.

In Logistic Regression, the Prediction is made using the Probabilities, that will be assigned to each lead in our case using the most significant variables or features, and use an optimal cut-off or threshold that determine whether the lead can or cannot be classified as Potential. 

Conclusions and Recommendation will specifiy:
* Significant features determining potential leads
* Key areas to focus, in increasing overall Conversion Rate

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

# Importing libraries

import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from matplotlib.pyplot import xticks
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

import statsmodels.api as sm

from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn import metrics

## 1. Loading data and Inspecting

In [None]:
leads_df = pd.read_csv('../input/leadscore/Leads.csv')
leads_df.head()

In [None]:
leads_df.shape

Dealing with 9240 records and 37 attributes

### 1.1 Finding Duplicates 

In [None]:
sum(leads_df.duplicated(subset = 'Prospect ID'))

In [None]:
sum(leads_df.duplicated(subset = 'Lead Number'))

Data is unique at Prospect ID and Lead Number, and there are no duplicates

### 1.2 Summary Statistics of Numerical Variables

In [None]:
# looking at summary statistics of numerical columns
leads_df.describe()

In [None]:
# verifying outliers
leads_df['Total Time Spent on Website'].quantile([0,0.25,0.5,0.75,0.95,0.99,1])

In [None]:
leads_df['Page Views Per Visit'].quantile([0,0.25,0.5,0.75,0.95,0.99,1])

Both variable show outliers presence.

### 1.3 Verifying data types

In [None]:
# data type of variables
leads_df.info()

All variables are in the correct format.

### 1.4 Checking NULL values

In [None]:
# Null values
round((leads_df.isnull().sum()/len(leads_df))*100,1)

There are quite a few variables with NULL values. We would be handling these values in data cleaning step.

### 1.5 Garbage values ("Select")

As seen in the csv, following columns have "Select" as their values. Which is equivalant to providing no input. Will be replacing these values with NULL during data cleaning

* "Specialization"
* "How did you hear about X Education"
* "Lead Profile"
* "City"

In [None]:
# Specialization
leads_df.Specialization.value_counts()

In [None]:
# How did you hear about X Education
leads_df['How did you hear about X Education'].value_counts()

In [None]:
# Lead Profile
leads_df['Lead Profile'].value_counts()

In [None]:
# City
leads_df['City'].value_counts()

## 2. Data Cleaning

### 2.1 Converting columns with "Select" as values, to NULL

In [None]:
leads_df.loc[leads_df['Specialization'].str.lower() == 'select', 'Specialization'] = np.nan
leads_df.loc[leads_df['How did you hear about X Education'].str.lower() == 'select', 'How did you hear about X Education'] = np.nan
leads_df.loc[leads_df['Lead Profile'].str.lower() == 'select', 'Lead Profile'] = np.nan
leads_df.loc[leads_df['City'].str.lower() == 'select', 'City'] = np.nan

leads_df.head()

### 2.2 Dropping columns that are not relevant for modelling

* These are scoring based columns, which are generated by Sales Team after having an initial discussion with the Customer/Student.
* Our goal is to identify potential leads so that the communication can be made more focused, rather than calling everyone. 

In [None]:
leads_df = leads_df.drop(['Tags','Lead Quality','Asymmetrique Activity Index','Asymmetrique Profile Index',
                          'Asymmetrique Activity Score','Asymmetrique Profile Score','Lead Profile'], axis=1)

In [None]:
leads_df.info()

### 2.3 Handling NULL values

In [None]:
# Null values
round((leads_df.isnull().sum()/len(leads_df))*100,1)

In [None]:
# creating a copy of original dataframe
leads_df_prep = leads_df

**Dropping columns with more than 45% missing values**

In [None]:
leads_df_prep = leads_df_prep.drop(leads_df_prep.loc[:,list(round(100*(leads_df_prep.isnull().sum()/len(leads_df_prep.index)), 2)>45)].columns, 1)

In [None]:
# checking dataframe
leads_df_prep.head()

Checking Missing Percentage again

In [None]:
# Null values
round((leads_df_prep.isnull().sum()/len(leads_df_prep))*100,1)

##### Handling NULL values in "Lead Source"

In [None]:
leads_df_prep['Lead Source'].value_counts()

Majority of the source is from Google. Hence we will replace missing values with "Google"

In [None]:
# replacing nulls with mode
leads_df_prep.loc[leads_df_prep['Lead Source'].isnull(), 'Lead Source'] = 'Google'

# checking for null values after replacement, which should be zero
leads_df_prep['Lead Source'].isnull().sum()

In [None]:
# rechecking counts
leads_df_prep['Lead Source'].value_counts()

##### Handling NULL values in "Total Visits"

In [None]:
# This has to be computed using either mean or median. 
# determining the mean and median values for converted and not not converted leads

# mean
print(leads_df_prep.groupby('Converted').TotalVisits.mean())

# median
print(leads_df_prep.groupby('Converted').TotalVisits.median())

We will compute it with Median, since for both cases, Median is the same and would be safe to use.

In [None]:
# replacing nulls with median
leads_df_prep.loc[leads_df_prep['TotalVisits'].isnull(), 'TotalVisits'] = leads_df_prep['TotalVisits'].median()

In [None]:
# null values should now be zero
leads_df_prep['TotalVisits'].isnull().sum()

In [None]:
# mean
print(leads_df_prep.groupby('Converted').TotalVisits.mean())

# median
print(leads_df_prep.groupby('Converted').TotalVisits.median())

No significant change in Mean after replacement.

##### Handling NULL values in "Page Views Per Visit"

In [None]:
# This has to be computed using either mean or median. 
# determining the mean and median values for converted and not not converted leads

# mean
print(leads_df_prep.groupby('Converted')['Page Views Per Visit'].mean())

# median
print(leads_df_prep.groupby('Converted')['Page Views Per Visit'].median())

We will compute it with Median

In [None]:
# replacing nulls with median
leads_df_prep.loc[leads_df_prep['Page Views Per Visit'].isnull(), 'Page Views Per Visit'] = leads_df_prep['Page Views Per Visit'].median()

In [None]:
# null values should now be zero
leads_df_prep['Page Views Per Visit'].isnull().sum()

In [None]:
# mean
print(leads_df_prep.groupby('Converted')['Page Views Per Visit'].mean())

# median
print(leads_df_prep.groupby('Converted')['Page Views Per Visit'].median())

##### Handling NULL values in "Last Activity"

In [None]:
leads_df_prep['Last Activity'].value_counts()

Computing missing values with "Others"

In [None]:
leads_df_prep.loc[leads_df_prep['Last Activity'].isnull(), 'Last Activity'] = 'Others'

In [None]:
leads_df_prep['Last Activity'].value_counts()

##### Handling NULL values in "Country"

In [None]:
# detrmining leads where Country is missing, however, City is provided

leads_df_prep[leads_df_prep['Country'].isnull() & ~leads_df_prep['City'].isnull()]['City'].value_counts()

In [None]:
# determining countries that fall wihtin "Tier II Cities"

leads_df_prep[leads_df_prep['City'] == 'Tier II Cities']['Country'].value_counts()

In [None]:
# determining overall distribution

leads_df_prep['Country'].value_counts()

Since India is having more counts in each inspection above, and few of the Null values are cities of India, imputing null values with India.

In [None]:
# replacing missing values
leads_df_prep.loc[leads_df_prep['Country'].isnull(), 'Country'] = 'India'

In [None]:
# verifying results
leads_df_prep['Country'].value_counts()

##### Handling NULL values in "Specialization"

In [None]:
leads_df_prep['Specialization'].value_counts(normalize=True) * 100

In [None]:
sns.countplot(y='Specialization',hue='Converted', data=leads_df_prep)
plt.show()

In [None]:
leads_df_prep[leads_df_prep['Specialization'].isnull()]['What is your current occupation'].value_counts(normalize=True)

Out of all the missing values under Specialization 
   * 93% of belongs to individuals that are "Unemployed"
   * Only 1% are "Working Professionals"

Hence we cannot impute using existing categories and therefore make a new category "Others" for the missing values.

In [None]:
# replacing missing values with "Others"
leads_df_prep.loc[leads_df_prep['Specialization'].isnull(), 'Specialization'] = 'Others'

In [None]:
# verifying dataframe
leads_df_prep['Specialization'].value_counts(normalize=True) * 100

##### Handling NULL values in "What is your current occupation"

In [None]:
leads_df_prep['What is your current occupation'].value_counts(normalize=True) * 100

In [None]:
leads_df_prep[leads_df_prep['What is your current occupation'].isnull()]['Specialization'].value_counts(normalize=True)

In [None]:
# replacing missing values with "Others"
leads_df_prep.loc[leads_df_prep['What is your current occupation'].isnull(), 'What is your current occupation'] = 'Other'

In [None]:
leads_df_prep['What is your current occupation'].value_counts(normalize=True) * 100

##### Handling NULL values in "What matters most to you in choosing a course"

In [None]:
leads_df_prep['What matters most to you in choosing a course'].value_counts(normalize=True) * 100

99% individuals opt or look for a course to make better career opportunitites within their workplace or getting a new job.

Hence we can **drop** this column because it won't be significant in predicting good leads.

In [None]:
leads_df_prep = leads_df_prep.drop('What matters most to you in choosing a course', axis=1)

In [None]:
leads_df_prep.head()

##### Handling NULL values in "City"

In [None]:
leads_df_prep['City'].value_counts(normalize=True)

In [None]:
# replacing with "Other Cities"
leads_df_prep.loc[leads_df_prep['City'].isnull(), 'City'] = 'Other Cities'

In [None]:
leads_df_prep['City'].value_counts(normalize=True)

#### Verifying dataset after treatment

In [None]:
leads_df_prep.isnull().sum()/len(leads_df_prep)

In [None]:
# retained rows and columns
leads_df_prep.shape

### 2.4 Columns with only one category

In [None]:
# determining columns with only one category

columns_with_single_category = []

for col in list(leads_df_prep.columns):
    if leads_df_prep[col].nunique() == 1:
        columns_with_single_category.append(col)

columns_with_single_category

In [None]:
# dropping columns

leads_df_prep = leads_df_prep.drop(columns_with_single_category, axis=1)

leads_df_prep.head()

### 2.5 Determining unique categories and skewness

##### Lead Origin

In [None]:
leads_df_prep['Lead Origin'].value_counts(normalize = True)

Data is not highly skewed towards one gategory and has only 5 categories. Hence no change is needed.

##### Lead Source

In [None]:
print(leads_df_prep['Lead Source'].value_counts(normalize = True))
print(leads_df_prep['Lead Source'].nunique())

* Google is redundant with "google" and "Google". Hence should be combined.
* "Facebook", "bing", "Live Chat", "blog", "youtubechannel", and "NC_EDM" can be merged into "Social Media", since that is a more generic category.
* "WeLearn", "Welingak Website", and "welearnblog_Home" can be clubbed into a broader category of "Other Educational Sites".
* "Reference" and "Referral Sites" can also be grouped into "Reference and Referral Sites"
* Categories with low distribution of leads < 0.1% combined to form "Others".

In [None]:
# making "google" and "Google" consistent

leads_df_prep['Lead Source'] = leads_df_prep['Lead Source'].replace('google','Google')

In [None]:
# creating "Social Media" as a more generic category

leads_df_prep['Lead Source'] = leads_df_prep['Lead Source'].replace(
    ['Facebook','bing','Live Chat','blog','youtubechannel','NC_EDM'],'Social Media')

In [None]:
# creating "Other Educational Sites" as a more generic category

leads_df_prep['Lead Source'] = leads_df_prep['Lead Source'].replace(
    ['Welingak Website','WeLearn','welearnblog_Home'],'Other Educational Sites')

In [None]:
# creating "Reference and Referral Sites" as a more generic category

leads_df_prep['Lead Source'] = leads_df_prep['Lead Source'].replace(
    ['Reference','Referral Sites'],'Reference and Referral Sites')

In [None]:
# combining categories with low distribution of leads into "Others"

leads_df_prep['Lead Source'] = leads_df_prep['Lead Source'].replace(['Click2call','Press_Release',
                                                     'Pay per Click Ads','testone'] ,'Others')

In [None]:
# determing final categories

leads_df_prep['Lead Source'].value_counts(normalize = True)

##### Last Activity

In [None]:
print(leads_df_prep['Last Activity'].value_counts(normalize = True))
print(leads_df_prep['Last Activity'].nunique())

* "Email Bounced", "Unreachable", and "Unsubscribed" can be grouped into "Unreachable"
* "Had a Phone Conversation", "Approached upfront", "View in browser link Clicked", "Email Received","Email Marked Spam","Resubscribed to emails", "Visited Booth in Tradeshow" can be grouped to "Others" because of less weightage

In [None]:
# creating "Unreachable" category

leads_df_prep['Last Activity'] = leads_df_prep['Last Activity'].replace(['Email Bounced','Unsubscribed'],'Unreachable')

In [None]:
# combining categories with low distribution into "Others"
leads_df_prep['Last Activity'] = leads_df_prep['Last Activity'].replace([
                                                        'Had a Phone Conversation', 
                                                        'Approached upfront',
                                                        'View in browser link Clicked',       
                                                        'Email Marked Spam',                  
                                                        'Email Received','Resubscribed to emails',
                                                         'Visited Booth in Tradeshow'],'Others')

In [None]:
print(leads_df_prep['Last Activity'].value_counts(normalize = True))
print(leads_df_prep['Last Activity'].nunique())

##### Country

In [None]:
print(leads_df_prep['Country'].value_counts(normalize = True))
print(leads_df_prep['Country'].nunique())

**97%** of the leads are coming from India. Hence this column can be dropped since it would not have significant impact in predition.

In [None]:
leads_df_prep = leads_df_prep.drop('Country', axis=1)

leads_df_prep.shape

##### Specialization

In [None]:
leads_df_prep.Specialization.value_counts(normalize=True)

In [None]:
# determining distribution against target variable
sns.countplot(y='Specialization', hue='Converted', data=leads_df_prep)
plt.show()

In [None]:
#combining Management Specializations because they show similar trends

leads_df_prep['Specialization'] = leads_df_prep['Specialization'].replace(['Finance Management','Human Resource Management',
                                                           'Marketing Management','Operations Management',
                                                           'IT Projects Management','Supply Chain Management',
                                                    'Healthcare Management','Hospitality Management',
                                                           'Retail Management'] ,'Management_Cross Industry')  

In [None]:
# creating "E-Commerce" as a generic segment

leads_df_prep['Specialization'] = leads_df_prep['Specialization'].replace(['E-Business'],'E-COMMERCE')

In [None]:
leads_df_prep['Specialization'] = leads_df_prep['Specialization'].replace(['Others','Rural and Agribusiness'
                                                                          ,'Services Excellence'],'Other Specializations')

In [None]:
# determining distribution against target variable
leads_df_prep.Specialization.value_counts(normalize=True)

##### What is your current occupation

In [None]:
leads_df_prep['What is your current occupation'].value_counts(normalize=True)

No cleaning needed as we have required categories

##### City

In [None]:
leads_df_prep.City.value_counts(normalize=True)

We can club "Thane & Outskirts" with "Other Cities of Mumbai" to have them a better weightage.

In [None]:
leads_df_prep['City'] = leads_df_prep['City'].replace('Thane & Outskirts','Other Cities of Maharashtra') 

In [None]:
leads_df_prep.City.value_counts(normalize=True)

##### Last Notable Activity

In [None]:
leads_df_prep['Last Notable Activity'].value_counts(normalize = True)

This column is similar to the information we have in "Last Activity" of the individual. Hence this can be dropped.

In [None]:
leads_df_prep = leads_df_prep.drop('Last Notable Activity', axis=1) 

##### Search

In [None]:
leads_df_prep['Search'].value_counts(normalize=True)

99% of the leads are not based on any search. Hence this column can be dropped

In [None]:
leads_df_prep = leads_df_prep.drop('Search', axis=1)

##### Through Recommendations

In [None]:
leads_df_prep['Through Recommendations'].value_counts(normalize=True)

99% of the leads are not based on particular Reccomendations. Hence this column can be dropped

In [None]:
leads_df_prep = leads_df_prep.drop('Through Recommendations', axis=1)

##### Digital Advertisement

In [None]:
leads_df_prep['Digital Advertisement'].value_counts(normalize=True)

99% of the leads are not based on particular Digital Advertisement. Hence this column can be dropped

In [None]:
leads_df_prep = leads_df_prep.drop('Digital Advertisement', axis=1)

##### 'Newspaper Article', 'X Education Forums', 'Newspaper','Do Not Call'

In [None]:
print(leads_df_prep['Newspaper Article'].value_counts(normalize=True))
print(leads_df_prep['X Education Forums'].value_counts(normalize=True))
print(leads_df_prep['X Education Forums'].value_counts(normalize=True))
print(leads_df_prep['Do Not Call'].value_counts(normalize=True))

These are again highly skewed columns to one category. Hence can be dropped.

In [None]:
leads_df_prep = leads_df_prep.drop(['Newspaper Article', 'X Education Forums', 'Newspaper','Do Not Call'], axis=1)

In [None]:
leads_df_prep.info()

In [None]:
leads_df_prep.shape

**We are able to retained 100% of the records and 14 columns**

### 2.6 Outlier Analysis

In [None]:
leads_df_prep.describe()

* Creating plots for all numerical variables to identify any outliers

In [None]:
plt.figure(figsize=(14,3))
plt.subplot(1,3,1)
sns.boxplot(x='TotalVisits', data=leads_df_prep)

plt.subplot(1,3,2)
sns.boxplot(x='Total Time Spent on Website', data=leads_df_prep)

plt.subplot(1,3,3)
sns.boxplot(x='Page Views Per Visit', data=leads_df_prep)

* Total Visits and Page Views per Visit has Outliers.

* Treating outliers with 95th percentile

##### TotalVisits

In [None]:
# checking quantile range
leads_df_prep['TotalVisits'].quantile([0,0.25,0.5,0.75,0.9,0.90,0.95,0.99,1])

In [None]:
# determining 95th quantile

q95 = leads_df_prep['TotalVisits'].quantile(0.95)

# replacing leads with TotalVisits more than the 95th percentile, with 95th percentile
leads_df_prep['TotalVisits'][leads_df_prep['TotalVisits']>q95] = q95

##### Page Vies Per Visit

In [None]:
leads_df_prep['Page Views Per Visit'].quantile([0,0.25,0.5,0.75,0.9,0.90,0.95,0.99,1])

In [None]:
# determining 95th quantile

q95 = leads_df_prep['Page Views Per Visit'].quantile(0.95)

# replacing leads with TotalVisits more than the 95th percentile, with 95th percentile
leads_df_prep['Page Views Per Visit'][leads_df_prep['Page Views Per Visit']>q95] = q95

Re-verifying using boxplots

In [None]:
plt.figure(figsize=(14,3))
plt.subplot(1,3,1)
sns.boxplot(x='TotalVisits', data=leads_df_prep)

plt.subplot(1,3,2)
sns.boxplot(x='Total Time Spent on Website', data=leads_df_prep)

plt.subplot(1,3,3)
sns.boxplot(x='Page Views Per Visit', data=leads_df_prep)

**Outliers have been handled**

## 3. Exploratory Data Analysis

In [None]:
# Overall Conversion Rate

round((sum(leads_df_prep['Converted'])/len(leads_df_prep))*100,2)

As per the problem statement, we see that Conversion Rate stands low at 38%

### 3.1 Univariate Analysis

#### 3.1.1 Catagorical Variables

In [None]:
# Lead Origin

plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.countplot(y='Lead Origin', hue='Converted', data=leads_df_prep)

plt.subplot(1,2,2)
df = leads_df_prep.groupby('Lead Origin').Converted.sum()/leads_df_prep.groupby('Lead Origin')['Lead Number'].count()
df = df.reset_index()
df.columns = ['Lead Origin','Conversion Ratio']
#df = df.sort_values(by = 'Conversion Ratio', ascending=False)
sns.barplot(y='Lead Origin', x='Conversion Ratio', data=df, color='salmon')

plt.tight_layout()
plt.show()

**Inferences**
* Majority number of Leads arrive from Landing Page Sumbmissions, however, that has a comparatively low Conversion Rate. It is a significant variable in determining potential leads
* Although Lead Add Form generated few leads, but the Conversion Rate is over 90%. Hence more importance can be given to this area 

In [None]:
# Lead Source

plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.countplot(y='Lead Source', hue='Converted', data=leads_df_prep, order=leads_df_prep['Lead Source'].value_counts().index)

plt.subplot(1,2,2)
df = leads_df_prep.groupby('Lead Source').Converted.sum()/leads_df_prep.groupby('Lead Source')['Lead Number'].count()
df = df.reset_index()
df.columns = ['Lead Source','Conversion Ratio']
#df = df.sort_values(by = 'Conversion Ratio', ascending=False)
sns.barplot(y='Lead Source', x='Conversion Ratio', data=df, color='salmon', order=leads_df_prep['Lead Source'].value_counts().index)

plt.tight_layout()
plt.show()

**Inference**
* Google and Direct Traffic despite highest number of leads, show only 30-40% of Conversion Rate
* Referrals and Reference Sites have a high Conversion Rate of 80%. Hence steps could be taken to improve traffic on these sources for example by offering some benefits existing students.
* Educational sites such as WeLearn and WeLearn Blog, can be used to put interactive content to increase traffic to Company's website.
* Social Media doesn't tend to deliver good leads as the conversion rate is also low at ~25%.

In [None]:
# Do not Email

plt.figure(figsize=(15,5))

plt.subplot(2,2,1)
sns.countplot(y='Do Not Email', hue='Converted', data=leads_df_prep)

plt.subplot(2,2,2)
df = leads_df_prep.groupby('Do Not Email').Converted.sum()/leads_df_prep.groupby('Do Not Email')['Lead Number'].count()
df = df.reset_index()
df.columns = ['Do Not Email','Conversion Ratio']
#df = df.sort_values(by = 'Conversion Ratio', ascending=False)
sns.barplot(y='Do Not Email', x='Conversion Ratio', data=df, color='salmon')

**Inference**

Nothing significant can be derived using this variable

In [None]:
# Last Activity
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.countplot(y='Last Activity', hue='Converted', data=leads_df_prep, order=leads_df_prep['Last Activity'].value_counts().index)

plt.subplot(1,2,2)
df = leads_df_prep.groupby('Last Activity').Converted.sum()/leads_df_prep.groupby('Last Activity')['Lead Number'].count()
df = df.reset_index()
df.columns = ['Last Activity','Conversion Ratio']
#df = df.sort_values(by = 'Conversion Ratio', ascending=False)
sns.barplot(y='Last Activity', x='Conversion Ratio', data=df, color='salmon', order=leads_df_prep['Last Activity'].value_counts().index)

plt.tight_layout()

**Inference**
* Individuals who have checked Emails or have sent and SMS, should be constantly communicated to sustain and probably increase the conversion ratio
* Next segment to target could be the individuals who have been visiting the website often

In [None]:
# Specialization
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.countplot(y='Specialization', hue='Converted', data=leads_df_prep, order=leads_df_prep['Specialization'].value_counts().index)

plt.subplot(1,2,2)
df = leads_df_prep.groupby('Specialization').Converted.sum()/leads_df_prep.groupby('Specialization')['Lead Number'].count()
df = df.reset_index()
df.columns = ['Specialization','Conversion Ratio']
#df = df.sort_values(by = 'Conversion Ratio', ascending=False)
sns.barplot(y='Specialization', x='Conversion Ratio', data=df, color='salmon', order=leads_df_prep['Specialization'].value_counts().index)

plt.tight_layout()

**Inference**
* Cross Industry Management (such as: Healthcare Mgmt, Finance Mgmt, HR Mgmt, IT Mgmt, etc.) show good leads and conversion ratio. This is the segment to be sustained
* Efforts can be made to increase traffic on Business Administration and Banking, Investment and Insurance sector.

In [None]:
# What is your current occupation
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.countplot(y='What is your current occupation', hue='Converted', data=leads_df_prep, order=leads_df_prep['What is your current occupation'].value_counts().index)

plt.subplot(1,2,2)
df = leads_df_prep.groupby('What is your current occupation').Converted.sum()/leads_df_prep.groupby('What is your current occupation')['Lead Number'].count()
df = df.reset_index()
df.columns = ['What is your current occupation','Conversion Ratio']
#df = df.sort_values(by = 'Conversion Ratio', ascending=False)
sns.barplot(y='What is your current occupation', x='Conversion Ratio', data=df, color='salmon', order=leads_df_prep['What is your current occupation'].value_counts().index)

plt.tight_layout()

**Inferences**
* Determining why Unemployed category has low conversion, despite individuals in need of Job.
* Working Professionals have a higher conversion rate, so steps can be taken to increase leads coming from this segment.
* Other categries does not require much focus, because the overall traffic is very less and instead focus on above segments to bring or generate more leads

In [None]:
# City
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.countplot(y='City', hue='Converted', data=leads_df_prep, order=leads_df_prep['City'].value_counts().index)

plt.subplot(1,2,2)
df = leads_df_prep.groupby('City').Converted.sum()/leads_df_prep.groupby('City')['Lead Number'].count()
df = df.reset_index()
df.columns = ['City','Conversion Ratio']
#df = df.sort_values(by = 'Conversion Ratio', ascending=False)
sns.barplot(y='City', x='Conversion Ratio', data=df, color='salmon', order=leads_df_prep['City'].value_counts().index)

plt.tight_layout()

**Inference**
* Other Cities in Maharashtra and outside Maharashtra have a good conversion but low traffic. Good marketing strategies in areas other than Mumbai can be applied to generate more leads.

#### 3.1.2 Numerical Variables

In [None]:
plt.figure(figsize=(14,5))
plt.subplot(1,3,1)
sns.boxplot(y='TotalVisits', x='Converted', data=leads_df_prep)

plt.subplot(1,3,2)
sns.boxplot(y='Total Time Spent on Website', x='Converted', data=leads_df_prep)

plt.subplot(1,3,3)
sns.boxplot(y='Page Views Per Visit', x='Converted', data=leads_df_prep)

plt.tight_layout()

**Inference**
* Individuals spending time on the website has a good impact on a lead getting converted.
* Website content can be made intutive in order to have aspiring students spend more time on the website and browse thourgh content that is relavant to them.

### 3.2 Correlation Analysis

In [None]:
sns.heatmap(leads_df_prep.corr(), annot=True)

**Inference**
* As determined earlier, Total Time Spent on the website has a good correlation with Converted, considering other variables.
* Total Visits and Page Views Per Visit are highly correlated with each other. We can drop one of them as it will create ambiguity in modelling.

## 4. Data Preparation

### 4.1 Converting Binary Variables with Yes/No to 1/0

In [None]:
var_list = ['Do Not Email', 'A free copy of Mastering The Interview']

In [None]:
# Defining the map function
def binary_map(x):
    return x.map({'Yes': 1, "No": 0})

# Capitalize all binary variables
leads_df_prep['Do Not Email'] = leads_df_prep['Do Not Email'].str.capitalize()
leads_df_prep['A free copy of Mastering The Interview'] = leads_df_prep['A free copy of Mastering The Interview'].str.capitalize()

# applying the function to the variables
leads_df_prep[var_list] = leads_df_prep[var_list].apply(binary_map)

In [None]:
leads_df_prep[var_list].head()

### 4.2 Creating Dummy Variables

In [None]:
dummies_list = ['Lead Origin','Lead Source','Last Activity','Specialization','What is your current occupation','City']

In [None]:
# creating dummies
dummy_set = pd.get_dummies(leads_df_prep[dummies_list], drop_first=True)

In [None]:
# concatenating with dataframe

leads_df_prep = pd.concat([leads_df_prep,dummy_set], axis=1)
leads_df_prep.head()

In [None]:
# dropping all catagorical variables, since we have already created dummies 
leads_df_prep = leads_df_prep.drop(dummies_list, axis=1)

In [None]:
leads_df_prep.head()

### 4.3 Train-Test Split 

In [None]:
# Putting feature variable to X
X = leads_df_prep.drop(['Converted','Prospect ID','Lead Number'], axis=1)

X.head()

In [None]:
y = leads_df_prep['Converted']

y.head()

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

### 4.4 Feature Scaling

In [None]:
# Standardizing following variables (others are already in Binary form)
scaler = StandardScaler()

X_train[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']] = scaler.fit_transform(X_train[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']])

X_train.head()

### 4.5 Feature Selection using RFE

In [None]:
logreg = LogisticRegression()

In [None]:
rfe = RFE(logreg, 25) 
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
# Selected Columns
col = X_train.columns[rfe.support_]
len(col)

In [None]:
# excluded columns
X_train.columns[~rfe.support_]

## 5. Modelling

In [None]:
# fitting the model using RFE variables

X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

### 5.1 Manual Feature Selection

In [None]:
X_train['What is your current occupation_Housewife'].value_counts(normalize=True)

In [None]:
X_train['Lead Source_Others'].value_counts(normalize=True)

In [None]:
X_train['Lead Origin_Lead Import'].value_counts(normalize=True)

In [None]:
X_train['Specialization_International Business'].value_counts(normalize=True)

Dropping these variables since they are highly skewed towards one category and show a high P-value

In [None]:
col = col.drop(['What is your current occupation_Housewife', 'Lead Source_Others','Lead Origin_Lead Import',
'Specialization_International Business'],1)
len(col)

#### Re-assess the model with remaining variables - Model 2

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

#### Checking VIF

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Dropping first column with highest VIF

In [None]:
col = col.drop(['What is your current occupation_Unemployed'],1)
col

In [None]:
# re-checking VIF
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### Re-assessing model with remaining variables - Model 3

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

**Inferences**
1. Last Activity=Olark Chat and Occupation=Student, show high P-Value
2. Among the two, as seen in Univariate analysis, Olark Chat has a conversion rate of 10%, however Occupation=Students has close to 40%
3. We will drop "Last Activity_Olark Chat Conversation"

In [None]:
# dropping Column with high p-value
col = col.drop(['Last Activity_Olark Chat Conversation'],1)
col

In [None]:
# re-checking VIF
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### Re-assessing model with remaining variables - Model 4

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# dropping column with high p-value
col = col.drop(['What is your current occupation_Student'],1)
len(col)

In [None]:
# re-checking VIF
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### Re-assessing model with remaining variables - Model 6

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Dropping "_OTHER" Categories and "A free copy of Mastering The Interview"
col = col.drop(['What is your current occupation_Other','Specialization_Other Specializations','Last Activity_Others'
               ,'A free copy of Mastering The Interview'],1)
col

### 5.2 Final model for Prediction - Model 7

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# re-checking VIF
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

We now have features with VIF under 5 and close to Zero p-values. Hence, we will continue with this model.

### 5.3 Predicting Probabilities

In [None]:
y_train_pred = res.predict(X_train_sm).values.reshape(-1)

In [None]:
# verifying first 10 probabilities in the array
y_train_pred[:10]

**Creating a dataframe with the actual Converted leads and the predicted probabilities**

In [None]:
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converting_Probabilities':y_train_pred})
y_train_pred_final['Lead ID'] = y_train.index
y_train_pred_final.head()

### 5.4 Finding Optimal Cutoff Point

In [None]:
# Creating columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Converting_Probabilities.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

**0.35 looks to be an optimal cutoff value, because we have a balanced Sensitivity and Specificity**

### 5.5 Creating Final prediction using the optimal cut-off

In [None]:
# Predicting Conversion (0/1) based on the predicted probabilities, using 0.35 as the cut-off.

y_train_pred_final['Conversion_predicted'] = y_train_pred_final.Converting_Probabilities.map(lambda x: 1 if x > 0.35 else 0)

y_train_pred_final.head()

### 5.6 Deriving Evaluation Metrics

**Confusion Matrix**

In [None]:
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.Conversion_predicted )
confusion

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

**Overall Accuracy**

In [None]:
round(((TP+TN)/(TP+TN+FP+FN))*100,1)

We have received an overall Accuracy of 80.8%


**Sensitivity** or **True Positive Rate**

In [None]:
round((TP / float(TP+FN))*100,1)

Sensitivity is 79.4%

**Specificity** or **True Negative Rate**

In [None]:
round((TN / float(TN+FP))*100,1)

Specificity is 81.6%

**PRECISION**

In [None]:
round((TP / float(TP+FP))*100,1)

**RECALL or Sensitivity**

In [None]:
round((TP / float(TP+FN))*100,1)

**Inferences**

* We see a good overall accuracy.

* However, considering the business scenario - the model should optimally determine whether a lead could be classified as a potential lead, i.e. it should be efficient enough to determine non-potential leads. Therefore, Specificity or the False Positive Rate can been seen as the metric determining the overall predictive power of the model.

* Specificity is good at 81.6%

### 5.7 ROC Curve

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_train_pred_final.Converted, 
                                         y_train_pred_final.Converting_Probabilities, drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converting_Probabilities)

The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.

### 5.8 Precision and Recall TradeOff

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converting_Probabilities)

# plot
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

### 5.9 Making Predictions on Test Set

**Transforming variables using Standard Scaler**

In [None]:
X_test[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']] = scaler.transform(X_test[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']])

**Restricting test set to the required variables**

In [None]:
X_test = X_test[col]
X_test.head()

**Adding Constant and Predicting**

In [None]:
# add constant
X_test_sm = sm.add_constant(X_test)

# making predictions
y_test_pred = res.predict(X_test_sm)

In [None]:
# creating dataframe with actual and predicted values from test set
y_test_pred_final = pd.DataFrame({'Converted':y_test.values, 'Converting_Probabilities':y_test_pred})
y_test_pred_final['Lead ID'] = y_test.index
y_test_pred_final.head()

In [None]:
y_test_pred_final['Conversion_predicted'] = y_test_pred_final.Converting_Probabilities.map(lambda x: 1 if x > 0.40 else 0)
y_test_pred_final.head()

**Determinig Mertrics**

**Confusion Matrix**

In [None]:
confusion_test = metrics.confusion_matrix(y_test_pred_final.Converted, y_test_pred_final.Conversion_predicted )
confusion_test

In [None]:
TP_test = confusion_test[1,1] # true positive 
TN_test = confusion_test[0,0] # true negatives
FP_test = confusion_test[0,1] # false positives
FN_test = confusion_test[1,0] # false negatives

**Overall Accuracy**

In [None]:
round(((TP_test+TN_test)/(TP_test+TN_test+FP_test+FN_test))*100,1)

We have received an overall Accuracy of 80.9%


**Sensitivity**

In [None]:
round((TP_test / float(TP_test+FN_test))*100,1)

Sensitivity is 74.7%

**Specificity**

In [None]:
round((TN_test / float(TN_test+FP_test))*100,1)

Specificity is 85%

**Inference**: Specificity has increased from 81.6% in Train Set to 85% in Test set. This clearly states that the model was  able to predict the NON potential leads efficiently.

## 6. Creating Lead Score variable

**Train Set**

In [None]:
y_train_pred_final.head()

In [None]:
y_train_pred_final['Lead Score'] = y_train_pred_final['Converting_Probabilities']*100

y_train_pred_final['Lead Score'] = y_train_pred_final['Lead Score'].astype(int)

In [None]:
# keeping only relevant variables
y_train_pred_final = y_train_pred_final[['Converted','Converting_Probabilities','Lead ID','Conversion_predicted','Lead Score']]

In [None]:
y_train_pred_final.head()

**Test Set**

In [None]:
y_test_pred_final.head()

In [None]:
y_test_pred_final['Lead Score'] = y_test_pred_final['Converting_Probabilities']*100

y_test_pred_final['Lead Score'] = y_test_pred_final['Lead Score'].astype(int)

In [None]:
y_test_pred_final.head()

**Append Trainig and Test data sets**

In [None]:
Lead_Score_df = y_train_pred_final.append(y_test_pred_final)
Lead_Score_df.head()

In [None]:
len(Lead_Score_df)

In [None]:
# Ensuring the LeadIDs are unique for each lead in the finl dataframe
len(Lead_Score_df['Lead ID'].unique().tolist())

**Merge Lead Score Column to the Original Data frame**

In [None]:
# Making Index as the column identifier
leads_df = leads_df.reset_index()
leads_df.head()

In [None]:
# renaming "index" to "Lead ID"
leads_df = leads_df.rename(columns={"index": "Lead ID"})
leads_df.head()

**Merging Predicted and Original Dataframe to assign "Lead Score" to all the leads.**

In [None]:
leads_df_scored = pd.merge(leads_df,Lead_Score_df[['Lead ID','Lead Score']], on='Lead ID', how='inner')
leads_df_scored.info()

In [None]:
leads_df_scored = leads_df_scored.drop('Lead ID', axis=1)

In [None]:
leads_df_scored.head()

## 6. Conclusions

#### In terms of Lead Generation

Compnay should focus on increasing **Total Visits** to the platform. This can be achieved in many ways:
* People spend more "Time on the Website". Making the website more intutive and relevant to the individual's interest would attract more traffic on the website.
* Google tends to generate more leads. Launching more **Lead Add Forms** on social media sites, search engines would increase the overall lead count.
* Content or Ads can be posted on **Educational Websites (such as WeLearn, Welingakr)** to increase number of leads, as this category show a high Conversion Rate.
* Referrals and Reference Sites have a high Conversion Rate of 80%. Hence steps could be taken to improve traffic on these sources for example by offering some benefits existing students.

#### Based on User's Last Activity

* Individual's with the following recent activities, can turn out to be potential Students:
    1. SMS Sent
    2. Email Opened
    3. Email link clicked
    4. Page Visited on website

#### Based on User's Profile

* **Working People** have shown deep interest and a high conversion rate. However, since the overall lead counts is low, Company can focus in reaching out people in workforce and educating them about the products.