# Problem Statement:

    An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 

    The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. 
    
    Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.
    
    As you can see, there are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion.
    
    X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.
    
    

In [1]:
#Here Importing the warnings and importing the required libraries for Data Analytics
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
leads = pd.read_csv("../Lead Scoring Assignment/Leads.csv") # Read the data 
leads.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,...,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,...,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,...,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


In [3]:
leads.shape #shape of dataframe

(9240, 37)

In [4]:
leads.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 

In [5]:
leads.isnull().sum() #null values each column

Prospect ID                                         0
Lead Number                                         0
Lead Origin                                         0
Lead Source                                        36
Do Not Email                                        0
Do Not Call                                         0
Converted                                           0
TotalVisits                                       137
Total Time Spent on Website                         0
Page Views Per Visit                              137
Last Activity                                     103
Country                                          2461
Specialization                                   1438
How did you hear about X Education               2207
What is your current occupation                  2690
What matters most to you in choosing a course    2709
Search                                              0
Magazine                                            0
Newspaper Article           

In [6]:
round(100*(leads.isnull().sum())/len(leads.index), 2) # Null Value Percentages

Prospect ID                                       0.00
Lead Number                                       0.00
Lead Origin                                       0.00
Lead Source                                       0.39
Do Not Email                                      0.00
Do Not Call                                       0.00
Converted                                         0.00
TotalVisits                                       1.48
Total Time Spent on Website                       0.00
Page Views Per Visit                              1.48
Last Activity                                     1.11
Country                                          26.63
Specialization                                   15.56
How did you hear about X Education               23.89
What is your current occupation                  29.11
What matters most to you in choosing a course    29.32
Search                                            0.00
Magazine                                          0.00
Newspaper 

## See the distinct values of each variable

In [7]:
# Distinct values for 'Lead Origin'
leads['Lead Origin'].value_counts()   

Lead Origin
Landing Page Submission    4886
API                        3580
Lead Add Form               718
Lead Import                  55
Quick Add Form                1
Name: count, dtype: int64

In [8]:
# Distinct values for 'Lead Source'

leads['Lead Source'].value_counts()

Lead Source
Google               2868
Direct Traffic       2543
Olark Chat           1755
Organic Search       1154
Reference             534
Welingak Website      142
Referral Sites        125
Facebook               55
bing                    6
google                  5
Click2call              4
Press_Release           2
Social Media            2
Live Chat               2
youtubechannel          1
testone                 1
Pay per Click Ads       1
welearnblog_Home        1
WeLearn                 1
blog                    1
NC_EDM                  1
Name: count, dtype: int64

In [9]:
# Distinct values for 'Lead Origin'

leads['Do Not Email'].value_counts()    # We found out that the variable 'Do Not Email' is a skewed variable.

Do Not Email
No     8506
Yes     734
Name: count, dtype: int64

In [10]:
# Distinct values for 'Last Activity'

leads['Last Activity'].value_counts() 

Last Activity
Email Opened                    3437
SMS Sent                        2745
Olark Chat Conversation          973
Page Visited on Website          640
Converted to Lead                428
Email Bounced                    326
Email Link Clicked               267
Form Submitted on Website        116
Unreachable                       93
Unsubscribed                      61
Had a Phone Conversation          30
Approached upfront                 9
View in browser link Clicked       6
Email Received                     2
Email Marked Spam                  2
Visited Booth in Tradeshow         1
Resubscribed to emails             1
Name: count, dtype: int64

In [11]:
# Distinct values for 'Country'

leads.Country.value_counts()  #  Found out most of the leads are from 'India'

Country
India                   6492
United States             69
United Arab Emirates      53
Singapore                 24
Saudi Arabia              21
United Kingdom            15
Australia                 13
Qatar                     10
Hong Kong                  7
Bahrain                    7
Oman                       6
France                     6
unknown                    5
South Africa               4
Nigeria                    4
Germany                    4
Kuwait                     4
Canada                     4
Sweden                     3
China                      2
Asia/Pacific Region        2
Uganda                     2
Bangladesh                 2
Italy                      2
Belgium                    2
Netherlands                2
Ghana                      2
Philippines                2
Russia                     1
Switzerland                1
Vietnam                    1
Denmark                    1
Tanzania                   1
Liberia                    1
Malays

In [12]:
# Distinct values for 'Specialization'

leads['Specialization'].value_counts()

Specialization
Select                               1942
Finance Management                    976
Human Resource Management             848
Marketing Management                  838
Operations Management                 503
Business Administration               403
IT Projects Management                366
Supply Chain Management               349
Banking, Investment And Insurance     338
Travel and Tourism                    203
Media and Advertising                 203
International Business                178
Healthcare Management                 159
Hospitality Management                114
E-COMMERCE                            112
Retail Management                     100
Rural and Agribusiness                 73
E-Business                             57
Services Excellence                    40
Name: count, dtype: int64

In [13]:
# Distinct values for 'How did you hear about X Education'

leads['How did you hear about X Education'].value_counts() # As most of the cells are not assigned, we should do some imputations.

How did you hear about X Education
Select                   5043
Online Search             808
Word Of Mouth             348
Student of SomeSchool     310
Other                     186
Multiple Sources          152
Advertisements             70
Social Media               67
Email                      26
SMS                        23
Name: count, dtype: int64

In [14]:
# Distinct values for 'How did you hear about X Education'

leads['What is your current occupation'].value_counts()

What is your current occupation
Unemployed              5600
Working Professional     706
Student                  210
Other                     16
Housewife                 10
Businessman                8
Name: count, dtype: int64

In [15]:
leads['Do Not Call'].value_counts() # perfectly sqewed column..... we can drop it


Do Not Call
No     9238
Yes       2
Name: count, dtype: int64

In [16]:
leads['What matters most to you in choosing a course'].value_counts() # Perfectly sqewed column..... we can drop it.

What matters most to you in choosing a course
Better Career Prospects      6528
Flexibility & Convenience       2
Other                           1
Name: count, dtype: int64

In [17]:
leads.Magazine.value_counts() # 100% sqewed, we can drop it

Magazine
No    9240
Name: count, dtype: int64

In [18]:
leads.Search.value_counts() # Sqewed column, we can drop it

Search
No     9226
Yes      14
Name: count, dtype: int64

In [19]:
leads['Get updates on DM Content'].value_counts() # 100% sqewed, we can drop it

Get updates on DM Content
No    9240
Name: count, dtype: int64

In [20]:
leads['Newspaper Article'].value_counts() # Sqewed column, we can drop it

Newspaper Article
No     9238
Yes       2
Name: count, dtype: int64

In [21]:
leads['X Education Forums'].value_counts() # Sqewed column, we can drop it

X Education Forums
No     9239
Yes       1
Name: count, dtype: int64

In [22]:
leads['Newspaper'].value_counts() # Sqewed column, we can drop it

Newspaper
No     9239
Yes       1
Name: count, dtype: int64

In [23]:
leads['Digital Advertisement'].value_counts() # Sqewed column, we can drop it

Digital Advertisement
No     9236
Yes       4
Name: count, dtype: int64

In [24]:
leads['Through Recommendations'].value_counts() # Sqewed column, we can drop it

Through Recommendations
No     9233
Yes       7
Name: count, dtype: int64

In [25]:
leads['Receive More Updates About Our Courses'].value_counts() # Sqewed column, we can drop it

Receive More Updates About Our Courses
No    9240
Name: count, dtype: int64

In [26]:
leads['Update me on Supply Chain Content'].value_counts() # Sqewed column, we can drop it

Update me on Supply Chain Content
No    9240
Name: count, dtype: int64

In [27]:
leads['Get updates on DM Content'].value_counts() # Sqewed column, we can drop it

Get updates on DM Content
No    9240
Name: count, dtype: int64

In [28]:
leads['Lead Profile'].value_counts() # We see 'select' is occupying most of the columns, so there is need to some imputation.

Lead Profile
Select                         4146
Potential Lead                 1613
Other Leads                     487
Student of SomeSchool           241
Lateral Student                  24
Dual Specialization Student      20
Name: count, dtype: int64

In [29]:
leads['Asymmetrique Activity Index'].value_counts()

Asymmetrique Activity Index
02.Medium    3839
01.High       821
03.Low        362
Name: count, dtype: int64

In [30]:
leads['Asymmetrique Profile Index'].value_counts()

Asymmetrique Profile Index
02.Medium    2788
01.High      2203
03.Low         31
Name: count, dtype: int64

In [31]:
leads['Asymmetrique Activity Score'].value_counts()

Asymmetrique Activity Score
14.0    1771
15.0    1293
13.0     775
16.0     467
17.0     349
12.0     196
11.0      95
10.0      57
9.0        9
18.0       5
8.0        4
7.0        1
Name: count, dtype: int64

In [32]:
leads['Asymmetrique Profile Score'].value_counts()

Asymmetrique Profile Score
15.0    1759
18.0    1071
16.0     599
17.0     579
20.0     308
19.0     245
14.0     226
13.0     204
12.0      22
11.0       9
Name: count, dtype: int64

In [33]:
leads['I agree to pay the amount through cheque'].value_counts()  # Sqewed column, we can drop it

I agree to pay the amount through cheque
No    9240
Name: count, dtype: int64

In [34]:
leads.columns # Let's see the columns

Index(['Prospect ID', 'Lead Number', 'Lead Origin', 'Lead Source',
       'Do Not Email', 'Do Not Call', 'Converted', 'TotalVisits',
       'Total Time Spent on Website', 'Page Views Per Visit', 'Last Activity',
       'Country', 'Specialization', 'How did you hear about X Education',
       'What is your current occupation',
       'What matters most to you in choosing a course', 'Search', 'Magazine',
       'Newspaper Article', 'X Education Forums', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations',
       'Receive More Updates About Our Courses', 'Tags', 'Lead Quality',
       'Update me on Supply Chain Content', 'Get updates on DM Content',
       'Lead Profile', 'City', 'Asymmetrique Activity Index',
       'Asymmetrique Profile Index', 'Asymmetrique Activity Score',
       'Asymmetrique Profile Score',
       'I agree to pay the amount through cheque',
       'A free copy of Mastering The Interview', 'Last Notable Activity'],
      dtype='object')

## **`Inference:`**
    1. As many columns has 99 percent same option (Perfectly Sqewed), Those variables won't have any effect on model bulding. Therefore we better drop those columns. 

In [36]:
# Drop the columns which are squewed... 

leads = leads.drop(['Do Not Call', 'What matters most to you in choosing a course', 'Search', 'Magazine',
       'Newspaper Article', 'X Education Forums', 'Newspaper',
       'Digital Advertisement', 'Through Recommendations',
       'Receive More Updates About Our Courses', 'Update me on Supply Chain Content', 'Get updates on DM Content',
       'I agree to pay the amount through cheque'], axis=1)

In [115]:
# Let's See the distinct values for 'Tags'

leads['Tags'].value_counts()

Tags
Will revert after reading the email                  2072
Ringing                                              1203
Interested in other courses                           513
Already a student                                     465
Closed by Horizzon                                    358
switched off                                          240
Busy                                                  186
Lost to EINS                                          175
Not doing further education                           145
Interested  in full time MBA                          117
Graduation in progress                                111
invalid number                                         83
Diploma holder (Not Eligible)                          63
wrong number given                                     47
opp hangup                                             33
number not provided                                    27
in touch with EINS                                     12
Lost to O

In [None]:
# Let's See the distinct values for 'A free copy of Mastering The Interview'
leads['A free copy of Mastering The Interview'].value_counts()

In [None]:
# Let's See the distinct values for 'Lead Quality'

leads['Lead Quality'].value_counts()

In [None]:
# Let's See the distinct values for 'Lead Profile'

leads['Lead Profile'].value_counts()

In [None]:
# Let's See the distinct values for 'Last Notable Activity'

leads['Last Notable Activity'].value_counts()

In [None]:
leads.info()

In [None]:
leads.head()

In [None]:
leads = leads.drop('Prospect ID', 1) # Drop 'Prospect ID' as it is of less use.

In [None]:
leads.head()

In [None]:
null_perc = round(leads.isnull().mean()*100,2)
null_perc   

# The null percentage of columns tells us which are the columns to drop immediatly.

In [None]:
# Lets see the columns with null percentage greater than 45% and drop them immediatly,
# as high null value percentage won't help our final model. 

null_perc[null_perc > 45]

In [None]:
leads.columns

In [None]:
# Drop the columns with high null percentages

leads = leads.drop(['Lead Quality','Asymmetrique Activity Index', 'Asymmetrique Profile Index',
       'Asymmetrique Activity Score', 'Asymmetrique Profile Score'], 1)

In [None]:
leads.info()

In [None]:
leads.head()

In [None]:
leads.shape

In [None]:
leads['Do Not Email'].value_counts(normalize = True)

In [None]:
leads = leads.drop(['Do Not Email'], axis = 1)  # Drop the column as it is 92% sqewed

In [None]:
leads['Lead Source'].value_counts(normalize = True)  

In [None]:
leads['Lead Source'].value_counts() # value counts of 'Lead Source'

In [None]:
leads['Lead Source'] = leads['Lead Source'].fillna(leads['Lead Source'].mode()[0])

# Fill the NAN cells with mode.

In [None]:
leads['Lead Source'].value_counts() # It is clearly seen that all NAN cells are replaced with mode. i.e., Google

# Data Preparation

## Modify Column Values

In [None]:
# We saw the number of distinct values is high in number, So we shall have to reduce them for better analysis.
# We shall retain the names of values with high percentages and put all the low percentage values as Others

def Change_Lead_Source_col(x):
    if x == 'google':
        return 'Google'
    elif x == 'Google':
        return 'Google'
    elif x == 'Direct Traffic':
        return 'Direct Traffic'
    elif x == 'Olark Chat':
        return 'Olark Chat'
    elif x == 'Organic Search':
        return 'Organic Search'
    else : 
        return 'Others'


In [None]:
leads['Lead Source'] = leads['Lead Source'].apply(Change_Lead_Source_col) # Lets Apply the modifications as mentioned above.

In [None]:
leads['Lead Source'].value_counts() # Modified value counts

In [None]:
leads['Lead Source'].value_counts(normalize = True) # Modified values in percentrages

In [None]:
leads.isnull().sum() # Lets impute the null values of other varibles too in the same way we did it above.

In [None]:
leads['Last Activity'].value_counts() 

In [None]:
leads['Last Notable Activity'].value_counts()

In [None]:
# We saw the number of distinct values is high in number, So we shall have to reduce them for better analysis.
# We shall retain the names of values with high percentages and put all the low percentage values as Others

def Change_Last_Activity(x):
    if x == 'Email Opened':
        return 'Email Opened'
    elif x == 'SMS Sent':
        return 'SMS Sent'
    elif x == 'Olark Chat Conversation':
        return 'Olark Chat Conversation'
    else : 
        return 'Others'

In [None]:
# Fill the NAN cells with mode.
leads['Last Activity'] = leads['Last Activity'].fillna(leads['Last Activity'].mode()[0])


In [None]:
leads['Last Activity'] = leads['Last Activity'].apply(Change_Last_Activity) # Apply the modifications as mentioned above

In [None]:
leads['Last Activity'].value_counts(normalize = True)  # Check the modified percentage Value Counts

In [None]:
# Fill the NAN cells with mode.
leads['Page Views Per Visit'] = leads['Page Views Per Visit'].fillna(leads['Page Views Per Visit'].mode()[0])

In [None]:

# Fill the NAN cells with mode.
leads['TotalVisits'] = leads['TotalVisits'].fillna(leads['TotalVisits'].mode()[0])

In [None]:
round(leads.isnull().mean()*100,2)

In [None]:
leads.Country.value_counts(normalize = True) # It shows the columns country is skewed and
                             # we can actually drop it or make some necessary modifications after treating Null values.

In [None]:
leads.Country.isnull().mean() # As the null values are high in percentage, we better assign them to mode of that column.

In [None]:
leads.Country =leads.Country.fillna(leads.Country.mode()[0]) # Filled Nan with mode.

In [None]:
# The number of distinct values is high in number, So we shall have to reduce them for better analysis.
# We shall retain the names of values with high percentages and put all the low percentage values as Others


def Change_Country(x):
    if x == 'India':
        return 'India'
    else : 
        return 'Foreign'

#Apply modication made as above mentioned    
leads.Country = leads.Country.apply(Change_Country)

In [None]:
leads.Country.value_counts(normalize = True) # It shows the column is highly sqewed, therefore we can drop it.

In [None]:
leads = leads.drop('Country', 1) # Drop the column 'Country'

In [None]:
round(leads.isnull().mean()*100,2)

In [None]:
leads.Specialization.value_counts()

In [None]:
leads.Specialization.isnull().sum()

In [None]:
(1942+1438)/9240    # Null values are highly overall is about 36.5 percent which is very high in number, therefore drop column.

In [None]:
leads = leads.drop('Specialization', 1) # drop the column 'Specialization'

In [None]:
leads['How did you hear about X Education'].value_counts() # here the 'Select' is almost 50 percent 

In [None]:
leads['How did you hear about X Education'].isnull().sum() # The null values and select accounts to around 75% of cells,
# Therefore we drop the column.

In [None]:
leads = leads.drop('How did you hear about X Education', 1) # Drop the column 'How did you hear about X Education'

In [None]:
leads['What is your current occupation'].value_counts() 

In [None]:
leads['What is your current occupation'].isnull().sum()

In [None]:
# Fill the NAN cells with mode and see the fresh value counts.
leads['What is your current occupation'].fillna('Unemployed', inplace = True)
leads['What is your current occupation'].value_counts()

In [None]:
leads['What is your current occupation'].value_counts(normalize = True)
# The column is 90% sqewed, therefore we better drop the column.

In [None]:
leads = leads.drop('What is your current occupation', 1) # Drop the column the 'What is your current occupation'

In [None]:
leads.Tags.value_counts()   # Un_modified value counts of Tags column

In [None]:
leads.Tags.isnull().sum() 

In [None]:
# Fill the NAN cells with mode.
leads.Tags = leads.Tags.fillna(leads.Tags.mode()[0])

In [None]:
# We saw the number of distinct values is high in number, So we shall have to reduce them for better analysis.
# We shall retain the names of values with high percentages and put all the low percentage values as Others


def Change_Tags(x):
    if x == 'Will revert after reading the email':
        return 'Will revert after reading the email'
    elif x == 'Ringing':
        return 'Ringing'
    elif x == 'Already a student':
        return 'Already a student'
    else: 
        return 'Other Reasons'

#Apply the modifications made above
leads.Tags = leads.Tags.apply(Change_Tags)

In [None]:
# lets see the modified value counts
leads.Tags.value_counts()

In [None]:
leads.info()

In [None]:
leads['Lead Profile'].value_counts()  

In [None]:
leads['Lead Profile'].isnull().sum()

In [None]:
# Fill the NAN cells with mode.
leads['Lead Profile'].fillna('Select', inplace =  True)


In [None]:
# Modify the column accordingly

def Change_Lead_Profile(x):
    if x == 'Select':
        return 'Other Leads'
    elif x == 'Potential Lead':
        return 'Potential Lead'
    elif x == 'Other Leads':
        return 'Other Leads'
    else: 
        return 'diploma_dual_n_SomeSchool_lead'

# Apply the modifications
leads['Lead Profile'] = leads['Lead Profile'].apply(Change_Lead_Profile)

In [None]:
leads['Lead Profile'].value_counts(normalize = True) # Checked the percentage value counts and decided not to drop.

In [None]:
leads.City.value_counts(normalize = True) 

In [None]:
leads.City.value_counts()

In [None]:
leads.City.isnull().sum()

In [None]:
leads.City.fillna('Other Cities', inplace = True) # As we don't know the which city is the Nan cells belong to,
                                                  # We better replace them as Other cities
leads.City.value_counts(normalize = True)     # Checked the percentage value counts and clearly it is non-sqewed column.

In [None]:
leads.City.value_counts()  

In [None]:
# Make the modifications to city column accordingly 

def Change_City(x):
    if x == 'Mumbai':
        return 'Mumbai'
    elif x == 'Thane & Outskirts':
        return 'Other Cities of Maharashtra'
    elif x == 'Other Cities of Maharashtra':
        return 'Other Cities of Maharashtra'
    else: 
        return 'Other Cities'

# Aplly the changes to city column.   
leads['City'] = leads['City'].apply(Change_City)

In [None]:
# lets see the value counts after modifications
leads.City.value_counts()

In [None]:
leads['A free copy of Mastering The Interview'].value_counts() # we can apply binary mapping to this column

In [None]:
leads['Last Notable Activity'].value_counts()

In [None]:
# Define the changes to be made to 'Last Notable Activity' column and apply the changes
def Change_Last_Notable_Activity(x):
    if x == 'Modified':
        return 'Modified'
    elif x == 'Email Opened':
        return 'Email Opened'
    elif x == 'SMS Sent':
        return 'SMS Sent'
    else: 
        return 'Other_Activity'

leads['Last Notable Activity'] = leads['Last Notable Activity'].apply(Change_Last_Notable_Activity)

In [None]:
leads['Last Notable Activity'].value_counts() # Modified value counts

In [None]:
leads.info()

In [None]:
leads.Converted.value_counts()

In [None]:
# Defining the map function
def binary_map(x):
    return x.map({'Yes': 1, "No": 0})

# Applying the function to the column
leads[['A free copy of Mastering The Interview']] = leads[['A free copy of Mastering The Interview']].apply(binary_map)

In [None]:
leads.info()

In [None]:
leads.columns

In [None]:
leads = leads.drop('Lead Number', 1)

In [None]:
leads.columns # Columns remaining after droping highly sqewed columns, high % null columns and not required columns

In [None]:
leads.shape

In [None]:
# Creating a dummy variable for some of the categorical variables and dropping the first one.
dummy1 = pd.get_dummies(leads[['Lead Origin', 'Lead Source', 'Last Activity', 'Tags', 'Lead Profile', 'City',
                               'Last Notable Activity']], drop_first=True)

# Adding the results to the master dataframe
leads = pd.concat([leads, dummy1], axis=1)

In [None]:
leads.head() # after concat we see 33 columns

In [None]:
leads.info()

In [None]:
# Drop the categorical columns after creating dummines
leads = leads.drop(['Lead Origin', 'Lead Source', 'Last Activity', 'Tags', 'Lead Profile', 'City',
                               'Last Notable Activity'], axis = 1)

In [None]:
leads.info()

In [None]:
# Let's see the correlation matrix 
plt.figure(figsize = (20,10))        # Size of the figure
sns.heatmap(leads.corr(),annot = True , cmap = 'RdYlGn')
plt.show()

## Inference:
   1. The **`(last Activity_SMS sent and Last Notable Activity_SMS sent)`** and **`Lead Source_Others and Lead Origin_Lead add form`** are having the highest correlation. So that we can drop any one of them.

In [None]:
leads = leads.drop(['Last Activity_SMS Sent', 'Lead Source_Others'],1) # Drop the highly corellated columns.

## Model Building
Let's start by splitting our data into a training set and a test set.

### Test-Train Split & Feature Scaling

In [None]:
# import the train_test_split and StandardScaler from scikit Learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
# Putting feature variables to X
X = leads.drop('Converted', axis=1) 

X.head()

In [None]:
# Putting response variable to y
y = leads['Converted']

y.head()

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

In [None]:
# Instantiate the object.
scaler = StandardScaler()

# create a var-list to fit, we shall only take numeric variables to fit, 
# since the scalling wont show any effect on categorical variables as they are already in between 0 and 1

X_train[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']] = scaler.fit_transform(X_train[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']])

X_train.head()

In [None]:
### Checking the converted Rate
converted = (sum(leads['Converted'])/len(leads['Converted'].index))*100
converted

In [None]:
# Let's see the correlation matrix 
plt.figure(figsize = (20,10))        # Size of the figure
sns.heatmap(leads.corr(),annot = True , cmap = 'RdYlGn')
plt.show()

In [None]:
# import statsmodels for better analytics
import statsmodels.api as sm  

In [None]:
# Logistic regression model using GLM (Generalized linear model)  and by using binomial family of distribution curves
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()

### Model Building through RFE (Recursive feature elimination)


#### RFE


In [None]:
# import the required scikit learn libraries
from sklearn.linear_model import LogisticRegression

#Instantiate LogisticRegression
logreg = LogisticRegression()

In [None]:

from sklearn.feature_selection import RFE

# assign number of variable for automatic selection
rfe = RFE(logreg, 15)   

# fit the rfe model
rfe = rfe.fit(X_train, y_train)

In [None]:
# Lets look at which are the variable with high significance and which are low
rfe.support_

In [None]:
# Lets see all the column names and their corresponding significance and ranking
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
# lets create a list with columns of highest significance
col = X_train.columns[rfe.support_]
col

In [None]:
# Columns with low significance
X_train.columns[~rfe.support_]

In [None]:
# Add a constant
X_train_sm = sm.add_constant(X_train[col])

# Create a Logestic regression model using GLM
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())

#Fit the model
res = logm2.fit()

# view the summary
res.summary()

## Inference: 
   1. Clearly **`Lead Origin_Landing Page Submission`** is exceeding the permissible p_value, we shall have to drop it first.

In [None]:
# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Drop the column with high p_value and assign the remaining to the same
col = col.drop('Lead Origin_Landing Page Submission', 1)
col

In [None]:
# Let's re-run the model using the selected variables

# Add a constant
X_train_sm = sm.add_constant(X_train[col])

# Create a Logestic regression model using GLM
logm3 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())

#Fit the model
res = logm3.fit()

# view the summary
res.summary()

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Drop the column with high p_value and assign the remaining to the same
col = col.drop('City_Other Cities',1)
col

In [None]:

# Add a constant
X_train_sm = sm.add_constant(X_train[col])

# Create a Logestic regression model using GLM
logm4 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())

#Fit the model
res = logm4.fit()

# view the summary
res.summary()

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Drop the column with high p_value and assign the remaining to the same
col = col.drop('Tags_Ringing',1)
col

In [None]:
# Let's re-run the model using the selected variables
# Add a constant
X_train_sm = sm.add_constant(X_train[col])

# Create a Logestic regression model using GLM
logm5 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())

#Fit the model
res = logm5.fit()

# view the summary
res.summary()

In [None]:
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
y_train_pred = res.predict(X_train_sm).values.reshape(-1) # Reshape it into an array

In [None]:
y_train_pred[:10]

In [None]:
# Convert the values in array to a data_frame and name them
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_Prob':y_train_pred})

# Assign an ID to the training set
y_train_pred_final['StudID'] = y_train.index


y_train_pred_final.head()

In [None]:
# Assume a rough treashold value, say 0.5 and predict the lead conversion 
y_train_pred_final['predicted'] = y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.tail()

In [None]:
from sklearn import metrics # import metrics from sciket learn

In [None]:
# Print Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
print(confusion)

In [None]:
# Predicted            Non_Converted   Converted
# Actual
# not_Converted        3667             335
# Converted            486              1980  

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.predicted))

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting converted when student had not got converted
print(FP/ float(TN+FP)) 

In [None]:
# positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

###  Plotting the ROC Curve

An ROC curve demonstrates several things:

- It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
- The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
- The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final.Converted, y_train_pred_final.Converted_Prob, drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

###  Finding Optimal Cutoff Point

Optimal cutoff probability is that prob where we get balanced sensitivity and specificity

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.tail()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

## Inference: 
   1. With the help of above plot, we can select a threshold probability value by having a trade_off between accuracy, Sensitivity and Specificity. 
   2. It clearly says, we can take Threshold probability **`x = 0.3`**.

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.3 else 0)

y_train_pred_final.head()

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
# Let's again print the confusion matric with the final predicted lead conversion by the model (X=0.3)
confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))

In [None]:
# Positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

## Precision and Recall

In [None]:
#Looking at the confusion matrix again
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
confusion

##### Precision
TP / TP + FP

In [None]:
confusion[1,1]/(confusion[0,1]+confusion[1,1]) # CEO expecting atleast 80%, but we got around 85 %. Therefore the model is satisfying the demands of CEO

##### Recall
TP / TP + FN

In [None]:
confusion[1,1]/(confusion[1,0]+confusion[1,1])

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)

In [None]:
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)

### Precision and recall tradeoff

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
y_train_pred_final.Converted, y_train_pred_final.predicted

In [None]:
y_train_pred_final.tail()

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

### Inference: 
   1. By looking at the Precision and Recall Tradeoff, we can go with a probability threshold value as **`x = 0.41`**

### Making predictions on the test set

In [None]:
X_test[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']] = scaler.fit_transform(X_test[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']])


In [None]:
# Lets take only the final columns after treating them for high P_value and High VIF's
X_test = X_test[col]
X_test.head()

In [None]:
X_test.info()

In [None]:
# Add constant
X_test_sm = sm.add_constant(X_test)

In [None]:
# Lets see the predicticed probalilities in the test set
y_test_pred = res.predict(X_test_sm)

In [None]:

y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array
y_pred_1 = pd.DataFrame(y_test_pred)

In [None]:
# Let's see the head
y_pred_1.head()

In [None]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)

In [None]:
# Putting CustID to index
y_test_df['StudID'] = y_test_df.index

In [None]:
# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [None]:
y_pred_final.head()

In [None]:
# Renaming the column 
y_pred_final= y_pred_final.rename(columns={ 0 : 'Converted_Prob'})

In [None]:
# Let's see the head of y_pred_final
y_pred_final.head()

In [None]:
y_pred_final['final_predicted'] = y_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.41 else 0)
# Here we took x= 0.41, which is the trade off probability value from precision and Recall.

In [None]:
y_pred_final.head() # Head of the test set showing actual converted and predicted values.

In [None]:
y_pred_final['Lead Score'] = round(y_pred_final['Converted_Prob']*100,0)


In [None]:
def Hot_Cold_mapping(x):
    return x.map({1:'Hot', 0:'Cold'})

y_pred_final[['Target']] = y_pred_final[['final_predicted']].apply(Hot_Cold_mapping)


In [None]:
y_pred_final.head() # Head of the test set showing actual converted and predicted values.

In [None]:
y_pred_final['Target'].value_counts()

In [None]:
y_pred_final['Converted'].value_counts()

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_pred_final.Converted, y_pred_final.final_predicted)

## Inference:
   1. The Accuracy Scores, sensitivity and specificity of Training and test sets is almost equal. We can conclude that, the model that was built is good to proceed.

In [None]:
confusion2 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
TP/(TP + FP) # CEO expecting atleast 80%, but we got around 83 % precision. Therefore the model is satisfying the demands of CEO

# Conclusion:
   - CEO expecting atleast 80%, but we got around 83 % precision. Therefore the model is satisfying the demands of CEO

# Extra Questions

In [None]:
col

In [None]:
# Let's see the correlation matrix 
plt.figure(figsize = (20,10))        # Size of the figure
sns.heatmap(leads.corr(),annot = True , cmap = 'RdYlGn')
plt.show()