# Case Study on Lead Score 

## Problem Statement 

An education company named X Education sells need help to imporove their hot leads from 30% to at least 80%,
The leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance

#### Import Libraries 

In [26]:
import pandas as pd
import numpy as np

#### Data Reading 

In [27]:
leadDf = pd.read_csv("data/Leads.csv")
leadDf.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,...,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,...,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,...,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


##### As we can see we have 37 columns, some of them categorical data & some are numeric data & some data columns have string value, so now we are going to understand the data


In [28]:
leadDf.shape

(9240, 37)

##### We have total 9240 records

## Data understanding & Data cleanup

#### Data understanding

In [29]:
##### In few columns we have value as "select", which means users are not selected any value for that column, so we can consider it as null and replaced it by nan
leadDf.replace("Select",np.nan,inplace=True)
leadDf.replace("How did you hear about X Education",np.nan,inplace=True)



In [30]:
leadDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 

##### Null Count Check for every column of data frame 

In [31]:
columns_null_count=leadDf.isnull().sum()
print(columns_null_count)

Prospect ID                                         0
Lead Number                                         0
Lead Origin                                         0
Lead Source                                        36
Do Not Email                                        0
Do Not Call                                         0
Converted                                           0
TotalVisits                                       137
Total Time Spent on Website                         0
Page Views Per Visit                              137
Last Activity                                     103
Country                                          2461
Specialization                                   3380
How did you hear about X Education               7250
What is your current occupation                  2690
What matters most to you in choosing a course    2709
Search                                              0
Magazine                                            0
Newspaper Article           

###### Total 17th column have null value

In [32]:
##### Numerical data descriptions
leadDf.describe()

Unnamed: 0,Lead Number,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Asymmetrique Activity Score,Asymmetrique Profile Score
count,9240.0,9240.0,9103.0,9240.0,9103.0,5022.0,5022.0
mean,617188.435606,0.38539,3.445238,487.698268,2.36282,14.306252,16.344883
std,23405.995698,0.486714,4.854853,548.021466,2.161418,1.386694,1.811395
min,579533.0,0.0,0.0,0.0,0.0,7.0,11.0
25%,596484.5,0.0,1.0,12.0,1.0,14.0,15.0
50%,615479.0,0.0,3.0,248.0,2.0,14.0,16.0
75%,637387.25,1.0,5.0,936.0,3.0,15.0,18.0
max,660737.0,1.0,251.0,2272.0,55.0,18.0,20.0


In [33]:
## creating a method to check null percentage of dataframe
def getNullPercentage(dataFrame):
    return round(100*(dataFrame.isnull().sum()/len(dataFrame.index)), 2)

In [34]:
## creating a method to check unique value givn columns
def getUniqueValue(dataFrame, column):
    return dataFrame[column].astype('category').value_counts()


In [35]:
# Check the column with null percentages.

print(getNullPercentage(leadDf))

Prospect ID                                       0.00
Lead Number                                       0.00
Lead Origin                                       0.00
Lead Source                                       0.39
Do Not Email                                      0.00
Do Not Call                                       0.00
Converted                                         0.00
TotalVisits                                       1.48
Total Time Spent on Website                       0.00
Page Views Per Visit                              1.48
Last Activity                                     1.11
Country                                          26.63
Specialization                                   36.58
How did you hear about X Education               78.46
What is your current occupation                  29.11
What matters most to you in choosing a course    29.32
Search                                            0.00
Magazine                                          0.00
Newspaper 

In [36]:
# check the unique value for all the columns of data frame 
leadDf.apply(lambda x: len(x.unique()))


Prospect ID                                      9240
Lead Number                                      9240
Lead Origin                                         5
Lead Source                                        22
Do Not Email                                        2
Do Not Call                                         2
Converted                                           2
TotalVisits                                        42
Total Time Spent on Website                      1731
Page Views Per Visit                              115
Last Activity                                      18
Country                                            39
Specialization                                     19
How did you hear about X Education                 10
What is your current occupation                     7
What matters most to you in choosing a course       4
Search                                              2
Magazine                                            1
Newspaper Article           

#### Data cleanup

In [37]:
##### creating function to remove column
def removeColumn(columns):
    leadDf.drop(columns,axis=1,inplace=True)

In [38]:
##### Prospect ID, Lead Number both are unique identifire for this data frame so we can remove them
removeColumn("Prospect ID")
removeColumn("Lead Number")


In [39]:
#### removing the columns which have only one scpecific value in all the rows, because these will not affect on our model
removeColumn(['Magazine','Receive More Updates About Our Courses', 'Update me on Supply Chain Content','Get updates on DM Content','I agree to pay the amount through cheque'])

<h3 style="color:red"><storng > We can see that we have many columns have high percentage null value</storng>, so we can delete the column which have null value more than <b>35%</b> </h3>

In [40]:
getNullPercentage(leadDf)

Lead Origin                                       0.00
Lead Source                                       0.39
Do Not Email                                      0.00
Do Not Call                                       0.00
Converted                                         0.00
TotalVisits                                       1.48
Total Time Spent on Website                       0.00
Page Views Per Visit                              1.48
Last Activity                                     1.11
Country                                          26.63
Specialization                                   36.58
How did you hear about X Education               78.46
What is your current occupation                  29.11
What matters most to you in choosing a course    29.32
Search                                            0.00
Newspaper Article                                 0.00
X Education Forums                                0.00
Newspaper                                         0.00
Digital Ad

In [41]:
## Column Specialization & Tags have null percentage around 36% and it's seems like this is important data for our model and if we delete all the value greator than 35 % so we will lost data hence we are replacing the null value by not provided for columns "Country, Specialization, What is your current occupation, What matters most to you in choosing a course & tags"
leadDf['Country'] = leadDf['Country'].fillna('NA')
leadDf['Specialization'] = leadDf['Specialization'].fillna('NA') 
leadDf['What is your current occupation'] = leadDf['What is your current occupation'].fillna('NA') 
leadDf['What matters most to you in choosing a course'] = leadDf['What matters most to you in choosing a course'].fillna('NA') 
leadDf['Tags'] = leadDf['Tags'].fillna('NA') 


<h5>Dropping the columns which have null value percentage more than 35% </h5>

In [42]:
#Drop all the columns which have more than 35% missing values
cols=leadDf.columns

for col in cols:
    if((100*(leadDf[col].isnull().sum()/len(leadDf.index))) >= 35):
        removeColumn(col)

In [43]:
## after removing null more than 35%, again checking null percentage of data frame
getNullPercentage(leadDf)

Lead Origin                                      0.00
Lead Source                                      0.39
Do Not Email                                     0.00
Do Not Call                                      0.00
Converted                                        0.00
TotalVisits                                      1.48
Total Time Spent on Website                      0.00
Page Views Per Visit                             1.48
Last Activity                                    1.11
Country                                          0.00
Specialization                                   0.00
What is your current occupation                  0.00
What matters most to you in choosing a course    0.00
Search                                           0.00
Newspaper Article                                0.00
X Education Forums                               0.00
Newspaper                                        0.00
Digital Advertisement                            0.00
Through Recommendations     

In [44]:
# We can see that, remaining missing values percentage is less than 2%, we can drop those rows without affecting the data
leadDf.dropna(inplace=True)


In [45]:
leadDf["Country"].value_counts()

India                   6491
NA                      2296
United States             69
United Arab Emirates      53
Singapore                 24
Saudi Arabia              21
United Kingdom            15
Australia                 13
Qatar                     10
Bahrain                    7
Hong Kong                  7
Oman                       6
France                     6
unknown                    5
Kuwait                     4
South Africa               4
Canada                     4
Nigeria                    4
Germany                    4
Sweden                     3
Philippines                2
Uganda                     2
Italy                      2
Bangladesh                 2
Netherlands                2
Asia/Pacific Region        2
China                      2
Belgium                    2
Ghana                      2
Kenya                      1
Sri Lanka                  1
Tanzania                   1
Malaysia                   1
Liberia                    1
Switzerland   

In [46]:
## in the country column mosty have value India, not provided (NA) so we can update country column data into three category like "India", "NA" and "out of India"

## Create a method for country category 
def slots(value):
    value=value.lower()
    category = ""
    if value == "india":
        category = "india"
    elif value == "na":
        category = "na"
    else:
        category = "outside india"
    return category

leadDf['Country'] = leadDf.apply(lambda x:slots(x['Country']), axis = 1)
leadDf['Country'].value_counts()

india            6491
na               2296
outside india     287
Name: Country, dtype: int64

In [47]:
## Again checking the null values percentage

getNullPercentage(leadDf)

Lead Origin                                      0.0
Lead Source                                      0.0
Do Not Email                                     0.0
Do Not Call                                      0.0
Converted                                        0.0
TotalVisits                                      0.0
Total Time Spent on Website                      0.0
Page Views Per Visit                             0.0
Last Activity                                    0.0
Country                                          0.0
Specialization                                   0.0
What is your current occupation                  0.0
What matters most to you in choosing a course    0.0
Search                                           0.0
Newspaper Article                                0.0
X Education Forums                               0.0
Newspaper                                        0.0
Digital Advertisement                            0.0
Through Recommendations                       

##### Now we are under standing categorical data

In [48]:
## we are checking unique value for all the columns of data frame

for col in leadDf.columns:
    print("******* Unique value of Column:",col," *******",)
    print(getUniqueValue(leadDf,col),"\n")


******* Unique value of Column: Lead Origin  *******
Landing Page Submission    4885
API                        3578
Lead Add Form               581
Lead Import                  30
Name: Lead Origin, dtype: int64 

******* Unique value of Column: Lead Source  *******
Google               2868
Direct Traffic       2543
Olark Chat           1753
Organic Search       1154
Reference             443
Welingak Website      129
Referral Sites        125
Facebook               31
bing                    6
google                  5
Click2call              4
Press_Release           2
Social Media            2
Live Chat               2
WeLearn                 1
Pay per Click Ads       1
NC_EDM                  1
blog                    1
testone                 1
welearnblog_Home        1
youtubechannel          1
Name: Lead Source, dtype: int64 

******* Unique value of Column: Do Not Email  *******
No     8358
Yes     716
Name: Do Not Email, dtype: int64 

******* Unique value of Column: Do Not 

<h4><strong> After unique value check, we can see that, few columns have value in binary format like yes/no and few columns have lot of varaity in values for cateogrical data  </strong><h4>

In [None]:
## 