# Lead Score 

Problem Statement

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:

Lead Conversion Process - Demonstrated as a funnel As you can see, there are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion.

X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

#  1.Importing the Data

In [None]:
## Calling the libraries
import numpy as np
import pandas  as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

In [None]:
# To increase the display size for rows and columns
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)



In [None]:
word=pd.read_excel(r"../input/lead-scoring-dataset/Leads Data Dictionary.xlsx",index_col=0)
print(word)

# 2.Reading and Undersatanding the data

In [None]:
#reading the Leads csv file
df = pd.read_csv('../input/lead-scoring-dataset/Lead Scoring.csv')

In [None]:
# reading the first 5 rows
df.head()

# 3. Inspecting the data

In [None]:
# shape of the data frame
df.shape

In [None]:
# info of the dataframe
df.info()

In [None]:
#stastical information of the data frame
df.describe()

In [None]:
## checking the object columns
ob=df.select_dtypes(include=["object"]).columns
ob

In [None]:
#Replacing the select with null values for all columns
df = df.replace({'Select':np.nan})

# 4. Dealing with Missing Values

In [None]:
#Percentage of missing values
null_perc = pd.DataFrame(round((df.isnull().sum())*100/df.shape[0],2)).reset_index()
null_perc.columns = ['Column Name', 'Null Values Percentage']
null_value = pd.DataFrame(df.isnull().sum()).reset_index()
null_value.columns = ['Column Name', 'Null Values']
null_lead = pd.merge(null_value, null_perc, on='Column Name')
null_lead.sort_values("Null Values", ascending = False)

In [None]:
## removing columns greater than 45% null values
null_column =round((df.isnull().sum()/len(df))*100,4) 
null_column_45 = null_column[null_column.values > 45.0000]
null_column_45 = list(null_column_45.index)
df.drop(labels=null_column_45,axis=1,inplace=True)

In [None]:
# Columns contains  data type objects
ob=df.select_dtypes(include=["object"]).columns

In [None]:
# Checking unique values and null values for the categorical columns
def Cat_info(df, categorical_column):
    df_result = pd.DataFrame(columns=["columns","values","unique_values"])
    
    df_temp=pd.DataFrame()
    for value in categorical_column:
        df_temp["columns"] = [value]
        df_temp["values"] = [df[value].unique()]
        df_temp["unique_values"] = df[value].nunique()
        df_result = df_result.append(df_temp)

    df_result.set_index("columns", inplace=True)
    return df_result

In [None]:
df_cat = Cat_info(df, ob)
df_cat

In [None]:
def column_category_counts(data):
    return pd.DataFrame(data.value_counts(dropna=False))


for column in ob:
    print("Column Name : ",column)
    display(column_category_counts(df[column]).T)

# 5.Imputation Process

Convert columns to "others" lead score","Last Activity","Country","Tags"

Drop columns which are highly skewed "Do Not Email","Do Not Call","What matters most to you in choosing a course","Search","Magazine"," Newspaper Article","X Education Forums","Newspaper","Digital Advertisement","Through Recommendations","Receive More Updates About Our Courses","Update me on Supply Chain Content","Get updates on DM Content","I agree to pay the amount through cheque","

In [None]:
#Dropping columns which are highly skewed
df.drop(["Newspaper Article","Do Not Email","Do Not Call","What matters most to you in choosing a course","Search","Magazine","X Education Forums","Newspaper","Digital Advertisement","Through Recommendations","Receive More Updates About Our Courses","Update me on Supply Chain Content","Get updates on DM Content","I agree to pay the amount through cheque"],axis=1,inplace=True)

> Drop the Column tags, Propects ID its created by sales team i.e after contacting with the students. Also dropping Lead number as it is like a unique number.

Drop City which is having 39% of missing value, if we impute then it will be skewed

In [None]:
df.drop(["Tags","Prospect ID","Lead Number","City"],axis=1,inplace=True)

In [None]:
## Dropping Lead Notable activity as this field is similar to Lead activity
df.drop(["Last Notable Activity"],axis=1,inplace=True)

In [None]:
# Converting uneven distribution to "OTHERS" for Lead source, Last activity , Country and Last notable activity
df.loc[(df["Lead Source"].isin(["Facebook","bing","google","Click2call","Social Media","Live Chat","Press_Release","testone","welearnblog_Home","blog","youtubechannel","NC_EDM","Pay per Click Ads","WeLearn"])),"Lead Source"]="Other_Internet_Sources"
df.loc[(df["Last Activity"].isin(["Unreachable","Unsubscribed","Had a Phone Conversation","Approached upfront","View in browser link Clicked","Email Marked Spam","Email Received","Resubscribed to emails","Visited Booth in Tradeshow"])),"Last Activity"]="All Others"
df.loc[(df["Country"].isin(["Bahrain","Hong Kong","France","Oman","unknown","Nigeria","South Africa","Canada","Kuwait","Germany","Sweden","Ghana","Italy"                      
,"Belgium","China","Uganda","Asia/Pacific Region","Philippines","Bangladesh","Netherlands","Kenya","Sri Lanka","Indonesia","Denmark","Tanzania","Malaysia","Switzerland","Russia","Liberia","Vietnam"])),"Country"]="All Others"


In [None]:
# impute the mode for country, city, specialization and what is your current occupation with hightest value counts
df.loc[df['Specialization'].isnull(),'Specialization']=df['Specialization'].value_counts().index[0]
df.loc[df['Country'].isnull(),'Country']=df['Country'].value_counts().index[0]
df.loc[df['What is your current occupation'].isnull(),'What is your current occupation']=df['What is your current occupation'].value_counts().index[0]

In [None]:
## removing the remaining null values 
df=df.dropna()

In [None]:
# object data types columns
ob=df.select_dtypes(include=["object"]).columns
ob

In [None]:
for i in ob:
    plt.figure(figsize=(15,5))
    sns.countplot(df[i])
    plt.xticks(rotation='vertical')

# 7. Handling  Numerical variabels

In [None]:
## checking integer and float datatypes
nu=df.select_dtypes(include=["int","float"]).columns
nu

Outlier handling

In [None]:
fig=px.box(df["TotalVisits"])

fig.show()

In [None]:
## Outer range of ouliers are moving to .95 percentile
q4=df["TotalVisits"].quantile(q=.95)
df["TotalVisits"][df["TotalVisits"]>=q4]=q4

In [None]:
fig=px.box(df["Page Views Per Visit"])
fig.show()

In [None]:
fig=px.box(df["Total Time Spent on Website"])
fig.show()

In [None]:
## converting to  q4 percentile
q4=df["Page Views Per Visit"].quantile(q=.95)
df["Page Views Per Visit"][df["Page Views Per Visit"]>=q4]=q4

In [None]:
fig=px.box(df["Page Views Per Visit"])
fig.show()

# 8. Dummy variable handling

In [None]:
#dummy vaiables
df = pd.get_dummies(df,drop_first=True)

In [None]:
## checking the shape after adding the dummy variables
df.shape

In [None]:
## checking the info
df.info()

# 9.Spliting the data in to Train and Test

In [None]:
# Importing the required library to perform the test_train_split
from sklearn.model_selection import train_test_split

In [None]:
X = df.drop(columns=['Converted'],axis=1)

#Putting the response variable in y
y = df[['Converted']]

In [None]:
# Performing the train_test_split with 70% of data for training set and 30% data for test set
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state = 42)


In [None]:
X_train.shape , X_test.shape

# 10. Decision Tree 

In [None]:
dt = DecisionTreeClassifier(random_state=42, max_depth=3, min_samples_leaf=10)



In [None]:
dt.fit(X_train, y_train)

# 11. Evaluating model performance

In [None]:
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

In [None]:
from sklearn.metrics import classification_report


In [None]:
print(classification_report(y_test,y_test_pred))

Roc curve

In [None]:
from sklearn.metrics import plot_roc_curve

In [None]:
plot_roc_curve(dt, X_train, y_train, drop_intermediate=False)
plt.show()

Hyper-parameter tuning for the Decision Tree

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
dt_ = DecisionTreeClassifier(random_state=42)

In [None]:
params = {
    "max_depth": [2,3,5,10,20],
    "min_samples_leaf": [5,10,20,50,100,500]
}

In [None]:
grid_search = GridSearchCV(estimator=dt_,
                           param_grid=params,
                           cv=6,
                           n_jobs=-1, verbose=1, scoring="accuracy")

In [None]:

grid_search.fit(X_train, y_train)

In [None]:
dtt=grid_search.best_score_

In [None]:
dt_best = grid_search.best_estimator_
dt_best

In [None]:
plot_roc_curve(dt_best, X_train, y_train)
plt.show()

In [None]:
dt_best.feature_importances_

In [None]:
imp_df = pd.DataFrame({
    "Varname": X_train.columns,
    "Imp": dt_best.feature_importances_
})

In [None]:
imp_df.sort_values(by="Imp", ascending=False)

# 12. Using Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(n_estimators=10,random_state=42, n_jobs=-1, max_depth=5, min_samples_leaf=10,oob_score=True)

In [None]:
rf.fit(X_train, y_train)

In [None]:
rf.oob_score_

In [None]:
plot_roc_curve(rf, X_train, y_train)
plt.show()

In [None]:
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

In [None]:
params = {
    'max_depth': [2,3,5,10,20],
    'min_samples_leaf': [5,10,20,50,100,200],
    'n_estimators': [10, 25, 50, 100]
}

In [None]:
grid_search = GridSearchCV(estimator=rf,
                           param_grid=params,
                           cv = 4,
                           n_jobs=-1, verbose=1, scoring="accuracy")

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
dtr=grid_search.best_score_

In [None]:
rf_best = grid_search.best_estimator_
rf_best

In [None]:
plot_roc_curve(rf_best, X_train, y_train)
plt.show()

In [None]:
rf_best.feature_importances_

In [None]:
imp_df = pd.DataFrame({
    "Varname": X_train.columns,
    "Imp": rf_best.feature_importances_
})

In [None]:
imp_df.sort_values(by="Imp", ascending=False)

Accuracy is increased in Random forest than in decision tree