# LEAD_SCORE ASSIGNMENT


## Problem Statement:
An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 
The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos.
When these people fill up a form providing their email address or phone number, they are classified to be a lead.
Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’.
If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.


**STEPS**
 - Reading and Understanding the Data
 - Data Preparation
 - Model Building
 - Confusion metrics

**LIBRARIES**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve

# Reading and Understanding the Data

In [None]:
leads=pd.read_csv("/kaggle/input/lead-scoring-x-online-education/Leads X Education.csv")
leads.head()

In [None]:
leads.shape

In [None]:
leads.info()

In [None]:
leads.nunique()

In [None]:
df_1=leads[leads["Converted"]==1]
df_0=leads[leads["Converted"]==0]

# Univariate Analysis

In [None]:
# Univariate Analysis on Categorical Variable Column, LEAD QUALITY .

plt.figure(figsize = (25,10))

plt.suptitle('Lead Quality', fontsize = 15, fontweight = 10)

plt.subplot(1,2,1)
df_0['Lead Quality'].value_counts(normalize = True).plot.bar()

plt.xlabel('Converted = 0')
plt.ylabel('Percentage')

plt.xticks(rotation = 0)
plt.yticks(np.arange(0, 0.8, 0.2))

plt.subplot(1,2,2)
df_1['Lead Quality'].value_counts(normalize = True).plot.bar()

plt.xlabel('Converted = 1')
plt.ylabel('Percentage')

plt.xticks(rotation = 0)
plt.yticks(np.arange(0, 0.8, 0.2))

plt.show()

In [None]:
# Univariate Analysis on Categorical Variable Column, LEAD ORIGIN .

plt.figure(figsize = (25,10))

plt.suptitle('Lead Origin', fontsize = 15, fontweight = 10)

plt.subplot(1,2,1)
df_0['Lead Origin'].value_counts(normalize = True).plot.bar()

plt.xlabel('Converted = 0')
plt.ylabel('Percentage')

plt.xticks(rotation = 0)
plt.yticks(np.arange(0, 0.8, 0.2))

plt.subplot(1,2,2)
df_1['Lead Origin'].value_counts(normalize = True).plot.bar()

plt.xlabel('Converted = 1')
plt.ylabel('Percentage')

plt.xticks(rotation = 0)
plt.yticks(np.arange(0, 0.8, 0.2))

plt.show()

# Bivariate analysis

In [None]:
# Bivariate analysis for continuous - categorical variables, leads quality vs total visits 
plt.figure(figsize = (15,5))

plt.suptitle('lead quality vs total visits ', fontsize = 15, fontweight = 10)

plt.subplot(121)
df_0.groupby('Lead Quality')['TotalVisits'].aggregate('median').plot.bar(color = ['Black', 'Blue'])
plt.xlabel('converted = 0')
plt.ylabel('total visits')
plt.xticks(rotation = 0)

plt.subplot(122)
df_1.groupby('Lead Quality')['TotalVisits'].aggregate('median').plot.bar(color = ['Black', 'Blue'])
plt.xlabel('converted = 1')
plt.xticks(rotation = 0)

plt.show()

In [None]:
# Bivariate analysis for continuous - categorical variables, leads source vs total time spent on website
plt.figure(figsize = (15,5))

plt.suptitle('Lead Source vs Total Time Spent on Website', fontsize = 15, fontweight = 10)

plt.subplot(121)
df_0.groupby('Lead Source')['Total Time Spent on Website'].aggregate('median').plot.bar(color = ['Black', 'Blue'])
plt.xlabel('converted = 0')
plt.ylabel('total visits')
plt.xticks(rotation = 50)

plt.subplot(122)
df_1.groupby('Lead Source')['Total Time Spent on Website'].aggregate('median').plot.bar(color = ['Black', 'Blue'])
plt.xlabel('converted = 1')
plt.xticks(rotation = 50)

plt.show()

**lead source from social media is converted more into leads**

In [None]:
# Bivariate analysis for continuous - what matters to you while choosing a course vs Total Time Spent on Website
plt.figure(figsize = (15,5))

plt.suptitle('what matters to you while choosing a course vs Total Time Spent on Website', fontsize = 15, fontweight = 10)

plt.subplot(121)
df_0.groupby('What matters most to you in choosing a course')['Total Time Spent on Website'].aggregate('median').plot.bar(color = ['Black', 'Blue'])
plt.xlabel('converted = 0')
plt.ylabel('total visits')
plt.xticks(rotation = 50)

plt.subplot(122)
df_1.groupby('What matters most to you in choosing a course')['Total Time Spent on Website'].aggregate('median').plot.bar(color = ['Black', 'Blue'])
plt.xlabel('converted = 1')
plt.xticks(rotation = 50)

plt.show()

In [None]:
leads.describe()

In [None]:
# here select is considered to be nan value
leads.replace('Select',np.NaN,inplace=True)
leads.head()

#### Checking for Missing Values and Inputing Them

In [None]:
leads.isnull().sum()/len(leads)*100

In [None]:
leads.drop(['Page Views Per Visit','What is your current occupation','What matters most to you in choosing a course',
            'Search','Magazine','Newspaper Article','X Education Forums','Newspaper','Digital Advertisement','Through Recommendations',
            'Receive More Updates About Our Courses','Update me on Supply Chain Content','I agree to pay the amount through cheque'
            ,'A free copy of Mastering The Interview','Last Activity','Lead Source',
            'Country','How did you hear about X Education'],axis=1,inplace=True)

In [None]:
leads.dropna(axis=0, how='all', subset=['TotalVisits'], inplace=True)

In [None]:
leads['Tags'].fillna((leads['Tags'].mode()[0]),inplace=True)
leads['City'].fillna((leads['City'].mode()[0]),inplace=True)

In [None]:
leads['Specialization'].replace('Select',np.NaN)
leads['Specialization'].fillna('Others',inplace=True)

In [None]:
leads['Lead Profile'].replace('Select',np.NaN)
leads['Lead Profile'].fillna('Others',inplace=True)

In [None]:
leads.drop(['Asymmetrique Activity Index','Asymmetrique Profile Index','Asymmetrique Activity Score','Specialization',
            'Asymmetrique Profile Score','Lead Profile','Lead Quality','Get updates on DM Content','Tags','Prospect ID'],axis=1,inplace=True)

In [None]:
leads.shape

In [None]:
leads.head()

In [None]:
#putting the categories with less percentage to others category
leads['Lead Origin']=leads['Lead Origin'].replace(['Lead Import','Quick Add Form'],'others')
leads['Last Notable Activity']=leads['Last Notable Activity'].replace(['Email Bounced','Unsubscribed','Had a Phone Conversation','Email Marked Spam','Email Received','Resubscribed to emails','View in browser link Clicked','Form Submitted on Website','Approached upfront'],'Others')


### creating dummies

In [None]:
leads['Do Not Email']=leads['Do Not Email'].map({'Yes':'Email_yes','No':'Email_no'})
dumm_dontmail=pd.get_dummies(leads['Do Not Email'],drop_first=True)
dumm_dontmail.head()

In [None]:
leads['Do Not Call']=leads['Do Not Call'].map({'Yes':'call_yes','No':'call_no'})
dumm_dontcall=pd.get_dummies(leads['Do Not Call'],drop_first=True)
dumm_dontcall.head()

In [None]:
dumm_Last_Notable_Activity=pd.get_dummies(leads['Last Notable Activity'],drop_first=True)
dumm_Last_Notable_Activity.drop(['Others'],axis=1,inplace=True)

In [None]:
dumm_City=pd.get_dummies(leads['City'],drop_first=True)
dumm_Lead_Origin=pd.get_dummies(leads['Lead Origin'],drop_first=True)
dumm_Lead_Origin.drop(['others'],axis=1,inplace=True)

In [None]:
# here we also drop registered and casual varibles as it is higly correlated with target varible count
leads.drop(['Do Not Email','Do Not Call','City','Last Notable Activity','Lead Origin'],axis=1,inplace=True)

In [None]:
# concating the dummy varibles into the dataset.
leads = pd.concat([leads, dumm_dontmail,dumm_dontcall,dumm_City,dumm_Last_Notable_Activity,dumm_Lead_Origin], axis = 1)
leads.head()

# Spliting the Data into Train and Test set

In [None]:
leads_train,leads_test=train_test_split(leads,train_size=0.7,test_size=0.3,random_state=100)

In [None]:
leads_train.shape

In [None]:
leads_train.info()

In [None]:
leads_test.shape

# Feature Scaling

In [None]:
scaler=StandardScaler()

In [None]:
leads_train[['Total Time Spent on Website','TotalVisits']]=scaler.fit_transform(leads_train[['Total Time Spent on Website','TotalVisits']])

In [None]:
leads_train.head()

In [None]:
plt.figure(figsize = (20, 10))
sns.heatmap(leads_train.corr(), annot = True, cmap="YlGnBu")
plt.show()

In [None]:
y_train = leads_train.pop('Converted')
x_train = leads_train.drop(['Lead Number'],axis=1)

In [None]:
x_train.head()

# Model Building

In [None]:
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(x_train)), family = sm.families.Binomial())
logm1.fit().summary()

# Feature Selection Using RFE

In [None]:
logreg=LogisticRegression()
rfe=RFE(logreg,15)
rfe=rfe.fit(x_train,y_train)

In [None]:
rfe.support_

In [None]:
list(zip(x_train.columns, rfe.support_, rfe.ranking_))

In [None]:
col = x_train.columns[rfe.support_]

In [None]:
x_train.columns[~rfe.support_]

In [None]:
x_train_sm=sm.add_constant(x_train[col])
logm2=sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
res=logm2.fit()
res.summary()

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features']=x_train[col].columns
vif['VIF']=[variance_inflation_factor(x_train[col].values,i)for i in range(x_train[col].shape[1])]
vif['VIF']=round(vif['VIF'],2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col=col.drop('call_yes',1)
col

In [None]:
x_train_sm=sm.add_constant(x_train[col])
logm3=sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
res1=logm3.fit()
res1.summary()

In [None]:
vif = pd.DataFrame()
vif['Features']=x_train[col].columns
vif['VIF']=[variance_inflation_factor(x_train[col].values,i)for i in range(x_train[col].shape[1])]
vif['VIF']=round(vif['VIF'],2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col=col.drop('Tier II Cities',1)
col

In [None]:
x_train_sm=sm.add_constant(x_train[col])
logm4=sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
res2=logm4.fit()
res2.summary()

In [None]:
vif = pd.DataFrame()
vif['Features']=x_train[col].columns
vif['VIF']=[variance_inflation_factor(x_train[col].values,i)for i in range(x_train[col].shape[1])]
vif['VIF']=round(vif['VIF'],2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col=col.drop('Other Cities of Maharashtra',1)
col

In [None]:
x_train_sm=sm.add_constant(x_train[col])
logm5=sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
res3=logm5.fit()
res3.summary()

In [None]:
vif = pd.DataFrame()
vif['Features']=x_train[col].columns
vif['VIF']=[variance_inflation_factor(x_train[col].values,i)for i in range(x_train[col].shape[1])]
vif['VIF']=round(vif['VIF'],2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col=col.drop('Other Cities',1)
col

In [None]:
x_train_sm=sm.add_constant(x_train[col])
logm6=sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
res4=logm6.fit()
res4.summary()

In [None]:
vif = pd.DataFrame()
vif['Features']=x_train[col].columns
vif['VIF']=[variance_inflation_factor(x_train[col].values,i)for i in range(x_train[col].shape[1])]
vif['VIF']=round(vif['VIF'],2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col=col.drop('Other Metro Cities',1)
col

In [None]:
x_train_sm=sm.add_constant(x_train[col])
logm7=sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
res5=logm7.fit()
res5.summary()

In [None]:
vif = pd.DataFrame()
vif['Features']=x_train[col].columns
vif['VIF']=[variance_inflation_factor(x_train[col].values,i)for i in range(x_train[col].shape[1])]
vif['VIF']=round(vif['VIF'],2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
col=col.drop('Page Visited on Website',1)
col

In [None]:
x_train_sm=sm.add_constant(x_train[col])
logm8=sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
res6=logm8.fit()
res6.summary() # final model

In [None]:
vif = pd.DataFrame()
vif['Features']=x_train[col].columns
vif['VIF']=[variance_inflation_factor(x_train[col].values,i)for i in range(x_train[col].shape[1])]
vif['VIF']=round(vif['VIF'],2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
y_train_pred = res6.predict(x_train_sm).values.reshape(-1)

In [None]:
y_train_pred[:10]

##### Creating a dataframe with the converted leads and the predicted probabilities i.e lead score

In [None]:
y_train_pred_final=pd.DataFrame({'Converted':y_train.values,'Lead_score':y_train_pred})
y_train_pred_final['Leads']=leads_train.index
y_train_pred_final.head()

In [None]:
y_train_pred_final['predicted'] = y_train_pred_final.Lead_score.map(lambda x: 1 if x > 50 else 0)

# Let's see the head
y_train_pred_final.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted)
print(confusion)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.predicted))

**Metrics beyond simply accuracy**

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))

In [None]:
# positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

**PLOTING ROC CURVE**

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final.Converted, y_train_pred_final.Lead_score, drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Lead_score)

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Lead_score.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Lead_score.map( lambda x: 1 if x > 0.38 else 0)

y_train_pred_final.head()

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))

In [None]:
# Positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

In [None]:
TP/TP+FP

In [None]:
TP/TP+FN

In [None]:
precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted)

In [None]:
recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted)

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Lead_score)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

# Making predictions on the test set

In [None]:
leads_test[['Total Time Spent on Website','TotalVisits']]=scaler.transform(leads_test[['Total Time Spent on Website','TotalVisits']])

In [None]:
X_test = leads_test[col]
X_test.head()

In [None]:
y_test=leads_test.pop('Converted')

In [None]:
X_test_sm = sm.add_constant(X_test)

In [None]:
y_test_pred = res6.predict(X_test_sm)

In [None]:
y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array
y_pred_1 = pd.DataFrame(y_test_pred)

In [None]:
# Let's see the head
y_pred_1.head()

In [None]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)
y_test_df

In [None]:
# Putting CustID to index
y_test_df['leads'] = y_test_df.index

In [None]:
# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [None]:
# Renaming the column 
y_pred_final= y_pred_final.rename(columns={ 0 : 'lead_score'})

In [None]:
# Rearranging the columns
y_pred_final = y_pred_final.reindex(['leads','Converted','lead_score'], axis=1)

In [None]:
# Let's see the head of y_pred_final
y_pred_final.head()

In [None]:
y_pred_final['final_predicted'] = y_pred_final.lead_score.map(lambda x: 1 if x > 0.38 else 0)

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_pred_final.Converted, y_pred_final.final_predicted)

In [None]:
y_pred_final.head()

In [None]:
y_pred_final['lead_score']=y_pred_final['lead_score']*100
y_pred_final.head()

In [None]:
confusion2 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

**SENSITIVITY BOTH FOR TRAIN AND TEST SET IS 79%**