# Project Brief

#### An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 

#### The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. 

#### Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. 

#### There are a lot of leads generated in the initial stage but only a few of them come out as paying customers at the last stage. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion.

#### X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
leads = pd.read_csv('/kaggle/input/lead-scoring-x-online-education/Leads X Education.csv')

In [None]:
leads.head()

In [None]:
leads.columns

In [None]:
leads.info()

In [None]:
leads.shape

In [None]:
# target variable
leads['Converted'].value_counts()

In [None]:
leads.describe()

In [None]:
# checking for the number of null values in each column 

leads.isnull().sum(axis = 0)

In [None]:
# checking for the percentage of null values in each column 

round((leads.isnull().sum(axis = 0)/ len(leads.index))*100 , 2)

#### As we can see some of the columns have substantial number of null or missing values. If we drop all these columns we will lose a lot of information so instead of dropping them, for some of the feature variables we will create a new value as 'Unknown' i.e. not specified to replace null or missing values.

In [None]:
leads.columns

## Imputing missing values and Dropping columns where imputation is not possible

In [None]:
# Dropping the columns 'Asymmetrique Activity Index' and 'Asymmetrique Profile Index' as there is score column for both

leads = leads.drop(['Asymmetrique Activity Index','Asymmetrique Profile Index'], axis = 1)

##### Asymmetrique Activity Score column

In [None]:
leads['Asymmetrique Activity Score'].value_counts()

In [None]:
sum(leads['Asymmetrique Activity Score'].isnull())

In [None]:
leads['Asymmetrique Activity Score'] = leads['Asymmetrique Activity Score'].fillna('Unknown')
print(leads['Asymmetrique Activity Score'].value_counts())
print('\n')
print('Number of null values = ',sum(leads['Asymmetrique Activity Score'].isnull()))

##### Asymmetrique Profile Score column

In [None]:
leads['Asymmetrique Profile Score'] = leads['Asymmetrique Profile Score'].fillna('Unknown')
leads['Asymmetrique Profile Score'].value_counts()

##### Lead Quality column

In [None]:
leads['Lead Quality'].value_counts()

In [None]:
sum(leads['Lead Quality'].isnull())

In [None]:
leads['Lead Quality'].fillna("Unknown", inplace = True)
leads['Lead Quality'].value_counts()

##### Tags column

In [None]:
# Tags column
leads['Tags'].value_counts()

In [None]:
sum(leads['Tags'].isnull())

In [None]:
leads['Tags'] = leads['Tags'].fillna('Unknown')
leads['Tags'].value_counts()

In [None]:
sum(leads['Tags'].isnull())

##### Country column 

In [None]:
# Country column 
sum(leads['Country']=='India')/len(leads.index)

Since maximum number of values in the country columns have "India" we are going to create 2 values for the country columns one being 'India' and the other being 'Foreign Country'

In [None]:
leads['Country'] = leads['Country'].apply(lambda x: 'India' if x=='India' else 'Foreign Country')
leads['Country'].value_counts()

##### Total visits column

In [None]:
# Total visits column
leads['TotalVisits'].value_counts() 

In [None]:
leads['TotalVisits'].median() #Since the above column has lot of outliers we will impute with the median value

In [None]:
leads['TotalVisits'].replace(np.NaN, leads['TotalVisits'].median(), inplace =True)

##### Page views per visit column

In [None]:
# Page Views Per Visit column null values are similarly imputed using the median values

leads['Page Views Per Visit'].replace(np.NaN, leads['Page Views Per Visit'].median(), inplace =True)

##### Last Activity column

In [None]:
leads['Last Activity'].value_counts()

In [None]:
sum(leads['Last Activity'].isnull())

In [None]:
leads['Last Activity'].fillna("Unknown", inplace = True)
leads['Last Activity'].value_counts()

##### Specialization column

In [None]:
leads['Specialization'].value_counts()

In [None]:
sum(leads['Specialization'].isnull())

In [None]:
leads['Specialization'].replace('Select', 'Unknown', inplace =True)
leads['Specialization'].value_counts()

In [None]:
leads['Specialization'].fillna("Unknown", inplace = True)
leads['Specialization'].value_counts()

##### How did you hear about X Education

In [None]:
leads['How did you hear about X Education'].value_counts()

We will drop this columns as most of the values in this column is 'Select' which does not add any information to our model

In [None]:
leads = leads.drop('How did you hear about X Education', axis=1)

##### What is your current occupation column

In [None]:
leads['What is your current occupation'].value_counts()

In [None]:
sum(leads['What is your current occupation'].isnull())

In [None]:
leads['What is your current occupation'].fillna("Unknown", inplace = True)
leads['What is your current occupation'].value_counts()

##### What matters most to you in choosing a course column

In [None]:
leads['What matters most to you in choosing a course'].value_counts()

In [None]:
sum(leads['What matters most to you in choosing a course'].isnull())

We will drop this column as most of the values in this column belong to one category and others are null

In [None]:
leads = leads.drop('What matters most to you in choosing a course', axis = 1)

##### Lead Profile column

In [None]:
leads['Lead Profile'].value_counts()

In [None]:
sum(leads['Lead Profile'].isnull())

In [None]:
leads['Lead Profile'].replace('Select', 'Unknown', inplace =True)
leads['Lead Profile'].value_counts()

In [None]:
leads['Lead Profile'].fillna("Unknown", inplace = True)
leads['Lead Profile'].value_counts()

##### City column

In [None]:
# City column
leads['City'].value_counts()

In [None]:
sum(leads['City'].isnull())

In [None]:
leads['City'].fillna("Unknown", inplace = True) # Replacing null values with 'NotSpecified' 
leads['City'].value_counts()

In [None]:
leads['City'].replace('Select', 'Unknown', inplace =True)
leads['City'].value_counts()

In [None]:
# re-checking for the percentage of null values in each column 

round((leads.isnull().sum(axis = 0)/ len(leads.index))*100 , 2)

We will remove the rows with missing values

In [None]:
leads.shape

In [None]:
# removing all the rows with null values

leads = leads.dropna()

In [None]:
leads.shape

In [None]:
# checking again for missing values in the dataframe 

round((leads.isnull().sum(axis = 0)/ len(leads.index))*100 , 2)

In [None]:
leads.head()

In [None]:
leads.columns

In [None]:
for col in leads.columns:
    print(col, ':', leads[col].nunique())
    print('\n')

In [None]:
# Prospect ID and Lead Number are the same thing so having both the columsn is redundant so we will drop the Prospect ID column

leads = leads.drop('Prospect ID',axis=1)

# Also a lot of the columns have just one unique value so they are of no use as they do not provide any information so dropping them as well
leads = leads.drop(['Magazine','Receive More Updates About Our Courses',
                    'Update me on Supply Chain Content','Get updates on DM Content',
                    'I agree to pay the amount through cheque'], axis=1)

In [None]:
leads.head()

In [None]:
print(leads.shape)

In [None]:
leads.columns

### Mapping 'Yes' and 'No' to '1' and '0'

In [None]:
def mapping(x):
    return x.map({'Yes':1, 'No':0})

In [None]:
col_list = ['Search',
            'Do Not Email',
            'Do Not Call',
            'Newspaper Article',
            'X Education Forums',
            'Newspaper',
            'Digital Advertisement',
            'Through Recommendations',
            'A free copy of Mastering The Interview']

In [None]:
leads[col_list] = leads[col_list].apply(mapping)

In [None]:
leads.head()

In [None]:
leads.columns

### Creating dummy variables for categorical variables

In [None]:
leads.info()

In [None]:
# creating dummy variables for some of the other categorical columns 
leads = pd.get_dummies(leads, columns=['Lead Origin', 'Lead Source', 'Country', 'Last Notable Activity'], drop_first=True)

In [None]:
# Creating dummmy variables for the rest of the columns and dropping the level called 'Unknown'


# Creating dummy variables for the variable 'City'
dummy = pd.get_dummies(leads['Asymmetrique Activity Score'], prefix='Asymmetrique Activity Score')
final_dummy = dummy.drop(['Asymmetrique Activity Score_Unknown'], 1)
leads = pd.concat([leads,final_dummy], axis=1)

# Creating dummy variables for the variable 'City'
dummy = pd.get_dummies(leads['Asymmetrique Profile Score'], prefix='Asymmetrique Profile Score')
final_dummy = dummy.drop(['Asymmetrique Profile Score_Unknown'], 1)
leads = pd.concat([leads,final_dummy], axis=1)

# Creating dummy variables for the variable 'Last Activity'
dummy = pd.get_dummies(leads['Last Activity'], prefix='Last Activity')
final_dummy = dummy.drop(['Last Activity_Unknown'], 1)
leads = pd.concat([leads,final_dummy], axis=1)

# Creating dummy variables for the variable 'What is your current occupation'
dummy = pd.get_dummies(leads['What is your current occupation'], prefix='What is your current occupation')
final_dummy = dummy.drop(['What is your current occupation_Unknown'], 1)
leads = pd.concat([leads,final_dummy], axis=1)

# Creating dummy variables for the variable 'Lead Profile'
dummy = pd.get_dummies(leads['Lead Profile'], prefix='Lead Profile')
final_dummy = dummy.drop(['Lead Profile_Unknown'], 1)
leads = pd.concat([leads,final_dummy], axis=1)

# Creating dummy variables for the variable 'Specialization'
dummy = pd.get_dummies(leads['Specialization'], prefix='Specialization')
final_dummy = dummy.drop(['Specialization_Unknown'], 1)
leads = pd.concat([leads,final_dummy], axis=1)

# Creating dummy variables for the variable 'City'
dummy = pd.get_dummies(leads['City'], prefix='City')
final_dummy = dummy.drop(['City_Unknown'], 1)
leads = pd.concat([leads,final_dummy], axis=1)

# Creating dummy variables for the variable 'City'
dummy = pd.get_dummies(leads['Lead Quality'], prefix='Lead Quality')
final_dummy = dummy.drop(['Lead Quality_Unknown'], 1)
leads = pd.concat([leads,final_dummy], axis=1)

# Creating dummy variables for the variable 'City'
dummy = pd.get_dummies(leads['Tags'], prefix='Tags')
final_dummy = dummy.drop(['Tags_Unknown'], 1)
leads = pd.concat([leads,final_dummy], axis=1)

In [None]:
leads.shape

#### Dropping the columns for which we have created dummy variables

In [None]:
leads = leads.drop(['Lead Quality','Asymmetrique Profile Score','Asymmetrique Activity Score','Last Activity', 
                    'What is your current occupation', 'Lead Profile','Specialization','City','Tags'],axis=1)

In [None]:
leads.shape

In [None]:
leads.head()

In [None]:
leads.info()

In [None]:
# checking for outliers in the continuous variables

numerical = leads[['TotalVisits','Total Time Spent on Website', 'Page Views Per Visit']]

In [None]:
numerical.describe()

In [None]:
plt.figure(figsize=(20,10))

plt.subplot(2,2,1)
sns.boxplot(numerical['TotalVisits'])

plt.subplot(2,2,2)
sns.boxplot(numerical['Total Time Spent on Website'])

plt.subplot(2,2,3)
sns.boxplot(numerical['Page Views Per Visit'])

In [None]:
# removing outliers using the IQR

Q1 = leads['TotalVisits'].quantile(0.25)
Q3 = leads['TotalVisits'].quantile(0.75)
IQR = Q3 - Q1
leads = leads.loc[(leads['TotalVisits'] >= Q1 - 1.5*IQR) & (leads['TotalVisits'] <= Q3 + 1.5*IQR)]

Q1 = leads['Page Views Per Visit'].quantile(0.25)
Q3 = leads['Page Views Per Visit'].quantile(0.75)
IQR = Q3 - Q1
leads=leads.loc[(leads['Page Views Per Visit'] >= Q1 - 1.5*IQR) & (leads['Page Views Per Visit'] <= Q3 + 1.5*IQR)]

In [None]:
plt.figure(figsize=(20,10))

plt.subplot(2,2,1)
sns.boxplot(leads['TotalVisits'])

plt.subplot(2,2,2)
sns.boxplot(leads['Total Time Spent on Website'])

plt.subplot(2,2,3)
sns.boxplot(leads['Page Views Per Visit'])

In [None]:
leads.shape

We have removed most of the outliers and so we can proceed with model building

In [None]:
# Lets look at the head of the dataframe again
leads.head()

In [None]:
# Lets look at the info of the dataframe again
leads.info()

## Splitting the data into Training and Test datasets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = leads.drop(['Lead Number', 'Converted'], axis = 1)
y = leads['Converted']

In [None]:
X.head()

In [None]:
y.head()

In [None]:
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
X_train[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']] = scaler.fit_transform(X_train[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']])

In [None]:
X_train.head()

In [None]:
y.head()

In [None]:
round((y.sum()/len(y))*100,2) 

As we can see we have 38% conversion rate

# Model building

In [None]:
import statsmodels.api as sm

In [None]:
# logistic regression model

logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family=sm.families.Binomial())
logm1.fit().summary()

## Feature selestion using RFE

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg = LogisticRegression()

In [None]:
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 20) # running RFE with 20 variables
rfe = rfe.fit(X_train,y_train)

In [None]:
rfe.support_

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]

In [None]:
X_train.columns[~rfe.support_] # rfe.support_ = false 

#### Assessing the model with StatsModels

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

#### Creating a dataframe with the actual churn flag and the predicted probabilities

In [None]:
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Conversion_Prob':y_train_pred})
y_train_pred_final['LeadID'] = y_train.index
y_train_pred_final.head()

#### Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0

In [None]:
y_train_pred_final['Predicted'] = y_train_pred_final.Conversion_Prob.map(lambda x: 1 if x > 0.5 else 0)
y_train_pred_final.head()

In [None]:
from sklearn import metrics

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.Predicted)
print(confusion)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.Predicted))

Our model has about 92% accuracy

### Checking VIFs

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

##### Our variables do not have high VIF which is good as it indicates we do not have multicolinearity issues to deal with

The variable 'Tags_Diploma holder (Not Eligible)' has high high P-value. So let's start by dropping that.

In [None]:
col = col.drop('Tags_Diploma holder (Not Eligible)', 1)
col

In [None]:
# Let's re-run the model using the selected variables
X_train_sm = sm.add_constant(X_train[col])
logm3 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm3.fit()
res.summary()

In [None]:
# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

In [None]:
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Conversion_Prob':y_train_pred})
y_train_pred_final['LeadID'] = y_train.index
y_train_pred_final.head()

In [None]:
y_train_pred_final['Predicted'] = y_train_pred_final.Conversion_Prob.map(lambda x: 1 if x > 0.5 else 0)
y_train_pred_final.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.Predicted)
print(confusion)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.Predicted))

So the overall accuracy hasn't dropped after dropping the 'Tags_Diploma holder (Not Eligible)' column 

In [None]:
#### Checking VIFs again
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

The variable 'Tags_wrong number given' has very high P-value. So we will drop that

In [None]:
col = col.drop('Tags_wrong number given', 1)

# Let's re-run the model using the selected variables
X_train_sm = sm.add_constant(X_train[col])
logm4 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm4.fit()
res.summary()

In [None]:
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

In [None]:
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Conversion_Prob':y_train_pred})
y_train_pred_final['LeadID'] = y_train.index
y_train_pred_final.head()

In [None]:
y_train_pred_final['Predicted'] = y_train_pred_final.Conversion_Prob.map(lambda x: 1 if x > 0.5 else 0)
y_train_pred_final.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.Predicted)
print(confusion)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.Predicted))

Again the accuracy hasn't dropped after dropping the 'Tags_wrong number given' feature column

In [None]:
#### Checking VIFs again
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

The variable 'Tags_number not provided' has very high P-value. So we will drop that

In [None]:
col = col.drop('Tags_number not provided', 1)

# Let's re-run the model using the selected variables
X_train_sm = sm.add_constant(X_train[col])
logm5 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm5.fit()
res.summary()

In [None]:
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

In [None]:
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Conversion_Prob':y_train_pred})
y_train_pred_final['LeadID'] = y_train.index
y_train_pred_final.head()

In [None]:
y_train_pred_final['Predicted'] = y_train_pred_final.Conversion_Prob.map(lambda x: 1 if x > 0.5 else 0)
y_train_pred_final.head()

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.Predicted)
print(confusion)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.Predicted))

Again the model accuracy hasn't decreased after removing the variable 'Tags_number not provided'

In [None]:
#### Checking VIFs again
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### 1. We can see that now most of our P-values for all our variables are equal to 'Zero' which indicates that these variables are statistically significant so we do not need to drop more feature variables
#### 2. Also the accuracy of our model hasn't dropped even after removing so many of the feature columns at around 91.6%

In [None]:
# correlation matrix 
plt.figure(figsize = (20,10),dpi=200)  
sns.heatmap(X_train[col].corr(),annot = True)
plt.show()

plt.savefig('corr.png')

# Metrics beyond simply accuracy

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))

In [None]:
# positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

# Plotting the ROC Curve

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic (RoC) curve')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final.Converted, y_train_pred_final.Conversion_Prob, 
                                         drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Conversion_Prob)

##### Area under the ROC curve is 0.97

# Finding Optimal Cutoff Point

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Conversion_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head(10)

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.

cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.vlines(x=0.34, ymax=1, ymin=0, colors="r", linestyles="--")
plt.show()

#### From the above curve, 0.34 seems to be the optimum point to take as the cutoff probability

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Conversion_Prob.map( lambda x: 1 if x > 0.34 else 0)
y_train_pred_final.head()

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
# Confusion matrix
confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))

In [None]:
# Positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

# Precision and Recall

Precision: 
TP / TP + FP

In [None]:
confusion2[1,1]/(confusion2[0,1]+confusion2[1,1])

Recall: TP / TP + FN

In [None]:
confusion2[1,1]/(confusion2[1,0]+confusion2[1,1])

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_pred_final.Converted, y_train_pred_final.Predicted)

In [None]:
recall_score(y_train_pred_final.Converted, y_train_pred_final.Predicted)

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Conversion_Prob)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_train_pred_final.Converted, y_train_pred_final.final_predicted))

# Making predictions on the test set

In [None]:
X_test[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']] = scaler.transform(X_test[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']])
X_test.head()

In [None]:
X_test = X_test[col]
X_test.head()

In [None]:
# adding constant for statsmodel
X_test_sm = sm.add_constant(X_test)

In [None]:
# making prediction on the test set
y_test_pred = res.predict(X_test_sm)

In [None]:
y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array
y_pred = pd.DataFrame(y_test_pred)

In [None]:
y_pred.head()

In [None]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)

In [None]:
y_test_df.head()

In [None]:
# Putting LeadID to index
y_test_df['LeadID'] = y_test_df.index
y_test_df.head()

In [None]:
# concatenating both the prediction and the orginal labels
y_pred_final = pd.concat([y_test_df, y_pred],axis=1)

In [None]:
y_pred_final.head()

In [None]:
# Renaming the column 
y_pred_final= y_pred_final.rename(columns={ 0 : 'Conversion_Prob'})

In [None]:
# Rearranging the columns
y_pred_final = y_pred_final[['LeadID','Converted','Conversion_Prob']]

In [None]:
y_pred_final.head()

In [None]:
y_pred_final['Predicted'] = y_pred_final.Conversion_Prob.map(lambda x: 1 if x > 0.34 else 0)

In [None]:
y_pred_final.head()

In [None]:
# Let's check the overall accuracy.
accuracy_score=metrics.accuracy_score(y_pred_final.Converted, y_pred_final.Predicted)
accuracy_score

#### Confusion matrix

In [None]:
confusion_test_set = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.Predicted)
print(confusion_test_set)

In [None]:
TP = confusion_test_set[1,1] # true positive 
TN = confusion_test_set[0,0] # true negatives
FP = confusion_test_set[0,1] # false positives
FN = confusion_test_set[1,0] # false negatives

#### Sensitivity

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

#### Specificity

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

#### False Postive Rate

In [None]:
# Calculate false postive rate - predicting converion when customer does not have converted
print(FP/ float(TN+FP))

#### Positive Predicted Value

In [None]:
# Positive predictive value 
print (TP / float(TP+FP))

#### Negative Predicted Value

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

#### Precision

In [None]:
#precision
confusion_test_set[1,1]/(confusion_test_set[0,1]+confusion_test_set[1,1])

#### Recall

In [None]:
#recall
confusion_test_set[1,1]/(confusion_test_set[1,0]+confusion_test_set[1,1])

#### Classification Report

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_pred_final.Converted, y_pred_final.Predicted))

#### Precision recall curve

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
p, r, thresholds = precision_recall_curve(y_pred_final.Converted, y_pred_final.Conversion_Prob)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

#### Plotting the ROC Curve for Test Dataset

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic (ROC) curve')
    plt.legend(loc="lower right")
    plt.show()

    return fpr,tpr, thresholds


In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_pred_final.Converted, y_pred_final.Conversion_Prob, drop_intermediate = False)

In [None]:
draw_roc(y_pred_final.Converted, y_pred_final.Conversion_Prob)

##### Area under the ROC curve is around 0.96 which means our model seems to be doing well on the test set as well

In [None]:
y_pred_final.head()

In [None]:
y_pred_final['Lead Score'] = y_pred_final['Conversion_Prob']*100
y_pred_final.head()

In [None]:
y_pred_final = pd.merge(leads[['Lead Number']], y_pred_final,how='inner',left_index=True, right_index=True)

In [None]:
y_pred_final.head()  # test dataset with all the Lead Score values

In [None]:
y_train_pred_df = y_train_pred_final[['Converted', 'Conversion_Prob', 'LeadID','Predicted']]
y_train_pred_df.head()

In [None]:
y_train_pred_df = pd.merge(leads[['Lead Number']], y_train_pred_df,how='inner',left_index=True, right_index=True)
y_train_pred_df.head()

In [None]:
y_train_pred_df['Lead Score'] = y_train_pred_df['Conversion_Prob']*100

In [None]:
y_train_pred_df.head()     # train dataset with all the Lead Score values

### Final dataframe with all the Lead Scores

In [None]:
final_df_lead_score = pd.concat([y_train_pred_df,y_pred_final],axis=0)
final_df_lead_score.head()

In [None]:
final_df_lead_score = final_df_lead_score.set_index('LeadID')

final_df_lead_score = final_df_lead_score[['Lead Number','Converted','Conversion_Prob','Predicted','Lead Score']]

## Final dataframe with the Lead Scores for all the LeadID

In [None]:
final_df_lead_score.head()  # final dataframe with all the Lead Scores

In [None]:
final_df_lead_score.shape

## Determining Feature Importance of our final model

In [None]:
# coefficients of our final model 

pd.options.display.float_format = '{:.2f}'.format
new_params = res.params[1:]
new_params

In [None]:
# Getting a relative coeffient value for all the features wrt the feature with the highest coefficient

feature_importance = new_params
feature_importance = 100.0 * (feature_importance / feature_importance.max())
feature_importance

In [None]:
# Sorting the feature variables based on their relative coefficient values

sorted_idx = np.argsort(feature_importance,kind='quicksort',order='list of str')

#### Top three variables in your model which contribute most towards the probability of a lead getting converted

In [None]:
feature_importance_df = pd.DataFrame(feature_importance).reset_index().sort_values(by=0,ascending=False)
feature_importance_df = feature_importance_df.rename(columns={'index':'Variables', 0:'Relative coeffient value'})
feature_importance_df = feature_importance_df.reset_index(drop=True)
feature_importance_df.head(3)

#### The top 3 variables are:
1. Tags_Lost to EINS	
2. Tags_Closed by Horizzon
3. Lead Source_Welingak Website