# Lead Scoring Case Study

## Goals of the Case Study

There are quite a few goals for this case study.

1. Build a logistic regression model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads. A higher score would mean that the lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get converted.

2. There are some more problems presented by the company which your model should be able to adjust to if the company's requirement changes in the future so you will need to handle these as well. These problems are provided in a separate doc file. Please fill it based on the logistic regression model you got in the first step. Also, make sure you include this in your final PPT where you'll make recommendations.

## Results Expected

1. A well-commented Jupyter note with at least the logistic regression model, the conversion predictions and evaluation metrics.
2. The word document filled with solutions to all the problems.
3. The overall approach of the analysis in a presentation
    i. Mention the problem statement and the analysis approach briefly 
    ii. Explain the results in business terms
    iii. Include visualisations and summarise the most important results in the presentation
4. A brief summary report in 500 words explaining how you proceeded with the assignment and the learnings that you gathered.

## Step 1: Importing the data

In [None]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

# Importing Pandas and NumPy
import pandas as pd, numpy as np

# Importing sklearn utilities
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve

# Importing matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Importing statsmodel
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Importing the dataset
leads = pd.read_csv("../input/leadscore/Leads.csv")
leads.head()

## Step 2: Inspecting the Data Frame

In [None]:
# Dimensions of dataset
leads.shape

In [None]:
# Statistic of dataset
leads.describe()

In [None]:
# Type of each column
leads.info()

## Step 3: Data Cleaning

In [None]:
# We observe that certain columns have a value "select". 
# This seems to be the default value in case the visitor doesn't select anything.
# Hence we replace it with NaN

leads = leads.replace("Select",np.NaN)

In [None]:
leads.head()

In [None]:
# Checking null values in all the columns
leads.isnull().sum()

In [None]:
# Checking the percentage of null values in all columns
round(100 * (leads.isnull().sum()/len(leads.index)), 2)

In [None]:
# Drop the columns with more that 70% Null values
leads = leads.drop(leads.loc[:,list(round(100 * (leads.isnull().sum()/len(leads.index)), 2) > 70)].columns, 1)

In [None]:
# Checking the percentage of null values in all columns again
round(100 * (leads.isnull().sum()/len(leads.index)), 2)

In [None]:
leads['Lead Quality'].describe()

In [None]:
sns.countplot(leads['Lead Quality'])

In [None]:
# We have 51% NULL values here so we need to replace the NULL values. 
# "Not Sure" seems to be the most neutral value
leads['Lead Quality'] = leads['Lead Quality'].replace(np.NaN,"Not Sure")

In [None]:
sns.countplot(leads['Lead Quality'])

In [None]:
# Plotting Asymmetrique Activity Index, Asymmetrique Profile Index, Asymmetrique Activity Score, Asymmetrique Profile Score
 
plt.figure (figsize=(20,10))

plt.subplot(2,2,1)
sns.countplot(leads['Asymmetrique Activity Index'])

plt.subplot(2,2,2)
sns.countplot(leads['Asymmetrique Profile Index'])

plt.subplot(2,2,3)
sns.boxplot(leads['Asymmetrique Activity Score'])

plt.subplot(2,2,4)
sns.boxplot(leads['Asymmetrique Profile Score'])

plt.show()

In [None]:
leads['Asymmetrique Activity Index'].value_counts()

In [None]:
leads['Asymmetrique Profile Index'].value_counts()

In [None]:
leads['Asymmetrique Activity Score'].value_counts()

In [None]:
leads['Asymmetrique Profile Score'].value_counts()

In [None]:
# There is variation in data in these four columns 
# and we were looking at the data in order to impute the NULL values (which are 45%)
# So we cant make a conclusive decision on this so we drop these columns

leads = leads.drop (['Asymmetrique Activity Index','Asymmetrique Profile Index','Asymmetrique Activity Score','Asymmetrique Profile Score'],1)

In [None]:
# City has now the highest NULL values

leads["City"].describe()

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(leads["City"])


In [None]:
# Since Mumbai is the highest occurrence in the data set, we replace NULL values with Mumbai
leads["City"] = leads["City"].replace(np.NaN,"Mumbai")

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(leads["Tags"])
plt.xticks(rotation = 90)

In [None]:
# "Will revert after reading the email" has the highest count in all the tags so we replace NULL values with that

leads["Tags"] = leads["Tags"].replace(np.NaN,"Will revert after reading the email")

In [None]:
leads["Tags"].describe()

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(leads["Tags"])
plt.xticks(rotation = 90)

In [None]:
plt.figure (figsize=(10,6))
sns.countplot(leads["Specialization"])
plt.xticks(rotation = 90)

In [None]:
leads["Specialization"].value_counts()

In [None]:
leads["Specialization"].describe()

In [None]:
# "Finance Management" has the highest count but overall not very high proportion so we replace NULL with "Others"
leads["Specialization"] = leads["Specialization"].replace(np.NaN,"Others")

In [None]:
plt.figure (figsize=(10,6))
sns.countplot(leads["Specialization"])
plt.xticks(rotation = 90)

In [None]:
leads["What matters most to you in choosing a course"].describe()


In [None]:
leads["What matters most to you in choosing a course"].value_counts()

In [None]:
# "Better Career Prospects" has the highest count and also, realistically speaking, that is why
# most people will join any course

leads["What matters most to you in choosing a course"] = leads["What matters most to you in choosing a course"].replace(np.NaN,"Better Career Prospects")

In [None]:
sns.countplot(leads["What matters most to you in choosing a course"])

In [None]:
leads["What is your current occupation"].describe()

In [None]:
leads["What is your current occupation"].value_counts()

In [None]:
# "Unemployed" has a very high count so we can safely replace NULL with "Unemployed"

leads["What is your current occupation"] = leads["What is your current occupation"].replace(np.NaN,"Unemployed")

In [None]:
leads["Country"].describe()

In [None]:
leads["Country"].value_counts()

In [None]:
# "India" has very high count so we can safely replace NULL with "India"

leads["Country"] = leads["Country"].replace(np.NaN,"India")

In [None]:
leads["TotalVisits"].describe()

In [None]:
leads["TotalVisits"].value_counts()

In [None]:
# here 0 has the highest count but not by a large proportion so we replace NULL with mean
leads["TotalVisits"] = leads["TotalVisits"].replace(np.NaN,leads["TotalVisits"].mean())

In [None]:
leads["Page Views Per Visit"].describe()

In [None]:
leads["Page Views Per Visit"].value_counts()

In [None]:
# here 0 has the highest count but not by a large proportion so we replace NULL with mean

leads["Page Views Per Visit"] = leads["Page Views Per Visit"].replace(np.NaN,leads["Page Views Per Visit"].mean())

In [None]:
leads["Last Activity"].describe()

In [None]:
leads["Last Activity"].value_counts()

In [None]:
# "Email Opened" has the highest count and since NULL values are only 1% we can replace them with "Email Opened"

leads["Last Activity"] = leads["Last Activity"].replace(np.NaN,"Email Opened")

In [None]:
leads["Lead Source"].describe()

In [None]:
leads["Lead Source"].value_counts()

In [None]:
# "Google" is the highest count and NULL values are only 0.39 % so we can safely replace them with "Google"
leads["Lead Source"] = leads["Lead Source"].replace(np.NaN,"Google")

In [None]:
# Checking the percentage of null values in all columns again
round(100 * (leads.isnull().sum()/len(leads.index)), 2)

## Step 4: EDA

### Univariate analysis

In [None]:
# Calculating conversion percentage
converted = (sum(leads['Converted'])/len(leads['Converted'].index))*100
converted

#### Lead Origin

In [None]:
sns.countplot(x = "Lead Origin", hue="Converted", data=leads).legend(loc="upper right")
plt.xticks(rotation=90)

#### Observation

- API and Landing Page submission have approximately 40% and 56% conversion rate and overall count from these two sources are high
- Lead Add Form has very high conversion rate but overall conversion count is very low

**Since we have high conversion counts from API and Landing Page Submissions, we can focus on increasing the conversion rate from these two sources**

#### Lead Source


In [None]:
sns.countplot(x = "Lead Source", hue="Converted", data=leads).legend(loc="upper right")
plt.xticks(rotation=90)

In [None]:
# Apparently Google and google are two different categorical values which are actually same so we combine them.
leads["Lead Source"] = leads["Lead Source"].replace(['google'],'Google')

# Also, since other categorical values have negligible counts as compared to the more prominent ones, 
# we can combine all such categories into others
leads["Lead Source"] = leads["Lead Source"].replace(['blog','Pay per Click Ads','bing','Social Media',
                                                     'WeLearn','Click2call','Live Chat','welearnblog_Home','youtubechannel',
                                                    'testone','Press_Release','NC_EDM'],'Others')

In [None]:
sns.countplot(x = "Lead Source", hue="Converted", data=leads).legend(loc="upper right")
plt.xticks(rotation=90)

#### Observations

- Direct Traffic and Google have similar counts with Google having highest conversion rates
- Organic Search also has a relatively high conversion rate.
- Same goes for Reference but overall count is very less

**To increase the overall conversion rate, we can focus on increasing the conversion rates from Google, Direct Traffic, Organic Search and Olark chat**

#### Do Not Email and Do Not Call

In [None]:
plt.figure(figsize=(15,6))

plt.subplot(1,2,1)
sns.countplot(x = "Do Not Email",hue="Converted", data=leads).legend(loc="upper right")

plt.subplot(1,2,2)
sns.countplot(x = "Do Not Call",hue="Converted", data=leads).legend(loc="upper right")

#### Observation

- People who said that they don't want to be Emailed have higher conversion rate than people who said that they wanted to be Emailed
- Same goes for the Do Not Call column as well

#### Total Visits

In [None]:
sns.boxplot(leads["TotalVisits"])

**We need to remove outliers**

In [None]:
leads['TotalVisits'].describe(percentiles=[0.05,.25, .5, .75, .90, .95, .99])

In [None]:
# We will cap the values to 95% percentile
percentile_95 = leads["TotalVisits"].quantile([0.05,0.95]).values

In [None]:
percentile_95


In [None]:
leads['TotalVisits'][leads['TotalVisits'] <= percentile_95[0]] = percentile_95[0]
leads['TotalVisits'][leads['TotalVisits'] >= percentile_95[1]] = percentile_95[1]

In [None]:
sns.boxplot(leads["TotalVisits"])

In [None]:
sns.boxplot(y="TotalVisits",x="Converted",data=leads)

#### Observations

- Medians for not converted and converted are almost same
- people with 0 - 6 visits are seen to be converted but then again people with 1 - 4 visits are also seen to not be converted 

**so nothing conclusive is observed from this column**

#### Total time spent on website

In [None]:
sns.boxplot(leads["Total Time Spent on Website"])

In [None]:
sns.boxplot(y="Total Time Spent on Website",x="Converted",data=leads)

#### Observations

- people spending more time on the wesbite are more likely to be converted

**Keeping the website updated regularly is recommended**

#### Page views per visit

In [None]:
sns.boxplot(leads["Page Views Per Visit"])

In [None]:
# Again we have outliers so we will attempt to cap the data at 95% percentile
percentile_95 = leads["Page Views Per Visit"].quantile([0.05,0.95]).values

leads['Page Views Per Visit'][leads['Page Views Per Visit'] <= percentile_95[0]] = percentile_95[0]
leads['Page Views Per Visit'][leads['Page Views Per Visit'] >= percentile_95[1]] = percentile_95[1]

In [None]:
sns.boxplot(leads["Page Views Per Visit"])

In [None]:
sns.boxplot(y="Page Views Per Visit",x="Converted",data=leads)

#### Observations

- Medians for both conversions and non conversions is same

**So nothing conclusive can be said here**

#### Last Activity

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x='Last Activity',hue='Converted',data=leads)
plt.xticks(rotation=90)

In [None]:
#Since certain categorical values have negligible count as compared to the more prominent ones,
# we can combine them into Others category
leads['Last Activity'] = leads['Last Activity'].replace(['Had a Phone Conversation','View in browser link Clicked',
                                                         'Approached upfront','Visited Booth in Tradeshow','Resubscribed to emails'
                                                        ,'Email Received','Email Marked Spam'],"Other_Activity")

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x='Last Activity',hue='Converted',data=leads)
plt.xticks(rotation=90)

#### Observations

- People getting SMS have the highest conversion rate although their count is second highest, with people who are opening the emails having the highest count

- People having Olark char conversations are significant in number although their conversion rate is very low

**Focus can be on increasing conversion rates for Email Opened, SMS sent, Olark chat conversations**

#### Country

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x="Country",hue="Converted",data=leads)
plt.xticks(rotation=90)

In [None]:
# Here as well, there are a lot of countries where the count is negligible so we combine them to other countries
leads["Country"] = leads["Country"].replace(['Russia','Kuwait','Oman','Bahrain','Ghana','Qatar','Saudi Arabia','Belgium',
                                             'France','Sri Lanka','China','Canada','Netherlands','Sweden','Nigeria','Hong Kong',
                                             'Germany','Asia/Pacific Region','Uganda','Kenya','Italy','South Africa','Tanzania'
                                            ,'unknown','Malaysia','Liberia','Switzerland','Denmark','Philipines','Bangladesh',
                                            'Vietnam','Indonesia'],'Other_Country')

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x="Country",hue="Converted",data=leads)
plt.xticks(rotation=90)

#### Specialization

#### Observations

**Not much to conclude as India still has the highest count**

#### Specialization

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x="Specialization",hue="Converted",data=leads)
plt.xticks(rotation=90)

#### Observation

**We need to focus on specializations having high conversion rates and try to increase them even further i.e. Finance Management, HR Management, Marketing Management, Operations Management etc**

#### Occupation

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x="What is your current occupation",hue="Converted",data=leads)
plt.xticks(rotation=90)

#### Observations

- Working professionals have high conversion rates although very less count
- Unemploeyed people, although high in number, have low conversion rate

**Increasing the number of working professionals signing up and increasing the conversion rates of unemployed people will help**

#### What matters most to you in choosing this course


In [None]:
leads['What matters most to you in choosing a course'].describe()

**Since most entries are Better Career prospects, we can't conclude much here**

#### Search

In [None]:
leads['Search'].describe()

**Most entries are No. Nothing to conclude here**

#### Magazine

In [None]:
leads['Magazine'].describe()

**Most entries are No. Nothing to conclude here**

#### Newspaper Article

In [None]:
leads['Newspaper Article'].describe()

**Most entries are No. Nothing to conclude here**

#### X Education Forums

In [None]:
leads['X Education Forums'].describe()

**Most entries are No. Nothing to conclude here**

#### Newspaper

In [None]:
leads['Newspaper'].describe()

**Most entries are No. Nothing to conclude here**

#### Digital Advertisement

In [None]:
leads['Digital Advertisement'].describe()

**Most entries are No. Nothing to conclude here**

#### Through Recommendations

In [None]:
leads['Through Recommendations'].describe()

**Most entries are No. Nothing to conclude here**

#### Receive More Updates About Our Courses

In [None]:
leads['Receive More Updates About Our Courses'].describe()

**Most entries are No. Nothing to conclude here**

#### Tags

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x="Tags",hue="Converted",data=leads).legend(loc='upper right')
plt.xticks(rotation=90)

In [None]:
# Categorical values with negligible count as compared to the prominent ones can be grouped under others

leads["Tags"] = leads["Tags"].replace(['In confusion whether part time or DLP','in touch with EINS','Diploma holder (Not Eligible)',
                                      'number not provided','opp hangup','Not doing further education','invalid number',
                                       'wrong number given','Still Thinking','Lost to Others','Shall take in the next coming month',
                                       'Lateral student','Interested in Next batch','Recognition issue (DEC approval)',
                                       'Want to take admission but has financial problems','University not recognized'],'Other_tags')

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x="Tags",hue="Converted",data=leads).legend(loc='upper right')
plt.xticks(rotation=90)

#### Observations

- People who say they will revert after reading the email have the highest conversion rates
- People who have been called and are not picking up are high in count but very low in terms of conversion rates. Same goes for people interested in other courses

#### Lead Quality

In [None]:
leads['Lead Quality'].describe()

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x='Lead Quality',hue='Converted',data=leads)

#### Observations

- Max count is where the lead qualtiy cant be determined thus the low conversion rate there.
- Proportionally speaking, the highest conversion rate is for the lead quality "High in Relevance" but its count is very less
- "Might be Lead Qualtiy" also has a high conversion rate

**Getting high quality leads would be important**

#### Update me on Supply Chain Content

In [None]:
leads["Update me on Supply Chain Content"].describe()

**Most entries are No. Nothing to conclude here**

#### Get updates on DM Content

In [None]:
leads['Get updates on DM Content'].describe()

**Most entries are No. Nothing to conclude here**

#### City

In [None]:
leads["City"].describe()

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x='City',hue='Converted',data=leads)

#### Observations

- Mumbai has the max count of people registering for courses and a decent conversion rate of around 50%
- Thane and outskirts actually has a higher conversion rate but very less count
- Same goes for other cities

**Focus can be more people registering from Mumbai to increase their conversion rate**

#### I agree to pay the amount through cheque

In [None]:
leads["I agree to pay the amount through cheque"].describe()

**Most entries are No. Nothing to conclude here**

#### a free copy of Mastering The Interview

In [None]:
leads['A free copy of Mastering The Interview'].value_counts()

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x="A free copy of Mastering The Interview",hue='Converted',data=leads)

#### Observations

- People who were not interested in getting a free copy of "Mastering the interview" have a higher conversion rate (and count) as compared to people who did opt for a free copy of "Mastering the interview"

#### Last Notable Activity


In [None]:
leads["Last Notable Activity"].describe()

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x="Last Notable Activity",hue='Converted',data=leads)
plt.xticks(rotation=90)

In [None]:
# We have certain categorical values which have negligible count as compared to the more prominent ones so we combine them into others

leads["Last Notable Activity"] = leads["Last Notable Activity"].replace(['Approached upfront','Resubscribed to emails',
                                                                         'View in browser link Clicked','Form Submitted on Website',
                                                                         'Email Received','Email Marked Spam','Had a Phone Conversation']
                                                                        ,'Other_Activity')

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x="Last Notable Activity",hue='Converted',data=leads)
plt.xticks(rotation=90)

#### Observations

- "Modified" column might refer to people who might have modified their profile on website (just an assumption) and it has the highest count but very low conversion rate
- "SMS sent" has a high conversion rate but low count

**Overall, this column will not really help us make a business decision**

**After univariate analysis, we saw that there are certain columns which will not help us with our analysis, so we drop them**

In [None]:
# List of variables to drop. 
#Dropping these variables as they have most of the values towards one attribute 
#and using them might introduce bias in the model

columns_to_drop = ['Lead Number','Country','Search','Magazine','Newspaper Article','X Education Forums','Newspaper',
                   'Digital Advertisement','Through Recommendations','Receive More Updates About Our Courses',
                   'Update me on Supply Chain Content','Get updates on DM Content','I agree to pay the amount through cheque',
                  'A free copy of Mastering The Interview','What matters most to you in choosing a course']

In [None]:
leads = leads.drop(columns_to_drop,1)

In [None]:
leads.head()

In [None]:
leads.shape

### Data preparation

#### Converting Binary (Yes/No) variables to 1/0

In [None]:
# List of variables to map
binary_var_list = ['Do Not Email','Do Not Call']

# Defining the map function
def binary_map(x):
    return x.map({'Yes': 1, "No": 0})

leads[binary_var_list] = leads[binary_var_list].apply(binary_map)

In [None]:
leads.head()

#### Creating dummies for categorical variables with multiple levels

In [None]:
vars_for_dummies = ['Lead Origin','Lead Source','Last Activity','Specialization','What is your current occupation','Tags',
                    'Lead Quality','City','Last Notable Activity']

dummies_1 = pd.get_dummies(leads[vars_for_dummies],drop_first=True)

dummies_1.head()

In [None]:
leads = pd.concat([leads,dummies_1],axis=1)
leads.head()

**Now we drop the original categorical columns**

In [None]:
leads = leads.drop(vars_for_dummies,axis=1)

In [None]:
leads.head()

#### Train test split

In [None]:
# Defining X and y

X = leads.drop(['Prospect ID','Converted'],axis=1)

X.head()

In [None]:
y = leads['Converted']

y.head()

In [None]:
# Splitting into train and test data set

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=100)

In [None]:
X_train.head()

In [None]:
X_train.shape

### Step 5 : Feature scaling

In [None]:
scaler = StandardScaler()

vars_to_scale = ['TotalVisits','Total Time Spent on Website','Page Views Per Visit']

X_train[vars_to_scale] = scaler.fit_transform(X_train[vars_to_scale])

X_train.head()

In [None]:
# Checking the conversion rate

Converted = (sum(leads['Converted'])/len(leads['Converted'].index))*100

Converted

**Current conversion rate is 38.5%**

### Step 6: Model Building

In [None]:
# Logistic regression model
X_train_sm = sm.add_constant(X_train)

logis_model_1 = sm.GLM(y_train, X_train_sm, family=sm.families.Binomial()) 

logis_model_1.fit().summary()

#### Feature selection using RFE

In [None]:
logis_reg = LogisticRegression()

# 15 variables to choose
rfe = RFE(logis_reg, 15)

rfe = rfe.fit(X_train, y_train)

In [None]:
rfe.support_

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
# Checking the top 15 columns
col_top_15 = X_train.columns[rfe.support_]
col_top_15

In [None]:
X_train.columns[~rfe.support_]

In [None]:
# Logistic regression model with top 15 columns chosen
X_train_sm = sm.add_constant(X_train[col_top_15])

logis_model_2 = sm.GLM(y_train, X_train_sm, family=sm.families.Binomial()) 

res = logis_model_2.fit()

res.summary()

In [None]:
# Zoom in to read the values
plt.figure(figsize=(60,60))
sns.heatmap(leads.corr(),annot=True,cmap="Spectral_r")



- This was just a second measure to understand if we are missing any corelations which might help us with the model
- There are not many high corelations except the ones like Last Activity_Unsubscribed and Last Notable Activity_Unsubscribed. These are the type of corelations that dont make sense as they are actually same variables told differently

In [None]:
# Getting the predicted values on the training data set

y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]
    

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)


In [None]:
y_train_pred[:10]

In [None]:
#Creating a final data set with the conversion score
y_train_pred_final = pd.DataFrame({"Converted":y_train.values, "Converted_probability":y_train_pred})

y_train_pred_final["Prospect ID"] = y_train.index

y_train_pred_final.head()

In [None]:
# Creating a new column to predict the conversion of a certain person
y_train_pred_final['predicted'] = y_train_pred_final.Converted_probability.map(lambda x: 1 if x > 0.5 else 0)

y_train_pred_final.head()

#### Confusion matrix

In [None]:
confusion = metrics.confusion_matrix(y_train_pred_final['Converted'], y_train_pred_final['predicted'])

print(confusion)

In [None]:
## Predicted    not_conv    conv
## Actual
## not_conv    3841         161
## conv        362          2104

In [None]:
# Overall model accuracy
print(metrics.accuracy_score(y_train_pred_final['Converted'], y_train_pred_final['predicted']))

## So the model accuracy is 91.91 %

#### Checking VIF

In [None]:
# This dataframe will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train[col_top_15].columns
vif['VIF'] = [variance_inflation_factor(X_train[col_top_15].values, i) for i in range(X_train[col_top_15].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

#### Sensitivity, specificity, false positive rate, Positive predictive value, negative predictive value

In [None]:
# True positive
TP = confusion[1,1]

# True negative
TN = confusion[0,0]

# False positive
FP = confusion[0,1]

# False negative
FN = confusion[1,0]

print("Sensitivity {}\n".format(((TP)/(TP + FN))))

print("Specificity {}\n".format(((TN)/(TN + FP))))

print("False Positive Rate {}\n".format(((FP)/(TN + FP))))

print("Positive Predictive value {}\n".format(((TP)/(TP + FP))))

print("Negative Predictive value {}\n".format(((TN)/(TN + FN))))

print("True Positive rate {}\n".format(((TP)/(TP + FN))))

print("False Positive rate {}\n".format(((FP)/(TN + FP))))

### Step 7 : Plotting the ROC curve

An ROC curve will help us understand the below things:

- It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will cause a decrease in specificity).
- The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
- The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('ROC curve sensitivity vs specificity')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_train_pred_final["Converted"], y_train_pred_final["Converted_probability"], 
                                         drop_intermediate= False)

In [None]:
draw_roc(y_train_pred_final["Converted"], y_train_pred_final["predicted"])

### Step 8: Finding optimal cut-off point

Optimal cutoff probability is the probability where we get a balance between sensitvity and specificity

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final["Converted_probability"].map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Calculating sensitivity and specificity for various probability cutoffs

cutoff_df = pd.DataFrame(columns=['probability','accuracy','sensitivity','specificity'])

for i in numbers:
    cm1 = metrics.confusion_matrix(y_train_pred_final["Converted"], y_train_pred_final[i])
    total1 = sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    specificity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensitivity = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensitivity,specificity]
    
print(cutoff_df)

In [None]:
# plotting accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='probability', y=['accuracy','sensitivity','specificity'])
plt.show()

**From the above curve 0.233 is the optimum probability as thats where the accuracy, sensitivity and specificity coincide**

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final["Converted_probability"].map( lambda x: 1 if x > 0.233 else 0)

y_train_pred_final.head()

In [None]:
y_train_pred_final["Lead_score"] = y_train_pred_final["Converted_probability"].map(lambda x: round(x*100))

y_train_pred_final.head()

#### Checking model accuracy, confusion matrix and all those metrics again

In [None]:
metrics.accuracy_score(y_train_pred_final["Converted"],y_train_pred_final["final_predicted"])

confusion2 = metrics.confusion_matrix(y_train_pred_final["Converted"],y_train_pred_final["final_predicted"])

# True positive
TP = confusion2[1,1]

# True negative
TN = confusion2[0,0]

# False positive
FP = confusion2[0,1]

# False negative
FN = confusion2[1,0]

print("Sensitivity {}\n".format(((TP)/(TP + FN))))

print("Specificity {}\n".format(((TN)/(TN + FP))))

print("False Positive Rate {}\n".format(((FP)/(TN + FP))))

print("Positive Predictive value {}\n".format(((TP)/(TP + FP))))

print("Negative Predictive value {}\n".format(((TN)/(TN + FN))))

print("True Positive rate {}\n".format(((TP)/(TP + FN))))

print("False Positive rate {}\n".format(((FP)/(TN + FP))))

#### Precision and Recall

In [None]:
print("Precision {}\n".format(((TP)/(TP + FP))))

print("Recall {}\n".format(((TP)/(TP + FN))))

In [None]:
print("Precision {}".format(precision_score(y_train_pred_final["Converted"],y_train_pred_final["final_predicted"])))
print("Recall {}".format(recall_score(y_train_pred_final["Converted"],y_train_pred_final["final_predicted"])))

#### Precision and Recall trade-off

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final["Converted"], y_train_pred_final["Converted_probability"])
                                                             
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

#### Make predictions on the data set

In [None]:
X_test[vars_to_scale] = scaler.transform(X_test[vars_to_scale])

X_test.head()

In [None]:
X_test = X_test[col_top_15]
X_test.head()

In [None]:
X_test_sm = sm.add_constant(X_test)

y_test_pred = res.predict(X_test_sm)

y_test_pred[:10]

In [None]:
#Convert y_test_pred to a DataFrame
y_pred_1 = pd.DataFrame(y_test_pred)

y_pred_1.head()

In [None]:
#Convert y_test to a DataFrame
y_test_df = pd.DataFrame(y_test)

y_test_df["Prospect ID"] = y_test_df.index

In [None]:
# Remove index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1)

y_pred_final.head()

In [None]:
# Rename the last column to show Conversion probability
y_pred_final= y_pred_final.rename(columns={ 0 : 'Converted_probability'})

# Rearrange the columns
y_pred_final = y_pred_final.reindex(['Prospect ID','Converted','Converted_probability'], axis=1)

y_pred_final.head()

In [None]:
y_pred_final['final_predicted'] = y_pred_final["Converted_probability"].map(lambda x: 1 if x > 0.2 else 0)

In [None]:
y_pred_final.head()

In [None]:
# Overall model accuracy
metrics.accuracy_score(y_pred_final["Converted"], y_pred_final["final_predicted"])

## Model accuracy is 81.5%

In [None]:
confusion_final = metrics.confusion_matrix(y_pred_final["Converted"],y_pred_final["final_predicted"])
confusion_final

In [None]:
# True positive
TP = confusion_final[1,1]

# True negative
TN = confusion_final[0,0]

# False positive
FP = confusion_final[0,1]

# False negative
FN = confusion_final[1,0]

print("Sensitivity {}\n".format(((TP)/(TP + FN))))

print("Specificity {}\n".format(((TN)/(TN + FP))))

print("False Positive Rate {}\n".format(((FP)/(TN + FP))))

print("Positive Predictive value {}\n".format(((TP)/(TP + FP))))

print("Negative Predictive value {}\n".format(((TN)/(TN + FN))))

print("True Positive rate {}\n".format(((TP)/(TP + FN))))

print("False Positive rate {}\n".format(((FP)/(TN + FP))))