### A Beginner's Guide to very Simple EDA 
This Notebook can be used as a basic guide to understand how to explore the data and build a simple regression model
using step by step feature elimination and making use of RFE as well for feature selection.
If you have any doubts, please comment to ask and if there is something you find should be done in a better way, do mention that 
as well.
Let us now see what the Problem of Leads Score is!!!



### Problem Statement

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 

 

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. 

 

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.

X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

#### Data
You have been provided with a leads dataset from the past with around 9000 data points. This dataset consists of various attributes such as Lead Source, Total Time Spent on Website, Total Visits, Last Activity, etc. which may or may not be useful in ultimately deciding whether a lead will be converted or not. The target variable, in this case, is the column ‘Converted’ which tells whether a past lead was converted or not wherein 1 means it was converted and 0 means it wasn’t converted. You can learn more about the dataset from the data dictionary provided in the zip folder at the end of the page.
Another thing that you also need to check out for are the levels present in the categorical variables. Many of the categorical variables have a level called 'Select' which needs to be handled because it is as good as a null value (think why?). 


### Goal
There are quite a few goals for this case study.

Build a logistic regression model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads. A higher score would mean that the lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get converted.

There are some more problems presented by the company which your model should be able to adjust to if the company's requirement changes in the future so you will need to handle these as well. These problems are provided in a separate doc file. Please fill it based on the logistic regression model you got in the first step. Also, make sure you include this in your final PPT where you'll make recommendations.

### DATA DICTIONARY

Prospect ID: A unique ID with which the customer is identified.

Lead Number: A lead number assigned to each lead procured.

Lead Origin: The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc.

Lead Source: The source of the lead. Includes Google, Organic Search, Olark Chat, etc.

Do Not Email: An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not.

Do Not Call: An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not.

Converted: The target variable. Indicates whether a lead has been successfully converted or not.

TotalVisits: The total number of visits made by the customer on the website.

Total Time Spent on Website: The total time spent by the customer on the website.

Page Views Per Visit: Average number of pages on the website viewed during the visits.

Last Activity: Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc.
Country	The country of the customer.

Specialization:	The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form.

How did you hear about X Education:	The source from which the customer heard about X Education.

What is your current occupation: Indicates whether the customer is a student, umemployed or employed.

What matters most to you in choosing this course: An option selected by the customer indicating what is their main motto behind doing this course.

Search/Magazine/Newspaper Article/X Education Forums/Newspaper/Digital Advertisement:	Indicating whether the customer had seen the ad in any of the listed items.
	
Through Recommendations: Indicates whether the customer came in through recommendations.

Receive More Updates About Our Courses:	Indicates whether the customer chose to receive more updates about the courses.

Tags: Tags assigned to customers indicating the current status of the lead.

Lead Quality: Indicates the quality of lead based on the data and intuition the the employee who has been assigned to the lead.

Update me on Supply Chain Content: Indicates whether the customer wants updates on the Supply Chain Content.

Get updates on DM Content: Indicates whether the customer wants updates on the DM Content.

Lead Profile: A lead level assigned to each customer based on their profile.

City: The city of the customer.

Asymmetrique Activity Index /Asymmetrique Profile Index/Asymmetrique Activity Score/Asymmetrique Profile Score: An index and score assigned to each customer based on their activity and their profile
	
I agree to pay the amount through cheque: Indicates whether the customer has agreed to pay the amount through cheque or not.

a free copy of Mastering The Interview:	Indicates whether the customer wants a free copy of 'Mastering the Interview' or not.

Last Notable Activity:	The last notable acitivity performed by the student.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
from __future__ import print_function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
from matplotlib.pyplot import xticks
%matplotlib inline
sns.set(style="whitegrid")
sns.set(rc={'figure.figsize':(12,8)})
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 40)
pd.set_option('display.max_colwidth', -1)

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

In [None]:
#function for missing values in columns
def missing_coldata(df):
    missin_col = pd.DataFrame(round(df.isnull().sum().sort_values(ascending=False)/len(df.index)*100,1), columns=['% of missing value'])
    missin_col['Count of Missing Values'] = df.isnull().sum()
    return missin_col

#function for missing values in rows
def missing_rowdata(df):
    missin_row = pd.DataFrame(round(df.isnull().sum(axis=1).sort_values(ascending=False)/len(df.columns)*100), columns=['% of missing value'])
    missin_row['Count of Missing Values'] = df.isnull().sum(axis=1)
    return missin_row

### Importing the data

In [None]:
leadsdata = pd.read_csv('../input/leads-dataset/Leads.csv')
leadsdata.head(5) 

In [None]:
leadsdata.shape

In [None]:
leadsdata.info()

In [None]:
leadsdata.isnull().sum()

#### Creating a dataframe for duplicate values if any

In [None]:
dupcheck=leadsdata[leadsdata.duplicated(["Prospect ID"])]
dupcheck

In [None]:
sum(leadsdata.duplicated('Prospect ID')) == 0

In [None]:
sum(leadsdata.duplicated('Lead Number')) == 0

In [None]:
leadsdata.nunique()

#### Those features which have only one unique value are :
* #### *-Magazine*
* #### *-Recieve More updates about the course*
* #### *-Update me on Supply chain content*
* #### *-Get updates on DM content*
* #### *-I agree to pay the amount through cheque*
These features show no variance and thus all the leads have chosen one option, 
thus this feature doesnt make any impact or difference on conversion of leads.

### Conversion rate

In [None]:
Conversion_rate = (sum(leadsdata['Converted'])/len(leadsdata['Converted'].index))*100
print("The conversion rate of leads is: ",Conversion_rate)

##### It can easily be considered to remove these features from the model as they play no role in conversion of leads

### Data Visualization and Cleaning


In [None]:
# Divide the data into Numeric and categorical data  
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
# NUMERIC
numdata=leadsdata[list(leadsdata.select_dtypes(numerics).columns)]
# CATEGORICAL 
catdata=leadsdata[list(leadsdata.select_dtypes(exclude=numerics).columns)]
catdata.columns

In [None]:
# Conversion rate for each categorical feature
@interact
def counts(col =catdata.iloc[:,1:].columns):
    sns.countplot(x=col,data=leadsdata,hue="Converted",palette="husl",hue_order=[0,1])
    plt.xlabel(col)
    plt.ylabel('Total count')
    plt.legend(loc='upper center', bbox_to_anchor=(1, 0.8), ncol=1)
    plt.xticks(rotation=65, horizontalalignment='right',fontweight='light')
    convertcount=leadsdata.pivot_table(values='Lead Number',index=col,columns='Converted', aggfunc='count').fillna(0)
    convertcount["Conversion(%)"] =round(convertcount[1]/(convertcount[0]+convertcount[1]),2)*100
    return print(convertcount.sort_values(ascending=False,by=1),plt.show())

### Desciribing the data

In [None]:
@interact
def described(col=leadsdata.iloc[:,2:].columns):
    return leadsdata[col].describe()

In [None]:
#Choosing to drop the columns that have only 1 unique value
leadsdata=leadsdata.drop(["Receive More Updates About Our Courses","Magazine","Update me on Supply Chain Content","Get updates on DM Content","I agree to pay the amount through cheque"],axis=1)

### Handling Missing Data 

##### Newspaper : 
##### X Education Forums : 
##### Newspaper Articles : 
##### Through Recommendations : 
##### Digital Advertisements : 
for all these variables above mentioned , all the values are no hence it does not have any significant role in lead score, drop this column
 
##### What matters most to you in choosing a course : 
99.9% of available values are "Better career prospects" and around 30 % are missing, hence it does not have any significant role in lead score, drop this column

##### Search :
99% values are no except a few yes and missing, hence it does not have any significant role in lead score, drop this column
##### Do Not Call: 
All the values are no except 2 values, hence there is no variance, doesnt indicate anything about leads and can easily be dropped

In [None]:
leadsdata=leadsdata.drop(["Newspaper","X Education Forums","Newspaper Article",
                          "Through Recommendations","Digital Advertisement",
                          "What matters most to you in choosing a course","Search","Do Not Call"],axis=1)

In [None]:
missing_coldata(leadsdata)

#### Lead Source : 
Missing values are much less than 1 percent, imputing it with most frequent value "Google"

In [None]:
leadsdata["Lead Source"]=leadsdata["Lead Source"].fillna("Google")

In [None]:
# What is your current occupation
leadsdata["What is your current occupation"]=leadsdata["What is your current occupation"].fillna("Unemployed")

#### Specialization
The highest frequency is of Select, which means we can not drop it and need to convert it to "Any Other"/ "Not on List"
Also if manually entering the specialization is not provided on website then this should be taken care of in the form design.


In [None]:
# Also the missing values can be imputed with Any_Other
leadsdata["Specialization"]=leadsdata["Specialization"].replace("Select","Any_Other")
leadsdata["Specialization"]=leadsdata["Specialization"].fillna("Any_Other")

#### How did you hear about X Education
The highest frequency is of Select, which means we can not drop it and need to convert it to "Any Other"/ "Not Mentioned"
As this is an important factor in order to plan the marketing of X Education, hence more information should be gathered about it through other tools.


In [None]:
leadsdata["How did you hear about X Education"]=leadsdata["How did you hear about X Education"].replace("Select","Not_Mentioned")
leadsdata["How did you hear about X Education"]=leadsdata["How did you hear about X Education"].fillna("Not_Mentioned")

#### Lead Profile
The highest frequency is of Select, which means we can not drop it and need to convert it to "Any Other"/ "Not Mentioned"


In [None]:
leadsdata["Lead Profile"]=leadsdata["Lead Profile"].replace("Select","Any_Other")
leadsdata["Lead Profile"]=leadsdata["Lead Profile"].fillna("Any_other")

#### Lead Quality
The highest frequency is of Select, which means we can not drop it and need to convert it to "Any Other"/ "Not Mentioned"


In [None]:
leadsdata["Lead Quality"]=leadsdata["Lead Quality"].replace("Select","Might be")
leadsdata["Lead Quality"]=leadsdata["Lead Quality"].fillna("Might be")

In [None]:
# Tags
leadsdata["Tags"]=leadsdata["Tags"].fillna("Will revert after reading the email")

#### Country
95% of the data has country as India hence imputing with India

In [None]:
leadsdata["Country"]=leadsdata["Country"].fillna("India")

#### City
majority records have city as Mumbai or nearby, hence imputing the NaN with Mumbai is not inappropriate choice

In [None]:
leadsdata["City"]=leadsdata["City"].fillna("Mumbai")

#### Lets see the shape of dataframe now

In [None]:
leadsdata.shape

In [None]:
missing_coldata(leadsdata)

### Outliers Check of numeric data


In [None]:
numdata.columns

In [None]:
@interact
def density( y=numdata.iloc[:,2:].columns,tick_spacing = [100,50,25,10,5]):
    ax=leadsdata[y].plot(kind="hist",title=y,bins=50, rot=30)
    ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
    return

In [None]:
@interact
def outliers_check( y=numdata.iloc[:,2:].columns):
    return leadsdata.plot(kind='box',y=y,figsize=[6,5]) 

#### 'Lead Number','Asymmetrique Activity Score', 'Asymmetrique Profile Score','Asymmetrique Activity Index','Asymmetrique Profile Index' 
45% missing values should not be imputed because this may effect the data's correctness, hence dropping these features.

In [None]:
leadsdata.drop(['Asymmetrique Activity Score', 'Asymmetrique Profile Score','Asymmetrique Activity Index','Asymmetrique Profile Index'],axis=1,inplace=True)

In [None]:
missing_coldata(leadsdata)

#### TotalVisits, Page views per visit, Last Activity  : 
Missing values are approx. 1 percent
hence drop all these rows

In [None]:
leadsdata.dropna(inplace = True)

In [None]:
missing_coldata(leadsdata)

In [None]:
leadsdata.shape

##### Analyzing the categories, we can reduce them as per the number of leads, the categories which have least number of records, almost negligible can be combined to form one category like Miscellaneous

In [None]:
leadsdata['Tags'] = leadsdata['Tags'].replace(['In confusion whether part time or DLP', 'in touch with EINS','Diploma holder (Not Eligible)',
                                     'Approached upfront','Graduation in progress','number not provided', 'opp hangup','Still Thinking',
                                    'Lost to Others','Shall take in the next coming month','Lateral student','Interested in Next batch',
                                    'Recognition issue (DEC approval)','Want to take admission but has financial problems',
                                    'University not recognized'], 'Misc_Tags')

In [None]:
leadsdata['Tags'] = leadsdata['Tags'].replace(["Ringing","Misc_Tags","Interested in other courses","switched off","Already a student",
                                               "Interested  in full time MBA","Not doing further education","invalid number","wrong number given"], 'Misc_Tags')                                  


In [None]:
leadsdata['Tags'] = leadsdata['Tags'].replace(["Ringing","Misc_Tags","Interested in other courses","switched off","Already a student",
                                               "Interested  in full time MBA","Not doing further education","invalid number","wrong number given"], 'Misc_Tags')                                  


#### Last Notable Activity

In [None]:
leadsdata["Last Notable Activity"] = leadsdata["Last Notable Activity"].replace(['Approached upfront',
       'Resubscribed to emails', 'View in browser link Clicked',
       'Form Submitted on Website', 'Email Received', 'Email Marked Spam'], 'Misc_Notable_Activity')

#### Lead Source
As we can see that There are various lead source which have just 1 or 2 leads, thus combining all those to one category Miscellaneous


In [None]:
leadsdata['Lead Source'] = leadsdata['Lead Source'].replace(['Click2call', 'Live Chat', 'NC_EDM', 'Pay per Click Ads', 'Press_Release',
  'Social Media', 'WeLearn', 'bing', 'blog', 'testone', 'welearnblog_Home', 'youtubechannel'], 'Miscellaneous')

In [None]:
#There are two google in lead source which should be corrected to one
leadsdata['Lead Source'] = leadsdata['Lead Source'].replace('google',"Google")


#### Last Activity


In [None]:
# As we can see that There are various categories in Last Activity which have very few records, thus combining all those to one category Miscellaneous
leadsdata['Last Activity'] = leadsdata['Last Activity'].replace(['Had a Phone Conversation', 'View in browser link Clicked', 
                                                       'Visited Booth in Tradeshow', 'Approached upfront',
                                                       'Resubscribed to emails','Email Received', 'Email Marked Spam'], 'Miscellaneous')

In [None]:
# Total Visit
(leadsdata["TotalVisits"]>=30).sum()

In [None]:
# removed outliers with values greater than 30

leadsdata=leadsdata[leadsdata["TotalVisits"] < 30]

In [None]:
# Page view per visit
# can easily remove these two outliers
(leadsdata["Page Views Per Visit"]>=15).sum()

In [None]:
#### removed outliers with values greater than 15
leadsdata=leadsdata[leadsdata["Page Views Per Visit"]<15]

#### Total Time Spent on Website

##### As we can see the time spent is in minutes: we analyse it by dividing the data into two parts:- 
* #####  more than one hour spent 


In [None]:
# dataframe is sliced for more than one hour time spent on website
leads1hrplus=leadsdata[leadsdata['Total Time Spent on Website']>=60]
leads1hrplus["hours spent"]=round(leads1hrplus["Total Time Spent on Website"]/60).astype(int)

In [None]:
time_spent_abv1hr=leads1hrplus.pivot_table(values='Lead Number',index=['hours spent'],columns='Converted', aggfunc='count').fillna(0)
time_spent_abv1hr["Conversion(%)"] =round(time_spent_abv1hr[1]/(time_spent_abv1hr[0]+time_spent_abv1hr[1]),2)*100
time_spent_abv1hr.sort_values(ascending=False,by=1)

In [None]:
time_spent_abv1hr.iloc[:,:-1].plot(kind='bar',title= "Conversion count for the leads that spend at least 1 hour on website",stacked=True,figsize=[8,6])

##### Time spent on website
* #####  less than one hour.

In [None]:
leadslessthan1hr=leadsdata[leadsdata['Total Time Spent on Website']<60]
leadslessthan1hr["mins_spent"]=leadslessthan1hr["Total Time Spent on Website"].astype(int)

In [None]:
time_spent_upto1hr=leadslessthan1hr.pivot_table(values='Lead Number',index=['mins_spent'],columns='Converted', aggfunc='count').fillna(0)
time_spent_upto1hr["Conversion(%)"] =round(time_spent_upto1hr[1]/(time_spent_upto1hr[0]+time_spent_upto1hr[1]),2)*100
time_spent_upto1hr.sort_values(ascending=False,by="Conversion(%)")

In [None]:
time_spent_upto1hr.iloc[:,:-1].plot(kind='bar',title="Conversion count for the leads that spend atmost 1 hour on website",stacked=True,figsize=[10,8],log=True)

* ### Conversion rate for other numeric features of the leads 

In [None]:
@interact
def numcount(cols=['TotalVisits','Asymmetrique Activity Score', 'Asymmetrique Profile Score']):
    numdfcount=round(leadsdata.pivot_table(values='Lead Number',index=cols,columns='Converted', aggfunc='count')).fillna(0)
    numdfcount["Conversion(%)"]=round((numdfcount[1]/(numdfcount[0]+numdfcount[1]))*100)
    cnplot=numdfcount.iloc[:,:-1].plot(kind="bar",stacked=True, legend="upper right", title=cols,figsize=[8,6])
    return print(numdfcount, "\n", cnplot)

In [None]:
pageview=leadsdata.pivot_table(values='Lead Number',index=['Page Views Per Visit'],columns='Converted', aggfunc='count')
pageview.reset_index(inplace=True)

In [None]:
pageview.fillna(0,inplace=True)

In [None]:
pageviews=pageview.round().groupby("Page Views Per Visit").sum()
pageviews["Conversion(%)"]=round((pageviews[1]/(pageviews[0]+pageviews[1]))*100)

In [None]:
pageviews.iloc[:,:-1].plot(kind="bar",legend="upper right",stacked=True,figsize=[7,5])
pageviews

In [None]:
# there are two unique keys for the data, hence dropping Prospect ID for now n keeping Lead Number.
leadsdata=leadsdata.drop("Prospect ID",axis=1)

In [None]:
leadsdata.nunique()

In [None]:
leadsdata.drop(['Country'],axis=1,inplace=True)

In [None]:
leadsdata.columns

### After cleaning the data we can visualise it again to see the conversion rates among various categories


In [None]:
catdata=leadsdata[list(leadsdata.select_dtypes(exclude=numerics).columns)]
catdata.columns

In [None]:
@interact
def counts(col =catdata.columns):
    sns.countplot(x=col,data=leadsdata,hue="Converted",palette="husl",hue_order=[0,1])
    plt.xlabel(col)
    plt.ylabel('Total count')
    plt.legend(loc='upper center', bbox_to_anchor=(1, 0.8), ncol=1)
    plt.xticks(rotation=65, horizontalalignment='right',fontweight='light')
    convertcount=leadsdata.pivot_table(values='Lead Number',index=col,columns='Converted', aggfunc='count').fillna(0)
    convertcount["Conversion(%)"] =round(convertcount[1]/(convertcount[0]+convertcount[1]),2)*100
    return print(convertcount.sort_values(ascending=False,by=1),plt.show())

### DATA PREPARATION
#### Encoding categorical features 

In [None]:
catdata.nunique()

In [None]:
df = pd.get_dummies(leadsdata[catdata.columns], drop_first=True)
df.head()

In [None]:
#Create a copy of leads data to add these dummies to the whole data
leads_copy = leadsdata.copy(deep=True)

In [None]:
leads = leadsdata.drop(catdata.columns, axis = 1)

In [None]:
leads.columns

In [None]:
leads = pd.concat([leads, df], axis=1)

In [None]:
leads.shape

In [None]:
%matplotlib inline
plt.figure(figsize = (10,6))
sns.heatmap(leadsdata.corr(),annot = True)

In [None]:
# Total Visits and Page Views Per Visit are significantly correlated, hence we drop one of those
leads = leads.drop("Page Views Per Visit", axis = 1)

In [None]:
leads.columns

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler =  MinMaxScaler()
leads[['TotalVisits','Total Time Spent on Website']] = scaler.fit_transform(leads[['TotalVisits','Total Time Spent on Website']])
leads.head()

### Splitting the data

In [None]:
from sklearn.model_selection import train_test_split
# Creating target variable as y and remaining as X
X = leads.drop(["Lead Number",'Converted'], axis=1)
y = leads['Converted']
display(y.head(),X.head())

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

In [None]:
numdata=X_train[list(X_train.select_dtypes(numerics).columns)]
numdata.columns

### DATA MODELLING

In [None]:
import statsmodels.api as sm
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

from sklearn.feature_selection import RFE
rfe = RFE(logreg, 24)             # running RFE with 15 variables as output
rfe = rfe.fit(X_train, y_train)
rfe.support_
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
vars=X_train.columns[rfe.support_]

In [None]:
X_train_sm = sm.add_constant(X_train[vars])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# VIF
X_train_sm = X_train_sm.drop(['const'], axis=1)
# Checking the  VIF of all the  features
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_sm
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
vars=vars.drop(['How did you hear about X Education_SMS'],1)

X_train_sm = sm.add_constant(X_train[vars])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
vars=vars.drop(['Specialization_Travel and Tourism'],1)

X_train_sm = sm.add_constant(X_train[vars])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
X_train_sm = X_train_sm.drop(['const'], axis=1)
# Checking the  VIF of all the  features

vif = pd.DataFrame()
X = X_train_sm
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:

vars=vars.drop(['Last Activity_Miscellaneous'],1)

X_train_sm = sm.add_constant(X_train[vars])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
vars=vars.drop(['Last Activity_SMS Sent'],1)

X_train_sm = sm.add_constant(X_train[vars])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
vars=vars.drop(['Lead Quality_Might be'],1)

X_train_sm = sm.add_constant(X_train[vars])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
X_train_sm = X_train_sm.drop(['const'], axis=1)
# Checking the  VIF of all the  features

vif = pd.DataFrame()
X = X_train_sm
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
vars=vars.drop(['Tags_Misc_Tags'],1)

X_train_sm = sm.add_constant(X_train[vars])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
vars=vars.drop(['What is your current occupation_Unemployed'],1)

X_train_sm = sm.add_constant(X_train[vars])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
X_train_sm = X_train_sm.drop(['const'], axis=1)
# Checking the  VIF of all the  features

vif = pd.DataFrame()
X = X_train_sm
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
vars=vars.drop(['Lead Profile_Any_other'],1)

X_train_sm = sm.add_constant(X_train[vars])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
vars=vars.drop(['Lead Quality_Worst'],1)

X_train_sm = sm.add_constant(X_train[vars])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
X_train_sm = X_train_sm.drop(['const'], axis=1)
# Checking the  VIF of all the  features

vif = pd.DataFrame()
X = X_train_sm
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### Final Logistic regression model

In [None]:
X_train_sm = sm.add_constant(X_train[vars])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

### Final Variables selected with RFE and Manual Elimination


In [None]:
print("The final variables selected by the logsitic regression model are ","\n",vars)

### The features which have highest corrleation to y variable - Converted

According to our final model Top 3 variables (using highest coefficient) are:
The Variable which give the highest correlation to  variable Converted are:
1. Tags:
    Tags_Lost to EINS:                           9.8
    Tags_Closed by Horizzon :                    9.7
    Tags_Will revert after reading the email:    4.9
2. Total Time Spent on Website:                  4.6
3. Lead Origin:
    Lead Add Form:                               3.5


### MAKING PREDICTIONS

In [None]:
# Let's run the model using the selected variables
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logsk = LogisticRegression(C=1e9)
logsk.fit(X_train[vars], y_train)

In [None]:
# Predicted probabilities
y_pred = logsk.predict_proba(X_train[vars])
# Converting y_pred to a dataframe which is an array
y_pred_df = pd.DataFrame(y_pred)
# Converting to column dataframe
y_pred_1 = y_pred_df.iloc[:,[1]]
# Let's see the head
y_pred_1.head()

In [None]:
# Converting y_train to dataframe
y_train_df = pd.DataFrame(y_train)
y_train_df.head()

In [None]:
# Putting index to LeadID
y_train_df['LeadID'] = y_train_df.index
y_train_df.head()

In [None]:

# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
y_train_df.reset_index(drop=True, inplace=True)
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([y_train_df,y_pred_1],axis=1)
# Renaming the column 
y_pred_final= y_pred_final.rename(columns={ 1 : 'Conv_Prob'})
# Rearranging the columns
y_pred_final = y_pred_final.reindex(['LeadID','Converted','Conv_Prob'], axis=1)
# Let's see the head of y_pred_final
y_pred_final.head()

In [None]:
# Creating new column 'predicted' with 1 if Conversion_Rate>0.5 else 0
y_pred_final['predicted'] = y_pred_final.Conv_Prob.map( lambda x: 1 if x > 0.5 else 0)
# Let's see the head
y_pred_final.head()

In [None]:
# Creating new column "Lead Score" with 1to100 using conversion rates
y_pred_final['Lead Score'] = y_pred_final.Conv_Prob.map( lambda x: round(x*100))
# Let's see the head
y_pred_final.head()

### Model Evaluation


In [None]:
from sklearn import metrics
# Confusion matrix 
confusion = metrics.confusion_matrix( y_pred_final.Converted, y_pred_final.predicted )
confusion

In [None]:
#Let's check the overall accuracy.
metrics.accuracy_score(y_pred_final.Converted, y_pred_final.predicted)

In [None]:
metrics.precision_score(y_pred_final.Converted, y_pred_final.predicted)

In [None]:
metrics.recall_score(y_pred_final.Converted, y_pred_final.predicted)

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(6, 6))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return fpr, tpr, thresholds

In [None]:
draw_roc(y_pred_final.Converted, y_pred_final.predicted)

In [None]:
#draw_roc(y_pred_final.Converted, y_pred_final.predicted)
"{:2.2f}".format(metrics.roc_auc_score(y_pred_final.Converted, y_pred_final.Conv_Prob))

In [None]:
from sklearn import metrics

# Confusion matrix 
confusion = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.predicted )
print(confusion)

#### Optimal Cutoff

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_pred_final[i]= y_pred_final.Conv_Prob.map(lambda x: 1 if x > i else 0)
y_pred_final.head()

In [None]:
# Calculating accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['probability','accuracy','sensitivity','specificity'])
from sklearn.metrics import confusion_matrix

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    specificity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensitivity = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensitivity,specificity]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='probability', y=['accuracy','sensitivity','specificity'])
plt.show()

In [None]:
#  0.3 is the optimum point to take it as a cutoff probability tp predict the final probability

y_pred_final['final_pred'] = y_pred_final.Conv_Prob.map( lambda x: 1 if x > 0.3 else 0)

y_pred_final.head(10)

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_pred_final.Converted, y_pred_final.final_pred)

cm2 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.final_pred )
cm2
TP = cm2[1,1] 
TN = cm2[0,0] 
FP = cm2[0,1] 
FN = cm2[1,0] 
print("SENSITIVITY of the logistic regression model is  ",TP / float(TP+FN))


In [None]:
print("True negatives are ",TN / float(TN+FP))
print("False Positives are  ",FP/ float(TN+FP))
print ("True Positives are  ",TP / float(TP+FP))
print (TN / float(TN+ FN))

In [None]:
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_pred_final.Converted, y_pred_final.Conv_Prob)
plt.plot(thresholds, precision[:-1], "b")
plt.plot(thresholds, recall[:-1], "g")
plt.show()

### 0.38 is the tradeoff between Precision and Recall - 
thus we can safely choose to consider any Prospect Lead with Conversion Probability higher than 38 % to be a hot Lead

#### Making Predictions on test set X_test

In [None]:
X_test[['TotalVisits','Total Time Spent on Website']] = scaler.fit_transform(X_test[['TotalVisits','Total Time Spent on Website']])


In [None]:
X_test=X_test[vars]
X_test.head()

In [None]:
X_test_sm = sm.add_constant(X_test)
y_test_pred = res.predict(X_test_sm)
y_test_pred.head()

In [None]:
y_pred_1 = pd.DataFrame(y_test_pred)
y_test_df = pd.DataFrame(y_test)
# Putting CustID to index
y_test_df['LeadID'] = y_test_df.index
# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)
y_pred_final= y_pred_final.rename(columns={ 0 : 'Conv_Prob'})
y_pred_final = y_pred_final.reindex(['LeadID','Converted','Conv_Prob'], axis=1)
y_pred_final.head()

In [None]:
# Creating new column "Lead Score" with 1to100 using conversion rates
y_pred_final['Lead_Score'] = y_pred_final.Conv_Prob.map( lambda x: round(x*100))
# Let's see the head
y_pred_final.head()

### Taking 0.38 as the cutoff using precision recall tradeoff

In [None]:
y_pred_final['final_pred'] = y_pred_final.Conv_Prob.map(lambda x: 1 if x > 0.38 else 0)
y_pred_final.head(10)

### Evaluation of model on test data

In [None]:
print("Model Accuracy on Test data is ",metrics.accuracy_score(y_pred_final.Converted, y_pred_final.final_pred))

In [None]:
confusion2 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.final_pred )
confusion2
TP = confusion2[1,1]  
TN = confusion2[0,0] 
FP = confusion2[0,1] 
FN = confusion2[1,0] 

print("Sensitivity of the model on test data is ",round(TP / float(TP+FN),2))

In [None]:
print("Specificity of the model on test data is ",TN / float(TN+FP))

### Visualising the conversion rate of the most Impactful Features of Leads Data

In [None]:
ydf=y_train_df.set_index("LeadID")

In [None]:
Xy_Traindf=pd.concat([ydf,X_train_sm.iloc[:,1:]],axis=1)

In [None]:
Xy_Traindf.corr()["Converted"].sort_values()

#### Analysing the training data over the Encoded variables which are finally selected ny the logistic regression model

In [None]:
Xy_Traindf.reset_index(inplace=True)

In [None]:
Xy_Traindf=Xy_Traindf.rename(columns={"index":"LeadID"})

In [None]:
@interact
def counts(col =['Lead Origin_Lead Add Form', 'Lead Source_Olark Chat',
       'Lead Source_Welingak Website', 'Do Not Email_Yes',
       'Tags_Closed by Horizzon', 'Tags_Lost to EINS',
       'Tags_Will revert after reading the email', 'Lead Quality_Not Sure',
       'Lead Profile_Other Leads', 'Lead Profile_Potential Lead',
       'Last Notable Activity_Modified',
       'Last Notable Activity_Olark Chat Conversation',
       'Last Notable Activity_SMS Sent']):
    sns.countplot(x=col,data=Xy_Traindf,hue="Converted",palette="husl")
    plt.xlabel(col)
    plt.ylabel('Total count')
    plt.legend(loc='upper center', bbox_to_anchor=(1, 0.8), ncol=1)
    plt.xticks(rotation=65, horizontalalignment='right',fontweight='light')
    convertcount=Xy_Traindf.pivot_table(values='LeadID',index=col,columns='Converted', aggfunc='count').fillna(0)
    convertcount["Conversion(%)"] =round(convertcount[1]/(convertcount[0]+convertcount[1]),2)*100
    return print(convertcount.sort_values(ascending=False,by=1),plt.show())

####  The above Visualisation shows that the variables chosen by out logistic regression model are appropriate and make the most impact in conversion rates of the leads to Hot leads

Most focus should be given to 

1. Lead Source: WElingkak WEbsite,
    Although the number of Users are less, but there is almost 100 percent conversion
    Thus if it is focused more then a very good Conversion rate can be achieved.
    The strategy should be to promote this source
    
2. Lead Origin: Add_Form : This origin has 93% conversion rate, this should not be neglected at al and infact if the origin is Add_Form then more pritoty should be given to the user as iit has higher chances to convert to HOT leads 

3. Lead Notable Activity: Olark Chat : this reveals that there are a very large number of users who are using Olark chat, and the conversion rate here is not so high, so keeping in mind the number of users, and their interest in online conversation, focus should be given to look for more potential leads, so that we don’t miss a large number of enquiring users.
