# <span style="color:navy"> Lead Scoring for X Education 
#### <span style="color:navy"> A case study in Logistic Regression

### <span style="color:navy"> Problem Statement

An education company named X Education sells online courses to industry professionals. The professionals who are interested in the courses land on their website and browse for courses. 

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. The company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. 

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. 

### <span style="color:navy"> Objective: 

The objective is to help X Education select the most promising leads by building a model and assigning a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

### <span style="color:navy"> 1. Import Libraries and Initial Data Analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Read the dataset Leads.csv

df = pd.read_csv("/kaggle/input/lead-scoring-dataset/Lead Scoring.csv")
df.head()

In [None]:

df.head()

In [None]:
# Take a copy of the original dataset to assign the Lead score to the original rows. 

df_orig = df.copy()

### <span style="color:navy">  1.1 Summary Data Analysis

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
#df.info()

### <span style="color:navy"> 1.2 Imbalance Analysis

**To check the balance and data with respect to the target variable - 'Converted'**
    
The data is not too much imbalanced. As such, we can proceed with the data for analysis and model building

In [None]:
# Dividing the dataset into two dataset with Converted = 0 and Converted = 1

df_0=df.loc[df["Converted"]==0]
df_1=df.loc[df["Converted"]==1]

In [None]:
# Calculating Imbalance percentage 
# Since the majority is target0 and minority is target1
print (f'Count of Converted = 0: {len(df_0)} \nCount of Converted = 1: {len(df_1)}')
print (f'Imbalance Ratio is : {round(len(df_0)/len(df_1),2)}')

In [None]:
# Plotting the imbalance Analysis:
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize = (6,4))
plt.title('Imbalance Analysis',  fontsize=20)
chart = sns.countplot(data=df, x='Converted', palette='muted')
plt.xlabel('Converted', fontsize=18)
plt.ylabel('count', fontsize=18)

## <span style="color:navy"> 2. Data Cleaning
1. Replace the 'Select' value in the categorical values to NaN. 
2. Check Percentage of Missing values for all columns
3. Drop columns with a high percentage of missing values
4. Drop categorical columns that are highly skewed
5. Impute columns with less percentage of missing values
6. We can also drop the columns that were completed by the Sales team after progressing with the leads. 

### <span style="color:navy"> 2.1 Convert 'Select' values to NaN

Replace the 'Select' value in the categorical values to NaN. These values are mostly from dropdown menus where nothing is selected

In [None]:
# Converting 'Select' values to NaN.
df = df.replace('Select', np.nan)

#### Drop the duplicate rows if any

In [None]:
row1 , column1 = df.shape[0], df.shape[1]

# delete duplicates
df = df.drop_duplicates() 

**Calculate the percentage of the retained rows**

In [None]:
row2 , column2 = df.shape[0], df.shape[1]

percentRows = round ((row2/row1 * 100), 2)
print (f'Rows retained after Duplicate Deletion: {row2} or {percentRows} percent')

### <span style="color:navy">   2.2 Find missing values and delete columns with a lot of missing values

For our analysis, we have to find the columns with missing values and handle them by either deleting or imputing. 

**Define a function to get the missing values and missing percentage for the dataframes.**

In [None]:
# To find percent of Nan values
# We can define a function to get the missing values and missing percentage for the dataframes.
def missing_data(data):
    count_missing = data.isnull().sum().sort_values(ascending=False)
    percent_missing = (data.isnull().sum() * 100 / len(data)).sort_values(ascending=False)
    missing_value_df = pd.DataFrame({'count_missing': count_missing,
                                 'percent_missing': percent_missing})
    return missing_value_df

In [None]:
#To find percent of Nan values 
missing_data(df).head(20).transpose()

### <span style="color:navy">  2.3. Drop the unwanted variables 

Since we do not need all the columns provided in the dataset for our analysis, we can drop some of the columns based on our analysis.  

#### Drop Prospect ID and Lead Number as they are unique identifiers and need not be used in prediction

Clearly Prospect ID and Lead Number are two variables that represent the unique identfier of the Contacted People and as such will not add value to the model. These columns can be dropped. There are no duplicates int he Prospect ID and the Lead Number columns.

In [None]:
# To check if there are any duplicate values in Prospect ID and Lead Number columns

print (f'Duplicates in Prospect ID - {any(df["Prospect ID"].duplicated())}')
print (f'Duplicates in Lead Number - {any(df["Lead Number"].duplicated())}')

In [None]:
# Dropping the columns as mentioned in the above comment. 
dropFeatures = ['Prospect ID', 'Lead Number']
df.drop(df[dropFeatures], axis=1, inplace=True)

#### Create a funtion to drop the columns with a certain percentage of NaN values

We can drop certain columns with more than certain percentage of missing values. As they have high value of missing percentage, they will not be indicative of the correct weight of the columns in prediction.

In [None]:
# we will drop the columns having more than 70% NA values.
def drop_columns(data, miss_per):
    cols_to_drop = list(round(100*(data.isnull().sum()/len(data.index)), 2) >= miss_per )
    dropcols = data.loc[:,cols_to_drop].columns
    print (f'Features dropping now: {dropcols}')
    data = data.drop(dropcols, axis=1)
    return data

#### Drop the columns with more than 70% NaN values

In [None]:
df = drop_columns(df, 70.0)

In [None]:
#missing_data(df).head(20)

**Analysis of Score columns assigned by the Sales Team**

The following are the score columns assigned by the sales team to the dataset after progressing with the leads. 
These columns can be dropped as they will not add to the model building. 

Analyse the following features before dropping them. 

* Lead Quality 
* Asymmetrique Activity Index
* Asymmetrique Profile Index
* Asymmetrique Activity Score
* Asymmetrique Profile Score

In [None]:
# Analyse the score columns assigned by the sales team to the dataset before dropping them

scoreFeatures = ['Lead Quality', 'Asymmetrique Activity Index', 'Asymmetrique Profile Index' ]

# Count plot for the categorical variables
sns.set(style='ticks',color_codes=True)
colors =['Accent', 'PiYG' , 'RdPu']

plt.figure(figsize = (15,5))
for i in enumerate(scoreFeatures):
    plt.subplot(1, 3, i[0]+1)
    chart = sns.countplot(x = i[1], hue = 'Converted', data = df, palette = colors[i[0]])
    chart.set_xticklabels(chart.get_xticklabels(), rotation=45, ha='right',)
    plt.tight_layout()

In [None]:
# Analyse the score columns assigned by the sales team to the dataset

fig, axis = plt.subplots(1, 2, figsize = (12,4))
plt1 = sns.distplot(df_0['Asymmetrique Activity Score'], hist=False, kde=True , color='b' , ax = axis[0])
plt1 = sns.distplot(df_1['Asymmetrique Activity Score'], hist=False, kde=True , color='r' , ax = axis[0])
plt2 = sns.distplot(df_0['Asymmetrique Profile Score'], hist=False, kde=True , color='b' , ax = axis[1])
plt2 = sns.distplot(df_1['Asymmetrique Profile Score'], hist=False, kde=True , color='r' , ax = axis[1])
plt.tight_layout()

#### Drop the columns with more than 45% NaN values

As all the score features have more than 45% Nan values, these can be dropped without affecting our analysis. 

In [None]:
# Drop the score columns assigned by the sales team to the dataset

df = drop_columns(df, 45.0)

In [None]:
df.columns

* **Drop the columns 'Tag' and 'Last Activity' as the columns are added by Sales team while working on the leads and does not directly contribute to identifying the hot leads**

In [None]:
# Drop the unwanted features
dropFeatures = ['Tags', 'Last Notable Activity']

df.drop(dropFeatures, axis=1, inplace=True)

---------------

## <span style="color:navy"> 3. EDA and Data Visualizations for futher analysis

The next step is to visualise the data using matplotlib and seaborn.

This is one of the most important step - understanding the data. This step will help us understand the properties of data.

* Helps to identify any outliers.
* If there is some obvious multicollinearity going on, this can be identified here.
* Identify the data types of the features and make any conversions if needed.
    
### <span style="color:navy"> 3.1 Check the data types of all the columns and make changes if needed

* The Constant features can be removed. Constant features are those features that have only one value.
* The Categorical features should be identified to create the Dummy variables for them later.
* The Boolean features ('Yes' or 'No' features) can be mapped to 0 and 1 to prepare them for modeling. 

**Delete the constant features**

In [None]:
# A function to find the constant features. Constant features are those features which have only one distinct value.

def find_constant_features(df):
    constFeatures = []
    for column in list(df.columns):
        if df[column].unique().size < 2:
            constFeatures.append(column)
    return constFeatures

constFeatures = find_constant_features(df)
print(constFeatures)

In [None]:
# Drop the constant features as they will not add value to the analysis

df = df.drop(constFeatures, axis=1)

In [None]:
df.shape

#### Identify the number of unique features in a column

In [None]:
# Look at the number of unique categories in a column
def unique_count(data):
    data_type = data.dtypes
    unique_count = data.nunique()
    
    unique_count_df = pd.DataFrame({'data_type': data_type,
                                 'unique_count': unique_count})
    return unique_count_df

In [None]:
unique_count(df).transpose() # Used transpose so as to avoid using more space. `

#### Identify all the Categorical, boolean and numeric features

In [None]:
# Identify and separate all the Categorical, boolean and numeric features for analysis
targetFeature = []
catFeatures = []
boolFeatures = []
numFeatures = []

for each in df.columns:
    if each in ('Converted'):
        targetFeature.append(each)
    elif df[each].nunique() == 2:  #Features with only 2 unique values as boolean
        boolFeatures.append(each)
    elif df[each].dtype == 'object':
        catFeatures.append(each)
    elif df[each].dtype in ('int64','float64'):
        numFeatures.append(each)
    else:
        numFeatures.append(each)

In [None]:
print (f'The Target Feature is : \n {targetFeature} \n')
print (f'The Boolean Features are : \n {boolFeatures} \n')
print (f'The Categorical Features are : \n {catFeatures} \n')
print (f'The Numeric Features are :\n {numFeatures} \n')

### <span style="color:navy"> 3.2 Univariate Analysis of Boolean Features

* Convert the values 'Yes' and 'No' to 1 and 0 in the Binary Features. 
* Check if the columns are skewed and drop them if they are skewed.

In [None]:
boolFeatures

In [None]:
# Convert the values 'Yes' and 'No' to 1 and 0 in the Binary Features. 
# value_counts is checked each time to ensure the mapping is done only once 
# If mapped multiple times, the values are converted to NaNs

for each in boolFeatures:
    if df[each].value_counts().values.sum() > 0:  # To check if the step was already completed
        df[each] = df[each].map(dict(Yes=1, No=0))
        print (f'Binary mapping is completed for {each}')

In [None]:
# Convert the boolean features to type boolean
df[boolFeatures] = df[boolFeatures].astype('int64')

In [None]:
boolFeatures

In [None]:
df.shape

In [None]:
# Count plot for the Boolean variables
# colors = ['Accent', 'PiYG' , 'RdPu', 'icefire' , 'ocean' , 'gist_earth', 'magma', 'plasma', 'rocket']
colors = ['Accent', 'ocean', 'rocket'] * 3
sns.set(style='ticks',color_codes=True)
plt.figure(figsize = (10,10))
for i, x_var in enumerate(boolFeatures):
    plt.subplot(3, 3, i+1)
    chart = sns.countplot(x = x_var, data = df, hue='Converted', palette=colors[i])
    chart.set_xticklabels(chart.get_xticklabels())
    plt.tight_layout()

In [None]:
# Identify the value counts of the boolean features to confirm if they have only one value

for each in boolFeatures:
    print (df[each].value_counts(dropna=False))

#### Observations:

* Only two fields, 'A free copy of Mastering The Interview' and 'Do Not Email' have values for 1 and 0
* All the other binary features have a very high percent of values as No.
* We can drop these columns as they will not contribute to the analysis.

In [None]:
# we can drop the boolean Features with most values as 0 as they all have the value True and do not help in the analysis

dropFeatures = [ 'Do Not Call',
                 'Search',
                 'Newspaper Article',
                 'X Education Forums',
                 'Newspaper',
                 'Digital Advertisement',
                 'Through Recommendations']

In [None]:
# Drop the unwanted features

df.drop(dropFeatures, axis=1, inplace=True)

In [None]:
#To find percent of Nan values 
missing_data(df).head(10)

In [None]:
df.shape

### <span style="color:navy"> 3.3 EDA and missing values handling for the Numeric Features

In [None]:
numFeatures

In [None]:
# Analyze the numeric features

sns.set(style='ticks',color_codes=True)
fig = plt.figure(figsize = (15, 15))
g = sns.pairplot(data=df, hue='Converted', vars=numFeatures + targetFeature);

In [None]:
# Frequency Ditribution for Numeric Features
sns.set(style='ticks',color_codes=True)
plt.figure(figsize = (12, 12))
for i, x_var in enumerate(numFeatures):
    plt.subplot(3, 2, i+1)
    sns.distplot(df_0[x_var], hist=False, kde=True , color='b')
    sns.distplot(df_1[x_var], hist=False, kde=True , color='r')
    plt.tight_layout()

In [None]:
df.Converted.dtype

In [None]:
numFeatures

#### Outlier Handling for the Numeric Features

The features 'TotalVisits', 'Page Views Per Visit' have outliers and they can be capped at 0.01 and 0.99 th quantiles

In [None]:
# Box plot to identify the outliers
# Frequency Ditribution for Numeric Features
sns.set(style='ticks',color_codes=True)
colors = ['Accent', 'ocean' , 'RdPu']
plt.figure(figsize = (12, 12))
for i, var in enumerate(numFeatures):
    plt.subplot(3,3,i+1)
    sns.boxplot(x='Converted', y = var, data = df, palette =colors[i])
    plt.tight_layout()

In [None]:
cap_outliers = ['TotalVisits', 'Page Views Per Visit']

In [None]:
# Cap the outliers for the Numeric features at 0.01 and 0.99

for i, var in enumerate(cap_outliers):
    q1 = df[var].quantile(0.01)
    q4 = df[var].quantile(0.99)
    df[var][df[var]<=q1] = q1
    df[var][df[var]>=q4] = q4

In [None]:
# Box plot to visualise numeric features after outlier capping
sns.set(style='ticks',color_codes=True)
colors = ['Accent', 'ocean' , 'RdPu'] # 'icefire' , 'ocean' , 'gist_earth', 'magma', 'prism', 'rocket', 'seismic']
plt.figure(figsize = (12, 12))
for i, var in enumerate(numFeatures):
    plt.subplot(3,3,i+1)
    sns.boxplot(x = 'Converted', y = var, data = df, palette=colors[i])
    plt.tight_layout()

#### Impute the missing values with mean for 'TotalVisits' and 'Page Views Per Visit' 

* After the outlier handling, the mean of the columns for the columns 'TotalVisits' and 'Page Views Per Visit' are same for 
  converted and non converted leads. 
* We can impute the missing values with mean for the columns.

In [None]:
# Impute the missing values for the columns with Mean

df['TotalVisits'].fillna((df['TotalVisits'].mean()), inplace=True)
df['Page Views Per Visit'].fillna((df['Page Views Per Visit'].mean()), inplace=True)

In [None]:
# Correlation Heat map for the numeric features

corrFeatures = numFeatures + targetFeature

sns.set(style='ticks',color_codes=True)
plt.figure(figsize = (6,6))

sns.heatmap(df[corrFeatures].corr(), cmap="YlGnBu", annot=True, square=True)
plt.show()

In [None]:
numFeatures

### <span style="color:navy"> 3.4 EDA and Data analysis for Categorical Features

In [None]:
#To find percent of Nan values 
#missing_data(df).head(10)

In [None]:
# Identify the Unique Counts for the categorical Features

unique_count(df[catFeatures]).transpose() # Used transpose so as to avoid using more space. `

In [None]:
catFeatures

In [None]:
unique_count(df[catFeatures]).sort_values(by = 'unique_count', ascending=False)

In [None]:
catFeatures[:4]
catFeatures[4:]

In [None]:
# Count plot for the categorical variables
sns.set(style='ticks',color_codes=True)
# colors =['Accent', 'PiYG' , 'RdPu', 'icefire' , 'ocean' , 'gist_earth', 'magma', 'prism', 'rocket', 'seismic']
colors =['gist_earth', 'magma', 'ocean', 'rocket'] * 2
plt.figure(figsize = (15,12))
for i, x_var in enumerate(catFeatures[:4]):
    plt.subplot(2, 2, i+1)
    chart = sns.countplot(x = x_var, hue = 'Converted', data = df, palette = colors[i])
    chart.set_xticklabels(chart.get_xticklabels(), fontsize=14, rotation=45, ha='right',)
    plt.xlabel(x_var, fontsize=14)
    plt.ylabel('count', fontsize=14)
    plt.tight_layout()

In [None]:
# Count plot for the categorical variables
sns.set(style='ticks',color_codes=True)
# colors =['Accent', 'PiYG' , 'RdPu', 'icefire' , 'ocean' , 'gist_earth', 'magma', 'prism', 'rocket', 'seismic']
colors =['gist_earth', 'magma', 'ocean', 'rocket'] * 2
plt.figure(figsize = (15,12))
for i, x_var in enumerate(catFeatures[4:]):
    plt.subplot(2, 2, i+1)
    chart = sns.countplot(x = x_var, hue = 'Converted', data = df, palette = colors[i])
    chart.set_xticklabels(chart.get_xticklabels(), fontsize=14, rotation=45, ha='right',)
    plt.xlabel(x_var, fontsize=14)
    plt.ylabel('count', fontsize=14)
    plt.tight_layout()

#### Drop the unwanted columns: 
* **Drop the columns 'Country' and 'What matters most to you in choosing a course' as these are highly skewed**


In [None]:
df.columns

In [None]:
dropFeatures = ['Country', 'What matters most to you in choosing a course']

df.drop(dropFeatures, axis=1, inplace=True)

In [None]:
df.columns

In [None]:
catFeatures = []

for each in df.columns:
    if df[each].dtype == 'object':
        catFeatures.append(each)

catFeatures

#### Replace the values with spelling corrections in the categories for categorical columns

In [None]:
df['Lead Source'] = df['Lead Source'].replace(['google'], 'Google')

#### Replace the missing values for 'City' column with the mode

In [None]:
# Replace all the NaN values for categorical variables
df['City'] = df['City'].replace(np.nan, 'Mumbai')

In [None]:
for each in catFeatures:
    print (f'Value Counts for {each}: \n {df[each].value_counts(dropna=False)} \n')

#### Bucketing the categories with lesser count for the categorical features

In [None]:
# Since there are so many categories in the categorical features with less than 2% counts each, we can 
# combine all those categories into one category called 'Others'

for each in catFeatures:
    replaceFeatures = []
    categories = df[each].value_counts()
    list1 = df[each].value_counts().keys().tolist()
    for i, v in enumerate (categories):
        if v <= 200:  ## Anything less than 200
            replaceFeatures.append(list1[i])
    df[each] = df[each].replace(replaceFeatures, 'Others')
    print (f'Categories replaced for column {each} are: \n {replaceFeatures} \n')

#### Replace the missing values with 'Missing' category for categorical columns

In [None]:
#To find percent of Nan values 
# missing_data(df).head(20)

In [None]:
# Replace all the NaN values with 'Missing' for the remaining Categorical variables with NaN in them
nanFeatures = ['Specialization', 'What is your current occupation', 'Lead Source', 'Last Activity']

for each in nanFeatures:
    df[each].replace(np.nan,'Missing', inplace=True)
    print (f'NaNs are converted to "Missing" category for column {each}')

#### Visualize the Categorical variables after handling missing values and bucketing

In [None]:
catFeatures

In [None]:
# Count plot for the categorical variables
sns.set(style='ticks',color_codes=True)
plt.figure(figsize = (25, 18))
colors = [ 'RdBu', 'rocket' , 'gist_earth'] * 2
for i, x_var in enumerate(catFeatures):
    plt.subplot(2, 3, i+1)
    chart = sns.countplot(x = x_var, hue = 'Converted', data = df, palette = colors[i])
    chart.set_xticklabels(chart.get_xticklabels(), fontsize=16, rotation=45, ha='right')
    plt.xlabel(x_var, fontsize=16)
    plt.ylabel('count', fontsize=16)
    plt.tight_layout()

In [None]:
#To find percent of Nan values 
missing_data(df).head()

**There are no missing values and we can proceed with the model building**

-------------

## <span style="color:navy"> 4. Model Building
    
Now that the data analysis is completed, data is cleaned and outliers handled, we can proceed to building the model. 

### <span style="color:navy"> 4.1 Get Dummy Variables:
    
* For all the categorical features, dummy variables need to be created.
* Instead of dropping the first dummy varibale for each categorical variable (using drop_first = True), we can select a specified dummy variable and drop it, so that we can have explainable features. 

In [None]:
catFeatures

In [None]:
# Getting dummy variables and adding the results to the master dataframe

for each in catFeatures:
    dummy = pd.get_dummies(df[each], drop_first=False, prefix=each)
    df = pd.concat([df,dummy],1)
    print (f'dummy columns are added for the feature {each}')

In [None]:
# Drop the sepcific dummy columns created after the dummy variables are added for these categorical columns

dummydropFeatures = ['Lead Origin_Others', 
                     'City_Others',
                     'Lead Source_Missing',
                     'Specialization_Missing',
                     'What is your current occupation_Missing',
                     'Last Activity_Missing']

df.drop(dummydropFeatures, axis=1, inplace=True )

In [None]:
catFeatures

In [None]:
# Drop the original categorical columns since the dummy variables are added for these categorical columns

df.drop(catFeatures, axis=1, inplace=True )

In [None]:
df.head()

In [None]:
df.columns

### <span style="color:navy"> 4.2 Train-Test Split and Logistic Regression Model Building:

The following steps are followed in building a model: 
    
* Import the necessary packages for model preprocessing and model building
* Split the train data and test data at 70% and 30%
* Scale the Numeric features using MinMaxScaler
* Build the model using a combination of automatic and manual processing
* Start the model with RFE features (automatic) and use feature reduction by dropping one feature at a time. 
* Build the model and fit the training data.

In [None]:
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
#from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

In [None]:
# The target variable in y
y = df['Converted']
y.head()

In [None]:
# The feature variables in X

X=df.drop('Converted', axis=1)
X.head()

**Splitting the data into train and test**

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7, test_size=0.3, random_state=101)

### <span style="color:navy"> 4.3 Scaling the Numerical features 
    
* The Numeric features need to be scaled before building the model. 
* 'TotalVisits', 'Total Time Spent on Website', 'Page Views Per Visit' are the numeric features to be scaled. 

In [None]:
numFeatures

In [None]:
#### Scaling the numerical columns
scaler = MinMaxScaler()

X_train[numFeatures] = scaler.fit_transform(X_train[numFeatures])

X_train.head()

### <span style="color:navy"> 4.4 Build the Logistic Regression model with RFE features

In [None]:
# Build the Logistic Regression Model
logmodel = LogisticRegression()

from sklearn.feature_selection import RFE
rfe = RFE(logmodel, 20)             # running RFE with 20 variables as output
rfe = rfe.fit(X_train, y_train)

In [None]:
# print (rfe.support_)
# list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
#list of RFE supported columns
cols = X_train.columns[rfe.support_]
cols

In [None]:
# Defining a function to generate the model by passing the model name and the columns used for the model 

def gen_model(model_no, cols):
    X_train_sm = sm.add_constant(X_train[cols])
    model_no = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
    res = model_no.fit()
    print (res.summary())
    return res

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
def calcVIF(col):
    vif = pd.DataFrame()
    vif['Features'] = X_train[col].columns
    vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return vif

### Model - Iteration 1

In [None]:
# Generate the first model using the RFE features

logm1 = LogisticRegression()

#Pass the columns to generate the model and print summary
res = gen_model(logm1, cols)

# Check the VIF for the features
calcVIF(cols).head(3)

In [None]:
res

---------------

### <span style="color:navy"> 4.5 Building Iterations of the model after reducing the features
    
The next step is to build iterations of the model after dropping one feature at a time using P values and VIFs

### Model - Iteration 2

In [None]:
# Dropping the next unwanted variable to pass to the model.
cols = cols.drop('Specialization_Supply Chain Management',1)
logm2 = LogisticRegression()

#Pass the columns to generate the model and print summary
res = gen_model(logm2, cols)

# Check the VIF for the features
calcVIF(cols).head(3)

-------------------

### Model - Iteration 3

In [None]:
# Dropping the next unwanted variable to pass to the model.
cols = cols.drop('Specialization_Banking, Investment And Insurance',1)
logm3 = LogisticRegression()

#Pass the columns to generate the model and print summary
res = gen_model(logm3, cols)

# Check the VIF for the features
calcVIF(cols).head(3)

------

### Model - Iteration 4

In [None]:
# Dropping the next unwanted variable to pass to the model.
cols = cols.drop('Specialization_Finance Management',1)
logm4 = LogisticRegression()

#Pass the columns to generate the model and print summary
res = gen_model(logm4, cols)

# Check the VIF for the features
calcVIF(cols).head(3)

-----

### Model - Iteration 5

In [None]:
# Dropping the next unwanted variable to pass to the model.
cols = cols.drop('Specialization_Marketing Management',1)
logm5 = LogisticRegression()

#Pass the columns to generate the model and print summary
res = gen_model(logm5, cols)

# Check the VIF for the features
calcVIF(cols).head(3)

-------------

### Model - Iteration 6

In [None]:
# Dropping the next unwanted variable to pass to the model.
cols = cols.drop('Lead Source_Reference',1)
logm6 = LogisticRegression()

#Pass the columns to generate the model and print summary
res = gen_model(logm6, cols)

# Check the VIF for the features
calcVIF(cols).head(3)

-------------

### Model - Iteration 7

In [None]:
# Dropping the next unwanted variable to pass to the model.
cols = cols.drop('Page Views Per Visit',1)
logm7 = LogisticRegression()

#Pass the columns to generate the model and print summary
res = gen_model(logm7, cols)

# Check the VIF for the features
calcVIF(cols).head(3)

-------------

### Model - Iteration 8

In [None]:
# # Dropping the next unwanted variable to pass to the model.
# cols = cols.drop('',1)
# logm8 = LogisticRegression()

# #Pass the columns to generate the model and print summary
# res = gen_model(logm8, cols)

# # Check the VIF for the features
# calcVIF(cols).head(3)

-----------------

### <span style="color:navy"> 4.6 Getting the predicted values on the train set
    
The following steps are done after building the model
    
* Get the predictions on the training dataset with the final model 
* Use the cut-off with 0.5 for the initial predictions
* Derive the Classification report and Classification metrics with the initial cutoff and predictions
* Derive the Area under the ROC curve for the initial cut-off and predictions
* Calculate the predicted values for the different cut-offs to arrive at the optimal cutoff
* Plot the Sensitivity / Specificity curve for the different cut-offs and identify the optimal cut-off
* Get the final_Predictions and the metrics for the Predictions with the optimal cut-off
* Assign a Lead Score to the Training dataset based on the Conversion probability of the final_Predictions
* Measuring the Precision Recall Trade-off

**Get the predictions on the training dataset with the final model.**
* Use the cut-off with 0.5 for the initial predictions.

In [None]:
# Getting the predicted values on the train set

X_train_sm = sm.add_constant(X_train[cols])
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_prob':y_train_pred})
y_train_pred_final['Prospect ID'] = y_train.index
y_train_pred_final.head()

In [None]:
y_train_pred_final['Predicted'] = y_train_pred_final.Converted_prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head()

### <span style="color:navy"> 4.7 Evaluation Metrics for the Train dataset
    
**Derive the Classification report and Classification metrics with the initial cutoff and predictions**

In [None]:
from sklearn.metrics import classification_report

In [None]:
print (classification_report(y_train_pred_final['Converted'], y_train_pred_final['Predicted']))

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix

In [None]:
def get_metrics(actual, predicted):
    confusion = confusion_matrix(actual, predicted)

    # Let's check the overall accuracy.
    Accuracy = metrics.accuracy_score(actual, predicted)

    TN = confusion[0,0] # true negatives
    FP = confusion[0,1] # false positives
    FN = confusion[1,0] # false negatives
    TP = confusion[1,1] # true positive 

    # Calculate the different Metrics
    Sensitivity = TP / float(TP+FN) # calculate Sensitivity
    Specificity = TN / float(TN+FP) # calculate specificity
    Precision   = TP / float(TP+FP) # calculate Precision
    Recall      = TN / float(TN+FP) # calculate Recall
    FPR = (FP/ float(TN+FP))        # Calculate False Postive Rate - predicting conversion when customer does not convert
    PPV = (TP / float(TP+FP))       # positive predictive value 
    NPV = (TN / float(TN+ FN))      # Negative predictive value
    
    F1 = 2*(Precision*Recall)/(Precision+Recall)

    # Print the Metrics
    print (f'The Confusion Matrix is \n {confusion}')
    print (f'The Accuracy is    : {round (Accuracy,2)} ({Accuracy})')
    print (f'The Sensitivity is : {round (Sensitivity,2)} ({Sensitivity})')
    print (f'The Specificity is : {round (Specificity,2)} ({Specificity})')
    print (f'The Precision is   : {round (Precision, 2)} ({Precision})')
    print (f'The Recall is      : {round (Recall, 2)} ({Recall})')
    print (f'The f1 score is    : {round (F1, 2)} ({F1})')
    print (f'The False Positive Rate is       : {round (FPR, 2)} ({FPR})')
    print (f'The Positive Predictive Value is : {round (PPV, 2)} ({PPV})')
    print (f'The Negative Predictive Value is : {round (NPV, 2)} ({NPV})')


In [None]:
def plot_confusion_metrics(actual, predicted):
    sns.set_style('white')
    cm = confusion_matrix(actual, predicted)
    plt.clf()
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Wistia)
    classNames = ['Negative','Positive']
    plt.title('True Converted and Predicted Converted Confusion Matrix', fontsize=14)
    plt.ylabel('True Converted', fontsize=14)
    plt.xlabel('Predicted Converted', fontsize=14)
    tick_marks = np.arange(len(classNames))
    plt.xticks(tick_marks, classNames, fontsize=14)
    plt.yticks(tick_marks, classNames, fontsize=14)
    s = [['TN','FP'], ['FN', 'TP']]
    for i in range(2):
        for j in range(2):
            plt.text(j,i, str(s[i][j])+" = "+str(cm[i][j]), fontsize=14, ha='center')
    plt.show()

In [None]:
get_metrics(y_train_pred_final.Converted, y_train_pred_final.Predicted)

In [None]:
plot_confusion_metrics(y_train_pred_final.Converted, y_train_pred_final.Predicted)

**Derive the Area under the ROC curve for the initial cut-off and predictions**

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final.Converted, 
                                          y_train_pred_final.Converted_prob, drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_prob)

###  <span style="color:navy"> 4.8 Getting the Optimal cutoff and final evaluation Metrics for Train Dataset
    
**Calculate the predicted values for the different cut-offs to arrive at the optimal cutoff**

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Converted_prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
numbers

**Plot the Sensitivity / Specificity curve for the different cut-offs and identify the optimal cut-off**

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['probability','accuracy','sensitivity','specificity'])
from sklearn.metrics import confusion_matrix

#     TN = confusion[0,0] # true negatives
#     FP = confusion[0,1] # false positives
#     FN = confusion[1,0] # false negatives
#     TP = confusion[1,1] # true positive 
    
for i in numbers:
    cm1 = metrics.confusion_matrix(y_train_pred_final['Converted'], y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    specificity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensitivity = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i , accuracy, sensitivity, specificity]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.

sns.set_style("whitegrid") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster
cutoff_df.plot.line(x='probability', y=['accuracy','sensitivity','specificity'], figsize=(10,6))

plt.xticks(np.arange(0, 1, step=0.05), size = 12)
plt.yticks(size = 12)
plt.title('Accuracy, Sensitivity and Specificity for various probabilities', fontsize=14)
plt.xlabel('Probability', fontsize=14)
plt.ylabel('Metrics', fontsize=14)
plt.show()

#### <span style="color:blue"> From the curve above, 0.36 can be taken as the optimum point to take it as a cutoff probability

**Get the final_Predictions and the metrics for the Predictions with the optimal cut-off**

In [None]:
#### From the curve above, 0.36 is the optimum point to take it as a cutoff probability.

y_train_pred_final['final_Predicted'] = y_train_pred_final.Converted_prob.map( lambda x: 1 if x > 0.36 else 0)
y_train_pred_final.head()

In [None]:
# Get all the necessary Metrics for the Training dataset for cut-off 0.36
print (f'The Final Evaluation Metrics for the train Dataset: ')
print (f'----------------------------------------------------')

get_metrics(y_train_pred_final['Converted'], y_train_pred_final['final_Predicted'])

In [None]:
# Plot Confusion metrics for final predicted for train data

plot_confusion_metrics(y_train_pred_final.Converted, y_train_pred_final.final_Predicted)

In [None]:
# Classification report for the training dataset
print (classification_report(y_train_pred_final['Converted'], y_train_pred_final['final_Predicted']))

**Assign a Lead Score to the Training dataset based on the Conversion probability of the final_Predictions**

In [None]:
# Assign a Lead score based on the predictions

y_train_pred_final['Lead_Score'] = y_train_pred_final.Converted_prob.map( lambda x: round(x*100))

y_train_pred_final[['Converted','Converted_prob','Prospect ID','final_Predicted','Lead_Score']].head()

In [None]:
y_train_pred_final.head()

#### Measuring the Precision Recall Trade-off

In [None]:
from sklearn.metrics import precision_recall_curve

p, r, thresholds = precision_recall_curve(y_train_pred_final['Converted'], y_train_pred_final['Converted_prob'])

In [None]:
# Plot the Precision / Recall tradeoff chart
sns.set_style("whitegrid") # white/whitegrid/dark/ticks
sns.set_context("paper") # talk/poster

plt.figure(figsize=(8, 4), dpi=100, facecolor='w', edgecolor='k', frameon='True')
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.xticks(np.arange(0, 1, step=0.05))
plt.title('Precision and Recall for various probabilities', fontsize=14)
plt.xlabel('Probability', fontsize=14)
plt.ylabel('Metrics', fontsize=14)
plt.show()

--------------

## <span style="color:navy"> 5. Model Validation 
    
The next step is to validate the model with the test dataset. 

The following are the steps invoved:
* Fit the Numeric features of the Test dataset with the Scaler method
* Making Predictions on the X_test dataset
* Create a Dataset with the Prospect ID and the conversion probability for the test dataset
* Generate the Lead Score for the test dataset based on the predicted probability from the model
* Get the final Predicted values using the optimal threshold value
* Get the Final evaluation Metrics for the test dataset with the actual converted values and final predicted values

    
### <span style="color:navy"> 5.1 Making Predictions for the Test Dataset

In [None]:
X_test.head()

**Fit the Numeric features of the Test dataset with the Scaler method**

In [None]:
# Fit the Numeric features of the Test dataset with the Scaler method
X_test[numFeatures] = scaler.transform(X_test[numFeatures])
X_test.head()

In [None]:
X_test.shape

In [None]:
cols

**Making Predictions on the X_test dataset using the final model**

In [None]:
# Making Predictions on the X_test dataset

X_test = X_test[cols]
X_test_sm = sm.add_constant(X_test)
X_test.head()

In [None]:
y_test_pred = res.predict(X_test_sm)

In [None]:
y_test_pred[:5]

**Create a Dataset with the Prospect ID and the conversion probability for the test dataset**

In [None]:
# Converting y_pred to a dataframe from an array
y_test_pred_df = pd.DataFrame(y_test_pred)

# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)

# Putting CustID to index
y_test_pred_df['Prospect ID'] = y_test_df.index

# Removing index for both dataframes to append them side by side 
y_test_pred_df.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

# Appending y_test_df and y_testest_pred_1
y_test_pred_final = pd.concat([y_test_df, y_test_pred_df],axis=1)

# Renaming the column 
y_test_pred_final= y_test_pred_final.rename(columns={ 0 : 'Converted_prob'})
y_test_pred_final.head(10)

**Generate the Lead Score for the test dataset based on the predicted probability from the model**

In [None]:
# Rearranging the columns
y_test_pred_final = y_test_pred_final[['Prospect ID','Converted','Converted_prob']]
y_test_pred_final['Lead_Score'] = y_test_pred_final.Converted_prob.map( lambda x: round(x*100))
y_test_pred_final.head()

**Get the final Predicted values using the optimal threshold value**

In [None]:
# Predict the final y values based on the threshold of 0.3
y_test_pred_final['final_Predicted'] = y_test_pred_final['Converted_prob'].map(lambda x: 1 if x > 0.36 else 0)

y_test_pred_final.head()

### <span style="color:navy"> 5.2 Final Evaluation Metrics for the Test Dataset

**Get the Final evaluation Metrics for the test dataset with the actual converted values and final predicted values**

In [None]:
# Get all the necessary Metrics for the Test dataset 

print (f'The Final Evaluation Metrics for the test Dataset: ')
print (f'---------------------------------------------------')
get_metrics(y_test_pred_final['Converted'], y_test_pred_final['final_Predicted'])

In [None]:
# Plot Confusion metrics for final predicted for test data

plot_confusion_metrics(y_test_pred_final.Converted, y_test_pred_final.final_Predicted)

In [None]:
# Print the classification report for the Test Dataset
print (classification_report(y_test_pred_final['Converted'], y_test_pred_final['final_Predicted']))

----------

## <span style="color:navy"> 6. Assigning the Lead score for each Prospect ID from the original data

The final step is to merge the datasets from Train and Test datasets with the predicted Lead Score and attach the Lead score to the original dataset. 

In [None]:
y_train_pred_final.head()

In [None]:
# Create Dataset with y_train Prospect ID and Lead score
y_train_score = y_train_pred_final[['Prospect ID','Lead_Score']]

# Create Dataset with y_test Prospect ID and Lead score
y_test_score = y_test_pred_final[['Prospect ID','Lead_Score']]

# Concatenate the y_train scores and the y_test scores
df_score = pd.concat([y_train_score, y_test_score], ignore_index=True)

# Set the index of the final score dataset as the Prospect ID to concatenate the score dataset to the original data
df_score.set_index('Prospect ID', inplace=True)

# Inner Join the Original Leads dataset with the scores dataset. This will add a new column 'Lead_Score' to the 
# Original dataset. 
df_orig = df_orig.join(df_score['Lead_Score'])

df_orig.head()

------------------

## <span style="color:navy"> 7. Determining Feature Importance

#### Selecting the coefficients of the selected features from our final model excluding the intercept

In [None]:
pd.options.display.float_format = '{:.2f}'.format
model_params = res.params[1:]
model_params

#### Getting a relative coeffient value for all the features wrt the feature with the highest coefficient

In [None]:
#feature_importance = abs(new_params)

feature_importance = model_params
feature_importance = 100.0 * (feature_importance / feature_importance.max())
feature_importance

#### Sorting the feature variables based on their relative coefficient values

In [None]:
# Sort the feature variables based on their relative coefficient values

sorted_idx = np.argsort(feature_importance,kind='quicksort',order='list of str')
sorted_idx

#### Plot showing the feature variables based on their relative coefficient values

In [None]:
# Plot to show the realtive Importance of each feature in the model 
pos = np.arange(sorted_idx.shape[0]) + .5

fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1, 1, 1)
ax.barh(pos, feature_importance[sorted_idx], align='center', color = 'tab:blue',alpha=0.8)
ax.set_yticks(pos)
ax.set_yticklabels(np.array(X_train[cols].columns)[sorted_idx], fontsize=12)
ax.set_xlabel('Relative Feature Importance', fontsize=14)

plt.tight_layout()   
plt.show()

---------------

## <span style="color:navy"> 8. Final Observations and Recommendations

#### <span style="color:navy"> The Final Evaluation Metrics for the train Dataset: 

* The Accuracy is    : 0.80
* The Sensitivity is : 0.80
* The Specificity is : 0.81
* The Precision is   : 0.73
* The Recall is      : 0.81
* The f1 score is    : 0.76
    
#### <span style="color:navy"> The Final Evaluation Metrics for the test Dataset: 

* The Accuracy is    : 0.81
* The Sensitivity is : 0.81
* The Specificity is : 0.81
* The Precision is   : 0.72
* The Recall is      : 0.81
* The f1 score is    : 0.76
    
#### <span style="color:navy"> X-Education has a better chance of converting a potential lead when:
* **The total time spent on the Website is high:**
Leads who have spent more time on the website have converted
* **Current Occupation is specified:**
Leads who are working professionals have high chances of getting converted. People who were looking for better prospects like Unemployed, students, Housewives and Business professionals were also good prospects to focus on. 
* **When the Lead origin was Lead Add form**
Leads who have responded/ or engaged through Lead Add Forms have had a higher chances of getting converted
* **Number of Total Visits were high** 
Leads who have made a greater number of visits have higher chances of getting converted. 
* **When the last activity was SMS sent or Email opened**
Members who have sent an SMS for enquiry or who have opened the email have a higher chance of getting converted.