**Problem Statement**
An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. 

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.

X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

In [None]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')
# Importing Pandas and NumPy
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns

In [None]:
#function definitons

#Function to print null values in all columns
def nulls(df):
    return (100*round(df.isnull().sum()/len(df),4).sort_values(ascending=False))

#Function to get the VIFs for all the variables in a dataframe
from statsmodels.stats.outliers_influence import variance_inflation_factor
def getvif(df):
    if 'const' in list(df.columns):
        df1=df.drop('const', axis=1) 
    else:
        df1 = df.copy()
    vif=pd.DataFrame()
    vif['Features'] = df1.columns
    vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
    vif['VIF'] = round(vif.VIF,2)
    vif = vif.sort_values(by = 'VIF', ascending = False)
    return vif

In [None]:
#importing dataset
df = pd.read_csv('../input/leadscore/Leads.csv')
df.head()

## Data Cleaning and EDA

In [None]:
df.shape

In [None]:
#Let's see what columns we have
df.info()

In [None]:
#Description of numerical columns
df.describe()

In [None]:
#There are a lot of columns with a lot of null values
nulls(df)

In [None]:
#Lets see if there are any duplicates rows in entirety
df.duplicated(keep='first').sum()

In [None]:
#Lets see if there are any duplicates, this time using only the prospect ID as the identifier
df.duplicated(keep='first',subset='Prospect ID').sum()

In [None]:
#There are some values that have been mentioned as "Select". As per the data dictionary, these are default values selected when the user does not make any other selection.
df.Specialization.value_counts(normalize=True)*100

In [None]:
#For our data to make more sense, we will be replacing the "Select" values with NaNs.
df=df.replace('Select',np.nan)

In [None]:
#Now, we check for null value percentages again 
nulls_list=nulls(df)
print(nulls_list)

In [None]:
#Let's drop the columns with more than 50% of null values. these would be of no use to our model building process.
df.drop(list(nulls_list.loc[nulls_list>50].index),axis=1,inplace=True)

In [None]:
#checking nulls again
nulls(df)

In [None]:
#We will try and see what these scores and indices contain
for i,each in enumerate(list(nulls(df).index)[:4]):
    print(df[each].describe())

In [None]:
#The scores are numerical, using box plots
plt.figure(figsize=(20,12))
plt.subplot(221)
sns.boxplot(df['Asymmetrique Profile Score'])
plt.subplot(222)
sns.boxplot(df['Asymmetrique Activity Score'])
#Indices are categorical, using countplot
plt.subplot(223)
sns.countplot(df['Asymmetrique Profile Index'])
plt.subplot(224)
sns.countplot(df['Asymmetrique Activity Index'])

In [None]:
#We cannot see a substantial variance in these features across the data set. Thus, since a large chunk of these variables is missing, we can choose to drop it, since no vital information would be lost.
df.drop(list(nulls(df).index)[:4],axis=1,inplace=True)
nulls(df)

In [None]:
#inspecting city
df.City.value_counts(normalize=True)*100

In [None]:
#57% of our data points are from Mumbai. We can choose to impute the nulls in city column with Mumbai.
df['City'] = df['City'].replace(np.nan, 'Mumbai')
nulls(df)

In [None]:
#inspecting specialization 
df['Specialization'].value_counts(normalize=True).sort_values(ascending=False)*100

In [None]:
#Here a null might mean that either the customer has a specialization that does not exist in this list, or no specialization. We can use 'Others' as a new category here. 
df['Specialization'] = df['Specialization'].replace(np.nan,'Others')
nulls(df)

In [None]:
#inspecing tags
df.Tags.value_counts(normalize=True).sort_values(ascending=False)*100

In [None]:
#We'll remove tags since it is a score variable
df.drop('Tags', axis=1, inplace=True)

In [None]:
#inspecting What matters most to you in choosing a course
df['What matters most to you in choosing a course'].value_counts(normalize=True).sort_values(ascending=False)*100

In [None]:
#Imputing nulls with mode
df['What matters most to you in choosing a course'] = df['What matters most to you in choosing a course'].replace(np.nan,'Better Career Prospects')
nulls(df)

In [None]:
#inspecing occupation feature
df['What is your current occupation'].value_counts(normalize=True).sort_values(ascending=False)*100

In [None]:
#imputing nulls with mode
df['What is your current occupation']=df['What is your current occupation'].replace(np.nan,'Unemployed')
nulls(df)

In [None]:
#inspecting country
df['Country'].value_counts(normalize=True).sort_values(ascending=False)*100

In [None]:
#imputing mode
df['Country'] = df['Country'].replace(np.nan, 'India')
nulls(df)

In [None]:
#the rest of the features contain less than 2% null values. We can safely drop these rows.
df.dropna(inplace=True)
nulls(df)

In [None]:
#Percentage of rows retained is pretty good.
(df.shape[0]/9240)*100

Now we have the cleaned dataset ready. We will be using this for our analysis.

In [None]:
#We will be inspecting all the cateogrical columns, looking for highly skewed features, and features with less prominent valus that can be clubbed

plt.figure(figsize=(20,8*13))
for i,each in enumerate(list(set(df.drop('Prospect ID',axis=1).columns) - set(df._get_numeric_data().columns))):
    plt.subplot(13,2,i+1)
    sns.countplot(y=df[each])

In [None]:
#We can clearly see the following highly skewed variables, we will use value counts to confirm our suspicion
for each in ['Digital Advertisement','Through Recommendations','Magazine','Do Not Call','Search','Newspaper Article',
        'Update me on Supply Chain Content','Receive More Updates About Our Courses','I agree to pay the amount through cheque',
        'What matters most to you in choosing a course','Do Not Email','X Education Forums','Newspaper','Country',
        'Get updates on DM Content']:
    print('\n')
    print(df[each].value_counts(normalize=True)*100)

In [None]:
#From the plots and value counts above, we can identify some highly skewed variables. These variables will not be of much value to the model, and don't contain valuable information. 
#We can remove all these redundant columns
df.drop(['Digital Advertisement','Through Recommendations','Magazine','Do Not Call','Search','Newspaper Article',
        'Update me on Supply Chain Content','Receive More Updates About Our Courses','I agree to pay the amount through cheque',
        'What matters most to you in choosing a course','Do Not Email','X Education Forums','Newspaper','Country',
        'Get updates on DM Content'],axis=1,inplace=True)
df.shape

In [None]:
df.info()

In [None]:
#Checking correlation between remaining categorical variables
df.corr()

In [None]:
df.head()

In [None]:
#Lets change column names to more readable ones
df=df.rename(columns={'Total Time Spent on Website':'Time Spent','Page Views Per Visit':'Views','What is your current occupation':'Occupation','A free copy of Mastering The Interview':'Free Copy'})

In [None]:
#We will now work towards reducing the number of possible values each variable can take.
#let's start with Last Activity
df['Last Activity'].value_counts(normalize=True).sort_values(ascending=False)*100

In [None]:
#List of features that have less than 1% frequency
to_combine = list(df['Last Activity'].value_counts(normalize=True).sort_values(ascending=False).loc[(df['Last Activity'].value_counts(normalize=True).sort_values(ascending=False)<0.01).values].index)
to_combine

In [None]:
#We'll club the less frequent features into a single category "Others"
df['Last Activity'] = df['Last Activity'].replace(to_combine,'Others')
df['Last Activity'].value_counts()

In [None]:
#Inspecting lead source
df['Lead Source'].value_counts(normalize=True).sort_values(ascending=False)*100

In [None]:
#We'll club the less frequent features into a single category "Others"
df['Lead Source'] = df['Lead Source'].replace(list(df['Lead Source'].value_counts(normalize=True).sort_values(ascending=False).loc[(df['Lead Source'].value_counts(normalize=True).sort_values(ascending=False)<0.01).values].index)
,'Others')
df['Lead Source'].value_counts()

In [None]:
#inspecting specialization
df['Specialization'].value_counts(normalize=True).sort_values(ascending=False)*100

We'll be leaving specializations as-is.

In [None]:
#inspecting occupation
df['Occupation'].value_counts(normalize=True).sort_values(ascending=False)*100

In [None]:
#inspecting last notable activity
df['Last Notable Activity'].value_counts(normalize=True).sort_values(ascending=False)*100

In [None]:
#We'll club the less frequent features into a single category "Others"
df['Last Notable Activity'] = df['Last Notable Activity'].replace(list(df['Last Notable Activity'].value_counts(normalize=True).sort_values(ascending=False).loc[(df['Last Notable Activity'].value_counts(normalize=True).sort_values(ascending=False)<0.01).values].index)
,'Others')
df['Last Notable Activity'].value_counts()

In [None]:
#We can also covert the Free Copy column to numeric (one-hot encoding)
df['Free Copy'] = df['Free Copy'].map({'No':0,'Yes':1})
df['Free Copy'].describe()

## Visualizations on final dataset

In [None]:
#Let's check data imbalance
100*df['Converted'].sum()/len(df)

Around 37% of the data corresponds to the leads which have been converted. Thus, the data is sufficiently balanced and we can continue with building our model here.

In [None]:
df.head()

In [None]:
plt.figure(figsize=(16,8))
sns.countplot(x='Lead Source',hue='Converted',data=df)

#### Inference
'Reference' and 'Welingak Website' have great conversion rates

In [None]:
num_vars=['TotalVisits','Time Spent','Views']
plt.figure(figsize=(8,6))
for i,each in enumerate(num_vars):
    plt.subplot(1,3,i+1)
    sns.boxplot(y=each,x='Converted',data=df)
    plt.tight_layout()

#### Inference
Time spent on the website has a strong correlation with the coversion rate

In [None]:
#Let's analyze all the categorical variables against the target variable
cats=['Lead Origin','Specialization','Occupation','City','Last Notable Activity', 'Last Activity']
plt.figure(figsize=(16,25))
for i,each in enumerate(cats):
    plt.subplot(3,2,i+1)
    sns.countplot(y=each,data=df,hue='Converted')
plt.tight_layout()

#### Inferences
* Leads with last notable activity as SMS sent have a high chance of conversion
* Working professionals have the highest conversion ratio
* Leads originating from Lead Add Form have high coversion ratio

In [None]:
#Lets see the correaltion heatmap
sns.heatmap(df.corr(), cmap="RdYlGn",annot=True)

We do not see any alarmingly high levels of correlation in the data

## Data Preparation 
We will be converting all categorical features into dummy variables, by implmeneting one-hot encoding. 

In [None]:
#We can use prospect ID for identificatipon, dropping Lead Number
df.drop('Lead Number',inplace=True, axis=1)

In [None]:
df.columns

In [None]:
#creating dummy variables
dummy = pd.get_dummies(df[['Lead Origin','Lead Source','Last Activity','Specialization',
                           'Occupation','City','Last Notable Activity']], drop_first=True)
dummy.head()


In [None]:
#Merging dummies into dataset
df=pd.concat([df,dummy],axis=1)
df.head()

In [None]:
#dropping dummified variables 
df.drop(['Lead Origin','Lead Source','Last Activity','Specialization',
                           'Occupation','City','Last Notable Activity'],inplace=True,axis=1)

In [None]:
df.head()

## Splitting the data into train and test set

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x=df.drop(['Converted','Prospect ID'],axis=1)
y=df[['Converted']]
x_train, x_test, y_train, y_test = train_test_split(x,y,train_size=0.7,test_size=0.3,random_state=1)

In [None]:
# y_train.index = x_train['Prospect ID']
# x_train.index = x_train['Prospect ID']
# x_train.drop('Prospect ID',axis=1,inplace=True)

In [None]:
x_train.shape

In [None]:
y_train.shape

In [None]:
# y_test.index = x_test['Prospect ID']
# x_test.index=x_test['Prospect ID']
# x_test.drop('Prospect ID',axis=1,inplace=True)

In [None]:
x_test.shape

In [None]:
y_test.shape

## Scaling of numerical features

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler=StandardScaler()

In [None]:
x_train[['TotalVisits','Time Spent','Views']].describe()

In [None]:
x_train[['TotalVisits','Time Spent','Views']] = scaler.fit_transform(x_train[['TotalVisits','Time Spent','Views']])
x_train[['TotalVisits','Time Spent','Views']].describe()

## Building the model

In [None]:
import statsmodels.api as sm

In [None]:
#Logistic Regression Model
m1 = sm.GLM(y_train,sm.add_constant(x_train), family = sm.families.Binomial())
m1.fit().summary()

Now we shall work towards refining this model

## Feature selection using RFE

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 20)             # running RFE with 20 variables as output
rfe = rfe.fit(x_train, y_train)
rfe.support_

In [None]:
list(zip(x_train.columns, rfe.support_, rfe.ranking_))

In [None]:
#columns chosen upon running RFE
cols=x_train.columns[rfe.support_]
cols

## Rebuilding the model and assessing using SM

In [None]:
x_train_sm = sm.add_constant(x_train[cols])
m2=sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
m2.fit().summary()

In [None]:
#checking VIF

getvif(x_train_sm)

In [None]:
#let's drop Occupation_Housewife since it shows to be less signficant in the model (relatively higher p value)
x_train_sm.drop('Occupation_Housewife',axis=1,inplace=True)

In [None]:
#Rebuilding the model
x_train_sm = sm.add_constant(x_train_sm)
m3 = sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
print(m3.fit().summary())
#checking vifs
getvif(x_train_sm)

In [None]:
#Dropping Lead Source_Reference since it is least signficant, and has a high VIF
x_train_sm = sm.add_constant(x_train_sm.drop('Lead Source_Reference',axis=1))
#Rebuilding the model
m4 = sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
print(m4.fit().summary())
#checking vifs
getvif(x_train_sm)

In [None]:
#Dropping Occupation_Unemployed since it has a high correlation with other features
x_train_sm = sm.add_constant(x_train_sm.drop('Occupation_Unemployed',axis=1))
#Rebuilding the model
m5 = sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
print(m5.fit().summary())
#checking vifs
getvif(x_train_sm)

In [None]:
#Dropping Last Activity_Others since it is coming out to be relatively less significant
x_train_sm = sm.add_constant(x_train_sm.drop('Last Activity_Others',axis=1))
#Rebuilding the model
m6 = sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
print(m6.fit().summary())
#checking vifs
getvif(x_train_sm)

In [None]:
#Dropping Specialization_Hospitality Management since it is coming out to be relatively less significant
x_train_sm = sm.add_constant(x_train_sm.drop('Specialization_Hospitality Management',axis=1))
#Rebuilding the model
m7 = sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
print(m7.fit().summary())
#checking vifs
getvif(x_train_sm)

In [None]:
#We are getting a high VIF for last activity and last notable activity "SMS Sent"
#We can see that these two are highly correlated, thus we can drop one of them
x_train_sm[['Last Activity_SMS Sent','Last Notable Activity_SMS Sent']].corr()

In [None]:
#We'll drop Last Activity_SMS Sent due to the high correlation
x_train_sm = sm.add_constant(x_train_sm.drop('Last Activity_SMS Sent',axis=1))
#Rebuilding the model
m8 = sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
print(m8.fit().summary())
#checking VIFs too
getvif(x_train_sm)

In [None]:
#Last Activity_Email Link Clicked is highly insignificant in the model with a high p value
#we will drop it 
x_train_sm = sm.add_constant(x_train_sm.drop('Last Activity_Email Link Clicked',axis=1))
#Rebuilding the model
m9 = sm.GLM(y_train,x_train_sm,family=sm.families.Binomial())
print(m9.fit().summary())
#checking VIFs too
getvif(x_train_sm)

## Model Finalized

In [None]:
#Let's move forward with this model
res = m9.fit()

In [None]:
#getting predicted values on the train set
y_train_pred = res.predict(x_train_sm)

In [None]:
y_train_pred.shape

In [None]:
y_train_pred[:10]

In [None]:
#changing predictions to array
y_train_pred=y_train_pred.values.reshape(-1)
y_train_pred[:10]

In [None]:
#creating new df for predictions
y_train_pred_final = pd.DataFrame({'Converted':y_train.values.reshape(-1),'Prob':y_train_pred})
y_train_pred_final['ID'] = y_train.index
y_train_pred_final.head()

In [None]:
y_train_pred_final['Predicted'] = y_train_pred_final.Prob.map(lambda x: 1 if x>0.5 else 0)
y_train_pred_final[:10]

Now we have trained our model and have the predictions on the training set. We will now see some metrics on the predictions made on the training set.

In [None]:
from sklearn import metrics

In [None]:
#confusion matrix
confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.Predicted)
confusion

In [None]:
#We now have our confusion matrix. 
tn = confusion[0][0] #true neatives
tp = confusion[1][1] #true positives
fp = confusion[0][1] #false positives
fn = confusion[1][0] #false negatives

In [None]:
#Let's check overall accuracy
metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.Predicted)

#### Note
We have obtained quite good accuracy of 81.6%

We will now look at some more metrics, to see how the model is really performing on the train data, and how relevant it will be to meet the business objective.

In [None]:
#Sensitivity - this will need to be maximized since the business objective is to identify the hottest leads. 
#We would not want to miss any of the positives in this scenario.
tp / float (tp+fn)

In [None]:
#specificity - this is a measure of how well the model can tell if a lead is not worth following
tn / float(tn+fp)

In [None]:
#False positive rate - from all the neagtives, how many were falsely predicted as positive? This should be minimized.
fp/float(tn+fp)

In [None]:
#True positive rate - from all the positives, how many were correctly predicted as positive? This should be maximized.
#This is same as sensitivity
tp/float(tp+fn)

In [None]:
#Positive predictive value
tp/float(tp+fp)

In [None]:
#negative predictive value 
tn/float(tn+fn)

All the metrics calculated above are in an acceptable range, but we can work on improving it further and tuning the model to better align with business objective

We will now plot the ROC curve.
An ROC curve demonstrates several things:

- It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
- The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
- The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.Prob)

#### Inference
Since the ROC curve sticking close to the edge and resembling a right angle triangle, we have a good operating chacteristic

## Finding optimal cut-off point
We initially chose the cut-off point for the model as 0.5. The lead score itself would serve the purpose of the model, but for sake on analysis, we will try to find the optimal cutoff point for prediction, and it can be included as a recommendation to the business. 

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [(round(i/100,2)) for i in range(0,101,5)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensitivity','specificity'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [(round(i/100,2)) for i in range(0,101,5)]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
#     accuracy = metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final[i])
    specificity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensitivity = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensitivity,specificity]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensitivity','specificity'])
plt.show()

#### Inference
From the plot above, it is clear that a value of around 0.3 would be the optimum cut-off for the lead score. Thus, a lead score above 0.3 would qualify as a hot lead and should be pursued by the company, and would have a much better chance of getting converted.

Since the business objective is to get a target of 80% lead conversion rate, we have to keep this in mind while setting the threshold as well.

In [None]:
y_train_pred_final.head()

In [None]:
#Trying cutoff 0.35
cutoff=0.35

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Prob.map( lambda x: 1 if x > cutoff else 0)

#Also adding Lead Score in line with the business objective
y_train_pred_final['Lead Score'] = y_train_pred_final['Prob'].apply(lambda x: int(round(x*100,0)))

#We can remove the rest of the columns now
y_train_pred_final = y_train_pred_final[['Converted','Prob','ID','final_predicted','Lead Score']]

y_train_pred_final.head()

In [None]:
confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives
confusion2

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting conversion when the customer would not convert
print(FP/ float(TN+FP))

In [None]:
#Setting cutoff 0.3
cutoff=0.3

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Prob.map( lambda x: 1 if x > cutoff else 0)

#Also adding Lead Score in line with the business objective
y_train_pred_final['Lead Score'] = y_train_pred_final['Prob'].apply(lambda x: int(round(x*100,0)))

#We can remove the rest of the columns now
y_train_pred_final = y_train_pred_final[['Converted','Prob','ID','final_predicted','Lead Score']]

y_train_pred_final.head()

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting conversion when the customer would not convert
print(FP/ float(TN+FP))

In [None]:
# Positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

#### Inference
* We obtain an accuracy of 79% on the training data, while maintaing a sensitivity of 84%. 
* We can proceed with these results since they are in line with business objective

## Precision and Recall

In [None]:
#We have the confusion matrix as 
confusion2

In [None]:
#Precision 
TP/(TP+FP)

In [None]:
#recall
TP/(TP+FN)

In [None]:
#We can also get the precision and recall values usking sklearn
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

In [None]:
recall_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

## Precision and Recall Trade-off

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Prob)

In [None]:
fig,ax=plt.subplots()
ax.plot(thresholds, p[:-1], "g-", label='Precision') #plotting precision as green line
ax.plot(thresholds, r[:-1], "r-", label = 'Recall') #plotting recall as red line
plt.xlabel('Probability Threshold')
legend = ax.legend(loc='best', shadow=True)
plt.ylabel('Precision/Recall')
plt.title('Precision - Recall Curve')
plt.show()

A value close to 0.3 seems to be optimal.

## Making predictions on the test set

In [None]:
#We will scale numerical features just like we did in train set. This time, we fit don't fit the scaler, we directly transform the data.
x_test[['TotalVisits','Time Spent','Views']] = scaler.transform(x_test[['TotalVisits','Time Spent','Views']])
x_test.head()

In [None]:
#retaining only the features that we used in our final model
x_test = x_test[list(x_train_sm.drop('const',axis=1).columns)]

In [None]:
x_test_sm = sm.add_constant(x_test)

In [None]:
x_test_sm

In [None]:
#Making predictions
y_test_pred = res.predict(x_test_sm)

In [None]:
#creating new df for predictions
y_test_pred_final = pd.DataFrame({'Converted':y_test.values.reshape(-1),'Prob':y_test_pred})
y_test_pred_final['ID'] = y_test.index
y_test_pred_final.head()

In [None]:
#Setting cutoff to 0.35 to see parameters
cutoff = 0.35

In [None]:
y_test_pred_final['Predicted'] = y_test_pred_final.Prob.map(lambda x: 1 if x>cutoff else 0)
#We will also all add a "Lead Score" column, in line with the business objective.
y_test_pred_final['Lead Score'] = y_test_pred_final['Prob'].apply(lambda x: int(round(x*100,0)))
y_test_pred_final[:10]

In [None]:
#Confusion Matrix for predictions on test data
confusion3 = metrics.confusion_matrix(y_test_pred_final.Converted, y_test_pred_final.Predicted)
TP = confusion3[1,1] # true positive 
TN = confusion3[0,0] # true negatives
FP = confusion3[0,1] # false positives
FN = confusion3[1,0] # false negatives# Let's see the sensitivity of our logistic regression model
confusion3

In [None]:
#Accuracy on test data
metrics.accuracy_score(y_test_pred_final.Converted, y_test_pred_final.Predicted)

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate 
print(FP/ float(TN+FP))

In [None]:
#Setting cutoff to 0.3
cutoff=0.3

In [None]:
y_test_pred_final['Predicted'] = y_test_pred_final.Prob.map(lambda x: 1 if x>cutoff else 0)
#We will also all add a "Lead Score" column, in line with the business objective.
y_test_pred_final['Lead Score'] = y_test_pred_final['Prob'].apply(lambda x: int(round(x*100,0)))
y_test_pred_final[:10]

In [None]:
#Accuracy on test data
metrics.accuracy_score(y_test_pred_final.Converted, y_test_pred_final.Predicted)

In [None]:
#Confusion Matrix for predictions on test data
confusion3 = metrics.confusion_matrix(y_test_pred_final.Converted, y_test_pred_final.Predicted)
TP = confusion3[1,1] # true positive 
TN = confusion3[0,0] # true negatives
FP = confusion3[0,1] # false positives
FN = confusion3[1,0] # false negatives# Let's see the sensitivity of our logistic regression model
confusion3

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate 
print(FP/ float(TN+FP))

In [None]:
# Positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

#### Inference
* We have obtained an accuracy of 79.25% on the test data, while maintaining a sensitivity of 85.3%. Thus, we can conclude that our model is performing well and can be rolled out to meet the business objective.
* Our model evaluation parameters have not changed and remained about the same when runnning them on test data. Hence, we can conclude that the model is quite stable.

In [None]:
#Recommendations
res.summary()

In [None]:
#We will now look at the most important features identified in our model

final_model = pd.DataFrame(res.params) #getting model parameters (features and coefficients)

final_model['Feature']=final_model.index 
final_model.index = range(len(final_model))

final_model = final_model.rename(columns = {0:'Coefficient'})[['Feature','Coefficient']] #renaming columns for better understanding
final_model.sort_values(by='Coefficient', ascending=False, ignore_index = True) #sorting by coefficient

## Inferences
* The most important colums from the dataset can be identified as below
    * Lead Origin
    * Lead Source
    * Time Spent
    * Occupation
    * Last Activity
    * Last Notable Activity
* The most important features (dummy variables) used in the model can be identified as below: 
    * Lead Origin_Lead Add Form
    * Lead Source_Welingak Website
    * Occupation_Working Professional
    * Last Activity_Email Bounced
    * Last Notable Activity_SMS Sent
* Recommendations to business can be made on the basis of the lead score. Since we chose the probability cut off at 0.3, this would translate to a score of 30. This can be tweaked as and when the needs of the business change   