Problem Statement

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

url:https://www.kaggle.com/ashydv/leads-dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df=pd.read_csv("/kaggle/input/leads-dataset/Leads.csv")

Data Analysis

In [None]:
print(df.shape)

In [None]:
print(df.info())

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
#check for duplicates
sum(df.duplicated(subset = 'Prospect ID')) == 0

In [None]:
#check for duplicates
sum(df.duplicated(subset = 'Lead Number')) == 0

In [None]:
#dropping Lead Number and Prospect ID since they have all unique values
df.drop(['Prospect ID', 'Lead Number'], 1, inplace = True)
df

In [None]:
#checking null values in each rows
df.isnull().sum()

In [None]:
#dropping cols with more than 45% missing values
cols=df.columns
for i in cols:
    if((100*(df[i].isnull().sum()/len(df.index))) >= 45):
        df.drop(i, 1, inplace = True)

In [None]:
#checking percentage of null values in each column
round(100*(df.isnull().sum()/len(df.index)), 2)

## Attribute Analysis

# Categorical Attributes

Country attribute:

In [None]:
#checking value counts of Country column
df['Country'].value_counts(dropna=False)  #NaN= 2461 $ unknown = 5

In [None]:
# Since India is the most common occurence among the non-missing values we can impute all missing values with India
df['Country'] = df['Country'].replace(np.nan,'India')

In [None]:
#plotting spread of Country columnn 
plt.figure(figsize=(15,5))
s1=sns.countplot(df.Country, hue=df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

In [None]:
#creating a list of columns to be droppped
cols_to_drop=['Country']

City Attribute

In [None]:
#checking value counts of "City" column
df['City'].value_counts(dropna=False) # Nan = 1420 & Select = 2249

In [None]:
#Converting 'Select' values to NaN.
df = df.replace('Select', np.nan)

In [None]:
#checking value counts of "City" column
df['City'].value_counts(dropna=False) # Nan = 1420

In [None]:
df['City'] = df['City'].replace(np.nan,'Mumbai')

In [None]:
#plotting spread of City columnn after replacing NaN values
plt.figure(figsize=(10,5))
s1=sns.countplot(df.City, hue=df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

Specialization Attribute

In [None]:
#checking value counts of Specialization column
df['Specialization'].value_counts(dropna=False) #Nan = 3380

In [None]:
df['Specialization'] = df['Specialization'].replace(np.nan, 'Not Specified')

In [None]:
#plotting spread of Specialization columnn 
plt.figure(figsize=(15,5))
s1=sns.countplot(df.Specialization, hue=df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

We see that specialization with Management in them have higher number of leads as well as leads converted. 
So this is definitely a significant variable and should not be dropped.

In [None]:
#combining Management Specializations because they show similar trends
df['Specialization'] = df['Specialization'].replace(['Finance Management','Human Resource Management',
                                                           'Marketing Management','Operations Management',
                                                           'IT Projects Management','Supply Chain Management',
                                                            'Healthcare Management','Hospitality Management',
                                                           'Retail Management'] ,'Management_Specializations')  

In [None]:
#visualizing count of Variable based on Converted value
plt.figure(figsize=(15,5))
s1=sns.countplot(df.Specialization, hue=df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

Current occupation Attribute

In [None]:
#What is your current occupation
df['What is your current occupation'].value_counts(dropna=False) #Nan = 2690

In [None]:
#imputing Nan values with mode "Unemployed"
df['What is your current occupation'] = df['What is your current occupation'].replace(np.nan, 'Unemployed')

In [None]:
#checking count of values
df['What is your current occupation'].value_counts(dropna=False)

In [None]:
#visualizing count of Variable based on Converted value
s1=sns.countplot(df['What is your current occupation'], hue=df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

Choosing a course Attribute

In [None]:
#checking value counts
df['What matters most to you in choosing a course'].value_counts(dropna=False) # Nan = 2709

In [None]:
#replacing Nan values with Mode "Better Career Prospects"
df['What matters most to you in choosing a course'] = df['What matters most to you in choosing a course'].replace(np.nan,'Better Career Prospects')

In [None]:
#checking value counts of variable
df['What matters most to you in choosing a course'].value_counts(dropna=False)

In [None]:
#visualizing count of Variable based on Converted value
s1=sns.countplot(df['What matters most to you in choosing a course'], hue=df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

Tags Attribute

In [None]:
#checking value counts of Tag variable
df['Tags'].value_counts(dropna=False) #Nan = 3353

In [None]:
#replacing Nan values with "Not Specified"
df['Tags'] = df['Tags'].replace(np.nan,'Not Specified')

In [None]:
#visualizing count of Variable based on Converted value
plt.figure(figsize=(15,5))
s1=sns.countplot(df['Tags'], hue=df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

In [None]:
#replacing tags with low frequency with "Other Tags"
df['Tags'] = df['Tags'].replace(['In confusion whether part time or DLP', 'in touch with EINS','Diploma holder (Not Eligible)',
                                     'Approached upfront','Graduation in progress','number not provided', 'opp hangup','Still Thinking',
                                    'Lost to Others','Shall take in the next coming month','Lateral student','Interested in Next batch',
                                    'Recognition issue (DEC approval)','Want to take admission but has financial problems',
                                    'University not recognized','switched off','Already a student','Not doing further education',
                                    'invalid number','wrong number given','Interested  in full time MBA'], 'Other_Tags')
                                    

In [None]:
#visualizing count of Variable based on Converted value
plt.figure(figsize=(15,5))
s1=sns.countplot(df['Tags'], hue=df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

In [None]:
#Here again we have another Column that is worth Dropping. So we Append to the cols_to_drop List
cols_to_drop.append('What matters most to you in choosing a course')
cols_to_drop

Lead Source Attribute

In [None]:
#checking value counts of Lead Source column
df['Lead Source'].value_counts(dropna=False) #Nan=36

In [None]:
#replacing Nan Values and combining low frequency values
df['Lead Source'] = df['Lead Source'].replace(np.nan,'Others')
df['Lead Source'] = df['Lead Source'].replace('google','Google')
df['Lead Source'] = df['Lead Source'].replace('Facebook','Social Media')
df['Lead Source'] = df['Lead Source'].replace(['bing','Click2call','Press_Release',
                                                     'youtubechannel','welearnblog_Home',
                                                     'WeLearn','blog','Pay per Click Ads',
                                                    'testone','NC_EDM'] ,'Others')  

In [None]:
#visualizing count of Variable based on Converted value
plt.figure(figsize=(15,5))
s1=sns.countplot(df['Lead Source'], hue=df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

Last Activity Attribute

In [None]:
# Last Activity:
df['Last Activity'].value_counts(dropna=False) #Nan = 103

In [None]:
df['Last Activity'] = df['Last Activity'].replace(np.nan,'Others')
df['Last Activity'] = df['Last Activity'].replace(['Unreachable','Unsubscribed',
                                                        'Had a Phone Conversation', 
                                                        'Approached upfront',
                                                        'View in browser link Clicked',       
                                                        'Email Marked Spam',                  
                                                        'Email Received','Resubscribed to emails',
                                                         'Visited Booth in Tradeshow'],'Others')

In [None]:
df['Last Activity'].value_counts(dropna=False)

Lead Origin Attribute

In [None]:
#Lead Origin
df['Lead Origin'].value_counts(dropna=False)

In [None]:
#visualizing count of Variable based on Converted value
plt.figure(figsize=(8,5))
s1=sns.countplot(df['Lead Origin'], hue=df.Converted)
s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
plt.show()

Inference
API and Landing Page Submission bring higher number of leads as well as conversion.
Lead Add Form has a very high conversion rate but count of leads are not very high.
Lead Import and Quick Add Form get very few leads.
In order to improve overall lead conversion rate, we have to improve lead converion of API and Landing Page Submission origin and generate more leads from Lead Add Form.

Attribute (NO or YES Inputs)

In [None]:
#checking value counts for Do Not Call
df['Do Not Call'].value_counts(dropna=False)

In [None]:
#checking value counts for Do Not Email
df['Do Not Email'].value_counts(dropna=False)

In [None]:
#visualizing count of Variable based on Converted value

plt.figure(figsize=(15,5))

ax1=plt.subplot(1, 2, 1)
ax1=sns.countplot(df['Do Not Call'], hue=df.Converted)
ax1.set_xticklabels(ax1.get_xticklabels(),rotation=90)

ax2=plt.subplot(1, 2, 2)
ax2=sns.countplot(df['Do Not Email'], hue=df.Converted)
ax2.set_xticklabels(ax2.get_xticklabels(),rotation=90)
plt.show()


In [None]:
cols_to_drop.append('Do Not Call')
cols_to_drop

In [None]:
df.Search.value_counts(dropna=False)

In [None]:
df.Magazine.value_counts(dropna=False)

In [None]:
df['Newspaper Article'].value_counts(dropna=False)

In [None]:
df['X Education Forums'].value_counts(dropna=False)

In [None]:
df['Newspaper'].value_counts(dropna=False)

In [None]:
df['Digital Advertisement'].value_counts(dropna=False)

In [None]:
df['Through Recommendations'].value_counts(dropna=False)

In [None]:
df['Receive More Updates About Our Courses'].value_counts(dropna=False)

In [None]:
df['Update me on Supply Chain Content'].value_counts(dropna=False)

In [None]:
df['Get updates on DM Content'].value_counts(dropna=False)

In [None]:
df['I agree to pay the amount through cheque'].value_counts(dropna=False)

In [None]:
df['A free copy of Mastering The Interview'].value_counts(dropna=False)

In [None]:
#adding imbalanced columns to the list of columns to be dropped
cols_to_drop.extend(['Search','Magazine','Newspaper Article','X Education Forums','Newspaper',
                 'Digital Advertisement','Through Recommendations','Receive More Updates About Our Courses',
                 'Update me on Supply Chain Content',
                 'Get updates on DM Content','I agree to pay the amount through cheque'])

Last Notable Activity Attribute

In [None]:
#checking value counts of last Notable Activity
df['Last Notable Activity'].value_counts()

In [None]:
#clubbing lower frequency values
df['Last Notable Activity'] = df['Last Notable Activity'].replace(['Had a Phone Conversation','Email Marked Spam','Unreachable',
                                                                       'Unsubscribed','Email Bounced','Resubscribed to emails',  
                                                                       'View in browser link Clicked', 'Approached upfront',  
                                                                         'Form Submitted on Website', 'Email Received'],'Other_Notable_activity')                                                           

In [None]:
#checking value counts for variable
df['Last Notable Activity'].value_counts()

In [None]:
#visualizing count of Variable based on Converted value
plt.figure(figsize = (14,5))
ax1=sns.countplot(x = "Last Notable Activity", hue = "Converted", data = df )
ax1.set_xticklabels(ax1.get_xticklabels(),rotation=90)
plt.show()


In [None]:
#dropping columns
leads = df.drop(cols_to_drop,1)
leads.info()

In [None]:
#checking missing values in leftover columns/
round(100*(leads.isnull().sum()/len(leads.index)),2)

# Numerical Attributes Analysis:

In [None]:
#Check the % of Data that has Converted Values = 1:
Converted = (sum(leads['Converted'])/len(leads['Converted'].index))*100
Converted

In [None]:
#Checking correlations of numeric values
# figure size
plt.figure(figsize=(8,6))
# heatmap
sns.heatmap(leads.corr(), cmap="YlGnBu", annot=True)
plt.show()

Total Visits

In [None]:
#visualizing spread of variable
plt.figure(figsize=(6,4))
sns.boxplot(y=leads['TotalVisits'])
plt.show()

In [None]:
#checking percentile values for "Total Visits"

leads['TotalVisits'].describe(percentiles=[0.05,.25, .5, .75, .90, .95, .99])

In [None]:
#checking percentile values for "Total Visits"

leads['TotalVisits'].describe()

In [None]:
#Outlier Treatment: Remove top & bottom 1% of the Column Outlier values
Q3 = leads.TotalVisits.quantile(0.99)
leads = leads[(leads.TotalVisits <= Q3)]
Q1 = leads.TotalVisits.quantile(0.01)
leads = leads[(leads.TotalVisits >= Q1)]
sns.boxplot(y=leads['TotalVisits'])
plt.show()

Total Time Spent on Website

In [None]:
#checking percentiles for "Total Time Spent on Website"
leads['Total Time Spent on Website'].describe(percentiles=[0.05,.25, .5, .75, .90, .95, .99])

In [None]:
#visualizing spread of numeric variable
plt.figure(figsize=(6,4))
sns.boxplot(y=leads['Total Time Spent on Website'])
plt.show()

Page Views Per Visit

In [None]:
#checking spread of "Page Views Per Visit"
leads['Page Views Per Visit'].describe()

In [None]:
#visualizing spread of numeric variable
plt.figure(figsize=(6,4))
sns.boxplot(y=leads['Page Views Per Visit'])
plt.show()

In [None]:
#Outlier Treatment: Remove top & bottom 1% 
Q3 = leads['Page Views Per Visit'].quantile(0.99)
leads = leads[leads['Page Views Per Visit'] <= Q3]
Q1 = leads['Page Views Per Visit'].quantile(0.01)
leads = leads[leads['Page Views Per Visit'] >= Q1]
sns.boxplot(y=leads['Page Views Per Visit'])
plt.show()

In [None]:
#checking Spread of "Total Visits" vs Converted variable
sns.boxplot(y = 'TotalVisits', x = 'Converted', data = leads)
plt.show()

In [None]:
#checking Spread of "Total Time Spent on Website" vs Converted variable
sns.boxplot(x=leads.Converted, y=leads['Total Time Spent on Website'])
plt.show()

In [None]:
#checking Spread of "Page Views Per Visit" vs Converted variable
sns.boxplot(x=leads.Converted,y=leads['Page Views Per Visit'])
plt.show()

In [None]:
leads.drop(["How did you hear about X Education", "Lead Profile"], axis = 1, inplace = True) 

In [None]:
#checking missing values in leftover columns/
round(100*(leads.isnull().sum()/len(leads.index)),2)

In [None]:
#getting a list of categorical columns
cat_cols= leads.select_dtypes(include=['object']).columns
cat_cols

In [None]:
# List of variables to map
varlist =  ['A free copy of Mastering The Interview','Do Not Email']
# Defining the map function
def binary_map(x):
    return x.map({'Yes': 1, "No": 0})

# Applying the function to the housing list
leads[varlist] = leads[varlist].apply(binary_map)

In [None]:
#getting dummies and dropping the first column and adding the results to the master dataframe
dummy = pd.get_dummies(leads[['Lead Origin','What is your current occupation',
                             'City']], drop_first=True)

leads = pd.concat([leads,dummy],1)

In [None]:
dummy = pd.get_dummies(leads['Lead Source'], prefix  = 'Lead Source')
dummy = dummy.drop(['Lead Source_Others'], 1)
leads = pd.concat([leads, dummy], axis = 1)

In [None]:
dummy = pd.get_dummies(leads['Last Activity'], prefix  = 'Last Activity')
dummy = dummy.drop(['Last Activity_Others'], 1)
leads = pd.concat([leads, dummy], axis = 1)

In [None]:
dummy = pd.get_dummies(leads['Last Notable Activity'], prefix  = 'Last Notable Activity')
dummy = dummy.drop(['Last Notable Activity_Other_Notable_activity'], 1)
leads = pd.concat([leads, dummy], axis = 1)

In [None]:
dummy = pd.get_dummies(leads['Tags'], prefix  = 'Tags')
dummy = dummy.drop(['Tags_Not Specified'], 1)
leads = pd.concat([leads, dummy], axis = 1)

In [None]:
#dropping the original columns after dummy variable creation
leads.drop(cat_cols,1,inplace = True)

In [None]:
leads.head()

In [None]:
leads.describe()

# Hypothesis Testing

Null hypothesis - Lead converted to QL more time spent on website

Alternative hypothesis - Lead haven't converted to QL more time spent on website

alpha = 0.05

In [None]:
stat = leads['Total Time Spent on Website'].mean()
stat

In [None]:
from scipy import stats

In [None]:
stats.ttest_ind(leads['Total Time Spent on Website'][leads['Converted']==0],leads['Total Time Spent on Website'][leads['Converted']==1])

We can see that p-value is less than 0.05 and therefore we reject null hypohesis. so we can confirm that leads doesn't convert to QL even they spend more time on website

In [None]:
lead = leads[:25]
lead

In [None]:
stat1 = lead['Total Time Spent on Website'].mean()
stat1

In [None]:
stats.ttest_ind(lead['Total Time Spent on Website'][lead['Converted']==0],lead['Total Time Spent on Website'][lead['Converted']==1])

We can see that p-value is less than 0.05 and therefore we reject null hypohesis. so we can confirm that leads doesn't convert to QL even they spend more time on website

# Models

## Logistic Regression

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

y = leads['Converted']
X=leads.drop('Converted', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

In [None]:
scaler = StandardScaler()
num_cols=X_train.select_dtypes(include=['float64', 'int64']).columns
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])

In [None]:
logreg = LogisticRegression(solver='lbfgs')
rfe = RFE(logreg, 15)            
rfe = rfe.fit(X_train, y_train)
rfe.support_

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm1 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm1.fit()
res.summary()

p-value of variable Lead Source_Referral Sites is high, so we can drop it.

In [None]:
col = col.drop('Lead Source_Referral Sites',1)

Looking at both coefficients, we have a p-value that is very low. This means that there is a strong correlation between these coefficients and the target. Since 'All' the p-values are less we can check the Variance Inflation Factor to see if there is any correlation between the variables.

In [None]:
#correlation b/n variables
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

There is a high correlation between two variables so we drop the variable with the higher valued VIF value.

In [None]:
col = col.drop('Last Notable Activity_SMS Sent',1)

## Prediction on Train Set

In [None]:
# Getting the Predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

In [None]:
y_train_pred_final = pd.DataFrame({'Converted Lead':y_train.values, 'Predicted':y_train_pred})
y_train_pred_final['Prospect ID'] = y_train.index
y_train_pred_final.head()

In [None]:
y_train_pred_final['Predicted CL'] = y_train_pred_final.Predicted.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head()

We can observe predicted values are almost same as train values.

In [None]:
from sklearn.metrics import accuracy_score
# Let's check the overall accuracy.
print(accuracy_score(y_train_pred_final['Converted Lead'], y_train_pred_final['Predicted CL']))

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
logreg = LogisticRegression()
print(cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean())

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_train_pred_final['Converted Lead'], y_train_pred_final['Predicted CL']))

In [None]:
# Confusion matrix 
from sklearn import metrics
confusion = metrics.confusion_matrix(y_train_pred_final['Converted Lead'], y_train_pred_final['Predicted CL'])
print(confusion)

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(6, 4)
sns.heatmap(confusion, annot=True, fmt='.1f', cmap='RdBu', center=0, ax=ax)
ax.set_ylim(2,0)

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's check the sensitivity 
TP / float(TP+FN)

In [None]:
# Let's check specificity
TN / float(TN+FP)

In [None]:
# Calculate the rate - predicted converted when customer does not have convert
print(FP/ float(TN+FP))

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_train_pred_final['Converted Lead'], y_train_pred_final['Predicted CL'], drop_intermediate = False)
auc_score = metrics.roc_auc_score(y_train_pred_final['Converted Lead'], y_train_pred_final['Predicted CL'])
plt.figure(figsize=(5, 5))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('RoC')
plt.legend(loc="lower right")
plt.show()

The ROC Curve should be a value close to 1. We are getting a good value of 0.92 indicating a good predictive model.

## KNN 

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
knn=KNeighborsClassifier(n_neighbors=2)

In [None]:
knn.fit(X_train[col],y_train)

In [None]:
pred=knn.predict(X_test[col])

In [None]:
print(classification_report(y_test,pred))

In [None]:
metrics.accuracy_score(y_test,pred)

In [None]:
error_rate=[]
for i in range(1,40):
    knn=KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train[col],y_train)
    pred_i = knn.predict(X_test[col])
    error_rate.append(np.mean(pred_i!=y_test))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue',linestyle='dashed',marker='o',
        markerfacecolor='red',markersize=10)
plt.title('Error Rate VS K-Value')
plt.xlabel('K Value')
plt.ylabel('Error Rate')

In [None]:
#now checking the accuracy with k value=5
knn=KNeighborsClassifier(n_neighbors= 5)
knn.fit(X_train[col],y_train)
pred=knn.predict(X_test[col])
print('With K=5\n')
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
print(accuracy_score(y_test,pred))

In [None]:
knn = KNeighborsClassifier(n_neighbors=10)
scores = (cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean())
print(scores)

In [None]:
confusion_matrix = metrics.confusion_matrix(y_test,pred)
confusion_matrix

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(6, 4)
sns.heatmap(confusion_matrix, annot=True, fmt='.1f', cmap='RdBu', center=0, ax=ax)
ax.set_ylim(2,0)

In [None]:
TP1 = confusion_matrix[1,1]
TN1 = confusion_matrix[0,0]
FP1 = confusion_matrix[0,1] 
FN1 = confusion_matrix[1,0]

In [None]:
#sensitivity
TP1 / float(TP1+FN1)

In [None]:
# Let's check specificity
TN1 / float(TN1+FP1)

In [None]:
# Calculate the rate - predicted converted when customer does not have convert
print(FP1/ float(TN1+FP1))

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test,pred, drop_intermediate = False)
auc_score = metrics.roc_auc_score(y_test,pred)
plt.figure(figsize=(5, 5))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

The ROC Curve should be a value close to 1. We are getting a good value of 0.61 indicating it is not a good predictive model.

# Conclusion

The Logistic Regression Model seems to predict the Conversion Rate very well and we should be able to give the confidence in making good calls based on this model.