### Problem Statement

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. 

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.

As you can see, there are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion.

X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with a higher lead score have a higher conversion chance and the customers with a lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

### Step 1: Importing Data

In [188]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

In [190]:
# Importing Pandas and NumPy
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [192]:
# Importing and inspecting the dataset
leads_data = pd.read_csv("Leads.csv")
leads_data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Leads.csv'

In [None]:
# Let's check the dimensions of the dataframe
leads_data.shape

In [None]:
# let's look at the statistical aspects of the dataframe
leads_data.describe()

In [None]:
# Let's see the type of each column
leads_data.info()

In [None]:
# checking the number of unique values in all the cols. If a column has 90% or more the one single value we delete that column.
for cols in leads_data.columns:
    print(f"column name : {cols} :  no of unique values --> {leads_data[cols].nunique()}\n")
    print(leads_data[cols].value_counts().sort_index())
    print("---------------------------------------------\n\n")
    

In [None]:
# removing columns of no significant importance
leads_data.drop(columns=['Prospect ID','Do Not Email','Do Not Call','What matters most to you in choosing a course',
                                    'Search','Magazine','Newspaper Article','X Education Forums','Newspaper','Digital Advertisement',
                                   'Through Recommendations','Receive More Updates About Our Courses','Update me on Supply Chain Content',
                                    'Get updates on DM Content','I agree to pay the amount through cheque'],inplace=True,axis=1)

# we delete the 'Prospect ID' column also since it is of no use in the analysis

In [None]:
# replacing the value 'Select' as a missing value in cols
leads_data.City.replace('Select',np.nan,inplace=True)
leads_data.Specialization.replace('Select',np.nan,inplace=True)
leads_data['How did you hear about X Education'].replace('Select',np.nan,inplace=True)
leads_data['Lead Profile'].replace('Select',np.nan,inplace=True)

In [None]:
# Calculating percentage of null values in all the remaining cols
(100*leads_data.isnull().mean()).sort_values(ascending=False)

In [None]:
# Dropping columns with missing values more than or equal to 35%
leads_data.drop(columns=['How did you hear about X Education','Lead Profile','Lead Quality','Asymmetrique Profile Score','Asymmetrique Profile Index',
                         'Asymmetrique Activity Index','Asymmetrique Activity Score','City','Tags','Specialization'],axis=1,inplace=True)

In [None]:
# Missing values in columns 'What is your current occupation','Country' should be replaced by the mean, median or 
# mode value whichever applicable
leads_data[['What is your current occupation','Country']].info()

Since both the above columns are of object type , we need to replace the missing values in these columns by their respective mode value

In [None]:
cols_wid_missingvalues=['What is your current occupation','Country']

In [None]:
# replacing the missing values with modes of the respective cols
for i in cols_wid_missingvalues:
    leads_data[i].fillna(leads_data[i].mode()[0],axis=0,inplace=True)

In [None]:
# dropping rows in columns where missing value percentage is too low
leads_data.dropna(subset=['Page Views Per Visit','TotalVisits','Last Activity','Lead Source'],axis=0,inplace=True)

In [None]:
# checking missing values again in cols
100*(leads_data.isnull().mean())

Now there are no missing values in the dataset. 

In [None]:
# checking shape of the dataframe
leads_data.shape

#### Converting the binary variables (Yes/No) to 0/1

In [None]:
# Applying a binary map function
leads_data['A free copy of Mastering The Interview'] = leads_data['A free copy of Mastering The Interview'].map({'Yes': 1, "No": 0}).astype('object')

In [None]:
print(leads_data.columns)

In [None]:
# renaming the column names
leads_data.rename(columns={'Total Time Spent on Website':'Web_Time','Page Views Per Visit':'Page_Views','What is your current occupation':'Occupation',
                           'A free copy of Mastering The Interview':'Interview_Copy','Last Notable Activity':'Last_Notable_Act'},inplace=True)

In [None]:
leads_data.head()

In [None]:
# dropping some more insignificant columns
leads_data.drop(columns=['Last_Notable_Act'],axis=1,inplace=True)

In [None]:
leads_data=leads_data[['Lead Number','Lead Origin', 'Lead Source', 'TotalVisits', 'Web_Time',
       'Page_Views', 'Last Activity', 'Country', 'Occupation',
       'Interview_Copy','Converted']]

In [None]:
leads_data.head()

### Univariate Analysis

In [None]:
# Numerical columns
# Set style
import matplotlib.pyplot as plt
plt.style.use('ggplot')

# See distribution of each of these columns
fig = plt.figure(figsize = (14, 10))
plt.subplot(2, 2, 1)
plt.hist(leads_data.TotalVisits, bins = 80)
plt.title('Total website visits')
plt.savefig(f'D:/UPGRAD/lead scoring images/Web_visits.png',dpi=300, bbox_inches='tight')

plt.subplot(2, 2, 2)
plt.hist(leads_data.Web_Time, bins = 30)
plt.title('Time spent on website')
plt.savefig(f'D:/UPGRAD/lead scoring images/Web_Time.png',dpi=300, bbox_inches='tight')

plt.subplot(2, 2, 3)
plt.hist(leads_data.Page_Views, bins = 80)
plt.title('Average number of page views per visit')
plt.savefig(f'D:/UPGRAD/lead scoring images/Page_Views.png',dpi=300, bbox_inches='tight')

plt.show()

In [None]:
num_cols=leads_data.select_dtypes(include='number').columns
cat_cols=leads_data.select_dtypes(include='object').columns
print(f'numeric columns\n{num_cols}')
print()
print(f'categorical columns\n{cat_cols}')

In [None]:
# Loop through categorical columns and create separate plots
for col in cat_cols:
    plt.figure(figsize=(7, 5))  # Create a new figure for each plot
    
    # Countplot for each categorical variable
    sns.countplot(data=leads_data, x=col, palette="viridis")
    
    # Formatting
    plt.title(f'Count of {col}', fontsize=14)  
    plt.xlabel(col, fontsize=12)             
    plt.ylabel("Count", fontsize=12)           
    plt.xticks(rotation=90, fontsize=10)       

    plt.tight_layout()  # Prevent overlapping layout
    plt.savefig(f'D:/UPGRAD/lead scoring images/{col}.png',dpi=300, bbox_inches='tight')
    plt.show()  # Show each plot separately

### Insights

1. **Country**: Most leads come from **India**, suggesting a focus on Indian audiences with potential expansion elsewhere.

2. **Interview_Copy**: Those who request the free interview guide show **higher intent**, making them prime targets for personalized follow-ups.

3. **Last Activity**: Website visits and email opens are the most common final actions, indicating the importance of **website content** and **email campaigns**.

4. **Lead Origin**: **Landing Page Submissions** generate the most leads, so optimizing landing pages can further boost conversions.

5. **Lead Source**: **Olark Chat** and **Organic Search** are top channels, emphasizing the need for strong **live chat support** and **SEO** strategies.

6. **Occupation**: The majority of leads are **Unemployed**, highlighting the importance of promoting **career-building** benefits to this group.

### Bivariate Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# visualising the correlations via pairplot for numeric variables
sns.pairplot(leads_data[['TotalVisits','Web_Time','Page_Views','Interview_Copy','Converted']],diag_kind='kde',hue='Converted')
# Save the figure in a folder
plt.savefig("D:/UPGRAD/lead scoring images/PairplotforNumVars.png", dpi=300, bbox_inches='tight')
plt.show()

### Insights

1. **High Website Engagement → Better Conversions**: More time on site and more page views often lead to higher conversion rates.  
2. **Free Copy Request Signals Intent**: Leads who request “Mastering the Interview” show higher intent and are more likely to convert.  
3. **Web Time & Page Views Correlate**: Visitors who spend more time also view more pages, indicating deeper engagement.  
4. **Total Visits ≠ Guaranteed Conversion**: Frequent visits alone don’t ensure conversion; the quality of engagement matters more.  
5. **Prioritize High-Engagement Leads**: Focus on leads who show multiple signs of interest (e.g., free copy request, long site visits) for targeted follow-ups.

##### For categorical variables with multiple levels, creating dummy features (one-hot encoding)

In [None]:
# Creating dummy variables for the categorical variables and dropping the first one.
dummy1 = pd.get_dummies(leads_data[['Lead Origin', 'Lead Source', 'Last Activity', 'Country','Occupation','Interview_Copy']], drop_first=True,dtype=int)

# Adding the results to the master dataframe
leads_data = pd.concat([leads_data, dummy1], axis=1)

In [None]:
leads_data.head()

In [None]:
# removing the original columns for which dummy variables have been created
leads_data.drop(columns=['Lead Origin', 'Lead Source', 'Last Activity', 'Country','Occupation','Interview_Copy'],axis=1,inplace=True)

In [None]:
leads_data.shape

In [None]:
leads_data.info()

In [None]:
# Calculating percentage of null values in all the cols
(100*leads_data.isnull().mean()).sort_values(ascending=False)

#### Checking outliers for the continuous numeric variables except the dummy variables

In [None]:
# Checking for outliers in the numeric variables
num_leads = leads_data[['TotalVisits','Web_Time','Page_Views']]

In [None]:
# Checking outliers at 25%, 50%, 75%, 90%, 95% and 99%
num_leads.describe(percentiles=[.25, .5, .75, .90, .95, .99])

In [None]:
# visualizing for outliers
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.boxplot(num_leads)
plt.savefig('D:/UPGRAD/lead scoring images/num_outlier.png',dpi=300, bbox_inches='tight')

We can see that the columns TotalVisits and Page_Views have outliers

In [None]:
# Calculating percentage of null values in all the cols
(100*leads_data.isnull().mean()).sort_values(ascending=False)

In [None]:
# List of columns to check for outliers
cols = ['TotalVisits', 'Web_Time', 'Page_Views']

# Start with a boolean mask that is True for all rows
mask = pd.Series(True, index=leads_data.index)

# Loop through each column and update the mask:
# Only keep rows where the value is between the 1st and 99th percentile.
for col in cols:
    # Calculate the 1st and 99th percentiles for the current column
    lower_bound = leads_data[col].quantile(0.01)
    upper_bound = leads_data[col].quantile(0.99)
    print(f"For column '{col}': Lower bound = {lower_bound}, Upper bound = {upper_bound}")
    
    # Update the mask: a row remains True only if its value in this column is within bounds.
    mask &= (leads_data[col] >= lower_bound) & (leads_data[col] <= upper_bound)

# Apply the mask to the entire DataFrame.
# This removes any row that has an outlier in any of the three columns.
leads_data = leads_data[mask].reset_index(drop=True)

print("New shape of leads_data after outlier removal:", leads_data.shape)


In [None]:
leads_data[['TotalVisits','Web_Time','Page_Views']].describe(percentiles=[0.25,0.5,0.75,0.90,0.99])

In [None]:
sns.boxplot(leads_data[['TotalVisits','Web_Time','Page_Views']])

Now we can see that there are negligible number of outliers remaining

In [None]:
leads_data.shape

In [None]:
leads_data.info()

We can see that all the variables are of numeric type now

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Putting feature variable to X
data = leads_data.copy()

X = leads_data.drop(['Converted'], axis=1)

X.head()

In [None]:
# Putting response variable to y
y = leads_data.Converted

y.head()

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

In [None]:
lead_number_train = X_train['Lead Number']
lead_number_test  = X_test['Lead Number']

# Now drop the Lead Number column from the feature sets that go into modeling.
X_train_model = X_train.drop('Lead Number', axis=1)
X_test_model  = X_test.drop('Lead Number', axis=1)

### Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

X_train_model[['TotalVisits','Web_Time','Page_Views']] = scaler.fit_transform(X_train_model[['TotalVisits','Web_Time','Page_Views']])

X_train_model.head()

In [None]:
### Checking the Lead Conversion Rate
(sum(leads_data['Converted'])/leads_data.shape[0])*100

The conversion rate is about 37% which indicates that there is no class imbalance in the dataset.

#### Looking at Correlations

In [None]:
# Let's see the correlation matrix 
plt.figure(figsize = (80,80))        # Size of the figure
sns.heatmap(X_train.corr(),annot = True)
plt.show()

### Dropping highly correlated dummy variables

In [None]:
X_test_model = X_test_model.drop(['Occupation_Working Professional'], axis=1)
X_train_model = X_train_model.drop(['Occupation_Working Professional'], axis=1)

#### Model Building and Automated Feature Selection Using RFE

In [None]:
import statsmodels.api as sm

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [None]:
from sklearn.feature_selection import RFE
rfe = RFE(logreg, n_features_to_select=15)             # running RFE with 15 variables as output
rfe = rfe.fit(X_train_model, y_train)

In [None]:
rfe.support_

In [None]:
# Convert np.bool_ and np.int64 to native Python types
output = [(col, bool(s), int(r)) for col, s, r in zip(X_train_model.columns, rfe.support_, rfe.ranking_)]
print(output)

In [None]:
# 15 best features selected by the recursive feature elimination(RFE)
col = X_train_model.columns[rfe.support_]
print(col)

In [None]:
# features not selected by rfe
no_col=X_train_model.columns[~rfe.support_]
print(no_col)

##### Building and Assessing the model with StatsModels

In [None]:
# building model using the 15 best features selected by rfe
# Model 1
X_train_sm = sm.add_constant(X_train_model[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

##### Creating a dataframe with the actual lead converted and the predicted probabilities

In [None]:
y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Convert_Prob':y_train_pred})
y_train_pred_final.head()

In [None]:
y_train_pred_final['predicted'] = y_train_pred_final.Convert_Prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head()

In [None]:
from sklearn import metrics

In [None]:
# Create the confusion matrix in a presentable dataframe form
confusion = metrics.confusion_matrix(y_train_pred_final['Converted'], y_train_pred_final['predicted'])
print(confusion)

In [None]:
# Calculating recall and accuracy score on the training data
print('recall --> ', metrics.recall_score(y_train_pred_final['Converted'], y_train_pred_final['predicted']))
print('accuracy --> ', metrics.accuracy_score(y_train_pred_final['Converted'], y_train_pred_final['predicted']))

The present recall score for the training set is almost 69%. We are supposed to maintain the recall score around 80%.



#### Checking VIFs for multicollinearity among the independent variables

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Model 2
X_train_sm = sm.add_constant(X_train_model[col])
lm = sm.GLM(y_train, X_train_sm,family = sm.families.Binomial()).fit()
print(lm.summary())

In [None]:
# checking vif again for model 2
df1 = X_train_model[col]
vif = pd.DataFrame()
vif['Features'] = df1.columns
vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
print(vif.sort_values(by='VIF',ascending=False))

In [None]:
col = col.drop(['Lead Origin_Lead Add Form'],1) # since its vif is 98.39 >> 5
print(f"number of cols : {len(col)}")
print(col)

In [None]:
# Model 3
X_train_sm = sm.add_constant(X_train_model[col])
lm = sm.GLM(y_train, X_train_sm,family = sm.families.Binomial()).fit()
print(lm.summary())

In [None]:
# checking vifs again for model 3
df1 = X_train_model[col]
vif = pd.DataFrame()
vif['Features'] = df1.columns
vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
print(vif.sort_values(by='VIF',ascending=False))

In [None]:
col = col.drop(['Country_Saudi Arabia'],1) # since its p value is more than 0.05
print(f"number of cols : {len(col)}")
print(col)

In [None]:
# Model 4
X_train_sm = sm.add_constant(X_train_model[col])
lm = sm.GLM(y_train, X_train_sm,family = sm.families.Binomial()).fit()
print(lm.summary())

In [None]:
# checking vifs again for model 4
df1 = X_train_model[col]
vif = pd.DataFrame()
vif['Features'] = df1.columns
vif['VIF'] = [variance_inflation_factor(df1.values, i) for i in range(df1.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
print(vif.sort_values(by='VIF',ascending=False))

Now we can see that no p values are more than 0.05 and all the vifs are less than 3. So no need to drop any other feature now.

In [None]:
y_train_pred_final_new = pd.DataFrame({'Converted':y_train.values})
y_train_pred_final_new.head()

In [None]:
# predicting the probabilities using the updated model
y_train_pred_new = lm.predict(X_train_sm).values.reshape(-1)

In [None]:
y_train_pred_final_new['Convert_Prob'] = y_train_pred_new
y_train_pred_final_new.head()

In [None]:
# Creating new column 'predicted' with 1 if Converted_Prob > 0.5 else 0.
# 0.5 being a random cutoff  probability threshold.
y_train_pred_final_new['predicted'] = y_train_pred_final_new.Convert_Prob.map(lambda x: 1 if x > 0.5 else 0)
y_train_pred_final_new.head()

In [None]:
# constructing the confusion matrix using the updated model(Model 4)
cm=metrics.confusion_matrix(y_train_pred_final_new.Converted,y_train_pred_final_new.predicted)
print(cm)

In [None]:
# Calculating final recall and accuracy score on the training data
print('recall --> ', metrics.recall_score(y_train_pred_final_new['Converted'], y_train_pred_final_new['predicted']))
print('accuracy --> ', metrics.accuracy_score(y_train_pred_final_new['Converted'], y_train_pred_final_new['predicted']))
# Extract true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP)
TN, FP, FN, TP = cm.ravel()

# Calculate specificity: TN / (TN + FP)
specificity = TN / (TN + FP)
print('specificity --> ', specificity)

It seems that after deleting few features also, the recall almost remains the same. So we need to try changing the cut_off probablility keeping in mind the target recall score of around 80%

### Plotting the ROC Curve

In [None]:
def draw_roc(actual, probs):
    fpr, tpr, thresholds = metrics.roc_curve(actual, probs, drop_intermediate=False)
    auc_score = metrics.roc_auc_score(actual, probs)
    plt.figure(figsize=(5, 5))
    plt.plot(fpr, tpr, label=f'ROC curve (area = {auc_score:.2f})')
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    
    # Save the figure before showing it
    plt.savefig('D:/UPGRAD/lead scoring images/roc_auc.png', dpi=300, bbox_inches='tight')
    plt.show()


In [None]:
# roc curve for the updated model (Model 4)
draw_roc(y_train_pred_final_new.Converted, y_train_pred_final_new.Convert_Prob)

### Finding Optimal Cutoff Point

Optimal cutoff probability is that prob where we get balanced accuracy,sensitivity and specificity

In [None]:
# Creating columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final_new[i]= y_train_pred_final_new.Convert_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final_new.head()

In [None]:
# Calculating accuracy, sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final_new.Converted, y_train_pred_final_new[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

From the above it is clear that our previous cutoff of 0.5 was not optimal. Instead something around 0.3 would be a better choice.

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.savefig('D:/UPGRAD/lead scoring images/recall_accuracy_balance.png',dpi=300, bbox_inches='tight')
plt.show()

so around 0.33 cutoff would be better. lets try the cutoff 0.33

In [None]:
# Changing the cutoff to 0.33
y_train_pred_final_new['final_predicted'] = y_train_pred_final_new.Convert_Prob.map( lambda x: 1 if x > 0.33 else 0)
y_train_pred_final_new.head(10)

In [None]:
# creating the confusion matrix for the updated cutoff 
cm1=metrics.confusion_matrix(y_train_pred_final_new.Converted,y_train_pred_final_new.final_predicted)
print(cm1)

In [None]:
# Calculating final recall and accuracy score on the training data
print('recall --> ', metrics.recall_score(y_train_pred_final_new['Converted'], y_train_pred_final_new['final_predicted']))
print('accuracy --> ', metrics.accuracy_score(y_train_pred_final_new['Converted'], y_train_pred_final_new['final_predicted']))
# Extract true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP)
TN, FP, FN, TP = cm1.ravel()

# Calculate specificity: TN / (TN + FP)
specificity = TN / (TN + FP)
print('specificity --> ', specificity)

so from the above we can see that we have achieved the recall as 80%.

In [None]:
y_train_pred = lm.predict(X_train_sm)

# Create a DataFrame for the training set results, ensuring all series are of equal length
train_results = pd.DataFrame({
    'Lead Number': lead_number_train,                     # should have the same length as X_train_model
    'Converted': y_train,                                 # actual conversion values from training set
    'Convert_Prob': y_train_pred,                         # predicted probabilities from the model
    'Final_Predicted': [1 if x > 0.33 else 0 for x in y_train_pred],  # applying cutoff 0.33
    'Lead_Score': (y_train_pred * 100).round(0)           # calculating lead score and rounding off
})

print(train_results.head())

In [None]:
# Verifying recall(sensitivity), specificity and accuracy for train_results
print(f'recall --> {metrics.recall_score(train_results.Converted,train_results.Final_Predicted)}')
print(f'specificity --> {metrics.recall_score(train_results.Converted,train_results.Final_Predicted,pos_label=0)}')
print(f'accuracy --> {metrics.accuracy_score(train_results.Converted,train_results.Final_Predicted)}')

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_pred_final_new['Converted'], y_train_pred_final_new['final_predicted'])

In [None]:
recall_score(y_train_pred_final_new['Converted'], y_train_pred_final_new['final_predicted'])

### Step 11: Making predictions on the test set

In [None]:
# transforming the numeric variables in testing data using standard scaler
X_test_model[['TotalVisits','Web_Time','Page_Views']] = scaler.transform(X_test_model[['TotalVisits','Web_Time','Page_Views']])

In [None]:
# using the same 15 cols in testing selected by rfe
X_test_model = X_test_model[col]
X_test_model.head()

In [None]:
#Adding a constant column
X_test_sm = sm.add_constant(X_test_model)

Making predictions on the test set

In [None]:
# predicting the probabilities using the final model
y_test_pred = lm.predict(X_test_sm)

In [None]:
y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array
y_pred_1 = pd.DataFrame(y_test_pred)

In [None]:
# Let's see the head
y_pred_1.head()

In [None]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)

In [None]:
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [None]:
y_pred_final.rename(columns={0:'Convert_Prob'},inplace=True)

In [None]:
y_pred_final.head()

In [None]:
y_pred_final['final_predicted'] = y_pred_final.Convert_Prob.map(lambda x: 1 if x > 0.33 else 0)

In [None]:
y_pred_final.head()

In [None]:
confusion2 = metrics.confusion_matrix(y_pred_final.Converted, y_pred_final.final_predicted )
confusion2

In [None]:
# Calculating final recall and accuracy score on the testing data
print('recall --> ', metrics.recall_score(y_pred_final.Converted, y_pred_final.final_predicted))
print('accuracy --> ', metrics.accuracy_score(y_pred_final.Converted, y_pred_final.final_predicted))
print('specificity --> ', metrics.recall_score(y_pred_final.Converted, y_pred_final.final_predicted,pos_label=0))

From the above we can see that the recall on the test data comes as 80.2% which is very close to the recall on the training data. This indicates the model is perfectly ok.

Now assigning a lead score for every lead number in the dataset. We are multiplying the Convert_Prob column by 100 thus obtaining the required lead score.  

In [None]:
# Generate predicted probabilities using your logistic regression model
y_test_pred = lm.predict(X_test_sm)

# Create the test results DataFrame ensuring all series have matching lengths
test_results = pd.DataFrame({
    'Lead Number': lead_number_test,
    'Converted': y_test,  # actual conversion status from test set
    'Convert_Prob': y_test_pred,
    'Final_Predicted': [1 if prob > 0.33 else 0 for prob in y_test_pred],
    'Lead_Score': (y_test_pred * 100).round(0)
})

print(test_results.head())

In [None]:
final_results = pd.concat([train_results, test_results], axis=0).reset_index(drop=True)

# 'final_results' now clearly shows each Lead Number with its corresponding lead score.
print(final_results.head(10))

### Conclusion

- Based on our analysis, 0.33 emerged as the optimal probability cutoff, balancing recall, precision, and accuracy for the lead conversion model. 


- At this threshold, leads receive a Lead Score of 33 or higher and are deemed “hot.” Our model demonstrates approximately
  80% recall on both training and test data, indicating it effectively captures the majority of likely converters without
  significantly inflating false positives.
  

- Consequently, we recommend prioritizing outreach to leads with scores ≥ 33, as they have a high likelihood of conversion.
  However, this threshold can be revisited if business objectives change—such as aiming to reduce the number of calls (requiring
  higher precision) or ensuring fewer missed opportunities (requiring higher recall). Regularly reviewing the model’s     performance and adjusting the cutoff as needed will help maintain alignment with organizational goals.