# ExtraaLearn Project

## Context

The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education. 

In the present scenario due to the Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like

* The customer interacts with the marketing front on social media or other online platforms. 
* The customer browses the website/app and downloads the brochure
* The customer connects through emails for more information.

The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.

## Objective

ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:
* Analyze and build an ML model to help identify which leads are more likely to convert to paid customers, 
* Find the factors driving the lead conversion process
* Create a profile of the leads which are likely to convert


## Data Description

The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.


**Data Dictionary**
* ID: ID of the lead
* age: Age of the lead
* current_occupation: Current occupation of the lead. Values include 'Professional','Unemployed',and 'Student'
* first_interaction: How did the lead first interacted with ExtraaLearn. Values include 'Website', 'Mobile App'
* profile_completed: What percentage of profile has been filled by the lead on the website/mobile app. Values include Low - (0-50%), Medium - (50-75%), High (75-100%)
* website_visits: How many times has a lead visited the website
* time_spent_on_website: Total time spent on the website
* page_views_per_visit: Average number of pages on the website viewed during the visits.
* last_activity: Last interaction between the lead and ExtraaLearn. 
    * Email Activity: Seeking for details about program through email, Representative shared information with lead like brochure of program , etc 
    * Phone Activity: Had a Phone Conversation with representative, Had conversation over SMS with representative, etc
    * Website Activity: Interacted on live chat with representative, Updated profile on website, etc

* print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper.
* print_media_type2: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Magazine.
* digital_media: Flag indicating whether the lead had seen the ad of ExtraaLearn on the digital platforms.
* educational_channels: Flag indicating whether the lead had heard about ExtraaLearn in the education channels like online forums, discussion threads, educational websites, etc.
* referral: Flag indicating whether the lead had heard about ExtraaLearn through reference.
* status: Flag indicating whether the lead was converted to a paid customer or not.

## Importing necessary libraries and data

In [None]:
import warnings

warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning

warnings.simplefilter("ignore", ConvergenceWarning)

# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)

# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

# To tune different models
from sklearn.model_selection import GridSearchCV


# To get diferent metric scores
import sklearn.metrics as metrics
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    classification_report,
    roc_auc_score,
    precision_recall_curve,
    roc_curve,
    make_scorer,
)
%matplotlib inline

In [None]:
# Load the dataset from the local data folder
df = pd.read_csv('data/ExtraaLearn.csv')

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nColumn names: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()


## Data Overview

- Observations
- Sanity checks

In [None]:
data = df.copy()
data.head(8)

In [None]:
data.tail()


In [None]:
data.info()


## Data Preprocessing

- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling 
- Any other preprocessing steps (if needed)

In [None]:
missing_values = data.isnull().sum()
missing_values # returns the count of null values by column index and since the values are zero, there are no null values in the dataset.


In [None]:
data.duplicated().sum() #data = data.drop_duplicates() if there are any duplicates encountered. Since there are none encountered, this command is not required.


In [None]:
data.nunique()


In [None]:
data.drop(["ID"], axis = 1, inplace = True) #This column is dropped as we determine it is not important in our evaluation due to the uniqueness of each ID.


Observations: ID is an identifier which is unique for each lead. We can drop this column since it would provide no value to the analysis.

## Exploratory Data Analysis (EDA)

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

**Questions**
1. Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status.
2. The company's first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status? 
3. The company uses multiple modes to interact with prospects. Which way of interaction works best? 
4. The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels have the highest lead conversion rate?
5. People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information.Does having more details about a prospect increase the chances of conversion?

In [None]:
#creating numerical columns
num_cols = ['age','website_visits','time_spent_on_website','page_views_per_visit']
#creating categorical columns
cat_cols = ['current_occupation','first_interaction','profile_completed','last_activity','print_media_type1','print_media_type2','digital_media','educational_channels','referral','status']


In [None]:
data[num_cols].describe().T


Observations: Average lead age is 46.201. With range = 63-18= 45 50% of leads spend 376 seconds on the website. However there are some extreme values given that max is 2537 seconds and min being 0. Average number of pages on the website viewed during the visits is 3.026, and the average number of times a lead has visited the website is 3.567.

In [None]:
# Creating histograms
data[num_cols].hist(figsize=(14,14))
plt.show()

Observations: The number of leads wrt age is skewed left. This indicates that leads are increasing wrt age. And leads wrt website_visits is skewed right.Indicating that the frequency of number of leads decrease as number of visits increase. Time spent on the website has a bimodal structure with peaks at 0-250 then at 1750-2000 seconds spent online, and the central tendency of page views per visit is narrow and skewed slightly to the right. It would be closer to the normal distribution had the variance been greater than current value. The peak for the page_views_per_visit is at 1.875-3.75 pages.

Univariate Analysis


In [None]:
for column in cat_cols:
    print(data[column].value_counts(normalize = True)) #we can leave the normalize parameter out of the statement is we want to display raw counts.
    print("-" * 50)

Observations: The status conversion rate (yes) is 29.86%. Around 57.00% of the leads are of the professional background. Website proportions are greater than mobile app proportions which can hint at a higher engaement level on the website. Out of all the last_activity interactions, the email activity has more leads than its counterparts. Most leads have high-medium profile completion. Out of all the four channels and the referral section, the educational channel has the preferred mode of interaction.

Bivariate and Multivariate Analysis

How is the conversion rate related to other categorical variables?



In [None]:
for i in cat_cols:
    if i!='status':
        (pd.crosstab(data[i],data['status'],normalize='index')*100).plot(kind='bar',figsize=(8,4),stacked=True)
        plt.ylabel('Percentage Lead Conversion %')

In [None]:
# Mean of numerical variables grouped by status
data.groupby(['status'])[num_cols].mean()

In [None]:
# Plotting the correlation between numerical variables
plt.figure(figsize=(15,8))
sns.heatmap(data[num_cols].corr(),annot=True, fmt='0.2f', cmap='YlGnBu')

In [None]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(12, 7))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()

Further Univariate Analysis

In [None]:
# function to plot a boxplot and a histogram along the same scale.

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
histogram_boxplot(data, "age")

Observation: Has more outliers given that it is skewed heavily to the right.



In [None]:
data[data["website_visits"] == 0].shape


In [None]:
histogram_boxplot(data, "time_spent_on_website")


In [None]:
histogram_boxplot(data, "page_views_per_visit")


Observation: the number of outliers in this feature is highest amongst all the numerical distributions



Current Occupation and Lead Status


In [None]:
data.columns


In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='status', hue='current_occupation')
plt.title('Distribution of Lead Status by Current Occupation')
plt.xlabel('Lead Status')
plt.ylabel('Count')
plt.legend(title='Current Occupation', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
sns.boxplot(data = data, x = data["current_occupation"], y = data["age"])
plt.show()

In [None]:
data.groupby(["current_occupation"])["age"].describe()


In [None]:
def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

In [None]:
distribution_plot_wrt_target(data, "time_spent_on_website", "status")


In [None]:
data.groupby(["status"])["time_spent_on_website"].median()


In [None]:
distribution_plot_wrt_target(data, "website_visits", "status")


In [None]:
distribution_plot_wrt_target(data, "page_views_per_visit", "status")


In [None]:
conversion_rates = data.groupby('first_interaction')['status'].mean()
print(conversion_rates)


In [None]:
from scipy.stats import chi2_contingency

# Create a contingency table
contingency_table = pd.crosstab(data['first_interaction'], data['status'])

# Perform the chi-squared test
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-Squared Value: {chi2}")
print(f"P-value: {p}")

There is a significant association between the first channel of interaction and the lead conversion status as the p-value is lower than threshold.

In [None]:
# Calculate the conversion rate for each last activity type
conversion_rates = data.groupby('last_activity')['status'].mean().sort_values(ascending=False)

In [None]:
# Plot the conversion rates
conversion_rates.plot(kind='bar', figsize=(10, 6))
plt.title('Conversion Rate by Last Activity')
plt.xlabel('Last Activity')
plt.ylabel('Conversion Rate')
plt.xticks(rotation=45)
plt.show()

In [None]:
sns.catplot(x='last_activity', y='status', data=data, kind='point', aspect=2, ci='sd')
plt.title('Conversion Rate by Last Activity')
plt.xlabel('Last Activity')
plt.ylabel('Conversion Rate (Proportion of Status=1)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Calculate the conversion rate for each mode of interaction
modes_of_interaction = ['print_media_type1', 'print_media_type2', 'digital_media', 'educational_channels', 'referral']
# Convert 'Yes'/'No' to 1/0 for each mode
for mode in modes_of_interaction:
    data[mode] = data[mode].map({'Yes': 1, 'No': 0})
# Calculate the conversion rate for each mode
conversion_rates = {}
for mode in modes_of_interaction:
    # Calculate mean only for rows where the mode is 'Yes' (now converted to 1)
    rate = data[data[mode] == 1]['status'].mean()
    conversion_rates[mode] = rate

# Print conversion rates to verify calculations
print(conversion_rates)

In [None]:
# Convert the conversion rates to a DataFrame
conversion_rates_df = pd.DataFrame(list(conversion_rates.items()), columns=['Interaction_Mode', 'Conversion_Rate'])

# Plot the conversion rates
plt.figure(figsize=(10, 6))
conversion_rates_df = conversion_rates_df.sort_values('Conversion_Rate', ascending=False)
plt.barh(conversion_rates_df['Interaction_Mode'], conversion_rates_df['Conversion_Rate'])  # using horizontal bar plot for better readability
plt.xlabel('Conversion Rate')
plt.ylabel('Mode of Interaction')
plt.title('Conversion Rate by Mode of Interaction')
plt.show()

In [None]:
# Now that the modes are binary, create the contingency table for the Chi-Squared test
contingency_table = pd.crosstab(index=data['status'], columns=[data['print_media_type1'],
                                                               data['print_media_type2'],
                                                               data['digital_media'],
                                                               data['educational_channels'],
                                                               data['referral']])

In [None]:
from scipy.stats import chi2_contingency

# Apply the Chi-Squared test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Squared Value: {chi2}")
print(f"P-value: {p}")

There is a statistically significant difference in conversion rates across the different modes of interaction.



In [None]:
# Assuming the dataset has a 'profile_completion' column with categories like 'Low', 'Medium', 'High'
conversion_rates_by_profile = data.groupby('profile_completed')['status'].mean()


In [None]:
sns.barplot(x=conversion_rates_by_profile.index, y=conversion_rates_by_profile.values)
plt.title('Conversion Rate by Profile Completion')
plt.xlabel('Profile Completion Level')
plt.ylabel('Conversion Rate')
plt.show()

In [None]:
from scipy.stats import f_oneway

# Create groups for each profile completion category
group_low = data[data['profile_completed'] == 'Low']['status']
group_medium = data[data['profile_completed'] == 'Medium']['status']
group_high = data[data['profile_completed'] == 'High']['status']

# Perform ANOVA test
f_stat, p_value = f_oneway(group_low, group_medium, group_high)
print(f"F-statistic: {f_stat}")
print(f"P-value: {p_value}")


The results indicate that there are statistically significant differences in the conversion rates for different levels of profile completion.



## Data Preprocessing
- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)

In [None]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# dropping release_year as it is a temporal variable
numeric_columns.remove("status")

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

In [None]:
#Feature Engineering based on correlation for age and status and time-spent on website and status
data['engagement_score'] = (data['website_visits'] +
                                    data['time_spent_on_website'] +
                                    data['page_views_per_visit'])

data['age_engagement_interaction'] = data['age'] * data['engagement_score']

Feature scaling is not necessary for tree-based models like Decision Trees and Random Forests because these models do not rely on the scale or distribution of the features. They make decisions based on order (which item is larger) rather than on the specific scale of the feature values, meaning that the varying scales of the raw data do not affect these models' performance.

For decision tree models, including random forests, outliers will generally have less impact because these models are non-parametric—they do not make assumptions about the data distribution.

## EDA

- It is a good idea to explore the data once again after manipulating it.

In [None]:
sns.violinplot(x='status', y='engagement_score', data=data)
plt.title('Engagement Score by Lead Status')
plt.xlabel('Lead Status')
plt.ylabel('Engagement Score')
plt.show()

In [None]:
sns.histplot(data['age_engagement_interaction'], kde=True)
plt.title('Distribution of Age-Engagement Interaction')
plt.xlabel('Age-Engagement Interaction')
plt.ylabel('Frequency')
plt.show()

In [None]:
sns.boxplot(x='status', y='age_engagement_interaction', data=data)
plt.title('Age-Engagement Interaction by Conversion Status')
plt.xlabel('Conversion Status')
plt.ylabel('Age-Engagement Interaction')
plt.show()

In [None]:
print(data['age_engagement_interaction'].describe())


In [None]:
print(data[['age_engagement_interaction', 'status']].corr())


Outlier detection, treatment, and Data-split


In [None]:
for col in ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit', 'engagement_score',]:
    # Calculate the 1st and 99th percentiles
    lower_bound = data[col].quantile(0.01)
    upper_bound = data[col].quantile(0.99)

    # Cap the values
    data[col] = np.clip(data[col], lower_bound, upper_bound)

In [None]:
X = data.drop(["status"], axis=1)
Y = data['status']  # Define the dependent (target) variable

X = pd.get_dummies(X, drop_first=True) # Get dummies for X and avoid the dummy variable trap

# Splitting the data in 70:30 ratio for train to test data
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)

In [None]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))

## Building a Decision Tree model

In [None]:
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))
    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize = (8, 5))
    sns.heatmap(cm, annot = True,  fmt = '.2f', xticklabels = ['Not Converted', 'Converted'], yticklabels = ['Not Converted', 'Converted'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

In [None]:
# Fitting the decision tree classifier on the training data
d_tree = DecisionTreeClassifier(random_state=1)

In [None]:
# Fit the classifier on the training data
d_tree.fit(X_train, y_train)

In [None]:
y_pred_train1 =  d_tree.predict(X_train)
metrics_score(y_train, y_pred_train1)

The Decision tree is giving a 100% score for all metrics on the training dataset.

In [None]:
# Choose the type of classifier
d_tree_tuned = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.3, 1: 0.7})

# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 10),
              'criterion': ['gini', 'entropy'],
              'min_samples_leaf': [5, 10, 20, 25]
             }

# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
grid_obj = GridSearchCV(d_tree_tuned, parameters, scoring = scorer, cv = 5)

grid_obj = grid_obj.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
d_tree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data
d_tree_tuned.fit(X_train, y_train)

In [None]:
y_pred_train2 = d_tree_tuned.predict(X_train)
# Using the metrics_score function to evaluate the model's performance
metrics_score(y_train, y_pred_train2)

The Decision Tree works well on the untuned training data but not so well on the tuned train data as the recall is 0.88 in comparison to 1 for the training dataset, i.e., the Decision Tree is overfitting the original training data. The precision on the tuned trained data suggests that there's a 38% (1 - 0.62) chance that the model will predict that a person is lead is going to convert even though he/she would not, and the company may waste their time and energy on these leads who are not at the brink of conversion.

In [None]:
# Making predictions on the testing data with the tuned model
y_pred_test2 = d_tree_tuned.predict(X_test)


In [None]:
# Using the metrics_score function to evaluate the model's performance on the test data
metrics_score(y_test, y_pred_test2)

The Decision Tree works not so well on the tuned test data as the recall is 0.86 in comparison to 0.88 for the tuned training dataset, i.e., the Decision Tree is overfitting the training data. The precision on the tuned trained data suggests that there's a 38% (1 - 0.62) chance that the model will predict that a person is lead is going to convert even though he/she would not, and the company may waste their time and energy on these leads who are not at the brink of conversion.

In [None]:
features = list(X.columns)

plt.figure(figsize = (20, 20))

tree.plot_tree(d_tree_tuned, feature_names = features, filled = True, fontsize = 9, node_ids = True, class_names = True)

plt.show()

In [None]:
# Importance of features in the tree building

# Importance of features in the tree building

print (pd.DataFrame(d_tree_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

In [None]:
# Plotting the feature importance
importances = d_tree_tuned.feature_importances_

indices = np.argsort(importances)

plt.figure(figsize = (10, 10))

plt.title('Feature Importances')

plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')

plt.yticks(range(len(indices)), [features[i] for i in indices])

plt.xlabel('Relative Importance')

plt.show()

Observations:

Time spent on the website and first_interaction_website are the most important features followed by profile_completed, age, and last_activity. The rest of the variables have no impact in this model, while deciding whether a lead will be converted or not.

## Do we need to prune the tree?

Yes, we should prune the tree since some features are not important at all in the decision tree model



In [None]:
from sklearn.model_selection import cross_val_score

# Search for the optimal ccp_alpha value
alpha_values = np.linspace(0.001, 0.02, 50)
mean_scores = []

for ccp_alpha in alpha_values:
    clf = DecisionTreeClassifier(random_state=7, ccp_alpha=ccp_alpha)
    scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy')
    mean_scores.append(np.mean(scores))

# Find the alpha value with the highest mean accuracy score
optimal_ccp_alpha = alpha_values[np.argmax(mean_scores)]

# Prune the tree using the optimal ccp_alpha
d_tree_pruned = DecisionTreeClassifier(random_state=7, ccp_alpha=optimal_ccp_alpha)
d_tree_pruned.fit(X_train, y_train)

In [None]:
# Plotting the mean accuracy scores over different ccp_alpha values
plt.figure(figsize=(10, 6))
plt.plot(alpha_values, mean_scores, marker='o', linestyle='--', color='b')
plt.title('Mean Accuracy Score over different ccp_alpha values')
plt.xlabel('ccp_alpha')
plt.ylabel('Mean Accuracy')
plt.grid(True)
plt.show()

In [None]:
features = list(X.columns)

plt.figure(figsize = (20, 20))

tree.plot_tree(d_tree_pruned, feature_names = features, filled = True, fontsize = 9, node_ids = True, class_names = True)

plt.show()

## Building a Random Forest model

In [None]:
rf_estimator = RandomForestClassifier(random_state=1)

# Fit the classifier on the training data
rf_estimator.fit(X_train, y_train)

In [None]:
# Making predictions on the training data with the random forest classifier
y_pred_train3 = rf_estimator.predict(X_train)

# Using the metrics_score function to evaluate the model's performance on the training data
metrics_score(y_train, y_pred_train3)

Observation:

For all the metrics in the training dataset, the Random Forest gives a 100% score.

In [None]:
# Making predictions on the testing data with the random forest classifier
y_pred_test3 = rf_estimator.predict(X_test)

# Using the metrics_score function to evaluate the model's performance on the test data
metrics_score(y_test, y_pred_test3)

The Random Forest classifier seems to be overfitting the training data. The recall on the training data is 1, while the recall on the test data is only ~ 0.68 for class 1. Precision is high for the test data as well.

In [None]:
# Choose the type of classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, recall_score
import sklearn.metrics as metrics

# Setup the RandomForestClassifier with your specified criterion and random state
rf_estimator_tuned = RandomForestClassifier(criterion="entropy", random_state=7)

# Define the grid of parameters to search over
parameters = {
    "n_estimators": [110, 120],
    "max_depth": [6, 7],
    "min_samples_leaf": [20, 25],
    "max_features": [0.8, 0.9],
    "max_samples": [0.9, 1],
    "class_weight": ["balanced", {0: 0.3, 1: 0.7}]
}

# Define the scorer based on recall score for class 1
scorer = make_scorer(recall_score, pos_label=1)

# Setup GridSearchCV with the RandomForestClassifier, the grid of parameters, and the scoring method
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring=scorer, cv=5, n_jobs=-1)

# Fit the grid search object to the training data to search for the best parameters
grid_obj.fit(X_train, y_train)

# After the search, save the best estimator to rf_estimator_tuned
rf_estimator_tuned = grid_obj.best_estimator_

# If you'd like to print out the best parameters found, you can do so like this:
print("Best parameters found: ", grid_obj.best_params_)

In [None]:
## random search
##from sklearn.model_selection import RandomizedSearchCV
##rf_estimator_tuned = RandomForestClassifier(criterion="entropy", random_state=7)
# Setup the randomized search with the same parameters and distributions
##random_search = RandomizedSearchCV(rf_estimator_tuned, param_distributions=parameters, n_iter=10, scoring=scorer, cv=5, random_state=7, n_jobs=-1)

# Fit the randomized search object to the training data
##random_search.fit(X_train, y_train)

# Save the best estimator to variable rf_estimator_tuned
##rf_estimator_tuned = random_search.best_estimator_

In [None]:
rf_estimator_tuned.fit(X_train, y_train)


In [None]:
# Making predictions on the training data with the tuned random forest classifier
y_pred_train4 = rf_estimator_tuned.predict(X_train)

# Using the metrics_score function to evaluate the model's performance on the training data
metrics_score(y_train, y_pred_train4)

In [None]:
# Making predictions on the test data with the tuned random forest classifier
y_pred_test4 = rf_estimator_tuned.predict(X_test)

# Using the metrics_score function to evaluate the model's performance on the test data
metrics_score(y_test, y_pred_test4)

Note that the tuned test dataset performs better than the tuned trained dataset in terms of precision for the class 1. The recall has reduced though by 0.03. Accuracy remains the same.

In [None]:
importances = rf_estimator_tuned.feature_importances_

indices = np.argsort(importances)

feature_names = list(X.columns)

plt.figure(figsize = (12, 12))

plt.title('Feature Importances')

plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')

plt.yticks(range(len(indices)), [feature_names[i] for i in indices])

plt.xlabel('Relative Importance')

plt.show()

Observations: Similar to the decision tree model, time spent on website, first_interaction_website, profile_completed, and age are the top four features that help distinguish between not converted and converted leads. Unlike the decision tree, the random forest gives some importance to other variables like occupation, page_views_per_visit, as well. This implies that the random forest is giving importance to more factors in comparison to the decision tree.

## Do we need to prune the tree?

We don't need to prune the tree when considering the random forest trees because there is built-in regularization: Random Forest inherently includes several mechanisms that prevent overfitting despite the complexity of individual trees. This is achieved through the randomness introduced by selecting different subsets of features and training examples. The ensemble method, where predictions are averaged (for regression) or voted upon (for classification), further mitigates the risk of overfitting.

Also getting the results, we can see that each variable is holding some importance in the decision making process here, so no need to prune the tree.

However, the above is true only if we used random search instead of GridSearch. Since we used GridSearch, we have certain features that are not that important, hence we can prune the tree. To further prune the tree we can only change the tuning of individual parameters like we did before.

## Actionable Insights and Recommendations

We saw that time_spent_on_website is the most important feature when considering the lead status in the decision tree model and the first_interaction feature is the most important in the random forest model. Since time_spent_on_website is a crucial factor, ensuring that the website is engaging, informative, and easy to navigate can encourage potential leads to spend more time exploring the content.
The company should focus marketing efforts and budgets on the channels that have the highest lead conversion rate.
They should focus on user feedback for designing their recommendation systems further and employ A/B testing for experimenting with different strategies.
The insights regarding the statistical tests are visualized via graphs and commented upon as well in the document. We have used ANOVA and Chi-squared test to judge the impact of these factors.
We have gauged the model predictions through decision trees and random forests and have visualized and commented upon its findings.
Further we have included feature engineering (adding two new features) and performed EDA on the dataset before and after adding the features
we have tuned the model to deal with overfitting and used classification reports to measure the precion, recall, f1 score, accuracy and confusion matrix of the variables
we created 2 lists: a numerical one and a categorical one to proceed with our analysis on the same and have gauged the target variable wrt all the other columns through our univariate, bivariate, and multivariate analysis.
We determined the skewness of the distributions of the numeric variables where we learnt that the age factor is skewed left. The central tendency of points, the shape and other factors.
We also focused on important features at the end of predictive models and decided upon their pruning status.