# Lead Scoring Assignment

#### Done By : Shiva Chandra Kante, Krishnakumar V, Kamatchi M

#### Problem Statement:
An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. 

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. 

#### Business Goal:
There are quite a few goals for this case study:

Build a logistic regression model to assign a lead score between 0 and 100 to each of the leads which can be used by the company to target potential leads. A higher score would mean that the lead is hot, i.e. is most likely to convert whereas a lower score would mean that the lead is cold and will mostly not get converted.
There are some more problems presented by the company which your model should be able to adjust to if the company's requirement changes in the future so you will need to handle these as well. These problems are provided in a separate doc file. Please fill it based on the logistic regression model you got in the first step. Also, make sure you include this in your final PPT where you'll make recommendations.




## Step 1: Reading and Understanding the Data

Let us first import NumPy and Pandas and read the Lead Scoring dataset

In [1]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from sklearn.metrics import r2_score

In [3]:
## setting the pandas column & row view to maximum
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [4]:
lead_score = pd.read_csv("Leads.csv")

In [None]:
# Check the head of the dataset
lead_score.head()

In [None]:
lead_score.shape

In [None]:
lead_score.info()

##### <b>*OBSERVATION*<b>:
- In many columns we are seeing select as values.

### Data Cleaning

In [8]:
# We are replacing Select with Nan
lead_score = lead_score.replace('Select',np.nan)

In [None]:
# Checking for duplicate rows in the dataframe.
duplicate_row_count = lead_score.duplicated().sum()
print(f" Duplicate Row Count in Lead scoring Dataframe :- {duplicate_row_count}")

In [None]:
## understanding percentage of null values of each column
(lead_score.isnull().sum() * 100/lead_score.shape[0]).round(2).sort_values(ascending=False)

In [None]:
columns_to_drop = []
for column in lead_score.columns:
    if (lead_score[column].isnull().sum() * 100/lead_score.shape[0]).round(2) > 40:
        columns_to_drop.append(column)
columns_to_drop

In [None]:
# Checking if there are columns with one unique value since it won't affect our analysis
lead_score.nunique()

##### <b>*OBSERVATION*<b>:
- Some columns contain only a single value across all rows, so it's better to drop those columns.

In [13]:
# Dropping columns with only one value for all the rows
columns_to_drop.extend(['Magazine','Receive More Updates About Our Courses','Update me on Supply Chain Content',
                            'Get updates on DM Content','I agree to pay the amount through cheque'])

In [None]:
columns_to_drop

In [15]:
lead_score.drop(columns_to_drop, axis = 1, inplace = True)

In [None]:
## understanding percentage of null values of each column
(lead_score.isnull().sum() * 100/lead_score.shape[0]).round(2).sort_values(ascending=False)

In [None]:
lead_score['Country'].value_counts(dropna=False)

In [None]:
def replace_country(x):
    category = ""
    if x == "India":
        category = "India"
    elif x is np.nan:
        category = "Not Provided"
    else:
        category = "Outside India"
    return category

lead_score['Country'] = lead_score.apply(lambda x:replace_country(x['Country']), axis = 1)
lead_score['Country'].value_counts()

In [None]:
lead_score['What matters most to you in choosing a course'].value_counts(dropna=False)

##### <b>*OBSERVATION*<b>:
- Apart from the null values, `What matters most to you in choosing a course'` is a highly skewed column. So we can drop the column.

In [20]:
lead_score.drop('What matters most to you in choosing a course', axis = 1, inplace = True)

In [None]:
lead_score['What is your current occupation'].value_counts(dropna=False)

In [None]:
lead_score['What is your current occupation'] = lead_score['What is your current occupation'].fillna("Not Provided")
lead_score['What is your current occupation'].value_counts(dropna=False)

In [None]:
lead_score['Specialization'].value_counts(dropna=False)

- Filling null values in `Specialization` Column with Not Provided value. Insttead of dropping it.

In [None]:
lead_score['Specialization'] = lead_score['Specialization'].fillna("Not Provided")
lead_score['Specialization'].value_counts(dropna=False)

In [None]:
lead_score['City'].value_counts(dropna=False)

##### <b>*OBSERVATION*<b>:
- The `City` column has 39.71% missing values. Imputing them with "Mumbai" would increase data skewness, potentially biasing the model. Therefore, it's best to drop the `City` column.


In [26]:
lead_score.drop('City', axis = 1, inplace = True)

In [None]:
lead_score['Tags'].value_counts(dropna=False)

##### <b>*OBSERVATION*<b>:
- The `Tags` column has 36.29% missing values. Since tags indicate the current status of a lead, they are not useful for modeling. Therefore, this column can be dropped.



In [28]:
lead_score.drop('Tags', axis = 1, inplace = True)

In [None]:
## understanding percentage of null values of each column
(lead_score.isnull().sum() * 100/lead_score.shape[0]).round(2).sort_values(ascending=False)

In [None]:
lead_score['TotalVisits'].value_counts(dropna=False).head(10)

In [31]:
# TotalVisits missing values to be imputed with mode
lead_score['TotalVisits'].fillna(lead_score['TotalVisits'].mode()[0], inplace=True)

In [None]:
lead_score['Page Views Per Visit'].value_counts(dropna=False).head(10)

In [33]:
# Page Views Per Visit missing values to be imputed with mode
lead_score['Page Views Per Visit'].fillna(lead_score['Page Views Per Visit'].mode()[0], inplace=True)

In [None]:
lead_score['Last Activity'].value_counts(dropna=False).head(10)

In [35]:
# Filling null values in Last Activity with Email Opened
lead_score['Last Activity'].fillna('Email Opened', inplace=True)

In [None]:
lead_score['Lead Source'].value_counts(dropna=False).head(10)

In [37]:
# Filling null values in Lead Source with Google
lead_score['Lead Source'].fillna('Google', inplace=True)

In [None]:
## understanding percentage of null values of each column
(lead_score.isnull().sum() * 100/lead_score.shape[0]).round(2).sort_values(ascending=False)

In [None]:
lead_score.nunique()

##### <b>*OBSERVATION*<b>:
- The `Lead Number` and `Prospect ID` columns contain unique values for each row and are used solely for lead tracking. Since they do not contribute to the model, they can be dropped.

In [40]:
lead_score.drop(['Prospect ID', 'Lead Number'], axis = 1, inplace = True)

##### Checking for Skeweness in the Categorical Columns

In [None]:
categorical_cols = lead_score.select_dtypes(include=['category', 'object']).columns.tolist()
categorical_cols

In [42]:
# A plot function to analyze the categorical columns
def plot_categorical_skweness_test(df,col):
    category_counts = df[col].value_counts()
    total = len(df)
    # Calculate percentages
    category_percentages = (category_counts / total) * 100
    # Plot
    plt.figure(figsize=(10, 5))
    sns.barplot(x=category_counts.index, y=category_counts.values, color='salmon')
    # Annotate with percentage values
    for i, (count, percentage) in enumerate(zip(category_counts.values, category_percentages.values)):
        if percentage <= 5:
            palce = total*0.05
        else :
            palce = count / 2
        plt.text(i, palce, f'{percentage:.1f}%',  ha='center',  va='center', fontsize=12,rotation=90)
    # Labels and title
    plt.xlabel(col)
    plt.xticks(rotation=90)
    plt.ylabel("Count")
    plt.title(f"Count and Percentage of Each {col}")
    plt.show()

In [None]:
for col in categorical_cols:
    centered_text = f" ------------------ Categorical Skweness test FOR {col} ------------------ ".center(150)
    print(centered_text)
    plot_categorical_skweness_test(lead_score,col)

##### <b>*OBSERVATION*<b>:
- The following columns contain highly skewed data and will be dropped as they do not add value to the model. Moreover, skewed variables can negatively impact logistic regression performance by leading to biased or inaccurate parameter estimates.
  - `Do Not Call`
  -  `Search`
  -  `Newspaper Article`
  -  `X Education Forums`
  - `Newspaper`
  -  `Digital Advertisement`
  -  `Through Recommendations`


In [44]:
lead_score.drop(['Do Not Call',
 'Search',
 'Newspaper Article',
 'X Education Forums',
 'Newspaper',
 'Digital Advertisement',
 'Through Recommendations'], axis = 1, inplace = True)


In [None]:
lead_score['Lead Source'].value_counts(normalize=True)*100

##### <b>*OBSERVATION*<b>:
- In the column `Lead Source` we observe Google is present two times and Few other lead sources are very less in number so it is better to put in Others bucket.

In [46]:
# Changing google to Google
lead_score['Lead Source'] = lead_score['Lead Source'].replace("google","Google")

In [47]:
# Grouping low frequency value levels to Others
lead_score['Lead Source'] = lead_score['Lead Source'].replace(["bing","Click2call","Press_Release",
                                                           "Social Media","Live Chat","youtubechannel",
                                                           "testone","Pay per Click Ads","welearnblog_Home",
                                                           "WeLearn","blog","NC_EDM"],"Others")



In [None]:
lead_score['Lead Source'].value_counts(normalize=True)*100

In [None]:
lead_score['Last Activity'].value_counts(normalize=True)*100

##### <b>*OBSERVATION*<b>:
- The `Lead Activity` column contains many dummy variables with low percentages. To reduce the number of dummy variables while encoding, we can group low-frequency categories under "Others". This will help keep the dataset clean and prevent unnecessary columns.



In [50]:
# Grouping low frequency value levels to Others
lead_score['Last Activity'] = lead_score['Last Activity'].replace(["Unreachable","Unsubscribed","Had a Phone Conversation",
                                                           "Approached upfront","View in browser link Clicked",
                                                           "Email Received","Email Marked Spam",
                                                           "Visited Booth in Tradeshow","Resubscribed to emails"],"Others")


In [None]:
lead_score['Last Activity'].value_counts(normalize=True)*100

In [None]:
lead_score.head(10)

In [53]:
# Renaming column name "A free copy of Mastering The Interview" to "Free_copy" 
lead_score.rename(columns={'A free copy of Mastering The Interview': 'Free_copy'}, inplace=True)

In [54]:
# Renaming column name "What is your current occupationA free copy of Mastering The Interview" to "Current_occupation" 
lead_score.rename(columns={'What is your current occupation': 'Current_occupation'}, inplace=True)

In [None]:
lead_score.head()

In [None]:
lead_score['Do Not Email'].value_counts()

##### <b>*OBSERVATION*<b>:
- The `Do Not Email` is a Binary Column which has two values, so lets replace them with 0 and 1.

In [57]:
lead_score['Do Not Email'] = lead_score['Do Not Email'].apply(lambda x: 1 if x =='Yes' else 0)

In [None]:
lead_score['Do Not Email'].value_counts()

In [None]:
lead_score['Free_copy'].value_counts()

##### <b>*OBSERVATION*<b>:
- The `Free_copy` is a Binary Column which has two values, so lets replace them with 0 and 1.

In [60]:
lead_score['Free_copy'] = lead_score['Free_copy'].apply(lambda x: 1 if x =='Yes' else 0)

In [None]:
lead_score['Free_copy'].value_counts()

In [None]:
lead_score.head()

## Step 2: EDA

#### Data Imbalance

In [None]:
lead_score.Converted.value_counts(normalize=True)*100

In [None]:
# Calculating Data Imbalance Ratio
round(len(lead_score[lead_score.Converted==0]) / len(lead_score[lead_score.Converted==1]),2)

In [None]:
# Checking what % of Leads are Converted 
plt.figure(figsize=(8,4))
labels = ["Not Converted" ,"Converted"]
explode = (0, 0.1)
plt.title('Data imbalance- Pie Chart',fontdict={'fontsize':20})
plt.pie(lead_score.Converted.value_counts(), explode=explode, colors='rg', labels=labels, autopct='%1.1f%%',shadow=True, startangle=180)
plt.show()

##### <b>*OBSERVATION*<b>:
- The conversion rate is 38.5%, meaning only 38.5% of people converted to leads (minority), while 61.5% did not convert (majority). This indicates a class imbalance in the data.

### Univariate Analysis

##### Categorical columns

In [66]:
# A plot function to analyze the categorical columns
def plot_function_categorical(df,col):
    fig = plt.figure(figsize=(15,10))
    ax1 = plt.subplot(221)
    df[col].value_counts().plot.pie(autopct = "%1.0f%%" , ax = ax1,colors=['lightblue', 'salmon','green','red','yellow'])
    plt.title("Pie chart for " + col)
    ax2 = plt.subplot(222)
    plot_df=pd.DataFrame()
    non_default_df = df[df['Converted']==0]
    default_df =  df[df['Converted']==1]
    plot_df['0'] = ((non_default_df[col].value_counts())/len(non_default_df))
    plot_df['1'] = ((default_df[col].value_counts())/len(default_df))
    plot_df.plot.bar(ax=ax2,color=['lightblue','salmon'])
    plt.title("Plotting data in terms of percentage")
    ax3 = plt.subplot(223)
    sns.countplot(x=col,hue='Converted',data=df,ax=ax3,palette={0: 'lightblue', 1: 'salmon'})
    plt.xticks(rotation=90)
    plt.title("Countplot for " + col)
    ax4 = plt.subplot(224)
    stack_df = df[[col,'Converted',]].value_counts().unstack()
    stack_df['SUM'] = stack_df.sum(axis=1)
    stack_df['0'] = ((stack_df[0] / stack_df['SUM']) * 100).fillna(0)
    stack_df['1'] = ((stack_df[1] / stack_df['SUM']) * 100).fillna(0)
    ax4.bar(stack_df.index, stack_df['0'], label='0', color='lightblue')
    ax4.bar(stack_df.index, stack_df['1'], bottom=stack_df['0'], label='1', color='salmon')
    for index, row in stack_df.iterrows():
        ax4.text(index, row['0'] / 2, f"{row['0']:.1f}%", ha='center',  va='center', rotation=90)
        ax4.text(index, row['0'] + row['1'] / 2, f"{row['1']:.1f}%", ha='center',  va='center', rotation=90)
    ax4.set_ylabel('Percentage (%)')
    ax4.set_title('Stacked Bar Chart ')
    ax4.legend(title='Target')
    plt.xticks(rotation=90)
    plt.title("Stacked Bar Chart for " + col)
    fig.tight_layout()
    plt.show()

In [67]:
cat_cols = ['Lead Origin','Lead Source','Last Activity','Country','Specialization','Current_occupation',
            'Last Notable Activity','Do Not Email','Free_copy']

In [None]:
for col in cat_cols:
    centered_text = f" ------------------ PLOT FOR {col} ------------------ ".center(150)
    print(centered_text)
    plot_function_categorical(lead_score,col)

##### <b>*Observations & Insights from Categorical Analysis*<b>:

- Lead Origin: "Landing Page Submission" identified 53% of customers, while "API" identified 39%.
- Lead Source: 58% of leads come from Google & Direct Traffic combined.
- Last Activity: 68% of customer interactions are from SMS Sent & Email Opened activities.
- Current Occupation: 90% of customers are Unemployed.
- Do Not Email: 92% of people opted not to receive emails about the course.

##### Numerical columns

In [69]:
# A plot function to analyze the numerical columns
def plot_function_numerical(df,col):
    fig = plt.figure(figsize=(15,8))
    ax1 = plt.subplot(221)
    sns.boxplot(data=df, x='Converted', y=col,ax = ax1,color='salmon')
    # sns.boxplot(x=col,data=df,hue=df['Converted'],ax=ax1)
    plt.title("Box Plot for " + col)
    ax2 = plt.subplot(222)
    sns.histplot(df[col],bins=30,ax=ax2)
    plt.title("Hist Plot for " + col)
    ax3 = plt.subplot(223)
    sns.histplot(data=df[df['Converted']==0],kde=True,x=col,ax=ax3)
    plt.title(col + " for Converted Leads")
    ax4 = plt.subplot(224)
    sns.histplot(data=df[df['Converted']==1],kde=True,x=col,ax=ax4)
    plt.title(col + " for Not Converted Leads")
    fig.tight_layout()
    plt.show()

In [70]:
numeric_cols = ['TotalVisits','Total Time Spent on Website','Page Views Per Visit']

In [None]:
for col in numeric_cols:
    centered_text = f" ------------------ PLOT FOR {col} ------------------ ".center(150)
    print(centered_text)
    plot_function_numerical(lead_score,col)

##### <b>*Observations & Insights from Numerical Analysis*<b>:

- Leads who spend more time on the website are more likely to convert into successful leads. This indicates that higher engagement correlates with a higher conversion rate.

### Bivariate Analysis

##### Numerical columns

In [None]:
sns.pairplot(lead_score,vars=numeric_cols,hue="Converted",palette={0: 'lightblue', 1: 'salmon'})
plt.show()

In [None]:
numeric_cols

In [None]:
# Heatmap to show correlation between numerical variables
sns.heatmap(data=lead_score[['Converted','TotalVisits', 'Total Time Spent on Website', 'Page Views Per Visit']].corr(),cmap="Greens",annot=True)
plt.show()


## Step 3: Data Preparation

## Step 4: Splitting the Data into Training and Testing Sets

As you know, the first basic step for regression is performing a train-test split.

## Step 5: Building a logistic model


#### Model 1

## Step 6: Residual Analysis of the train data

## Step 7: Making Predictions Using the Final Model

## Step 8: Model Evaluation