# A. Business Goal

***Background:*** X Education provides online courses to industry professionals. Many professionals who are interested in the courses land on the website and browse for courses. X education advertises its courses across several marketing platforms such as Google, Olark chat, etc. Once visitors land on the website, they might perform engagement activities such as browsing courses, filling up forms, or watching some videos. When visitors fill up forms providing their email address or phone number, they get converted to leads. The company also acquires leads through past referrals. Once leads are acquired, employees from the sales team phone and email campaigns. Through this process, a fraction of generated leads gets converted into customers. However, the typical lead conversion rate at X education is around 30%, which is something this notebook attempts to improve.

The company goal is to build **a logistic regression model wherein you need to assign a lead score** to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. 

## Question: 
**What factors affect lead-to-customer conversion the most?**

Some to consider
* Audience: student, professionals, unpemployed, city and country of residence, etc
* Marketing: websites, search engines, media platforms, referals, etc
* Sale: method of engaging the leads (calls, texts, emails, free materials, etc)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
from scipy import stats
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_recall_curve
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix

import plotly.graph_objects as go
import warnings

%matplotlib inline
warnings.filterwarnings("ignore")

# B. Importing the Dataset and Preview

In [None]:
df = pd.read_csv('/kaggle/input/lead-scoring-x-online-education/Leads X Education.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

## B.1. Missing data

Note: There is a level call "Select" that should be treated as Null values. The "Select" level means the viewer/customers did not make a selection

In [None]:
# Identify columns that contain "Select" level
find_select = df.loc[: , (df == 'Select').any()] 

# Function converts "Select" level to NaN
def convert_to_NaN(df,col_names):
    for col in col_names:
        df[col][df[col] == 'Select'] = None
        df[col].fillna(value=np.nan, inplace=True)
        
convert_to_NaN(df,find_select.columns)

In [None]:
# Amount of missing data in each column
null_values = round(df.isnull().mean().sort_values(ascending = False)*100,2)
null_values

### Insights:
* The target column "Converted" doens't have any null values
* Drop columns with more than 70% of missing values: How did you hear about X Education, Lead Profile
* An equal portion of missing data are present in Assymetrique Index and Score columns (45.65%)
* Features with <2% of missing data can be imputed with appropriate values
* Features with 25%-50% of missing data will be explored for possibility of imputation or dropping

## B.2. EDA

### B.2.1. Numerical Features

In [None]:
df.drop(['Prospect ID','Lead Number','How did you hear about X Education','Lead Profile'], axis=1, inplace=True)

In [None]:
# The Conversion rate
labels = df['Converted'].value_counts().index

fig = go.Figure(data=[go.Pie(labels=labels, values=df['Converted'].value_counts())])
fig.show()

In [None]:
# Numerical input features in boxplots
plt.figure(figsize=(14,5))
i=1
web_interact = ['TotalVisits', 'Total Time Spent on Website','Page Views Per Visit']

for col in web_interact:
    plt.subplot(1,3,i)
    sns.boxplot(df[col])
    i +=1
    plt.tight_layout()

In [None]:
# Numerical input features in distribution plots
plt.figure(figsize=(14,5))
i=1
web_interact = ['TotalVisits', 'Total Time Spent on Website','Page Views Per Visit']

for col in web_interact:
    plt.subplot(1,3,i)
    sns.distplot(df[col])
    i +=1
    plt.tight_layout()

In [None]:
# Assigned scores in boxplots
plt.figure(figsize=(12,5))
i=1
score = ['Asymmetrique Activity Score','Asymmetrique Profile Score']

for col in score:
    plt.subplot(1,2,i)
    sns.boxplot(df[col])
    i +=1
    plt.tight_layout()

In [None]:
# Assigned scores in distribution plots
plt.figure(figsize=(12,5))
i=1
for col in score:
    plt.subplot(1,2,i)
    sns.distplot(df[col])
    i +=1
    plt.tight_layout()

In [None]:
pd.qcut(df['Asymmetrique Activity Score'], q=3)

* There are outliers in TotalVisits, Page Views Per Visit, Activity Score. Besides, there are missing values in all 4 columns graphed above. Therefore, I'm going to treat the outliers and fill in missing data. 
* As seen above, scores are equally binned into 3 groups, which then identified as indices. It makes sense to keep either the score or the index because they are related to each other. 

In [None]:
df.drop(['Asymmetrique Activity Index','Asymmetrique Profile Index'], axis=1, inplace=True)

In [None]:
# 2 features relation
sns.pairplot(df, hue='Converted');

* In general, people are more likely to convert if they spend more time on website, regardless to the amount of visits and number of pages viewed.
* The activity and profile scores of whom converted seem to be in correlation with Total Time Spent of Website 

### B.2.2 Categorical Features

In [None]:
cat_df = df.select_dtypes(include='object')

# Check for data variety in each column
cat_df.nunique()

In [None]:
# Percentage of values in each feature
a = []
for col in cat_df.columns:
    a.append(round(cat_df[col].value_counts()/cat_df[col].count()*100,2))
a

**Majority of categorical features have 25-51% of missing values**
* In general, values with less than 5% of data share similar meaning. Most of them are inactive or indecisive action. I'm going to group these values into "Others" level
* As mentioned earlier, I'm going to explored and impute appropriate values for missing values in Data Preparation section.

**Columns have very few variations in values can be dropped:**
Do Not Email, Do Not Call, Country, What matters most to you in choosing a course, Search, Magazines, Newspapers Articles, X Education Forums, Newspapers, Digital Advertisement, Through Recommendations, Receive More Updates About Our Courses, Update me on Supply Chain Content, Get updates on DM Content, City, I agree to pay the amount through cheque


In [None]:
df.drop(['Do Not Email', 'Do Not Call','Country','What matters most to you in choosing a course',
         'Search','Magazine','Newspaper Article','X Education Forums','Newspaper','Digital Advertisement',
         'Through Recommendations','Receive More Updates About Our Courses',
         'Update me on Supply Chain Content','Get updates on DM Content','City',
         'I agree to pay the amount through cheque'],axis=1,inplace=True)

# C. Data Preparation

In [None]:
# Current features with null values to impute
plt.figure(figsize=(10,8))
sns.heatmap(df.isnull(),yticklabels=False,cbar=False);

### Question:
**We observe 45.65% of missing Asymmetrique Activity and Profile Score data. Is there a rational way to inpute without breaking business logic?**

**Insights**: 
* *Assumption*: The activity and profile scores look like a result of another algorithm that take in the users' web interaction such as: time spent on website, amount of visits, page views, personal info provided and such.  
* This scoring system can be an important input from another team. They can carry some weighs and business assessment.
* Because they are multimodal distributions, it is not appropriate to choose a univariate to impute the missing data. 
* I'm going to use Multiple Imputation method, which estimates each feature from all others. 

## C.1. Numerical Features

In [None]:
new_df = df.copy().drop('Converted',axis=1)

In [None]:
num_vars = new_df.select_dtypes(include=['float64','int64'])

### C.1.1. Outliers treatment

In [None]:
# Cap the outliers  

def outliers_treatment(col_lst):
    for col in col_lst:
        percentiles = new_df[col].quantile([0.01,0.99]).values
        new_df[col][new_df[col]<=percentiles[0]] = percentiles[0] # replace left outliers with left limit
        new_df[col][new_df[col]>=percentiles[1]] = percentiles[1] # replace left outliers with right limit
        
outliers_treatment(num_vars.columns.tolist())

In [None]:
# Numerical features df received outliers treatment
num_vars_ot = new_df.select_dtypes(include=['float64','int64'])

### C.1.2. Multiple Imputation

The missing values in each numerical features will be imputed with relation to each other by using the Multiple Imputation method

In [None]:
# These columns are treated separately
# This df should concatenate into the final data set for training the model

impute_it = IterativeImputer(verbose=2, tol=1e-10)
num_vars_it = pd.DataFrame(impute_it.fit_transform(num_vars_ot),columns=num_vars.columns)

## C.2. Categorical Features

In [None]:
# Missing data percentage
new_cat = new_df.select_dtypes(include='object')
round(new_cat.isnull().mean()*100,2)

Using the infortmation from Insights of Categorical Features in EDA section above, the following actions are executed:

In [None]:
# Grouping values with <5% count
def replace_value(col_lst):
    for col in col_lst:
        # Get the source list with less than 5% count
        other_val = new_df[col].value_counts(normalize=True).loc[lambda x:x<0.05].index.tolist()
        # Replace with "Others" level
        new_df[col] = new_df[col].replace(other_val,'Others')
        
replace_value(new_cat.columns.tolist())

In [None]:
# Categorical features with respect to conversion rate
plt.figure(figsize=(12,15))

i=1
for col in new_cat.columns:
    plt.subplot(3,3,i)
    sns.countplot(new_df[col],hue=df['Converted'])
    i +=1
    plt.xticks(rotation=90)
    plt.tight_layout();

### Insights:
* Working professional are more likely to convert than Unemployed or Others groups. 
* The Specialization feature mainly provides details of working professional group. Thus, we can use the high level feature "Occupation" and drop the "Specialization". 
* For missing data, it seems ok to impute the mode of the column. 
* Except the Lead Quality feature with more than 50% data missing. From a business standpoint, a "Not Sure" value can replace these missing values
* Tags feature carry similar inputs from occupation, last activity. Will drop this feature

### C.2.1. Imputation

In [None]:
# Drop Specialization and Tags
new_df = new_df.drop(['Specialization','Tags'],axis=1)

In [None]:
# Fill in "Not Sure" for missing value in Lead Quality:
new_df['Lead Quality'] = new_df['Lead Quality'].fillna(value='Not Sure')

In [None]:
# Fill in missing value with mode
fill_mode = lambda col:col.fillna(col.mode()[0])

new_df = new_df.apply(fill_mode,axis=0)

### C.2.2. Encoding

In [None]:
# Categorical Columns
cat_col = new_df.select_dtypes(include='object').columns.tolist()

new_df_encd = pd.get_dummies(new_df,prefix_sep="_",columns=cat_col,drop_first=True)

## C.3. Ensemble the Final Clean Data Set

In [None]:
old_num_col = num_vars.columns.tolist()

# Drop old numerical columns
new_df_encd.drop(old_num_col,axis=1,inplace=True)

# Concatenate the treated numerical features (num_vars_it) to the encoded categorical features
X = pd.concat([num_vars_it,new_df_encd],axis=1)

# The complete df 
leads_df = pd.concat([df['Converted'],X],axis=1)

leads_df.head()

In [None]:
# Features Coorelation
data = [go.Heatmap(
        z= leads_df.corr().values,
        x= leads_df.columns.values,
        y= leads_df.columns.values,
        colorscale='RdBu_r',
        opacity = 1.0 )]

layout = go.Layout(
    title='Pearson Correlation of Input Features',
    xaxis = dict(ticks='', nticks=36),
    yaxis = dict(ticks=''),
    width = 900, height = 900)

fig = go.Figure(data=data, layout=layout)
fig.show()

In [None]:
# Top 10 features that correlates with Coverted
leads_df.corr()['Converted'].sort_values(ascending=False).head(10)

# D. Modeling 

In [None]:
X = leads_df.drop('Converted',axis=1)
y = leads_df['Converted']

### D.1. Train - Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Instantiate the Logistic Regression Model
logmodel = LogisticRegression(solver='liblinear')
logmodel.fit(X_train, y_train)

In [None]:
# Create a pipeline 
pipe = make_pipeline(StandardScaler(),logmodel)
pipe.fit(X_train,y_train)

### D.2. Train the models

Apply cross validation with k-fold = 10 on the train set


In [None]:
# The score order is: accuracy, precision, recall
train_score = [] 
scoring = ['accuracy','precision','recall'] # specify the type of scoring 
scores = cross_validate(pipe, X_train, y_train, cv=10, scoring=scoring)

train_score.append(scores['test_accuracy'].mean())
train_score.append(scores['test_precision'].mean())
train_score.append(scores['test_recall'].mean())

# The model's score on training set
train_scores = pd.DataFrame(train_score,columns=['Train Set'],
                            index=['Accuracy','Precision','Recall'])
train_scores

In [None]:
# Predicted label (0,1) on the train set
y_train_pred = cross_val_predict(logmodel,X_train,y_train,cv=10)

# Probability of y_train 
y_train_pred_prob = cross_val_predict(logmodel,X_train,y_train,cv=10,method='predict_proba')

# Outputs of the Train set:
train_output = pd.DataFrame({'True Converted':y_train.values,'Predict Converted':y_train_pred,
                             'Predict Probability':y_train_pred_prob[:,1]})
train_output.head()

### D.3. Tune the Optimal Threshold that classify labels 0 vs 1

We observe an imbalanced labels in Converted from the dataset. Therefore, a precision-recall analysis is more approriate than ROC

In [None]:
# Finding precision, recall, and thresholds arrays
p, r, thresholds = precision_recall_curve(train_output['True Converted'], train_output['Predict Probability'])
pr_auc = metrics.auc(r,p)

# Precison,recall vs Threshold chart
plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, p[: -1], "b--", label="Precision")
plt.plot(thresholds, r[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.ylim([0,1]);

The optimal threshold is the point that results in the best balance of precision and recall. This is the same as optimizing the F-measure

In [None]:
# f score
fscore = (2 * p * r) / (p + r)
# locate the index of the largest f score
ix = np.argmax(fscore)
print('Best Threshold=%.2f, F-Score=%.3f' % (thresholds[ix], fscore[ix]))
print('Precision score = %.2f, Recall score = %.2f' %(r[ix], p[ix]))

# Precision vs Recall chart
plt.plot(r, p, marker='.', label='Logistic',markersize=0.5)
plt.scatter(r[ix], p[ix], marker='o', color='black', label='Best')

plt.title('Precision-Recall Trade-off Chart')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend();

The **optimal threshold** to determine the binary class **is 0.33**

The two charts and the model's score on train set are in agreement pointing at the optimal threshold of 0.33 locates where Precision is 0.85 and Recall is 0.75

### D.4. Test The Model

In [None]:
# Apply the model on X_test
y_test_pred = pipe.predict(X_test)

# Probability of predicted y_test
y_test_pred_prob = pipe.predict_proba(X_test)[:,1]

# Create a df for Test Set output
test_output = pd.DataFrame({'True Converted':y_test.values,'Predict Probability':y_test_pred_prob})

# Predicted Converted Label in which optimal threshold applies
test_output['Predict Converted'] = test_output['Predict Probability'].apply(lambda x: 1 if x > 0.33 else 0)

# Lead Score
test_output['Lead Score'] = round(test_output['Predict Probability']*100)

test_output.head()

## E. Model Evaluation

In [None]:
# The model's score on test set
test_score = []

conf_matrix = confusion_matrix(y_test,y_test_pred)
print(classification_report(y_test,y_test_pred))
'\n'
print(conf_matrix)

tn = conf_matrix[0,0]
fp = conf_matrix[0,1]
tp = conf_matrix[1,1]
fn = conf_matrix[1,0]

total = tn + fp + tp + fn
accuracy  = (tp + tn) / total # Accuracy Rate
precision = tp / (tp + fp) # Positive Predictive Value
recall    = tp / (tp + fn) # True Positive Rate
error = (fp + fn) / total # Missclassification Rate

test_score.append(accuracy)
test_score.append(precision)
test_score.append(recall)

test_scores = pd.DataFrame(test_score,columns=['Test Set'],
                            index=['Accuracy','Precision','Recall'])
test_scores

In [None]:
# Compare the train and test scores
pd.concat([train_scores,test_scores],axis=1)

### Question:
**What other features data engineers can implement to improve the model?**

The evaluation scores indicate that the model is not over-fitting and a minimal chance of data leakage. We can improve the dataset by collecting some more features:
* Improve quality of the survey or form questions to receive more user inputs (reduce NaN values)
* Improve algorithm for the Activity and Profile Score to produce complete results
* Time stamp visiting the websites for seasonality analysis

# F. Conclusion

The logistic regression model built with threshold of 0.33 is able to produce 86% accuracy and 85% precision on identify the users that will likely to convert to cutomers. 

The main features that the company should focus on to increase the conversion rate are: Website contents (increase Time Spent on Website), Working Professional, Sending SMS, Customers who fill Add From

The results and findings from this notebook is summarized [here](https://medium.com/@nguyenpham111/tips-to-improve-conversion-rate-for-online-educational-providers-fd84c9a43226)