### Lead Scoring Dataset for Classification Case Study
https://www.kaggle.com/amritachatterjee09/lead-scoring-dataset

### Context
An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.

There are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating, etc. ) in order to get a higher lead conversion.

X Education wants to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score h have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

### Content

#### Variables Description

* Prospect ID - A unique ID with which the customer is identified.
* Lead Number - A lead number assigned to each lead procured.
* Lead Origin - The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc.
* Lead Source - The source of the lead. Includes Google, Organic Search, Olark Chat, etc.
* Do Not Email -An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not.
* Do Not Call - An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not.
* Converted - The target variable. Indicates whether a lead has been successfully converted or not.
* TotalVisits - The total number of visits made by the customer on the website.
* Total Time Spent on Website - The total time spent by the customer on the website.
* Page Views Per Visit - Average number of pages on the website viewed during the visits.
* Last Activity - Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc.
* Country - The country of the customer.
* Specialization - The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form.
* How did you hear about X Education - The source from which the customer heard about X Education.
* What is your current occupation - Indicates whether the customer is a student, umemployed or employed.
* What matters most to you in choosing this course An option selected by the customer - indicating what is their main motto behind doing this course.
* Search - Indicating whether the customer had seen the ad in any of the listed items.
* Magazine
* Newspaper Article
* X Education Forums
* Newspaper
* Digital Advertisement
* Through Recommendations - Indicates whether the customer came in through recommendations.
* Receive More Updates About Our Courses - Indicates whether the customer chose to receive more updates about the courses.
* Tags - Tags assigned to customers indicating the current status of the lead.
* Lead Quality - Indicates the quality of lead based on the data and intuition the employee who has been assigned to the lead.
* Update me on Supply Chain Content - Indicates whether the customer wants updates on the Supply Chain Content.
* Get updates on DM Content - Indicates whether the customer wants updates on the DM Content.
* Lead Profile - A lead level assigned to each customer based on their profile.
* City - The city of the customer.
* Asymmetric Activity Index - An index and score assigned to each customer based on their activity and their profile
* Asymmetric Profile Index
* Asymmetric Activity Score
* Asymmetric Profile Score
* I agree to pay the amount through cheque - Indicates whether the customer has agreed to pay the amount through cheque or not.
* a free copy of Mastering The Interview - Indicates whether the customer wants a free copy of 'Mastering the Interview' or not.
* Last Notable Activity - The last notable activity performed by the student.

In [None]:
%matplotlib inline
RANDOM_STATE = 0

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
pd.set_option('display.float_format', '{:.3f}'.format)

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split, ParameterGrid, StratifiedKFold
from sklearn import metrics
from tqdm import tqdm

import eli5 
from eli5.sklearn import PermutationImportance

import itertools

import catboost as cb
from catboost import CatBoostClassifier
from catboost import Pool

In [None]:
def summary(df):
    summary = pd.DataFrame(df.dtypes, columns=['dtypes'])
    summary = summary.reset_index()
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    return summary

In [None]:
def plot_cf_matrix_and_roc(model, 
                           X_train, 
                           y_train,
                           X_test, 
                           y_test,
                           y_pred, 
                           classes=[0,1],
                           normalize=False,
                           cmap=plt.cm.Blues):
    metrics_list = []
    
    # the main plot
    plt.figure(figsize=(15,5))

    # the confusion matrix
    plt.subplot(1,2,1)
    cm = metrics.confusion_matrix(y_test, y_pred)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        plt.title("Normalized confusion matrix")
    else:
        plt.title('Confusion matrix')

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.2f}".format(cm[i, j]),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, format(cm[i, j]),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
    # the result metrix
    summary_df = pd.DataFrame([[str(np.unique( y_pred )),
                               str(round(metrics.precision_score(y_test, y_pred.round()),3)),
                               str(round(metrics.accuracy_score(y_test, y_pred.round()),3)),
                               str(round(metrics.recall_score(y_test, y_pred.round(), average='binary'),3)),
                               str(round(metrics.roc_auc_score(y_test, y_pred.round()),3)),
                                str(round(metrics.cohen_kappa_score(y_test, y_pred.round()),3)),
                               str(round(metrics.f1_score(y_test, y_pred.round(), average='binary'),3))]], 
                              columns=['Class', 'Precision', 'Accuracy', 'Recall', 'ROC-AUC', 'Kappa', 'F1-score'])
    # print the metrics
    print("\n");
    print(summary_df);
    print("\n");
    
    plt.show()

In [None]:
def cross_val(X, y, param, cat_features='', class_weights = '', n_splits=3):
    results = []
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    
    for tr_ind, val_ind in skf.split(X, y):
        X_train_i = X.iloc[tr_ind]
        y_train_i = y.iloc[tr_ind]
        
        X_valid_i = X.iloc[val_ind]
        y_valid_i = y.iloc[val_ind]
        
        if class_weights == '' :
            clf = CatBoostClassifier(iterations=param['iterations'],
                            loss_function = param['loss_function'],
                            depth=param['depth'],
                            l2_leaf_reg = param['l2_leaf_reg'],
                            eval_metric = param['eval_metric'],
                            leaf_estimation_iterations = 10,
                            use_best_model=True,
                            logging_level='Silent',
                            od_type="Iter",
                            early_stopping_rounds=param['early_stopping_rounds']
            )
        else:
            clf = CatBoostClassifier(iterations=param['iterations'],
                            loss_function = param['loss_function'],
                            depth=param['depth'],
                            l2_leaf_reg = param['l2_leaf_reg'],
                            class_weights = class_weights,
                            eval_metric = param['eval_metric'],
                            leaf_estimation_iterations = 10,
                            use_best_model=True,
                            logging_level='Silent',
                            od_type="Iter",
                            early_stopping_rounds=param['early_stopping_rounds']
            )
        
        
        if cat_features == '' :
            clf.fit(X_train_i, 
                    y_train_i,
                    eval_set=(X_valid_i, y_valid_i)
            )
        else:
            clf.fit(X_train_i, 
                    y_train_i,
                    cat_features=cat_features,
                    eval_set=(X_valid_i, y_valid_i)
            )
        
        # predict
        y_pred = clf.predict(X_valid_i)
        
        # select the right metric
        if(param['eval_metric'] == 'Recall'):
            metric = metrics.recall_score(y_valid_i, y_pred)
        elif(param['eval_metric'] == 'Accuracy'):
            metric = metrics.accuracy_score(y_valid_i, y_pred)
        elif(param['eval_metric'] == 'F1'):
            metric = metrics.f1_score(y_valid_i, y_pred)
        elif(param['eval_metric'] == 'AUC'):
            metric = metrics.roc_auc_score(y_valid_i, y_pred)
        elif(param['eval_metric'] == 'Kappa'):
            metric = metrics.cohen_kappa_score(y_valid_i, y_pred)
        else:
            metric = metrics.accuracy_score(y_valid_i, y_pred)
        
        # append the metric
        results.append(metric)
        
        print('Classes: '+str(np.unique( y_pred )))
        print('Precision: '+str(round(metrics.precision_score(y_valid_i, y_pred.round()),3)))
        print('Accuracy: '+str(round(metrics.accuracy_score(y_valid_i, y_pred.round()),3)))
        print('Recall: '+str(round(metrics.recall_score(y_valid_i, y_pred.round(), average='binary'),3)))
        print('Roc_Auc: '+str(round(metrics.roc_auc_score(y_valid_i, y_pred.round()),3)))
        print('F1 score: '+str(round(metrics.f1_score(y_valid_i, y_pred.round(), average='binary'),3)))
        print('Mean for '+param['eval_metric']+' OOF prediction: ',np.mean(results))
        print('Standard deviation for '+param['eval_metric']+' OOF prediction: ',np.std(results))
        print("\n")
    return sum(results)/n_splits

In [None]:
def catboost_GridSearchCV(X, y, params, cat_features='', class_weights='', n_splits=5):
    ps = {'score':0,'param': []}
    for prms in tqdm(list(ParameterGrid(params)), ascii=True, desc='Params Tuning:'):
        score = cross_val(X, y, prms, cat_features, class_weights, n_splits)
        if score > ps['score']:
            ps['score'] = score
            ps['param'] = prms
    print('Score: '+str(ps['score']))
    print('Params: '+str(ps['param']))
    return ps['param']

In [None]:
def check_target(df, target):
    sns.countplot(df[target])
    count_no = len(df[df[target]==0])
    count_yes = len(df[df[target]==1])
    pct_of_no_sub = count_no/(count_no+count_yes)*100
    pct_of_sub = count_yes/(count_no + count_yes)*100
    print('{} {} % YES '.format(count_yes, pct_of_sub))
    print('{} {} % NO '.format(count_no, pct_of_no_sub))

In [None]:
def num_vs_ctr(df, var1, var2):
    ctr = df[[var1, var2]].groupby(var1, as_index=False).mean().sort_values(var2, ascending=False)
    count = df[[var1, var2]].groupby(var1, as_index=False).count().sort_values(var2, ascending=False)
    merge = count.merge(ctr, on=var1, how='left')
    merge.columns=[var1, 'count', 'ctr%']
    return merge

def crosstab(df, features, target, label_cutoff = 'none'):
    for feature in features:
        if(label_cutoff != 'none' and label_cutoff > 0):
            # how many uninques
            unique_elements = data[feature].nunique()
            
            # if we have more uniques then the cutoff
            if(unique_elements > label_cutoff):
                # select the number most common values
                most_common_values = df.groupby(feature)[target].count().sort_values(ascending=False).nlargest(label_cutoff)
                # add another value "Other"
                df[feature] = np.where(df[feature].isin(most_common_values.index), df[feature], 'Other')
        
        # plot the crosstab
        pd.crosstab(df[feature],df[target]).plot(kind='bar', figsize=(20,5), stacked=True)
        plt.title(feature+' / '+target)
        plt.xlabel(feature)
        plt.ylabel(feature+' / '+target)
            
        # display the table obove each chart 
        return num_vs_ctr(df, feature, target)   
        

In [None]:
## Import the data file
data = pd.read_csv('../input/lead-scoring-dataset/Lead Scoring.csv')

In [None]:
data.head()

## The features

In [None]:
summary(data)

In [None]:
# drop the feature
data = data.drop(columns=['Prospect ID','Lead Number'])

## Lead Origin
The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc.

Note: Collected from the Form

In [None]:
data['Lead Origin'].describe()

Check for null values

In [None]:
data['Lead Origin'].isnull().sum()

In [None]:
crosstab(data, ['Lead Origin'], 'Converted')

Put the new label "Other" for labels which have few results. In this case we want to avoid data shift between train, validation and test data.

In [None]:
data.loc[data['Lead Origin'] == 'Lead Import', 'Lead Origin'] = 'Other'
data.loc[data['Lead Origin'] == 'Quick Add Form', 'Lead Origin'] = 'Other'

In [None]:
crosstab(data, ['Lead Origin'], 'Converted')

## Lead Source
The source of the lead. Includes Google, Organic Search, Olark Chat, etc.

Note: Collected from the Tracking system

In [None]:
data['Lead Source'].describe()

Check for null values

In [None]:
data['Lead Source'].isnull().sum()

In [None]:
# Replace null values with the new label "Other"
data['Lead Source'].fillna('Other', inplace=True)

In [None]:
data['Lead Source'].isnull().sum()

In [None]:
crosstab(data, ['Lead Source'], 'Converted')

In [None]:
# fix google
data.loc[data['Lead Source'] == 'google', 'Lead Source'] = 'Google'

Put the new label "Other" for labels which have few results. In this case we want to avoid data shift between train, validation and test data.

In [None]:
data.loc[data['Lead Source'] == 'bing', 'Lead Source'] = 'Other'
data.loc[data['Lead Source'] == 'Click2call', 'Lead Source'] = 'Other'
data.loc[data['Lead Source'] == 'Press_Release', 'Lead Source'] = 'Other'
data.loc[data['Lead Source'] == 'Social Media', 'Lead Source'] = 'Other'
data.loc[data['Lead Source'] == 'Live Chat', 'Lead Source'] = 'Other'
data.loc[data['Lead Source'] == 'WeLearn', 'Lead Source'] = 'Other'
data.loc[data['Lead Source'] == 'Pay per Click Ads', 'Lead Source'] = 'Other'
data.loc[data['Lead Source'] == 'NC_EDM', 'Lead Source'] = 'Other'
data.loc[data['Lead Source'] == 'blog', 'Lead Source'] = 'Other'
data.loc[data['Lead Source'] == 'testone', 'Lead Source'] = 'Other'
data.loc[data['Lead Source'] == 'welearnblog_Home', 'Lead Source'] = 'Other'
data.loc[data['Lead Source'] == 'youtubechannel', 'Lead Source'] = 'Other'

In [None]:
crosstab(data, ['Lead Source'], 'Converted')

## Do Not Email
An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not.

Note: Collected from the Form

In [None]:
data['Do Not Email'].describe()

In [None]:
data['Do Not Email'].isnull().sum()

In [None]:
crosstab(data, ['Do Not Email'], 'Converted')

In [None]:
# convert to binary
data.loc[data['Do Not Email'] == 'Yes', 'Do Not Email'] = 1
data.loc[data['Do Not Email'] == 'No', 'Do Not Email'] = 0

# convert to int
data['Do Not Email'] = data['Do Not Email'].astype('int64')

In [None]:
crosstab(data, ['Do Not Email'], 'Converted')

## Do Not Call
An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not.

Note: Collected from the Form

In [None]:
data['Do Not Call'].describe()

In [None]:
data['Do Not Call'].isnull().sum()

In [None]:
crosstab(data, ['Do Not Call'], 'Converted')

In [None]:
data.loc[data['Do Not Call'] == 'Yes', 'Do Not Call'] = 1
data.loc[data['Do Not Call'] == 'No', 'Do Not Call'] = 0

# convert to int
data['Do Not Call'] = data['Do Not Call'].astype('int64')

In [None]:
crosstab(data, ['Do Not Call'], 'Converted')

Note: for the "No" label we have to few results. In this case, the feature is useless, because over 99% of the results go to the label "Yes".

In [None]:
# drop the feature
data = data.drop(columns=['Do Not Call'])

## TotalVisits
The total number of visits made by the customer on the website.

Note: Collected from the Tracking system

In [None]:
data['TotalVisits'].describe()

In [None]:
data['TotalVisits'].isnull().sum()

In [None]:
data['TotalVisits'].fillna(data['TotalVisits'].mean(), inplace=True)

In [None]:
data['TotalVisits'].isnull().sum()

In [None]:
crosstab(data, ['TotalVisits'], 'Converted')

### Check for outliers

In [None]:
sns.boxplot(x = 'Converted', y = 'TotalVisits', data = data)

There are outliers for both labels "Yes" and "No". To produce a stable model we need to handle them.

#### Capping the data at 95% percetile value

In [None]:
# Get 95th quantile
Q4 = data['TotalVisits'].quantile(0.95) 
print("Total number of rows getting capped for TotalVisits column : ",len(data[data['TotalVisits'] >= Q4]))

# outlier capping
data.loc[data['TotalVisits'] >= Q4, 'TotalVisits'] = Q4 

In [None]:
sns.boxplot(x = 'Converted', y = 'TotalVisits', data = data)

Note: Now the outliers are removed

## Total Time Spent on Website
The total time spent by the customer on the website.

Note: Collected from the Tracking system

In [None]:
data['Total Time Spent on Website'].describe()

In [None]:
data['Total Time Spent on Website'].isnull().sum()

In [None]:
sns.boxplot(x = 'Converted', y = 'Total Time Spent on Website', data = data)

## Page Views Per Visit
Average number of pages on the website viewed during the visits.

Note: Collected from the Tracking system

In [None]:
data['Page Views Per Visit'].describe()

In [None]:
data['Page Views Per Visit'].isnull().sum()

In [None]:
data['Page Views Per Visit'].fillna(data['Page Views Per Visit'].mean(), inplace=True)

In [None]:
crosstab(data, ['Page Views Per Visit'], 'Converted')

In [None]:
sns.boxplot(x = 'Converted', y = 'Page Views Per Visit', data = data)

In [None]:
# Get 95th quantile
Q4 = data['Page Views Per Visit'].quantile(0.95) 
print("Total number of rows getting capped for Page Views Per Visit column : ",len(data[data['Page Views Per Visit'] >= Q4]))

# outlier capping
data.loc[data['Page Views Per Visit'] >= Q4, 'Page Views Per Visit'] = Q4 

In [None]:
sns.boxplot(x = 'Converted', y = 'Page Views Per Visit', data = data)

## Last Activity
Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc.

Note: Collected from the Tracking system

In [None]:
data['Last Activity'].describe()

In [None]:
data['Last Activity'].isnull().sum()

In [None]:
data['Last Activity'].fillna('Other', inplace=True)

In [None]:
crosstab(data, ['Last Activity'], 'Converted')

## Country
The country of the customer.

Note: Collected from the Form

In [None]:
data['Country'].describe()

In [None]:
data['Country'].isnull().sum()

In [None]:
data['Country'].fillna('Other', inplace=True)

In [None]:
crosstab(data, ['Country'], 'Converted')

## Specialization
The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form.

Note: Collected from the Form

In [None]:
data['Specialization'].describe()

In [None]:
data['Specialization'].isnull().sum()

In [None]:
data['Specialization'].fillna('Other', inplace=True)

In [None]:
crosstab(data, ['Specialization'], 'Converted')

## How did you hear about X Education

Note: Collected from the Form

In [None]:
data['How did you hear about X Education'].describe()

In [None]:
data['How did you hear about X Education'].isnull().sum()

In [None]:
data['How did you hear about X Education'].fillna('Other', inplace=True)

In [None]:
crosstab(data, ['How did you hear about X Education'], 'Converted')

## What is your current occupation

Note: Collected from the Form

In [None]:
data['What is your current occupation'].describe()

In [None]:
data['What is your current occupation'].isnull().sum()

In [None]:
data['What is your current occupation'].fillna('Other', inplace=True)

In [None]:
crosstab(data, ['What is your current occupation'], 'Converted')

## What matters most to you in choosing a course

Note: Collected from the Form

In [None]:
data['What matters most to you in choosing a course'].describe()

In [None]:
data['What matters most to you in choosing a course'].isnull().sum()

In [None]:
data['What matters most to you in choosing a course'].fillna('Other', inplace=True)

In [None]:
crosstab(data, ['What matters most to you in choosing a course'], 'Converted')

## Search

Note: Collected from the Tracking sysytem

In [None]:
data['Search'].describe()

In [None]:
data['Search'].isnull().sum()

In [None]:
crosstab(data, ['Search'], 'Converted')

In [None]:
data.loc[data['Search'] == 'Yes', 'Search'] = 1
data.loc[data['Search'] == 'No', 'Search'] = 0

# convert to int
data['Search'] = data['Search'].astype('int64')

In [None]:
crosstab(data, ['Search'], 'Converted')

Note: for the "No" label we have to few results. In this case, the feature is useless, because over 99% of the results go to the label "Yes".

In [None]:
# drop the feature
data = data.drop(columns=['Search'])

## Magazine

Note: Collected from the Tracking system

In [None]:
data['Magazine'].describe()

Note: There is only one label. In this case, the feature is useless.

In [None]:
# drop the feature
data = data.drop(columns=['Magazine'])

## Newspaper Article

Note: Collected from the Tracking system

In [None]:
data['Newspaper Article'].describe()

In [None]:
crosstab(data, ['Newspaper Article'], 'Converted')

In [None]:
data.loc[data['Newspaper Article'] == 'Yes', 'Newspaper Article'] = 1
data.loc[data['Newspaper Article'] == 'No', 'Newspaper Article'] = 0

# convert to int
data['Newspaper Article'] = data['Newspaper Article'].astype('int64')

In [None]:
crosstab(data, ['Newspaper Article'], 'Converted')

Note: for the "No" label we have to few results. In this case, the feature is useless, because over 99% of the results go to the label "Yes".

In [None]:
# drop the feature
data = data.drop(columns=['Newspaper Article'])

## X Education Forums

Note: Collected from the Tracking system

In [None]:
data['X Education Forums'].describe()

In [None]:
data['X Education Forums'].isnull().sum()

In [None]:
crosstab(data, ['X Education Forums'], 'Converted')

In [None]:
data.loc[data['X Education Forums'] == 'Yes', 'X Education Forums'] = 1
data.loc[data['X Education Forums'] == 'No', 'X Education Forums'] = 0

# convert to int
data['X Education Forums'] = data['X Education Forums'].astype('int64')

In [None]:
crosstab(data, ['X Education Forums'], 'Converted')

Note: for the "No" label we have to few results. In this case, the feature is useless, because over 99% of the results go to the label "Yes".

In [None]:
# drop the feature
data = data.drop(columns=['X Education Forums'])

## Newspaper

Note: Collected from the Tracking system

In [None]:
data['Newspaper'].describe()

In [None]:
data['Newspaper'].isnull().sum()

In [None]:
crosstab(data, ['Newspaper'], 'Converted')

In [None]:
data.loc[data['Newspaper'] == 'Yes', 'Newspaper'] = 1
data.loc[data['Newspaper'] == 'No', 'Newspaper'] = 0

# convert to int
data['Newspaper'] = data['Newspaper'].astype('int64')

In [None]:
crosstab(data, ['Newspaper'], 'Converted')

Note: for the "No" label we have to few results. In this case, the feature is useless, because over 99% of the results go to the label "Yes".

In [None]:
# drop the feature
data = data.drop(columns=['Newspaper'])

## Digital Advertisement

Note: Collected from the Tracking system

In [None]:
data['Digital Advertisement'].describe()

In [None]:
data['Digital Advertisement'].isnull().sum()

In [None]:
crosstab(data, ['Digital Advertisement'], 'Converted')

In [None]:
data.loc[data['Digital Advertisement'] == 'Yes', 'Digital Advertisement'] = 1
data.loc[data['Digital Advertisement'] == 'No', 'Digital Advertisement'] = 0

# convert to int
data['Digital Advertisement'] = data['Digital Advertisement'].astype('int64')

In [None]:
crosstab(data, ['Digital Advertisement'], 'Converted')

Note: for the "No" label we have to few results. In this case, the feature is useless, because over 99% of the results go to the label "Yes".

In [None]:
# drop the feature
data = data.drop(columns=['Digital Advertisement'])

## Through Recommendations

Note: Collected from the Tracking system

In [None]:
data['Through Recommendations'].describe()

In [None]:
data['Through Recommendations'].isnull().sum()

In [None]:
crosstab(data, ['Through Recommendations'], 'Converted')

In [None]:
data.loc[data['Through Recommendations'] == 'Yes', 'Through Recommendations'] = 1
data.loc[data['Through Recommendations'] == 'No', 'Through Recommendations'] = 0

# convert to int
data['Through Recommendations'] = data['Through Recommendations'].astype('int64')

In [None]:
crosstab(data, ['Through Recommendations'], 'Converted')

Note: for the "No" label we have to few results. In this case, the feature is useless, because over 99% of the results go to the label "Yes".

In [None]:
# drop the feature
data = data.drop(columns=['Through Recommendations'])

## Receive More Updates About Our Courses

Note: Collected from the Form

In [None]:
data['Receive More Updates About Our Courses'].describe()

Note: There is only one label. In this case, the feature is useless.

In [None]:
# drop the feature
data = data.drop(columns=['Receive More Updates About Our Courses'])

## Tags
Tags assigned to customers indicating the current status of the lead.

Note: Collected from the CRM

In [None]:
data['Tags'].describe()

In [None]:
data['Tags'].isnull().sum()

In [None]:
data['Tags'].fillna('Other', inplace=True)

In [None]:
crosstab(data, ['Tags'], 'Converted')

## Lead Quality

Note: Collected from the CRM

In [None]:
data['Lead Quality'].describe()

In [None]:
data['Lead Quality'].isnull().sum()

In [None]:
crosstab(data, ['Lead Quality'], 'Converted')

In [None]:
data['Lead Quality'].fillna('Other', inplace=True)

In [None]:
crosstab(data, ['Lead Quality'], 'Converted')

## Update me on Supply Chain Content
Indicates whether the customer wants updates on the Supply Chain Content.

Note: Collected from the Form

In [None]:
data['Update me on Supply Chain Content'].describe()

Note: There is only one label. In this case, the feature is useless.

In [None]:
# drop the feature
data = data.drop(columns=['Update me on Supply Chain Content'])

## Get updates on DM Content
Indicates whether the customer wants updates on the DM Content.

Note: Collected from the Form

In [None]:
data['Get updates on DM Content'].describe()

Note: There is only one label. In this case, the feature is useless.

In [None]:
# drop the feature
data = data.drop(columns=['Get updates on DM Content'])

## Lead Profile
A lead level assigned to each customer based on their profile.

Note: Collected from the CRM

In [None]:
data['Lead Profile'].describe()

In [None]:
data['Lead Profile'].isnull().sum()

In [None]:
data['Lead Profile'].fillna('Other', inplace=True)

In [None]:
crosstab(data, ['Lead Profile'], 'Converted')

## City
The city of the customer.

Note: Collected from the Form

In [None]:
data['City'].describe()

In [None]:
data['City'].isnull().sum()

In [None]:
data['City'].fillna('Other', inplace=True)

In [None]:
crosstab(data, ['City'], 'Converted')

## Asymmetric Activity Index
An index and score assigned to each customer based on their activity and their profile

Note: Collected from the CRM

In [None]:
data['Asymmetrique Activity Index'].describe()

In [None]:
data['Asymmetrique Activity Index'].isnull().sum()

In [None]:
data['Asymmetrique Activity Index'].fillna('Other', inplace=True)

In [None]:
crosstab(data, ['Asymmetrique Activity Index'], 'Converted')

## Asymmetrique Profile Index

Note: Collected from the CRM

In [None]:
data['Asymmetrique Profile Index'].describe()

In [None]:
data['Asymmetrique Profile Index'].isnull().sum()

In [None]:
data['Asymmetrique Profile Index'].fillna('Other', inplace=True)

In [None]:
crosstab(data, ['Asymmetrique Profile Index'], 'Converted')

## Asymmetrique Activity Score

Note: Collected from the CRM

In [None]:
data['Asymmetrique Activity Score'].describe()

In [None]:
data['Asymmetrique Activity Score'].isnull().sum()

In [None]:
crosstab(data, ['Asymmetrique Activity Score'], 'Converted')

In [None]:
data['Asymmetrique Activity Score'].fillna(data['Asymmetrique Activity Score'].mean(), inplace=True)

In [None]:
crosstab(data, ['Asymmetrique Activity Score'], 'Converted')

## Asymmetrique Profile Score

Note: Collected from the CRM

In [None]:
data['Asymmetrique Profile Score'].describe()

In [None]:
data['Asymmetrique Profile Score'].isnull().sum()

In [None]:
data['Asymmetrique Profile Score'].fillna(data['Asymmetrique Profile Score'].mean(), inplace=True)

In [None]:
crosstab(data, ['Asymmetrique Profile Score'], 'Converted')

## I agree to pay the amount through cheque
Indicates whether the customer has agreed to pay the amount through cheque or not.

Note: Collected from the Form

In [None]:
data['I agree to pay the amount through cheque'].describe()

Note: There is only one label. In this case, the feature is useless.

In [None]:
# drop the feature
data = data.drop(columns=['I agree to pay the amount through cheque'])

## A free copy of Mastering The Interview
Indicates whether the customer wants a free copy of 'Mastering the Interview' or not.

* Note: Collected from the Form

In [None]:
data['A free copy of Mastering The Interview'].describe()

In [None]:
data['A free copy of Mastering The Interview'].isnull().sum()

In [None]:
crosstab(data, ['A free copy of Mastering The Interview'], 'Converted')

In [None]:
data.loc[data['A free copy of Mastering The Interview'] == 'Yes', 'A free copy of Mastering The Interview'] = 1
data.loc[data['A free copy of Mastering The Interview'] == 'No', 'A free copy of Mastering The Interview'] = 0

# convert to int
data['A free copy of Mastering The Interview'] = data['A free copy of Mastering The Interview'].astype('int64')

In [None]:
crosstab(data, ['A free copy of Mastering The Interview'], 'Converted')

## Last Notable Activity
The last notable activity performed by the student.

Note: Collected from the Tracking system

In [None]:
data['Last Notable Activity'].describe()

In [None]:
data['Last Notable Activity'].isnull().sum()

In [None]:
crosstab(data, ['Last Notable Activity'], 'Converted')

summary(data)

In [None]:
summary(data)

## Correlation of the numeric features

In [None]:
sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2)
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()

The correlation matrix does not show any correlated features, which can distort the model.

In [None]:
X = data.drop('Converted', 1)
y = data['Converted']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=47)
X_train.shape, X_test.shape

In [None]:
train_df = pd.concat([X_train, y_train], axis=1)
check_target(train_df, 'Converted')

In [None]:
test_df = pd.concat([X_test, y_test], axis=1)
check_target(test_df, 'Converted')

The distribution of converted and not converted data is similar in train and test.

In [None]:
cat_features=[i for i in X_train.columns if ((X_train.dtypes[i]!='int64') & (X_train.dtypes[i]!='float64'))]
cat_features

In [None]:
bool_features=[i for i in X_train.columns if (((X_train.dtypes[i]=='int64') | (X_train.dtypes[i]=='float64')) & (len(X_train[i].unique()) == 2))]
bool_features

In [None]:
num_features=[i for i in X_train.columns if (((X_train.dtypes[i]=='int64') | (X_train.dtypes[i]=='float64')) & (len(X_train[i].unique()) > 2))]
num_features

In [None]:
from sklearn.utils import class_weight
cw = list(class_weight.compute_class_weight('balanced',
                                             np.unique(data['Converted']),
                                             data['Converted']))

In [None]:
params = {'depth':[2, 3, 4, 5],
          'iterations':[1500],
          'loss_function': ['Logloss'],
          'l2_leaf_reg':np.logspace(-19,-20,3),
          'early_stopping_rounds': [500],
          'learning_rate':[0.01],
          'eval_metric':['F1']
}

# parameter tuning
#param = catboost_GridSearchCV(X_train, y_train, params, cat_features, cw)
#param

In [None]:
# pre-optimized parameters
param = {'depth': 4,
     'early_stopping_rounds': 500,
     'eval_metric': 'F1',
     'iterations': 1500,
     'l2_leaf_reg': 1e-19,
     'learning_rate': 0.01,
     'loss_function': 'Logloss'
}

# create the model
clf2 = CatBoostClassifier(iterations=param['iterations'],
                        loss_function = param['loss_function'],
                        depth=param['depth'],
                        l2_leaf_reg = param['l2_leaf_reg'],
                        eval_metric = param['eval_metric'],
                        #leaf_estimation_iterations = param['leaf_estimation_iterations'],
                        use_best_model=True,
                        early_stopping_rounds=param['early_stopping_rounds'],
                        class_weights = cw
)

# train the model
clf2.fit(X_train, 
        y_train,
        cat_features=cat_features,
        logging_level='Silent',
        eval_set=(X_test, y_test)
)

In [None]:
feature_score = pd.DataFrame(list(zip(X_train.dtypes.index, clf2.get_feature_importance(Pool(X_train, label=train_df['Converted'], cat_features=cat_features)))),
                columns=['Feature','Score'])

feature_score = feature_score.sort_values(by='Score', ascending=False, inplace=False, kind='quicksort', na_position='last')
plt.rcParams["figure.figsize"] = (15,8)
ax = feature_score.plot('Feature', 'Score', kind='bar', color='c')
ax.set_title("Catboost Feature Importance Ranking", fontsize = 14)
ax.set_xlabel('')

rects = ax.patches

labels = feature_score['Score'].round(2)

for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 0.35, label, ha='left', va='bottom')
plt.xticks(rotation=85)

plt.gca().invert_xaxis()

plt.show()
print(feature_score)


In [None]:
pred_catboost2_train = clf2.predict(X_train)

In [None]:
plot_cf_matrix_and_roc(clf2, X_train, y_train, X_train, y_train, pred_catboost2_train , classes=['NO','YES'])

In [None]:
print(metrics.classification_report(y_train, pred_catboost2_train))

In [None]:
pred_catboost2_train = clf2.predict(X_test)

In [None]:
plot_cf_matrix_and_roc(clf2, X_train, y_train, X_test, y_test, pred_catboost2_train , classes=['NO','YES'])

In [None]:
print(metrics.classification_report(y_test, pred_catboost2_train))

On the test data set our model reach a Precision of 94% and a Recall of 93%
 - Out of 1115 converted students our model recognized 1037 correctly - Recall 93%
 - Out of 1657 non-converted students our model recognized 1588 correctly - Accuracy 95%

## Cross validate the model

In [None]:
score = cross_val(X_train, y_train, param, cat_features, cw, 5)

The models seems very stable in the prediction of the different folds. The standard deviation for F1 is 0.005.