<a href="https://colab.research.google.com/github/zecakpm/ML_projects/blob/main/leads_scoring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Lead scoring classifier**

*   Get dataset
* Select categorical and numerical variables
* Split dataset
* Add dummy variables
* Scale all 
* Train model
* Check accuracy
* Check most important features







Reference material

https://towardsdatascience.com/a-true-end-to-end-ml-example-lead-scoring-f5b52e9a3c80

##**Libraries**

In [75]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn import svm, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score



In [28]:
#connecting with personal frive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##**Dataset**

*   Leads dataset kaggle
* https://www.kaggle.com/ashydv/leads-dataset



In [57]:
#open file
df = open('/content/drive/My Drive/Colab Notebooks/Projects/leads_scoring/Leads_cleaned.csv')

In [58]:
df = pd.read_csv(df,index_col=0)

##**Lower case column names**

In [59]:
df.columns = map(str.lower, df.columns)

In [60]:
df.head()

Unnamed: 0,prospect id,lead number,lead origin,lead source,do not email,do not call,converted,totalvisits,total time spent on website,page views per visit,last activity,country,specialization,what is your current occupation,what matters most to you in choosing a course,search,magazine,newspaper article,x education forums,newspaper,digital advertisement,through recommendations,receive more updates about our courses,tags,lead quality,update me on supply chain content,get updates on dm content,city,i agree to pay the amount through cheque,a free copy of mastering the interview,last notable activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,Page Visited on Website,India,Others,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Interested in other courses,Low in Relevance,No,No,Mumbai,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,Email Opened,India,Others,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,Not Sure,No,No,Mumbai,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,Email Opened,India,Business Administration,Student,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Mumbai,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,Unreachable,India,Media and Advertising,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Ringing,Not Sure,No,No,Mumbai,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,Converted to Lead,India,Others,Unemployed,Better Career Prospects,No,No,No,No,No,No,No,No,Will revert after reading the email,Might be,No,No,Mumbai,No,No,Modified


In [61]:
df.shape

(9074, 31)

In [62]:
df.columns

Index(['prospect id', 'lead number', 'lead origin', 'lead source',
       'do not email', 'do not call', 'converted', 'totalvisits',
       'total time spent on website', 'page views per visit', 'last activity',
       'country', 'specialization', 'what is your current occupation',
       'what matters most to you in choosing a course', 'search', 'magazine',
       'newspaper article', 'x education forums', 'newspaper',
       'digital advertisement', 'through recommendations',
       'receive more updates about our courses', 'tags', 'lead quality',
       'update me on supply chain content', 'get updates on dm content',
       'city', 'i agree to pay the amount through cheque',
       'a free copy of mastering the interview', 'last notable activity'],
      dtype='object')

##**Selected columns**

In [63]:
cat_vars = ['lead origin',
            'lead source',
            'last activity',
            'specialization',
            'what is your current occupation',
            'what matters most to you in choosing a course',
            'city',
            'last notable activity']

num_vars = ['totalvisits',
          'total time spent on website',
          'page views per visit']

target_label = ['converted']

all = cat_vars + num_vars + target_label
print(all)

['lead origin', 'lead source', 'last activity', 'specialization', 'what is your current occupation', 'what matters most to you in choosing a course', 'city', 'last notable activity', 'totalvisits', 'total time spent on website', 'page views per visit', 'converted']


##**Dropping columns**

*   Remove columns not selected on previous section



In [64]:
columns = df.columns.to_list()
for var in columns:
  if var not in all:
    df.drop([var],axis=1,inplace= True)

In [65]:
df.head()

Unnamed: 0,lead origin,lead source,converted,totalvisits,total time spent on website,page views per visit,last activity,specialization,what is your current occupation,what matters most to you in choosing a course,city,last notable activity
0,API,Olark Chat,0,0.0,0,0.0,Page Visited on Website,Others,Unemployed,Better Career Prospects,Mumbai,Modified
1,API,Organic Search,0,5.0,674,2.5,Email Opened,Others,Unemployed,Better Career Prospects,Mumbai,Email Opened
2,Landing Page Submission,Direct Traffic,1,2.0,1532,2.0,Email Opened,Business Administration,Student,Better Career Prospects,Mumbai,Email Opened
3,Landing Page Submission,Direct Traffic,0,1.0,305,1.0,Unreachable,Media and Advertising,Unemployed,Better Career Prospects,Mumbai,Modified
4,Landing Page Submission,Google,1,2.0,1428,1.0,Converted to Lead,Others,Unemployed,Better Career Prospects,Mumbai,Modified


In [66]:
X = df.drop(target_label, axis=1)
y = df[target_label]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.7,
                                                    test_size=0.3,
                                                    random_state=333
                                                    )

##**Spliting dataset**

In [67]:
X = df.drop(target_label, axis=1)
y = df[target_label]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.7,
                                                    test_size=0.3,
                                                    random_state=333
                                                    )

print("X_train size ===>", X_train.shape)
print("y_train size ===>", y_train.shape)
print("X_test size ===>", X_test.shape)
print("y_test size ===>", y_test.shape)


X_train size ===> (6351, 11)
y_train size ===> (6351, 1)
X_test size ===> (2723, 11)
y_test size ===> (2723, 1)


##**Categorial variables to Dummy Variables**

In [68]:
for var in cat_vars:
  X_train[var] =  pd.get_dummies(X_train[var])
  X_test[var] = pd.get_dummies(X_test[var])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [69]:
X_train.head()

Unnamed: 0,lead origin,lead source,totalvisits,total time spent on website,page views per visit,last activity,specialization,what is your current occupation,what matters most to you in choosing a course,city,last notable activity
7635,0,0,1.0,345,1.0,0,0,0,1,0,0
7276,1,0,3.0,641,1.5,0,0,0,1,1,0
8724,0,0,3.0,389,3.0,0,0,0,1,0,0
8314,0,0,11.0,1002,3.67,0,0,0,1,1,0
5033,0,0,14.0,302,3.5,0,0,0,1,1,0


##**Scalling dataset**


*   Sklearn returns numpy arrays, so we wrap it with a pandas dataframe



In [80]:
scaler = StandardScaler()
scaler.fit(X_train)
scaler.fit(X_test)

X_train = pd.DataFrame(
    scaler.transform(X_train),
    columns=X_train.columns
)

X_test = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_test.columns
)

##**Visualize scaled dataset**

In [81]:
X_train.head()

Unnamed: 0,lead origin,lead source,totalvisits,total time spent on website,page views per visit,last activity,specialization,what is your current occupation,what matters most to you in choosing a course,city,last notable activity
0,-0.796526,-0.629611,-0.492776,-0.241907,-0.612436,-0.019167,-0.196266,-0.033211,0.019167,-1.747021,-0.086019
1,1.255452,-0.629611,-0.107359,0.299765,-0.392858,-0.019167,-0.196266,-0.033211,0.019167,0.572403,-0.086019
2,-0.796526,-0.629611,-0.107359,-0.161389,0.265875,-0.019167,-0.196266,-0.033211,0.019167,-1.747021,-0.086019
3,-0.796526,-0.629611,1.43431,0.960386,0.56011,-0.019167,-0.196266,-0.033211,0.019167,0.572403,-0.086019
4,-0.796526,-0.629611,2.012436,-0.320596,0.485453,-0.019167,-0.196266,-0.033211,0.019167,0.572403,-0.086019


##**Train the model**

In [85]:
n_estimators = 100
min_samples_split = 4

clf = RandomForestClassifier(n_estimators=n_estimators,
                            min_samples_split=min_samples_split)
clf.fit(X_train, y_train.values.ravel())

print("X_train size ===>", X_train.shape)
print("y_train size ===>", y_train.shape)
print("X_test size ===>", X_test.shape)
print("y_test size ===>", y_test.shape)

X_train size ===> (6351, 11)
y_train size ===> (6351, 1)
X_test size ===> (2723, 11)
y_test size ===> (2723, 1)


In [84]:
y_test_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_test_pred)
auc_score = roc_auc_score(y_test, y_test_pred)

print(accuracy)
print(auc_score)

0.7297098788101358
0.7045870362199487


##**Variable importance**

In [77]:
feature_list = list(X_train.columns)
importances = list(clf.feature_importances_)

feature_importances = [(feature, round(importance,2)) for feature, importance in zip(feature_list, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
[print('Variable: {:10} Importance: {}'.format(*pair)) for pair in feature_importances]

Variable: total time spent on website Importance: 0.7
Variable: totalvisits Importance: 0.1
Variable: page views per visit Importance: 0.09
Variable: lead origin Importance: 0.07
Variable: city       Importance: 0.02
Variable: specialization Importance: 0.01
Variable: lead source Importance: 0.0
Variable: last activity Importance: 0.0
Variable: what is your current occupation Importance: 0.0
Variable: what matters most to you in choosing a course Importance: 0.0
Variable: last notable activity Importance: 0.0


[None, None, None, None, None, None, None, None, None, None, None]

In [78]:
print(len(feature_list))

11
