<a href="https://colab.research.google.com/github/sanjzreddy/leadscore/blob/main/UniAcco_DS_internship.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
Selling a product or service is a challenging task that involves targeting potential customers and convincing them to make a purchase. However, not all leads are equally likely to convert into sales, and it can be difficult for sales teams to prioritize their efforts effectively. To address this issue, lead scoring is a method that assigns scores to leads based on their likelihood to convert into sales. In this report, we will discuss how to build a machine learning model to predict the lead score.

## Importing the required libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [2]:
sheet_id = '1-Q5EaPbEG0BuZj69ZaGCe4agBY_RLiu-gZClm50gCmg'

xls = pd.ExcelFile(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format-xlsx")

df = pd.read_excel(xls, 'Dump')
df.head()

Unnamed: 0.1,Unnamed: 0,Agent_id,status,lost_reason,budget,lease,movein,source,source_city,source_country,utm_source,utm_medium,des_city,des_country,room_type,lead_id
0,0,1deba9e96f404694373de9749ddd1ca8aa7bb823145a6f...,LOST,Not responding,,,NaT,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,3d59f7548e1af2151b64135003ce63c0a484c26b9b8b16...,268ad70eb5bc4737a2ae28162cbca30118cc94520e49ef...,ecc0e7dc084f141b29479058967d0bc07dee25d9690a98...,8d23a6e37e0a6431a8f1b43a91026dcff51170a89a6512...,,cd5dc0d9393f3980d11d4ba6f88f8110c2b7a7f7796307...
1,1,299ae77a4ef350ae0dd37d6bba1c002d03444fb1edb236...,LOST,Low budget,,,NaT,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,3d59f7548e1af2151b64135003ce63c0a484c26b9b8b16...,268ad70eb5bc4737a2ae28162cbca30118cc94520e49ef...,5372372f3bf5896820cb2819300c3e681820d82c6efc54...,8d23a6e37e0a6431a8f1b43a91026dcff51170a89a6512...,,b94693673a5f7178d1b114e4004ad52377d3244dd24a3d...
2,2,c213697430c006013012dd2aca82dd9732aa0a1a6bca13...,LOST,Not responding,£121 - £180 Per Week,Full Year Course Stay 40 - 44 weeks,2022-08-31,7aae3e886e89fc1187a5c47d6cea1c22998ee610ade1f2...,9b8cc3c63cdf447e463c11544924bf027945cbd29675f7...,e09e10e67812e9d236ad900e5d46b4308fc62f5d69446a...,bbdefa2950f49882f295b1285d4fa9dec45fc4144bfb07...,09076eb7665d1fb9389c7c4517fee0b00e43092eb34821...,11ab03a1a8c367191355c152f39fe28cae5e426fce49ef...,8d23a6e37e0a6431a8f1b43a91026dcff51170a89a6512...,Ensuite,96ea4e2bf04496c044745938c0299c264c3f4ba079e572...
3,3,eac9815a500f908736d303e23aa227f0957177b0e6756b...,LOST,Low budget,0-0,0,NaT,ba2d0a29556ac20f86f45e4543c0825428cba33fd7a9ea...,a5f0d2d08eb0592087e3a3a2f9c1ba2c67cc30f2efd2bd...,e09e10e67812e9d236ad900e5d46b4308fc62f5d69446a...,bbdefa2950f49882f295b1285d4fa9dec45fc4144bfb07...,09076eb7665d1fb9389c7c4517fee0b00e43092eb34821...,19372fa44c57a01c37a5a8418779ca3d99b0b59731fb35...,8d23a6e37e0a6431a8f1b43a91026dcff51170a89a6512...,,1d2b34d8add02a182a4129023766ca4585a8ddced0e5b3...
4,4,1deba9e96f404694373de9749ddd1ca8aa7bb823145a6f...,LOST,Junk lead,,,NaT,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,3d59f7548e1af2151b64135003ce63c0a484c26b9b8b16...,268ad70eb5bc4737a2ae28162cbca30118cc94520e49ef...,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,9b2d5b4678781e53038e91ea5324530a03f27dc1d0e5f6...,,fc10fffd29cfbe93c55158fb47752a7501c211d253468c...


## Data Preprocessing

In [3]:
df.shape

(46608, 16)

In [4]:
df = df[df['status'].isin(['WON', 'LOST'])]
df.shape

(46317, 16)

In [5]:
df['status'].value_counts()

LOST    43244
WON      3073
Name: status, dtype: int64

In [6]:
df.drop_duplicates()
df.shape

(46317, 16)

In [7]:
df.isna().sum()

Unnamed: 0            0
Agent_id              0
status                0
lost_reason        3073
budget             3694
lease              2336
movein            13610
source                0
source_city           0
source_country        0
utm_source            0
utm_medium            0
des_city              0
des_country           0
room_type         23491
lead_id               0
dtype: int64

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46317 entries, 0 to 46607
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Unnamed: 0      46317 non-null  int64         
 1   Agent_id        46317 non-null  object        
 2   status          46317 non-null  object        
 3   lost_reason     43244 non-null  object        
 4   budget          42623 non-null  object        
 5   lease           43981 non-null  object        
 6   movein          32707 non-null  datetime64[ns]
 7   source          46317 non-null  object        
 8   source_city     46317 non-null  object        
 9   source_country  46317 non-null  object        
 10  utm_source      46317 non-null  object        
 11  utm_medium      46317 non-null  object        
 12  des_city        46317 non-null  object        
 13  des_country     46317 non-null  object        
 14  room_type       22826 non-null  object        
 15  le

In [9]:
df['status'].value_counts()

LOST    43244
WON      3073
Name: status, dtype: int64

In [10]:
le = LabelEncoder()
for column in df.columns:
    df[column] = le.fit_transform(df[column].astype(str))

We divided the data into two categories, one is "X" which contains all the columns except the 'status' coulmn and the other is "y" which contains only the status column

In [11]:
X = df.drop(['status'], axis=1)
y = df['status']

Performed feature selection using SeletKbest with chi-squared test

In [12]:
selector = SelectKBest(chi2, k=10)
selector.fit_transform(X, y)
X = X[X.columns[selector.get_support(indices=True)]]

In [13]:
X

Unnamed: 0.1,Unnamed: 0,Agent_id,lost_reason,budget,movein,source,source_city,source_country,des_city,lead_id
0,0,12,21,1716,469,432,2616,109,206,24421
1,1,20,16,1716,469,432,2616,109,74,22037
2,11039,87,21,1750,127,343,2620,166,18,17924
3,22088,110,16,10,469,508,2818,166,26,3582
4,33124,12,8,1716,469,432,2616,109,137,30016
...,...,...,...,...,...,...,...,...,...,...
46603,40414,15,15,1821,128,432,2125,100,22,3295
46604,40415,24,24,1821,156,432,2696,100,135,3895
46605,40416,5,15,1800,147,343,2125,100,206,25937
46606,40417,60,15,815,126,579,3944,134,206,3895


In [14]:
y

0        0
1        0
2        0
3        0
4        0
        ..
46603    0
46604    0
46605    0
46606    0
46607    0
Name: status, Length: 46317, dtype: int64

Splitting the data into training and testing sets

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=43)

## Model Training
Training the model to predict the lead status

In [16]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)

Using the model to predict the lead status on the testing set and calculating the accuracy of the model

In [17]:
y_pred = gnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 1.0


Calculating the lead score

In [18]:
df['WON Probability'] = gnb.predict_proba(X)[:, 1]
df['LOST Probability'] = gnb.predict_proba(X)[:, 0]
df['Lead Score'] = 100 * (df['WON Probability'] - df['LOST Probability'])

Using binning to reduce cardinality of continuous and discrete data and adding a new column - "Lead Score Cat"

In [19]:
bins = 3
df['Lead Score Cat'] = pd.cut(df['Lead Score'], bins, labels=[0, 1, 2])

Train a new machine learning model to predict the lead score

In [20]:
X = df.drop(['status', 'Lead Score', 'Lead Score Cat'], axis=1)
y = df['Lead Score Cat']

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
gnb = GaussianNB()
gnb.fit(X_train, y_train)

Using the model to predict the lead score on the testing set.

In [22]:
y_pred = gnb.predict(X_test)

calculating the accuracy of the model

In [23]:
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 0.9789867587795049


Overall classification report

In [24]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99     12814
           1       0.37      0.84      0.52       184
           2       0.98      1.00      0.99       898

    accuracy                           0.98     13896
   macro avg       0.78      0.94      0.83     13896
weighted avg       0.99      0.98      0.98     13896

