# Homework

Dataset

In this homework, we will use the lead scoring dataset Bank Marketing dataset. Download it from here.

Or you can do it with wget:

wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv

In this dataset our desired target for classification task will be converted variable - has the client signed up to the platform or not.

## Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
   * For categorical features, replace them with 'NA'
   * For numerical features, replace with with 0.0


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
data = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv'

In [3]:
!wget $data -O data.csv

--2025-10-15 20:13:17--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80876 (79K) [text/plain]
Saving to: ‘data.csv’


2025-10-15 20:13:17 (38.0 MB/s) - ‘data.csv’ saved [80876/80876]



In [4]:
df = pd.read_csv('data.csv')

In [5]:
df.head().T

Unnamed: 0,0,1,2,3,4
lead_source,paid_ads,social_media,events,paid_ads,referral
industry,,retail,healthcare,retail,education
number_of_courses_viewed,1,1,5,2,3
annual_income,79450.0,46992.0,78796.0,83843.0,85012.0
employment_status,unemployed,employed,unemployed,,self_employed
location,south_america,south_america,australia,australia,europe
interaction_count,4,1,3,1,3
lead_score,0.94,0.8,0.69,0.87,0.62
converted,1,0,1,0,1


In [6]:
df.shape

(1462, 9)

In [7]:
df.isnull().sum()

lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [8]:
df.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

In [9]:
df['number_of_courses_viewed'].nunique()

10

In [10]:
df['interaction_count'].nunique()

12

In [11]:
df['lead_score'].nunique()

101

In [12]:
df.columns

Index(['lead_source', 'industry', 'number_of_courses_viewed', 'annual_income',
       'employment_status', 'location', 'interaction_count', 'lead_score',
       'converted'],
      dtype='object')

In [13]:
categorical = ['lead_source', 'industry', 'employment_status', 'location']

In [14]:
numerical = ['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']

In [15]:
string_columns = list(df.dtypes[df.dtypes == 'object'].index)

In [16]:
for col in string_columns:
    df[col] = df[col].replace('', np.nan)
    df[col] = df[col].fillna('NA')

In [17]:
df['lead_source'].isnull().sum()

np.int64(0)

In [18]:
for col in numerical:
    df[col] = df[col].fillna(0.0)

In [19]:
df['annual_income'].isnull().sum()

np.int64(0)

# Question 1

What is the most frequent observation (mode) for the column industry?

    NA
    technology
    healthcare
    retail


In [20]:
df['industry'].unique()

array(['NA', 'retail', 'healthcare', 'education', 'manufacturing',
       'technology', 'other', 'finance'], dtype=object)

In [21]:
df['industry'].value_counts()

industry
retail           203
finance          200
other            198
healthcare       187
education        187
technology       179
manufacturing    174
NA               134
Name: count, dtype: int64

In [22]:
most_common_industry = df['industry'].mode()[0]
most_common_industry

'retail'

# Question 2

Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

    interaction_count and lead_score
    number_of_courses_viewed and lead_score
    number_of_courses_viewed and interaction_count
    annual_income and interaction_count

Only consider the pairs above when answering this question.

In [23]:
df[numerical].corr()

Unnamed: 0,number_of_courses_viewed,annual_income,interaction_count,lead_score
number_of_courses_viewed,1.0,0.00977,-0.023565,-0.004879
annual_income,0.00977,1.0,0.027036,0.01561
interaction_count,-0.023565,0.027036,1.0,0.009888
lead_score,-0.004879,0.01561,0.009888,1.0


In [24]:
pairs = [
    ('interaction_count', 'lead_score'),
    ('number_of_courses_viewed', 'lead_score'),
    ('number_of_courses_viewed', 'interaction_count'),
    ('annual_income', 'interaction_count')
]

for a, b in pairs:
    print(f"{a} ↔ {b}: {df[a].corr(df[b]):.3f}")

interaction_count ↔ lead_score: 0.010
number_of_courses_viewed ↔ lead_score: -0.005
number_of_courses_viewed ↔ interaction_count: -0.024
annual_income ↔ interaction_count: 0.027


The two feature with biggest correlation are "annual_income ↔ interaction_count"

# Split the data

    Split your data in train/val/test sets with 60%/20%/20% distribution.
    Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
    Make sure that the target value converted is not in your dataframe.


In [25]:
from sklearn.model_selection import train_test_split

In [26]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
len(df_full_train), len(df_test)

(1169, 293)

In [27]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)
len(df_train) + len(df_val)

1169

In [28]:
assert len(df) == (len(df_train) + len(df_val) + len(df_test))

In [29]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)


In [30]:
y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values

In [31]:
del df_train['converted']
del df_val['converted']
del df_test['converted']

In [32]:
assert 'converted' not in df_train.columns
assert 'converted' not in df_val.columns
assert 'converted' not in df_test.columns


# Question 3

    Calculate the mutual information score between converted and other categorical variables in the dataset. Use the training set only.
    Round the scores to 2 decimals using round(score, 2).

Which of these variables has the biggest mutual information score?

    industry
    location
    lead_source
    employment_status


In [33]:
from sklearn.metrics import mutual_info_score

In [36]:
def mutual_info_converted_score(series):
    return mutual_info_score(series, y_train)

In [37]:
mi = df_train[categorical].apply(mutual_info_converted_score)
mi.sort_values(ascending=False)

lead_source          0.035396
employment_status    0.012938
industry             0.011575
location             0.004464
dtype: float64

lead_source has biggest MI

# Question 4

    Now let's train a logistic regression.
    Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
    Fit the model on the training dataset.
        To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
        model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

    0.64
    0.74
    0.84
    0.94


In [38]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [39]:
dv = DictVectorizer(sparse=False)

In [45]:
train_dict = df_train.to_dict(orient='records')

In [46]:
X_train = dv.fit_transform(train_dict)

In [48]:
val_dict = df_val.to_dict(orient='records')

In [49]:
X_val = dv.transform(val_dict)

In [50]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)

In [51]:
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'liblinear'
,max_iter,1000


In [52]:
y_pred = model.predict_proba(X_val)[:, 1]

In [53]:
y_pred

array([0.61192163, 0.79982617, 0.53021344, 0.47131479, 0.57066132,
       0.44227169, 0.87127669, 0.84883115, 0.83290037, 0.61497801,
       0.54968027, 0.78153088, 0.69039786, 0.77017122, 0.5265944 ,
       0.91706425, 0.53170635, 0.42123049, 0.30146455, 0.84881583,
       0.79488653, 0.73670375, 0.44527211, 0.64838383, 0.4176882 ,
       0.75393418, 0.90166116, 0.33903049, 0.43181431, 0.9680681 ,
       0.92018714, 0.37487988, 0.652301  , 0.90650057, 0.75164117,
       0.64202121, 0.82250075, 0.83375553, 0.659116  , 0.30978853,
       0.78942264, 0.35546366, 0.96517758, 0.63389304, 0.51274195,
       0.53230533, 0.82287785, 0.744074  , 0.73452313, 0.68955217,
       0.46964443, 0.84539252, 0.55635243, 0.92637871, 0.65258021,
       0.61526273, 0.63816995, 0.28304018, 0.48049824, 0.57890618,
       0.35497342, 0.62175051, 0.38960778, 0.61156056, 0.85304278,
       0.75430136, 0.89185954, 0.71946459, 0.95387623, 0.89209517,
       0.75277088, 0.33850139, 0.61376593, 0.51622275, 0.64088

In [54]:
converted_decision = (y_pred >= 0.5)

In [55]:
(y_val == converted_decision).mean()

np.float64(0.6996587030716723)

In [58]:
df_pred = pd.DataFrame()
df_pred['probability'] = y_pred
df_pred['prediction'] = converted_decision.astype(int)
df_pred['actual'] = y_val
df_pred['correct'] = df_pred.prediction == df_pred.actual

In [59]:
df_pred

Unnamed: 0,probability,prediction,actual,correct
0,0.611922,1,0,False
1,0.799826,1,1,True
2,0.530213,1,0,False
3,0.471315,0,0,True
4,0.570661,1,0,False
...,...,...,...,...
288,0.419342,0,0,True
289,0.710539,1,1,True
290,0.418185,0,0,True
291,0.744835,1,1,True


In [60]:
original_score = df_pred.correct.mean()

In [65]:
print(f'Accuracy = {round(original_score, 2)}')

Accuracy = 0.7


# Question 5

    Let's find the least useful feature using the feature elimination technique.
    Train a model using the same features and parameters as in Q4 (without rounding).
    Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
    For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

    'industry'
    'employment_status'
    'lead_score'

    Note: The difference doesn't have to be positive

In [66]:
columns = df_train.columns.to_list()
columns

['lead_source',
 'industry',
 'number_of_courses_viewed',
 'annual_income',
 'employment_status',
 'location',
 'interaction_count',
 'lead_score']

In [68]:
scores = pd.DataFrame()
for feature in columns:
    df_train_cut = df_train.copy()
    df_val_cut = df_val.copy()
    
    df_train_cut = df_train_cut.drop(columns={feature})
    df_val_cut = df_val_cut.drop(columns={feature})
    
    dv_cut = DictVectorizer(sparse=False)
    train_dict_cut = df_train_cut.to_dict(orient='records')
    X_train_cut = dv_cut.fit_transform(train_dict_cut)
    val_dict_cut = df_val_cut.to_dict(orient='records')
    X_val_cut = dv_cut.transform(val_dict_cut)
    
    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train_cut, y_train)
    
    y_pred = model.predict_proba(X_val_cut)[:, 1] 
    prediction = (y_pred >= 0.5).astype(int)
    correct = (prediction == y_val)
    score = correct.mean()
    scores = pd.concat([
    scores,
    pd.DataFrame([{
        'eliminated_feature': feature,
        'accuracy': score,
        'difference': abs(original_score - score)
    }])], ignore_index=True)


In [69]:
scores

Unnamed: 0,eliminated_feature,accuracy,difference
0,lead_source,0.703072,0.003413
1,industry,0.699659,0.0
2,number_of_courses_viewed,0.556314,0.143345
3,annual_income,0.853242,0.153584
4,employment_status,0.696246,0.003413
5,location,0.709898,0.010239
6,interaction_count,0.556314,0.143345
7,lead_score,0.706485,0.006826


In [70]:
min_diff = scores.difference.min()
scores[scores.difference == min_diff]

Unnamed: 0,eliminated_feature,accuracy,difference
1,industry,0.699659,0.0


# Question 6

    Now let's train a regularized logistic regression.
    Let's try the following values of the parameter C: [0.01, 0.1, 1, 10, 100].
    Train models using all the features as in Q4.
    Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these C leads to the best accuracy on the validation set?

    0.01
    0.1
    1
    10
    100

    Note: If there are multiple options, select the smallest C.


In [71]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [75]:
scores = pd.DataFrame()
C_values = [0.01, 0.1, 1, 10, 100]

for C in C_values:
    model = LogisticRegression(penalty='l2', solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict_proba(X_val)[:, 1] 
    prediction = (y_pred >= 0.5).astype(int)
    correct = (prediction == y_val)
    score = correct.mean()
    scores = pd.concat([
    scores,
    pd.DataFrame([{
        'C': C,
        'accuracy': score,
        'difference': abs(original_score - score)
    }])], ignore_index=True)


In [76]:
scores

Unnamed: 0,C,accuracy,difference
0,0.01,0.699659,0.0
1,0.1,0.699659,0.0
2,1.0,0.699659,0.0
3,10.0,0.699659,0.0
4,100.0,0.699659,0.0
