## **Machine Learning Zoomcamp 2025**

## **Homework 3**

**Dataset**

In [31]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv

--2025-11-13 07:42:45--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80876 (79K) [text/plain]
Saving to: ‘course_lead_scoring.csv.1’


2025-11-13 07:42:45 (5.46 MB/s) - ‘course_lead_scoring.csv.1’ saved [80876/80876]



In [32]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [33]:
df = pd.read_csv('course_lead_scoring.csv')

In [34]:
df.head(3)

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1


In [35]:
df.dtypes


Unnamed: 0,0
lead_source,object
industry,object
number_of_courses_viewed,int64
annual_income,float64
employment_status,object
location,object
interaction_count,int64
lead_score,float64
converted,int64


**Data Preparation**

Check if the missing values are presented in the features.

If there are missing values:

*  For categorical features, replace them with 'NA'
*  For numerical features, replace with with 0.0




In [36]:
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].fillna('NA')

numerical_columns = ['number_of_courses_viewed','annual_income','interaction_count','lead_score','converted']

for c in numerical_columns:
    df[c] = df[c].fillna(0.0)

**Question 1**

What is the most frequent observation (mode) for the column industry?


*   NA
*   technology
*   healthcare
*   retail

In [37]:
df.industry.mode()


Unnamed: 0,industry
0,retail


**Question 2**

Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

*   interaction_count and lead_score
*   number_of_courses_viewed and lead_score
*   number_of_courses_viewed and interaction_count
*   annual_income and interaction_count

Only consider the pairs above when answering this question.

In [38]:
corr_matrix = df[numerical_columns].corr()

pairs = [ ("interaction_count", "lead_score"),
          ("number_of_courses_viewed", "lead_score"),
          ("number_of_courses_viewed", "interaction_count"),
          ("annual_income", "interaction_count")]

for pair in pairs:
    print(f"Correlation between {pair[0]} and {pair[1]}: {corr_matrix.loc[pair[0], pair[1]]:.3f}")

Correlation between interaction_count and lead_score: 0.010
Correlation between number_of_courses_viewed and lead_score: -0.005
Correlation between number_of_courses_viewed and interaction_count: -0.024
Correlation between annual_income and interaction_count: 0.027


**The two features that have the biggest correlation is *"annual_income and interaction_count"***

**Split the data**

Split your data in train/val/test sets with 60%/20%/20% distribution.

Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.

Make sure that the target value converted is not in your dataframe.

In [39]:
from sklearn.model_selection import train_test_split

In [40]:
df_full_train,df_test = train_test_split(df, test_size = 0.2, random_state = 42) #test size = 20%
df_train, df_val = train_test_split(df_full_train, test_size = 0.25, random_state = 42) # 20/80 = 25%

In [41]:
len(df_train), len(df_val), len(df_test)

(876, 293, 293)

In [42]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [43]:
y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values

del df_train['converted']
del df_val['converted']
del df_test['converted']

**Question 3**

Calculate the mutual information score between converted and other categorical variables in the dataset. Use the training set only.
Round the scores to 2 decimals using round(score, 2).

Which of these variables has the biggest mutual information score?

* industry
* location
* lead_source
* employment_status


In [44]:
from sklearn.metrics import mutual_info_score

print(f'Mutual info score for industry: {mutual_info_score(df_train.industry, y_train).round(3)}')
print(f'Mutual info score for location: {mutual_info_score(df_train.location, y_train).round(3)}')
print(f'Mutual info score for lead_source: {mutual_info_score(df_train.lead_source, y_train).round(3)}')
print(f'Mutual info score for employment_status: {mutual_info_score(df_train.employment_status, y_train).round(3)}')

Mutual info score for industry: 0.012
Mutual info score for location: 0.004
Mutual info score for lead_source: 0.035
Mutual info score for employment_status: 0.013


***"Employment_status"* has the biggest mutual information score.***

**Question 4**

Now let's train a logistic regression.
Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.

Fit the model on the training dataset.

To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)

Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

* 0.64
* 0.74
* 0.84
* 0.94

In [45]:
from sklearn.feature_extraction import DictVectorizer

In [46]:
dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

In [47]:
from sklearn.linear_model import LogisticRegression

In [48]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [49]:
y_pred = model.predict_proba(X_val)[:, 1]

baseline_accuracy = (y_val == (y_pred >= 0.5)).mean().round(2)

**The accuracy is *0.7*.**

**Question 5**

Let's find the least useful feature using the feature elimination technique.

Train a model using the same features and parameters as in Q4 (without rounding).

Now exclude each feature from this set and train a model without it. Record the accuracy for each model.

For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

* 'industry'
* 'employment_status'
* 'lead_score'

Note: The difference doesn't have to be positive.

In [50]:
features = ['industry', 'employment_status', 'lead_score']

In [51]:
for feature in features:
    train_dict = df_train.drop(feature, axis=1).to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)

    val_dict = df_val.drop(feature, axis=1).to_dict(orient='records')
    X_val = dv.transform(val_dict)

    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict_proba(X_val)[:, 1]

    accuracy = (y_val == (y_pred >= 0.5)).mean().round(2)
    accuracy_diff = baseline_accuracy - accuracy
    print(f'Accuracy difference after eliminating {feature}: {accuracy_diff}')

Accuracy difference after eliminating industry: 0.0
Accuracy difference after eliminating employment_status: 0.0
Accuracy difference after eliminating lead_score: -0.010000000000000009


***"Industry"* has the smallest difference.**

**Question 6**

Now let's train a regularized logistic regression.

Let's try the following values of the parameter C: [0.01, 0.1, 1, 10, 100].

Train models using all the features as in Q4.

Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these C leads to the best accuracy on the validation set?

* 0.01
* 0.1
* 1
* 10
* 100

Note: If there are multiple options, select the smallest C.

In [53]:
C = [0.01, 0.1, 1, 10, 100]

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)
val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

for c in C:
  model = LogisticRegression(solver='liblinear', C=c, max_iter=1000, random_state=42)
  model.fit(X_train, y_train)
  y_pred = model.predict_proba(X_val)[:, 1]
  accuracy = (y_val == (y_pred >= 0.5)).mean().round(3)
  print(f'Accuracy for C={c}: {accuracy}')

Accuracy for C=0.01: 0.7
Accuracy for C=0.1: 0.7
Accuracy for C=1: 0.7
Accuracy for C=10: 0.7
Accuracy for C=100: 0.7


***"C = 0.01"* leads to the best accuracy on the validation set.**