## Homework

> Note: sometimes your answer doesn't match one of the options exactly.
> That's fine.
> Select the option that's closest to your solution.

### Dataset

In this homework, we will use the lead scoring dataset Bank Marketing dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv).

Or you can do it with `wget`:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
```

In this dataset our desired target for classification task will be `converted` variable - has the client signed up to the platform or not.

### Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    * For categorical features, replace them with 'NA'
    * For numerical features, replace with with 0.0 

### Question 1

What is the most frequent observation (mode) for the column `industry`?

- `NA`
- `technology`
- `healthcare`
- `retail`

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset.
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `interaction_count` and `lead_score`
- `number_of_courses_viewed` and `lead_score`
- `number_of_courses_viewed` and `interaction_count`
- `annual_income` and `interaction_count`

Only consider the pairs above when answering this question.

### Split the data

- Split your data in train/val/test sets with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
- Make sure that the target value `converted` is not in your dataframe.

### Question 3

- Calculate the mutual information score between `converted` and other categorical variables in the dataset. Use the training set only.
- Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?

- `industry`
- `location`
- `lead_source`
- `employment_status`

### Question 4

- Now let's train a logistic regression.
- Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
- Fit the model on the training dataset.
  - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
  - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.64
- 0.74
- 0.84
- 0.94

### Question 5

- Let's find the least useful feature using the _feature elimination_ technique.
- Train a model using the same features and parameters as in Q4 (without rounding).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

- `'industry'`
- `'employment_status'`
- `'lead_score'`

> **Note**: The difference doesn't have to be positive.

### Question 6

- Now let's train a regularized logistic regression.
- Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
- Train models using all the features as in Q4.
- Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.

## Submit the results

- Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw03
- If your answer doesn't match options exactly, select the closest one


In [1]:
## import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## import data
#!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv

In [2]:
## Loading data
df = pd.read_csv("course_lead_scoring.csv")

print(df.dtypes)

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object


In [3]:
print(df.head())
print("##########")
print(df.isna().sum()) # number of null values


for feature in df.columns:
    # impute numerical features
    if df[feature].dtype == 'int64':
        df[feature] = df[feature].fillna(0)
    elif df[feature].dtype == 'float64':
        df[feature] = df[feature].fillna(0.0)
    # impute categorical features
    elif df[feature].dtype == 'object':
        df[feature] = df[feature].fillna('NA')

print(df.isna().sum())


    lead_source    industry  number_of_courses_viewed  annual_income  \
0      paid_ads         NaN                         1        79450.0   
1  social_media      retail                         1        46992.0   
2        events  healthcare                         5        78796.0   
3      paid_ads      retail                         2        83843.0   
4      referral   education                         3        85012.0   

  employment_status       location  interaction_count  lead_score  converted  
0        unemployed  south_america                  4        0.94          1  
1          employed  south_america                  1        0.80          0  
2        unemployed      australia                  3        0.69          1  
3               NaN      australia                  1        0.87          0  
4     self_employed         europe                  3        0.62          1  
##########
lead_source                 128
industry                    134
number_of_courses_

In [4]:
### question 1 find the mode of the column `industry`
df.industry.mode()

0    retail
Name: industry, dtype: object

In [5]:
df.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

In [6]:
### question 2 find the correlation of numerical features 
df_num = df.select_dtypes(include=['int64', 'float64'])
df_num.corr()

Unnamed: 0,number_of_courses_viewed,annual_income,interaction_count,lead_score,converted
number_of_courses_viewed,1.0,0.00977,-0.023565,-0.004879,0.435914
annual_income,0.00977,1.0,0.027036,0.01561,0.053131
interaction_count,-0.023565,0.027036,1.0,0.009888,0.374573
lead_score,-0.004879,0.01561,0.009888,1.0,0.193673
converted,0.435914,0.053131,0.374573,0.193673,1.0


In [7]:
df['interaction_count'].dtypes

dtype('int64')

In [8]:
### question 2 print out correlation score
corr_mat = df_num.corr()
print(corr_mat.loc[('interaction_count', 'lead_score')])
print(corr_mat.loc[('number_of_courses_viewed', 'lead_score')])
print(corr_mat.loc[('number_of_courses_viewed', 'interaction_count')])
print(corr_mat.loc[('annual_income', 'interaction_count')])

0.009888182496913131
-0.004878998354681276
-0.023565222882888037
0.02703647240481443


In [9]:
### question 3
from sklearn.model_selection import train_test_split

np.random.seed(42)

# train test split into feature matrix (X) and target (y)
target = 'converted'
X = df.drop(columns=[target])
y = df[target]

# First Split: Train (60%) vs. Temporary (40%) followed by Second Split: Test (50%) and Val (50%)
X_train, X_temp, y_train, y_temp = train_test_split(X,y, test_size=0.4, random_state=42, stratify=y)
X_test, X_val, y_test, y_val = train_test_split(X_temp,y_temp, test_size=0.4, random_state=42, stratify=y_temp)


In [10]:
df.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

In [11]:
### question 3 extract categorical columns
cat_columns = ['lead_source','industry','employment_status','location']
X_train_categorical = X_train[cat_columns]

# impute missing values
X_train_categorical = X_train_categorical.fillna('unknown')

from sklearn.feature_selection import mutual_info_classif

# use one hot encoding to convert categorical columns 
X_train_encoded = pd.get_dummies(X_train_categorical, dtype=float)
            
### calculate mutual information scores
mi_scores = mutual_info_classif(X_train_encoded, y_train, random_state=42)
mi_series = pd.Series(mi_scores,index = X_train_encoded.columns) # added column names as index

print(mi_series.round(2).sort_values(ascending=False))

lead_source_events                 0.03
lead_source_organic_search         0.02
industry_NA                        0.02
lead_source_NA                     0.01
lead_source_paid_ads               0.01
lead_source_referral               0.01
industry_education                 0.01
industry_healthcare                0.01
industry_finance                   0.01
industry_manufacturing             0.01
industry_other                     0.01
location_asia                      0.01
employment_status_student          0.01
location_africa                    0.01
lead_source_social_media           0.00
employment_status_NA               0.00
employment_status_employed         0.00
industry_technology                0.00
industry_retail                    0.00
employment_status_unemployed       0.00
employment_status_self_employed    0.00
location_NA                        0.00
location_australia                 0.00
location_europe                    0.00
location_middle_east               0.00


In [12]:
### question 4 fit the logistic regression model

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)


# extract the categorical variables
X_val_categorical = X_val[cat_columns]
# impute missing values
X_val_categorical = X_val_categorical.fillna('NA')
# one hot encoding on the val data
X_val_encoded = pd.get_dummies(X_val_categorical, dtype=float)
# align the columns and reindex
X_val_aligned = X_val_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

# fit the model and get the accuracy score
model.fit(X_train_encoded,y_train)
model_accuracy = model.score(X_val_aligned,y_val)
print(round(model_accuracy,2))

0.62


In [13]:
### question 5 
from sklearn.feature_selection import RFE
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=4, step=1, verbose=0)
rfe.fit(X_train_encoded,y_train)

print(dict(zip(X_train_encoded.columns, rfe.ranking_))) # print their importance rank
print(X_train_encoded.columns[rfe.support_])  # print the remaining features

acc = rfe.score(X_val_aligned,y_val)
print("Validation accuracy:", acc)

{'lead_source_NA': np.int64(10), 'lead_source_events': np.int64(4), 'lead_source_organic_search': np.int64(3), 'lead_source_paid_ads': np.int64(1), 'lead_source_referral': np.int64(1), 'lead_source_social_media': np.int64(14), 'industry_NA': np.int64(2), 'industry_education': np.int64(1), 'industry_finance': np.int64(6), 'industry_healthcare': np.int64(9), 'industry_manufacturing': np.int64(24), 'industry_other': np.int64(22), 'industry_retail': np.int64(13), 'industry_technology': np.int64(15), 'employment_status_NA': np.int64(21), 'employment_status_employed': np.int64(16), 'employment_status_self_employed': np.int64(11), 'employment_status_student': np.int64(17), 'employment_status_unemployed': np.int64(1), 'location_NA': np.int64(8), 'location_africa': np.int64(20), 'location_asia': np.int64(23), 'location_australia': np.int64(5), 'location_europe': np.int64(12), 'location_middle_east': np.int64(18), 'location_north_america': np.int64(19), 'location_south_america': np.int64(7)}
Ind

In [14]:
### question 5
orig_acc = 0.62
# compute the accuracy by dropping one feature at a time
accuracy = {}

cols = ['industry', 'employment_status','lead_source']

for feature in cols:
    # take out the feature relevant columns, i.e industry_x, industry_y 
    cols_to_drop = [c for c in X_train_encoded.columns if c.startswith(feature)]
    X_train_drop = X_train_encoded.drop(columns=cols_to_drop)
    X_val_drop = X_val_encoded.drop(columns=cols_to_drop)

    # Retrain model without that feature
    model = LogisticRegression(solver='liblinear', max_iter=1000)
    model.fit(X_train_drop, y_train)

    # compute the new accuracy score
    acc = model.score(X_val_drop, y_val)
    diff = acc - orig_acc

    # attach the difference for that feature
    accuracy[feature] = diff

accu = pd.DataFrame(accuracy.items(), columns=['Feature', 'Accuracy Difference'])
print(accu)

             Feature  Accuracy Difference
0           industry             0.016752
1  employment_status             0.003932
2        lead_source             0.021026


In [15]:
### question 6
# Logistic Regression with L1 regularization and cross-validation
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score

# Create and fit the LassoCV model on the training set
logreg_cv = LogisticRegressionCV(
    penalty='l1', 
    solver='saga',  # 'saga' supports L1 regularization
    cv=5,           
    Cs=[0.01, 0.1, 1, 10, 100],
    scoring='accuracy',
    max_iter=1000
)

# Fit the model
logreg_cv.fit(X_train_encoded, y_train)

# Predict and evaluate
y_pred = logreg_cv.predict(X_val_encoded)
accuracy = accuracy_score(y_val, y_pred)

print(f"Accuracy: {accuracy:.3f}")


Accuracy: 0.632


In [16]:
print("Best C selected by CV:", logreg_cv.C_[0])

Best C selected by CV: 1.0
