# Homework

### Dataset
In this homework, we will use the lead scoring dataset Bank Marketing dataset. Download it from here.

Or you can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
```

In this dataset our desired target for classification task will be converted variable - has the client signed up to the platform or not.

### Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    * For caterogiral features, replace them with 'NA'
    * For numerical features, replace with with 0.0

In [148]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mutual_info_score

import matplotlib.pyplot as plt

%matplotlib inline

In [149]:
path = "data/homework/course_lead_scoring.csv"
data = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv"

In [150]:
!wget -O $path $data

--2025-10-13 21:59:47--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80876 (79K) [text/plain]
Saving to: ‘data/homework/course_lead_scoring.csv’


2025-10-13 21:59:47 (6.68 MB/s) - ‘data/homework/course_lead_scoring.csv’ saved [80876/80876]



In [151]:
df = pd.read_csv(path)
df.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


In [152]:
df.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

In [153]:
numerical = [col for col in df.select_dtypes(include=['int64', 'float64']).columns if col != 'converted']
categorical = list(df.select_dtypes(include=['object', 'bool']).columns)
numerical, categorical

(['number_of_courses_viewed',
  'annual_income',
  'interaction_count',
  'lead_score'],
 ['lead_source', 'industry', 'employment_status', 'location'])

In [154]:
df.isna().sum()

lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [155]:
for cat in categorical:
    print(cat)
    print("Nulos:")
    print(df[cat].isna().sum())
    df[cat] = df[cat].fillna('NA')
    print("Nulos despues de fillna:")
    print(df[cat].isna().sum())
    print()

lead_source
Nulos:
128
Nulos despues de fillna:
0

industry
Nulos:
134
Nulos despues de fillna:
0

employment_status
Nulos:
100
Nulos despues de fillna:
0

location
Nulos:
63
Nulos despues de fillna:
0



In [156]:
for num in numerical:
    print(num)
    print("Nulos:")
    print(df[num].isna().sum())
    df[num] = df[num].fillna(0.0)
    print("Nulos despues de fillna:")
    print(df[num].isna().sum())
    print()

number_of_courses_viewed
Nulos:
0
Nulos despues de fillna:
0

annual_income
Nulos:
181
Nulos despues de fillna:
0

interaction_count
Nulos:
0
Nulos despues de fillna:
0

lead_score
Nulos:
0
Nulos despues de fillna:
0



In [157]:
df.isna().sum()

lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64

### Question 1

What is the most frequent observation (mode) for the column `industry`?

- `NA`
- `technology`
- `healthcare`
- `retail`

In [158]:
df.describe()

Unnamed: 0,number_of_courses_viewed,annual_income,interaction_count,lead_score,converted
count,1462.0,1462.0,1462.0,1462.0,1462.0
mean,2.031464,52472.172367,2.976744,0.506108,0.619015
std,1.449717,24254.34703,1.681564,0.288465,0.485795
min,0.0,0.0,0.0,0.0,0.0
25%,1.0,44097.25,2.0,0.2625,0.0
50%,2.0,57449.5,3.0,0.51,1.0
75%,3.0,68241.0,4.0,0.75,1.0
max,9.0,109899.0,11.0,1.0,1.0


In [159]:
df.industry.value_counts()

industry
retail           203
finance          200
other            198
healthcare       187
education        187
technology       179
manufacturing    174
NA               134
Name: count, dtype: int64

In [160]:
df.industry.mode()

0    retail
Name: industry, dtype: object

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `interaction_count` and `lead_score`
- `number_of_courses_viewed` and `lead_score`
- `number_of_courses_viewed` and `interaction_count`
- `annual_income` and `interaction_count`

Only consider the pairs above when answering this question.

In [161]:
corr_matrix = df[numerical].corr()

pairs = [
    ("interaction_count", "lead_score"),
    ("number_of_courses_viewed", "lead_score"),
    ("number_of_courses_viewed", "interaction_count"),
    ("annual_income", "interaction_count")
]

pair_corrs_df = pd.DataFrame([
    {"Feature 1": a, "Feature 2": b, "Correlation": corr_matrix.loc[a, b]}
    for a, b in pairs
])

pair_corrs_df = pair_corrs_df.reindex(pair_corrs_df["Correlation"].abs().sort_values(ascending=False).index)

print(pair_corrs_df)

                  Feature 1          Feature 2  Correlation
3             annual_income  interaction_count     0.027036
2  number_of_courses_viewed  interaction_count    -0.023565
0         interaction_count         lead_score     0.009888
1  number_of_courses_viewed         lead_score    -0.004879


In [162]:
top_pair = pair_corrs_df.iloc[0]
print("Highest correlation pair:", top_pair["Feature 1"], "and", top_pair["Feature 2"], "→", top_pair["Correlation"])

Highest correlation pair: annual_income and interaction_count → 0.02703647240481443


#### Split the data
Split your data in train/val/test sets with 60%/20%/20% distribution.
Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
Make sure that the target value y is not in your dataframe.

In [163]:
seed = 42
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=seed)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=seed)
len(df_train), len(df_val), len(df_test)

(876, 293, 293)

In [164]:
# Resetting index
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [165]:
# Target values
y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values


In [166]:
del df_train['converted']
del df_val['converted']
del df_test['converted']

In [167]:
len(y_train)

876

### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?
  
- `industry`
- `location`
- `lead_source`
- `employment_status`

In [168]:
for cat in categorical:
    print(cat)
    print("Mutual Information:")
    print(round(mutual_info_score(df_full_train[cat], df_full_train.converted), 2))


lead_source
Mutual Information:
0.03
industry
Mutual Information:
0.01
employment_status
Mutual Information:
0.01
location
Mutual Information:
0.0


### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.64
- 0.74
- 0.84
- 0.94

In [169]:
dicts = df_full_train[categorical].to_dict(orient='records')
dicts

[{'lead_source': 'social_media',
  'industry': 'manufacturing',
  'employment_status': 'self_employed',
  'location': 'australia'},
 {'lead_source': 'events',
  'industry': 'retail',
  'employment_status': 'student',
  'location': 'north_america'},
 {'lead_source': 'social_media',
  'industry': 'education',
  'employment_status': 'NA',
  'location': 'europe'},
 {'lead_source': 'referral',
  'industry': 'education',
  'employment_status': 'employed',
  'location': 'australia'},
 {'lead_source': 'paid_ads',
  'industry': 'healthcare',
  'employment_status': 'employed',
  'location': 'europe'},
 {'lead_source': 'organic_search',
  'industry': 'manufacturing',
  'employment_status': 'self_employed',
  'location': 'asia'},
 {'lead_source': 'social_media',
  'industry': 'education',
  'employment_status': 'unemployed',
  'location': 'north_america'},
 {'lead_source': 'organic_search',
  'industry': 'finance',
  'employment_status': 'self_employed',
  'location': 'north_america'},
 {'lead_sou

In [170]:
dv = DictVectorizer(sparse=False)

In [171]:
dv.fit(dicts)

0,1,2
,dtype,<class 'numpy.float64'>
,separator,'='
,sparse,False
,sort,True


In [172]:
dv.get_feature_names_out()

array(['employment_status=NA', 'employment_status=employed',
       'employment_status=self_employed', 'employment_status=student',
       'employment_status=unemployed', 'industry=NA',
       'industry=education', 'industry=finance', 'industry=healthcare',
       'industry=manufacturing', 'industry=other', 'industry=retail',
       'industry=technology', 'lead_source=NA', 'lead_source=events',
       'lead_source=organic_search', 'lead_source=paid_ads',
       'lead_source=referral', 'lead_source=social_media', 'location=NA',
       'location=africa', 'location=asia', 'location=australia',
       'location=europe', 'location=middle_east',
       'location=north_america', 'location=south_america'], dtype=object)

In [173]:
dv.transform(dicts)

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 0., 0.]], shape=(1169, 27))

In [174]:
train_dicts = df_train[categorical + numerical].to_dict(orient='records')

In [175]:
train_dicts[0]

{'lead_source': 'paid_ads',
 'industry': 'retail',
 'employment_status': 'student',
 'location': 'middle_east',
 'number_of_courses_viewed': 0,
 'annual_income': 58472.0,
 'interaction_count': 5,
 'lead_score': 0.03}

In [176]:
dv.fit(train_dicts)

0,1,2
,dtype,<class 'numpy.float64'>
,separator,'='
,sparse,False
,sort,True


In [177]:
dv.get_feature_names_out()

array(['annual_income', 'employment_status=NA',
       'employment_status=employed', 'employment_status=self_employed',
       'employment_status=student', 'employment_status=unemployed',
       'industry=NA', 'industry=education', 'industry=finance',
       'industry=healthcare', 'industry=manufacturing', 'industry=other',
       'industry=retail', 'industry=technology', 'interaction_count',
       'lead_score', 'lead_source=NA', 'lead_source=events',
       'lead_source=organic_search', 'lead_source=paid_ads',
       'lead_source=referral', 'lead_source=social_media', 'location=NA',
       'location=africa', 'location=asia', 'location=australia',
       'location=europe', 'location=middle_east',
       'location=north_america', 'location=south_america',
       'number_of_courses_viewed'], dtype=object)

In [178]:
X_train = dv.fit_transform(train_dicts)
X_train.shape

(876, 31)

In [179]:
val_dicts = df_val[categorical + numerical].to_dict(orient='records')

In [180]:
X_val = dv.transform(val_dicts)

#### Train the model

In [181]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)

In [182]:
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'liblinear'
,max_iter,1000


In [183]:
model.intercept_[0]

np.float64(-0.0691472802783609)

In [184]:
model.coef_[0].round(3)

array([-0.   , -0.015,  0.034,  0.003,  0.012, -0.103, -0.025,  0.049,
       -0.02 , -0.013, -0.003, -0.009, -0.032, -0.016,  0.311,  0.051,
        0.02 , -0.012, -0.012, -0.115,  0.08 , -0.03 ,  0.004, -0.011,
       -0.011, -0.006,  0.008,  0.006, -0.033, -0.025,  0.454])

In [185]:
model.predict(X_train)

array([1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,

In [186]:
model.predict_proba(X_train)

array([[0.42085657, 0.57914343],
       [0.12716509, 0.87283491],
       [0.41183893, 0.58816107],
       ...,
       [0.25265784, 0.74734216],
       [0.3302157 , 0.6697843 ],
       [0.14407823, 0.85592177]], shape=(876, 2))

In [187]:
y_pred = model.predict_proba(X_val)[:, 1]

In [188]:
converted_decision = (y_pred >= 0.5)
converted_decision

array([ True,  True,  True, False,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True, False,
       False,  True,  True,  True, False,  True, False,  True,  True,
       False, False,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True, False,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True, False, False,  True, False,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True, False, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True, False,  True,
       False,  True,  True, False,  True,  True, False,  True,  True,
       False,  True,

In [189]:
converted_decision_mean = round((y_val == converted_decision).mean(),2)
converted_decision_mean

np.float64(0.7)

### Question 5

- Let's find the least useful feature using the _feature elimination_ technique.
- Train a model using the same features and parameters as in Q4 (without rounding).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

- `'industry'`
- `'employment_status'`
- `'lead_score'`

In [190]:
features = ['industry', 'employment_status', 'lead_score']


# Test removing each feature
for f in features:
    reduced_features = [x for x in features if x != f]
    print(reduced_features)

['employment_status', 'lead_score']
['industry', 'lead_score']
['industry', 'employment_status']


In [191]:
features = ['industry', 'employment_status', 'lead_score']


# Test removing each feature
for f in features:
    reduced_features = [x for x in features if x != f]
    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    train_dicts = df_train[reduced_features + numerical].to_dict(orient='records')
    dv.fit(train_dicts)
    X_train = dv.fit_transform(train_dicts)
    val_dicts = df_val[reduced_features + numerical].to_dict(orient='records')
    X_val = dv.transform(val_dicts)
    model.fit(X_train, y_train)
    model.predict(X_train)
    model.predict_proba(X_train)
    y_pred = model.predict_proba(X_val)[:, 1]
    converted_decision = (y_pred >= 0.5)
    converted_decision = round((y_val == converted_decision).mean(),2)
    diff = converted_decision_mean - converted_decision
    print(f"{f:20s} -> Mean: {converted_decision:.4f} | Diff: {diff:.4f}")


industry             -> Mean: 0.7000 | Diff: 0.0000
employment_status    -> Mean: 0.6900 | Diff: 0.0100
lead_score           -> Mean: 0.7000 | Diff: 0.0000


  train_dicts = df_train[reduced_features + numerical].to_dict(orient='records')
  val_dicts = df_val[reduced_features + numerical].to_dict(orient='records')
  train_dicts = df_train[reduced_features + numerical].to_dict(orient='records')
  val_dicts = df_val[reduced_features + numerical].to_dict(orient='records')


### Question 6

- Now let's train a regularized logistic regression.
- Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
- Train models using all the features as in Q4.
- Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.

In [192]:
# Convert train and validation data to dictionaries (use the same columns used in model)
train_dicts = df_train[categorical + numerical].to_dict(orient='records')
val_dicts   = df_val[categorical + numerical].to_dict(orient='records')
test_dicts  = df_test[categorical + numerical].to_dict(orient='records')

# Fit vectorizer only once on the training data
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)
X_val   = dv.transform(val_dicts)
X_test  = dv.transform(test_dicts)

print("X_train shape:", X_train.shape)
print("X_val   shape:", X_val.shape)
print("X_test  shape:", X_test.shape)
print("Number of features (dv):", len(dv.feature_names_))

X_train shape: (876, 31)
X_val   shape: (293, 31)
X_test  shape: (293, 31)
Number of features (dv): 31


In [193]:

# 1) Targets sanity
print("y_train shape:", y_train.shape, "mean:", y_train.mean(), "unique:", np.unique(y_train))
print("y_val   shape:", y_val.shape,   "mean:", y_val.mean(),   "unique:", np.unique(y_val))

# 2) Shape alignment
print("X_train rows vs y_train len:", X_train.shape[0], len(y_train))
print("X_val rows vs y_val len:    ", X_val.shape[0], len(y_val))

# 3) Check for constant columns in X_train
const_cols = np.sum(np.all(X_train == X_train[0, :], axis=0))
print("Constant columns in X_train:", const_cols)

# 4) Any NaNs?
print("Any NaNs in X_train?", np.isnan(X_train).any())
print("Any NaNs in X_val?  ", np.isnan(X_val).any())


y_train shape: (876,) mean: 0.6244292237442922 unique: [0 1]
y_val   shape: (293,) mean: 0.5563139931740614 unique: [0 1]
X_train rows vs y_train len: 876 876
X_val rows vs y_val len:     293 293
Constant columns in X_train: 0
Any NaNs in X_train? False
Any NaNs in X_val?   False


In [194]:
C_values = [0.01, 0.1, 1, 10, 100]
accuracies = {}

for c in C_values:
    model = LogisticRegression(solver='lbfgs', C=c, max_iter=5000, random_state=42)
    model.fit(X_train, y_train)
    y_pred_proba = model.predict_proba(X_val)[:, 1]
    y_pred = (y_pred_proba >= 0.5).astype(int)
    acc = (y_val == y_pred).mean()
    accuracies[c] = round(acc, 3)
    print(f"C={c}: accuracy={accuracies[c]:.3f}")

# Choose best C (smallest C in ties)
best_acc = max(accuracies.values())
best_cs = [c for c, a in accuracies.items() if a == best_acc]
best_c = min(best_cs)
print("\nAll accuracies:", accuracies)
print("Best:", best_c)



C=0.01: accuracy=0.812


C=0.1: accuracy=0.843
C=1: accuracy=0.857
C=10: accuracy=0.853
C=100: accuracy=0.853

All accuracies: {0.01: np.float64(0.812), 0.1: np.float64(0.843), 1: np.float64(0.857), 10: np.float64(0.853), 100: np.float64(0.853)}
Best: 1
