# Homework

In this dataset our desired target for classification task will be converted variable - has the client signed up to the platform or not.

Data preparation:

- Check if the missing values are presented in the features.
- If there are missing values:
    - For caterogiral features, replace them with 'NA'
    - For numerical features, replace with with 0.0

In [1]:
import pandas as pd
import numpy as np

In [22]:
df = pd.read_csv('../data/course_lead_scoring.csv')

# Missing values per column (before)
missing_before = df.isna().sum()

categorical_cols = df.select_dtypes(include=['object', 'category']).columns
numeric_cols = df.select_dtypes(include=[np.number]).columns

# Fill missing values
if len(categorical_cols) > 0:
    df[categorical_cols] = df[categorical_cols].fillna('NA')
if len(numeric_cols) > 0:
    df[numeric_cols] = df[numeric_cols].fillna(0.0)

missing_after = df.isna().sum()

missing_before, missing_after


(lead_source                 128
 industry                    134
 number_of_courses_viewed      0
 annual_income               181
 employment_status           100
 location                     63
 interaction_count             0
 lead_score                    0
 converted                     0
 dtype: int64,
 lead_source                 0
 industry                    0
 number_of_courses_viewed    0
 annual_income               0
 employment_status           0
 location                    0
 interaction_count           0
 lead_score                  0
 converted                   0
 dtype: int64)

## Question 1
### What is the most frequent observation (mode) for the column industry?

- NA
- technology
- healthcare
- retail

In [23]:
mode_industry = df['industry'].mode()[0]
mode_industry

'retail'

## Question 2
Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- interaction_count and lead_score
- number_of_courses_viewed and lead_score
- number_of_courses_viewed and interaction_count
- annual_income and interaction_count
Only consider the pairs above when answering this question.

Split the data
- Split your data in train/val/test sets with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
- Make sure that the target value y is not in your dataframe.

In [26]:
numeric_cols = df.select_dtypes(include=[np.number]).columns
print("Numeric Columns found:")
print(numeric_cols.tolist())

df_numeric = df[numeric_cols]
corr_matrix = df_numeric.corr()
corr_matrix

Numeric Columns found:
['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score', 'converted']


Unnamed: 0,number_of_courses_viewed,annual_income,interaction_count,lead_score,converted
number_of_courses_viewed,1.0,0.00977,-0.023565,-0.004879,0.435914
annual_income,0.00977,1.0,0.027036,0.01561,0.053131
interaction_count,-0.023565,0.027036,1.0,0.009888,0.374573
lead_score,-0.004879,0.01561,0.009888,1.0,0.193673
converted,0.435914,0.053131,0.374573,0.193673,1.0


In [27]:
correlations_sorted = corr_matrix.unstack().sort_values(ascending=False).drop_duplicates()
print("Ordered correlations (excluding perfect correlations of 1.0):")
print(correlations_sorted[correlations_sorted < 1.0].head(10))

pairs_to_check = [
    ('interaction_count', 'lead_score'),
    ('number_of_courses_viewed', 'lead_score'), 
    ('number_of_courses_viewed', 'interaction_count'),
    ('annual_income', 'interaction_count')
]

print("\nCorrelations of the specific pairs mentioned:")
for col1, col2 in pairs_to_check:
    if col1 in corr_matrix.columns and col2 in corr_matrix.columns:
        correlation = corr_matrix.loc[col1, col2]
        print(f"{col1} vs {col2}: {correlation:.4f}")
    else:
        print(f"Columnas {col1} o {col2} no encontradas en el dataset")

Ordered correlations (excluding perfect correlations of 1.0):
converted                 number_of_courses_viewed    0.435914
interaction_count         converted                   0.374573
lead_score                converted                   0.193673
annual_income             converted                   0.053131
                          interaction_count           0.027036
                          lead_score                  0.015610
lead_score                interaction_count           0.009888
annual_income             number_of_courses_viewed    0.009770
lead_score                number_of_courses_viewed   -0.004879
number_of_courses_viewed  interaction_count          -0.023565
dtype: float64

Correlations of the specific pairs mentioned:
interaction_count vs lead_score: 0.0099
number_of_courses_viewed vs lead_score: -0.0049
number_of_courses_viewed vs interaction_count: -0.0236
annual_income vs interaction_count: 0.0270


In [28]:
correlations = {
    "interaction_count and lead_score": abs(corr_matrix.loc['interaction_count', 'lead_score']),
    "number_of_courses_viewed and lead_score": abs(corr_matrix.loc['number_of_courses_viewed', 'lead_score']),
    "number_of_courses_viewed and interaction_count": abs(corr_matrix.loc['number_of_courses_viewed', 'interaction_count']),
    "annual_income and interaction_count": abs(corr_matrix.loc['annual_income', 'interaction_count'])
}

for pair, corr_value in correlations.items():
    print(f"{pair}: {corr_value:.4f}")

print()
max_pair = max(correlations, key=correlations.get)
max_correlation = correlations[max_pair]

print(f"'{max_pair}' con una correlación de {max_correlation:.4f}")

print("Original values (with sign):")
print(f"interaction_count and lead_score: {corr_matrix.loc['interaction_count', 'lead_score']:.4f}")
print(f"number_of_courses_viewed and lead_score: {corr_matrix.loc['number_of_courses_viewed', 'lead_score']:.4f}")
print(f"number_of_courses_viewed and interaction_count: {corr_matrix.loc['number_of_courses_viewed', 'interaction_count']:.4f}")
print(f"annual_income and interaction_count: {corr_matrix.loc['annual_income', 'interaction_count']:.4f}")

interaction_count and lead_score: 0.0099
number_of_courses_viewed and lead_score: 0.0049
number_of_courses_viewed and interaction_count: 0.0236
annual_income and interaction_count: 0.0270

'annual_income and interaction_count' con una correlación de 0.0270
Original values (with sign):
interaction_count and lead_score: 0.0099
number_of_courses_viewed and lead_score: -0.0049
number_of_courses_viewed and interaction_count: -0.0236
annual_income and interaction_count: 0.0270


In [30]:
from sklearn.model_selection import train_test_split
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42) 

## Question 3.
Calculate the mutual information score between y and other categorical variables in the dataset. Use the training set only.
Round the scores to 2 decimals using round(score, 2).
Which of these variables has the biggest mutual information score?

In [31]:
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import LabelEncoder

y_train = df_train['converted'].values
print("Target variable: 'converted'")
print("Unique values ​​in 'converted':", df_train['converted'].unique())
print("Distribution:", df_train['converted'].value_counts())
print()

categorical_cols = df_train.select_dtypes(include=['object', 'category']).columns
categorical_cols = [col for col in categorical_cols if col != 'converted']
print("Categorical columns to analyze:", categorical_cols)
print()

mi_scores = {}
le = LabelEncoder()

for col in categorical_cols:
    print(f"Processing column '{col}'...")
    print(f"  Unique values: {df_train[col].nunique()}")
    print(f"  Sample values: {df_train[col].unique()[:5]}")

    X_encoded = le.fit_transform(df_train[col])
    
    mi = mutual_info_classif(X_encoded.reshape(-1, 1), y_train, discrete_features=True, random_state=42)
    mi_scores[col] = round(mi[0], 2)
    print(f"  Mutual Information Score: {mi_scores[col]}")
    print()

print("Mutual Information Scores (sorted from highest to lowest):")
for col, score in sorted(mi_scores.items(), key=lambda x: x[1], reverse=True):
    print(f"{col}: {score}")

max_var = max(mi_scores, key=mi_scores.get)
max_score = mi_scores[max_var]
print(f"\nVariable with HIGHEST Mutual Information Score: '{max_var}' = {max_score}")

Target variable: 'converted'
Unique values ​​in 'converted': [0 1]
Distribution: converted
1    547
0    329
Name: count, dtype: int64

Categorical columns to analyze: ['lead_source', 'industry', 'employment_status', 'location']

Processing column 'lead_source'...
  Unique values: 6
  Sample values: ['paid_ads' 'organic_search' 'NA' 'social_media' 'events']
  Mutual Information Score: 0.04

Processing column 'industry'...
  Unique values: 8
  Sample values: ['retail' 'manufacturing' 'technology' 'finance' 'NA']
  Mutual Information Score: 0.01

Processing column 'employment_status'...
  Unique values: 5
  Sample values: ['student' 'employed' 'unemployed' 'NA' 'self_employed']
  Mutual Information Score: 0.01

Processing column 'location'...
  Unique values: 8
  Sample values: ['middle_east' 'north_america' 'europe' 'australia' 'south_america']
  Mutual Information Score: 0.0

Mutual Information Scores (sorted from highest to lowest):
lead_source: 0.04
industry: 0.01
employment_status: 

## Question 4
Now let's train a logistic regression.
Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
- Fit the model on the training dataset.
 - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
 - model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
df_full_train = pd.get_dummies(df_full_train, drop_first=True)
df_val = pd.get_dummies(df_val, drop_first=True)
df_test = pd.get_dummies(df_test, drop_first=True)
df_val = df_val.reindex(columns=df_full_train.columns, fill_value=0)
df_test = df_test.reindex(columns=df_full_train.columns, fill_value=0)
y_full_train = df_full_train['converted'].values
y_val = df_val['converted'].values
X_full_train = df_full_train.drop(columns=['converted']).values
X_val = df_val.drop(columns=['converted']).values
X_test = df_test.drop(columns=['converted']).values
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_full_train, y_full_train)
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_accuracy = round(val_accuracy, 2)
print(f"Accuracy on the validation set: {val_accuracy}")


Accuracy on the validation set: 0.72


## Question 5
- Let's find the least useful feature using the feature elimination technique.
- Train a model using the same features and parameters as in Q4 (without rounding).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
Which of following feature has the smallest difference?

- 'industry'
- 'employment_status'
- 'lead_score'

In [33]:
from sklearn.base import clone
def train_evaluate_model(X_train, y_train, X_val, y_val, model):
    model_clone = clone(model)
    model_clone.fit(X_train, y_train)
    y_val_pred = model_clone.predict(X_val)
    return accuracy_score(y_val, y_val_pred)
original_accuracy = train_evaluate_model(X_full_train, y_full_train, X_val, y_val, model)
print(f"Original precision with all the features: {original_accuracy:.4f}")
features_to_evaluate = ['industry', 'employment_status', 'lead_score']
feature_indices = {col: idx for idx, col in enumerate(df_full_train.drop(columns=['converted']).columns)}
print("Indices of features to evaluate:", feature_indices)
accuracy_differences = {}
for feature in features_to_evaluate:
    if feature in feature_indices:
        idx = feature_indices[feature]
        print(f"Evaluating removal of feature '{feature}' (index {idx})...")
        X_train_reduced = np.delete(X_full_train, idx, axis=1)
        X_val_reduced = np.delete(X_val, idx, axis=1)
        reduced_accuracy = train_evaluate_model(X_train_reduced, y_full_train, X_val_reduced, y_val, model)
        accuracy_difference = original_accuracy - reduced_accuracy
        accuracy_differences[feature] = accuracy_difference
        print(f"  Precision without '{feature}': {reduced_accuracy:.4f}, Difference: {accuracy_difference:.4f}")
    else:
        print(f"Feature '{feature}' not found in the dataset.")

least_useful_feature = min(accuracy_differences, key=accuracy_differences.get)
print(f"Least useful feature is: {least_useful_feature} (Difference: {accuracy_differences[least_useful_feature]:.4f})")


Original precision with all the features: 0.7167
Indices of features to evaluate: {'number_of_courses_viewed': 0, 'annual_income': 1, 'interaction_count': 2, 'lead_score': 3, 'lead_source_events': 4, 'lead_source_organic_search': 5, 'lead_source_paid_ads': 6, 'lead_source_referral': 7, 'lead_source_social_media': 8, 'industry_education': 9, 'industry_finance': 10, 'industry_healthcare': 11, 'industry_manufacturing': 12, 'industry_other': 13, 'industry_retail': 14, 'industry_technology': 15, 'employment_status_employed': 16, 'employment_status_self_employed': 17, 'employment_status_student': 18, 'employment_status_unemployed': 19, 'location_africa': 20, 'location_asia': 21, 'location_australia': 22, 'location_europe': 23, 'location_middle_east': 24, 'location_north_america': 25, 'location_south_america': 26}
Feature 'industry' not found in the dataset.
Feature 'employment_status' not found in the dataset.
Evaluating removal of feature 'lead_score' (index 3)...
  Precision without 'lead_

## Question 6
- Now let's train a regularized logistic regression.
- Let's try the following values of the parameter C: [0.01, 0.1, 1, 10, 100].
- Train models using all the features as in Q4.
- Calculate the accuracy on the validation dataset and round it to 3 decimal digits.
- Which of these C leads to the best accuracy on the validation set?


In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
C_values = [0.01, 0.1, 1, 10, 100]
best_C = None
best_accuracy = 0.0
for C in C_values:
    print(f"Training model with C={C}...")
    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    model.fit(X_full_train, y_full_train)
    y_val_pred = model.predict(X_val)
    val_accuracy = accuracy_score(y_val, y_val_pred)
    val_accuracy_rounded = round(val_accuracy, 3)
    print(f"Validation accuracy: {val_accuracy_rounded}")
    if val_accuracy > best_accuracy:
        best_accuracy = val_accuracy
        best_C = C

print(f"Best C: {best_C}, Accuracy: {round(best_accuracy, 3)}")

Training model with C=0.01...
Validation accuracy: 0.706
Training model with C=0.1...
Validation accuracy: 0.717
Training model with C=1...
Validation accuracy: 0.717
Training model with C=10...
Validation accuracy: 0.717
Training model with C=100...
Validation accuracy: 0.717
Best C: 0.1, Accuracy: 0.717
