# 03-classification homework

## Homework

> Note: sometimes your answer doesn't match one of the options exactly. 
> That's fine. 
> Select the option that's closest to your solution.


### Dataset

In this homework, we will use the Bank Marketing dataset. Download it from [here](https://archive.ics.uci.edu/static/public/222/bank+marketing.zip).

Or you can do it with `wget`:

```bash
wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
```

We need to take `bank/bank-full.csv` file from the downloaded zip-file.  
In this dataset our desired target for classification task will be `y` variable - has the client subscribed a term deposit or not. 

### Features

For the rest of the homework, you'll need to use only these columns:

* `age`,
* `job`,
* `marital`,
* `education`,
* `balance`,
* `housing`,
* `contact`,
* `day`,
* `month`,
* `duration`,
* `campaign`,
* `pdays`,
* `previous`,
* `poutcome`,
* `y`

### Data preparation

* Select only the features from above.
* Check if the missing values are presented in the features.

In [1]:
# import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
# Data loading
data = 'wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip'

In [3]:
!wget $data -O bank-marketing.zip

--2024-10-11 19:12:29--  http://wget/
Resolving wget (wget)... failed: Name or service not known.
wget: unable to resolve host address ‘wget’
--2024-10-11 19:12:29--  https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘bank-marketing.zip’

bank-marketing.zip      [ <=>                ] 999.85K  5.24MB/s    in 0.2s    

2024-10-11 19:12:29 (5.24 MB/s) - ‘bank-marketing.zip’ saved [1023843]

FINISHED --2024-10-11 19:12:29--
Total wall clock time: 0.6s
Downloaded: 1 files, 1000K in 0.2s (5.24 MB/s)


In [4]:
!unzip bank-marketing.zip

Archive:  bank-marketing.zip
 extracting: bank.zip                
 extracting: bank-additional.zip     


In [5]:
!unzip bank.zip

Archive:  bank.zip
  inflating: bank-full.csv           
  inflating: bank-names.txt          
  inflating: bank.csv                


In [6]:
! head bank-full.csv

"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"
33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"
47;"blue-collar";"married";"unknown";"no";1506;"yes";"no";"unknown";5;"may";92;1;-1;0;"unknown";"no"
33;"unknown";"single";"unknown";"no";1;"no";"no";"unknown";5;"may";198;1;-1;0;"unknown";"no"
35;"management";"married";"tertiary";"no";231;"yes";"no";"unknown";5;"may";139;1;-1;0;"unknown";"no"
28;"management";"single";"tertiary";"no";447;"yes";"yes";"unknown";5;"may";217;1;-1;0;"unknown";"no"
42;"entrepreneur";"divorced";"tertiary";"yes";2;"yes";"no";"unknown";5;"may";380;1;-1;0;"unknown";"no"
58;"retired";"married";"primary";"no";121;"yes

## Data preparation
### Features

In [7]:
cols = [
    'age',
    'job',
    'marital',
    'education',
    'balance',
    'housing',
    'contact',
    'day',
    'month',
    'duration',
    'campaign',
    'pdays',
    'previous',
    'poutcome',
    'y'
    ]

In [8]:
df = pd.read_csv('bank-full.csv', sep=';', usecols=cols)
df.head()

Unnamed: 0,age,job,marital,education,balance,housing,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,2143,yes,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,29,yes,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,2,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,1506,yes,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,1,no,unknown,5,may,198,1,-1,0,unknown,no


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   balance    45211 non-null  int64 
 5   housing    45211 non-null  object
 6   contact    45211 non-null  object
 7   day        45211 non-null  int64 
 8   month      45211 non-null  object
 9   duration   45211 non-null  int64 
 10  campaign   45211 non-null  int64 
 11  pdays      45211 non-null  int64 
 12  previous   45211 non-null  int64 
 13  poutcome   45211 non-null  object
 14  y          45211 non-null  object
dtypes: int64(7), object(8)
memory usage: 5.2+ MB


In [10]:
df.describe().round(2)

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.94,1362.27,15.81,258.16,2.76,40.2,0.58
std,10.62,3044.77,8.32,257.53,3.1,100.13,2.3
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [11]:
# check for missing values
df.isnull().sum()

age          0
job          0
marital      0
education    0
balance      0
housing      0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

### Question 1

What is the most frequent observation (mode) for the column `education`?

- `unknown`
- `primary`
- `secondary`
- `tertiary`

**Answer: `secondary`**

In [12]:
df['education'].value_counts()

education
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: count, dtype: int64

In [13]:
df['education'].mode()

0    secondary
Name: education, dtype: object

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset.
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `age` and `balance`
- `day` and `campaign`
- `day` and `pdays`
- `pdays` and `previous`

**Answer: `pdays` and `previous`**

In [14]:
corr_matrix = df.corr(numeric_only=True).round(2)
corr_matrix

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
age,1.0,0.1,-0.01,-0.0,0.0,-0.02,0.0
balance,0.1,1.0,0.0,0.02,-0.01,0.0,0.02
day,-0.01,0.0,1.0,-0.03,0.16,-0.09,-0.05
duration,-0.0,0.02,-0.03,1.0,-0.08,-0.0,0.0
campaign,0.0,-0.01,0.16,-0.08,1.0,-0.09,-0.03
pdays,-0.02,0.0,-0.09,-0.0,-0.09,1.0,0.45
previous,0.0,0.02,-0.05,0.0,-0.03,0.45,1.0


In [15]:
# Stack the correlation matrix
corr_stacked = corr_matrix.stack().sort_values(ascending=False)
# Remove self-correlations (correlation of a variable with itself)
corr_stacked = corr_stacked[corr_stacked < 1]
corr_stacked

pdays     previous    0.45
previous  pdays       0.45
day       campaign    0.16
campaign  day         0.16
balance   age         0.10
age       balance     0.10
balance   duration    0.02
          previous    0.02
previous  balance     0.02
duration  balance     0.02
previous  age         0.00
          duration    0.00
pdays     balance     0.00
          duration   -0.00
balance   day         0.00
age       campaign    0.00
balance   pdays       0.00
day       balance     0.00
age       duration   -0.00
          previous    0.00
duration  previous    0.00
          age        -0.00
campaign  age         0.00
duration  pdays      -0.00
age       day        -0.01
balance   campaign   -0.01
campaign  balance    -0.01
day       age        -0.01
age       pdays      -0.02
pdays     age        -0.02
day       duration   -0.03
duration  day        -0.03
previous  campaign   -0.03
campaign  previous   -0.03
previous  day        -0.05
day       previous   -0.05
duration  campaign   -0.08
c

### Target encoding

* Now we want to encode the `y` variable.
* Let's replace the values `yes`/`no` with `1`/`0`.



In [16]:
df.y

0         no
1         no
2         no
3         no
4         no
        ... 
45206    yes
45207    yes
45208    yes
45209     no
45210     no
Name: y, Length: 45211, dtype: object

In [17]:
df.y = (df.y == 'yes').astype(int)
df.y

0        0
1        0
2        0
3        0
4        0
        ..
45206    1
45207    1
45208    1
45209    0
45210    0
Name: y, Length: 45211, dtype: int64

In [18]:
df.y.value_counts(normalize=True)

y
0    0.883015
1    0.116985
Name: proportion, dtype: float64

In [19]:
df.head()

Unnamed: 0,age,job,marital,education,balance,housing,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,2143,yes,unknown,5,may,261,1,-1,0,unknown,0
1,44,technician,single,secondary,29,yes,unknown,5,may,151,1,-1,0,unknown,0
2,33,entrepreneur,married,secondary,2,yes,unknown,5,may,76,1,-1,0,unknown,0
3,47,blue-collar,married,unknown,1506,yes,unknown,5,may,92,1,-1,0,unknown,0
4,33,unknown,single,unknown,1,no,unknown,5,may,198,1,-1,0,unknown,0


### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [22]:
len(df_train), len(df_val), len(df_test)

(27126, 9042, 9043)

In [23]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [24]:
y_train = df_train.y.values
y_val = df_val.y.values
y_test = df_test.y.values

del df_train['y']
del df_val['y']
del df_test['y']

In [25]:
df_train.head()

Unnamed: 0,age,job,marital,education,balance,housing,contact,day,month,duration,campaign,pdays,previous,poutcome
0,32,technician,single,tertiary,1100,yes,cellular,11,aug,67,1,-1,0,unknown
1,38,entrepreneur,married,secondary,0,yes,cellular,17,nov,258,1,-1,0,unknown
2,49,blue-collar,married,secondary,3309,yes,cellular,15,may,349,2,-1,0,unknown
3,37,housemaid,married,primary,2410,no,cellular,4,aug,315,1,-1,0,unknown
4,31,self-employed,married,tertiary,3220,no,cellular,26,aug,74,4,-1,0,unknown


In [26]:
y_train

array([0, 0, 0, ..., 0, 1, 0])

### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?

- `contact`
- `education`
- `housing`
- `poutcome`

**Answer: `poutcome`**

In [27]:
df_train.dtypes

age           int64
job          object
marital      object
education    object
balance       int64
housing      object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
dtype: object

In [28]:
numerical = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']

In [29]:
categorical = ['job', 'marital', 'education', 'housing', 'contact', 'month', 'poutcome']

In [30]:
df_train[categorical].nunique()

job          12
marital       3
education     4
housing       2
contact       3
month        12
poutcome      4
dtype: int64

In [31]:
from sklearn.metrics import mutual_info_score

In [32]:
def mutual_info_churn_score(series):
    return mutual_info_score(series, y_train)

In [33]:
score = df_train[categorical].apply(mutual_info_churn_score)
score = score.sort_values(ascending=False)
score

poutcome     0.029533
month        0.025090
contact      0.013356
housing      0.010343
job          0.007316
education    0.002697
marital      0.002050
dtype: float64

In [34]:
score = round(score, 2)
score

poutcome     0.03
month        0.03
contact      0.01
housing      0.01
job          0.01
education    0.00
marital      0.00
dtype: float64

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.6
- 0.7
- 0.8
- 0.9

**Answer: `0.9`**

In [35]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [36]:
dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical + numerical].to_dict(orient='records')
# train_dict[0]
X_train = dv.fit_transform(train_dict)
# X_train[0]

val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [37]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [38]:
model.intercept_[0]

np.float64(-0.976486527757112)

In [39]:
model.coef_[0].round(2)

array([ 0.  ,  0.  , -0.08,  0.25,  0.08, -1.31,  0.01,  0.  , -0.44,
       -0.25, -0.05, -0.23, -0.15, -0.83,  0.1 , -0.24, -0.26, -0.33,
       -0.08,  0.27, -0.29, -0.13,  0.29, -0.15,  0.03, -0.17, -0.35,
       -0.48, -0.15, -0.01, -0.71,  0.39, -0.33, -1.16, -1.04,  0.3 ,
        1.45, -0.5 , -0.94,  0.78,  0.8 , -0.  , -0.78, -0.58,  1.5 ,
       -1.11,  0.01])

In [40]:
y_pred = model.predict(X_val)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

In [41]:
y_val

array([0, 0, 1, ..., 0, 0, 1])

In [42]:
accuracy = accuracy_score(y_val, y_pred)
accuracy = round(accuracy, 2)
accuracy

0.9

#### calculate accuracy manually

In [43]:
y_pred = model.predict_proba(X_val)[:, 1]
y_pred

array([0.01240549, 0.01017637, 0.15515956, ..., 0.05676404, 0.00908912,
       0.28499536])

In [44]:
y_decision = (y_pred >= 0.5)

In [45]:
(y_val == y_decision).mean()

np.float64(0.9009068790090687)

In [46]:
df_pred = pd.DataFrame()
df_pred['probability'] = y_pred
df_pred['prediction'] = y_decision.astype(int)
df_pred['actual'] = y_val

In [47]:
df_pred['correct'] = df_pred.prediction == df_pred.actual

In [48]:
df_pred

Unnamed: 0,probability,prediction,actual,correct
0,0.012405,0,0,True
1,0.010176,0,0,True
2,0.155160,0,1,False
3,0.226359,0,0,True
4,0.442853,0,1,False
...,...,...,...,...
9037,0.022068,0,0,True
9038,0.264328,0,1,False
9039,0.056764,0,0,True
9040,0.009089,0,0,True


In [49]:
accuracy = df_pred.correct.mean()
accuracy = round(accuracy, 2)
accuracy

np.float64(0.9)

### Question 5

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

- `age`
- `balance`
- `marital`
- `previous`

> **Note**: The difference doesn't have to be positive.

**Answer: `marital`**

In [50]:
# Feature Elimination Technique

numerical = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
categorical = ['job', 'marital', 'education', 'housing', 'contact', 'month', 'poutcome']
all_features = numerical + categorical

dv = DictVectorizer(sparse=False)
train_dict = df_train[all_features].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)
val_dict = df_val[all_features].to_dict(orient='records')
X_val = dv.transform(val_dict)

def train_and_evaluate(X_train, X_val, y_train, y_val):
    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    # return round(accuracy_score(y_val, y_pred), 10)
    return accuracy_score(y_val, y_pred)

# Train the model with all features
full_accuracy = train_and_evaluate(X_train, X_val, y_train, y_val)
print(f"Accuracy with all features: {full_accuracy}")

Accuracy with all features: 0.9009068790090687


In [51]:
# Test each feature
accuracy_differences = {}
for i, feature in enumerate(all_features):
    X_train_without_feature = np.delete(X_train, i, axis=1)
    X_val_without_feature = np.delete(X_val, i, axis=1)

    accuracy_without_feature = train_and_evaluate(X_train_without_feature, X_val_without_feature, y_train, y_val)
    difference = abs(full_accuracy - accuracy_without_feature)
    # difference = round(difference, 10)
    accuracy_differences[feature] = difference
    print(f"Accuracy without {feature}: {accuracy_without_feature}, Difference: {difference}")

least_useful_feature = min(accuracy_differences, key=accuracy_differences.get)
print(f"\nThe least useful feature is: {least_useful_feature}")
print(f"Accuracy difference: {accuracy_differences[least_useful_feature]}")

# Sort features by their impact (difference in accuracy)
sorted_features = sorted(accuracy_differences.items(), key=lambda x: x[1])
print("\nFeatures sorted by impact (least to most):")
for feature, diff in sorted_features:
    print(f"{feature}: {diff}")

Accuracy without age: 0.9013492590134926, Difference: 0.00044238000442387015
Accuracy without balance: 0.9010174740101747, Difference: 0.0001105950011059953
Accuracy without day: 0.9002433090024331, Difference: 0.0006635700066356387
Accuracy without duration: 0.9011280690112807, Difference: 0.0002211900022119906
Accuracy without campaign: 0.9014598540145985, Difference: 0.0005529750055297544
Accuracy without pdays: 0.9006856890068569, Difference: 0.00022119000221187957
Accuracy without previous: 0.9013492590134926, Difference: 0.00044238000442387015
Accuracy without job: 0.8897367838973679, Difference: 0.011170095111700862
Accuracy without marital: 0.9007962840079629, Difference: 0.00011059500110588427
Accuracy without education: 0.9006856890068569, Difference: 0.00022119000221187957
Accuracy without housing: 0.9014598540145985, Difference: 0.0005529750055297544
Accuracy without contact: 0.9012386640123866, Difference: 0.00033178500331787486
Accuracy without month: 0.9010174740101747, 

In [52]:
# Final step: Compare specific features
features_to_compare = ['age', 'balance', 'marital', 'previous']
specific_differences = {f: accuracy_differences[f] for f in features_to_compare}

print("\nComparison of specific features:")
for feature, diff in specific_differences.items():
    print(f"{feature}: {diff}")

least_useful_specific_feature = min(specific_differences, key=specific_differences.get)
print(f"\nAmong {features_to_compare}, the least useful feature is: {least_useful_specific_feature}")
print(f"Accuracy difference: {specific_differences[least_useful_specific_feature]}")


Comparison of specific features:
age: 0.00044238000442387015
balance: 0.0001105950011059953
marital: 0.00011059500110588427
previous: 0.00044238000442387015

Among ['age', 'balance', 'marital', 'previous'], the least useful feature is: marital
Accuracy difference: 0.00011059500110588427


### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.

**Answer: `0.1`**

In [53]:
dv = DictVectorizer(sparse=False)
train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)
val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

for c in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(solver='liblinear', C=c, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    result = round(accuracy_score(y_val, y_pred), 3)
    # result = accuracy_score(y_val, y_pred)
    print(c, result)
    print()

0.01 0.898

0.1 0.901

1 0.901

10 0.901

100 0.901

