# 03-Classification - Homework

In [1]:
# import required libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mutual_info_score, accuracy_score

## Getting the data
For this homework, we'll use the Laptops Price dataset. Download it from [here](https://archive.ics.uci.edu/static/public/222/bank+marketing.zip).

In [2]:
!wget -P ../../datasets/ "https://archive.ics.uci.edu/static/public/222/bank+marketing.zip"

--2024-10-15 18:05:04--  https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘../../datasets/bank+marketing.zip’

bank+marketing.zip      [    <=>             ] 999.85K  1.05MB/s    in 0.9s    

2024-10-15 18:05:06 (1.05 MB/s) - ‘../../datasets/bank+marketing.zip’ saved [1023843]



In [3]:
!unzip ../../datasets/bank+marketing.zip -d ../../datasets/bank+marketing/
!unzip ../../datasets/bank+marketing/bank.zip -d ../../datasets/bank+marketing/bank
!unzip ../../datasets/bank+marketing/bank-additional.zip -d ../../datasets/bank+marketing/bank-additional/
!rm -rf ../../datasets/bank+marketing/bank-additional/__MACOSX
!cp ../../datasets/bank+marketing/bank-additional/bank-additional/* ../../datasets/bank+marketing/bank-additional/
!rm -rf ../../datasets/bank+marketing/bank-additional/bank-additional/ ../../datasets/bank+marketing/*.zip ../../datasets/*.zip

Archive:  ../../datasets/bank+marketing.zip
 extracting: ../../datasets/bank+marketing/bank.zip  
 extracting: ../../datasets/bank+marketing/bank-additional.zip  
Archive:  ../../datasets/bank+marketing/bank.zip
  inflating: ../../datasets/bank+marketing/bank/bank-full.csv  
  inflating: ../../datasets/bank+marketing/bank/bank-names.txt  
  inflating: ../../datasets/bank+marketing/bank/bank.csv  
Archive:  ../../datasets/bank+marketing/bank-additional.zip
   creating: ../../datasets/bank+marketing/bank-additional/bank-additional/
  inflating: ../../datasets/bank+marketing/bank-additional/bank-additional/.DS_Store  
   creating: ../../datasets/bank+marketing/bank-additional/__MACOSX/
   creating: ../../datasets/bank+marketing/bank-additional/__MACOSX/bank-additional/
  inflating: ../../datasets/bank+marketing/bank-additional/__MACOSX/bank-additional/._.DS_Store  
  inflating: ../../datasets/bank+marketing/bank-additional/bank-additional/.Rhistory  
  inflating: ../../datasets/bank+marke

## Preparing the dataset
We need to take `bank/bank-full.csv` file from the downloaded zip-file.
In this dataset our desired target for classification task will be `y` variable - has the client subscribed a term deposit or not.

Features
1. Next we will use only the following columns:

    - `age`,
    - `job`,
    - `marital`,
    - `education`,
    - `balance`,
    - `housing`,
    - `contact`,
    - `day`,
    - `month`,
    - `duration`,
    - `campaign`,
    - `pdays`,
    - `previous`,
    - `poutcome`,
    - `y`

2. Check if the missing values are presented in the features.

In [4]:
bank_full_df = pd.read_csv("../../datasets/bank+marketing/bank/bank-full.csv", sep=';')
bank_full_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [5]:
feature_columns = ['age', 'job', 'marital', 'education', 'balance',
    'housing', 'contact', 'day', 'month', 'duration',
    'campaign', 'pdays', 'previous', 'poutcome', 'y']

df = bank_full_df[feature_columns]
df.head()

Unnamed: 0,age,job,marital,education,balance,housing,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,2143,yes,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,29,yes,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,2,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,1506,yes,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,1,no,unknown,5,may,198,1,-1,0,unknown,no


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   balance    45211 non-null  int64 
 5   housing    45211 non-null  object
 6   contact    45211 non-null  object
 7   day        45211 non-null  int64 
 8   month      45211 non-null  object
 9   duration   45211 non-null  int64 
 10  campaign   45211 non-null  int64 
 11  pdays      45211 non-null  int64 
 12  previous   45211 non-null  int64 
 13  poutcome   45211 non-null  object
 14  y          45211 non-null  object
dtypes: int64(7), object(8)
memory usage: 5.2+ MB


In [7]:
df[df.columns].isnull().sum()

age          0
job          0
marital      0
education    0
balance      0
housing      0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

## Question 1
What is the most frequent observation (mode) for the column `education`?

- `unknown`
- `primary`
- `secondary`
- `tertiary`

In [8]:
df['education'].mode().iloc[0]

'secondary'

In [9]:
df['education'].value_counts().sort_values(ascending=False).index[0]

'secondary'

## Question 2
Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `age` and `balance`
- `day` and `campaign`
- `day` and `pdays`
- `pdays` and `previous`

In [10]:
df.corr(numeric_only=True)

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
age,1.0,0.097783,-0.00912,-0.004648,0.00476,-0.023758,0.001288
balance,0.097783,1.0,0.004503,0.02156,-0.014578,0.003435,0.016674
day,-0.00912,0.004503,1.0,-0.030206,0.16249,-0.093044,-0.05171
duration,-0.004648,0.02156,-0.030206,1.0,-0.08457,-0.001565,0.001203
campaign,0.00476,-0.014578,0.16249,-0.08457,1.0,-0.088628,-0.032855
pdays,-0.023758,0.003435,-0.093044,-0.001565,-0.088628,1.0,0.45482
previous,0.001288,0.016674,-0.05171,0.001203,-0.032855,0.45482,1.0


From the above `df.corr()` output, we can conclude that `pdays` and `previous` features have the biggest correlation.

### Target encoding
- Now we want to encode the `y` variable.
- Let's replace the values `yes`/`no` with `1`/`0`.

In [11]:
df['y'] = df['y'].map({"yes": 1, "no": 0}).astype(int)
df['y']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['y'] = df['y'].map({"yes": 1, "no": 0}).astype(int)


0        0
1        0
2        0
3        0
4        0
        ..
45206    1
45207    1
45208    1
45209    0
45210    0
Name: y, Length: 45211, dtype: int64

In [12]:
df.head()

Unnamed: 0,age,job,marital,education,balance,housing,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,2143,yes,unknown,5,may,261,1,-1,0,unknown,0
1,44,technician,single,secondary,29,yes,unknown,5,may,151,1,-1,0,unknown,0
2,33,entrepreneur,married,secondary,2,yes,unknown,5,may,76,1,-1,0,unknown,0
3,47,blue-collar,married,unknown,1506,yes,unknown,5,may,92,1,-1,0,unknown,0
4,33,unknown,single,unknown,1,no,unknown,5,may,198,1,-1,0,unknown,0


### Split the data
- Split your data in train/val/test sets with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
- Make sure that the target value `y` is not in your dataframe.

In [13]:
def data_split(dataframe, random_state):
    df_full_train, df_test = train_test_split(dataframe, test_size=0.2, random_state=random_state)
    df_train, df_val = train_test_split(df_full_train, test_size=(20/80), random_state=random_state)
    # Creating a target variable (y) from the predictor variables
    y_train = df_train['y'].values
    y_val = df_val['y'].values
    y_test = df_test['y'].values
    
    # Removing the target variable from the datasets
    del df_train['y']
    del df_val['y']
    del df_test['y']

    return df_train, df_val, df_test, y_train, y_val, y_test

In [14]:
df_train, df_val, df_test, y_train, y_val, y_test = data_split(df, random_state=42)

In [15]:
df_train.head()

Unnamed: 0,age,job,marital,education,balance,housing,contact,day,month,duration,campaign,pdays,previous,poutcome
20326,32,technician,single,tertiary,1100,yes,cellular,11,aug,67,1,-1,0,unknown
24301,38,entrepreneur,married,secondary,0,yes,cellular,17,nov,258,1,-1,0,unknown
38618,49,blue-collar,married,secondary,3309,yes,cellular,15,may,349,2,-1,0,unknown
18909,37,housemaid,married,primary,2410,no,cellular,4,aug,315,1,-1,0,unknown
23081,31,self-employed,married,tertiary,3220,no,cellular,26,aug,74,4,-1,0,unknown


In [16]:
df_train.shape, y_train.shape

((27126, 14), (27126,))

## Question 3
Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
Round the scores to 2 decimals using `round(score, 2)`.
Which of these variables has the biggest mutual information score?

- `contact`
- `education`
- `housing`
- `poutcome`

In [17]:
# Categorical variables
numerical_columns = df_train.select_dtypes('number')
categorical_columns = df_train.select_dtypes('object')

# Calculating mutual information score between any column and target variable y
def mutual_info_y_score(col):
    return mutual_info_score(col, pd.Series(y_train))

df_train[categorical_columns.columns].apply(mutual_info_y_score).sort_values(ascending=False).apply(lambda x : round(x, 2))

poutcome     0.03
month        0.03
contact      0.01
housing      0.01
job          0.01
education    0.00
marital      0.00
dtype: float64

## Question 4
- Now let's train a logistic regression.
- Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
- Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.
What accuracy did you get?

- 0.6
- 0.7
- 0.8
- 0.9

In [18]:
def one_hot_encode(dataframe, dv, train):
    feature_dict = dataframe.iloc[:, :-1].to_dict(orient='records')
    if train == True:
        X = dv.fit_transform(feature_dict)
    else:
        X = dv.transform(feature_dict)
    return X

In [19]:
def fit(dataframe, C, dv):
    X = one_hot_encode(dataframe, dv, train=True)
    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    model.fit(X, dataframe.iloc[:, -1].values)
    return model

In [20]:
def predict(model, dataframe, dv):
    X = one_hot_encode(dataframe, dv, train=False)
    y_pred = model.predict(X)
    return accuracy_score(dataframe.iloc[:, -1].values, y_pred)

In [21]:
dv = DictVectorizer(sparse=False)
model = fit(df_train, C=1.0, dv=dv)
acc_full = predict(model, df_val, dv=dv)
round(acc_full, 2)

0.92

## Question 5
Let's find the least useful feature using the feature elimination technique.
Train a model with all these features (using the same parameters as in Q4).
Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
Which of following feature has the smallest difference?

- `age`
- `balance`
- `marital`
- `previous`

> Note: The difference doesn't have to be positive.

In [22]:
features = ['age', 'balance', 'marital', 'previous']
accuracies_diff = []
for col in features:
    dv = DictVectorizer(sparse=False)
    model = fit(df_train.drop([col], axis=1), C=1.0, dv=dv)
    acc = predict(model, df_val.drop([col], axis=1), dv=dv)
    #print(acc_full, acc)
    print(col, acc_full - acc)

age 0.00033178500331776384
balance -0.0002211900022119906
marital 0.00033178500331776384
previous 0.0014377350143772727


## Question 6
Now let's train a regularized logistic regression.
Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
Train models using all the features as in Q4.
Calculate the accuracy on the validation dataset and round it to 3 decimal digits.
Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100
> Note: If there are multiple options, select the smallest `C`.



In [23]:
C_list = [0.01, 0.1, 1, 10, 100]
for C in C_list:
    dv = DictVectorizer(sparse=False)
    model = fit(df_train, C=C, dv=dv)
    acc = predict(model, df_val, dv=dv)
    print(C, round(acc, 4))

0.01 0.9194
0.1 0.9198
1 0.9207
10 0.9205
100 0.9202
