<a href="https://colab.research.google.com/github/suwisitlk/229352-StatisticalLearning/blob/main/Lab02_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

660110399
สุวิศิษฏ์ ลิขิตวนิชกุล




### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #2

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve

# For Fashion-MNIST
from tensorflow.keras.datasets import fashion_mnist

# For 20 Newsgroups
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

## Part 1: Marketing Campaign Dataset - Manual Data Preprocessing & Logistic Regression

### Load the Marketing Campaign Dataset ([Data Information](https://archive.ics.uci.edu/dataset/222/bank+marketing))

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (`'yes'`) or not (`'no'`) subscribed.

In [2]:
bank_url = 'https://raw.githubusercontent.com/donlap/ds352-labs/main/bank.csv'

df = pd.read_csv(bank_url, sep=';', na_values=['unknown'])
df = df.drop(["emp.var.rate", "cons.price.idx", "cons.conf.idx",	"euribor3m", "nr.employed"], axis=1)
print("Shape of the dataset:", df.shape)
df.head()

Shape of the dataset: (41188, 16)


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,no
1,57,services,married,high.school,,no,no,telephone,may,mon,149,1,999,0,nonexistent,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,no


### Data Exploration

In [3]:
print("--- Missing Values Count ---")
print(df.isnull().sum())

--- Missing Values Count ---
age               0
job             330
marital          80
education      1731
default        8597
housing         990
loan            990
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
y                 0
dtype: int64


In [4]:
print("--- Unique Values for Categorical Columns ---")
for col in df.select_dtypes(include='object').columns:
    print(f"\n'{col}' unique values:")
    print(df[col].value_counts(dropna=False)) # Include NaN counts

--- Unique Values for Categorical Columns ---

'job' unique values:
job
admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
NaN                330
Name: count, dtype: int64

'marital' unique values:
marital
married     24928
single      11568
divorced     4612
NaN            80
Name: count, dtype: int64

'education' unique values:
education
university.degree      12168
high.school             9515
basic.9y                6045
professional.course     5243
basic.4y                4176
basic.6y                2292
NaN                     1731
illiterate                18
Name: count, dtype: int64

'default' unique values:
default
no     32588
NaN     8597
yes        3
Name: count, dtype: int64

'housing' unique values:
housing
yes    21576
no     18622
NaN      990
Name: count, dtype: int64


### Data Preprocessing

In [5]:
# Map target variable 'y' to 0 (no) and 1 (yes)
df['y_new'] = df['y'].map({'no': 0, 'yes': 1})


# Drop 'duration' due to data leakage
df = df.drop("duration", axis=1)


# Define features (X) and target (y)
X = df.drop(['y','y_new'], axis=1)
y = df['y_new']




In [6]:
# Split the data BEFORE any transformations
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


# Print data shape
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (28831, 14)
X_test shape: (12357, 14)
y_train shape: (28831,)
y_test shape: (12357,)


We will apply `StandardScaler()`, `OrdinalEncoder()`, and `OneHotEncoder()` on a few selected columns.

**1. Numerical Feature: `age` and `campaign` (Standard Scaling)**

In [7]:
num_cols_demo = ['age', 'campaign']

scaler = StandardScaler()

# Fit the scaler ONLY on the training data

X_train[num_cols_demo] = scaler.fit_transform(X_train[num_cols_demo])

X_test[num_cols_demo] = scaler.transform(X_test[num_cols_demo])

X_train[num_cols_demo]

Unnamed: 0,age,campaign
9746,-0.389990,-0.209842
38925,-1.828602,-0.566329
13226,0.377269,-0.566329
13942,-0.677712,-0.209842
2148,1.432251,-0.566329
...,...,...
37645,-0.581805,-0.566329
30634,1.528159,1.929082
14916,-0.581805,0.859620
23870,0.185455,-0.566329


In [8]:
X_train.describe()

Unnamed: 0,age,campaign,pdays,previous
count,28831.0,28831.0,28831.0,28831.0
mean,1.946963e-17,8.872234000000001e-17,962.144948,0.175159
std,1.000017,1.000017,187.729939,0.501038
min,-2.212231,-0.5663289,0.0,0.0
25%,-0.7736198,-0.5663289,999.0,0.0
50%,-0.1981752,-0.2098416,999.0,0.0
75%,0.6649917,0.1466456,999.0,0.0
max,5.556271,14.40613,999.0,7.0


Let's take a look at the transformed `age` and `campaign` features and their statistics.

**2. Ordinal Feature: `education` (Ordinal Encoding with Imputation)**

- **Imputation**

In [9]:
ord_col_demo = ['education']

imputer_ord = SimpleImputer(strategy='most_frequent')

## Write your code here

X_train[ord_col_demo] = imputer_ord.fit_transform(X_train[ord_col_demo])
X_test[ord_col_demo] = imputer_ord.transform(X_test[ord_col_demo])

X_train[ord_col_demo]

Unnamed: 0,education
9746,basic.4y
38925,professional.course
13226,basic.9y
13942,university.degree
2148,basic.4y
...,...
37645,university.degree
30634,high.school
14916,basic.4y
23870,university.degree


- **Ordinal Encoding**

In [10]:
# เรียงจากน้อยไปมาก
education_categories = [
    'illiterate', 'basic.4y', 'basic.6y', 'basic.9y', 'high.school',
    'professional.course', 'university.degree', 'masters', 'doctorate'
]

In [11]:
ordinal_encoder = OrdinalEncoder(categories=[education_categories])

## Write your code here

X_train[ord_col_demo] = ordinal_encoder.fit_transform(X_train[ord_col_demo])
X_test[ord_col_demo] = ordinal_encoder.transform(X_test[ord_col_demo])

X_train[ord_col_demo]

Unnamed: 0,education
9746,1.0
38925,5.0
13226,3.0
13942,6.0
2148,1.0
...,...
37645,6.0
30634,4.0
14916,1.0
23870,6.0


In [12]:
X_train[ord_col_demo]

Unnamed: 0,education
9746,1.0
38925,5.0
13226,3.0
13942,6.0
2148,1.0
...,...
37645,6.0
30634,4.0
14916,1.0
23870,6.0


**3. Nominal Feature: `job` (One-Hot Encoding with Imputation)**

- **Imputation**

In [13]:
nom_col_demo = ['job']

imputer_nom = SimpleImputer(strategy='most_frequent')
imputer_nom.fit(X_train[nom_col_demo])

X_train_imputed_nom_demo = imputer_nom.fit_transform(X_train[nom_col_demo])
X_test_imputed_nom_demo = imputer_nom.transform(X_test[nom_col_demo])

- **Nominal Encoding**

In [14]:
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

## Write your code here
X_train_onehot_encoded_demo = onehot_encoder.fit_transform(X_train_imputed_nom_demo)
X_test_onehot_encoded_demo = onehot_encoder.transform(X_test_imputed_nom_demo)

In [15]:
X_train_onehot_encoded_demo

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [16]:
print("\nOriginal X_train 'job' head:")
print(X_train[nom_col_demo].iloc[40:45])
print("\nImputed X_train 'job' head (after imputer.transform):")
print(pd.DataFrame(X_train_imputed_nom_demo, columns=nom_col_demo, index=X_train.index).iloc[40:45])
print("\nOne-Hot Encoded X_train 'job' shape:", X_train_onehot_encoded_demo.shape)
print("First 5 rows of One-Hot Encoded X_train 'job':")
print(pd.DataFrame(X_train_onehot_encoded_demo, columns=onehot_encoder.get_feature_names_out(nom_col_demo), index=X_train.index).iloc[40:45])


Original X_train 'job' head:
               job
23340       admin.
14282  blue-collar
794            NaN
14614  blue-collar
9443     housemaid

Imputed X_train 'job' head (after imputer.transform):
               job
23340       admin.
14282  blue-collar
794         admin.
14614  blue-collar
9443     housemaid

One-Hot Encoded X_train 'job' shape: (28831, 11)
First 5 rows of One-Hot Encoded X_train 'job':
       job_admin.  job_blue-collar  job_entrepreneur  job_housemaid  job_management  job_retired  job_self-employed  job_services  job_student  job_technician  job_unemployed
23340         1.0              0.0               0.0            0.0             0.0          0.0                0.0           0.0          0.0             0.0             0.0
14282         0.0              1.0               0.0            0.0             0.0          0.0                0.0           0.0          0.0             0.0             0.0
794           1.0              0.0               0.0            0

In [17]:
X_train = pd.concat([X_train.reset_index(drop=True),
                              pd.DataFrame(X_train_onehot_encoded_demo, columns = onehot_encoder.get_feature_names_out(['job']))], axis=1)

X_test = pd.concat([X_test.reset_index(drop=True),
                              pd.DataFrame(X_test_onehot_encoded_demo, columns = onehot_encoder.get_feature_names_out(['job']))], axis=1)


* drop job

In [18]:
# drop job
X_train = X_train.drop(['job'], axis=1)
X_test = X_test.drop(['job'], axis=1)

* Drop job_services to prevent the dummy variable trap.

In [19]:
# drop job_services
X_train = X_train.drop(['job_services'], axis=1)
X_test = X_test.drop(['job_services'], axis=1)

In [20]:
X_train

Unnamed: 0,age,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_student,job_technician,job_unemployed
0,-0.389990,married,1.0,no,no,no,telephone,jun,mon,-0.209842,999,0,nonexistent,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-1.828602,single,5.0,no,no,no,telephone,nov,wed,-0.566329,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.377269,married,3.0,,yes,no,cellular,jul,wed,-0.566329,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,-0.677712,single,6.0,no,yes,no,cellular,jul,fri,-0.209842,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.432251,married,1.0,,yes,no,telephone,may,mon,-0.566329,999,0,nonexistent,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28826,-0.581805,single,6.0,no,yes,no,telephone,aug,tue,-0.566329,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28827,1.528159,married,4.0,no,no,no,cellular,may,mon,1.929082,999,0,nonexistent,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
28828,-0.581805,married,1.0,,no,no,cellular,jul,wed,0.859620,999,0,nonexistent,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28829,0.185455,married,6.0,no,no,no,cellular,aug,fri,-0.566329,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
X_test

Unnamed: 0,age,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_student,job_technician,job_unemployed
0,0.281362,married,6.0,,yes,no,telephone,jun,mon,0.146646,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.952714,,6.0,no,no,no,cellular,mar,fri,-0.209842,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1.240436,divorced,4.0,no,no,yes,telephone,may,thu,0.859620,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,-1.157250,married,4.0,no,yes,no,telephone,jul,tue,-0.566329,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.581805,married,6.0,,yes,no,cellular,aug,thu,-0.209842,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12352,0.089547,married,4.0,no,yes,no,cellular,may,tue,-0.566329,999,1,failure,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12353,-0.485898,married,4.0,,yes,yes,telephone,may,tue,-0.566329,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12354,0.856807,divorced,4.0,no,yes,no,telephone,jun,fri,6.206928,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12355,-0.869527,single,6.0,no,no,no,cellular,sep,fri,0.146646,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### **Exercise 1: Apply All Preprocessing & Train Logistic Regression**

Now, it's your turn to apply these preprocessing steps to *all* relevant columns and then train a Logistic Regression model.

**Instructions:**

1.  Look at the Variable Table in [this link](https://archive.ics.uci.edu/dataset/222/bank+marketing).
2. Make lists for `numerical_features`, `ordinal_features`, and `nominal_features`.
3. Preprocess the features. It is safer to make a copy of `X_train` using:
   ```
   X_train_copy = X_train.copy()
   X_test_copy = X_test.copy()
   ```
   and preprocess `X_train_copy` instead.

   **For nominal features, concat the one-hot encoded features using [`pd.concat(..., axis=1)`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) and drop the old nominal features from the dataframe.**
4. Train Logistic Regression on the preprocessed `X_train_copy` and `y_train`.
5. Evaluate the Model:
    *   Make predictions on the preprocessed `X_test_copy`.
    *   Print `classification_report` ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)). What are the accuracy, average precision, average recall, and average f1-score?


## **2. Make lists for**
`Binary encoding`
*  housing
*  loan
*  contact

`nominal_features`
*  job
*  marital
*  day of week
*  month
*  poutcome

`numerical_features`
*  age
*  campaign
*  duration  -> drop ; data leakage problem
*  pdays
*  previous

`ordinal_features`
*  education



## **3.Preprocessing**

### 3.1 Norminal Feature
`nominal_features`
*  job
*  marital
*  day of week
*  month
*  poutcome

#### 3.1.1 Marital

*  imputaition

In [22]:
# imputaition missing value
nom_col_demo1 = ['marital']

imputer_nom1 = SimpleImputer(strategy='most_frequent')
imputer_nom1.fit(X_train[nom_col_demo1])

X_train_imputed_nom_demo1 = imputer_nom1.fit_transform(X_train[nom_col_demo1])
X_test_imputed_nom_demo1 = imputer_nom1.transform(X_test[nom_col_demo1])

In [23]:
# To use value_counts(), convert the NumPy array to a pandas Series
pd.Series(X_train_imputed_nom_demo1.flatten()).value_counts()

Unnamed: 0,count
married,17524
single,8057
divorced,3250


*   Nominal Encoding

In [24]:
# Nominal Encoding
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

## Write your code here
X_train_onehot_encoded_demo1 = onehot_encoder.fit_transform(X_train_imputed_nom_demo1)
X_test_onehot_encoded_demo1 = onehot_encoder.transform(X_test_imputed_nom_demo1)

In [25]:
X_train_onehot_encoded_demo1

array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       ...,
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [26]:
X_train = pd.concat([X_train.reset_index(drop=True),
                              pd.DataFrame(X_train_onehot_encoded_demo1, columns = onehot_encoder.get_feature_names_out(['marital']))], axis=1)

X_test = pd.concat([X_test.reset_index(drop=True),
                              pd.DataFrame(X_test_onehot_encoded_demo1, columns = onehot_encoder.get_feature_names_out(['marital']))], axis=1)


In [27]:
X_train

Unnamed: 0,age,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_student,job_technician,job_unemployed,marital_divorced,marital_married,marital_single
0,-0.389990,married,1.0,no,no,no,telephone,jun,mon,-0.209842,999,0,nonexistent,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,-1.828602,single,5.0,no,no,no,telephone,nov,wed,-0.566329,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,0.377269,married,3.0,,yes,no,cellular,jul,wed,-0.566329,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,-0.677712,single,6.0,no,yes,no,cellular,jul,fri,-0.209842,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.432251,married,1.0,,yes,no,telephone,may,mon,-0.566329,999,0,nonexistent,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28826,-0.581805,single,6.0,no,yes,no,telephone,aug,tue,-0.566329,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
28827,1.528159,married,4.0,no,no,no,cellular,may,mon,1.929082,999,0,nonexistent,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
28828,-0.581805,married,1.0,,no,no,cellular,jul,wed,0.859620,999,0,nonexistent,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
28829,0.185455,married,6.0,no,no,no,cellular,aug,fri,-0.566329,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [28]:
# drop jmarital
X_train = X_train.drop(['marital'], axis=1)
X_test = X_test.drop(['marital'], axis=1)

In [29]:
# Drop marital_divorced to prevent the dummy variable trap.
X_train = X_train.drop(['marital_divorced'], axis=1)
X_test = X_test.drop(['marital_divorced'], axis=1)

In [30]:
X_train

Unnamed: 0,age,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_student,job_technician,job_unemployed,marital_married,marital_single
0,-0.389990,1.0,no,no,no,telephone,jun,mon,-0.209842,999,0,nonexistent,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,-1.828602,5.0,no,no,no,telephone,nov,wed,-0.566329,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,0.377269,3.0,,yes,no,cellular,jul,wed,-0.566329,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,-0.677712,6.0,no,yes,no,cellular,jul,fri,-0.209842,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.432251,1.0,,yes,no,telephone,may,mon,-0.566329,999,0,nonexistent,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28826,-0.581805,6.0,no,yes,no,telephone,aug,tue,-0.566329,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
28827,1.528159,4.0,no,no,no,cellular,may,mon,1.929082,999,0,nonexistent,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
28828,-0.581805,1.0,,no,no,cellular,jul,wed,0.859620,999,0,nonexistent,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
28829,0.185455,6.0,no,no,no,cellular,aug,fri,-0.566329,999,0,nonexistent,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


*  ในตอนนี้อยู่ขั้นตอน Norminal Feature preprocessing ได้ทำการจัดการไป 2 ตัวแปรได้แก่
   *  job ✅
   *  marital ✅ *แก้ไขใน X_train,X_test
*  ในขั้นตอนต่อไป จะทำแก้ไขบน X_train_copy, X_test_copy เพื่อทำตามคำสั่ง และกันข้อผิดพลาด ดังนั้น มีตัวแปรอีก 3 ที่เป็น Norminal Feature ที่ต้องแก้ไขต่อไปนี้ :
   *  day of week
   *  month
   *  poutcome

#### 3.1.2 day of week, month, poucome

In [31]:
X_train_copy = X_train.copy()
X_test_copy = X_test.copy()

In [32]:
X_train_copy[['day_of_week', 'month', 'poutcome']].isnull().sum()

Unnamed: 0,0
day_of_week,0
month,0
poutcome,0


In [33]:
# ทำ dummy กับ Train
X_train_copy = pd.get_dummies(
    X_train_copy,
    columns=['day_of_week', 'month', 'poutcome'],
    drop_first=True)

In [34]:
# ทำ dummy กับ Test
X_test_copy = pd.get_dummies(
    X_test_copy,
    columns=['day_of_week', 'month', 'poutcome'],
    drop_first=True)

In [35]:
X_train_copy

Unnamed: 0,age,education,default,housing,loan,contact,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_student,job_technician,job_unemployed,marital_married,marital_single,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,month_aug,month_dec,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_nonexistent,poutcome_success
0,-0.389990,1.0,no,no,no,telephone,-0.209842,999,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False
1,-1.828602,5.0,no,no,no,telephone,-0.566329,999,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,False,False,False,True,False,False,False,False,False,False,True,False,False,True,False
2,0.377269,3.0,,yes,no,cellular,-0.566329,999,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False
3,-0.677712,6.0,no,yes,no,cellular,-0.209842,999,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False
4,1.432251,1.0,,yes,no,telephone,-0.566329,999,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28826,-0.581805,6.0,no,yes,no,telephone,-0.566329,999,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,False,False,True,False,True,False,False,False,False,False,False,False,False,True,False
28827,1.528159,4.0,no,no,no,cellular,1.929082,999,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
28828,-0.581805,1.0,,no,no,cellular,0.859620,999,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False
28829,0.185455,6.0,no,no,no,cellular,-0.566329,999,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False


In [36]:
X_train_copy

Unnamed: 0,age,education,default,housing,loan,contact,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_student,job_technician,job_unemployed,marital_married,marital_single,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,month_aug,month_dec,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_nonexistent,poutcome_success
0,-0.389990,1.0,no,no,no,telephone,-0.209842,999,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False
1,-1.828602,5.0,no,no,no,telephone,-0.566329,999,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,False,False,False,True,False,False,False,False,False,False,True,False,False,True,False
2,0.377269,3.0,,yes,no,cellular,-0.566329,999,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False
3,-0.677712,6.0,no,yes,no,cellular,-0.209842,999,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False
4,1.432251,1.0,,yes,no,telephone,-0.566329,999,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28826,-0.581805,6.0,no,yes,no,telephone,-0.566329,999,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,False,False,True,False,True,False,False,False,False,False,False,False,False,True,False
28827,1.528159,4.0,no,no,no,cellular,1.929082,999,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
28828,-0.581805,1.0,,no,no,cellular,0.859620,999,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False
28829,0.185455,6.0,no,no,no,cellular,-0.566329,999,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False


### 3.2 numerical_features (Standard Scaling)
`numerical_features`
*  age ✅  
*  campaign ✅
*  duration  -> drop ; data leakage problem ✅
*  pdays
*  previous

In [37]:
num_cols_demo = ['pdays','previous']

scaler = StandardScaler()

# Fit the scaler ONLY on the training data

X_train_copy[num_cols_demo] = scaler.fit_transform(X_train_copy[num_cols_demo])

X_test_copy[num_cols_demo] = scaler.transform(X_test_copy[num_cols_demo])

X_train_copy[num_cols_demo]

Unnamed: 0,pdays,previous
0,0.196323,-0.349598
1,0.196323,-0.349598
2,0.196323,-0.349598
3,0.196323,-0.349598
4,0.196323,-0.349598
...,...,...
28826,0.196323,-0.349598
28827,0.196323,-0.349598
28828,0.196323,-0.349598
28829,0.196323,-0.349598


### 3.3 Binary encoding
`Binary encoding`
*  default
*  housing
*  loan
*  contact


In [38]:
binary_cols = ['default', 'housing', 'loan','contact']
X_train_copy[binary_cols].isnull().sum()

Unnamed: 0,0
default,6030
housing,689
loan,689
contact,0


In [39]:
for col in binary_cols:
    print(X_train_copy[col].value_counts(), "\n")

default
no     22798
yes        3
Name: count, dtype: int64 

housing
yes    15065
no     13077
Name: count, dtype: int64 

loan
no     23763
yes     4379
Name: count, dtype: int64 

contact
cellular     18235
telephone    10596
Name: count, dtype: int64 



*  many NaN in 'housing' and 'loan'
*  default variable is extremely imbalanced

In [40]:
# drop default
X_train_copy = X_train_copy.drop(['default'], axis=1)

In [41]:
X_test_copy = X_test_copy.drop(['default'], axis=1)

In [42]:
# Imputation
yes_no_cols_2 = ['housing', 'loan']

imputer_ord = SimpleImputer(strategy='most_frequent')

X_train_copy[yes_no_cols_2] = imputer_ord.fit_transform(X_train_copy[yes_no_cols_2])
X_test_copy[yes_no_cols_2] = imputer_ord.transform(X_test_copy[yes_no_cols_2])

X_train_copy[yes_no_cols_2].isnull().sum()

Unnamed: 0,0
housing,0
loan,0


In [43]:
X_train_copy

Unnamed: 0,age,education,housing,loan,contact,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_student,job_technician,job_unemployed,marital_married,marital_single,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,month_aug,month_dec,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_nonexistent,poutcome_success
0,-0.389990,1.0,no,no,telephone,-0.209842,0.196323,-0.349598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False
1,-1.828602,5.0,no,no,telephone,-0.566329,0.196323,-0.349598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,False,False,False,True,False,False,False,False,False,False,True,False,False,True,False
2,0.377269,3.0,yes,no,cellular,-0.566329,0.196323,-0.349598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False
3,-0.677712,6.0,yes,no,cellular,-0.209842,0.196323,-0.349598,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False
4,1.432251,1.0,yes,no,telephone,-0.566329,0.196323,-0.349598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28826,-0.581805,6.0,yes,no,telephone,-0.566329,0.196323,-0.349598,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,False,False,True,False,True,False,False,False,False,False,False,False,False,True,False
28827,1.528159,4.0,no,no,cellular,1.929082,0.196323,-0.349598,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
28828,-0.581805,1.0,no,no,cellular,0.859620,0.196323,-0.349598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False
28829,0.185455,6.0,no,no,cellular,-0.566329,0.196323,-0.349598,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False


In [44]:
# Map  variable 'housing', 'loan' to 0 (no) and 1 (yes)
for col in yes_no_cols_2:
    X_train_copy[col] = X_train_copy[col].map({'yes': 1, 'no': 0})
    X_test_copy[col] = X_test_copy[col].map({'yes': 1, 'no': 0})

In [45]:
X_train_copy

Unnamed: 0,age,education,housing,loan,contact,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_student,job_technician,job_unemployed,marital_married,marital_single,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,month_aug,month_dec,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_nonexistent,poutcome_success
0,-0.389990,1.0,0,0,telephone,-0.209842,0.196323,-0.349598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False
1,-1.828602,5.0,0,0,telephone,-0.566329,0.196323,-0.349598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,False,False,False,True,False,False,False,False,False,False,True,False,False,True,False
2,0.377269,3.0,1,0,cellular,-0.566329,0.196323,-0.349598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False
3,-0.677712,6.0,1,0,cellular,-0.209842,0.196323,-0.349598,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False
4,1.432251,1.0,1,0,telephone,-0.566329,0.196323,-0.349598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28826,-0.581805,6.0,1,0,telephone,-0.566329,0.196323,-0.349598,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,False,False,True,False,True,False,False,False,False,False,False,False,False,True,False
28827,1.528159,4.0,0,0,cellular,1.929082,0.196323,-0.349598,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
28828,-0.581805,1.0,0,0,cellular,0.859620,0.196323,-0.349598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False
28829,0.185455,6.0,0,0,cellular,-0.566329,0.196323,-0.349598,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False


In [46]:
# Map  variable contact to 0 (cellular) and 1 (telephone)
contact_cols = ['contact']
for col in contact_cols:
    X_train_copy[col] = X_train_copy[col].map({'telephone': 1, 'cellular': 0})
    X_test_copy[col] = X_test_copy[col].map({'telephone': 1, 'cellular': 0})

X_test_copy[contact_cols].value_counts()

Unnamed: 0_level_0,count
contact,Unnamed: 1_level_1
0,7909
1,4448


In [47]:
X_train_copy

Unnamed: 0,age,education,housing,loan,contact,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_student,job_technician,job_unemployed,marital_married,marital_single,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,month_aug,month_dec,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_nonexistent,poutcome_success
0,-0.389990,1.0,0,0,1,-0.209842,0.196323,-0.349598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,True,False,False,False,False,False,True,False
1,-1.828602,5.0,0,0,1,-0.566329,0.196323,-0.349598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,False,False,False,True,False,False,False,False,False,False,True,False,False,True,False
2,0.377269,3.0,1,0,0,-0.566329,0.196323,-0.349598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False
3,-0.677712,6.0,1,0,0,-0.209842,0.196323,-0.349598,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False
4,1.432251,1.0,1,0,1,-0.566329,0.196323,-0.349598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28826,-0.581805,6.0,1,0,1,-0.566329,0.196323,-0.349598,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,False,False,True,False,True,False,False,False,False,False,False,False,False,True,False
28827,1.528159,4.0,0,0,0,1.929082,0.196323,-0.349598,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False
28828,-0.581805,1.0,0,0,0,0.859620,0.196323,-0.349598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,False,False,False,True,False,False,True,False,False,False,False,False,False,True,False
28829,0.185455,6.0,0,0,0,-0.566329,0.196323,-0.349598,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False


In [48]:
# Change bool to int because logistic re. does not allow boolean types.
X_train_copy = X_train_copy.astype(float)
X_test_copy = X_test_copy.astype(float)

In [49]:
X_train_copy

Unnamed: 0,age,education,housing,loan,contact,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_student,job_technician,job_unemployed,marital_married,marital_single,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,month_aug,month_dec,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_nonexistent,poutcome_success
0,-0.389990,1.0,0.0,0.0,1.0,-0.209842,0.196323,-0.349598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,-1.828602,5.0,0.0,0.0,1.0,-0.566329,0.196323,-0.349598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.377269,3.0,1.0,0.0,0.0,-0.566329,0.196323,-0.349598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,-0.677712,6.0,1.0,0.0,0.0,-0.209842,0.196323,-0.349598,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.432251,1.0,1.0,0.0,1.0,-0.566329,0.196323,-0.349598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28826,-0.581805,6.0,1.0,0.0,1.0,-0.566329,0.196323,-0.349598,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
28827,1.528159,4.0,0.0,0.0,0.0,1.929082,0.196323,-0.349598,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
28828,-0.581805,1.0,0.0,0.0,0.0,0.859620,0.196323,-0.349598,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
28829,0.185455,6.0,0.0,0.0,0.0,-0.566329,0.196323,-0.349598,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


## 4.model fitting

In [61]:
# baseline model
model = LogisticRegression(C=0.1)
model.fit(X_train_copy,y_train)

## 5.Evaluate the Model

In [62]:
# Make predictions on the test set
y_pred = model.predict(X_test_copy)

# Print the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      0.99      0.94     10981
           1       0.63      0.17      0.27      1376

    accuracy                           0.90     12357
   macro avg       0.77      0.58      0.61     12357
weighted avg       0.87      0.90      0.87     12357



## Part 2: Fashion-MNIST Dataset - Image Classification

### Load Fashion-MNIST Dataset

The Fashion-MNIST dataset consists of 28x28 grayscale images of fashion items.

In [52]:
(fm_X_train, fm_y_train), (fm_X_test, fm_y_test) = fashion_mnist.load_data()

print(f"Fashion-MNIST Train data shape: {fm_X_train.shape}")
print(f"Fashion-MNIST Train labels shape: {fm_y_train.shape}")
print(f"Fashion-MNIST Test data shape: {fm_X_test.shape}")
print(f"Fashion-MNIST Test labels shape: {fm_y_test.shape}")

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
[1m29515/29515[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
[1m26421880/26421880[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
[1m5148/5148[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
[1m4422102/4422102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Fashion-MNIST Train data shape: (60000, 28, 28)
Fashion-MNIST Train labels shape: (60000,)
Fashion-MNIST Test data shape: (10000, 28, 28)
Fashion-MNIST Test labels shape: (10000,)


In [53]:
print(f"First image {fm_X_train[0]}")
print(f"First label {fm_y_train[0]}")

First image [[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   1   0   0  13  73   0
    0   1   4   0   0   0   0   1   1   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   3   0  36 136 127  62
   54   0   0   0   1   3   4   0   0   3]
 [  0   0   0   0   0   0   0   0   0   0   0   0   6   0 102 204 176 134
  144 123  23   0   0   0   0  12  10   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0 155 236 207 178
  107 156 161 109  64  23  77 130  72  15]
 [  0   0   0   0   0   0   0   0   0   0   0   1   0  69 207 223 218 216
  216 163 127 121 122 146 141  88 172  66]
 [  0   0   0   0   0   0   0   0   0   1   1   1   

### Visualize Fashion-MNIST Images

Let's see what these images look like.

In [54]:
fashion_mnist_class_names = [
    'T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
    'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'
]

# Visualize the images
## Write your code here



### **Exercise 2: Preprocessing Images (Flatten and Scale)**

Images are 2D arrays (matrices of pixels) and pixel values are integers from 0-255. For Logistic Regression, we need:
*  **Flattening:** Convert each 28x28 image into a 1D array of 784 features.
*  **Scaling:** Normalize pixel values from [0, 255] to [0, 1].

**Instructions:**

1.   **Flatten:** Use the `.reshape()` method (see [documentation](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.reshape.html)). For `fm_X_train_binary` (shape `(num_samples, 28, 28)`), you want to reshape it to `(num_samples, 28*28)`.
2.  **Scale:** Divide the flattened pixel values by 255.0 to get values between 0 and 1.
3.   **Train Logistic Regression:**
    *   Initialize `LogisticRegression(solver='saga')`. `saga` is a good solver when both number of samples and number of features are large.
    *   Fit the model on your *processed* `fm_X_train_scaled` and `fm_y_train`.
4.   **Make Predictions:** Use `predict()` to make predictions on the *processed* `fm_X_test_scaled`.
5.   **Print Classification Report:** Print `classification_report` ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)). What are the accuracy, average precision, average recall, and average f1-score?
6.   **Visualize Misclassifications:**
    *   Find the indices in `fm_X_test_binary` where your model made incorrect predictions (i.e., `fm_y_pred != fm_y_test`).
    *   Select 5 of these misclassified images.
    *   Plot these images (using `plt.imshow`). For each image, print its true label and its predicted label.

In [55]:
# --- YOUR CODE FOR EXERCISE 2 STARTS HERE ---





## Part 3: 20 Newsgroups Dataset - Text Classification

### Load 20 Newsgroups Dataset

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics.

In [56]:
news_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
news_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)

X_train_news, y_train_news = news_train.data, news_train.target
X_test_news, y_test_news = news_test.data, news_test.target

print(f"Number of training documents: {len(X_train_news)}")
print(f"Number of test documents: {len(X_test_news)}")
print(f"Categories: {news_train.target_names}")

Number of training documents: 11314
Number of test documents: 7532
Categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


### Explore Sample Document

In [57]:
# Print the first document and its class
## Write your code here



### Preprocessing: Text Vectorization Demonstration with `TfidfVectorizer`

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

Where:

$$
\text{TF}(t, d) = \frac{\text{number of word }t\text{ in } d}{\text{number of words in } d} \quad \text{ and } \quad
\text{IDF}(t, D) = \log\left(\frac{\text{total number of documents}}{\text{number of documents that contain word }t}\right).
$$

In [58]:
sample_sentences = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the sample sentences
sample_vec_output_sparse = # Write your code here

sample_vec_output_dense = sample_vec_output_sparse.toarray()

print(vectorizer.vocabulary_)
print(vectorizer.get_feature_names_out())
print(sample_vec_output_dense)

SyntaxError: invalid syntax (ipython-input-2984329282.py, line 11)

### **Exercise 3: Apply TF-IDF Vectorization to Full Dataset**

Now, apply `TfidfVectorizer` to the actual training and testing datasets for the 20 Newsgroups classification task.

**Instructions:**

1.  **Initialize `TfidfVectorizer`:**
    *   Initialize `TfidfVectorizer`. Use `stop_words='english'` to remove common words.
2.  **Fit and Transform Training Data:**
    *   Call `fit_transform()` on `X_train_news` to learn the vocabulary and transform the training text into TF-IDF features. Store the result in `X_train_vec`.
3.  **Transform Test Data:**
    *   Call `transform()` on `X_test_news` using the *already fitted* vectorizer. Store the result in `X_test_vec`. **Crucially, do not call `fit_transform()` on the test data!** This would cause data leakage.
4.  **Initialize Logistic Regression:**
    *   Initialize `LogisticRegression(solver='saga')`. `saga` is a good solver when both number of samples and number of features are large.
5.  **Train the Model:**
    *   Fit the model on your `X_train_vec` and `y_train_news`.
6.  **Make Predictions:**
    *   Make predictions using `predict()` on the `X_test_vec`.
7.  **Evaluate the Model:**
    *   Print `classification_report` ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)). What are the accuracy, average precision, average recall, and average f1-score?

In [None]:
# --- YOUR CODE FOR EXERCISE 3 STARTS HERE ---


