<a href="https://colab.research.google.com/github/sathundorn/Statistical-Learning-Labs1/blob/main/670510773_Lab02_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #2

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve

# For Fashion-MNIST
from tensorflow.keras.datasets import fashion_mnist

# For 20 Newsgroups
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

## Part 1: Marketing Campaign Dataset - Manual Data Preprocessing & Logistic Regression

### Load the Marketing Campaign Dataset ([Data Information](https://archive.ics.uci.edu/dataset/222/bank+marketing))

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (`'yes'`) or not (`'no'`) subscribed.

In [3]:
bank_url = 'https://raw.githubusercontent.com/donlap/ds352-labs/main/bank.csv'

df = pd.read_csv(bank_url, sep=';', na_values=['unknown'])
df = df.drop(["emp.var.rate", "cons.price.idx", "cons.conf.idx",	"euribor3m", "nr.employed"], axis=1)
print("Shape of the dataset:", df.shape)
df.head()

Shape of the dataset: (41188, 16)


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,no
1,57,services,married,high.school,,no,no,telephone,may,mon,149,1,999,0,nonexistent,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,no


### Data Exploration

In [4]:
print("--- Missing Values Count ---")
print(df.isnull().sum())

--- Missing Values Count ---
age               0
job             330
marital          80
education      1731
default        8597
housing         990
loan            990
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
y                 0
dtype: int64


In [5]:
print("--- Unique Values for Categorical Columns ---")
for col in df.select_dtypes(include='object').columns:
    print(f"\n'{col}' unique values:")
    print(df[col].value_counts(dropna=False)) # Include NaN counts

--- Unique Values for Categorical Columns ---

'job' unique values:
job
admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
NaN                330
Name: count, dtype: int64

'marital' unique values:
marital
married     24928
single      11568
divorced     4612
NaN            80
Name: count, dtype: int64

'education' unique values:
education
university.degree      12168
high.school             9515
basic.9y                6045
professional.course     5243
basic.4y                4176
basic.6y                2292
NaN                     1731
illiterate                18
Name: count, dtype: int64

'default' unique values:
default
no     32588
NaN     8597
yes        3
Name: count, dtype: int64

'housing' unique values:
housing
yes    21576
no     18622
NaN      990
Name: count, dtype: int64


### Data Preprocessing

In [51]:
# Map target variable 'y' to 0 (no) and 1 (yes)
df['y_new'] = df['y'].map({'yes' : 1, 'no' : 0}) # Write your code here


# Drop 'duration' due to data leakage


# Define features (X) and target (y)
y = df['y_new']
x = df.drop(['y', 'y_new'], axis=1)

# Split the data BEFORE any transformations


# Print data shape
print(y)
print(x)



0        0
1        0
2        0
3        0
4        0
        ..
41183    1
41184    0
41185    0
41186    1
41187    0
Name: y_new, Length: 41188, dtype: int64
       age          job  marital            education default housing loan    contact month day_of_week  duration  campaign  pdays  previous     poutcome
0       56    housemaid  married             basic.4y      no      no   no  telephone   may         mon       261         1    999         0  nonexistent
1       57     services  married          high.school     NaN      no   no  telephone   may         mon       149         1    999         0  nonexistent
2       37     services  married          high.school      no     yes   no  telephone   may         mon       226         1    999         0  nonexistent
3       40       admin.  married             basic.6y      no      no   no  telephone   may         mon       151         1    999         0  nonexistent
4       56     services  married          high.school      no      n

We will apply `StandardScaler()`, `OrdinalEncoder()`, and `OneHotEncoder()` on a few selected columns.

In [66]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

y_test

Unnamed: 0,y_new
3326,0
11446,0
12913,0
36343,0
16488,0
...,...
10950,0
30483,0
8733,0
24199,0


**1. Numerical Feature: `age` and `campaign` (Standard Scaling)**

In [53]:
num_cols_demo = ['age', 'campaign']

scaler = StandardScaler()

# Fit the scaler ONLY on the training data
x_train[num_cols_demo] = scaler.fit_transform(x_train[num_cols_demo])
x_test[num_cols_demo] = scaler.transform(x_test[num_cols_demo])\

x_test

# X_train_scaled_demo = # Write your code here
# X_test_scaled_demo = # Write your code here

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome
4509,-0.096924,management,married,high.school,no,yes,yes,telephone,may,tue,52,2.663360,999,0,nonexistent
31107,0.477980,entrepreneur,married,university.degree,no,no,no,cellular,may,wed,139,1.227378,999,0,nonexistent
23656,-0.096924,technician,married,university.degree,,yes,no,cellular,aug,thu,76,-0.567599,999,0,nonexistent
26486,-0.192741,self-employed,married,professional.course,no,no,no,cellular,nov,thu,24,-0.567599,999,1,failure
4940,0.861249,blue-collar,married,basic.9y,no,yes,no,telephone,may,wed,281,-0.567599,999,0,nonexistent
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14868,0.573797,blue-collar,single,basic.4y,,yes,no,telephone,jul,wed,558,1.227378,999,0,nonexistent
34293,-0.096924,services,divorced,high.school,no,no,no,cellular,may,thu,186,-0.567599,999,0,nonexistent
12394,1.148701,blue-collar,married,basic.9y,,no,no,cellular,jul,mon,128,0.150392,999,0,nonexistent
21867,1.819422,retired,married,basic.9y,no,no,no,cellular,aug,wed,351,-0.567599,999,0,nonexistent


Let's take a look at the transformed `age` and `campaign` features and their statistics.

In [38]:
print("\nOriginal X_train 'age' and 'campaign' head:")
print(x_train[num_cols_demo].head())
print("\nScaled X_train 'age' and 'campaign' head:")
print(pd.DataFrame(x_train_scaled_demo, columns=num_cols_demo, index=X_train.index).head())

print("\nMean of scaled 'age' (train):", x_train_scaled_demo[:, 0].mean())
print("Std Dev of scaled 'campaign' (train):", x_train_scaled_demo[:, 1].std())


Original X_train 'age' and 'campaign' head:
            age  campaign
11957  0.575958  0.155944
25286  0.768069  0.516781
31221  0.095682 -0.565729
22035  1.344400  0.877617
94     0.191737 -0.565729

Scaled X_train 'age' and 'campaign' head:


NameError: name 'x_train_scaled_demo' is not defined

**2. Ordinal Feature: `education` (Ordinal Encoding with Imputation)**

- **Imputation**

In [68]:
ord_col_demo = ['education']

imputer_ord = SimpleImputer(strategy='most_frequent')

## Write your code here
x_train[ord_col_demo] = imputer_ord.fit_transform(x_train[ord_col_demo])
x_test[ord_col_demo] = imputer_ord.transform(x_test[ord_col_demo])

x_train['education']
# X_train_imputed_ord_demo = # Write your code here
# X_test_imputed_ord_demo = # Write your code here

Unnamed: 0,education
3162,university.degree
24402,basic.6y
12540,university.degree
38397,university.degree
27489,university.degree
...,...
12131,university.degree
31676,basic.6y
38695,university.degree
9163,university.degree


- **Ordinal Encoding**

In [69]:
education_categories = [
    'illiterate', 'basic.4y', 'basic.6y', 'basic.9y', 'high.school',
    'professional.course', 'university.degree', 'masters', 'doctorate'
]

In [70]:
ordinal_encoder = OrdinalEncoder(categories=[education_categories])

## Write your code here
x_train[ord_col_demo] = ordinal_encoder.fit_transform(x_train[ord_col_demo])
x_test[ord_col_demo] = ordinal_encoder.transform(x_test[ord_col_demo])


# X_train_ord_encoded_demo = # Write your code here
# X_test_ord_encoded_demo = # Write your code here
x_test

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome
3326,54,blue-collar,divorced,1.0,no,no,no,telephone,may,thu,192,3,999,0,nonexistent
11446,51,management,married,6.0,no,yes,no,telephone,jun,fri,74,1,999,0,nonexistent
12913,30,technician,married,6.0,,no,no,cellular,jul,tue,305,1,999,0,nonexistent
36343,48,technician,married,4.0,no,yes,no,cellular,jun,tue,115,1,999,0,nonexistent
16488,30,blue-collar,single,2.0,,yes,no,cellular,jul,wed,375,8,999,0,nonexistent
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10950,37,technician,single,6.0,no,no,no,telephone,jun,wed,190,3,999,0,nonexistent
30483,27,student,single,4.0,no,no,no,cellular,may,mon,64,2,999,1,failure
8733,38,blue-collar,married,2.0,no,yes,no,telephone,jun,wed,382,3,999,0,nonexistent
24199,40,management,married,2.0,no,no,no,cellular,nov,mon,56,1,999,0,nonexistent


Let's take a look at the imputed and ordinal-encoded `education`.

In [None]:
print("\nOriginal X_train 'education' head:")
print(X_train[ord_col_demo].iloc[20:25])
print("\nImputed X_train 'education' head (after imputer.transform):")
print(pd.DataFrame(X_train_imputed_ord_demo, columns=ord_col_demo, index=X_train.index).iloc[20:25])
print("\nOrdinal Encoded X_train 'education' head:")
print(pd.DataFrame(X_train_ord_encoded_demo, columns=ord_col_demo, index=X_train.index).iloc[20:25])

**3. Nominal Feature: `job` (One-Hot Encoding with Imputation)**

- **Imputation**

In [13]:
nom_col_demo = ['job']

imputer_nom = SimpleImputer(strategy='most_frequent')
imputer_nom.fit(x_train[nom_col_demo])

x_train[nom_col_demo] = imputer_nom.fit_transform(x_train[nom_col_demo])
x_test[nom_col_demo] = imputer_nom.transform(x_test[nom_col_demo])

x_test

# X_train_imputed_nom_demo = imputer_nom.transform(X_train[nom_col_demo])
# X_test_imputed_nom_demo = imputer_nom.transform(X_test[nom_col_demo])

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome
17190,-0.195923,technician,single,6.0,no,no,no,cellular,jul,fri,337,-0.207629,999,0,nonexistent
22319,0.478105,technician,married,5.0,,yes,no,cellular,aug,thu,360,0.862854,999,0,nonexistent
31106,0.092947,self-employed,married,6.0,,no,no,cellular,may,wed,100,-0.207629,999,1,failure
21165,-0.484792,technician,married,5.0,no,yes,no,cellular,aug,mon,70,-0.207629,999,0,nonexistent
24477,-0.677371,management,married,6.0,no,yes,no,cellular,nov,mon,318,-0.564456,999,0,nonexistent
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27413,1.152134,management,divorced,6.0,no,no,no,cellular,nov,fri,185,-0.207629,999,0,nonexistent
39067,-1.640268,blue-collar,single,4.0,no,yes,no,cellular,dec,fri,248,1.219682,999,0,nonexistent
18467,-0.773661,technician,single,5.0,no,yes,yes,cellular,jul,thu,234,-0.564456,999,0,nonexistent
18179,-0.388502,blue-collar,divorced,3.0,no,no,no,telephone,jul,wed,109,1.576510,999,0,nonexistent


- **Nominal Encoding**

In [14]:
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

## Write your code here
x_train_onehot = onehot_encoder.fit_transform (x_train[['job']])
x_test_onehot = onehot_encoder.transform (x_test[['job']])

x_test_onehot
# X_train_onehot_encoded_demo = ## Write your code here
# X_test_onehot_encoded_demo = ## Write your code here

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.]])

In [15]:
x_train = pd.concat([x_train.reset_index(drop=True), pd.DataFrame(x_train_onehot,columns=onehot_encoder.get_feature_names_out(['job']))],axis = 1)


x_test = pd.concat([x_test.reset_index(drop=True), pd.DataFrame(x_test_onehot,columns=onehot_encoder.get_feature_names_out(['job']))],axis = 1)
x_test

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed
0,-0.195923,technician,single,6.0,no,no,no,cellular,jul,fri,337,-0.207629,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.478105,technician,married,5.0,,yes,no,cellular,aug,thu,360,0.862854,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.092947,self-employed,married,6.0,,no,no,cellular,may,wed,100,-0.207629,999,1,failure,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,-0.484792,technician,married,5.0,no,yes,no,cellular,aug,mon,70,-0.207629,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,-0.677371,management,married,6.0,no,yes,no,cellular,nov,mon,318,-0.564456,999,0,nonexistent,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12352,1.152134,management,divorced,6.0,no,no,no,cellular,nov,fri,185,-0.207629,999,0,nonexistent,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
12353,-1.640268,blue-collar,single,4.0,no,yes,no,cellular,dec,fri,248,1.219682,999,0,nonexistent,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12354,-0.773661,technician,single,5.0,no,yes,yes,cellular,jul,thu,234,-0.564456,999,0,nonexistent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
12355,-0.388502,blue-collar,divorced,3.0,no,no,no,telephone,jul,wed,109,1.576510,999,0,nonexistent,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
print("\nOriginal X_train 'job' head:")
print(X_train[nom_col_demo].iloc[40:45])
print("\nImputed X_train 'job' head (after imputer.transform):")
print(pd.DataFrame(X_train_imputed_nom_demo, columns=nom_col_demo, index=X_train.index).iloc[40:45])
print("\nOne-Hot Encoded X_train 'job' shape:", X_train_onehot_encoded_demo.shape)
print("First 5 rows of One-Hot Encoded X_train 'job':")
print(pd.DataFrame(X_train_onehot_encoded_demo, columns=onehot_encoder.get_feature_names_out(nom_col_demo), index=X_train.index).iloc[40:45])

### **Exercise 1: Apply All Preprocessing & Train Logistic Regression**

Now, it's your turn to apply these preprocessing steps to *all* relevant columns and then train a Logistic Regression model.

**Instructions:**

1.  Look at the Variable Table in [this link](https://archive.ics.uci.edu/dataset/222/bank+marketing).
2. Make lists for `numerical_features`, `ordinal_features`, and `nominal_features`.
3. Preprocess the features. It is safer to make a copy of `X_train` using:
   ```
   X_train_copy = X_train.copy()
   X_test_copy = X_test.copy()
   ```
   and preprocess `X_train_copy` instead.

   **For nominal features, concat the one-hot encoded features using [`pd.concat(..., axis=1)`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) and drop the old nominal features from the dataframe.**
4. Train Logistic Regression on the preprocessed `X_train_copy` and `y_train`.
5. Evaluate the Model:
    *   Make predictions on the preprocessed `X_test_copy`.
    *   Print `classification_report` ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)). What are the accuracy, average precision, average recall, and average f1-score?


In [117]:
numerical_features = ['age', 'campaign', 'pdays', 'previous']
ordinal_features = ['education']
nominal_features = ['job', 'marital', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome']

X_train_copy = x_train.copy()
X_test_copy = x_test.copy()

# Drop 'duration' from the copies immediately to prevent data leakage
X_train_copy = X_train_copy.drop('duration', axis=1)
X_test_copy = X_test_copy.drop('duration', axis=1)



In [118]:
scaler = StandardScaler()

# Fit the scaler ONLY on the training data
X_train_copy[numerical_features] = scaler.fit_transform(X_train_copy[numerical_features])
X_test_copy[numerical_features] = scaler.transform(X_test_copy[numerical_features])

x_test

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome
32884,57,technician,married,high.school,no,no,yes,cellular,may,mon,371,1,999,1,failure
3169,55,,married,,,yes,no,telephone,may,thu,285,2,999,0,nonexistent
32206,33,blue-collar,married,basic.9y,no,no,no,cellular,may,fri,52,1,999,1,failure
9403,36,admin.,married,high.school,no,no,no,telephone,jun,fri,355,4,999,0,nonexistent
14020,27,housemaid,married,high.school,no,yes,no,cellular,jul,fri,189,2,999,0,nonexistent
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15908,46,services,married,high.school,no,yes,no,cellular,jul,mon,181,2,999,0,nonexistent
28222,38,services,divorced,high.school,,yes,no,cellular,apr,tue,620,1,2,1,success
14194,26,blue-collar,single,high.school,no,no,yes,telephone,jul,mon,251,2,999,0,nonexistent
19764,51,technician,divorced,professional.course,no,no,yes,cellular,aug,fri,50,3,999,0,nonexistent


In [119]:
imputer_ord = SimpleImputer(strategy='most_frequent')

## Write your code here
X_train_copy[ordinal_features] = imputer_ord.fit_transform(X_train_copy[ordinal_features])
X_test_copy[ordinal_features] = imputer_ord.transform(X_test_copy[ordinal_features])

In [120]:
education_categories = [
    'illiterate', 'basic.4y', 'basic.6y', 'basic.9y', 'high.school',
    'professional.course', 'university.degree', 'masters', 'doctorate'
]

In [121]:
ordinal_encoder = OrdinalEncoder(categories=[education_categories])

## Write your code here
X_train_copy[ordinal_features] = ordinal_encoder.fit_transform(X_train_copy[ordinal_features])
X_test_copy[ordinal_features] = ordinal_encoder.transform(X_test_copy[ordinal_features])

In [122]:
imputer_nom = SimpleImputer(strategy='most_frequent')
imputer_nom.fit(x_train[nom_col_demo])

X_train_copy[nominal_features] = imputer_nom.fit_transform(X_train_copy[nominal_features])
X_test_copy[nominal_features] = imputer_nom.transform(X_test_copy[nominal_features])

In [123]:
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Fit and transform X_train_copy's nominal features
X_train_onehot_encoded = onehot_encoder.fit_transform(X_train_copy[nominal_features])
# Transform X_test_copy's nominal features
X_test_onehot_encoded = onehot_encoder.transform(X_test_copy[nominal_features])

# Create DataFrame from one-hot encoded features
X_train_onehot_df = pd.DataFrame(X_train_onehot_encoded, columns=onehot_encoder.get_feature_names_out(nominal_features), index=X_train_copy.index)
X_test_onehot_df = pd.DataFrame(X_test_onehot_encoded, columns=onehot_encoder.get_feature_names_out(nominal_features), index=X_test_copy.index)

# Concatenate one-hot encoded features and drop original nominal columns
X_train_copy = pd.concat([X_train_copy.drop(columns=nominal_features), X_train_onehot_df], axis=1)
X_test_copy = pd.concat([X_test_copy.drop(columns=nominal_features), X_test_onehot_df], axis=1)

In [124]:
model = LogisticRegression(solver='liblinear', random_state=42)
model.fit(X_train_copy, y_train)
print(model)

LogisticRegression(random_state=42, solver='liblinear')


In [125]:
y_pred = model.predict(X_test_copy)
print("Predictions made successfully.")

Predictions made successfully.


In [126]:
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.91      0.99      0.95     10968
           1       0.66      0.19      0.29      1389

    accuracy                           0.90     12357
   macro avg       0.78      0.59      0.62     12357
weighted avg       0.88      0.90      0.87     12357



## Part 2: Fashion-MNIST Dataset - Image Classification

### Load Fashion-MNIST Dataset

The Fashion-MNIST dataset consists of 28x28 grayscale images of fashion items.

In [None]:
(fm_X_train, fm_y_train), (fm_X_test, fm_y_test) = fashion_mnist.load_data()

print(f"Fashion-MNIST Train data shape: {fm_X_train.shape}")
print(f"Fashion-MNIST Train labels shape: {fm_y_train.shape}")
print(f"Fashion-MNIST Test data shape: {fm_X_test.shape}")
print(f"Fashion-MNIST Test labels shape: {fm_y_test.shape}")

In [None]:
print(f"First image {fm_X_train[0]}")
print(f"First label {fm_y_train[0]}")

### Visualize Fashion-MNIST Images

Let's see what these images look like.

In [None]:
fashion_mnist_class_names = [
    'T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
    'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'
]

# Visualize the images
## Write your code here



### **Exercise 2: Preprocessing Images (Flatten and Scale)**

Images are 2D arrays (matrices of pixels) and pixel values are integers from 0-255. For Logistic Regression, we need:
*  **Flattening:** Convert each 28x28 image into a 1D array of 784 features.
*  **Scaling:** Normalize pixel values from [0, 255] to [0, 1].

**Instructions:**

1.   **Flatten:** Use the `.reshape()` method (see [documentation](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.reshape.html)). For `fm_X_train_binary` (shape `(num_samples, 28, 28)`), you want to reshape it to `(num_samples, 28*28)`.
2.  **Scale:** Divide the flattened pixel values by 255.0 to get values between 0 and 1.
3.   **Train Logistic Regression:**
    *   Initialize `LogisticRegression(solver='saga')`. `saga` is a good solver when both number of samples and number of features are large.
    *   Fit the model on your *processed* `fm_X_train_scaled` and `fm_y_train`.
4.   **Make Predictions:** Use `predict()` to make predictions on the *processed* `fm_X_test_scaled`.
5.   **Print Classification Report:** Print `classification_report` ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)). What are the accuracy, average precision, average recall, and average f1-score?
6.   **Visualize Misclassifications:**
    *   Find the indices in `fm_X_test_binary` where your model made incorrect predictions (i.e., `fm_y_pred != fm_y_test`).
    *   Select 5 of these misclassified images.
    *   Plot these images (using `plt.imshow`). For each image, print its true label and its predicted label.

In [None]:
# --- YOUR CODE FOR EXERCISE 2 STARTS HERE ---





## Part 3: 20 Newsgroups Dataset - Text Classification

### Load 20 Newsgroups Dataset

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics.

In [None]:
news_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
news_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)

X_train_news, y_train_news = news_train.data, news_train.target
X_test_news, y_test_news = news_test.data, news_test.target

print(f"Number of training documents: {len(X_train_news)}")
print(f"Number of test documents: {len(X_test_news)}")
print(f"Categories: {news_train.target_names}")

### Explore Sample Document

In [None]:
# Print the first document and its class
## Write your code here



### Preprocessing: Text Vectorization Demonstration with `TfidfVectorizer`

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

Where:

$$
\text{TF}(t, d) = \frac{\text{number of word }t\text{ in } d}{\text{number of words in } d} \quad \text{ and } \quad
\text{IDF}(t, D) = \log\left(\frac{\text{total number of documents}}{\text{number of documents that contain word }t}\right).
$$