# **Fashion MNIST: Feature Engineering**

***
***

### **Introduction to Feature Engineering**

This notebook demonstrates experimental feature engineering for the Fashion MNIST dataset. While deep neural networks can automatically learn hierarchical features from raw pixel data, explicit feature engineering provides several advantages for analysis:

1. Creates `interpretable features` that help us understand what distinguishes different clothing categories
2. Enables traditional ML algorithms to work with image data through meaningful transformations
3. Facilitates `feature store integration` for consistent access across training and serving
4. Provides insights into underlying data characteristics without full model training
5. Creates features that can be used for `exploratory data analysis` and visualization

The extracted features will be organized into two logical groups:
- `Image features`: Statistical, structural, and dimensionality-based features derived from raw pixels
- `Metadata features`: Class labels, identifiers, and dataset split information

This approach is primarily for experimental analysis rather than direct implementation in our training pipeline.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import cv2
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

2025-04-28 14:24:12.699821: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745850252.712526   34706 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745850252.716240   34706 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1745850252.727362   34706 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1745850252.727381   34706 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1745850252.727382   34706 computation_placer.cc:177] computation placer alr

***

### **Data Preparation**

Our feature engineering process begins with loading the Fashion MNIST dataset using TensorFlow's datasets API. The dataset consists of:

- `60,000 training images` (28×28 grayscale pixels)
- `10,000 test images` (28×28 grayscale pixels)
- 10 clothing categories with balanced class distribution

We'll assign each image a class name from the predefined list to make the features more interpretable and facilitate analysis by category. These class names will be included in our metadata features table.

In [None]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

# 2. Define class names
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

*** 

### **Dataset Combination**

For feature extraction, we'll first combine the training and test datasets while maintaining split information. This approach allows us to:

1. Process all `70,000 images` with a single feature extraction pipeline
2. Keep track of the original data `split designation` (train/test) as metadata
3. Generate consistent features across both splits
4. Create a unified feature set that can later be separated as needed

This combined approach simplifies the feature extraction workflow while preserving important provenance information.

In [None]:
x_all = np.concatenate([x_train, x_test])
y_all = np.concatenate([y_train, y_test])
split_labels = ['train'] * len(x_train) + ['test'] * len(x_test)

***

### **Dimensionality Reduction with PCA**

Principal Component Analysis (PCA) is a powerful technique for extracting the most important dimensions from high-dimensional data. For our 28×28 pixel images (784 dimensions), we:

1. Flatten each image into a `784-dimensional vector`
2. Fit PCA on a random subset of `10,000 images` to save memory and computation time
3. Retain the `top 10 principal components` that capture the most variance
4. Transform all 70,000 images to this lower-dimensional representation

PCA components represent directions of maximum variance in the data and often correspond to meaningful visual patterns. These components will be included in our feature set to provide a compact representation of image content.

In [None]:
# Flatten images for PCA
x_flat = x_all.reshape(x_all.shape[0], -1)

# Fit PCA on a subset to save memory
sample_size = 10000
random_indices = np.random.choice(len(x_flat), sample_size, replace=False)
pca = PCA(n_components=10)
pca.fit(x_flat[random_indices])

# Transform all data
pca_results = pca.transform(x_flat)

*** 

### **Feature Extraction Strategy**

Our feature extraction function calculates several types of engineered features for each image:

1. **Statistical Features**:
   - `Mean brightness`: Average pixel value across the image
   - `Standard deviation`: Variation in pixel intensity

2. **Edge Features**:
   - Sobel operators to detect edges in x and y directions
   - `Edge density`: Mean magnitude of edge gradients across the image

3. **Histogram Features**:
   - Distribution of pixel intensities divided into `5 bins`
   - Normalized to sum to 1.0 for scale invariance

These handcrafted features capture different aspects of the images that may be useful for classification tasks, particularly when using traditional machine learning models or for interpretability purposes.

In [None]:
def extract_features(image):
    # Basic statistics
    mean_brightness = np.mean(image)
    std_deviation = np.std(image)
    
    # Edge detection
    sobelx = cv2.Sobel(image, cv2.CV_64F, 1, 0, ksize=3)
    sobely = cv2.Sobel(image, cv2.CV_64F, 0, 1, ksize=3)
    edge_magnitude = np.sqrt(sobelx**2 + sobely**2)
    edge_density = np.mean(edge_magnitude)
    
    # Histogram features (5 bins)
    hist, _ = np.histogram(image, bins=5, range=(0, 255))
    hist = hist / np.sum(hist)  # Normalize
    
    return {
        'mean_brightness': mean_brightness,
        'std_deviation': std_deviation,
        'edge_density': edge_density,
        'hist_bin1': hist[0],
        'hist_bin2': hist[1],
        'hist_bin3': hist[2],
        'hist_bin4': hist[3],
        'hist_bin5': hist[4]
    }

***

### **Feature Processing Pipeline**

The feature extraction pipeline processes all 70,000 images to generate two distinct feature sets:

1. **Image Features**:
   - Unique `image_id` as primary key
   - Statistical features (mean brightness, standard deviation)
   - Structural features (edge density)
   - Dimensionality reduction features (top 5 PCA components)
   - Histogram features (5 bins representing pixel value distribution)

2. **Metadata Features**:
   - Unique `image_id` as primary key (for joining)
   - Class information (numeric ID and readable name)
   - Dataset split designation (train/test)

This separation follows data modeling best practices by organizing features into logical groups with appropriate relationships. The `image_id` serves as the joining key between these tables.

In [None]:
image_features = []
metadata_features = []

for i in range(len(x_all)):
    # Generate unique image ID
    image_id = f"img_{i}"
    
    # Extract features
    features = extract_features(x_all[i])
    
    # Image features
    image_features.append({
        'image_id': image_id,
        'mean_brightness': features['mean_brightness'],
        'std_deviation': features['std_deviation'],
        'edge_density': features['edge_density'],
        'pca_component_1': pca_results[i][0],
        'pca_component_2': pca_results[i][1],
        'pca_component_3': pca_results[i][2],
        'pca_component_4': pca_results[i][3],
        'pca_component_5': pca_results[i][4],
        'hist_bin1': features['hist_bin1'],
        'hist_bin2': features['hist_bin2'],
        'hist_bin3': features['hist_bin3'],
        'hist_bin4': features['hist_bin4'],
        'hist_bin5': features['hist_bin5']
    })

    
    # Metadata features
    metadata_features.append({
        'image_id': image_id,
        'class_id': int(y_all[i]),
        'class_name': class_names[y_all[i]],
        'data_split': split_labels[i]
    })
    
    # Show progress
    if i % 10000 == 0:
        print(f"Processed {i}/{len(x_all)} images")

Processed 0/70000 images
Processed 10000/70000 images
Processed 20000/70000 images
Processed 30000/70000 images
Processed 40000/70000 images
Processed 50000/70000 images
Processed 60000/70000 images


***

### **Data Persistence**

After generating our feature sets, we persist them to CSV files for further analysis and potential integration with other systems. The CSV format offers several advantages:

1. `Universal compatibility` with various data analysis tools
2. Easy import into databases or feature stores
3. Human-readable format for inspection and debugging
4. Efficient storage for tabular data

These CSV files will be stored in the `./features/` directory and could be uploaded to Google Cloud Storage or imported into BigQuery for further analysis and feature serving.

In [None]:
image_features_df = pd.DataFrame(image_features)
metadata_features_df = pd.DataFrame(metadata_features)

# 8. Save to CSV
image_features_df.to_csv('./features/image_features.csv', index=False)
metadata_features_df.to_csv('./features/metadata_features.csv', index=False)

***

### **Feature Exploration**

The sample rows display shows our extracted features for the first 5 images, allowing us to inspect the data structure and verify the feature generation process. Key observations:

1. Each image has a unique `image_id` that serves as the primary key
2. Image features include statistical, structural, and PCA-based components
3. Metadata includes both numeric `class_id` and human-readable `class_name`
4. The `data_split` column preserves the original dataset designation

This tabular format transforms unstructured image data into structured features that can be used for visualization, analysis, and traditional machine learning approaches.

In [None]:
print("\nImage Features (first 5 rows):")
print(image_features_df.head())

print("\nMetadata Features (first 5 rows):")
print(metadata_features_df.head())


Image Features (first 5 rows):
  image_id  mean_brightness  std_deviation  edge_density  pca_component_1  \
0    img_0        97.253827     101.792346    192.181315      -133.960372   
1    img_1       107.905612     100.831448    225.351588      1420.510590   
2    img_2        36.558673      49.698752     83.301051      -692.044930   
3    img_3        59.501276      64.849295    136.712344        60.543575   
4    img_4        78.044643     103.843248    190.411598       838.742679   

   pca_component_2  pca_component_3  pca_component_4  pca_component_5  \
0      1634.674432     -1180.050450      -351.047373         9.991461   
1      -425.358405      -224.046959      -361.054253       290.012937   
2     -1123.716759       107.366663      -201.745258       -94.288999   
3      -990.493523       218.350244      -360.377514        43.654988   
4     -1185.264650      -771.009579       227.522258       398.429202   

   hist_bin1  hist_bin2  hist_bin3  hist_bin4  hist_bin5  
0   0.5

*** 

### **Statistical Analysis**

The statistical summary provides insights into the distribution of our engineered features across the entire dataset:

1. **Mean Brightness**: Ranges from ~5 to ~192 with mean around `73`
2. **Standard Deviation**: Ranges from ~17 to ~121 with mean around `82`
3. **Edge Density**: Ranges from ~33 to ~475 with mean around `176`
4. **PCA Components**: Show typical zero-centered distributions
5. **Histogram Bins**: The first bin (darkest pixels) has the highest average proportion (~58%)

These statistics help us understand the range and distribution of our features, which is essential for feature normalization, outlier detection, and interpretation of model results.

In [None]:
print("\nImage Features Statistics:")
print(image_features_df.describe())


Image Features Statistics:
       mean_brightness  std_deviation  edge_density  pca_component_1  \
count     70000.000000   70000.000000  70000.000000     70000.000000   
mean         72.969811      81.649481    176.495030        17.998457   
std          32.134516      20.019305     43.877835      1134.817214   
min           4.943878      16.525658     32.577617     -2026.518314   
25%          47.405293      66.923739    147.085311      -945.189249   
50%          69.336735      84.541055    173.466071         0.600875   
75%          97.354911      98.158086    203.357451       914.400001   
max         191.820153     121.286206    475.178819      2798.851306   

       pca_component_2  pca_component_3  pca_component_4  pca_component_5  \
count     70000.000000     70000.000000     70000.000000     70000.000000   
mean         -6.440443         4.424498        -5.566805         4.030693   
std         886.547650       516.046603       468.677947       412.592766   
min       -1705

***

### **Class Distribution Verification**

The class distribution confirms that our feature set maintains the balanced nature of the original Fashion MNIST dataset:

- Each of the 10 classes contains exactly `7,000 images` (combining train and test)
- This balanced distribution is important for unbiased model training and evaluation
- The alphabetical ordering differs from the original class_id ordering

This verification step ensures that our feature engineering process preserved the integrity of the dataset structure while transforming the data format.

In [None]:
print("\nClass Distribution:")
print(metadata_features_df['class_name'].value_counts())


Class Distribution:
class_name
Ankle boot     7000
T-shirt/top    7000
Dress          7000
Pullover       7000
Sneaker        7000
Sandal         7000
Trouser        7000
Shirt          7000
Coat           7000
Bag            7000
Name: count, dtype: int64


***

### **Storage Efficiency**

The file size analysis shows:

- `image_features.csv`: 17.20 MB
- `metadata_features.csv`: 1.69 MB

These CSV files provide an efficient representation of the dataset's essential characteristics:
- The feature files are significantly smaller than the raw images (~17 MB vs ~70 MB)
- The separation into two tables optimizes storage by preventing redundancy
- The files are small enough for easy handling in memory on standard machines

This compact representation facilitates rapid experimentation and analysis while maintaining the most informative aspects of the original images.

In [None]:
import os
print(f"\nCSV File Sizes:")
print(f"image_features.csv: {os.path.getsize('./features/image_features.csv') / (1024*1024):.2f} MB")
print(f"metadata_features.csv: {os.path.getsize('./features/metadata_features.csv') / (1024*1024):.2f} MB")


CSV File Sizes:
image_features.csv: 17.20 MB
metadata_features.csv: 1.69 MB


***

### **Conclusion**

This experimental feature engineering notebook has successfully:

1. Transformed 70,000 Fashion MNIST images into structured, tabular feature sets
2. Created `13 engineered features` capturing statistical, structural, and distributional characteristics
3. Preserved essential metadata including class information and dataset splits
4. Organized features into logical tables with proper relationships
5. Generated analysis-ready CSV files for further exploration

These features provide an alternative representation of the Fashion MNIST dataset that can be used for:

- Exploratory data analysis and visualization
- Training traditional ML models (random forests, gradient boosting, etc.)
- Integration with `Vertex AI Feature Store` for feature serving
- Understanding which characteristics distinguish different clothing categories
- Supplementing deep learning approaches with interpretable features

While our production training pipeline will likely use the raw pixel data for deep learning models, these engineered features provide valuable insights and alternative modeling approaches for experimentation.

***
***