# **Fashion MNIST: Data Normalization**

***
***

### **Introduction**

This notebook orchestrates the execution of our production-grade data normalization pipeline for the Fashion MNIST dataset on Google Cloud Platform. Instead of performing normalization interactively, we implement a cloud-native, reproducible approach that:

1. Downloads our pre-built `normalization script` from Cloud Storage
2. Executes the script against our raw dataset in GCS
3. Outputs normalized data optimized for machine learning workflows

The normalization process follows the recommendations from our data analysis phase, applying `Min-Max [0,1] normalization` by converting images from uint8 to float32 and scaling by 255.0.

This engineering approach follows ML production best practices by:
- Separating preprocessing logic from exploration notebooks
- Creating reusable, modular normalization components
- Enabling reproducible preprocessing across environments
- Establishing proper error handling and logging
- Generating comprehensive documentation

The main Python module (`main.py`) implements a robust normalization pipeline with both HTTP and Cloud Storage triggers, allowing for flexible integration with various workflow orchestration systems.

***

### **Cloud Environment Setup**

Above, we've downloaded the necessary components for our normalization pipeline:

1. `main.py` - The core normalization module containing:
   - Cloud Function implementations for both HTTP and GCS triggers
   - Robust error handling and comprehensive logging
   - GCS download and upload utilities
   - Normalization logic that converts images to `float32` and scales to `[0,1]`
   - Documentation generation for processed datasets

2. `requirements.txt` - Dependencies needed for the normalization process:
   - `numpy==1.26.0` for numerical operations
   - `google-cloud-storage==2.12.0` for GCS operations
   - `functions-framework==3.5.0` for Cloud Functions local testing

These files are retrieved from the `fashion-mnist-dev` GCS bucket where we maintain our data engineering components. This approach ensures version control of our preprocessing code and enables consistent preprocessing across different environments.

Next, we'll install the required dependencies to ensure our environment has all necessary libraries.

In [1]:
import os
from google.cloud import storage

# Create a local directory for your files
os.makedirs(os.path.expanduser("~/data_normalizer"), exist_ok=True)

# Download the files from GCS
client = storage.Client()
bucket = client.bucket("fashion-mnist-dev")

# Download main.py
blob = bucket.blob("data-normalizer/main.py")
blob.download_to_filename(os.path.expanduser("~/data_normalizer/main.py"))

# Download requirements.txt
blob = bucket.blob("data-normalizer/requirements.txt")
blob.download_to_filename(os.path.expanduser("~/data_normalizer/requirements.txt"))

print("Files downloaded successfully to ~/data_normalizer/")

Files downloaded successfully to ~/data_normalizer/


In [2]:
import subprocess
subprocess.run(["pip", "install", "-r", os.path.expanduser("~/data_normalizer/requirements.txt")])



CompletedProcess(args=['pip', 'install', '-r', '/home/jupyter/data_normalizer/requirements.txt'], returncode=0)

***

### **Configuration**

After installing dependencies, we'll create a simple runner script that:

1. Imports the `normalize_fashion_mnist` function from our module
2. Configures input and output paths in Google Cloud Storage
3. Executes the normalization process
4. Reports the results in JSON format

The input path points to our raw dataset stored as `fashion_mnist.npz` in the `custom_jobs` directory, and the output will be saved to the `custom_jobs_normalized` directory. This mirrors the pattern implemented in the Cloud Function that automatically transforms input directories to normalized versions.

The normalization process will:
- Load the raw Fashion MNIST data from the NPZ file
- Apply `Min-Max [0,1] normalization` through the `normalize_dataset` function
- Create a detailed README.md with usage instructions
- Copy the class_names.json file if present
- Save the normalized data in compressed NPZ format with the original split structure preserved

In [3]:
# Create a simple runner script
with open(os.path.expanduser("~/data_normalizer/run.py"), "w") as f:
    f.write("""
import sys
sys.path.append("/home/jupyter/data_normalizer")

from main import normalize_fashion_mnist
import json

# Define your input and output paths
input_path = "gs://fashion-mnist-datasets/custom_jobs/fashion_mnist.npz"
output_path = "gs://fashion-mnist-datasets/custom_jobs_normalized/"

# Run the normalization function
result = normalize_fashion_mnist(input_path, output_path)
print(f"Normalization result: {json.dumps(result, indent=2)}")
""")

print("Runner script created at ~/data_normalizer/run.py")

Runner script created at ~/data_normalizer/run.py


***

### **Execution**

Now we'll execute the normalization process by running our script. The script will:

1. Download the raw Fashion MNIST dataset from Google Cloud Storage
2. Load the compressed NPZ file containing training and test data
3. Convert images from `uint8` (0-255) to `float32` (0.0-1.0) using the carefully calibrated normalization function
4. Preserve all dataset splits (training, validation, test) while maintaining label consistency
5. Create a comprehensive README.md with usage examples and preprocessing details
6. Save the normalized data in compressed NPZ format for efficient storage and loading
7. Upload all artifacts back to Google Cloud Storage

This execution demonstrates the standalone functionality of our normalization module, which could alternatively be deployed as a Cloud Function triggered by storage events or HTTP requests as implemented in the module.

In [4]:
import subprocess
subprocess.run(["python", os.path.expanduser("~/data_normalizer/run.py")])

2025-04-28 19:17:13,467 - main - INFO - Downloading gs://fashion-mnist-datasets/custom_jobs/fashion_mnist.npz to /var/tmp/tmpcrtwd2tx/fashion_mnist.npz
2025-04-28 19:17:14,053 - main - INFO - Download complete
2025-04-28 19:17:14,053 - main - INFO - Loading dataset from /var/tmp/tmpcrtwd2tx/fashion_mnist.npz
2025-04-28 19:17:14,417 - main - INFO - Dataset loaded with keys: dict_keys(['X_train', 'y_train', 'X_test', 'y_test'])
2025-04-28 19:17:14,417 - main - INFO - Normalizing dataset
2025-04-28 19:17:14,512 - main - INFO - Normalized X_train: shape=(60000, 28, 28), dtype=float32
2025-04-28 19:17:14,513 - main - INFO - Kept y_train unchanged: shape=(60000,), dtype=uint8
2025-04-28 19:17:14,530 - main - INFO - Normalized X_test: shape=(10000, 28, 28), dtype=float32
2025-04-28 19:17:14,530 - main - INFO - Kept y_test unchanged: shape=(10000,), dtype=uint8
2025-04-28 19:17:14,530 - main - INFO - Saving normalized dataset to /var/tmp/tmpcrtwd2tx/fashion_mnist_normalized.npz
2025-04-28 19:1

Normalization result: {
  "status": "success",
  "input_path": "gs://fashion-mnist-datasets/custom_jobs/fashion_mnist.npz",
  "output_path": "gs://fashion-mnist-datasets/custom_jobs_normalized/",
  "normalized_file": "gs://fashion-mnist-datasets/custom_jobs_normalized/fashion_mnist_normalized.npz",
  "timestamp": "2025-04-28T19:17:23.926234"
}


2025-04-28 19:17:23,926 - main - INFO - Upload complete
2025-04-28 19:17:23,926 - main - INFO - Created and uploaded README.md to gs://fashion-mnist-datasets/custom_jobs_normalized/README.md


CompletedProcess(args=['python', '/home/jupyter/data_normalizer/run.py'], returncode=0)

***

### **Validation**

After running the normalization process, we'll verify the results by:

1. Listing the contents of the output directory in Google Cloud Storage
2. Confirming the creation of:
   - `fashion_mnist_normalized.npz` - The normalized dataset file
   - `class_names.json` - The class mapping information
   - `README.md` - Documentation with usage instructions

We expect to see these files with appropriate timestamps and sizes. The normalized NPZ file should be similar in size to the original, as the compression efficiency is typically maintained despite the data type conversion from uint8 to float32.

This validation step ensures our preprocessing pipeline executed correctly before proceeding to model training. In a production environment, this would be supplemented with additional data quality checks and automated testing.

In [5]:
subprocess.run(["gsutil", "ls", "-l", "gs://fashion-mnist-datasets/custom_jobs_normalized/"])

      1058  2025-04-28T19:17:23Z  gs://fashion-mnist-datasets/custom_jobs_normalized/README.md
       106  2025-04-28T19:17:23Z  gs://fashion-mnist-datasets/custom_jobs_normalized/class_names.json
  45395155  2025-04-28T19:17:23Z  gs://fashion-mnist-datasets/custom_jobs_normalized/fashion_mnist_normalized.npz
TOTAL: 3 objects, 45396319 bytes (43.29 MiB)


CompletedProcess(args=['gsutil', 'ls', '-l', 'gs://fashion-mnist-datasets/custom_jobs_normalized/'], returncode=0)

***

### **Cleanup**

As a final step, we'll clean up resources by removing the Cloud Function used for normalization. While our current execution used the local Python module directly, this function was previously deployed to automatically process new datasets uploaded to our bucket.

This cleanup is an important practice in cloud environments to:

1. Avoid unnecessary ongoing resource costs
2. Maintain a clean cloud project environment
3. Prevent potential naming conflicts in future deployments

The Google Cloud SDK command `gcloud functions delete` handles the removal of our serverless function resources. In a production environment, this type of cleanup would typically be managed by Infrastructure as Code (IaC) tools or CI/CD pipelines.

In [6]:
import subprocess

# Delete the Cloud Function
result = subprocess.run(
    ["gcloud", "functions", "delete", "fashion-mnist-normalizer", 
     "--region=us-central1", "--quiet"],
    capture_output=True,
    text=True
)

# Print the output
print("STDOUT:", result.stdout)
print("STDERR:", result.stderr)
print(f"Return code: {result.returncode}")

if result.returncode == 0:
    print("Cloud Function successfully deleted!")
else:
    print("Error deleting Cloud Function. See error messages above.")

STDOUT: 
STDERR: Preparing function...
......done.
Deleting function...
[Service]..........................done
[Artifact Registry]done
Done.
Deleted [projects/fashion-mnist-gcp/locations/us-central1/functions/fashion-mnist-normalizer].

Return code: 0
Cloud Function successfully deleted!


***

### **Conclusion**

In this notebook, we've successfully:

1. Implemented a scalable, reproducible data normalization pipeline leveraging Google Cloud Storage
2. Processed the Fashion MNIST dataset with industry-standard `Min-Max [0,1] normalization`
3. Generated normalized floating-point data optimized for neural network training
4. Created comprehensive documentation for downstream consumers of the dataset
5. Demonstrated proper cloud resource management and cleanup

Our approach leverages several production ML engineering best practices:

- **Modularity**: The normalization logic is encapsulated in a standalone Python module
- **Flexibility**: The implementation supports both interactive execution and serverless deployment
- **Reproducibility**: The process applies consistent normalization with fixed random seeds
- **Documentation**: Automatic README generation ensures consumers understand the data format
- **Error handling**: Robust exception handling and logging throughout the pipeline

The normalized dataset is now ready for model training in subsequent notebooks and Vertex AI custom training jobs. The preprocessing decisions implemented here align with our findings from the data analysis phase, specifically the recommendation to use Min-Max normalization for optimal model performance.

***
***