# Data Cleaning & Preprocessing (MNIST Kaggle Dataset)

In this notebook, we will:
- Handle any missing values (if present)
- Normalize pixel values to range [0,1]
- Separate features and target variable
- Reshape images for Deep Learning models (CNN or MLP)
- Prepare data for training and testing

This ensures the dataset is clean and ready for model training.

## Step 1: Set Project Root for Python Imports

In [1]:
import os 
import sys

sys.path.append(os.path.abspath(".."))

## Step 2: Load MNIST dataset
We first load the MNIST Dataset into a Pandas DataFrame and preview the first 5 rows.

In [2]:
from src.data import (load_train_data,
load_test_data,
mis_val_train,
mis_val_test,
)

df_train = load_train_data(r"D:\Thiru\ML_Projects\MNIST-Handwritten-Digit-Recognition\Data\raw\mnist_train.csv")
df_test = load_test_data(r"D:\Thiru\ML_Projects\MNIST-Handwritten-Digit-Recognition\Data\raw\mnist_test.csv")

Train data loaded successfully 
     label  1x1  1x2  1x3  1x4  1x5  1x6  1x7  1x8  1x9  ...  28x19  28x20  \
0      5    0    0    0    0    0    0    0    0    0  ...      0      0   
1      0    0    0    0    0    0    0    0    0    0  ...      0      0   
2      4    0    0    0    0    0    0    0    0    0  ...      0      0   
3      1    0    0    0    0    0    0    0    0    0  ...      0      0   
4      9    0    0    0    0    0    0    0    0    0  ...      0      0   

   28x21  28x22  28x23  28x24  28x25  28x26  28x27  28x28  
0      0      0      0      0      0      0      0      0  
1      0      0      0      0      0      0      0      0  
2      0      0      0      0      0      0      0      0  
3      0      0      0      0      0      0      0      0  
4      0      0      0      0      0      0      0      0  

[5 rows x 785 columns]


## Step 3:Missing Values

In [3]:
mis_val_train(df_train)
print('\n')
mis_val_test(df_test)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

### insights
 - No have missing values so doesn't need to Handling Miss_val

## Step 4: Separate Features and Target

In [4]:
from src.model import (
split_x_y,
normalize_data,
reshape_for_cnn,
encode_data,
save_clean_data
)

X_train, Y_train, X_test, Y_test = split_x_y(df_train, df_test)

## Step 5: Normalize Pixel Values (0-255 -> 0-1)

In [5]:
X_train, X_test = normalize_data(X_train, X_test)

## Step 6: Reshape for Deep Learning Models
 - For MLP: already flattened (shape: n_samples, 784)
 - For CNN: reshape to (n_samples, 28, 28, 1)

In [6]:
X_train_cnn, X_test_cnn = reshape_for_cnn(X_train, X_test)

## Step 7: One-Hot Encode Target for CNN

In [7]:
Y_train_cnn, Y_test_cnn = encode_data(Y_train, Y_test)

## Step 8: Save the cleaning data

In [8]:
save_dir = r"D:\Thiru\ML_Projects\MNIST-Handwritten-Digit-Recognition\Data\processed"

data_dict = {
    "X_train": X_train,
    "X_train_cnn": X_train_cnn,
    "Y_train": Y_train,
    "Y_train_cnn": Y_train_cnn,
    "X_test": X_test,
    "X_test_cnn": X_test_cnn,
    "Y_test": Y_test,
    "Y_test_cnn": Y_test_cnn
}

save_clean_data(save_dir, data_dict)

✅ Saved X_train.npy  →  shape: (60000, 784)
✅ Saved X_train_cnn.npy  →  shape: (60000, 28, 28, 1)
✅ Saved Y_train.npy  →  shape: (60000,)
✅ Saved Y_train_cnn.npy  →  shape: (60000, 10)
✅ Saved X_test.npy  →  shape: (10000, 784)
✅ Saved X_test_cnn.npy  →  shape: (10000, 28, 28, 1)
✅ Saved Y_test.npy  →  shape: (10000,)
✅ Saved Y_test_cnn.npy  →  shape: (10000, 10)

All processed datasets saved in: D:\Thiru\ML_Projects\MNIST-Handwritten-Digit-Recognition\Data\processed


## Data-Cleaning Summary
 -Checked for missing values: none found.
 - Features normalized to range [0,1].
 - Training features shape for MLP: (60000, 784)
 - Training features shape for CNN: (60000, 28, 28, 1)
 - Target variable 'label' one-hot encoded for CNN: (60000, 10)
 - Data is now ready for model training.