# Data Preparation: MATLAB to HDF5

This notebook documents the process of preparing the datasets for SONNMF experiments, and saving them in the HDF5 format.

**Transformations Applied**

- **Jasper**: Transpose > Reshape to 2D > Remove negatives
- **Swimmer**: Transpose  
- **Urban**: No transformation required

All datasets are standardized to (features × samples) orientation for NMF analysis.

In [1]:
import os
import numpy as np
import scipy.io
from utils import save_h5

datasets_path = '../datasets'

In [2]:
def load_matlab_file(filepath):
    """Load MATLAB file and return the data dictionary."""
    data = scipy.io.loadmat(filepath)
    return {k: v for k, v in data.items() if not k.startswith('__')}

### 1. Urban Dataset

In [3]:
urban_mat = load_matlab_file(os.path.join(datasets_path, 'urban.mat'))

if os.path.exists(os.path.join(datasets_path, 'urban.h5')):
    print("Urban dataset already exists in HDF5 format.")
else:
    print("Saving Urban dataset in HDF5 format...")
    save_h5(urban_mat, os.path.join(datasets_path, 'urban.h5'))


Urban dataset already exists in HDF5 format.


### 2. Swimmer Dataset

In [4]:
swimmer_mat = load_matlab_file(os.path.join(datasets_path, 'Swimmer.mat'))

if os.path.exists(os.path.join(datasets_path, 'swimmer.h5')):
    print("Swimmer dataset already exists in HDF5 format.")
else:
    print("Saving Swimmer dataset in HDF5 format...")
    swimmer_mat['X'] = swimmer_mat['X'].T
    save_h5(swimmer_mat, os.path.join(datasets_path, 'swimmer.h5'))


Swimmer dataset already exists in HDF5 format.


## Jasper Dataset

In [5]:
jasper_mat = load_matlab_file(os.path.join(datasets_path, 'jasper.mat'))

if os.path.exists(os.path.join(datasets_path, 'jasper.h5')):
    print("Jasper dataset already exists in HDF5 format.")
else:
    print("Saving Jasper dataset in HDF5 format...")
    jasper_mat['X'] = jasper_mat['X'].T
    jasper_mat['X'] = jasper_mat['X'].reshape(jasper_mat['X'].shape[0], -1)
    
    print(f"Before removing negatives: {np.sum(jasper_mat['X'] < 0)} negative values found.")
    print(f"Proportion of negative values before removal: {100 * np.sum(jasper_mat['X'] < 0) / (jasper_mat['X'].shape[0] * jasper_mat['X'].shape[1]):.2f}%")
    # Remove negative values
    jasper_mat['X'][jasper_mat['X'] < 0] = 0
    save_h5(jasper_mat, os.path.join(datasets_path, 'jasper.h5'))

Jasper dataset already exists in HDF5 format.
