# Stage 1: Molecular Featurization Pipeline

This notebook implements the first stage of the mass spectrometry data processing pipeline, focusing on molecular feature extraction from SMILES strings.

## Pipeline Overview:

### Responsibilities:
1. **Load Raw Spectral Data**: Read JSONL files containing SMILES and peak data
2. **Extract Molecular Features**:
   - RDKit descriptors (200+ properties)
   - Multiple fingerprint types (Morgan, MACCS, Avalon, etc.)
   - Electronic properties and charge distributions
3. **Post-process Features**: Remove NaN values and zero-variance features
4. **Save Intermediate Data**: Store processed data for Stage 2

### Input:
- `data/input/{dataset_type}/spectral_data.jsonl`

### Output (to temporary directory):
- `data/tmp/{dataset_type}/raw_spectral_data.jsonl` - Original spectral data
- `data/tmp/{dataset_type}/molecular_features.jsonl` - Extracted molecular features
- `data/tmp/{dataset_type}/dataset_config.json` - Configuration metadata
- `data/tmp/{dataset_type}/corrupted_records.jsonl` - Error log

## 1. Environment Setup

Import required libraries and configure the environment for molecular feature extraction.

In [1]:
#!/usr/bin/env python
# coding: utf-8

# Standard library imports
import os
import json
import numpy as np
import pandas as pd
import logging
import traceback
import random
from tqdm import tqdm
from joblib import Parallel, delayed
from collections import defaultdict
import psutil

# RDKit imports
from rdkit import Chem
from rdkit.Chem import Descriptors, rdFingerprintGenerator, rdMolDescriptors, MACCSkeys, AllChem
from rdkit.ML.Descriptors import MoleculeDescriptors
from rdkit.Avalon import pyAvalonTools

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Set random seeds for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

print("Environment setup complete")
print("=" * 60)

Environment setup complete


## 2. Configuration

Configuration specific to Stage 1: molecular featurization.

**Key Configuration Sections:**
- `dataset_type`: Specifies which dataset to process
- `features`: Controls which molecular features to extract
- `processing`: Parallel processing and error handling settings

In [2]:
# Stage 1 Configuration
STAGE1_CONFIG = {
    # Dataset configuration
    'dataset_type': 'hpj',  # Change this to process different datasets
    
    # Paths configuration
    'paths': {
        'data_root': '../data',
        'input_dir': lambda dtype: f"../data/input/{dtype}",
        'input_file': lambda dtype: f"../data/input/{dtype}/spectral_data.jsonl",
        'temp_dir': lambda dtype: f"../data/tmp/{dtype}",  # Intermediate storage
        'corrupted_output': lambda dtype: f"../data/tmp/{dtype}/corrupted_records.jsonl",
    },
    
    # Processing configuration
    'processing': {
        'n_jobs': -1,  # Use all available cores
        'log_level': 'INFO',
        'error_log_detail_level': 'standard'
    },
    
    # Feature extraction settings
    'features': {
        'extract_descriptors': True,
        'extract_electronic': True,
        'extract_substructures': False,
        'atom_pair_fp_size': 512,
        
        # Fingerprint configurations
        'fingerprints': {
            'morgan': {'enabled': True, 'radii': [1, 2, 3], 'size': 1024},
            'morgan_feature': {'enabled': True, 'radius': 2, 'size': 1024},
            'maccs': {'enabled': True},
            'topological': {'enabled': True, 'size': 1024},
            'rdkit': {'enabled': True, 'size': 2048},
            'avalon': {'enabled': True, 'size': 1024},
            'pattern': {'enabled': True, 'size': 1024},
            'layered': {'enabled': True, 'size': 2048}
        },
        
        # Count features
        'count_features': {
            'bond_counts': True,
            'atom_counts': True
        }
    }
}

print(f"Configuration loaded for dataset: {STAGE1_CONFIG['dataset_type']}")
print("=" * 60)

Configuration loaded for dataset: hpj


## 3. Directory Setup

Create necessary directories for intermediate data storage.

In [3]:
def setup_directories(dataset_type):
    """Create all necessary directories for Stage 1."""
    temp_dir = STAGE1_CONFIG['paths']['temp_dir'](dataset_type)
    os.makedirs(temp_dir, exist_ok=True)
    logger.info(f"Created directory: {temp_dir}")
    return temp_dir

# Setup directories
print("Setting up directories...")
temp_dir = setup_directories(STAGE1_CONFIG['dataset_type'])
print(f"Temporary directory: {temp_dir}")
print("Directory setup complete")
print("=" * 60)

2025-06-23 07:09:30,430 - INFO - Created directory: ../data/tmp/hpj


Setting up directories...
Temporary directory: ../data/tmp/hpj
Directory setup complete


## 4. Utility Functions

Core utilities for memory monitoring, data conversion, and error handling.

In [4]:
# ---------------------- Utilities ---------------------- #
class Utilities:
    @staticmethod
    def get_memory_usage():
        """Get current memory usage in MB."""
        process = psutil.Process(os.getpid())
        mem_info = process.memory_info()
        return f"{mem_info.rss / (1024 * 1024):.1f} MB (RSS), {process.memory_percent():.1f}% of total"

    @staticmethod
    def convert_np_to_list(item):
        """Recursively convert numpy arrays to lists for JSON serialization."""
        if isinstance(item, np.ndarray):
            return item.tolist()
        elif isinstance(item, dict):
            return {k: Utilities.convert_np_to_list(v) for k, v in item.items()}
        elif isinstance(item, list):
            return [Utilities.convert_np_to_list(v) for v in item]
        else:
            return item

# ---------------------- Error Handling Mixin ---------------------- #
class ErrorHandlingMixin:
    """Provides standardized error handling for pipeline components."""
    
    def handle_error(self, error, context="", data=None):
        """Centralized error handling with configurable detail level."""
        message = f"Error in {self.__class__.__name__}"
        if context:
            message += f" ({context})"
        message += f": {error}"
        
        if hasattr(self, 'config') and self.config and 'processing' in self.config:
            detail_level = self.config['processing'].get('error_log_detail_level', 'standard')
        else:
            detail_level = 'standard'
            
        if detail_level == 'full':
            logger.error(message, exc_info=True)
        elif detail_level == 'standard':
            logger.error(message)
        else:
            logger.error(f"Error in {self.__class__.__name__}: {error}")
        
        if hasattr(self, 'config') and self.config and 'paths' in self.config:
            if hasattr(self, 'dataset_type'):
                corrupted_output = self.config['paths']['corrupted_output'](self.dataset_type)
            else:
                corrupted_output = self.config['paths']['corrupted_output'](self.config.get('dataset_type', 'unknown'))
            
            self._log_corrupted_record(
                record_type=self.__class__.__name__,
                data=data if data is not None else {'context': context},
                error=error,
                log_file=corrupted_output
            )
        return None
    
    def _log_corrupted_record(self, record_type, data, error, log_file):
        """Log corrupted records with standardized format."""
        record = {
            'type': record_type,
            'data': data,
            'error': str(error),
            'traceback': traceback.format_exc()
        }
        try:
            with open(log_file, 'a') as fout:
                fout.write(json.dumps(record) + "\n")
        except Exception as e:
            logging.error(f"Failed to log corrupted record: {e}")

# Helper function
def is_valid_smiles(smi, error_log=None):
    """Check if SMILES string is valid."""
    if not isinstance(smi, str):
        return False
    try:
        mol = Chem.MolFromSmiles(smi)
        return mol is not None
    except Exception as e:
        if error_log:
            with open(error_log, 'a') as f:
                f.write(json.dumps({"error": "invalid SMILES", "smiles": smi, "exception": str(e)}) + "\n")
        return False

print("Utility functions loaded")
print(f"Initial memory usage: {Utilities.get_memory_usage()}")
print("=" * 60)

Utility functions loaded
Initial memory usage: 180.9 MB (RSS), 0.1% of total


## 5. Data Loading

Load raw spectral data from JSONL file.

In [5]:
# ---------------------- DataLoader ---------------------- #
class DataLoader(ErrorHandlingMixin):
    def __init__(self, config, dataset_type):
        self.config = config
        self.dataset_type = dataset_type
    
    def load_data(self, input_path):
        """Load spectral data from JSONL file."""
        error_log = self.config['paths']['corrupted_output'](self.dataset_type)
        logging.info(f"Loading spectral data from {input_path}")
        
        data = {}
        with open(input_path, 'r') as f:
            for line in tqdm(f, desc="Loading data"):
                try:
                    record = json.loads(line)
                    smiles = record.get("smiles")
                    peaks = record.get("peaks", [])
                    df = pd.DataFrame(peaks, columns=["mz", "intensity"]) if peaks else pd.DataFrame(columns=["mz", "intensity"])
                    data[smiles] = df
                except Exception as e:
                    self.handle_error(e, "loading JSON line", None)
        
        logging.info(f"Loaded {len(data)} records")
        return data

print("Data loader initialized")
print("=" * 60)

Data loader initialized


## 6. Molecular Feature Extraction

Extract chemical descriptors and fingerprints from SMILES strings using RDKit.

### Features Extracted:
- **RDKit Descriptors**: 200+ molecular properties
- **Fingerprints**: Morgan, MACCS, Avalon, RDKit, etc.
- **Electronic Properties**: Partial charges, charge statistics
- **Structural Counts**: Atom and bond type counts

In [6]:
# ---------------------- Feature Extraction Functions ---------------------- #
def extract_fingerprint(mol, fp_type, size=1024, radius=2, **kwargs):
    """Generic fingerprint extraction function."""
    try:
        if fp_type == 'morgan':
            generator = rdFingerprintGenerator.GetMorganGenerator(radius=radius, fpSize=size)
        elif fp_type == 'morgan_feature':
            generator = rdFingerprintGenerator.GetMorganGenerator(
                radius=radius, 
                fpSize=size,
                atomInvariantsGenerator=rdFingerprintGenerator.GetMorganFeatureAtomInvGen()
            )
        elif fp_type == 'topological':
            generator = rdFingerprintGenerator.GetTopologicalTorsionGenerator(fpSize=size)
        elif fp_type == 'rdkit':
            return np.array([int(b) for b in Chem.RDKFingerprint(mol, fpSize=size).ToBitString()])
        elif fp_type == 'avalon':
            return np.array([int(b) for b in pyAvalonTools.GetAvalonFP(mol, nBits=size).ToBitString()])
        elif fp_type == 'pattern':
            return np.array([int(b) for b in Chem.PatternFingerprint(mol, fpSize=size).ToBitString()])
        elif fp_type == 'layered':
            return np.array([int(b) for b in Chem.LayeredFingerprint(mol, fpSize=size).ToBitString()])
        elif fp_type == 'maccs':
            return np.array(MACCSkeys.GenMACCSKeys(mol))
        else:
            return None
            
        fp = generator.GetFingerprint(mol)
        return np.array([int(b) for b in fp.ToBitString()])
    except Exception:
        return np.zeros(size, dtype=np.int32)

def extract_electronic_features(mol):
    """Extract electronic features from molecule."""
    features = {}
    electronic_features = {}
    
    # Try to compute Gasteiger charges
    try:
        AllChem.ComputeGasteigerCharges(mol)
        charges = [atom.GetDoubleProp('_GasteigerCharge') if atom.HasProp('_GasteigerCharge') else 0.0 
                  for atom in mol.GetAtoms()]
        
        if charges:
            electronic_features['min_partial_charge'] = float(min(charges))
            electronic_features['max_partial_charge'] = float(max(charges))
            electronic_features['mean_partial_charge'] = float(np.mean(charges))
            electronic_features['charge_std'] = float(np.std(charges))
            
            if len(charges) > 1 and electronic_features['charge_std'] > 0:
                electronic_features['charge_skew'] = float(np.mean(
                    [(c - electronic_features['mean_partial_charge'])**3 for c in charges]
                ) / (electronic_features['charge_std']**3))
                electronic_features['charge_kurtosis'] = float(np.mean(
                    [(c - electronic_features['mean_partial_charge'])**4 for c in charges]
                ) / (electronic_features['charge_std']**4))
            else:
                electronic_features['charge_skew'] = 0.0
                electronic_features['charge_kurtosis'] = 0.0
        else:
            electronic_features.update({k: 0.0 for k in [
                'min_partial_charge', 'max_partial_charge', 
                'mean_partial_charge', 'charge_std', 
                'charge_skew', 'charge_kurtosis'
            ]})
    except Exception:
        electronic_features.update({k: 0.0 for k in [
            'min_partial_charge', 'max_partial_charge', 
            'mean_partial_charge', 'charge_std', 
            'charge_skew', 'charge_kurtosis'
        ]})
    
    # Try to compute other electronic properties
    try:
        peoe_vsa = rdMolDescriptors.PEOE_VSA_(mol)
        electronic_features['PEOE_VSA1'] = peoe_vsa[1]
        electronic_features['PEOE_VSA2'] = peoe_vsa[2]
        electronic_features['PEOE_VSA3'] = peoe_vsa[3]
        
        crippen_contribs = rdMolDescriptors.GetCrippenContribs(mol)
        logp_values = [c[0] for c in crippen_contribs]
        mr_values = [c[1] for c in crippen_contribs]
        
        electronic_features['max_logp_contrib'] = float(max(logp_values)) if logp_values else 0.0
        electronic_features['min_logp_contrib'] = float(min(logp_values)) if logp_values else 0.0
        electronic_features['max_mr_contrib'] = float(max(mr_values)) if mr_values else 0.0
        electronic_features['min_mr_contrib'] = float(min(mr_values)) if mr_values else 0.0
    except Exception:
        electronic_features.update({k: 0.0 for k in [
            'PEOE_VSA1', 'PEOE_VSA2', 'PEOE_VSA3',
            'max_logp_contrib', 'min_logp_contrib',
            'max_mr_contrib', 'min_mr_contrib'
        ]})
    
    features['electronic_features'] = np.array(list(electronic_features.values()))
    features['electronic_feature_names'] = list(electronic_features.keys())
    return features

def compute_features(smi, config, error_log=None):
    """Compute all molecular features for a given SMILES string."""
    if not is_valid_smiles(smi, error_log):
        logging.warning(f"Skipping invalid SMILES: {smi}")
        return None
    
    try:
        mol = Chem.MolFromSmiles(smi)
        features = {}
        
        # Extract RDKit descriptors
        if config['features']['extract_descriptors']:
            descriptor_names = [desc[0] for desc in Descriptors._descList]
            calc = MoleculeDescriptors.MolecularDescriptorCalculator(descriptor_names)
            features['descriptors'] = np.array(calc.CalcDescriptors(mol))
            features['descriptor_names'] = descriptor_names
        
        # Extract fingerprints
        for fp_name, fp_config in config['features']['fingerprints'].items():
            if fp_config.get('enabled', False):
                if fp_name == 'morgan':
                    for radius in fp_config['radii']:
                        features[f'morgan_fp_r{radius}'] = extract_fingerprint(
                            mol, 'morgan', size=fp_config['size'], radius=radius
                        )
                        if radius == fp_config['radii'][0]:
                            features['morgan_fingerprint'] = features[f'morgan_fp_r{radius}']
                elif fp_name == 'morgan_feature':
                    features['morgan_feature_fp'] = extract_fingerprint(
                        mol, 'morgan_feature', size=fp_config['size'], radius=fp_config['radius']
                    )
                else:
                    features[f'{fp_name}_fingerprint'] = extract_fingerprint(
                        mol, fp_name, size=fp_config.get('size', 1024)
                    )
        
        # Count features
        if config['features']['count_features']['bond_counts']:
            bond_counts = defaultdict(int)
            for bond in mol.GetBonds():
                bond_type = str(bond.GetBondType())
                bond_counts[bond_type] += 1
            features['bond_counts'] = np.array(list(bond_counts.values()))
            features['bond_types'] = list(bond_counts.keys())
            
        if config['features']['count_features']['atom_counts']:
            atom_counts = defaultdict(int)
            for atom in mol.GetAtoms():
                atom_counts[atom.GetSymbol()] += 1
            features['atom_counts'] = np.array(list(atom_counts.values()))
            features['atom_types'] = list(atom_counts.keys())
        
        # Extract electronic features
        if config['features']['extract_electronic']:
            features.update(extract_electronic_features(mol))
            
        return (smi, features)
    except Exception as e:
        if error_log:
            with open(error_log, 'a') as f:
                f.write(json.dumps({"error": f"processing SMILES {smi}", "smiles": smi, "exception": str(e)}) + "\n")
        return None

def postprocess_descriptors(molecular_features):
    """Process descriptors to remove NaN values and zero variance columns."""
    if not molecular_features:
        return molecular_features
        
    # Find a molecule with descriptors to get descriptor names
    valid_smiles = []
    descriptors_list = []
    
    for smi, f in molecular_features.items():
        if 'descriptors' in f:
            descriptors_list.append(f['descriptors'])
            valid_smiles.append(smi)
    
    if not descriptors_list:
        return molecular_features
        
    # Process all descriptors at once
    descriptors_matrix = np.array(descriptors_list)
    
    # Check for NaN values
    nan_mask = ~np.isnan(descriptors_matrix).any(axis=0)
    if not np.all(nan_mask):
        logging.info(f"Dropping {np.sum(~nan_mask)} descriptor(s) with NaN values")
        descriptors_matrix = descriptors_matrix[:, nan_mask]
        original_names = molecular_features[valid_smiles[0]]['descriptor_names']
        new_names = [name for name, keep in zip(original_names, nan_mask) if keep]
        
        for smi in valid_smiles:
            molecular_features[smi]['descriptor_names'] = new_names
    
    # Check for zero variance
    var_mask = np.std(descriptors_matrix, axis=0) > 0
    if not np.all(var_mask):
        logging.info(f"Dropping {np.sum(~var_mask)} descriptor(s) with 0 variance")
        descriptors_matrix = descriptors_matrix[:, var_mask]
        current_names = molecular_features[valid_smiles[0]]['descriptor_names']
        new_names = [name for name, keep in zip(current_names, var_mask) if keep]
        
        for smi in valid_smiles:
            molecular_features[smi]['descriptor_names'] = new_names
    
    # Update the descriptors in the original dictionary
    for i, smi in enumerate(valid_smiles):
        molecular_features[smi]['descriptors'] = descriptors_matrix[i]
    
    # Memory cleanup
    del descriptors_list
    del descriptors_matrix
    
    return molecular_features

print("Feature extraction functions loaded")
print("=" * 60)

Feature extraction functions loaded


## 7. Execute Stage 1: Molecular Featurization

Run the molecular feature extraction pipeline.

In [7]:
print("STAGE 1: MOLECULAR FEATURIZATION")
print("=" * 60)

# Validate dataset exists
dataset_type = STAGE1_CONFIG['dataset_type']
input_path = STAGE1_CONFIG['paths']['input_file'](dataset_type)

if not os.path.exists(input_path):
    print(f"ERROR: Dataset '{dataset_type}' not found at {input_path}")
    print(f"Available datasets: {[d for d in os.listdir('../data/input') if os.path.isdir(os.path.join('../data/input', d))]}")
else:
    # Initialize settings
    temp_dir = STAGE1_CONFIG['paths']['temp_dir'](dataset_type)
    corrupted_output = STAGE1_CONFIG['paths']['corrupted_output'](dataset_type)
    
    # Clear corrupted output file
    open(corrupted_output, 'w').close()
    
    print(f"Processing dataset: {dataset_type}")
    print(f"Input file: {input_path}")
    print(f"Temporary directory: {temp_dir}")
    print("")
    
    # Initialize data loader
    data_loader = DataLoader(STAGE1_CONFIG, dataset_type)
    
    # Step 1: Load data
    print("Loading raw spectral data...")
    raw_data = data_loader.load_data(input_path)
    print(f"Memory usage after data loading: {Utilities.get_memory_usage()}")
    print("")
    
    # Step 2: Extract molecular features
    print("Extracting molecular features...")
    smiles_list = list(raw_data.keys())
    print(f"Processing {len(smiles_list)} molecules with {STAGE1_CONFIG['processing']['n_jobs']} parallel jobs")
    
    # Parallel processing of features
    results = Parallel(n_jobs=STAGE1_CONFIG['processing']['n_jobs'])(
        delayed(compute_features)(smi, STAGE1_CONFIG, corrupted_output) 
        for smi in tqdm(smiles_list, desc="Computing molecular features")
    )
    
    # Collect results
    molecular_features = {}
    for result in results:
        if result is not None:
            smi, features = result
            molecular_features[smi] = features
    
    print(f"\nSuccessfully extracted features for {len(molecular_features)} molecules")
    print(f"Failed: {len(smiles_list) - len(molecular_features)} molecules")
    
    # Process descriptors if needed
    if STAGE1_CONFIG['features']['extract_descriptors']:
        print("\nPost-processing descriptors...")
        molecular_features = postprocess_descriptors(molecular_features)
    
    print(f"Memory usage after feature extraction: {Utilities.get_memory_usage()}")
    print("")
    
    # Step 3: Save intermediate data
    print("Saving intermediate data...")
    
    # Convert DataFrames to serializable format for raw_data
    serializable_raw_data = {smiles: df.to_dict(orient='records') for smiles, df in raw_data.items()}
    raw_data = None  # Free memory
    
    # Save raw spectral data
    raw_data_path = os.path.join(temp_dir, 'raw_spectral_data.jsonl')
    with open(raw_data_path, 'w') as f:
        for smiles, peaks in tqdm(serializable_raw_data.items(), desc="Saving raw spectral data"):
            record = {"smiles": smiles, "data": peaks}
            f.write(json.dumps(record) + "\n")
    print(f"Saved raw spectral data to {raw_data_path}")
    serializable_raw_data = None  # Free memory
    
    # Save molecular features
    mol_features_path = os.path.join(temp_dir, 'molecular_features.jsonl')
    with open(mol_features_path, 'w') as f:
        for smiles, features in tqdm(molecular_features.items(), desc="Saving molecular features"):
            record = {"smiles": smiles, "data": Utilities.convert_np_to_list(features)}
            f.write(json.dumps(record) + "\n")
    print(f"Saved molecular features to {mol_features_path}")
    
    # Save dataset configuration
    config_path = os.path.join(temp_dir, 'dataset_config.json')
    with open(config_path, 'w') as f:
        json.dump({
            'dataset_type': dataset_type,
            'stage1_complete': True,
            'num_molecules': len(molecular_features),
            'feature_config': STAGE1_CONFIG['features']
        }, f, indent=2)
    print(f"Saved configuration to {config_path}")
    
    print(f"\nStage 1 complete. All data saved to {temp_dir}")
    print(f"Final memory usage: {Utilities.get_memory_usage()}")
    print("=" * 60)

2025-06-23 07:09:30,479 - INFO - Loading spectral data from ../data/input/hpj/spectral_data.jsonl


STAGE 1: MOLECULAR FEATURIZATION
Processing dataset: hpj
Input file: ../data/input/hpj/spectral_data.jsonl
Temporary directory: ../data/tmp/hpj

Loading raw spectral data...


Loading data: 2720it [00:00, 9886.82it/s]
2025-06-23 07:09:30,763 - INFO - Loaded 2720 records


Memory usage after data loading: 206.7 MB (RSS), 0.2% of total

Extracting molecular features...
Processing 2720 molecules with -1 parallel jobs


Computing molecular features: 100%|██████████| 2720/2720 [00:02<00:00, 937.85it/s] 
2025-06-23 07:09:34,487 - INFO - Dropping 12 descriptor(s) with NaN values
2025-06-23 07:09:34,491 - INFO - Dropping 9 descriptor(s) with 0 variance



Successfully extracted features for 2720 molecules
Failed: 0 molecules

Post-processing descriptors...
Memory usage after feature extraction: 582.9 MB (RSS), 0.4% of total

Saving intermediate data...


Saving raw spectral data: 100%|██████████| 2720/2720 [00:00<00:00, 12548.50it/s]


Saved raw spectral data to ../data/tmp/hpj/raw_spectral_data.jsonl


Saving molecular features: 100%|██████████| 2720/2720 [00:01<00:00, 1623.62it/s]

Saved molecular features to ../data/tmp/hpj/molecular_features.jsonl
Saved configuration to ../data/tmp/hpj/dataset_config.json

Stage 1 complete. All data saved to ../data/tmp/hpj
Final memory usage: 618.1 MB (RSS), 0.5% of total





## 8. Stage 1 Summary

Summary of files created for Stage 2 processing.

In [8]:
# Display generated files
print("\nSTAGE 1 OUTPUT FILES:")
print("=" * 60)

temp_dir = STAGE1_CONFIG['paths']['temp_dir'](STAGE1_CONFIG['dataset_type'])
if os.path.exists(temp_dir):
    files = sorted(os.listdir(temp_dir))
    for file in files:
        file_path = os.path.join(temp_dir, file)
        if os.path.isfile(file_path):
            size_mb = os.path.getsize(file_path) / (1024 * 1024)
            print(f"  - {file}: {size_mb:.2f} MB")

print("\nThese files will be used as input for Stage 2.")
print("Next: Run 02_spectral_processing_ml_preparation.ipynb")


STAGE 1 OUTPUT FILES:
  - 3d_features.pkl: 10.96 MB
  - corrupted_3d_records.jsonl: 0.00 MB
  - corrupted_records.jsonl: 0.00 MB
  - dataset_config.json: 0.00 MB
  - dataset_config_3d.json: 0.00 MB
  - dataset_metadata_3d.json: 0.00 MB
  - feature_importance_3d.json: 0.00 MB
  - molecular_features.jsonl: 111.92 MB
  - molecular_features_3d.jsonl: 2.21 MB
  - raw_spectral_data.jsonl: 16.40 MB
  - raw_spectral_data_3d.jsonl: 7.40 MB

These files will be used as input for Stage 2.
Next: Run 02_spectral_processing_ml_preparation.ipynb
