## Preprocessing Pipeline

### Consolidate metadata CSV files

The four separately provided CBIS-DDSM metadata files contain critical clinical information—such as BI-RADS category, pathology, assessment, and lesion subtlety—that are not embedded in the DICOM files or their folder names.

- data/metadata/calc_case_description_test_set.csv
- data/metadata/calc_case_description_train_set.csv
- data/metadata/mass_case_description_test_set.csv
- data/metadata/mass_case_description_train_set.csv

Each rows contains the following columns:

- Patient ID: `P_00038` — Unique patient identifier.
- Breast Side: `LEFT` — Left breast.
- Image View: `CC` or `MLO` — Standard cranio-caudal or mediolateral oblique view used in mammography.
- Breast Density: `2`
- Abnormality ID: `1`
- Abnormality Type: `calcification` — Specifically dealing with microcalcifications.
- Calcification Type: `PUNCTATE-PLEOMORPHIC` — Mixed types, suggesting variable morphology.
- Calcification Distribution: `CLUSTERED` — Clustered microcalcifications, often suspicious.
- Assessment: `4` — BI-RADS 4, suspicious abnormality; biopsy usually recommended.
- Pathology: `BENIGN` — Biopsy/pathology confirmed the finding as benign.
- Subtlety: `2` — Fairly subtle (1 = very subtle, 5 = very obvious).
- Image Files:
  - Original Image Path: Full mammogram DICOM.
    Calc-Test_P_00038_LEFT_CC/1.3.6.1.4.1.9590.100.1.2.85935434310203356712688695661986996009/1.3.6.1.4.1.9590.100.1.2.374115997511889073021386151921807063992/000000.dcm
  - Cropped Image Path: Focused region where calcifications are.
    Calc-Test_P_00038_LEFT_CC_1/1.3.6.1.4.1.9590.100.1.2.161465562211359959230647609981488894942/1.3.6.1.4.1.9590.100.1.2.419081637812053404913157930753972718515/000001.dcm
  - ROI Mask Path: Binary mask of the calcifications (region of interest).
    Calc-Test_P_00038_LEFT_CC_1/1.3.6.1.4.1.9590.100.1.2.161465562211359959230647609981488894942/1.3.6.1.4.1.9590.100.1.2.419081637812053404913157930753972718515/000000.dcm

The following script reads and merges the metadata CSV files, cleans weird paths, adds missing fields where needed and exports everything into a properly formatted metadata_master.csv.

In [None]:
import pandas as pd
import csv
from pathlib import Path

# Input CSV files
input_files = [
    '../data/metadata/calc_case_description_test_set.csv',
    '../data/metadata/calc_case_description_train_set.csv',
    '../data/metadata/mass_case_description_test_set.csv',
    '../data/metadata/mass_case_description_train_set.csv'
]

# Read CSVs by properly handling malformed newlines and all the data cleanly kept.
dfs = []
for file in input_files:
    df = pd.read_csv(
        file,
        engine="python",        # Use Python engine to properly handle malformed newlines
        quoting=csv.QUOTE_MINIMAL, # Respect quotes
        skip_blank_lines=True   # Optional: Skip totally blank lines
    )
    dfs.append(df)

# Concatenate all together
metadata = pd.concat(dfs, ignore_index=True)

# Mapping
metadata = metadata.rename(columns={
    'patient_id': 'patient_id',
    'left or right breast': 'side',
    'image view': 'view',
    'abnormality id': 'abnormality_id',
    'abnormality type': 'abnormality_type',
    'calc type': 'calc_type',
    'calc distribution': 'distribution',
    'mass shape': 'mass_shape',
    'mass margins': 'mass_margins',
    'breast density': 'breast_density',
    'assessment': 'assessment',
    'pathology': 'pathology',
    'subtlety': 'subtlety',
    'image file path': 'full_mammo_path',
    'cropped image file path': 'cropped_roi_path',
    'ROI mask file path': 'roi_mask_path'
})


# Add missing columns (some rows are calcification, others are mass)
for col in ['calc_type', 'distribution', 'mass_shape', 'mass_margins']:
    if col not in metadata.columns:
        metadata[col] = pd.NA

# Normalize paths
def fix_path(path):
    if pd.isna(path):
        return None
    # Remove a newline character embedded inside the "cropped image file path" field.
    path = path.strip().replace('\\', '/').replace('\"', '')
    parts = Path(path).parts
    if len(parts) < 4:
        return path
    parent_folder = parts[0]
    subfolder = parts[1]
    file_name = parts[-1]
    return f'raw/{parent_folder}/{subfolder}/{file_name}'

# Apply path fixer
metadata['full_mammo_path'] = metadata['full_mammo_path'].apply(fix_path)
metadata['cropped_roi_path'] = metadata['cropped_roi_path'].apply(fix_path)
metadata['roi_mask_path'] = metadata['roi_mask_path'].apply(fix_path)

# Select only needed columns
final_cols = [
    'patient_id', 'breast_density', 'side', 'view', 'abnormality_id',
    'abnormality_type', 'calc_type', 'distribution', 'mass_shape', 'mass_margins',
    'assessment', 'pathology', 'subtlety',
    'full_mammo_path', 'cropped_roi_path', 'roi_mask_path'
]
metadata = metadata[final_cols]

# Save to CSV
metadata.to_csv('../data/metadata/metadata_master.csv', index=False)
print('Master metadata CSV created: metadata_master.csv')

Master metadata CSV created: metadata_master.csv
