# SimpleMIMICCXRDataset for View-Specific X-ray Generation

This notebook implements a custom dataset class, SimpleMIMICCXRDataset, based on the PyHealth BaseEHRDataset for the MIMIC-CXR dataset, tailored for a view-specific X-ray generation task. The dataset processes metadata from a CSV file and constructs patient data with image paths for use with UniXGen.

### Overview
- Purpose: Prepares MIMIC-CXR data for generating X-ray images from one view (e.g., PA) to another (e.g., LL).
- Data Source: mimiccxr_train_sub_filtered.csv with columns dicom_id, subject_id, study_id, view, and count.
- Output: Structured patient data with image paths for input and target views.

### Usage
- The dataset can be integrated with UniXGen for training a view-specific X-ray generation model.
- Ensure the image_dir points to the MIMIC-CXR JPG files directory.

### Repository Structure
- simple_mimic_cxr_dataset.py: The dataset class implementation.
- view_specific_xray_generation.py: Task function to generate view-specific X-ray samples.
- SimpleMIMICCXRDataset.ipynb: This documentation notebook.

In [2]:
%cd /content/drive/MyDrive/Colab Notebooks/CS598

/content/drive/MyDrive/Colab Notebooks/CS598


In [3]:
!pwd

/content/drive/MyDrive/Colab Notebooks/CS598


### Data Preparation
Before creating the dataset, we preprocess the original MIMIC-CXR metadata to create a filtered subset. The original file, mimiccxr_train_sub_final.csv, does not have a header, so we explicitly define the column names and filter the data to include only specific study_id values. The filtered data is saved as mimiccxr_train_sub_filtered.csv with a header, which is then used by the SimpleMIMICCXRDataset class.

In [4]:
import pandas as pd

# Load the original CSV with explicit column names
df = pd.read_csv(
    "./metadata/mimiccxr_train_sub_final.csv",
    sep=',',
    header=None,  # No header in the original file
    names=['dicom_id', 'subject_id', 'study_id', 'view', 'count'],
    dtype={'study_id': str, 'subject_id': str, 'dicom_id': str}
)
print("Loaded columns:", df.columns.tolist())
print("First few rows:\n", df.head())

# Filter to a subset of study_ids (example)
df_filtered = df[df['study_id'].isin(['57049660', '50163781', '57291897'])]
print("Filtered rows:", len(df_filtered))

# Save with header
df_filtered.to_csv("./metadata/mimiccxr_train_sub_filtered.csv", index=False)
print("Saved filtered CSV with header to ./metadata/mimiccxr_train_sub_filtered.csv")

Loaded columns: ['dicom_id', 'subject_id', 'study_id', 'view', 'count']
First few rows:
                                        dicom_id subject_id  study_id     view  \
0  02aa804e-bde0afdd-112c0b34-7bc16630-4e384014   10000032  50414267       PA   
1  174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962   10000032  50414267  LATERAL   
2  2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab   10000032  53189527       PA   
3  e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c   10000032  53189527  LATERAL   
4  68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714   10000032  53911762       AP   

   count  
0      2  
1      2  
2      2  
3      2  
4      2  
Filtered rows: 6
Saved filtered CSV with header to ./metadata/mimiccxr_train_sub_filtered.csv


### Dataset Implementation
The SimpleMIMICCXRDataset class is implemented in simple_mimic_cxr_dataset.py. It inherits from PyHealth’s BaseEHRDataset and processes the filtered CSV to create a structured dataset for view-specific X-ray generation. Key features include:

- Loads the CSV and handles header presence dynamically.
- Structures data into patients and visits with image paths.
- Supports tasks like generating samples for PA to LL view conversion.

To use this class, you need to have pyhealth installed. The class is not executed here due to the dependency, but the code is available in the repository for reference.

### Task Implementation
The view_specific_xray_generation task function, implemented in view_specific_xray_generation.py, generates samples for the view-specific X-ray generation task. It pairs PA view images as inputs with LL view images as targets when both are available in a visit.

### Example Usage
Below is an example of how to use the SimpleMIMICCXRDataset with the view_specific_xray_generation task to generate samples for training a view-specific X-ray generation model.

In [5]:
from simple_mimic_cxr_dataset import SimpleMIMICCXRDataset
from view_specific_xray_generation import view_specific_xray_generation

# Create the dataset with correct paths
dataset = SimpleMIMICCXRDataset(
    root="./CS598",
    metadata_path="./metadata/mimiccxr_train_sub_filtered.csv",
    image_dir="./images",
    dev=False,
    refresh_cache=False,
)

# Set the task
samples = dataset.set_task(view_specific_xray_generation)

# Print a few samples
for i, sample in enumerate(samples[:3]):
    print(f"Sample {i}:")
    print("Input:", sample["input_front_view"], sample["input_view_position"])
    print("Target:", sample["target_view"], sample["target_path"])
    print()

DEBUG: Entering SimpleMIMICCXRDataset.__init__
DEBUG: Initialized self.patients as {}
DEBUG: Preprocessing CSV data
DEBUG: Reading first few rows of CSV to inspect
DEBUG: First few rows of CSV (raw):
                                        dicom_id  subject_id  study_id view  \
0  98d4dfdf-05f65f15-aad86e48-0b41d552-50c3acc8    12470349  50163781   LL   
1  ffe94f9e-da29e399-cf910cd1-ff895172-a1257159    12470349  50163781   PA   
2  598f53a5-d4c83f02-36b40d44-66692b4c-b2a576c1    12470349  57291897   PA   
3  5e17c48b-c64c9b40-7d5e0d2e-4153058b-a1cec9d6    12470349  57291897   PA   
4  7f677cfd-7497e339-8f689bb3-af23a8d5-72c2fc01    12470349  57291897   LL   

   count  
0      2  
1      2  
2      3  
3      3  
4      3  
DEBUG: Does CSV have a header? True
DEBUG: Loading CSV with default header parsing
DEBUG: Loaded DataFrame shape: (6, 5)
DEBUG: Loaded DataFrame columns: ['dicom_id', 'subject_id', 'study_id', 'view', 'count']
DEBUG: First few rows of loaded DataFrame:
           

Generating samples for view_specific_xray_generation: 100%|██████████| 2/2 [00:00<00:00, 1206.65it/s]

DEBUG: Entering view_specific_xray_generation
DEBUG: Input patients: {'patient_id': 'p12470349', 'visits': [{'visit_id': 's50163781', 'events': [{'dicom_id': '98d4dfdf-05f65f15-aad86e48-0b41d552-50c3acc8', 'view_position': 'LL', 'image_path': './images/files/mimic-cxr-jpg/2.0.0/files/p12/p12470349/s50163781/98d4dfdf-05f65f15-aad86e48-0b41d552-50c3acc8.jpg', 'study_time': None}, {'dicom_id': 'ffe94f9e-da29e399-cf910cd1-ff895172-a1257159', 'view_position': 'PA', 'image_path': './images/files/mimic-cxr-jpg/2.0.0/files/p12/p12470349/s50163781/ffe94f9e-da29e399-cf910cd1-ff895172-a1257159.jpg', 'study_time': None}]}, {'visit_id': 's57291897', 'events': [{'dicom_id': '598f53a5-d4c83f02-36b40d44-66692b4c-b2a576c1', 'view_position': 'PA', 'image_path': './images/files/mimic-cxr-jpg/2.0.0/files/p12/p12470349/s57291897/598f53a5-d4c83f02-36b40d44-66692b4c-b2a576c1.jpg', 'study_time': None}, {'dicom_id': '5e17c48b-c64c9b40-7d5e0d2e-4153058b-a1cec9d6', 'view_position': 'PA', 'image_path': './images/




## Additional Notes and Conclusion

### Integration with UniXGen
- The generated samples are formatted with input_front_view, input_view_position, target_view, and target_path, suitable for UniXGen's view-specific X-ray generation task.
- To use with UniXGen, pass the samples list to the model's data loader or pipeline, ensuring the image paths are accessible.

### Scaling the Dataset
- For a larger subset of MIMIC-CXR, ensure the metadata_path CSV maintains the same format (header row with dicom_id, subject_id, etc.).
- If the CSV format changes (e.g., no header), the has_header logic will adapt, but validate the output.

### Extending View Conversions
- The current task generates samples for PA to LL conversion. To support other conversions (e.g., AP to LL), update the view_specific_xray_generation function to include additional view pairs and logic.

### Conclusion
- This implementation is complete and ready for publication or use with UniXGen. All issues (e.g., psubject_id, CSV loading, empty dataset) have been resolved.
- The preprocessing step ensures reproducibility by documenting how the filtered dataset was created from the original MIMIC-CXR metadata.