# Clinical Data Pipeline
Sebastian Quirarte Justo | Nov 2025 | sebastianquirajus@outlook.com

This notebook demonstrates an end-to-end clinical data engineering pipeline using Python and Pandas.  
It simulates the process used in regulated clinical environments (CDISC/SDTM) to transform raw operational files into validated, analysis-ready datasets.

Simulated SDTM formatted data was obtained from https://cdiscdataset.com/. 

Key concepts demonstrated:
- Loading raw clinical datasets
- Initial structural and quality checks
- Cleaning and standardizing data
- Joining clinical domains
- Creating analysis-ready outputs

This notebook is the development version of the pipeline, which will later be refactored into `.py` modeules.

### Contents

1. [Imports](#1-imports)
2. [Load SDTM Datasets](#2-load-sdtm-datasets)
3. [Exploratory Analysis and QC](#3-exploratory-analysis-and-quality-control)
4. Domain-Level Cleaning
5. Joining Data
6. Analysis & Outputs
7. Next Steps for Pipeline Deployment

### 1. Imports

In [8]:
import pandas as pd # data manipulation
import numpy as np # numerical operations
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization

pd.set_option('display.max_columns', None)

### 2. Load SDTM Datasets

The raw SDTM CSV files are stored in the `/data` directory.

Each dataset is read into a pandas DataFrame:

- **DM:** Demographics  
- **AE:** Adverse Events  
- **EX:** Exposure  
- **VS:** Vital Signs 

In [10]:
dm = pd.read_csv("data/DM.csv")
ae = pd.read_csv("data/AE.csv")
ex = pd.read_csv("data/EX.csv")
vs = pd.read_csv("data/VS.csv")

### 3. Exploratory Analysis and Quality Control

Quick checks:

- Number of rows and columns  
- Preview the first records  
- Check for missing key variables  
- Confirm expected variables, such as `STUDYID` and `USUBJID`

In [12]:
datasets = {
    "DM": dm,
    "AE": ae,
    "EX": ex,
    "VS": vs
}

for name, df in datasets.items():
    print(f"\n=== {name} ===")
    print("Shape:", df.shape)
    print("Columns:", list(df.columns))
    display(df.head())


=== DM ===
Shape: (100, 25)
Columns: ['STUDYID', 'DOMAIN', 'USUBJID', 'SUBJID', 'RFSTDTC', 'RFENDTC', 'RFXSTDTC', 'RFXENDTC', 'RFICDTC', 'RFPENDTC', 'DTHDTC', 'DTHFL', 'SITEID', 'AGE', 'AGEU', 'SEX', 'RACE', 'ETHNIC', 'ARMCD', 'ARM', 'ACTARMCD', 'ACTARM', 'COUNTRY', 'DMDTC', 'DMDY']


Unnamed: 0,STUDYID,DOMAIN,USUBJID,SUBJID,RFSTDTC,RFENDTC,RFXSTDTC,RFXENDTC,RFICDTC,RFPENDTC,DTHDTC,DTHFL,SITEID,AGE,AGEU,SEX,RACE,ETHNIC,ARMCD,ARM,ACTARMCD,ACTARM,COUNTRY,DMDTC,DMDY
0,STUDY001,DM,STUDY001-SUBJ0001,SUBJ0001,2025-06-28,2025-11-02,2025-07-04,2025-11-02,2024-12-16,2025-11-02,,,16,39,YEARS,M,BLACK OR AFRICAN AMERICAN,NOT HISPANIC OR LATINO,ARM2,Treatment 2,ARM2,Treatment 2,ST,2025-07-13,17
1,STUDY001,DM,STUDY001-SUBJ0002,SUBJ0002,2025-09-21,2025-11-03,2025-07-31,2025-10-27,2025-07-22,2025-11-03,,,2,21,YEARS,M,ASIAN,HISPANIC OR LATINO,ARM2,Treatment 2,ARM2,Treatment 2,MK,2025-03-15,17
2,STUDY001,DM,STUDY001-SUBJ0003,SUBJ0003,2025-03-13,2025-10-31,2025-05-21,2025-10-30,2025-01-27,2025-10-31,,,32,48,YEARS,F,AMERICAN INDIAN OR ALASKA NATIVE,HISPANIC OR LATINO,ARM1,Treatment 1,ARM1,Treatment 1,SD,2025-10-22,5
3,STUDY001,DM,STUDY001-SUBJ0004,SUBJ0004,2025-09-30,2025-11-18,2025-05-06,2025-11-14,2025-05-28,2025-11-18,,,19,24,YEARS,F,AMERICAN INDIAN OR ALASKA NATIVE,NOT HISPANIC OR LATINO,ARM3,Treatment 3,ARM3,Treatment 3,UY,2024-12-08,19
4,STUDY001,DM,STUDY001-SUBJ0005,SUBJ0005,2024-12-21,2025-10-31,2025-10-13,2025-11-18,2025-05-08,2025-10-31,,,43,49,YEARS,F,BLACK OR AFRICAN AMERICAN,NOT HISPANIC OR LATINO,ARM2,Treatment 2,ARM2,Treatment 2,TV,2025-05-19,24



=== AE ===
Shape: (285, 27)
Columns: ['STUDYID', 'DOMAIN', 'USUBJID', 'AESEQ', 'AESPID', 'AETERM', 'AEDECOD', 'AEBODSYS', 'AESOC', 'AESEV', 'AETOXGR', 'AESER', 'AEREL', 'AEACN', 'AEOUT', 'AESTDTC', 'AEENDTC', 'AESTDY', 'AEENDY', 'AEDUR', 'AECONTRT', 'AESDTH', 'AESLIFE', 'AESHOSP', 'AESDISAB', 'AESMIE', 'AEACNOTH']


Unnamed: 0,STUDYID,DOMAIN,USUBJID,AESEQ,AESPID,AETERM,AEDECOD,AEBODSYS,AESOC,AESEV,AETOXGR,AESER,AEREL,AEACN,AEOUT,AESTDTC,AEENDTC,AESTDY,AEENDY,AEDUR,AECONTRT,AESDTH,AESLIFE,AESHOSP,AESDISAB,AESMIE,AEACNOTH
0,STUDY001,AE,STUDY001-SUBJ0001,1,AE1,HEADACHE,HEADACHE,CARDIAC DISORDERS,CARDIAC DISORDERS,SEVERE,3,Y,NOT RELATED,DOSE NOT CHANGED,RECOVERED/RESOLVED,2025-10-02,,70,,,N,N,N,N,N,N,
1,STUDY001,AE,STUDY001-SUBJ0001,2,AE2,HEADACHE,HEADACHE,RESPIRATORY DISORDERS,RESPIRATORY DISORDERS,MILD,1,N,NOT RELATED,DOSE REDUCED,RECOVERING/RESOLVING,2025-09-09,,47,,,Y,N,Y,N,N,Y,
2,STUDY001,AE,STUDY001-SUBJ0001,3,AE3,RASH,RASH,NERVOUS SYSTEM DISORDERS,NERVOUS SYSTEM DISORDERS,MILD,1,N,RELATED,DRUG INTERRUPTED,NOT RECOVERED/NOT RESOLVED,2025-08-22,,29,,,N,N,N,Y,N,N,
3,STUDY001,AE,STUDY001-SUBJ0001,4,AE4,VOMITING,VOMITING,RESPIRATORY DISORDERS,RESPIRATORY DISORDERS,MODERATE,2,N,NOT RELATED,NOT APPLICABLE,RECOVERING/RESOLVING,2025-09-21,2025-11-18,59,117.0,58.0,N,N,Y,Y,N,Y,
4,STUDY001,AE,STUDY001-SUBJ0001,5,AE5,FATIGUE,FATIGUE,CARDIAC DISORDERS,CARDIAC DISORDERS,MODERATE,2,N,NOT RELATED,DOSE NOT CHANGED,NOT RECOVERED/NOT RESOLVED,2025-09-28,2025-10-06,66,74.0,8.0,N,N,Y,Y,N,Y,



=== EX ===
Shape: (547, 15)
Columns: ['STUDYID', 'DOMAIN', 'USUBJID', 'EXSEQ', 'EXTRT', 'EXDOSE', 'EXDOSU', 'EXDOSFRM', 'EXROUTE', 'EXDOSFRQ', 'EXSTDTC', 'EXENDTC', 'VISITNUM', 'VISIT', 'EXDY']


Unnamed: 0,STUDYID,DOMAIN,USUBJID,EXSEQ,EXTRT,EXDOSE,EXDOSU,EXDOSFRM,EXROUTE,EXDOSFRQ,EXSTDTC,EXENDTC,VISITNUM,VISIT,EXDY
0,STUDY001,EX,STUDY001-SUBJ0001,1,STUDY DRUG,100,mg,TABLET,ORAL,QD,2025-09-28,2025-09-28,1,BASELINE,1
1,STUDY001,EX,STUDY001-SUBJ0001,2,STUDY DRUG,100,mg,TABLET,ORAL,QD,2025-10-04,2025-10-04,2,WEEK 1,7
2,STUDY001,EX,STUDY001-SUBJ0001,3,STUDY DRUG,100,mg,TABLET,ORAL,QD,2025-10-11,2025-10-11,3,WEEK 2,14
3,STUDY001,EX,STUDY001-SUBJ0001,4,STUDY DRUG,100,mg,TABLET,ORAL,QD,2025-10-25,2025-10-25,4,WEEK 4,28
4,STUDY001,EX,STUDY001-SUBJ0002,1,STUDY DRUG,100,mg,TABLET,ORAL,QD,2024-11-29,2024-11-29,1,BASELINE,1



=== VS ===
Shape: (4500, 16)
Columns: ['STUDYID', 'DOMAIN', 'USUBJID', 'VSSEQ', 'VSTESTCD', 'VSTEST', 'VSORRES', 'VSORRESU', 'VSSTRESC', 'VSSTRESN', 'VSSTRESU', 'VSBLFL', 'VISIT', 'VISITNUM', 'VSDTC', 'VSDY']


Unnamed: 0,STUDYID,DOMAIN,USUBJID,VSSEQ,VSTESTCD,VSTEST,VSORRES,VSORRESU,VSSTRESC,VSSTRESN,VSSTRESU,VSBLFL,VISIT,VISITNUM,VSDTC,VSDY
0,STUDY001,VS,STUDY001-SUBJ0001,1,HEIGHT,Height,174.8,cm,174.8,174.8,cm,,SCREENING,1,2025-03-05,-14
1,STUDY001,VS,STUDY001-SUBJ0001,2,WEIGHT,Weight,114.9,kg,114.9,114.9,kg,,SCREENING,1,2025-03-05,-14
2,STUDY001,VS,STUDY001-SUBJ0001,3,TEMP,Temperature,37.0,C,37.0,37.0,C,,SCREENING,1,2025-03-05,-14
3,STUDY001,VS,STUDY001-SUBJ0001,4,BP,Blood Pressure,142/98,mmHg,142/98,,mmHg,,SCREENING,1,2025-03-05,-14
4,STUDY001,VS,STUDY001-SUBJ0001,5,HR,Heart Rate,90,beats/min,90,90.0,beats/min,,SCREENING,1,2025-03-05,-14
