# 02 — Data Preprocessing

This notebook prepares the cleaned dataset (`nlp_ready_df`) for feature engineering.
Steps:
1. Load pre-cleaned structured dataset and clinical notes.
2. Clean, truncate, and merge notes.
3. Save NLP-ready dataset (`data/processed/data_nlp_ready.csv`).
4. Write Radiology, Discharge, and Combined notes to text files for Word2Vec training.


## 0. Imports

In [1]:
import os
import pandas as pd
from src.data_prep import (
    load_cleaned_data,
    clean_text,
    group_notes,
    process_notes_in_parallel,
    save_processed_notes,
    pivot_notes_to_wide,
    combine_notes,
    merge_notes_with_cleaned,
    inspect_dataframes,
    write_radiology_notes_for_w2v,
    write_discharge_notes_for_w2v,
    write_combined_notes_for_w2v,
)
from src.utils import resolve_path

## 1. Load Cleaned Structured Data


In [4]:
df_clean = load_cleaned_data("data/raw/Data_after_Cleaning.csv")
df_clean.head()
print("✅ Cleaned dataset loaded:", df_clean.shape)


✅ Loaded cleaned dataset from C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\raw\Data_after_Cleaning.csv with shape (5208, 48)
✅ Cleaned dataset loaded: (5208, 48)


## 2. Group and process Clinical Notes


In [5]:
# Assume `df_notes` was pulled separately or loaded from CSV
df_notes = pd.read_csv(resolve_path("data/interim/data_full_notes_interim.csv"))
df_notes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303994 entries, 0 to 303993
Data columns (total 58 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   int64_field_0                          303994 non-null  int64  
 1   subject_id                             303994 non-null  int64  
 2   hospital_expire_flag                   303994 non-null  int64  
 3   max_age                                303994 non-null  int64  
 4   los_icu                                303994 non-null  float64
 5   first_hosp_stay                        303994 non-null  bool   
 6   suspected_infection                    303994 non-null  int64  
 7   sofa_score                             303994 non-null  int64  
 8   sepsis3                                303994 non-null  bool   
 9   avg_urineoutput                        303994 non-null  float64
 10  glucose_min                            303994 non-null  

In [5]:
# Group into records
records = group_notes(df_notes)

# Parallel processing
processed = process_notes_in_parallel(records)

# Save long-format grouped notes
nlp_long_df = save_processed_notes(processed, "data/interim/data_trunc_notes_interim.csv")
nlp_long_df.head()


✅ Processed notes saved to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\interim\data_trunc_notes_interim.csv


Unnamed: 0,subject_id,note_type_1,combined_notes
0,10002013,discharge,Name: Unit No: Admission Date: Discharge Da...
1,10002013,radiology,INDICATION: History: with L great toe ulcer a...
2,10002155,discharge,Name: Unit No: Admission Date: Discharge Da...
3,10002155,radiology,INDICATION: woman with known stage IV lung ca...
4,10002428,discharge,Name: No: Admission Date: Discharge Date: ...


## 3: Pivot to Wide Format

In [6]:
nlp_wide_df = pivot_notes_to_wide(nlp_long_df, "data/interim/data_trunc_notes_wide_interim.csv")
nlp_wide_df.head()


Unnamed: 0,subject_id,Discharge_summary_notes,Radiology_notes
0,10002013,Name: Unit No: Admission Date: Discharge Da...,INDICATION: History: with L great toe ulcer a...
1,10002155,Name: Unit No: Admission Date: Discharge Da...,INDICATION: woman with known stage IV lung ca...
2,10002428,Name: No: Admission Date: Discharge Date: ...,INDICATION: woman admitted to the ICU for pne...
3,10003400,Name: Unit No: Admission Date: Discharge Da...,EXAMINATION: CHEST (PORTABLE AP) INDICATION: H...
4,10004720,Name: Unit No: Admission Date: Discharge Da...,EXAMINATION: CHEST (PORTABLE AP) INDICATION: ...


## 4: Combine Notes

In [8]:
nlp_combined_notes_df = combine_notes(nlp_wide_df, "data/interim/data_trunc_notes_combined_interim.csv")
nlp_combined_notes_df.head()


Unnamed: 0,subject_id,combined_notes
0,10002013,INDICATION: History: with L great toe ulcer a...
1,10002155,INDICATION: woman with known stage IV lung ca...
2,10002428,INDICATION: woman admitted to the ICU for pne...
3,10003400,EXAMINATION: CHEST (PORTABLE AP) INDICATION: H...
4,10004720,EXAMINATION: CHEST (PORTABLE AP) INDICATION: ...


## 5: Merge with Structured Data

In [9]:
nlp_ready_df = merge_notes_with_cleaned(resolve_path(
    "data/raw/data_after_cleaning.csv"),
    nlp_wide_df,
    nlp_combined_notes_df,
    "data/interim/data_nlp_ready.csv"
)

inspect_dataframes(nlp_wide_df, nlp_combined_notes_df, nlp_ready_df)


nlp_wide_df shape: (5208, 3)
nlp_combined_notes_df shape: (5208, 2)
nlp_ready_df shape: (5208, 51)

=== nlp_ready_df Info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5208 entries, 0 to 5207
Data columns (total 51 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   subject_id                             5208 non-null   int64  
 1   hospital_expire_flag                   5208 non-null   int64  
 2   max_age                                5208 non-null   int64  
 3   los_icu                                5208 non-null   float64
 4   first_hosp_stay                        5208 non-null   bool   
 5   suspected_infection                    5208 non-null   int64  
 6   sofa_score                             5208 non-null   int64  
 7   sepsis3                                5208 non-null   bool   
 8   avg_urineoutput                        5208 non-null   float64
 9   glucose_min   

## 6: Write Corpus Files for Word2Vec Embedding Extraction

In [10]:
# Ensure interim directory exists
os.makedirs("data/interim/w2v_interim", exist_ok=True)

# Write corpora for Word2Vec training
write_radiology_notes_for_w2v(
    nlp_ready_df,
    out_path="data/interim/w2v_interim/w2v_Radiology_notes.txt"
)

write_discharge_notes_for_w2v(
    nlp_ready_df,
    out_path="data/interim/w2v_interim/w2v_Discharge_notes.txt"
)

write_combined_notes_for_w2v(
    nlp_ready_df,
    out_path="data/interim/w2v_interim/w2v_combined_notes.txt"
)

print("✅ Word2Vec corpora saved in data/interim/w2v_interim/")

✅ Word2Vec corpora saved in data/interim/w2v_interim/


## 7. Next Steps
- Move to `03_feature_engineering.ipynb` for Word2Vec / BERT embeddings
  and scaling functions from `src/features.py`.
