

# **🔍 Predicting ICD-9 Codes from Clinical Notes with Deep Neural Networks**

## **Dataset:** MIMIC-III
The **MIMIC-III** dataset provides an extensive collection of de-identified clinical data, ideal for building predictive models. In this project, we aim to **predict ICD-9 codes** from **clinical notes** using **deep neural networks**.

### **Project Objective**
The primary objective of this project is to leverage **deep learning** for accurately predicting **ICD-9 codes**, which are essential for categorizing diseases and medical conditions in healthcare settings. Using the MIMIC-III dataset, we can build a robust model with significant potential for real-world applications.

### **Approach**
1. **Data Preprocessing**: Clean and tokenize the clinical notes.
2. **Model Architecture**: Employ a **deep neural network** to capture patterns in textual data.
3. **Training and Evaluation**: Train the model on MIMIC-III and assess performance metrics.

---

> **Note**: This project is focused on exploring **natural language processing (NLP)** techniques and **deep learning frameworks** like **TensorFlow** and **PyTorch**.

### **Let's dive in! 🚀**

---



## **Data Preprocssing**

In [16]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import string
import re
import itertools
import pickle
import warnings
warnings.filterwarnings('ignore')

In [17]:
# Install required libraries
!pip install numpy pandas nltk pickle5

Collecting pickle5
  Using cached pickle5-0.0.11.tar.gz (132 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: pickle5
  Building wheel for pickle5 (setup.py): started
  Building wheel for pickle5 (setup.py): finished with status 'error'
  Running setup.py clean for pickle5
Failed to build pickle5


  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [17 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-311
      creating build\lib.win-amd64-cpython-311\pickle5
      copying pickle5\pickle.py -> build\lib.win-amd64-cpython-311\pickle5
      copying pickle5\pickletools.py -> build\lib.win-amd64-cpython-311\pickle5
      copying pickle5\__init__.py -> build\lib.win-amd64-cpython-311\pickle5
      creating build\lib.win-amd64-cpython-311\pickle5\test
      copying pickle5\test\pickletester.py -> build\lib.win-amd64-cpython-311\pickle5\test
      copying pickle5\test\test_pickle.py -> build\lib.win-amd64-cpython-311\pickle5\test
      copying pickle5\test\test_picklebuffer.py -> build\lib.win-amd64-cpython-311\pickle5\test
      copying pickle5\test\__init__.py -> build\lib.win-amd64-cpython-311\pickle5\test
      run

In [18]:
# Download NLTK stopwords and punkt tokenizer
import nltk
nltk.download('stopwords')
nltk.download('punkt')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\NETRA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\NETRA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## **Load Datasets:**

In [19]:
NOTES = pd.read_csv(r'D:\Final_year_project\final project dataset\final project\NOTEEVENTS-2.csv\NOTEEVENTS-2.csv')

In [20]:
D_ICD_DIAG = pd.read_csv(r'D:\Final_year_project\final project dataset\final project\DIAGNOSES_ICD.xlsx - DIAGNOSES_ICD.xlsx.csv')
D_ICD_PROC = pd.read_csv(r'D:\Final_year_project\final project dataset\final project\PROCEDURES_ICD.xlsx - PROCEDURES_ICD.xlsx.csv')

In [21]:
diagnoses_icd = pd.read_csv(r'D:\Final_year_project\final project dataset\final project\DIAGNOSES_ICD.xlsx - DIAGNOSES_ICD.xlsx.csv')
procedures_icd = pd.read_csv(r'D:\Final_year_project\final project dataset\final project\PROCEDURES_ICD.xlsx - PROCEDURES_ICD.xlsx.csv')

## **Feature Engineering:-**

In [22]:
NOTES.columns=NOTES.columns.str.upper()
D_ICD_DIAG.columns=D_ICD_DIAG.columns.str.upper()
D_ICD_PROC.columns=D_ICD_PROC.columns.str.upper()
diagnoses_icd.columns=diagnoses_icd.columns.str.upper()
procedures_icd.columns=procedures_icd.columns.str.upper()

In [23]:
diagnoses_icd

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,SEQ_NUM,ICD9_CODE
0,1297,109,172335,1.0,40301
1,1298,109,172335,2.0,486
2,1299,109,172335,3.0,58281
3,1300,109,172335,4.0,5855
4,1301,109,172335,5.0,4254
...,...,...,...,...,...
651042,639798,97503,188195,2.0,20280
651043,639799,97503,188195,3.0,V5869
651044,639800,97503,188195,4.0,V1279
651045,639801,97503,188195,5.0,5275


In [24]:
KEEP = NOTES[['HADM_ID','CATEGORY','TEXT']]

In [25]:
len(diagnoses_icd['ICD9_CODE'].unique())

6947

In [26]:
KEEP

Unnamed: 0,HADM_ID,CATEGORY,TEXT
0,188442.0,Discharge summary,Admission Date: [**2183-9-25**] Dischar...
1,193793.0,Discharge summary,Admission Date: [**2184-1-16**] Dischar...
2,118446.0,Discharge summary,Admission Date: [**2103-4-11**] ...
3,157985.0,Discharge summary,Admission Date: [**2103-10-7**] Dischar...
4,189488.0,Discharge summary,Admission Date: [**2131-4-2**] D...
...,...,...,...
2083175,186787.0,Discharge summary,Admission Date: [**2198-5-31**] ...
2083176,156868.0,Discharge summary,Admission Date: [**2168-12-29**] Discharg...
2083177,156868.0,Discharge summary,Admission Date: [**2168-12-29**] Discha...
2083178,156868.0,Discharge summary,Admission Date: [**2168-12-29**] Discha...


In [31]:
KEEP = KEEP.groupby(['HADM_ID']).agg({'TEXT': ' '.join, 'CATEGORY': ' '.join})

In [28]:
KEEP

Unnamed: 0_level_0,TEXT,CATEGORY
HADM_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
100001.0,Admission Date: [**2117-9-11**] ...,Discharge summary Radiology
100003.0,Admission Date: [**2150-4-17**] ...,Discharge summary Echo ECG Nursing Nursing Phy...
100006.0,Admission Date: [**2108-4-6**] Discharg...,Discharge summary Discharge summary Echo ECG R...
100007.0,Admission Date: [**2145-3-31**] ...,Discharge summary ECG Nursing/other Nursing/ot...
100009.0,Admission Date: [**2162-5-16**] ...,Discharge summary Echo ECG Radiology Radiology...
...,...,...
199993.0,Admission Date: [**2161-10-23**] Discha...,Discharge summary ECG ECG ECG ECG ECG ECG Radi...
199994.0,Admission Date: [**2188-7-7**] Discharg...,Discharge summary ECG Radiology Radiology Radi...
199995.0,Admission Date: [**2137-12-11**] Discha...,Discharge summary Echo ECG ECG ECG ECG ECG Rad...
199998.0,Admission Date: [**2119-2-18**] ...,Discharge summary Echo ECG ECG Radiology Radio...


In [29]:
KEEP

Unnamed: 0_level_0,TEXT,CATEGORY
HADM_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
100001.0,Admission Date: [**2117-9-11**] ...,Discharge summary Radiology
100003.0,Admission Date: [**2150-4-17**] ...,Discharge summary Echo ECG Nursing Nursing Phy...
100006.0,Admission Date: [**2108-4-6**] Discharg...,Discharge summary Discharge summary Echo ECG R...
100007.0,Admission Date: [**2145-3-31**] ...,Discharge summary ECG Nursing/other Nursing/ot...
100009.0,Admission Date: [**2162-5-16**] ...,Discharge summary Echo ECG Radiology Radiology...
...,...,...
199993.0,Admission Date: [**2161-10-23**] Discha...,Discharge summary ECG ECG ECG ECG ECG ECG Radi...
199994.0,Admission Date: [**2188-7-7**] Discharg...,Discharge summary ECG Radiology Radiology Radi...
199995.0,Admission Date: [**2137-12-11**] Discha...,Discharge summary Echo ECG ECG ECG ECG ECG Rad...
199998.0,Admission Date: [**2119-2-18**] ...,Discharge summary Echo ECG ECG Radiology Radio...


In [32]:
len(KEEP['HADM_ID'].unique())

KeyError: 'HADM_ID'

In [None]:
KEEP.to_csv('KEEP.CSV')

In [None]:
diagnoses_dict = {}
for i in range(len(diagnoses_icd)):
    entry = diagnoses_icd.iloc[i]
    hadm = entry['HADM_ID']
    icd = entry['ICD9_CODE']
    if hadm not in diagnoses_dict:
        diagnoses_dict[hadm] = [icd]
    else:
        diagnoses_dict[hadm].append(icd)
        
procedures_dict = {}
for i in range(len(procedures_icd)):
    entry = procedures_icd.iloc[i]
    hadm = entry['HADM_ID']
    icd = entry['ICD9_CODE']
    if hadm not in procedures_dict:
        procedures_dict[hadm] = [icd]
    else:
        procedures_dict[hadm].append(icd)

In [None]:
diagnoses_df = pd.DataFrame.from_dict(diagnoses_dict,orient='index')
procedures_df = pd.DataFrame.from_dict(procedures_dict,orient='index')

In [None]:
diagnoses_df.columns = ['DIAG_CODE'+str(i) for i in range(1,len(diagnoses_df.columns)+1)]
diagnoses_df.index.name = 'HADM_ID'
procedures_df.columns = ['PRCD_CODE'+str(i) for i in range(1,len(procedures_df.columns)+1)]
procedures_df.index.name = 'HADM_ID'
codes_df = pd.merge(diagnoses_df, procedures_df, how='outer', on='HADM_ID')

In [None]:
diagnoses_df

In [None]:
diagnoses_df['DIAG_CODES'] = diagnoses_df[diagnoses_df.columns[:]].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1
)

procedures_df['PROC_CODES'] = procedures_df[procedures_df.columns[:]].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1
)

In [None]:
diagnoses = diagnoses_df[['DIAG_CODES']]
procedures = procedures_df[['PROC_CODES']]
codes = pd.merge(diagnoses, procedures, how='outer', on='HADM_ID')
codes = codes.dropna()

In [None]:
codes.to_csv('CODES.csv')

In [None]:
# KEEP.set_index('HADM_ID')
merged_df = pd.merge(KEEP, codes, how='left', on='HADM_ID')
merged_df = merged_df.dropna()

In [None]:
merged_df

In [None]:
merged_df

In [None]:
SAMPLE = merged_df.sample(n=20000)

In [None]:
SAMPLE

In [None]:
SAMPLE.to_csv('SAMPLE_20K.csv')

In [None]:
SAMPLE.columns

In [None]:
SAMPLE

### **Codes to Dict**

In [None]:
sample_ids = SAMPLE.index

In [None]:
flt_diag = diagnoses_icd[diagnoses_icd['HADM_ID'].isin(sample_ids)]
flt_proc = procedures_icd[procedures_icd['HADM_ID'].isin(sample_ids)]

In [None]:
diag_keep = flt_diag['ICD9_CODE'].value_counts()[:300]
proc_keep = flt_proc['ICD9_CODE'].value_counts()[:100]

In [None]:
diag2idx, idx2diag = {},{}
for d in diag_keep.index:
    if d not in diag2idx:
        idx2diag[len(idx2diag)] = d
        diag2idx[d] = len(diag2idx)
        
proc2idx, idx2proc = {},{}
for p in proc_keep.index:
    if p not in proc2idx:
        idx2proc[len(idx2proc)] = p
        proc2idx[p] = len(proc2idx)

In [None]:
with open('diag2idx.pickle','wb') as f:
    pickle.dump(diag2idx,f,pickle.HIGHEST_PROTOCOL)
with open('idx2diag.pickle','wb') as f:
    pickle.dump(idx2diag,f,pickle.HIGHEST_PROTOCOL)
with open('proc2idx.pickle','wb') as f:
    pickle.dump(proc2idx,f,pickle.HIGHEST_PROTOCOL)
with open('idx2proc.pickle','wb') as f:
    pickle.dump(idx2proc,f,pickle.HIGHEST_PROTOCOL)

### **CONVERT CODE LIST**

In [None]:
def diag_code2idx(org_lst):
    coded_lst = []
    for c in org_lst.split(','):
        if c in diag2idx:
            coded_lst.append(diag2idx[c])
    return coded_lst

In [None]:
def proc_code2idx(org_lst):
    coded_lst = []
    for c in org_lst.split(','):
        c_ = int(str(c).split('.')[0])
        if c_ in proc2idx:
            coded_lst.append(proc2idx[c_])
            
    return coded_lst

In [None]:
SAMPLE['CODED_DIAG'] = SAMPLE['DIAG_CODES'].apply(diag_code2idx)
SAMPLE['CODED_PROC'] = SAMPLE['PROC_CODES'].apply(proc_code2idx)

In [None]:
SAMPLE

## **General processing**

In [None]:
def remove_stopwords(text): 
        stop_words = set(stopwords.words("english")) 
        word_tokens = word_tokenize(text) 
        filtered_text = [word for word in word_tokens if word not in stop_words] 
        return filtered_text 
    
def preprocess(note):
    note = note.replace('\n',' ')
    note = note.replace('w/', 'with')
    note = note.lower() #lower case
    note = re.sub(r'\d+', '', note) #remove numbers
    note = note.translate(str.maketrans('', '', string.punctuation)) #remove punctuation
    note = " ".join(note.split())
    note = remove_stopwords(note)
    return note

In [None]:
sample_1k_removed['NOTE'] = sample_1k_removed['TEXT'].apply(preprocess)
sample_10k_removed['NOTE'] = sample_10k_removed['TEXT'].apply(preprocess)
merged_df_removed['NOTE'] = merged_df_removed['TEXT'].apply(preprocess)

In [None]:
SAMPLE['NOTE'] = SAMPLE['TEXT'].apply(preprocess)
SAMPLE['NOTE'] = SAMPLE['TEXT'].apply(preprocess)
merged_df['NOTE'] = merged_df['TEXT'].apply(preprocess)

In [None]:
sample_1k_cleaned = sample_1k_removed[['NOTE','CODED_DIAG','CODED_PROC']]
sample_10k_cleaned = sample_10k_removed[['NOTE','CODED_DIAG','CODED_PROC']]
merged_df_cleaned = merged_df_removed[['NOTE','CODED_DIAG','CODED_PROC']]

In [None]:
sample_1k_removed['CODED_NOTE'] = sample_1k_cleaned['NOTE']
sample_10k_removed['CODED_NOTE'] = sample_10k_cleaned['NOTE']
merged_df_removed['CODED_NOTE'] = merged_df_cleaned['NOTE']

In [None]:
sample_10k_removed.to_csv('SAMPLE10K_ALL.csv')
sample_1k_removed.to_csv('SAMPLE1K_ALL.csv')
merged_df_removed.to_csv('ALL.csv')

In [None]:
sample_20k = merged_df_removed.sample(n = 20000) 
sample_20k.to_csv('SAMPLE20K_ALL.csv')

In [None]:
sample_20k

### **corpus**

In [None]:
TOKENS = pd.read_csv('19908_all_coded_token.csv')

In [None]:
TOKENS

In [None]:
SAMPLE

In [None]:
corpus = {}
for i, s in enumerate(SAMPLE['NOTE']):
    for w in s:
        corpus[w] = corpus.get(w, 1) + 1
corpus = {k: v for k, v in sorted(corpus.items(), key=lambda item: item[1], reverse=True)}

In [None]:
corpus_slice = dict(itertools.islice(corpus.items(), 10000))

###  **build dictionary**

In [None]:

word2idx = {'<PAD>': 0, '<UNK>':1}
idx2word = {0: '<PAD>', 1:'<UNK>'}
for c in corpus_slice:
    word2idx[c] = len(word2idx)
    idx2word[len(idx2word)] = c

In [None]:
def note2idx(org_lst):
    coded_lst = []
    for w in org_lst:
        if w in word2idx:
            coded_lst.append(word2idx[w])
        else:
            coded_lst.append(0)
    return coded_lst

In [None]:
def note2idx_cap400(org_lst):
    coded_lst = []
    for w in org_lst:
        if len(coded_lst) < 400 and w in word2idx:
            coded_lst.append(word2idx[w])
        else:
            coded_lst.append(1)
    coded_lst += [0]*(400-len(coded_lst))
    return coded_lst

In [None]:
sample_20k['CODED_NOTE'] = sample_20k['NOTE'].apply(note2idx_cap400)

In [None]:
sample_20k.to_csv('SAMPLE_2OK.csv')

In [None]:
sample_1k_cleaned['NOTE'] = sample_1k_cleaned['NOTE'].apply(note2idx)
sample_10k_cleaned['NOTE'] = sample_10k_cleaned['NOTE'].apply(note2idx)
merged_df_cleaned['NOTE'] = merged_df_cleaned['NOTE'].apply(note2idx)

In [None]:
sample_10k_cleaned.to_csv('SAMPLE10K.csv')
sample_1k_cleaned.to_csv('SAMPLE1K.csv')
merged_df_cleaned.to_csv('CLEANED.csv')

In [None]:
dump_lst = [diag2idx, idx2diag, proc2idx, idx2proc, word2idx, idx2word]
with open('diag2idx.pickle','wb') as f:
    pickle.dump(diag2idx,f,pickle.HIGHEST_PROTOCOL)
with open('idx2diag.pickle','wb') as f:
    pickle.dump(idx2diag,f,pickle.HIGHEST_PROTOCOL)
with open('proc2idx.pickle','wb') as f:
    pickle.dump(proc2idx,f,pickle.HIGHEST_PROTOCOL)
with open('idx2proc.pickle','wb') as f:
    pickle.dump(idx2proc,f,pickle.HIGHEST_PROTOCOL)
with open('word2idx.pickle','wb') as f:
    pickle.dump(word2idx,f,pickle.HIGHEST_PROTOCOL)
with open('idx2word.pickle','wb') as f:
    pickle.dump(idx2word,f,pickle.HIGHEST_PROTOCOL)

In [None]:
with open('corpus.pickle','wb') as f:
    pickle.dump(corpus,f,pickle.HIGHEST_PROTOCOL)

#################################################################

# **Codes**

In [None]:
diagnoses_icd

In [None]:
diagnoses_icd[diagnoses_icd['HADM_ID']==172335]

In [None]:
diagnoses_dict = {}
for i in range(len(diagnoses_icd)):
    entry = diagnoses_icd.iloc[i]
    hadm = entry['HADM_ID']
    icd = entry['ICD9_CODE']
    if hadm not in diagnoses_dict:
        diagnoses_dict[hadm] = [icd]
    else:
        diagnoses_dict[hadm].append(icd)
        
procedures_dict = {}
for i in range(len(procedures_icd)):
    entry = procedures_icd.iloc[i]
    hadm = entry['HADM_ID']
    icd = entry['ICD9_CODE']
    if hadm not in procedures_dict:
        procedures_dict[hadm] = [icd]
    else:
        procedures_dict[hadm].append(icd)

In [None]:
diagnoses_df = pd.DataFrame.from_dict(diagnoses_dict,orient='index')
procedures_df = pd.DataFrame.from_dict(procedures_dict,orient='index')

In [None]:
diagnoses_df.columns = ['DIAG_CODE'+str(i) for i in range(1,len(diagnoses_df.columns)+1)]
diagnoses_df.index.name = 'HADM_ID'
procedures_df.columns = ['PRCD_CODE'+str(i) for i in range(1,len(procedures_df.columns)+1)]
procedures_df.index.name = 'HADM_ID'
codes_df = pd.merge(diagnoses_df, procedures_df, how='outer', on='HADM_ID')

In [None]:
diagnoses_df['DIAG_CODES'] = diagnoses_df[diagnoses_df.columns[:]].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1
)

procedures_df['PROC_CODES'] = procedures_df[procedures_df.columns[:]].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1
)

In [None]:
diagnoses = diagnoses_df[['DIAG_CODES']]
procedures = procedures_df[['PROC_CODES']]
codes = pd.merge(diagnoses, procedures, how='outer', on='HADM_ID')

In [None]:
diagnoses_df.to_csv('DIAGNOSES_DF.csv')
procedures_df.to_csv('PROCEDURES_DF.csv')
codes_df.to_csv('CODES_DF.csv')


In [None]:
codes.to_csv('CODES.csv')

# Notes

In [None]:
notes = pd.read_csv(r'D:\ICD9CodePredectionUsingMIMICDatasets\data\NOTEEVENTS.csv')
notes_df = notes[['HADM_ID','TEXT']]
notes_df.set_index('HADM_ID')
merged_df = pd.merge(notes_df, codes, how='left', on='HADM_ID')
merged_df = merged_df.dropna()
merged_df.to_csv('FULL_DATA.csv')

In [None]:
### SLICE NOTE
sample_1k = merged_df.sample(n = 1000) 
sample_1k.to_csv('SAMPLE1K.csv')
sample_10k = merged_df.sample(n = 10000) 
sample_10k.to_csv('SAMPLE10K.csv')

In [None]:
notes.columns

In [None]:
merged_df.columns

In [None]:
notes['CATEGORY'].unique()

In [None]:
a = 'test a b c'
b = ('aaa', 'bbb', 'test')


In [None]:
a.startswith(b)

### **Go to deidentify information** 

In [None]:

full_data['TEXT'] = full_data['TEXT'].replace(to_replace=r"\[.*?\]", value="", regex=True)