  <div>
    <h1 align="center">Excercise 02 - Medical Information Retrieval 2023</h1>
  </div>
  <br />

## NLP Pipeline <a class="anchor" id="first"></a>

In the following weeks, we will explore an NLP pipeline for preprocessing a dataset of clinical patient notes with the goal of classifying them according to 40 different medical specialities, such as:

* Neurology
* Cardiovascular/Pulmonary
* Psychiatry/Psychology
* Obstetrics /Gynecology
* Nephrology
* Radiology
* and so on...

This course will cover various techniques and tools commonly used in NLP, including text preprocessing, tokenization, Stemming/Lemmatization, Stop word analysis Part-of-speech tagging etc. in order to preprocess the data for the classification task.

The dataset provided to you contains a collection of medical transcriptions that includes information on the patients conditions. Your task will be to extract useful information from these notes in order to classify them according to which medical speciality this patient belongs.

By the end of this course, you will have gained a solid understanding of the NLP pipeline for clinical patient note classification and will have hands-on experience with various NLP techniques and tools. You will also be able to apply your knowledge to real-world NLP problems not only in the medical domain.

The dataset is scraped from https://mtsamples.com. We used this kaggle thread for downloading the data: https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions

### Data Exploration

Go through the data and inspect the notes and its medical speciality:

In [2]:
import pandas as pd
import re

Let's inspect the data:

In [11]:
df = pd.read_csv("../DATA/mtsamples_clean.csv")
df.head()


Unnamed: 0,medical_specialty,transcription
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr..."
1,Bariatrics,"PAST MEDICAL HISTORY:, He has difficulty climb..."
2,Bariatrics,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ..."
3,Cardiovascular / Pulmonary,"2-D M-MODE: , ,1. Left atrial enlargement wit..."
4,Cardiovascular / Pulmonary,1. The left ventricular cavity size and wall ...


Feel free to look further for yourself!

In [4]:
### your code ###
df['medical_specialty'].value_counts(normalize=True)

 Surgery                          0.220644
 Consult - History and Phy.       0.103221
 Cardiovascular / Pulmonary       0.074415
 Orthopedic                       0.071014
 Radiology                        0.054611
 General Medicine                 0.051810
 Gastroenterology                 0.046009
 Neurology                        0.044609
 SOAP / Chart / Progress Notes    0.033207
 Obstetrics / Gynecology          0.032006
 Urology                          0.031606
 Discharge Summary                0.021604
 ENT - Otolaryngology             0.019604
 Neurosurgery                     0.018804
 Hematology - Oncology            0.018004
 Ophthalmology                    0.016603
 Nephrology                       0.016203
 Emergency Room Reports           0.015003
 Pediatrics - Neonatal            0.014003
 Pain Management                  0.012402
 Psychiatry / Psychology          0.010602
 Office Notes                     0.010202
 Podiatry                         0.009402
 Dermatolog

When going through the dataset, it becomes clear that just using complex regex is no longer sufficient. In order to prepare the data set for our classification problem, we want to deal with data cleaning and tokenisation this week.

### Data Cleaning

It makes sense to remove punctuations, digits, stop words, and converting all text to lowercase. Hint: Start with using regex.

In [12]:
### your code ###

df['cleaned'] = df['transcription'].apply(lambda row: re.sub(r"[\W\d\s]", ' ', str(row)))

In [6]:
df

Unnamed: 0,medical_specialty,transcription,cleaned
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr...",SUBJECTIVE This year old white female pr...
1,Bariatrics,"PAST MEDICAL HISTORY:, He has difficulty climb...",PAST MEDICAL HISTORY He has difficulty climb...
2,Bariatrics,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...",HISTORY OF PRESENT ILLNESS I have seen ABC ...
3,Cardiovascular / Pulmonary,"2-D M-MODE: , ,1. Left atrial enlargement wit...",D M MODE Left atrial enlargement wit...
4,Cardiovascular / Pulmonary,1. The left ventricular cavity size and wall ...,The left ventricular cavity size and wall ...
...,...,...,...
4994,Allergy / Immunology,"HISTORY:, I had the pleasure of meeting and e...",HISTORY I had the pleasure of meeting and e...
4995,Allergy / Immunology,"ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...",ADMITTING DIAGNOSIS Kawasaki disease DISCH...
4996,Allergy / Immunology,"SUBJECTIVE: , This is a 42-year-old white fema...",SUBJECTIVE This is a year old white fema...
4997,Allergy / Immunology,"CHIEF COMPLAINT: , This 5-year-old male presen...",CHIEF COMPLAINT This year old male presen...


### Tokenization

Tokenize the preprocessed text and create a bag of words representation of the dataset. Hint: Use a tokenizer from the nltk.

In [13]:
### your code ###
import nltk

df['tokenized'] = df['cleaned'].apply(lambda x: nltk.word_tokenize(str(x)))

In [14]:
df

Unnamed: 0,medical_specialty,transcription,cleaned,tokenized
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr...",SUBJECTIVE This year old white female pr...,"[SUBJECTIVE, This, year, old, white, female, p..."
1,Bariatrics,"PAST MEDICAL HISTORY:, He has difficulty climb...",PAST MEDICAL HISTORY He has difficulty climb...,"[PAST, MEDICAL, HISTORY, He, has, difficulty, ..."
2,Bariatrics,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...",HISTORY OF PRESENT ILLNESS I have seen ABC ...,"[HISTORY, OF, PRESENT, ILLNESS, I, have, seen,..."
3,Cardiovascular / Pulmonary,"2-D M-MODE: , ,1. Left atrial enlargement wit...",D M MODE Left atrial enlargement wit...,"[D, M, MODE, Left, atrial, enlargement, with, ..."
4,Cardiovascular / Pulmonary,1. The left ventricular cavity size and wall ...,The left ventricular cavity size and wall ...,"[The, left, ventricular, cavity, size, and, wa..."
...,...,...,...,...
4994,Allergy / Immunology,"HISTORY:, I had the pleasure of meeting and e...",HISTORY I had the pleasure of meeting and e...,"[HISTORY, I, had, the, pleasure, of, meeting, ..."
4995,Allergy / Immunology,"ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...",ADMITTING DIAGNOSIS Kawasaki disease DISCH...,"[ADMITTING, DIAGNOSIS, Kawasaki, disease, DISC..."
4996,Allergy / Immunology,"SUBJECTIVE: , This is a 42-year-old white fema...",SUBJECTIVE This is a year old white fema...,"[SUBJECTIVE, This, is, a, year, old, white, fe..."
4997,Allergy / Immunology,"CHIEF COMPLAINT: , This 5-year-old male presen...",CHIEF COMPLAINT This year old male presen...,"[CHIEF, COMPLAINT, This, year, old, male, pres..."


In [18]:
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

df['tokenized'] = df['cleaned'].apply(lambda x: tokenizer(str(x)))


In [19]:
df

Unnamed: 0,medical_specialty,transcription,cleaned,tokenized
0,Allergy / Immunology,"SUBJECTIVE:, This 23-year-old white female pr...",SUBJECTIVE This year old white female pr...,"(SUBJECTIVE, , This, , year, old, white,..."
1,Bariatrics,"PAST MEDICAL HISTORY:, He has difficulty climb...",PAST MEDICAL HISTORY He has difficulty climb...,"(PAST, MEDICAL, HISTORY, , He, has, difficul..."
2,Bariatrics,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...",HISTORY OF PRESENT ILLNESS I have seen ABC ...,"(HISTORY, OF, PRESENT, ILLNESS, , I, have, ..."
3,Cardiovascular / Pulmonary,"2-D M-MODE: , ,1. Left atrial enlargement wit...",D M MODE Left atrial enlargement wit...,"( , D, M, MODE, , Left, atrial, enlar..."
4,Cardiovascular / Pulmonary,1. The left ventricular cavity size and wall ...,The left ventricular cavity size and wall ...,"( , The, left, ventricular, cavity, size, a..."
...,...,...,...,...
4994,Allergy / Immunology,"HISTORY:, I had the pleasure of meeting and e...",HISTORY I had the pleasure of meeting and e...,"(HISTORY, , I, had, the, pleasure, of, meet..."
4995,Allergy / Immunology,"ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...",ADMITTING DIAGNOSIS Kawasaki disease DISCH...,"(ADMITTING, DIAGNOSIS, , Kawasaki, disease,..."
4996,Allergy / Immunology,"SUBJECTIVE: , This is a 42-year-old white fema...",SUBJECTIVE This is a year old white fema...,"(SUBJECTIVE, , This, is, a, , year, old,..."
4997,Allergy / Immunology,"CHIEF COMPLAINT: , This 5-year-old male presen...",CHIEF COMPLAINT This year old male presen...,"(CHIEF, COMPLAINT, , This, , year, old, m..."
