# Medical Text

Medical Text Dataset [https://www.kaggle.com/datasets/chaitanyakck/medical-text/data]

In [38]:
import pandas as pd
import nltk
import spacy
# from spacy import displacy

In [39]:
df = pd.read_csv('data/train.dat', sep="\t", header=None)

In [40]:
df.rename(columns={0:'condition', 1:'abstract'}, inplace=True)
df.head()

Unnamed: 0,condition,abstract
0,4,Catheterization laboratory events and hospital...
1,5,Renal abscess in children. Three cases of rena...
2,2,Hyperplastic polyps seen at sigmoidoscopy are ...
3,5,Subclavian artery to innominate vein fistula a...
4,4,Effect of local inhibition of gamma-aminobutyr...


In [41]:
lexical_df = df.copy() # a df to compute lexical analysis on

## Lexical Analysis
Lexical analysis consists of the operations of:
- Sentence Splitting
- Tokenization
- Stemming
- Lemmatization
- POS Tagging

It focuses on the main components of a text (words), and aims to recognize them in relation to the context in which they are used, such as sentences or clauses.

### Sentence Splitting
The technique aims to identify the beginning and end of a textual fragment (sentence or clause) with informative content, even if simple.

To achieve this, it uses orthographic features of words (e.g., uppercase initial letters) and delimiters (e.g., punctuation).

In [42]:
# Split the text into sentences
sentences = lexical_df['abstract'].apply(nltk.sent_tokenize)

In [43]:
sentences[0][2] # Print the third sentence of the first record

'Cardiogenic shock was present in eight patients with infarction of the left anterior descending coronary artery, four with infarction of the right coronary artery, and four with infarction of the circumflex coronary artery.'

In [44]:
lexical_df["sentences"] = sentences # set sentences as df col to save progress

### Word Tokenization
The goal of tokenization is to pinpoint the starting and ending positions of each token, whether it’s a word, a number, or a combination of symbols.

As with sentence splitting, the process relies on orthographic features (e.g., initial capital letters) and delimiters (e.g., punctuation).

In [45]:
tokens = []
for record in sentences:
    words = [nltk.word_tokenize(sentence) for sentence in record]
    tokens.append(words)
lexical_df["tokens"] = tokens # set tokenized sentences as df col to save progress

In [46]:
lexical_df.head()

Unnamed: 0,condition,abstract,sentences,tokens
0,4,Catheterization laboratory events and hospital...,[Catheterization laboratory events and hospita...,"[[Catheterization, laboratory, events, and, ho..."
1,5,Renal abscess in children. Three cases of rena...,"[Renal abscess in children., Three cases of re...","[[Renal, abscess, in, children, .], [Three, ca..."
2,2,Hyperplastic polyps seen at sigmoidoscopy are ...,[Hyperplastic polyps seen at sigmoidoscopy are...,"[[Hyperplastic, polyps, seen, at, sigmoidoscop..."
3,5,Subclavian artery to innominate vein fistula a...,[Subclavian artery to innominate vein fistula ...,"[[Subclavian, artery, to, innominate, vein, fi..."
4,4,Effect of local inhibition of gamma-aminobutyr...,[Effect of local inhibition of gamma-aminobuty...,"[[Effect, of, local, inhibition, of, gamma-ami..."


### Lemmatization
Post-tokenization techniques address the morphological analysis of word-tokens.

Lemmatization identifies the base form (lemma) of inflected words, preserving their meaning and grammatical category. For example, the token _liked_ maps to the lemma *like*.

This process minimizes lexical variation by consolidating different forms of the same word into a unified representation.

In [47]:
wnl = nltk.WordNetLemmatizer()
lemmatization = []
for record in lexical_df['tokens']:
    lemmatized_record = []
    for words in record:
        lemmatized_record.append([wnl.lemmatize(word) for word in words]) # update sentences into lemmatized
    lemmatization.append(lemmatized_record)
lexical_df["lemmatization"] = lemmatization

In [48]:
lexical_df.head()

Unnamed: 0,condition,abstract,sentences,tokens,lemmatization
0,4,Catheterization laboratory events and hospital...,[Catheterization laboratory events and hospita...,"[[Catheterization, laboratory, events, and, ho...","[[Catheterization, laboratory, event, and, hos..."
1,5,Renal abscess in children. Three cases of rena...,"[Renal abscess in children., Three cases of re...","[[Renal, abscess, in, children, .], [Three, ca...","[[Renal, abscess, in, child, .], [Three, case,..."
2,2,Hyperplastic polyps seen at sigmoidoscopy are ...,[Hyperplastic polyps seen at sigmoidoscopy are...,"[[Hyperplastic, polyps, seen, at, sigmoidoscop...","[[Hyperplastic, polyp, seen, at, sigmoidoscopy..."
3,5,Subclavian artery to innominate vein fistula a...,[Subclavian artery to innominate vein fistula ...,"[[Subclavian, artery, to, innominate, vein, fi...","[[Subclavian, artery, to, innominate, vein, fi..."
4,4,Effect of local inhibition of gamma-aminobutyr...,[Effect of local inhibition of gamma-aminobuty...,"[[Effect, of, local, inhibition, of, gamma-ami...","[[Effect, of, local, inhibition, of, gamma-ami..."


### Stemming
Stemming, like lemmatization, processes inflected forms but reduces them to their root, which may not correspond to a dictionary word.
Unlike lemmatization, it focuses on inflections that create new words and may change the grammatical class, such as *probable* (adjective) stemming to *probably* (adverb).

In [49]:
porterStemmer = nltk.PorterStemmer()
stemming = []
for record in lexical_df['tokens']:
    stemmed_record = [] # stemmed sentences for each record
    for words in record:
        stemmed_record.append([porterStemmer.stem(word) for word in words]) # update sentences into stemmed
    stemming.append(stemmed_record)

lexical_df["stemming"] = stemming

In [50]:
lexical_df.head()

Unnamed: 0,condition,abstract,sentences,tokens,lemmatization,stemming
0,4,Catheterization laboratory events and hospital...,[Catheterization laboratory events and hospita...,"[[Catheterization, laboratory, events, and, ho...","[[Catheterization, laboratory, event, and, hos...","[[catheter, laboratori, event, and, hospit, ou..."
1,5,Renal abscess in children. Three cases of rena...,"[Renal abscess in children., Three cases of re...","[[Renal, abscess, in, children, .], [Three, ca...","[[Renal, abscess, in, child, .], [Three, case,...","[[renal, abscess, in, children, .], [three, ca..."
2,2,Hyperplastic polyps seen at sigmoidoscopy are ...,[Hyperplastic polyps seen at sigmoidoscopy are...,"[[Hyperplastic, polyps, seen, at, sigmoidoscop...","[[Hyperplastic, polyp, seen, at, sigmoidoscopy...","[[hyperplast, polyp, seen, at, sigmoidoscopi, ..."
3,5,Subclavian artery to innominate vein fistula a...,[Subclavian artery to innominate vein fistula ...,"[[Subclavian, artery, to, innominate, vein, fi...","[[Subclavian, artery, to, innominate, vein, fi...","[[subclavian, arteri, to, innomin, vein, fistu..."
4,4,Effect of local inhibition of gamma-aminobutyr...,[Effect of local inhibition of gamma-aminobuty...,"[[Effect, of, local, inhibition, of, gamma-ami...","[[Effect, of, local, inhibition, of, gamma-ami...","[[effect, of, local, inhibit, of, gamma-aminob..."


### POS tagging
Part-of-speech (POS) tagging assigns a grammatical category to each token, such as noun, verb, or adjective.

In [51]:
pos = []
for sentence in lexical_df['tokens']:
    pos.append([nltk.pos_tag(token) for token in sentence])
lexical_df["pos_tagging"] = pos

In [52]:
lexical_df.head()

Unnamed: 0,condition,abstract,sentences,tokens,lemmatization,stemming,pos_tagging
0,4,Catheterization laboratory events and hospital...,[Catheterization laboratory events and hospita...,"[[Catheterization, laboratory, events, and, ho...","[[Catheterization, laboratory, event, and, hos...","[[catheter, laboratori, event, and, hospit, ou...","[[(Catheterization, NNP), (laboratory, NN), (e..."
1,5,Renal abscess in children. Three cases of rena...,"[Renal abscess in children., Three cases of re...","[[Renal, abscess, in, children, .], [Three, ca...","[[Renal, abscess, in, child, .], [Three, case,...","[[renal, abscess, in, children, .], [three, ca...","[[(Renal, JJ), (abscess, NN), (in, IN), (child..."
2,2,Hyperplastic polyps seen at sigmoidoscopy are ...,[Hyperplastic polyps seen at sigmoidoscopy are...,"[[Hyperplastic, polyps, seen, at, sigmoidoscop...","[[Hyperplastic, polyp, seen, at, sigmoidoscopy...","[[hyperplast, polyp, seen, at, sigmoidoscopi, ...","[[(Hyperplastic, JJ), (polyps, NNS), (seen, VB..."
3,5,Subclavian artery to innominate vein fistula a...,[Subclavian artery to innominate vein fistula ...,"[[Subclavian, artery, to, innominate, vein, fi...","[[Subclavian, artery, to, innominate, vein, fi...","[[subclavian, arteri, to, innomin, vein, fistu...","[[(Subclavian, JJ), (artery, NN), (to, TO), (i..."
4,4,Effect of local inhibition of gamma-aminobutyr...,[Effect of local inhibition of gamma-aminobuty...,"[[Effect, of, local, inhibition, of, gamma-ami...","[[Effect, of, local, inhibition, of, gamma-ami...","[[effect, of, local, inhibit, of, gamma-aminob...","[[(Effect, NNP), (of, IN), (local, JJ), (inhib..."


### Stop-words Removal
Stop-words are common words that do not carry specific meaning, such as articles, prepositions, and conjunctions.
It is usually performed after lexical analysis to avoid inaccuracies in subsequent syntactic or semantic analyses.

In [53]:
# nltk.download('stopwords')

In [54]:
stopwords = nltk.corpus.stopwords.words('english')

stopwords_removal = []

for record in lexical_df['tokens']:
    filtered_record = []
    for sentence in record:
        filtered_sentence = [word for word in sentence if word.lower() not in stopwords]
        filtered_record.append(filtered_sentence)

    stopwords_removal.append(filtered_record)

lexical_df["stopwords_removal"] = stopwords_removal

In [55]:
lexical_df.head()

Unnamed: 0,condition,abstract,sentences,tokens,lemmatization,stemming,pos_tagging,stopwords_removal
0,4,Catheterization laboratory events and hospital...,[Catheterization laboratory events and hospita...,"[[Catheterization, laboratory, events, and, ho...","[[Catheterization, laboratory, event, and, hos...","[[catheter, laboratori, event, and, hospit, ou...","[[(Catheterization, NNP), (laboratory, NN), (e...","[[Catheterization, laboratory, events, hospita..."
1,5,Renal abscess in children. Three cases of rena...,"[Renal abscess in children., Three cases of re...","[[Renal, abscess, in, children, .], [Three, ca...","[[Renal, abscess, in, child, .], [Three, case,...","[[renal, abscess, in, children, .], [three, ca...","[[(Renal, JJ), (abscess, NN), (in, IN), (child...","[[Renal, abscess, children, .], [Three, cases,..."
2,2,Hyperplastic polyps seen at sigmoidoscopy are ...,[Hyperplastic polyps seen at sigmoidoscopy are...,"[[Hyperplastic, polyps, seen, at, sigmoidoscop...","[[Hyperplastic, polyp, seen, at, sigmoidoscopy...","[[hyperplast, polyp, seen, at, sigmoidoscopi, ...","[[(Hyperplastic, JJ), (polyps, NNS), (seen, VB...","[[Hyperplastic, polyps, seen, sigmoidoscopy, m..."
3,5,Subclavian artery to innominate vein fistula a...,[Subclavian artery to innominate vein fistula ...,"[[Subclavian, artery, to, innominate, vein, fi...","[[Subclavian, artery, to, innominate, vein, fi...","[[subclavian, arteri, to, innomin, vein, fistu...","[[(Subclavian, JJ), (artery, NN), (to, TO), (i...","[[Subclavian, artery, innominate, vein, fistul..."
4,4,Effect of local inhibition of gamma-aminobutyr...,[Effect of local inhibition of gamma-aminobuty...,"[[Effect, of, local, inhibition, of, gamma-ami...","[[Effect, of, local, inhibition, of, gamma-ami...","[[effect, of, local, inhibit, of, gamma-aminob...","[[(Effect, NNP), (of, IN), (local, JJ), (inhib...","[[Effect, local, inhibition, gamma-aminobutyri..."


# Syntax Analysis
Syntax analysis consists of:
- Shallow Parsing
- Deep Parsing

### Shallow Parsing
Syntactic parsing extends chunking by generating a parse tree. This tree organizes POS-tagging results as leaf nodes and syntactic structures (often chunks) as intermediate nodes, connected hierarchically without representing specific relationships.


In [56]:
syntax_df = df.copy() # a df to compute syntax analysis on

In [57]:
# Define the grammar and the chunk parser
grammar = "NP: {<NNP><NNP>}"
cp = nltk.RegexpParser(grammar) # chunk parser

# Apply chunking to each record
chunking = []
for record in lexical_df['pos_tagging']:
    chunked_record = [cp.parse(sentence) for sentence in record]

    chunking.append(chunked_record)

# Save the chunking results into the dataframe
syntax_df["shallow_parsing"] = chunking

In [58]:
# Display the dataframe
syntax_df.head()

Unnamed: 0,condition,abstract,shallow_parsing
0,4,Catheterization laboratory events and hospital...,"[[(Catheterization, NNP), (laboratory, NN), (e..."
1,5,Renal abscess in children. Three cases of rena...,"[[(Renal, JJ), (abscess, NN), (in, IN), (child..."
2,2,Hyperplastic polyps seen at sigmoidoscopy are ...,"[[(Hyperplastic, JJ), (polyps, NNS), (seen, VB..."
3,5,Subclavian artery to innominate vein fistula a...,"[[(Subclavian, JJ), (artery, NN), (to, TO), (i..."
4,4,Effect of local inhibition of gamma-aminobutyr...,"[[(Effect, NNP), (of, IN), (local, JJ), (inhib..."


In [60]:
# Display the chunking result for the first record
print(syntax_df['shallow_parsing'][0][4])
#syntax_df['chunking'][0][4]

(S
  There/EX
  was/VBD
  one/CD
  in-laboratory/JJ
  death/NN
  (/(
  shock/JJ
  patient/NN
  with/IN
  infarction/NN
  of/IN
  the/DT
  left/JJ
  anterior/JJ
  descending/VBG
  coronary/JJ
  artery/NN
  )/)
  ./.)


In [61]:
# To draw the parse tree
#syntax_df['chunking'][1][1].draw()

### Deep Parsing
Differently from _Shallow parsing_, _Deep parsing_ aims to infer dependency relationships between nodes.
The result is a dependency graph which relates words that are syntactically linked.

In [62]:
nlp = spacy.load('en_core_web_sm')

In [63]:
deep_parsing = []
for sentences in syntax_df["abstract"]:
    sentence_dep = []
    doc = nlp(sentences)
    for token in doc:
        sentence_dep.append((str(token.text), str(token.dep_), str(token.head.text), str([child for child in token.children])))
        # creates a tuple containing the token, dependency nature, head and all dependents of the token
    deep_parsing.append(sentence_dep)

In [64]:
syntax_df["deep_parsing"] = deep_parsing

In [65]:
syntax_df.head()

Unnamed: 0,condition,abstract,shallow_parsing,deep_parsing
0,4,Catheterization laboratory events and hospital...,"[[(Catheterization, NNP), (laboratory, NN), (e...","[(Catheterization, compound, events, []), (lab..."
1,5,Renal abscess in children. Three cases of rena...,"[[(Renal, JJ), (abscess, NN), (in, IN), (child...","[(Renal, nsubj, abscess, []), (abscess, ROOT, ..."
2,2,Hyperplastic polyps seen at sigmoidoscopy are ...,"[[(Hyperplastic, JJ), (polyps, NNS), (seen, VB...","[(Hyperplastic, amod, polyps, []), (polyps, ns..."
3,5,Subclavian artery to innominate vein fistula a...,"[[(Subclavian, JJ), (artery, NN), (to, TO), (i...","[(Subclavian, amod, artery, []), (artery, nsub..."
4,4,Effect of local inhibition of gamma-aminobutyr...,"[[(Effect, NNP), (of, IN), (local, JJ), (inhib...","[(Effect, ROOT, Effect, [of, :, study, .]), (o..."


## Semantic Analysis

### Entity Extraction

In [66]:
semantic_df = df.copy() # a df to compute semantic analysis on

In [69]:
entities = []
for sentences in syntax_df["abstract"]:
    doc = nlp(sentences)
    record_entities = [(ent.text, ent.label_) for ent in doc.ents] # collect entities for each record
    entities.append(record_entities) # append the list of entities for the record

In [70]:
semantic_df["entities"] = entities

In [77]:
# Flatten the list of entities and extract the text and label
all_entities = [(text, label) for record in semantic_df["entities"] for text, label in record]

# Get the unique entities (text, label) pairs
unique_entities = set(all_entities)

In [78]:
# Display the unique entities
unique_entities

{('University of Milan', 'ORG'),
 ('45 degrees', 'QUANTITY'),
 ('ER+/PR+', 'ORG'),
 ('Sixty-seven percent', 'PERCENT'),
 ('565', 'CARDINAL'),
 ('an average of nine years', 'DATE'),
 ('500-2000', 'CARDINAL'),
 ('Citrobacter', 'PRODUCT'),
 ('73.4 +', 'DATE'),
 ('Between 1970 and 1987', 'DATE'),
 ('0.70', 'CARDINAL'),
 ('13.19 +', 'DATE'),
 ('the Normative Aging Study', 'ORG'),
 ('6) minutes', 'TIME'),
 ('3T6', 'DATE'),
 ('157 +', 'DATE'),
 ('NSLS', 'ORG'),
 ('6 to 39 months', 'DATE'),
 ('Neoplasia', 'GPE'),
 ('under 5 years of age', 'DATE'),
 ('446', 'CARDINAL'),
 ('0.90 +', 'DATE'),
 ('E4', 'PERSON'),
 ('Ees', 'GPE'),
 ('the US Public Health Service', 'ORG'),
 ('54.4 years', 'DATE'),
 ('May 17.6%', 'DATE'),
 ('10-60 percent', 'PERCENT'),
 ('56 to 89', 'TIME'),
 ('MEP', 'ORG'),
 ('TTP', 'ORG'),
 ('Nonpigmented', 'ORG'),
 ('OC', 'DATE'),
 ('20.0 +', 'DATE'),
 ('less than 3.5', 'CARDINAL'),
 ('223 mm', 'QUANTITY'),
 ('each morning', 'TIME'),
 ('2-18 years', 'DATE'),
 ('28.6 months', 'DATE'

In [79]:
semantic_df.head()

Unnamed: 0,condition,abstract,entities
0,4,Catheterization laboratory events and hospital...,"[(100, CARDINAL), (100, CARDINAL), (50, CARDIN..."
1,5,Renal abscess in children. Three cases of rena...,"[(Renal, ORG), (Three, CARDINAL), (23, CARDINA..."
2,2,Hyperplastic polyps seen at sigmoidoscopy are ...,"[(Polyps, ORG), (185, CARDINAL), (99, CARDINAL..."
3,5,Subclavian artery to innominate vein fistula a...,"[(Subclavian, NORP), (Sixteen, CARDINAL), (onl..."
4,4,Effect of local inhibition of gamma-aminobutyr...,"[(GABA, ORG), (GABA, ORG), (15, CARDINAL), (2...."


### Relation Extraction