### Syntax

Syntax is the structure of a language which is governed by grammers. Any ordering of words can not be a sentence. Hence, we need syntactical analysis for natural languages.

## Table of Contents

* [Parts of Speech Tagging](#pos)
* [Dependency Parsing](#parsing)
* [Named Entity Recognition](#ner)

<a id='pos'></a>

### Parts of Speech Tagging

Parts of speech (POS) are specific lexical categories to which words are assigned, based on their syntactic context and role. Usually, words can fall into one of the following major categories.

* <strong>Nouns</strong>
* <strong>Verb</strong>
* <strong>Adjective</strong>
* <strong>Adverb</strong>

Besides these four major categories of parts of speech , there are other categories that occur frequently in the English language. These include pronouns, prepositions, interjections, conjunctions, determiners, and many others. The process of classifying and labeling POS tags for words called parts of speech tagging or POS tagging . POS tags are used to annotate words and depict their POS, which is really helpful to perform specific analysis, such as narrowing down upon nouns and seeing which ones are the most prominent, word sense disambiguation, and grammar analysis.


In [1]:
import pandas as pd

pd.options.display.max_colwidth = -1

data = pd.read_csv('clinical_notes_cleaned.csv') #for excel file use read_excel

In [2]:
import nltk
import spacy
import en_core_med7_lg #en_core_web_sm
import re

nlp = spacy.load('en_core_med7_lg', parse=True, tag=True, entity=True)

In [3]:
sample_text = data.clean_text.iloc[1]
print (sample_text)

preoperative ganglion left wrist . postoperative ganglion left wrist . excision ganglion . general . estimate blood less 5 ml . successful anesthetic , patient position operating table . tourniquet apply upper arm . extremity preppe usual manner surgical procedure drape . superficial vessel exsanguinate elastic wrap tourniquet inflate usual arm pressure . curved incision make present ganglion dorsal aspect wrist . blunt sharp dissection , dissect underneath extensor tendon stalk appear arise distal radiocapitellar joint dorsal capsule excise along ganglion specimen remove submit . small superficial vessel electrocoagulate instill close skin 4 - 0 prolene , area approximately 6 7 ml 0.25 marcaine epinephrine . jackson - pratt drain insert tourniquet release , keep deflate least 5 10 minute pass activate remove recovery room . dressing apply hand xeroform , 4x4s , abd , kerlix , elastic wrap volar fiberglass splint . tourniquet release . circulation return finger . patient allow awaken l

In [4]:
text_tokenized = nlp(sample_text)

for token in text_tokenized:
    print ("{} ---> {}".format(token,token.pos_))

preoperative ---> ADJ
ganglion ---> NOUN
left ---> VERB
wrist ---> NOUN
. ---> PUNCT
postoperative ---> ADJ
ganglion ---> NOUN
left ---> VERB
wrist ---> NOUN
. ---> PUNCT
excision ---> NOUN
ganglion ---> NOUN
. ---> PUNCT
general ---> PROPN
. ---> PUNCT
estimate ---> VERB
blood ---> NOUN
less ---> ADJ
5 ---> NUM
ml ---> NOUN
. ---> PUNCT
successful ---> ADJ
anesthetic ---> NOUN
, ---> PUNCT
patient ---> ADJ
position ---> NOUN
operating ---> VERB
table ---> NOUN
. ---> PUNCT
tourniquet ---> NOUN
apply ---> VERB
upper ---> ADJ
arm ---> NOUN
. ---> PUNCT
extremity ---> NOUN
preppe ---> ADJ
usual ---> ADJ
manner ---> ADJ
surgical ---> ADJ
procedure ---> NOUN
drape ---> NOUN
. ---> PUNCT
superficial ---> ADJ
vessel ---> NOUN
exsanguinate ---> ADJ
elastic ---> ADJ
wrap ---> NOUN
tourniquet ---> NOUN
inflate ---> VERB
usual ---> ADJ
arm ---> NOUN
pressure ---> NOUN
. ---> PUNCT
curved ---> ADJ
incision ---> NOUN
make ---> VERB
present ---> ADJ
ganglion ---> NOUN
dorsal ---> ADJ
aspect ---> NO

Usually POS tags are used for analysis, feature engineering or, feature selection. In this analysis, let us select only the words that are nouns, verbs, number and adjectives.

In [5]:
def get_selected_pos(text):
    text_tokenized = nlp(text)
    selected_words = [token.string for token in text_tokenized if token.pos_ in ['NOUN','PROPN','NUM','ADJ','VERB','PUNCT']]
    processed_text = re.sub(' +',' ', " ".join(selected_words))
    return processed_text

In [6]:
data = data.dropna(subset=['clean_text'])
data.clean_text = data.clean_text.apply(get_selected_pos)

<a id='parsing'></a>

### Dependency Parsing

In dependency parsing, we try to use dependency-based grammars to analyze and infer both structure and semantic dependencies and relationships between tokens in a sentence. The basic principle behind a dependency grammar is that in any sentence in the language, all words except one, have some relationship or dependency on other words in the sentence. The word that has no dependency is called the root of the sentence. The verb is taken as the root of the sentence in most cases. All the other words are directly or indirectly linked to the root verb using links , which are the dependencies.

In [13]:
from spacy import displacy

text_tokenized = nlp(data.clean_text.iloc[1])

options = {"compact": True}
displacy.serve(text_tokenized, style="dep", options=options)

#displacy.render(text_tokenized, jupyter=True, 
#                options={'distance': 110,
#                         'arrow_stroke': 2,
#                         'arrow_width': 8})

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


<a id='ner'></a>

### Named Entity Recognition (NER)

In any text document, there are particular terms that represent specific entities that are more informative and have a unique context. These entities are known as named entities , which more specifically refer to terms that represent real-world objects like people, places, organizations, and so on, which are often denoted by proper names. A naive approach could be to find these by looking at the noun phrases in text documents. Named entity recognition (NER) , also known as entity chunking/extraction , is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes.

In [14]:
#for ent in text_tokenized.ents:
#    print ("{} ---> {}".format(ent.text, ent.label_))
    
displacy.serve(text_tokenized, style="ent")

0.25 ---> STRENGTH
marcaine epinephrine ---> DRUG


  "__main__", mod_spec)



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [20]:
text_tokenized_orig = nlp(data.text.iloc[1])
#for ent in text_tokenized_orig_eng.ents:
#    print ("{} ---> {}".format(ent.text, ent.label_))
    
displacy.serve(text_tokenized_orig, style="ent")

  "__main__", mod_spec)



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [18]:
nlp_eng = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)
text_tokenized_eng = nlp_eng(data.clean_text.iloc[1])
#for ent in text_tokenized_eng.ents:
#    print ("{} ---> {}".format(ent.text, ent.label_))
    
displacy.serve(text_tokenized_eng, style="ent")

  "__main__", mod_spec)



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [19]:
text_tokenized_orig_eng = nlp_eng(data.text.iloc[1])
#for ent in text_tokenized_orig_eng.ents:
#    print ("{} ---> {}".format(ent.text, ent.label_))
    
displacy.serve(text_tokenized_orig_eng, style="ent")

  "__main__", mod_spec)



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [10]:
data.to_csv("clinical_notes_cleaned_pos.csv",index=False)

### References for further reading

<strong> POS tagging </strong>

https://www.nltk.org/book/ch05.html

<strong> Medical named entity recognition </strong>

https://github.com/kormilitzin/med7

https://github.com/NLPatVCU/medaCy

https://github.com/text-machine-lab/CliNER
