# Stanza: A Python NLP Library for Many Human Languages

## 1. About

- The Stanford NLP Group's official Python NLP library. It contains support for running various accurate natural language processing tools on **60+ languages**
- A new collection of **biomedical** and **clinical** English model packages are now available, offering seamless experience for **syntactic analysis** and **named entity recognition (NER)** from biomedical literature text and clinical notes. 

- `github`: https://github.com/stanfordnlp/stanza
- `website`: https://stanfordnlp.github.io/stanza/
- `demo`: http://stanza.run/
- `models`: https://huggingface.co/stanfordnlp/stanza-en/tree/main/models
- `Biomedical page`: https://stanfordnlp.github.io/stanza/biomed.html
- `Biomedical demo`: http://stanza.run/bio
- `Biomedical paper`: https://academic.oup.com/jamia/article/28/9/1892/6307885

- Overview of the biomedical and clinical English model packages in the Stanza NLP library. For syntactic analysis, an example output from the CRAFT biomedical pipeline is shown; For named entity recognition, an example output from the i2b2 clinical model is shown.
<img width="500" height="800" align="left" src="https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jamia/28/9/10.1093_jamia_ocab090/1/ocab090f1.jpeg?Expires=1639110397&Signature=xsV~YOlEvbas3CVB8FaojrucfRH3kMH6fy~JOtiwuHGkmtO0HDh~Qx1fWAn4esbK6Q0MBGpvPBp2Ndh9w0T8NFC6LJ6zWhr3qfzC72Jieb4-TGyHxZdGKDypa~~eA9WUhlcNPmhoRcR5sQ0XTzXy9r--6oOhZZ906RQVtCremvkyteYUTUqsmlXkuMTKP5Y6DmF16VamYMnTGfPU2VUYjCz8Lj0WxlOWd6zIXpNFRlLFqh4lKcRowmnzOp1UgkKKOt8Sv5lhiBabcytp9fb1oo74k83iPTFGdcbeNn9SoMagiVIb2gyBP1Hj4ZX1JD2FK076BqaL6Qn7dmDgxL0hjg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA">

- Details on NER Datasets. first 8 is **biological** datasets, last 2 is **clinical** datasets

|Dataset|Entity Types|Number of Tokens (train/dev/test)|
|---|---|----
|AnatEM|`Anatomy`|153,823/58,785/99,976
|BC5CDR|`Chemical`,`Disease`| 118,170/117,453/124,750
|BC4CHEMD|`Chemical`|891,948/886,324/766,033
|BioNLP13CG|`Amino_acid`, `Anatomical_system`, `Cancer`, `Cell`, `Cellular_component`, `Developing_anatomical_structure`, `Gene_or_gene_product`, `Immaterial_anatomical_entity`, `Multi-tissue_structure`, `Organ`, `Organism`, `Organism_subdivision`, `Organism_substance`, `Pathological_formation`, `Simple_chemical`, `Tissue`|83,467/27,599/52,771
|JNLPBA|`Cell_line`, `Cell_type`, `DNA`, `Protein`, `RNA`|446,890/47,661/101,443
|Linnaeus|`Species`|280,960/93,815/164,653
|NCBI-Disease|`Disease`|135,701/23,969/24,497
|S800|`Species`|147,291/22,217/42,298
|i2b2|`Problem`, `Test`, `Treatment`|106,597/44,672/269,954
|Radiology|`Anatomy`, `Observation`, `Anatomy_modifier`, `Observation_modifier`, `Uncertainty`|30,137/2,423/3,693

## 2. Install

installing Stanza with conda

or, installing Stanza with pip

## 3. Usage

In [1]:
import stanza

use stanza in 4 steps:
- a. download pipline and models
- b. initialize pipeline and models
- c. annotate text.
- d. print out the entities.

### 3.1. Clinical: i2b2

In [2]:
nlp = stanza.Pipeline('en', package='mimic', processors={'ner': 'i2b2'})
doc = nlp('The patient had a sore throat and was treated with Cepacol lozenges.')
for ent in doc.entities:
    print(ent.start_char, ent.end_char, ent.type, ent.text)

2021-11-08 14:49:31 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | mimic   |
| pos       | mimic   |
| lemma     | mimic   |
| depparse  | mimic   |
| ner       | i2b2    |

2021-11-08 14:49:32 INFO: Use device: gpu
2021-11-08 14:49:32 INFO: Loading: tokenize
2021-11-08 14:49:35 INFO: Loading: pos
2021-11-08 14:49:35 INFO: Loading: lemma
2021-11-08 14:49:35 INFO: Loading: depparse
2021-11-08 14:49:35 INFO: Loading: ner
2021-11-08 14:49:36 INFO: Done loading processors!


16 29 PROBLEM a sore throat
51 67 TREATMENT Cepacol lozenges


### 3.2. Clinical: Radiology

In [3]:
nlp = stanza.Pipeline('en', package='mimic', processors={'ner': 'radiology'})
text = "Mrs Smith presented to A&E with worsening shortness of breath and ankle swelling. \
On arrival, she was tachypnoeic and hypoxic (oxygen saturation 82% on air). Clinical examination \
revealed reduced breath sounds and dullness to percussion in both lung bases. There was \
also a significant degree of lower limb oedema extending up to the mid-thigh bilaterally."
doc = nlp(text)
for ent in doc.entities:
    print(ent.start_char, ent.end_char, ent.type, ent.text)

2021-11-08 14:49:37 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | mimic     |
| pos       | mimic     |
| lemma     | mimic     |
| depparse  | mimic     |
| ner       | radiology |

2021-11-08 14:49:37 INFO: Use device: gpu
2021-11-08 14:49:37 INFO: Loading: tokenize
2021-11-08 14:49:37 INFO: Loading: pos
2021-11-08 14:49:37 INFO: Loading: lemma
2021-11-08 14:49:37 INFO: Loading: depparse
2021-11-08 14:49:37 INFO: Loading: ner
2021-11-08 14:49:38 INFO: Done loading processors!


66 71 ANATOMY ankle
72 80 OBSERVATION swelling
118 125 OBSERVATION_MODIFIER hypoxic
188 195 OBSERVATION_MODIFIER reduced
196 209 OBSERVATION breath sounds
214 222 OBSERVATION dullness
226 236 OBSERVATION percussion
240 244 ANATOMY_MODIFIER both
245 249 ANATOMY lung
250 255 ANATOMY_MODIFIER bases
274 285 OBSERVATION_MODIFIER significant
286 292 OBSERVATION_MODIFIER degree
296 301 ANATOMY_MODIFIER lower
302 306 ANATOMY limb
307 313 OBSERVATION oedema
334 343 ANATOMY mid-thigh
344 355 ANATOMY_MODIFIER bilaterally


### 3.3. Biology: NCBI_Disease

In [4]:
nlp = stanza.Pipeline('en', package='mimic', processors={'ner': 'ncbi_disease'})
pmid=28483579
abstract = "This study compared 2-D shear wave elastography (SWE) and transient elastography (TE) for liver \
fibrosis staging in patients with chronic hepatitis B (CHB) infection using liver biopsy as the \
reference standard. Patients with CHB infection who underwent liver biopsy were consecutively included. \
After exclusions, 257 patients were analyzed. Two-dimensional SWE resulted in a significantly higher \
rate of reliable measurements (98.1%, 252/257) than TE (93.0%, 239/257) (p = 0.011). Liver stiffness \
measurements of the two examinations exhibited a strong correlation (r = 0.835, p &lt; 0.001). \
In patients given a confirmed histologic diagnosis, Spearman's rank coefficients were 0.520 in stage \
F0 (p &lt; 0.001), 0.684 in stage F1 (p &lt; 0.001), 0.777 in stage F2 (p &lt; 0.001), 0.672 in stage F3 \
(p &lt; 0.001) and 0.755 in stage F4 (p &lt; 0.001). There were no significant differences between \
the areas under the receiver operating characteristic (ROC) curves of 2-D SWE and TE for liver \
fibrosis staging (all p values &gt; 0.05). Two-dimensional SWE had diagnostic accuracy comparable to \
that of TE for liver fibrosis staging. The measurements that the two techniques provide are not interchangeable."
doc = nlp(abstract)
for ent in doc.entities:
    print(ent.start_char, ent.end_char, ent.type, ent.text)

2021-11-08 14:49:38 INFO: Loading these models for language: en (English):
| Processor | Package      |
----------------------------
| tokenize  | mimic        |
| pos       | mimic        |
| lemma     | mimic        |
| depparse  | mimic        |
| ner       | ncbi_disease |

2021-11-08 14:49:38 INFO: Use device: gpu
2021-11-08 14:49:38 INFO: Loading: tokenize
2021-11-08 14:49:38 INFO: Loading: pos
2021-11-08 14:49:39 INFO: Loading: lemma
2021-11-08 14:49:39 INFO: Loading: depparse
2021-11-08 14:49:39 INFO: Loading: ner
2021-11-08 14:49:40 INFO: Done loading processors!


90 104 DISEASE liver fibrosis
130 149 DISEASE chronic hepatitis B
151 154 DISEASE CHB
226 239 DISEASE CHB infection


### 3.4. Biology: BC5CDR

In [5]:
nlp = stanza.Pipeline('en', package='mimic', processors={'ner': 'bc5cdr'})
pmid=28483579
abstract = "This study compared 2-D shear wave elastography (SWE) and transient elastography (TE) for liver \
fibrosis staging in patients with chronic hepatitis B (CHB) infection using liver biopsy as the \
reference standard. Patients with CHB infection who underwent liver biopsy were consecutively included. \
After exclusions, 257 patients were analyzed. Two-dimensional SWE resulted in a significantly higher \
rate of reliable measurements (98.1%, 252/257) than TE (93.0%, 239/257) (p = 0.011). Liver stiffness \
measurements of the two examinations exhibited a strong correlation (r = 0.835, p &lt; 0.001). \
In patients given a confirmed histologic diagnosis, Spearman's rank coefficients were 0.520 in stage \
F0 (p &lt; 0.001), 0.684 in stage F1 (p &lt; 0.001), 0.777 in stage F2 (p &lt; 0.001), 0.672 in stage F3 \
(p &lt; 0.001) and 0.755 in stage F4 (p &lt; 0.001). There were no significant differences between \
the areas under the receiver operating characteristic (ROC) curves of 2-D SWE and TE for liver \
fibrosis staging (all p values &gt; 0.05). Two-dimensional SWE had diagnostic accuracy comparable to \
that of TE for liver fibrosis staging. The measurements that the two techniques provide are not interchangeable."
doc = nlp(abstract)
for ent in doc.entities:
    print(ent.start_char, ent.end_char, ent.type, ent.text)

2021-11-08 14:49:40 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | mimic   |
| pos       | mimic   |
| lemma     | mimic   |
| depparse  | mimic   |
| ner       | bc5cdr  |

2021-11-08 14:49:40 INFO: Use device: gpu
2021-11-08 14:49:40 INFO: Loading: tokenize
2021-11-08 14:49:40 INFO: Loading: pos
2021-11-08 14:49:41 INFO: Loading: lemma
2021-11-08 14:49:41 INFO: Loading: depparse
2021-11-08 14:49:41 INFO: Loading: ner
2021-11-08 14:49:42 INFO: Done loading processors!


90 104 DISEASE liver fibrosis
130 149 DISEASE chronic hepatitis B
151 154 DISEASE CHB
156 165 DISEASE infection
226 239 DISEASE CHB infection
993 1001 DISEASE fibrosis
1115 1123 DISEASE fibrosis


## 4. time cost

In [6]:
import time
from tqdm import tqdm
pmid=28483579
abstract = "This study compared 2-D shear wave elastography (SWE) and transient elastography (TE) for liver \
fibrosis staging in patients with chronic hepatitis B (CHB) infection using liver biopsy as the \
reference standard. Patients with CHB infection who underwent liver biopsy were consecutively included. \
After exclusions, 257 patients were analyzed. Two-dimensional SWE resulted in a significantly higher \
rate of reliable measurements (98.1%, 252/257) than TE (93.0%, 239/257) (p = 0.011). Liver stiffness \
measurements of the two examinations exhibited a strong correlation (r = 0.835, p &lt; 0.001). \
In patients given a confirmed histologic diagnosis, Spearman's rank coefficients were 0.520 in stage \
F0 (p &lt; 0.001), 0.684 in stage F1 (p &lt; 0.001), 0.777 in stage F2 (p &lt; 0.001), 0.672 in stage F3 \
(p &lt; 0.001) and 0.755 in stage F4 (p &lt; 0.001). There were no significant differences between \
the areas under the receiver operating characteristic (ROC) curves of 2-D SWE and TE for liver \
fibrosis staging (all p values &gt; 0.05). Two-dimensional SWE had diagnostic accuracy comparable to \
that of TE for liver fibrosis staging. The measurements that the two techniques provide are not interchangeable."
abstracts = [abstract]*50
t1 = time.time()
for abstract in tqdm(abstracts, ncols=80):
    doc = nlp(abstract)
t2 = time.time()
print('used', t2-t1, 'seconds')

100%|███████████████████████████████████████████| 50/50 [00:26<00:00,  1.89it/s]


used 26.956512689590454 seconds
