<a href="https://colab.research.google.com/github/zus12873/colab_notebooks/blob/main/stanza/Stanza_Beginners_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to Stanza!

![Latest Version](https://img.shields.io/pypi/v/stanza.svg?colorB=bc4545)
![Python Versions](https://img.shields.io/pypi/pyversions/stanza.svg?colorB=bc4545)

Stanza is a Python NLP toolkit that supports 60+ human languages. It is built with highly accurate neural network components that enable efficient training and evaluation with your own annotated data, and offers pretrained models on 100 treebanks. Additionally, Stanza provides a stable, officially maintained Python interface to Java Stanford CoreNLP Toolkit.

In this tutorial, we will demonstrate how to set up Stanza and annotate text with its native neural network NLP models. For the use of the Python CoreNLP interface, please see other tutorials.

## 1. Installing Stanza

Note that Stanza only supports Python 3.6 and above. Installing and importing Stanza are as simple as running the following commands:

In [2]:
# Install; note that the prefix "!" is not needed if you are running in a terminal
!pip install stanza

# Import the package
import stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.0->stanza)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata 

### More Information

For common troubleshooting, please visit our [troubleshooting page](https://stanfordnlp.github.io/stanfordnlp/installation_usage.html#troubleshooting).

## 2. Downloading Models

You can download models with the `stanza.download` command. The language can be specified with either a full language name (e.g., "english"), or a short code (e.g., "en").

By default, models will be saved to your `~/stanza_resources` directory. If you want to specify your own path to save the model files, you can pass a `dir=your_path` argument.


In [3]:
# Download an English model into the default directory
print("Downloading English model...")
stanza.download('en')

# Similarly, download a (simplified) Chinese model
# Note that you can use verbose=False to turn off all printed messages
print("Downloading Chinese model...")
stanza.download('zh', verbose=False)

Downloading English model...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources


Downloading Chinese model...


### More Information

Pretrained models are provided for 60+ different languages. For all languages, available models and the corresponding short language codes, please check out the [models page](https://stanfordnlp.github.io/stanza/models.html).


## 3. Processing Text


### Constructing Pipeline

To process a piece of text, you'll need to first construct a `Pipeline` with different `Processor` units. The pipeline is language-specific, so again you'll need to first specify the language (see examples).

- By default, the pipeline will include all processors, including tokenization, multi-word token expansion, part-of-speech tagging, lemmatization, dependency parsing and named entity recognition (for supported languages). However, you can always specify what processors you want to include with the `processors` argument.

- Stanza's pipeline is CUDA-aware, meaning that a CUDA-device will be used whenever it is available, otherwise CPUs will be used when a GPU is not found. You can force the pipeline to use CPU regardless by setting `use_gpu=False`.

- Again, you can suppress all printed messages by setting `verbose=False`.

In [4]:
# Build an English pipeline, with all processors by default
print("Building an English pipeline...")
en_nlp = stanza.Pipeline('en')

# Build a Chinese pipeline, with customized processor list and no logging, and force it to use CPU
print("Building a Chinese pipeline...")
zh_nlp = stanza.Pipeline('zh', processors='tokenize,lemma,pos,depparse', verbose=False, use_gpu=False)

Building an English pipeline...
Building a Chinese pipeline...


### Annotating Text

After a pipeline is successfully constructed, you can get annotations of a piece of text simply by passing the string into the pipeline object. The pipeline will return a `Document` object, which can be used to access detailed annotations from. For example:


In [5]:
# Processing English text
en_doc = en_nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
print(type(en_doc))
raw_text="君諱泰，字元平，博陵安平人也。九空上襲，姜水濬其長瀾，四履宏開，營口口其曾構。西漢家嗣，見安於夏里，東京元舅，請交於亭伯。自玆以降，慶緒彌隆，庭孕之口，口盈口冕。高祖乗，弱冠有志氣，率性忠烈，後魏釋褐奉朝請光禄大夫、燕州刺史、冀州刺史、左光禄大夫、舞騎大將軍、儀同三司、使持節瀛定相三州諸軍事、定州刺史、侍中、尚書令、司徒公，謚曰靜穆公。曾祖仲哲，後魏龍讓將軍、主客侍郎、鎮遠將軍、營州口口、安平男，謚曰忠。理識沉隱，器懷貞尚，道映時倫，績宣朝伍。祖長瑜，浮陽郡守、太常卿，襲爵安平男。模楷指紳，羽儀廊廟，風移化洽，口屬於分符，禮備樂和，允彰於列棘。父子博，隋户部虞部侍郎、四州刺史。材標國幹，業口書林，效職文昌，百寮傾其雅範，攝官藩部，千里安其惠政。君積潤與原，資芳桂苑，岐嶷夙表，英徽早發。徑寸稱珍，方魏車而更重，盈尺爲寳，况秦城而取貴。孝敬之極，自叶天經，忠亮之規，非緣物獎。洎夫鈎深致遠，王室與銀編並究，屬辭比事，鸞光將鳳艷相輝。仁壽元年，應詔舉，射策甲第。時漢王諒光暦寵命，作牧參野。君以材地兼美，解巾爲漢府典籤。列長裾之賓從，預小山之文藻。高視梧宫，孤標龍岫。及燕謀且發，人馳成軫之辭，吴兵遂舉，家上周丘之策。君深體逆順，妙達機兆，屢陳忠講，因致猜嫌，遂以疾辭，免玆尤费。於是韬光衡泌，閉想簪纓，馳驚九流之宗，迴翔千載之表。氣積星頊，神清林澤。大業中，召補左武衛兵曹，非其好也。乃掛冠獨往，逃難他方。爰屬運初，委身從義，拜通議大夫，尋除監察御史。芬冠執憲，霜簡直筆，志存矯枉，情無屈橈。時以晉陽之地，王業攸基，餘寇未平，嚴城尚警。口奉口巡撫，式光原隰。遵塗北邁，兇黨南侵，君潛運謀猷，星馳表檄。大軍既至，羣袄遂口，特降璽書，深加慰喻，進授輕車都尉。武德五年，轉萬年縣丞。帝城貴要，口謂難繩，城里豪門，尤多私謁。君抗心奉法，正身直道，居職累載，聲譽甚隆。貞觀初，遷洛州長水縣令。中牟德被，遠慚馴雉，重泉政美，有娩翔驚。六年，遷蘇州司馬。同彦威之稱職，詛止文章，類儒宗之納善，實惟忠肅。嗟乎。南康仙駕，未戾於三山，壤口骥足，遠輟於千里。以貞觀十年十一月六日終於官所，年六十有一。夫人隴西李氏，籍慶高門，凝華中谷，貞姿玉映，淑問風揚。粤自移天，來儀君子，懋蘋繫於行潦，諧瑟琴於異室，女圖弘訓，母德馳芳。與善何愆，徂光奄謝。但故鄉絶人，先望遼遠，上下諮謀，改斯宅兆，爰卜邙洛，用定終居。粤以永徽六年十月一日合葬於洛州河南縣平樂鄉華邑里邙山之原。嗚呼哀哉。迺爲銘曰：天齊形勝，投釣開封。長岑博雅，弈葉雕龍。懷金鏘玉，疊構連峰。森梢良梓，磊落喬松。惟祖惟考，道風逾盛。績表遺縑，息孚留詠。懿哉君子，誕膺家慶。具美攸鍾，多能無競。始口窮運，終會昌辰。蘭臺振操，赤縣霑口。牛刀暫屈，驟足俄申。上才方遠，高春遽淪。猗歟令偶，蕙心瓊潔。昔奉齊眉，今歸同穴。口旗縈委，口扉冥滅。去矣佳城，悠哉芳烈。"
# Processing Chinese text
zh_doc = zh_nlp(raw_text)
print(type(zh_doc))

<class 'stanza.models.common.doc.Document'>
<class 'stanza.models.common.doc.Document'>


In [7]:
zh_doc.sentences

[[
   {
     "id": 1,
     "text": "君諱泰",
     "lemma": "君諱泰",
     "upos": "PROPN",
     "xpos": "NNP",
     "head": 0,
     "deprel": "root",
     "start_char": 0,
     "end_char": 3,
     "misc": "SpaceAfter=No"
   },
   {
     "id": 2,
     "text": "，",
     "lemma": "，",
     "upos": "PUNCT",
     "xpos": ",",
     "head": 3,
     "deprel": "punct",
     "start_char": 3,
     "end_char": 4,
     "misc": "SpaceAfter=No"
   },
   {
     "id": 3,
     "text": "字元",
     "lemma": "字元",
     "upos": "NOUN",
     "xpos": "NN",
     "head": 1,
     "deprel": "appos",
     "start_char": 4,
     "end_char": 6,
     "misc": "SpaceAfter=No"
   },
   {
     "id": 4,
     "text": "平",
     "lemma": "平",
     "upos": "PROPN",
     "xpos": "NNP",
     "head": 1,
     "deprel": "appos",
     "start_char": 6,
     "end_char": 7,
     "misc": "SpaceAfter=No"
   },
   {
     "id": 5,
     "text": "，",
     "lemma": "，",
     "upos": "PUNCT",
     "xpos": ",",
     "head": 8,
     "deprel": "punct",


### More Information

For more information on how to construct a pipeline and information on different processors, please visit our [pipeline page](https://stanfordnlp.github.io/stanfordnlp/pipeline.html).

## 4. Accessing Annotations

Annotations can be accessed from the returned `Document` object.

A `Document` contains a list of `Sentence`s, and a `Sentence` contains a list of `Token`s and `Word`s. For the most part `Token`s and `Word`s overlap, but some tokens can be divided into mutiple words, for instance the French token `aux` is divided into the words `à` and `les`, while in English a word and a token are equivalent. Note that dependency parses are derived over `Word`s.

Additionally, a `Span` object is used to represent annotations that are part of a document, such as named entity mentions.


The following example iterate over all English sentences and words, and print the word information one by one:

In [None]:
for i, sent in enumerate(en_doc.sentences):
    print("[Sentence {}]".format(i+1))
    for word in sent.words:
        print("{:12s}\t{:12s}\t{:6s}\t{:d}\t{:12s}".format(\
              word.text, word.lemma, word.pos, word.head, word.deprel))
    print("")

The following example iterate over all extracted named entity mentions and print out their character spans and types.

In [None]:
print("Mention text\tType\tStart-End")
for ent in en_doc.ents:
    print("{}\t{}\t{}-{}".format(ent.text, ent.type, ent.start_char, ent.end_char))

And similarly for the Chinese text:

In [6]:
for i, sent in enumerate(zh_doc.sentences):
    print("[Sentence {}]".format(i+1))
    for word in sent.words:
        print("{:12s}\t{:12s}\t{:6s}\t{:d}\t{:12s}".format(\
              word.text, word.lemma, word.pos, word.head, word.deprel))
    print("")

[Sentence 1]
君諱泰         	君諱泰         	PROPN 	0	root        
，           	，           	PUNCT 	3	punct       
字元          	字元          	NOUN  	1	appos       
平           	平           	PROPN 	1	appos       
，           	，           	PUNCT 	8	punct       
博陵          	博陵          	PROPN 	8	nmod        
安平          	安平          	PROPN 	8	compound    
人           	人           	PART  	1	appos       
也           	也           	SCONJ 	8	mark        
。           	。           	PUNCT 	1	punct       

[Sentence 2]
九空          	九空          	PROPN 	3	nsubj       
上           	上           	NOUN  	1	acl         
襲           	襲           	VERB  	0	root        
，           	，           	PUNCT 	17	punct       
姜           	姜           	PROPN 	17	nsubj       
水           	水           	PROPN 	5	flat:name   
濬           	濬           	PROPN 	5	flat:name   
其           	其           	PRON  	5	appos       
長           	長           	PROPN 	5	appos       
瀾           	瀾           	PROPN 	9	flat:name   
，          

Alternatively, you can directly print a `Word` object to view all its annotations as a Python dict:

In [None]:
word = en_doc.sentences[0].words[0]
print(word)

### More Information

For all information on different data objects, please visit our [data objects page](https://stanfordnlp.github.io/stanza/data_objects.html).

## 5. Resources

Apart from this interactive tutorial, we also provide tutorials on our website that cover a variety of use cases such as how to use different model "packages" for a language, how to use spaCy as a tokenizer, how to process pretokenized text without running the tokenizer, etc. For these tutorials please visit [our Tutorials page](https://stanfordnlp.github.io/stanza/tutorials.html).

Other resources that you may find helpful include:

- [Stanza Homepage](https://stanfordnlp.github.io/stanza/index.html)
- [FAQs](https://stanfordnlp.github.io/stanza/faq.html)
- [GitHub Repo](https://github.com/stanfordnlp/stanza)
- [Reporting Issues](https://github.com/stanfordnlp/stanza/issues)
- [Stanza System Description Paper](http://arxiv.org/abs/2003.07082)
