![Part of Speech](https://dailygenius.com/wp-content/uploads/2014/09/handwriting1.jpg)

Source: https://dailygenius.com/handwriting-helps-learn-graphic/

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

# Part of Speech

Part of Speech, aka POS, is referring to category of words. Same category of words can represent similar behavior. For example, "Word" is a noun while "Run" is a verb. To have a better understanding on article, we have to know the POS. 

In NLP, POS is an important part but we may not always deal with it directly. Lemmanization and Stemming process relies on POS but some libraries (e.g. spaCy) is very nice that helped us to tackle it.

In English, we have noun, adjective, conjunction etc. Sometimes, same word can have both verb and noun. In Chinese, two major categories are Content Word and Function words which including noun, adverb, conjunction as well. 
This article includes how we can do it for English (via spaCy) and Chinese (via jieba).

In [12]:
# Catpure from https://en.wikipedia.org/wiki/Part_of_speech

article = 'In traditional grammar, a part of speech (abbreviated form: PoS or POS) is \
a category of words (or, more generally, of lexical items) which have similar grammatical properties. '

In [13]:
# Catpure from https://zh.wikipedia.org/wiki/%E8%A9%9E%E9%A1%9E

article2 = '詞類是一個語言學術語，是一種語言中詞的語法分類，是以語法特徵\
（包括句法功能和形態變化）為主要依據、兼顧詞彙意義對詞進行劃分的結果。'

### spaCy

In [14]:
import spacy
print('spaCy Version: %s' % (spacy.__version__))
spacy_nlp = spacy.load('en_core_web_sm')

spaCy Version: 2.3.5


"a" is DT which means deteminer. "part" is NN which is noun while "of" is IN which is preposition.

In [15]:
doc = spacy_nlp(article)
tokens = [token.text for token in doc if not token.is_stop]

print('Original Article: %s' % (article))
print()
for token in doc:
    print('Word: %s, POS: %s' % (token.text, token.tag_))

Original Article: In traditional grammar, a part of speech (abbreviated form: PoS or POS) is a category of words (or, more generally, of lexical items) which have similar grammatical properties. 

Word: In, POS: IN
Word: traditional, POS: JJ
Word: grammar, POS: NN
Word: ,, POS: ,
Word: a, POS: DT
Word: part, POS: NN
Word: of, POS: IN
Word: speech, POS: NN
Word: (, POS: -LRB-
Word: abbreviated, POS: VBN
Word: form, POS: NN
Word: :, POS: :
Word: PoS, POS: NNP
Word: or, POS: CC
Word: POS, POS: NNP
Word: ), POS: -RRB-
Word: is, POS: VBZ
Word: a, POS: DT
Word: category, POS: NN
Word: of, POS: IN
Word: words, POS: NNS
Word: (, POS: -LRB-
Word: or, POS: CC
Word: ,, POS: ,
Word: more, POS: RBR
Word: generally, POS: RB
Word: ,, POS: ,
Word: of, POS: IN
Word: lexical, POS: JJ
Word: items, POS: NNS
Word: ), POS: -RRB-
Word: which, POS: WDT
Word: have, POS: VBP
Word: similar, POS: JJ
Word: grammatical, POS: JJ
Word: properties, POS: NNS
Word: ., POS: .


### jieba

In [11]:
import jieba
print('jieba Version: %s' % jieba.__version__)

import jieba.posseg as jieba_pos_tagger

jieba Version: 0.42.1


"詞類" is noun while "是" is verb.

In [6]:
print('Original Article: %s' % (article2))
print()

words = jieba_pos_tagger.cut(article2)

for word in words:
    print('Word: %s, POS: %s' % (word.word, word.flag))

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\mukjain\AppData\Local\Temp\jieba.cache


Original Article: 詞類是一個語言學術語，是一種語言中詞的語法分類，是以語法特徵（包括句法功能和形態變化）為主要依據、兼顧詞彙意義對詞進行劃分的結果。



Loading model cost 0.957 seconds.
Prefix dict has been built successfully.


Word: 詞類, POS: n
Word: 是, POS: v
Word: 一個, POS: m
Word: 語言, POS: n
Word: 學術, POS: n
Word: 語, POS: n
Word: ，, POS: x
Word: 是, POS: v
Word: 一種, POS: m
Word: 語, POS: n
Word: 言中, POS: nr
Word: 詞, POS: n
Word: 的, POS: uj
Word: 語法, POS: n
Word: 分類, POS: vn
Word: ，, POS: x
Word: 是, POS: v
Word: 以, POS: p
Word: 語, POS: n
Word: 法特, POS: ns
Word: 徵, POS: zg
Word: （, POS: x
Word: 包括, POS: v
Word: 句法, POS: n
Word: 功能, POS: n
Word: 和, POS: c
Word: 形態, POS: n
Word: 變化, POS: vn
Word: ）, POS: x
Word: 為, POS: zg
Word: 主要, POS: b
Word: 依據, POS: p
Word: 、, POS: x
Word: 兼顧, POS: v
Word: 詞, POS: n
Word: 彙, POS: zg
Word: 意, POS: ng
Word: 義, POS: nt
Word: 對, POS: p
Word: 詞, POS: n
Word: 進, POS: v
Word: 行, POS: v
Word: 劃, POS: v
Word: 分, POS: q
Word: 的, POS: uj
Word: 結, POS: v
Word: 果, POS: ng
Word: 。, POS: x


# Conclusion

POS helps a lot on text pre-processing. For example, we have to know the POS of word in order to perform lemmanization, stemming and stop word removal. These three pre-processing will be discussed in later article. Stay tuned.

# Reference

Standard Syntactic Categories: https://cs.nyu.edu/grishman/jet/guide/PennPOS.html