 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# `Syntactic characteristics`


* allows us to analyze the grammatical structure of languages
    * we analyze the relationships between different words

**Three ways to access the relationships that exist between words:**

* `POS tagging`
* ` parsing`
* `chunking`

* we won't go over `parsing` and `chunking`, and will instead focus on `POS tagging` 
    <br>
    
    * before we explain what `POS tagging` is, we need to explain what are `Treebanks`


## `Treebanks`

* treebanks are special text collections which have been parsed and annotated for syntactic structure

* **`treebanks`** are usually built on top of **POS tagged corpora** (we will explain POS tagging in just a bit)

**`Universal Dependencies`** is a collection of over 100 treebanks in 60 different languages

* the **`Penn Treebank`** has become the standard nowadays

## `POS tagging`

* assigning **`POS tags (Part-Of-Speech) tags`** to words

* a part of speech is a class of words that plays a similar syntactic role in some sentence

**Basic POS tags:**

* `N` - noun
* `V` - verb
* `A` - article
* `ADJ` - adjective
* `P` - preposition
* `CON` - conjunction
* `PRO` - Pronoun
* `INT` - interjection

**Example:** "She sells dog food"

* She - `PRO` (pronoun)
* sells - `V` (verb)
* dog - `N` (noun)
* food - `N` (noun)

* performing **`POS tagging`** is often hard because the language humans use is very ambiguous
    * some words can have multiple meanings (e.g. the word "object" can be both a noun and a verb depending on the context)
    * there are a lot of subcategories (e.g. singular nouns, plural nouns, etc.)
    * different **`treebanks`** have different abbreviations for their **`POS tags`**

### `POS tagging algorithms:`

* two types:
    <br>
    
    * **`rule based algorithms`**
    * **`statistical methods`**

**`Rule based:`**

* depend on dictionaries, lexicons, the usage of regular expressions etc. to predict **`POS tags`**
* their accuracy is not great: 
    * the best rule based **`POS tagging`** algorithm managed to achieve an **accuracy of 77%** on the **`Brown corpus`** (standard corpus for testing **`POS tagging`** accuracy)

**`Statistical methods:`**

* the popularity of different statistical methods has risen through the years
* maybe the best example are **`Hidden Markov models`** 
* these models predict the tags of ambiguous words much more accurately than standard rule-based systems

**Bonus: `Machine Learning`**

* lately, various **`Machine Learning`** algorithms have became increasingly popular 
* most achieve **accuracy of over 97 %**
* using **`Deep Learning`** for **`POS tagging`** shows even more promise than using classic **`Machine Learning`**

# `POS tagging` in `NLTK`

* tagging each word in some text manually would be a very cumbersome procedure, thankfully **`NLTK`** can solve that problem for using a built-in **`POS tagger`**


* we call upon that **`POS tagger`** using  **`nltk.pos_tag()`** 
    <br>
    
    * **`pos_tag()`**  currently uses an averaged perceptron as the tagging algorithm
    * the automatic POS tags generated using  **`pos_tag()`**  are based on the **`Penn Treebank tagset`**

* if you want to POS tag sentences, you can do it by using **`pos_tag_sents()`**

**FYI:** If you don't have **`nltk`** installed:

`conda install nltk` or `pip install nltk`

In [1]:
# Import nltk so that we can use the POS-tagger

import nltk
from nltk import pos_tag
from nltk import pos_tag_sents

In [2]:
# Then we can create our data
# in the form of a list of words
# and a list of sentences

words = ['Life', 'is', 'what', 'happens', 'when', 'you', 'are', 'busy', 'making', 'other', 'plans']
sents = [["He", "is", "a", "baker"], ["She", "sells", "dog", "food"]]

In [3]:
# Let's do some POS-tagging

tagged_words = pos_tag(words)
tagged_sentences = pos_tag_sents(sents)

In [4]:
# Take a look at the tagged text

tagged_words

[('Life', 'NNP'),
 ('is', 'VBZ'),
 ('what', 'WP'),
 ('happens', 'VBZ'),
 ('when', 'WRB'),
 ('you', 'PRP'),
 ('are', 'VBP'),
 ('busy', 'JJ'),
 ('making', 'VBG'),
 ('other', 'JJ'),
 ('plans', 'NNS')]

In [5]:
# Take a look at the tagged text

tagged_sentences

[[('He', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('baker', 'NN')],
 [('She', 'PRP'), ('sells', 'VBZ'), ('dog', 'NN'), ('food', 'NN')]]

**BONUS: display what a tag represents**
    
* we do it by loading in a tagset and checking what a certaing tag means

In [6]:
# Load NTLK resource files 

from nltk.data import load

tag_dict = load('help/tagsets/upenn_tagset.pickle')

In [7]:
# Display what a tag represents

tag_dict['PRP'][0]

'pronoun, personal'

# `Syntactic characteristics cheat sheet`

* these characteristics are directly connected to the grammar of some particular language

* the three procedures we perform are:

    * POS tagging
    * Parsing
    * Chunking

    

### `POS tagging`

* assigning **`POS tags (Part-Of-Speech) tags`** to words using either **`rule based algorithms`** or **`statistical methods`** (in recent times we also use **`Machine Learning algorithms`**)

* **`NLTK`** has a built-in POS tagger that takes in words and outputs pairs of words and their POS tags
    * the tags are based on the **`Treebank corpus`**

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>