# Exercise Sheet 2 - Linguistic Concepts
## Learning Objectives

In this sheet we are going to:
- learn more about linguistic structure, analysis and data
- do a few exercises on basic linguistic concepts
- study some functions in `nltk` for linguistic analysis


---
# Pen & Paper Exercises
# 1. Morphology

See lecture slides 16-17.

1. What is the stem in the following words: *amusing*, *amusement*, and *amused*.
2. What is the lemma in the following words: *amusing*, *amusement*, and *amused*.
3. Name two applications that might use these morphological processes.

---
# 2. Syntax
## 2.1. Part-of-Speech Tagging
Use the POS tag set from the slides to annotate each word in the following sentence with the correct part of speech (see lecture slides 29-30):

> *He had an expensive, but very good lunch at the Thai restaurant with the big windows that is opposite the church.*

---
## 2.2. Context-free Grammars
Define a context-free grammar with production rules for terminal and non-terminal symbols that can be used to analyse/generate the following sentence (see lecture slides 34-35):

> *He had lunch at the restaurant.*

---
## 2.3. Constituency Parsing
Draw the phrase structure for the sentence in **1.** (see lecture slides 35-36)

> *He had lunch at the restaurant.*

---
### 2.3.1. Recursive Grammar and Phrase Structure
Define the grammar rules and draw the phrase structure for the following sentence (see lecture slide 37-38):

> *He had lunch at the restaurant on the corner of the street.*

---
## 2.4. Dependency Parsing

Draw the dependency structure for the following sentence:

> *We ate at a cheap restaurant on the corner of the street.*

See lecture slides 40-41 and the Universal Dependencies website for the [list of relations](https://universaldependencies.org/u/dep/) and [annotation guidelines](https://universaldependencies.org/u/overview/syntax.html). You can also encounter another dependency annotation scheme, [Stanford dependencies](https://nlp.stanford.edu/software/dependencies_manual.pdf).

---
# NLTK
Now we will look at some functions in `nltk` for morphological processing of texts.

In [None]:
import nltk
nltk.download(['punkt', 'wordnet', "omw-1.4", 'averaged_perceptron_tagger', 'universal_tagset' ])

---
## 3. Tokenization

Tokenization is the process of dividing a text into smaller units i.e. tokens. 

`nltk` provides a `word_tokenize` function to tokenize a sentence in a given language.

In [None]:
from nltk.tokenize import word_tokenize

sentence = "Galway has a year-round mild, moist, temperate and changeable climate, due to \
the prevailing winds of the North Atlantic Current together with the Gulf Stream."

tokens = word_tokenize(sentence, language='english')
tokens

`nltk` has many pre-defined tokenizers and you can also create your own. More details about the `tokenize` submodule are available [here](https://www.nltk.org/api/nltk.tokenize.html). 

The default tokenizer uses two classes, [PunktSentenceTokenizer](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer) and [TreebankWordTokenizer](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.treebank.TreebankWordTokenizer) for the tokenization of a document into sentences and words respectively.

You can also call `dir` on any Python module to see what attributes (classes and functions) are defined inside it. Uncomment the following the cell to see the attributes in the `tokenize` subpackage.

In [None]:
dir(nltk.tokenize)

Run the following cell to see the details about the `WhitespaceTokenizer`.

In [None]:
?nltk.tokenize.WhitespaceTokenizer

Run the following cell to tokenize the same sentence using the `WhitespaceTokenizer`.

In [None]:
from nltk.tokenize import WhitespaceTokenizer

tokenizer = WhitespaceTokenizer()
tokenizer.tokenize(sentence)

Do you notice any differences here compared to the default tokenizer?

---

Certain types of texts require special considerations for tokenization. For example, hashtags, mentions and emoji in tweets.

Tokenize the following tweet using the default tokenizer and TweetTokenizer.

In [None]:
tweet = '''Dr @OmniaHZayed gives us a smashing overview of #SemanticAnalysis Information Extraction she elucidates how it all work & what are the likely challenges. #wordsensedisambiguation 
#homonymy 
#parsing 
#semanticrolelabelling 
#NLP'''

In [None]:
# tokenize using the default tokenizer

In [None]:
# tokenize using TweetTokenizer

---
# 4. Stemming
Stemming is a simplified analysis of  word structure by removing endings/beginnings of words - leaving a common stem. There are many algorithms to perform stemming, with [PorterStemmer](https://www.nltk.org/_modules/nltk/stem/porter.html#PorterStemmer) being one of the most widely used. Run the following cell to see the output from the stemmer for the sentence defined above.

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
for token in tokens:
    print(stemmer.stem(token))

---
# 5. Lemmatization
Lemmatisation is the linguistic analysis of  word structure by a transformation of morphologically related words to a common lemma. 

Lemmatization is more complex than stemming as it depends on correctly identifying part of speech and meaning of a word in a sentence. Run the following cell to see the output from the [WordNetLemmatizer](https://www.nltk.org/api/nltk.stem.html#nltk.stem.wordnet.WordNetLemmatizer). 

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
for token in tokens:
    print(lemmatizer.lemmatize(token))

# 6. Part-of-Speech Tagging

Finally, `nltk` provides a `pos_tag` function to annotate a sentence using an *off-the-shelf* tagger which uses the [Penn Treebank tagset](https://www.nltk.org/book/ch05.html) for the English language.

**UPenn Tagset**

    CC — coordinating conjunction
    CD — cardinal digit
    DT — determiner
    EX — existential there (“there is”, “there exists”)
    FW — foreign word
    IN — preposition/subordinating conjunction
    JJ — adjective (‘big’)
    JJR — adjective, comparative (‘bigger’)
    JJS — adjective, superlative (‘biggest’)
    LS — list marker
    MD — modal ('could', 'will')
    NN — noun, singular (‘desk’)
    NNS — noun plural (‘desks’)
    NNP — proper noun, singular (‘Harrison’)
    NNPS — proper noun, plural (‘Americans’)
    PDT — predeterminer (‘all the kids’)
    POS — possessive ending ('parent’s')
    PRP — personal pronoun ('I', 'he', 'she')
    PRPS — possessive pronoun ('my', 'his', 'hers')
    RB — adverb ('very', 'silently')
    RBR — adverb, comparative ('better')
    RBS — adverb, superlative ('best')
    RP — particle ('give up')
    TO — to-particle ('to go')
    UH — interjection ('errrrrrrrm')
    VB — verb, base form ('take')
    VBD — verb, past tense ('took')
    VBG — verb, gerund/present participle ('taking')
    VBN — verb, past participle ('taken')
    VBP — verb, sing. present, non-3d ('take')
    VBZ — verb, 3rd person sing. present ('takes')
    WDT — wh-determiner ('which')
    WP — wh-pronoun ('who', 'what')
    WP — possessive wh-pronoun ('whose')
    WRB — wh-abverb ('where', 'when')

There is a [more modern and language independent tagset](http://universaldependencies.org/u/pos/) used within the [Universal Dependencies](http://universaldependencies.org/) framework. Each part of speech also has a set of features (e.g. nouns can have *number, case, gender*; verbs can have *tense, person, number* etc.), which are described [here](https://universaldependencies.org/u/feat/index.html). You can change the default UPenn tagset to UPOS by specifying the `tagset='universal'` in NLTK's `pos_tag` function.

**UPOS tagset**

    ADJ: adjective
    ADP: adposition
    ADV: adverb
    AUX: auxiliary
    CCONJ: coordinating conjunction
    DET: determiner
    INTJ: interjection
    NOUN: noun
    NUM: numeral
    PART: particle
    PRON: pronoun
    PROPN: proper noun
    PUNCT: punctuation
    SCONJ: subordinating conjunction
    SYM: symbol
    VERB: verb
    X: other


The `pos_tag` function uses a so-called [PerceptronTagger](http://www.nltk.org/api/nltk.tag.html#nltk.tag.perceptron.PerceptronTagger). The model was trained on on Sections 00-18 of the Wall Street Journal sections of OntoNotes 5. The original implementation comes from Matthew Honnibal (you can read more about it [here](https://explosion.ai/blog/part-of-speech-pos-tagger-in-python)), it outperforms the predecessor maximum entropy POS model in NLTK. 


You can also create your own tagger which can **learn** from annotated data. This topic will be covered in detail in Lecture 4 on sequence modeling.

In [None]:
from nltk.tag import pos_tag
pos_tag(tokens)

In [None]:
pos_tag(tokens, tagset='universal')

### Can you identify any problems with the tagging of the sentence above?