### DS102 | Self-Study Week 4A - Text Mining III - POS Tagging and Lemmatisation
<hr>

## Learning Objectives
At the end of the self-study, you will be able to:

- list tags from a Part-of-Speech tagger related to verbs, nouns, prepositions, adverbs, adjectives and d

- interpret the output of a **Part-of-Speech tagger**

- define a **lemma** as a word that can be found in a dictionary

- identify lemmatisation a way to find the root form of a word

At the end of the lesson, you will be able to:

- use `pos_tag` from `nltk` to perform Part-of-Speech (POS) tagging of a sentence

- given a word and its POS tag, use `WordNetLemmatizer`'s `lemmatize` from `nltk.stem` to find its corresponding lemma 

### Datasets Required for this Self-Study
1. `loans-descs-1k.csv`

#### Import Libraries

In [None]:
import pandas as pd
import nltk
import re

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag, wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [None]:
# Dataset 1: Credits at the end of the notebook
df = pd.read_csv('loans-descs-1k.csv')

Before starting, let's first revise the steps to stem a document.

In [None]:
# Convert the raw text into 2 sentences that can be used for processing
s1 = df['desc'][4]
print(s1)
s1 = re.sub('Borrower added on \d+/\d+/\d+ >|<br>', '', s1)
s1 = s1.strip()
print()
print(s1)

In [None]:
ts1 = word_tokenize(s1)
print(ts1)

The Stemming result will be useful for later comparisons.

In [None]:
stemmer = PorterStemmer()
stemmed_words = []
for t in ts1:
    u = stemmer.stem(t)
    stemmed_words.append(u)
print(stemmed_words)

We will consider another algorithm to find the root form of a word, called **Lemmatisation**. Before we start talking about Lemmatisation, we need to first understand **Part-of-Speech (POS) Tagging**. 

### Part-of-Speech (POS) Tagging

POS tagging is a way to group a word into its **class**. Commonly, a word and tag pair is represented as a tuple. We will use one of `nltk`'s tagged corpora, in particular the *Penn Treebank Project* to help us tag newly discovered words. To find the POS tag of a word, use `nltk.pos_tag(word_tokens)`.

Notice that the resulting value consists of many tuples. The first element in the tuple is the original word from the sentence and the second element is the assigned POS tag. Refer to this [link](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) to understand the meaning of each tag.

In [None]:
tagged_words_by_treebank = nltk.pos_tag(ts1)
print(tagged_words_by_treebank)
# Tags to be aware of: PRP, VBP, VBG, NN, VBN, NNS

Notice that the first letter of the tag represent similar classes. In particular, 

- the pattern `N[A-Z]+` represents nouns and 

- the pattern `V[A-Z]+` represent verbs

Hence, we can take the first character and convert it to lower case. This first letter of the tag will be used for Lemmatisation.

In [None]:
tagged_words = []
for twt in tagged_words_by_treebank:
    # Get the first element of the tuple, and the first letter of the second element
    # of the tuple.
    tagged_words.append((twt[0], twt[1][0].lower()))
print(tagged_words)

### Lemmatisation

A **Lemma** is a word found in the dictionary. Hence, you can think of them as the root form of a word. Given a word and its corresponding tag, we can find the word's root form in English. This will be easier for human interpretation. Use `WordNetLemmatizer.lemmatize(term, pos=tag)` to find the root form of the word given the source word and its associated POS tag.

Note that if the POS tag cannot be found, a `KeyError` will be thrown. For example, the first word will have the following result:

In [None]:
# lemmatiser = WordNetLemmatizer()
# lemmatiser.lemmatize('I', pos='p') # Uncomment this line to see the KeyError

Hence, wrap the Lemmatisation step into `try...catch` block so the program continues even if a `KeyError` is encountered.

In [None]:
lemmatiser = WordNetLemmatizer()
lemmed_words = []
for tw_pair in tagged_words:
    tw_word, tw_tag = tw_pair[0], tw_pair[1]
    lemm_word = tw_word
    try:
        lemm_word = lemmatiser.lemmatize(tw_word, pos=tw_tag)
    except KeyError:
        # If an error is thrown, then the word is assumed to be in its root form.
        print("KeyError: " + tw_word)
        pass

    lemmed_words.append(lemm_word)
print(lemmed_words)

The following shows the original form of the sentence, the result after stemming and the result after lemmatisation for easy comparison.

In [None]:
print("%15s   %15s   %15s" % ("Raw", "Stemming", "Lemmatisation"))
print("%15s-- %15s-- %15s" % ("------------", "------------", "--------------"))
for i in range(0, len(stemmed_words)-1):
    print ("%15s   %15s   %15s" % (ts1[i], stemmed_words[i], lemmed_words[i]))

### Further Exploration - Exercise on POS Tagging & Lemmatisation

Pick one of the following 2 sentences. This is also from the `loans-descs-1k.csv` dataset. Perform POS tagging, followed by Lemmatisation of your selected sentence.

In [None]:
s2 = "I really need to consolidate my credit card debt so that I can become debt free. The interest is killing me and I'm just not getting anywhere with the balances. Help!"
s3 = "Hello, I just closed on the house of my dreams and I would like to use this loan to pay off my high interest credit cards and build a deck on my home."

In [None]:
# Step 1: First, perform tokenization using word_tokenize
#

In [None]:
# Step 2: Then, perform POS tagging
#

In [None]:
# Step 3: Then, perform Lemmatisation
#

In [None]:
# Example Solution:
s_x = '''This loan request is to consolidate loans to a lower intrest rate with a goal of becoming debt free. I have been employed with the same company for 10 years next month. I pay ahead on my loans. My goal is to be debt free in 10 years, including my home.'''

# Tokenization & POS Tagggin
tks = nltk.pos_tag(nltk.word_tokenize(s_x))
print(tks)
print()

# Lemmatisation
final_w = []
for tw_pair in tks:
    tw_word, tw_tag = tw_pair[0], tw_pair[1][0].lower()
    lemm_word = tw_word
    try:
        lemm_word = lemmatiser.lemmatize(tw_word, pos=tw_tag)
    except KeyError:
        pass
    final_w.append(lemm_word)

# Show the result after Lemmatisation
print("%15s %15s" % ("ORIGINAL", "LEMMATISATION"))
for i in range(0, len(final_w)):
    print("%15s %15s" % (tks[i][0], final_w[i]))

For more information on the corpora mentioned, head over to [Chapter 2 of Natural Language Processing with Python](http://www.nltk.org/book/ch02.html).

**Credits**
- [Kaggle](https://www.kaggle.com/wendykan/lending-club-loan-data) for Dataset 1