 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# `Morphological characteristics`

* study of the internal structure of words

* **`morpheme` -** smallest element with independent meaning

**Simple words (word = one morphem):**

* run
* work
* like

**Complex words (word = combination of morphemes):**

* runner
* working
* likely

**Processing morphological characteristics:**

* `Stemming`
* `Lemmatization`

# `Stemming`

* the process of reducing some word to its root (to its base morpheme)
    <br>
    
    * **runner** --> **run**
    <br>
    
    * **working** --> **work**
    <br>
    
    * **likely** --> **like**

* we perform **`stemming`** because the ending of some word is usually not important
    <br>
    
    * incidentally, it also makes sure our model will be somewhat robust to spelling errors
    * the procedure also involves **converting all strings to lowercase**, so that we don't run into certain problems (such as our model treating the words "ERROR" and "error" as two different words)

* stemming is preceeded by **`tokenization` (we will talk more about tokenization in a bit)**
    <br>
    
    * for now, think of **`tokenization`** as breaking down some text into words
    

**Stemming errors**
    <br>
    <br>
* **`over stemming`** - a larger part of some word (more than is needed) is removed
    * e.g. reducing words such as universal and university to the same root


* **`under stemming`** - multiple words are reduced to more than one root word
    * e.g. not reducing alumnus and alumnae to the same root

* it is very important to pick the right algorithm for the job

**Popular stemming algorithms**:
     <br>
     <br>
    
   * **`Porter Stemmer`** - efficient, simple, but not the most precise and is also limited to English words
    <br>
    
   * **`Snowball Stemmer (Porter2Stemmer)`** - very popular, more precise over larger datasets than the original Porter Stemmer, not limited to English words
    <br>
    
   * **`Lancaster Stemmer`** - very aggresive, mostly avoided

## Example

* our sentence : ***Life is what happens when you are busy making other plans***

* stemmed version of our sentence: ***['life', 'is', 'what', 'happen', 'when', 'you', 'are', 'busi', 'make', 'other', 'plan']***

# Stemming in `NLTK`

* we will demonstrate how we perform stemming using the three mentioned stemming algorithms

* for our text, let's use a famous quote from John Lennon

    ***Life is what happens when you are busy making other plans***

* in **`NLTK`**, stemmers are accessed through the **`nltk.stem`** package

In [2]:
# First we need to import the three stemmers from nltk 
# and define them

from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import LancasterStemmer

stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer(language='english') # for Snowball Stemmer you need to define the language parameter
lancaster_stemmer = LancasterStemmer()

In [2]:
# Then we can create our data
# in the form of a list of words

text = ['Life', 'is', 'what', 'happens', 'when', 'you', 'are', 'busy', 'making', 'other', 'plans']


In [3]:
# Stemming using Porter Stemmer

stemmed_text_porter = []

for word in text:
    stemmed_word = stemmer.stem(word)
    stemmed_text_porter.append(stemmed_word)

In [4]:
# Stemming result 

print(stemmed_text_porter)

['life', 'is', 'what', 'happen', 'when', 'you', 'are', 'busi', 'make', 'other', 'plan']


In [5]:
# Stemming using Snowball Stemmer

stemmed_text_snowball = []

for word in text:
    stemmed_word = snowball_stemmer.stem(word)
    stemmed_text_snowball.append(stemmed_word)

In [6]:
# Stemming result 

print(stemmed_text_snowball)

['life', 'is', 'what', 'happen', 'when', 'you', 'are', 'busi', 'make', 'other', 'plan']


In [7]:
# Stemming using Lancaster Stemmer

stemmed_text_lancaster = []

for word in text:
    stemmed_word = lancaster_stemmer.stem(word)
    stemmed_text_lancaster.append(stemmed_word)

In [8]:
# Stemming result 

print(stemmed_text_lancaster)

['lif', 'is', 'what', 'hap', 'when', 'you', 'ar', 'busy', 'mak', 'oth', 'plan']


In [9]:
# Let's compare the results

print(stemmed_text_porter)
print(stemmed_text_snowball)
print(stemmed_text_lancaster)

['life', 'is', 'what', 'happen', 'when', 'you', 'are', 'busi', 'make', 'other', 'plan']
['life', 'is', 'what', 'happen', 'when', 'you', 'are', 'busi', 'make', 'other', 'plan']
['lif', 'is', 'what', 'hap', 'when', 'you', 'ar', 'busy', 'mak', 'oth', 'plan']


### Exercise 1 

**Stem the following words using the Porter2Stemmer (Snowball stemmer):**

* university
* magnitude
* poem
* planetary


**You can solve the exercise in two ways:**


**1.** Stem each word on its own by repeating a few lines of code


**2.** Create a list of the words and include the stemming code in a loop

### Solution: 

In [3]:
text = ['university','magnitude','poem','planetary']
stemmed_text_lancaster = []
stemmed_text_snowball = []
stemmed_text_porter = []
for word in text:
    stemmed_word = lancaster_stemmer.stem(word)
    stemmed_text_lancaster.append(stemmed_word)
    stemmed_word = snowball_stemmer.stem(word)
    stemmed_text_snowball.append(stemmed_word)
    stemmed_word = stemmer.stem(word)
    stemmed_text_porter.append(stemmed_word)
    
    
print(stemmed_text_lancaster)
print(stemmed_text_snowball)
print(stemmed_text_porter)

['univers', 'magnitud', 'poem', 'planet']
['univers', 'magnitud', 'poem', 'planetari']
['univers', 'magnitud', 'poem', 'planetari']


### Exercise 2 

**Use the Porter Stemmer and the Lancaster Stemmer to stem the following string:**

`"That which we call a rose by any other name would smell as sweet"`


**To solve the exercise, convert the string into a list and then create a loop that stems the words in the list.**


**Hint:** You can use the **`split()`** method on the string to create a list of words.

### Solution: 

In [4]:
text = "That which we call a rose by any other name would smell as sweet"
text = text.split(' ')
stemmed_text_lancaster = []
stemmed_text_snowball = []
stemmed_text_porter = []
for word in text:
    stemmed_word = lancaster_stemmer.stem(word)
    stemmed_text_lancaster.append(stemmed_word)
    stemmed_word = snowball_stemmer.stem(word)
    stemmed_text_snowball.append(stemmed_word)
    stemmed_word = stemmer.stem(word)
    stemmed_text_porter.append(stemmed_word)
print(stemmed_text_lancaster)
print(stemmed_text_snowball)
print(stemmed_text_porter)

['that', 'which', 'we', 'cal', 'a', 'ros', 'by', 'any', 'oth', 'nam', 'would', 'smel', 'as', 'sweet']
['that', 'which', 'we', 'call', 'a', 'rose', 'by', 'ani', 'other', 'name', 'would', 'smell', 'as', 'sweet']
['that', 'which', 'we', 'call', 'a', 'rose', 'by', 'ani', 'other', 'name', 'would', 'smell', 'as', 'sweet']


# `Lemmatizing`

* another popular morphological process

* similar to **`stemming`**
    <br>
    
    * also converts a word to its root form

**Advantages:**
    <br>
    
   * the root form of a word is its **`lemma (dictionary form)`** which makes the result an actual word
   <br>
   
       * e.g. stemming will reduce busy to busi, which is not an actual word
       * lemmatization will leave busy as busy
    <br>
    
    
   * meaning and context are preserved

**Disadvantages:**

   * slower (requires a dictionary of lexicons)
    <br>
    
   * very sensitive to spelling errors, so requires preprocessing
   
   
  * libraries such as **`NLTK`** require that we tag each word with a description that says what type of word it is (noun, adjective, verb...)
 <br>
 
 * without supplying **`POS tags`** the results of lemmatization will be worse

**Some popular lemmatization algorithms can be found in:**
    <br>
    
   * `Wordnet `
     <br>
    
   * `spaCy` 
     <br>
    
   * `TextBlob`
     <br>
    
   * `TreeTagger`
     <br>
    
   * `Gensim`
     
    

## Example

* our sentence : ***Life is what happens when you are busy making other plans***

* lemmatized version of our sentence: ***['Life', 'be', 'what', 'happen', 'when', 'you', 'be', 'busy', 'make', 'other', 'plan']***

# Lemmatizing in `NLTK`

* we will demonstrate how we perform **lemmatizing** using the **WordNet algorithm** on the following sentence
    <br>
    
    ***Life is what happens when you are busy making other plans***

* **IMPORTANT:** because of how **lemmatization** works, we must import the corpus and perform **POS tagging** *before* **lemmatization** 

In [5]:
# Import nltk so that we can use the POS-tagger
# Import the lemmatizer and the corpus
# Define the lemmatizer

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

In [6]:
# Then we can create our data
# in the form of a list of words

text = ['Life', 'is', 'what', 'happens', 'when', 'you', 'are', 'busy', 'making', 'other', 'plans']

In [7]:
# Let's do some POS-tagging

tagged_text = nltk.pos_tag(text)

In [8]:
tagged_text

[('Life', 'NNP'),
 ('is', 'VBZ'),
 ('what', 'WP'),
 ('happens', 'VBZ'),
 ('when', 'WRB'),
 ('you', 'PRP'),
 ('are', 'VBP'),
 ('busy', 'JJ'),
 ('making', 'VBG'),
 ('other', 'JJ'),
 ('plans', 'NNS')]

**Converting tags to `Wordnet` tags**

<br>

* the automatic **`POS tags`** generated using  **`pos_tag()`**  are generated based on the **`Treebank corpus`**



* to perform lemmatization, **`NLTK`** needs **`POS tags`** that the **`Wordnet`** algorithm can use
    * **`Wordnet`** is a famous lexical database of nouns, verbs, adjectives and adverbs

* we can create a simple function that will convert our tags
    <br>
    
    * there are 4 **`Wordnet`** tags: **nouns**, **verbs**, **adjectives** and **adverbs**
    * to lemmatize using  **`NLTK`** we need to transform all of the tags we get using **`nltk.pos_tag()`** into one of these four groups
    * **NOTE:** for tags that are not converted to an equivalent **Wordnet** tag, the lemmatization process will be performed as if we didn't supply any tag at all

In [16]:
tt = "this"
tt.startswith('h')

False

In [9]:
# Create function that converts tags from treebank to wordnet

def convert_pos_tags(tag):

    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None  

In [30]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\swaheed\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [10]:
# Get wordnet tagged text

wordnet_tagged_text = []

for word, tag in tagged_text:
    wordnet_tag = convert_pos_tags(tag)
    wordnet_tagged_text.append((word,wordnet_tag))

In [11]:
wordnet_tagged_text

[('Life', 'n'),
 ('is', 'v'),
 ('what', None),
 ('happens', 'v'),
 ('when', None),
 ('you', None),
 ('are', 'v'),
 ('busy', 'a'),
 ('making', 'v'),
 ('other', 'a'),
 ('plans', 'n')]

* **None** isn't problematic: the lemmatizer will treat None as if we didn't supply a POS tag and just lemmatize the word, without the tag

In [12]:
# Lemmatize using the WordNet algorithm
# Snippet of code you can easily reuse later for your own needs

lemmatized_text_wordnet = []

for word, tag in tagged_text:
    wordnet_tag = convert_pos_tags(tag)
    if wordnet_tag is None:
        lemmatized_word = lemmatizer.lemmatize(word) 
    else:
        lemmatized_word = lemmatizer.lemmatize(word, pos=wordnet_tag) 
    lemmatized_text_wordnet.append(lemmatized_word)

In [13]:
# Print results

print(lemmatized_text_wordnet)

['Life', 'be', 'what', 'happen', 'when', 'you', 'be', 'busy', 'make', 'other', 'plan']


**NOTE:**

* you can also manually add tags


* this is only useful if you want to check one specific word

In [20]:
# Use the lemmatizer to lemmatize the word

lemmatizer.lemmatize("running", pos=wordnet.NOUN)

'running'

### Exercise 3

**Lemmatize the following words using the WordNetLemmatizer:**

* driving
* construction
* are
* early

### Solution:

In [15]:
lemmatizer.lemmatize("driving",pos=wordnet.VERB)

'drive'

In [16]:
lemmatizer.lemmatize("construction",pos=wordnet.VERB)

'construction'

In [17]:
lemmatizer.lemmatize("are",pos=wordnet.VERB)

'be'

In [18]:
lemmatizer.lemmatize("early",pos=wordnet.VERB)

'early'

# `Morphological Characteristics Cheat Sheet`

* extracting **`morphemes` (smallest elements with independent meaning)** from some text

* we perform either **`stemming`** or **`lemmatization`**
    * performing both is redundant as they essentially try to do the same thing (extract small independent parts of text)

### `Stemming`

* the process of reducing some word to its root (to its base **`morpheme`**) by using a stemming algorithms
    * removes the ending of the word in most cases

* performed because the endings of words are usually not relevant for understanding them
    * careful - we must not stem a word too much or too little (**`overstemming`** and **`understemming`**)

* in **`NLTK`** three algorithms:
    <br>
    
    * **`Porter Stemmer`** - the default option
    <br>
    
    * **`Porter 2 Stemmer (Snowball Stemmer)`** - very precise, but limited in language choice
    <br>
    
    * **`Lancaster Stemmer`** - avoid using it in most cases, stems too much

### `Lemmatization`

* similar to **`stemming`**, but instead reduces the word to its lemma (dictionary form) which preserves meaning and context 

* requires a dictionary of lexicons

* very sensitive to spelling errors

* **`POS tags`** necessary for getting good results

* caution when using **`NLTK`**:
    * it uses the **`Wordnet lemmatization algorithm`**, but the built-in tagger produces **`Treebank tags`**
    * always convert tags before trying to use them or else **`lemmatization`** won't work as it is supposed to

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>