#  Normalizing
In linguistics, a grapheme is the smallest unit of a writing system of any given language.

An individual grapheme may or may not carry meaning by itself, and may or may not correspond to a single phoneme of the spoken language. 

Graphemes include alphabetic letters, typographic ligatures, Chinese characters, numerical digits, punctuation marks, and other individual symbols. A grapheme can also be construed as a graphical sign that independently represents a portion of linguistic material.

# Stemming and lemmatization


For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

# Stemming 

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.

Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

[Refernece1: Stemming](http://www.cs.odu.edu/~jbollen/IR04/readings/readings5.pdf)


[Reference2: Snowball](http://snowball.tartarus.org/texts/introduction.html) 

In [0]:
# Stemming without NLTK
import re
def stem(phrase):
  return ' '.join([re.findall('^(.*ss|.*?)(s)?$', word)[0][0].strip("'") for word in phrase.lower().split()])

In [6]:
stem('roses')

'rose'

In [7]:
stem("Roses are red, Violets are blue!")

'rose are red, violet are blue!'

The disadvantage is it just cuts the 's' at the end of words, it maynot necessarily remove the plurals but also removes the meaning sometimes.

For more complex problems NLTK provides implementation of Porter Stemming algorithm, refernce1.

In [13]:
# using NLTK for stemming
import nltk
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
' '.join([stemmer.stem(w).strip("'") for w in "dish washer's washed dishes".split()])

'dish washer wash dish'

# Lemmatization

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

NLTK provides functions for this using the WordNet in the back ground. 

In [14]:
# importing libraries
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("better")


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


'better'

**NLTK preserves the meaning of the word based on the parts of speech of the word, below are the few examples**

In [15]:
# a denotes the adjective part of the speech
lemmatizer.lemmatize("better", pos="a")  

'good'

In [16]:
# a denotes the adjective part of the speech
lemmatizer.lemmatize("good", pos="a")

'good'

In [17]:
# a denotes the adjective part of the speech
lemmatizer.lemmatize("goods", pos="a")

'goods'

In [18]:
# n denotes the noun part of the speech
lemmatizer.lemmatize("goods", pos="n")

'good'

In [19]:
# n denotes the noun part of the speech
lemmatizer.lemmatize("goodness", pos="n")

'goodness'

In [20]:
# a denotes the adjective part of the speech
lemmatizer.lemmatize("best", pos="a")

'best'

**If we are building a search based appliaction stemming and lemma will improve the recall of the documents**


**If we are developing a search based chatbot, accuracy is more important, chatbot should use the un-normalized text for the closeness to the match.**