# Morphology – Getting Our Feet Wet

Morphology may be defined as the study of the composition of words using morphemes. A morpheme is the smallest unit of language that has meaning. In this chapter, we will discuss stemming and lemmatizing, stemmer and lemmatizer for non-English languages, developing a morphological analyzer and morphological generator using machine learning tools, search engines, and many such concepts.


# Introducing morphology

Morphology may be defined as the study of the production of tokens with the
help of morphemes. A morpheme is the basic unit of language carrying meaning. There are two types of morpheme: stems and affixes (suffixes, prefixes, infixes,
and circumfixes).

Stems are also referred to as free morphemes, since they can even exist without
adding affixes. Affixes are referred to as bound morphemes, since they cannot exist
in a free form and they always exist along with free morphemes. Consider the word
unbelievable. Here, believe is a stem or a free morpheme. It can exist on its own.
The morphemes un and able are affixes or bound morphemes. They cannot exist
in a free form, but they exist together with stem.

There are three kinds of language,
namely isolating languages, agglutinative languages, and inflecting languages.
Morphology has a different meaning in all these languages. Isolating languages are
those languages in which words are merely free morphemes and they do not carry
any tense (past, present, and future) and number (singular or plural) information.
Mandarin Chinese is an example of an isolating language.

Agglutinative languages
are those in which small words combine together to convey compound information.
Turkish is an example of an agglutinative language.

Inflecting languages are
those in which words are broken down into simpler units, but all these simpler
units exhibit different meanings. Latin is an example of an inflecting language

Morphological processes are of the following types: inflection, derivation, semiaffixes
and combining forms, and cliticization. Inflection means transforming the word into
a form so that it represents person, number, tense, gender, case, aspect, and mood.
Here, the syntactic category of a token remains the same. In derivation, the syntactic
category of a word is also changed. Semiaffixes are bound morphemes that exhibit
words, such as quality, for example, noteworthy, antisocial, anticlockwise, and so on.

# Understanding stemmer

Stemming may be defined as the process of obtaining a stem from a word by
eliminating the affixes from a word. For example, in the case of the word raining,
stemmer would return the root word or stem word rain by removing the affix from
raining. In order to increase the accuracy of information retrieval, search engines
mostly use stemming to get the stems and store them as indexed words. Search
engines call words with the same meaning synonyms, which may be a kind of query
expansion known as conflation. Martin Porter has designed a well-known stemming
algorithm known as the Porter stemming algorithm. This algorithm is basically
designed to replace and eliminate some well-known suffices present in English
words. To perform stemming in NLTK, we can simply do an instantiation of the
PorterStemmer class and then perform stemming by calling the stem method.

In [2]:
import nltk
from nltk.stem import PorterStemmer
stemmerporter = PorterStemmer()
stemmerporter.stem('working')

'work'

In [3]:
stemmerporter.stem('happiness')

'happi'

Types of Stemmers:
i)PorterStemmer
ii)LancasterStemmer
iii)RegExp Stemmer
iv)SnowballStemmer

# LancasterStemmer 
Lancaster stemming algorithm was
introduced at Lancaster University. Similar to the PorterStemmer class, the
LancasterStemmer class is used in NLTK to implement Lancaster stemming.
However, one of the major differences between the two algorithms is that Lancaster
stemming involves the use of more words of different sentiments as compared to
Porter Stemming.

In [5]:
from nltk.stem import LancasterStemmer
stemmer_lan=LancasterStemmer()
stemmer_lan.stem('working')

'work'

In [8]:
stemmer_lan.stem('happiness')

'happy'

# RegexStemmer
We can also build our own stemmer in NLTK using RegexpStemmer. It works by
accepting a string and eliminating the string from the prefix or suffix of a word
when a match is found.

In [9]:
from nltk.stem import RegexpStemmer
stemmer_regexp=RegexpStemmer('ing')
stemmer_regexp.stem('working')

'work'

In [10]:
 stemmer_regexp.stem('happiness')


'happiness'

In [11]:
 stemmer_regexp.stem('pairing')

'pair'

# SnowballStemmer
SnowballStemmer is used to perform stemming in 13 languages other than English.
In order to perform stemming using SnowballStemmer, firstly, an instance is created
in the language in which stemming needs to be performed. Then, using the stem()
method, stemming is performed.

In [12]:
from nltk.stem import SnowballStemmer
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

In [14]:
spanishstemmer=SnowballStemmer('spanish')
spanishstemmer.stem('comiendo')

'com'

In [15]:
frenchstemmer=SnowballStemmer('french')
frenchstemmer.stem('manger')

'mang'

# Understanding lemmatization

Lemmatization is the process in which we transform the word into a form with a
different word category. The word formed after lemmatization is entirely different.
The built-in morphy() function is used for lemmatization in WordNetLemmatizer.
The inputted word is left unchanged if it is not found in WordNet. In the argument,
pos refers to the part of speech category of the inputted word.

Consider an example of lemmatization in NLTK:

In [18]:
from nltk.stem import WordNetLemmatizer
lemmatizer_output=WordNetLemmatizer()
lemmatizer_output.lemmatize('working')

'working'

In [19]:
lemmatizer_output.lemmatize('working',pos='v')

'work'

In [20]:
lemmatizer_output.lemmatize('works')

'work'

# Developing a stemmer for non-English language