## Word Normalization
Task of putting words/tokens in a standard format, choosing a single normal form for words with multiple forms like USA and US.

### Case folding
Maps everything to lower case, helpful for generalization in tasks such as information retrieval or speech recognition. However, for sentiment analysis and other text classification tasks, case can be quite helpful and case folding is generally not done.

### Lemmatization
Determining whether two words have the same root, despite their surface differences.

How is it done? The most sophisticated methods for lemmatization involve complete **morphological parsing** of the word. **Morphology** is the study of the way words are built up from smaller meaning-bearing units called **morphemes**. Two broad classes of morphemes can be distinguished:

- **stems**: the central morpheme of the word, supplying the main meaning
- **affixes**: adding additional meanings of various kinds.

One useful lemmatizer is the `WordNetLemmatizer` from `nltk`.

This uses [WordNet](https://wordnet.princeton.edu/)'s built-in morphy function. Returns the input word unchanged if it cannot be found in WordNet.

In [1]:
from nltk.stem import WordNetLemmatizer

In [9]:
wnl = WordNetLemmatizer()
print(wnl.__doc__)


    WordNet Lemmatizer

    Lemmatize using WordNet's built-in morphy function.
    Returns the input word unchanged if it cannot be found in WordNet.

        >>> from nltk.stem import WordNetLemmatizer
        >>> wnl = WordNetLemmatizer()
        >>> print(wnl.lemmatize('dogs'))
        dog
        >>> print(wnl.lemmatize('churches'))
        church
        >>> print(wnl.lemmatize('aardwolves'))
        aardwolf
        >>> print(wnl.lemmatize('abaci'))
        abacus
        >>> print(wnl.lemmatize('hardrock'))
        hardrock
    


### How does Lemmatization work in SpaCy?
Spacy is another NLP library which can perform lemmatization (amongst lots of other things)

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
sentence = nlp(u'compute computer computed computing')
for word in sentence:
    print(word.text,  word.lemma_)

compute compute
computer computer
computed compute
computing computing


The details of SpaCy's lemmatizer is given in this [Stackoverflow Q/A](https://stackoverflow.com/questions/43795249/how-does-spacy-lemmatizer-works), but basically it is based on `_morphy` from `nltk` and includes some new punctuation rules.

### Stemming

Stemming is a simpler but cruder method of morphological analysis. One of the most widely used stemming algorithms is the Porter Stemmer. Another is the Snowball stemmer. You can call both from `nltk`.

In [3]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
          'died', 'agreed', 'owned', 'humbled', 'sized',
            'meeting', 'stating', 'siezing', 'itemization',
            'sensational', 'traditional', 'reference', 'colonizer',
            'plotted']
singles = [stemmer.stem(plural) for plural in plurals]
print(singles)

['caress', 'fli', 'die', 'mule', 'deni', 'die', 'agre', 'own', 'humbl', 'size', 'meet', 'state', 'siez', 'item', 'sensat', 'tradit', 'refer', 'colon', 'plot']


Note that you need to specify the language to use the `SnowballStemmer`.

In [5]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
          'died', 'agreed', 'owned', 'humbled', 'sized',
            'meeting', 'stating', 'siezing', 'itemization',
            'sensational', 'traditional', 'reference', 'colonizer',
            'plotted']
singles = [stemmer.stem(plural) for plural in plurals]
print(singles)

['caress', 'fli', 'die', 'mule', 'deni', 'die', 'agre', 'own', 'humbl', 'size', 'meet', 'state', 'siez', 'item', 'sensat', 'tradit', 'refer', 'colon', 'plot']


We see there are clearly some problems with stemming, sometimes completely changes the meaning of a word, as in `colonizer` -> `colon`.

### 2.4.5 Sentence Segmentation
**Sentence segmentation** is another important step in text processing. The most useful cues for segmenting a text into sentences are punctuations, like periods, question marks and exclamation points. Due to the ambiguity of period characters, sentences tokenization and word tokenization may be addressed jointly.