#**Data Normalization (part 2)**

![](https://drive.google.com/uc?export=view&id=1AroOEeHMu8OFM9bCN7AvrjnPc_UvCvsb)

This lesson continues the discussion on data preparation and normalization. What follows are some common techniques we can use to apply rules to help bring some consistency to handling text.

One of the underlying issues with the analysis that involved word frequency counters or dictionaries is that a word is separately counted even though it may exist in the dictionary but in a different form. For example, **argue, arguing, argues, argued** would all be distinct keys in our counter even though they are essentially the 'same' word. As we saw in both the lessons on tf•idf, word embeddings and word2vec, the more features (usually unique words), the more space (longer vectors) will be required to manage each word (i.e. feature).



## **Text Cleaning**

![](https://drive.google.com/uc?export=view&id=1H44c5xAatjLQrcgdVR502hUqPTZyJ8jb)

Cleaning usually means removing or intelligently dealing with errors. But cleaning can also involve processing the text such that down stream users of the data can easily tokenize or process the text.

For processing books and other forms of 'printed' media,
text, cleaning can involve removing front and back
matter, chapter headings, and page numbers. However,
if the text is digitized via OCR (optical character recognition), it's quite possible that additional cleaning (even via machine learning) will be needed to deal with any digital artifacts.

It's also possible that if the corpus is large or the text documents are long enough, the analysis is no better off by doing additional cleaning. This is not (usually) true when dealing with 'raw' human text.

## **Human Cleaning**

When you are dealing with text generated from humans (surveys, emails, transcriptions, web pages, recipes, tweets, txt, chats etc) that doesn't go through a rigorous editing process, additional cleaning can be extremely useful. However, it is extremely difficult. It can include spell correction, making abbreviations consistent (Dr, DR., Doctor, Dr., Doc.), working with text emojis (emoticons) and Emojis (😀), handling poor grammar, improper word usage, sentence structure and inconsistent punctuation (just to name a few).

Most systems that process human generated text rely on rule sets as it's usually too time consuming (and costly) and difficult to clean the text manually. Hence, the author's use (or misuse) of slang, puns, idioms, spelling will often lead to a wrong analysis of the authors intended meaning.


## **Case normalization, stopwords and cut off lengths**

One of easiest processes to reduce the number of unique words in your corpus is to simply make them all lowercase. For most situations 'The' and 'the' are equivalent. Another quick method is to simply drop words that are less than 3 characters long. This cleans up any leftovers from parsing artifacts (having single punctuation 'words').

Another effective processing step is to properly handle contractions. Depending on the analysis, the words didn't and 'did not' should be treated equally. Most contractions can be expanded with a few regular expression rules.

Removing stop words (words that are so common that they provide no information). Determiners (a, an, another), conjunctions (and, or, but, yet), prepositions (in, of) are all good candidates. Run the following demo that illustrates how one can cut down from 75K words to 5K by using some simple normalization techniques:

In [7]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
import ssl

def tokenize(text):
    import re

    # allow numbers
    # reg = r"['A-Za-z0-9]+-?[A-Za-z0-9']+"

    # exclude numbers
    reg = r"['A-Za-z]+-?[A-Za-z']+"
    regex = re.compile(reg)
    return regex.findall(text)

def normalize(words):
    return [w.lower().strip("'") for w in words]

def normalization_demo():
    path = "harryPotter.txt"
    with open(path, 'r') as fd:
        all = fd.read()
        # the most basic way to tokenize
        raw = all.split()
        
        # use a regular expression to tokenize
        words = tokenize(all)
        normalized = normalize(words)
        
        uniq = set(words)
        uniq_norm = set(normalized)
        uniq_norm_min = set([w for w in uniq_norm if len(w) > 2])
        
        try:
            _create_unverified_https_context = ssl._create_unverified_context
        except AttributeError:
            pass
        else:
            ssl._create_default_https_context = _create_unverified_https_context

        nltk.download('stopwords')
        from nltk.corpus import stopwords
        stop = stopwords.words('english')
        
        uniq_no_stop = set([w.lower() for w in uniq_norm_min if w not in stop])
        
        # some basic counts of the different techniques
        print(len(raw))    # 78706
        print(len(words))  # 75529 with numbers; 75277 w/out
        print(len(uniq))
        print(len(uniq_norm))
        print(len(uniq_norm_min))
        print(len(uniq_no_stop))
        # print(sorted(uniq_norm_min))
        
normalization_demo()

78706
75277
7019
6086
6031
5898


[nltk_data] Downloading package stopwords to /Users/mac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


##**Stemming**

![](https://drive.google.com/uc?export=view&id=1JK6jYt-gkmnXKOeQwTSNJUuI_8wvblT4)

Stemming is the process of reducing words to a base or root form. However, the result may not be an actual word. The stemming process applies an algorithm in an attempt to get to the root word. One of the easiest transformations to make is to remove suffixes (e.g. 'ed', 'ing', 'ly'). The stemming process can result in some strange words ('ties' becomes 'ti'). Algorithmic stemming has a rich history in computer science and there are multiple algorithms to do so.

**Porter stemmer**

The Porter Stemmer algorithm (by Martin Porter) is one of the more popular stemmers. NLTK provides an implementation of it as well. Those interested in the details of the algorithm can [read](http://facweb.cs.depaul.edu/mobasher/classes/csc575/papers/porter-algorithm.html) about them.

In [11]:
import nltk

def porter_test(words):
    from nltk.stem.porter import PorterStemmer
    p_stemmer = PorterStemmer()
    for word in words:
        msg = "{:10s} --> {:s}".format(word, p_stemmer.stem(word))
        print(msg)
        
words = 'run runner running ran runs easily fairly children plotted potter'.split()
porter_test(words)

run        --> run
runner     --> runner
running    --> run
ran        --> ran
runs       --> run
easily     --> easili
fairly     --> fairli
children   --> children
plotted    --> plot
potter     --> potter


**Snowball Stemmer**

The [snowball](https://snowballstem.org/) stemmer [historical reference](http://snowball.tartarus.org/) fixes a few of the issues with the Porter algorithm. It is written by the same author. You may also come across 'Porter2' references -- which refers to the same improved algorithm.

In [12]:
def snowball_test(words):
    # Porter2
    # The Snowball Stemmer requires that you pass a language parameter
    stemmer = nltk.stem.snowball.SnowballStemmer(language='english')
    for word in words:
        msg = "{:10s} --> {:s}".format(word, stemmer.stem(word))
        print(msg)
snowball_test(words)

run        --> run
runner     --> runner
running    --> run
ran        --> ran
runs       --> run
easily     --> easili
fairly     --> fair
children   --> children
plotted    --> plot
potter     --> potter


You can also use Snowball to build your own domain-specific stemmer.

**Lancaster Stemmer**

The Lancaster stemmer is a very aggressive stemming algorithm, sometimes to a fault. With porter and snowball, the stemmed representations are usually fairly intuitive. However, with Lancaster, many shorter words will become totally obfuscated. It is the fastest algorithm of he three and will reduce your working set of words hugely. If you want more distinction, it is not the tool you would want.

In [13]:
def lancaster_test(words):
    stemmer = nltk.stem.lancaster.LancasterStemmer()
    for word in words:
        msg = "{:10s} --> {:s}".format(word, stemmer.stem(word))
        print(msg)
lancaster_test(words)

run        --> run
runner     --> run
running    --> run
ran        --> ran
runs       --> run
easily     --> easy
fairly     --> fair
children   --> childr
plotted    --> plot
potter     --> pot






> ***Data Scientist Log:*** It's important that you not only run the above code examples, but actually read and interpret their results. Test it with your set of words. How do the different algorithms treat those words?

##**Lemmatization**

![](https://drive.google.com/uc?export=view&id=1heucu8-yXYyOWi5867GXViIVZFmzMt4e)

Unlike stemming, lemmatization attempts to reduce the word and keep it's part of speech. In linguistics, it is the process of grouping together the different inflected forms of a word and treat the set a single item. Lemmatization looks at surrounding text to determine a given word’s part of speech.

A lemma is the form of the word that usually appears in the
dictionary and used to represent other forms of that word. Lemmatization is the algorithmic process of determining the lemma of a word based on its
intended meaning.

**NLTK Lemmatization via Wordnet**

The nltk nlp toolkit has a lemmatizer that uses [Wordnet](https://wordnet.princeton.edu/), a product from Princeton, and is a large database (lexical) of English nouns, verbs, adjectives and adverbs. It also provides Synsets which provides a linked network of related words.

The following shows a simple demonstration:

In [16]:
def demo_nltk_lemma(words):
    import nltk
    nltk.download('wordnet')
    lemmer  = nltk.stem.WordNetLemmatizer()
    for word in words:
        msg = "{:10s} --> {:s}".format(word, lemmer.lemmatize(word))
        print(msg)

    # ask for a specific usage
    msg = "{:10s} --> {:s}".format('better', lemmer.lemmatize('better', pos="a"))
    print(msg)
    
demo_nltk_lemma(words)
demo_nltk_lemma(['better'])

run        --> run
runner     --> runner
running    --> running
ran        --> ran
runs       --> run
easily     --> easily
fairly     --> fairly
children   --> child
plotted    --> plotted
potter     --> potter
better     --> good
better     --> better
better     --> good


[nltk_data] Downloading package wordnet to /Users/mac/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/mac/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**Spacy Lemmatization**

Spacy has opted to only have lemmatization available instead of having stemming features. The following shows that whey you tokenize a passage of text, each item includes the lemma.

In [22]:
def spacy_lemma_demo1():

    import spacy
    nlp = spacy.load('en')
  
    # tokens have a lemma_
    doc = nlp("Apples are better than ducks")
    for token in doc:
      print(token.text, '==>', token.lemma_)

spacy_lemma_demo1()

OSError: [E941] Can't find model 'en'. It looks like you're trying to load a model from a shortcut, which is obsolete as of spaCy v3.0. To load the model, use its full name instead:

nlp = spacy.load("en_core_web_sm")

For more details on the available models, see the models directory: https://spacy.io/models. If you want to create a blank model, use spacy.blank: nlp = spacy.blank("en")

You can also use the lemmatizer directly:

In [24]:
def spacy_lemma_demo2():
    import spacy
    from spacy.lemmatizer import Lemmatizer, ADJ, NOUN, VERB
    nlp = spacy.load('en')
    lemmatizer = nlp.vocab.morphology.lemmatizer
    l = lemmatizer('ducks', NOUN)
    print(l)

spacy_lemma_demo2()

ModuleNotFoundError: No module named 'spacy.lemmatizer'

You can also build your own lemmatizer and add rules depending on your situation:

In [None]:
def spacy_lemma_demo3():
    from spacy.lemmatizer import Lemmatizer
    from spacy.lookups import Lookups
    lookups = Lookups()
    # add a custom conversion for all nouns
    lookups.add_table("lemma_rules", {"noun": [["s", ""]]})
    lemmatizer = Lemmatizer(lookups)
    lemmas = lemmatizer("ducks", "NOUN")
    print(lemmas)

Additionally Gensim, TextBlob, and Stanford's CoreNLP provide lemmatizers.


#**Lesson Assignment**
Although there is no assignment, make sure you understand and learn the concepts taught in this lesson.

<h1><center>The End!</center></h1>