## Text Normalization
Text normalization is defined as a process that consists of a series of steps that should be followed to clean and standardize textual data into a form that could be consumed by other NLP and analytics systems and applications as input. Tokenization itself is a part of text normalization. Besides tokenization, various other techniques include cleaning text, case conversion, correcting spellings, removing stopwords and other unncessary terms, stemming and lemmatization.

### Tokenizing
Usually, we tokenize text before or after removing unnecessary characters and symbols
from the data. This choice depends on the problem we are trying to solve and the data
we are dealing with.

### Removing special Characters
One important task in text normalization involves removing unnecessary and special
characters. These may be special symbols or even punctuation that occurs in sentences.
This step is often performed before or after tokenization. The main reason for doing so is
because often punctuation or special characters do not have much significance when we
analyze the text and utilize it for extracting features or information based on NLP and ML.


In [15]:
import nltk
import string
import re

def remove_special_chars(tokens):
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    new_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    return new_tokens

sample_text = """Hours later, rumors began to spread in the close-knit community that something \
had happened to her son in the nearby river. Kapessa’s older sister heard from friends that \
Kapessa, who could not swim, may have jumped into the water; others alleged that he had been \
pushed."""
words = nltk.word_tokenize(sample_text)

new_words = remove_special_chars(words)
print(list(new_words))

['Hours', 'later', 'rumors', 'began', 'to', 'spread', 'in', 'the', 'closeknit', 'community', 'that', 'something', 'had', 'happened', 'to', 'her', 'son', 'in', 'the', 'nearby', 'river', 'Kapessa', '’', 's', 'older', 'sister', 'heard', 'from', 'friends', 'that', 'Kapessa', 'who', 'could', 'not', 'swim', 'may', 'have', 'jumped', 'into', 'the', 'water', 'others', 'alleged', 'that', 'he', 'had', 'been', 'pushed']


### Removing Accented Characters
We might be dealing with accented characters/letters, especially
if you only want to analyze the English language. Hence, we need to make sure that these
characters are converted and standardized into ASCII characters. This shows a simple
example — converting é to e.

In [3]:
import unicodedata

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

str1 = 'Sómě Áccěntěd těxt'
remove_accented_chars(str1)

'Some Accented text'

### Expanding Contractions
Contractions are shortened version of words or syllables. They exist in either written or
spoken forms.

In [8]:
import contractions

contractions.fix("Y'all can't expand contractions I'd think")

'you all cannot expand contractions I would think'

### Removing Stopwords
Stopwords are words that have little or no significance. They are usually removed from text during processing so as to retain words having maximum significance and context. Stopwords are usually words that end up occurring the most if you aggregated any corpus of text based on singular tokens and checked their frequencies. Words like a, the , me , and so on are stopwords.

In [16]:
import nltk

def remove_stopwords(tokens):
    stopwords = nltk.corpus.stopwords.words('english')
    return [token for token in tokens if token not in stopwords]

sample_text = """Hours later, rumors began to spread in the close-knit community that something \
had happened to her son in the nearby river. Kapessa’s older sister heard from friends that \
Kapessa, who could not swim, may have jumped into the water; others alleged that he had been \
pushed."""

print(remove_stopwords(nltk.word_tokenize(sample_text)))

['Hours', 'later', ',', 'rumors', 'began', 'spread', 'close-knit', 'community', 'something', 'happened', 'son', 'nearby', 'river', '.', 'Kapessa', '’', 'older', 'sister', 'heard', 'friends', 'Kapessa', ',', 'could', 'swim', ',', 'may', 'jumped', 'water', ';', 'others', 'alleged', 'pushed', '.']


### Correcting words

In [35]:
import textblob

incorrect_word = textblob.Word('fianly')
incorrect_word.correct()

'finally'

### Stemming
A morpheme is the smallest meaningful lexical item in a language. A morpheme is not necessarily the same as a word. The main difference between a morpheme and a word is that a morpheme sometimes does not stand alone, but a word, by definition, always stands alone. Word stems are also often known as the <i>base form</i> of a word, and we can create new words by attaching affixes to them in a process known as <i>inflection</i> . The reverse of this is obtaining the base form of a word from its inflected form, and this is known as <i>stemming</i> .

In [31]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

print(stemmer.stem('jumping'), stemmer.stem('jumps'), stemmer.stem('jumped'))
print(stemmer.stem('lying'))
print(stemmer.stem('strange'))

jump jump jump
lie
strang


In [32]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()

print(stemmer.stem('jumping'), stemmer.stem('jumps'), stemmer.stem('jumped'))
print(stemmer.stem('lying'))
print(stemmer.stem('strange'))

jump jump jump
lying
strange


In [34]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')

print(stemmer.stem('jumping'), stemmer.stem('jumps'), stemmer.stem('jumped'))
print(stemmer.stem('lying'))
print(stemmer.stem('strange'))

jump jump jump
lie
strang


### Lemmatization 
The process of <i>lemmatization</i> is very similar to stemming—you remove word affixes to
get to a base form of the word. But in this case, this base form is also known as the <i>root
word</i> , but not the <i>root stem</i> . The difference is that the root stem may not always be a lexicographically correct word; that is, it may not be present in the dictionary. The root
word, also known as the <i>lemma</i> , will always be present in the dictionary.

In [37]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# lemmatize nouns
print(lemmatizer.lemmatize('cars', 'n'))
print(lemmatizer.lemmatize('men', 'n'))

# lemmatize verbs
print(lemmatizer.lemmatize('running', 'v'))
print(lemmatizer.lemmatize('ate', 'v'))

# lemmatize adjectives
print(lemmatizer.lemmatize('saddest', 'a'))
print(lemmatizer.lemmatize('fancier', 'a'))

# ineffective lemmatization
print(lemmatizer.lemmatize('ate', 'n'))
print(lemmatizer.lemmatize('fancier', 'v'))

car
men
run
eat
sad
fancy
ate
fancier


In [39]:
import spacy

nlp = spacy.load('en_core_web_md')

sample_text = """Hours later, rumors began to spread in the close-knit community that something \
had happened to her son in the nearby river. Kapessa’s older sister heard from friends that \
Kapessa, who could not swim, may have jumped into the water; others alleged that he had been \
pushed."""

doc = nlp(sample_text)

lemmas = [token.lemma_ for token in doc]
print(lemmas)

['hour', 'later', ',', 'rumor', 'begin', 'to', 'spread', 'in', 'the', 'close', '-', 'knit', 'community', 'that', 'something', 'have', 'happen', 'to', 'her', 'son', 'in', 'the', 'nearby', 'river', '.', 'Kapessa', '’s', 'old', 'sister', 'hear', 'from', 'friend', 'that', 'Kapessa', ',', 'who', 'could', 'not', 'swim', ',', 'may', 'have', 'jump', 'into', 'the', 'water', ';', 'other', 'allege', 'that', 'he', 'have', 'be', 'push', '.']
