In [16]:
import nltk

# Replacing and Correcting Words

In this chapter, we will go over various word replacement and correction techniques. <br>
The recipes cover the gamut of: 
- linguistic compression, 
- spelling correction, and 
- text normalization. 

<br>All of these methods can be **very useful for preprocessing text before search indexing, document classification, and text analysis.**

## Stemming words - Text Compression

In [41]:
def stem_examples(Stemmer, lemmatize=False):
    text = """
    The PorterStemmer class knows a number of regular word forms and suffixes and uses  this knowledge 
    to transform your input word to a final stem through a series of steps. The resulting stem is often a 
    shorter word, or at least a common form of the word, which has  the same root meaning
    """
    if lemmatize:
        return ' '.join([Stemmer.lemmatize(word) for word in nltk.word_tokenize(text)])
    return ' '.join([Stemmer.stem(word) for word in nltk.word_tokenize(text)])

#### Porter Stemmer

In [24]:
# Using the regex based Porter Stemmer
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [25]:
stem_examples(stemmer)

'the porterstemm class know a number of regular word form and suffix and use thi knowledg to transform your input word to a final stem through a seri of step . the result stem is often a shorter word , or at least a common form of the word , which ha the same root mean'

The PorterStemmer class knows a number of regular word forms and suffixes and uses  this knowledge to transform your input word to a final stem through a series of steps. The resulting stem is often a shorter word, or at least a common form of the word, which has  the same root meaning.

#### Lancaster Stemmer

In [26]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()

In [27]:
stem_examples(stemmer)

'the porterstem class know a numb of regul word form and suffix and us thi knowledg to transform yo input word to a fin stem through a sery of step . the result stem is oft a short word , or at least a common form of the word , which has the sam root mean'

#### Snowball Stemmer

The SnowballStemmer class supports 13 non-English languages. It also provides two English stemmers: the original porter algorithm as well as the new English stemming algorithm. To use the SnowballStemmer class, create an instance with the name of the language you are using and then call the stem() method.

In [34]:
from nltk.stem import SnowballStemmer

'; '.join(SnowballStemmer.languages)

'danish; dutch; english; finnish; french; german; hungarian; italian; norwegian; porter; portuguese; romanian; russian; spanish; swedish'

In [35]:
stemmer = SnowballStemmer('english')

In [28]:
stem_examples(stemmer)

'the porterstem class know a numb of regul word form and suffix and us thi knowledg to transform yo input word to a fin stem through a sery of step . the result stem is oft a short word , or at least a common form of the word , which has the sam root mean'

#### Regex Stemmer

 It takes  a single regular expression (either compiled or as a string) and removes any prefix or  suffix that matches the expression:

In [30]:
from nltk.stem import RegexpStemmer
stemmer = RegexpStemmer('ing')

In [31]:
stem_examples(stemmer)

'The PorterStemmer class knows a number of regular word forms and suffixes and uses this knowledge to transform your input word to a final stem through a series of steps . The result stem is often a shorter word , or at least a common form of the word , which has the same root mean'

## Lemmatizing Words
Lemmatization is very similar to stemming, but is more akin to synonym replacement.  A lemma is a root word, as opposed to the root stem. So unlike stemming, you are always  left with a valid word that means the same thing. However, the word you end up with can  be completely differen

In [36]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [40]:
stem_examples(lemmatizer, lemmatize=True)

'The PorterStemmer class know a number of regular word form and suffix and us this knowledge to transform your input word to a final stem through a series of step . The resulting stem is often a shorter word , or at least a common form of the word , which ha the same root meaning'

The WordNetLemmatizer class is a thin wrapper around the wordnet corpus and uses the **morphy()** function of the WordNetCorpusReader class to find a lemma. If no lemma is found, or the word itself is a lemma, the word is returned as is. Unlike with stemming, knowing the part of speech of the word is important. As demonstrated previously, cooking does not return a different lemma unless you specify that the POS is a verb. This is because the default POS is a noun, and as a noun, cooking is its own lemma. On the other hand, cookbooks  is a noun with its singular form, cookbook, as its lemma.

### Maximum Text Compression

In cases where it is important to reach a maximum of Text compression, we can combine the effects of Stemming and Lemmatization

In [42]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

# apply both processes
def max_compression(sentence):
    return [lemmatizer.lemmatize(stemmer.stem(word)) for word in nltk.word_tokenize(sentence)]

In [67]:
text = """
    The PorterStemmer class knows a number of regular word forms and suffixes and uses  this knowledge 
    to transform your input word to a final stem through a series of steps. The resulting stem is often a 
    shorter word, or at least a common form of the word, which has  the same root meaning
    """
print('The original text has a lenght of {}, and after compression retains a length of {}'
      .format(len(text), len(' '.join(max_compression(text)))))
print('This is a reduction of {:0.2f} %'.format(1 - len(' '.join(max_compression(text)))/ len(text)))

The original text has a lenght of 306, and after compression retains a length of 269
This is a reduction of 0.12 %


## Replacing words with regEx

To clean up slang and missspellings, we can use manual rue

#### Removing Slang

In [89]:
import re

replacement_patterns = [
    (r'ain\'t', 'are not'),
    (r'gonna', 'going to'),
    (r'won\'t', 'will not'),
    (r'can\'t', 'can not'),
    (r'i\'m', 'i am'),
    (r'ain\'t', 'is not'),
    (r'(\w+)\'ll', '\g<1> will')
]

In [90]:
class RegexReplacer(object):
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
        
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            text = re.sub(pattern, repl, text)
        return text

In [91]:
replacer = RegexReplacer()
replacer.replace("You ain't gonna believe that i can't express how greatfull i am that i'm back")

'You are not going to believe that i can not express how greatfull i am that i am back'

#### Removing redunandt characters

In [None]:
import re

class RepeatReplacer(object):
    def __init__(self):
        self.repeate_regexp