# Text Normalization using Spacy

## Steps for Text Normalization

Atleast three steps are involved to normalize the text:

* ***Segmenting/tokenizing words from running text***
* ***Normalizing word formats***
* * Convert characters to lowercase
* * Expand Abbreviations
* * Remove stopwords
* * Lemmatizers
* * Stemmatizers
* ***Segmenting sentences in running text***

* * ***Spacy do not implement stemmatizers***
* * * Reference: https://github.com/explosion/spaCy/issues/327

In [1]:
import spacy
import string
from spacy.en import English
from __future__ import unicode_literals 
nlp = spacy.load('en')  # load english language module

In [2]:
parser = English()

In [93]:
class text_normalization:
    def __init__(self,parser):
        self.punctuations = list(string.punctuation)
        self.parser = parser
        
    '''
    This function takes text as input and returns
    list of tokenized words
    '''
    def word_tokenizer(self,text):
        list_of_tokenized_words = []
        lowercase_words = []
        # parse the data
        parsedData = self.parser(text)
        for token in parsedData:
            # tokenize the word 
            # each token contains properties
            # property with underscore(token_orth_) returns string
            # property without underscore(token_orth) returns an index (int) into spaCy's vocabulary
            lower_case = self.convert_characters_lower_case(token)
            token =  token.orth_

            if token in self.punctuations:
                continue
            else:
                lowercase_words.append(lower_case)
                list_of_tokenized_words.append(token)
        print "Original text: \n ", text
        print "================================================================================="
        print "Tokenized sentences are : \n ", self.sent_tokenizer(text)
        print "================================================================================="
        print "List of tokenized words without punctutations: \n " + str(list_of_tokenized_words)  
        print "================================================================================="
        print "Length of words: ", len(list_of_tokenized_words)
        print "================================================================================="
        print "Lowercase words: \n " + str(lowercase_words)
        print "================================================================================="
        return list_of_tokenized_words
        
        '''
        Convert characters of each token to lower case
        '''
    def convert_characters_lower_case(self,token):
        return token.lower_
        
        '''
        This function takes sentences as input and
        returns list of tokenized sentences
        '''
        
    def sent_tokenizer(self,sentences):
        document =  nlp(sentences)
        list_of_tokenized_sentence = []
        for sentence in document.sents:
            list_of_tokenized_sentence.append(sentence)
        
        return list_of_tokenized_sentence
        
        '''
        This function returns list of lemma 
        in the text
        '''
    def lemma(self,text):
        lemma_list = []
        parsedData = self.parser(text)
        for word in parsedData:
            word = word.lemma_

            if word in self.punctuations:
                continue
            else:
                lemma_list.append(word)
                
        print "Lemma of text is: \n ",lemma_list
        return lemma_list
            

In [94]:
text_normalizer = text_normalization(parser)
text ="San Francisco is part of the USA, where many peoples are living. Bonn is the part of Germany"

word_tokenizer = text_normalizer.word_tokenizer(text)
lemma = text_normalizer.lemma(text)

Original text: 
  San Francisco is part of the USA, where many peoples are living. Bonn is the part of Germany
Tokenized sentences are : 
  [San Francisco is part of the USA, where many peoples are living., Bonn is the part of Germany]
List of tokenized words without punctutations: 
 [u'San', u'Francisco', u'is', u'part', u'of', u'the', u'USA', u'where', u'many', u'peoples', u'are', u'living', u'Bonn', u'is', u'the', u'part', u'of', u'Germany']
Length of words:  18
Lowercase words: 
 [u'san', u'francisco', u'is', u'part', u'of', u'the', u'usa', u'where', u'many', u'peoples', u'are', u'living', u'bonn', u'is', u'the', u'part', u'of', u'germany']
Lemma of text is: 
  [u'san', u'francisco', u'be', u'part', u'of', u'the', u'usa', u'where', u'many', u'people', u'be', u'live', u'bonn', u'be', u'the', u'part', u'of', u'germany']
