In [1]:
'''
A notebook to show various steps of preprocessing text
'''

'\nA notebook to show various steps of preprocessing text\n'

In [2]:
# We will be using various review from file `avengers_infinity_war_reviews.json`

In [3]:
# lets consider review 1 (id 1)
review_1_text = '''Moments that touch the heart are <strong>few</strong> and <strong>far</strong> between in this almost-culmination of a decade of Marvel Comics movies. '''

In [4]:
# lowercase the text
review_1_text = review_1_text.lower()
print review_1_text

moments that touch the heart are <strong>few</strong> and <strong>far</strong> between in this almost-culmination of a decade of marvel comics movies. 


In [5]:
from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ' '.join(self.fed)


def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()


In [6]:
# remove html tags from text
review_1_text = strip_tags(review_1_text)
print review_1_text

moments that touch the heart are  few  and  far  between in this almost-culmination of a decade of marvel comics movies. 


In [7]:
# lets consider review 3
review_3_text = '''?Avengers: Infinity War? takes you places that most superhero movies don?t ? and where you may not want to go.'''

In [8]:
import re
import string

def remove_punctuation(text): 
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation))) 
    filtered_tokens = filter(None, [pattern.sub('', token) for token in text.split(' ')]) 
    filtered_text = ' '.join(filtered_tokens) 
    return filtered_text

In [9]:
review_3_text = remove_punctuation(review_3_text)
print review_3_text

Avengers Infinity War takes you places that most superhero movies dont and where you may not want to go


In [10]:
# expanding abbreviations
# - standard
# - domain specific

In [11]:
# lets consider review 2
review_2_text = '''Avengers: Infinity War delivers an exciting culmination of the MCU, though it's overstuffed and suffers from certain typical Marvel movie problems.'''

In [12]:
standard_abbreviation_map = {
    "aren't": "are not",
    "can't": "cannot",
    "doesn't": "does not", 
    "don't": "do not",
    "they're": "they are",
    "they've": "they have",
    "it's": "it is"
}

In [13]:
domain_specific_abbreviation_map = {
    'MCU': 'Marvel Commics Universe'
}

In [14]:
def expand_abbreviations(text, abbreviation_map):
    tokens = text.split(' ')
    updated_tokens = []
    for token in tokens:
        if token in abbreviation_map.keys():
            updated_tokens.append(abbreviation_map.get(token))
        else:
            updated_tokens.append(token)
    return ' '.join(updated_tokens)

In [15]:
# remove standard abbreviations from review 2
review_2_text = expand_abbreviations(review_2_text, standard_abbreviation_map)
print review_2_text

Avengers: Infinity War delivers an exciting culmination of the MCU, though it is overstuffed and suffers from certain typical Marvel movie problems.


In [16]:
# remove domain specific(here Marvel Movies specific review)
review_2_text = expand_abbreviations(review_2_text, domain_specific_abbreviation_map)
print review_2_text

Avengers: Infinity War delivers an exciting culmination of the MCU, though it is overstuffed and suffers from certain typical Marvel movie problems.


In [17]:
# expand numbers into their string version

In [18]:
# lets consider review 4
review_4_text = '''There were six singularities before the universe existed. When the universe was formed, the Cosmic Entities like Death and Eternity, turned those singularities into concentrated objects of power called Infinity Stones. There are 6 Infinity Stones or Gems in total: Space Stone, Reality Stone, Power Stone, Mind Stone, Time Stone, and finally Soul Stone '''

In [19]:
number_map = {
    '0': 'zero',
    '1': 'one',
    '2': 'two',
    '3': 'three',
    '4': 'four',
    '5': 'five',
    '6': 'six',
    '7': 'seven',
    '8': 'eight',
    '9': 'nine'
}

In [20]:
review_4_text = expand_abbreviations(review_4_text, number_map)
print review_4_text

There were six singularities before the universe existed. When the universe was formed, the Cosmic Entities like Death and Eternity, turned those singularities into concentrated objects of power called Infinity Stones. There are six Infinity Stones or Gems in total: Space Stone, Reality Stone, Power Stone, Mind Stone, Time Stone, and finally Soul Stone 


In [21]:
# synonyms / alternate names

In [22]:
# lets consider review 6
review_6_text = '''Directed by the Russo brothers, the architects behind Cptain Amrica: Civil War and Cptain Amrica: Winter Soldier, Infinity War slyly betrays Cap presenting his and the Avengers’ worldviews as naive and privileged. Instead, it dares to ask what happens if saving the day means taking real, tangible losses — a concept so foreign that it comes in the form of an intergalactic purple titan named Thanos'''

In [23]:
# our popular Captain America is sometimes called as "Cap"
synonym_map = {
    'Cap': 'Captain America'
}

In [24]:
review_6_text = expand_abbreviations(review_6_text, synonym_map)
print review_6_text

Directed by the Russo brothers, the architects behind Cptain Amrica: Civil War and Cptain Amrica: Winter Soldier, Infinity War slyly betrays Captain America presenting his and the Avengers’ worldviews as naive and privileged. Instead, it dares to ask what happens if saving the day means taking real, tangible losses — a concept so foreign that it comes in the form of an intergalactic purple titan named Thanos


In [25]:
# correcting spelling 

In [26]:
# lets consider review 6 again
review_6_text = '''Directed by the Russo brothers, the architects behind Cptain Amrica: Civil War and Cptain Amrica: Winter Soldier, Infinity War slyly betrays Cap presenting his and the Avengers’ worldviews as naive and privileged. Instead, it dares to ask what happens if saving the day means taking real, tangible losses — a concept so foreign that it comes in the form of an intergalactic purple titan named Thanos'''

In [27]:
spelling_error_map = {
    'Cptain': 'Captain',
    'Amrica': 'America'
}

In [28]:
review_6_text = expand_abbreviations(review_6_text, spelling_error_map)
print review_6_text

Directed by the Russo brothers, the architects behind Captain Amrica: Civil War and Captain Amrica: Winter Soldier, Infinity War slyly betrays Cap presenting his and the Avengers’ worldviews as naive and privileged. Instead, it dares to ask what happens if saving the day means taking real, tangible losses — a concept so foreign that it comes in the form of an intergalactic purple titan named Thanos


In [71]:
# removing stopwords
# - standard
# - domain specific

In [29]:
# lets consider review 6
review_5_text = '''I hate to once again compare the DCEU to Marvel (that's a lie, it's tons of fun), but Marvel takes something that DC has been desperately trying to do and does it 100 times better. The movie is often dark and depressing, but unlike the likes of BvS it is never pessimistic or hopeless. DC has been trying to deliver an edge by dulling their colors and trudging their characters through the muck from the get go, while Marvel realizes that loss and anger and fear are all the more powerful when there is hope and happiness and bravery to go alongside it.'''

In [30]:
def remove_stopwords(text, stopwords):
    tokens = text.split()
    filtered_tokens = [token for token in tokens if token not in set(stopwords)]
    return ' '.join(filtered_tokens)

In [31]:
english_stopwords = [
    "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", 
    "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", 
    "such", "that", "the", "their", "then", "there", "these", "they", 
    "this", "to", "was", "will", "with",
]

In [32]:
review_5_text = remove_stopwords(review_5_text, english_stopwords)
print review_5_text

I hate once again compare DCEU Marvel (that's lie, it's tons fun), Marvel takes something DC has been desperately trying do does 100 times better. The movie often dark depressing, unlike likes BvS never pessimistic hopeless. DC has been trying deliver edge dulling colors trudging characters through muck from get go, while Marvel realizes loss anger fear all more powerful when hope happiness bravery go alongside it.


In [33]:
# domain specific stopwords
# lets make sure that a review about marvel movie does not mentions anything about DC and its characters
domain_specific_stopwords = [
    'DC', 'Batman', 'Wonder Woman', 'DCEU'
]

In [34]:
review_5_text = remove_stopwords(review_5_text, domain_specific_stopwords)
print review_5_text

I hate once again compare Marvel (that's lie, it's tons fun), Marvel takes something has been desperately trying do does 100 times better. The movie often dark depressing, unlike likes BvS never pessimistic hopeless. has been trying deliver edge dulling colors trudging characters through muck from get go, while Marvel realizes loss anger fear all more powerful when hope happiness bravery go alongside it.


In [35]:
# stemming
# lets consider review 4
review_4_text = '''There were six singularities before the universe existed. When the universe was formed, the Cosmic Entities like Death and Eternity, turned those singularities into concentrated objects of power called Infinity Stones. There are 6 Infinity Stones or Gems in total: Space Stone, Reality Stone, Power Stone, Mind Stone, Time Stone, and finally Soul Stone'''

In [36]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def find_stem_sentence(s):
    return ' '.join([stemmer.stem(token) for token in s.split(' ')])

In [37]:
print find_stem_sentence(review_4_text)

there were six singular befor the univers existed. when the univers wa formed, the cosmic entiti like death and eternity, turn those singular into concentr object of power call infin stones. there are 6 infin stone or gem in total: space stone, realiti stone, power stone, mind stone, time stone, and final soul stone


In [38]:
# notice how some tokens are reduced
# singularities ==> singular
# also notice how some token dont mean anything
# universe ==> univers

In [39]:
# lemmatization: stemming but it preserves meaning of the token, and make sures that the reduced form is a meaningful word
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def find_lemma_sentence(s):
    return ' '.join([lemmatizer.lemmatize(token) for token in s.split(' ')])

In [40]:
print find_lemma_sentence(review_4_text)

There were six singularity before the universe existed. When the universe wa formed, the Cosmic Entities like Death and Eternity, turned those singularity into concentrated object of power called Infinity Stones. There are 6 Infinity Stones or Gems in total: Space Stone, Reality Stone, Power Stone, Mind Stone, Time Stone, and finally Soul Stone


In [41]:
# notice how universe(earlier univers in stemming) exists as universe here