"""
NLP Text Preprocessing
"""

In Natural Language Processing, one of the most important factors determining the accuracy of Machine Learning or insights is the data we provide & how we process and clean it.

We might have come across the phrase "Garbage IN, Garbage OUT."

Each data contains a lot of noise, which will impact the accuracy of the outcome. We need to clear and clean as much as possible to improve the data quality.


Let's focus on the different steps of cleaning
1. Basic Cleaning
   1. Organizing the structure.
   2. Remove anything that will create an error.
2. Unnecessary tokens
   1. Removing certain aspects of the data which don't contain any added value.
3. Preparing Dataset in a proper format to feed it into the ML Algorithm.

Let's import a csv file which contains a list of reviews

In [17]:
#import the data
import pandas as pd
data = pd.read_csv('tripadvisor_hotel_reviews_.csv')
data.head()

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


The very first step while working with text data is the lowercase conversion, which helps to maintain consistency in our data. Sometimes, our models might treat a word with a Capital letter differently from the same word.

Disadvantages: Sometimes, lowering the words might change the meaning of the word. 
Example : US is a country but word us is a pronoun

In [18]:
data['reviews_lowercase'] = data['Review'].str.lower()
data.head()

Unnamed: 0,Review,Rating,reviews_lowercase
0,nice hotel expensive parking got good deal sta...,4,nice hotel expensive parking got good deal sta...
1,ok nothing special charge diamond member hilto...,2,ok nothing special charge diamond member hilto...
2,nice rooms not 4* experience hotel monaco seat...,3,nice rooms not 4* experience hotel monaco seat...
3,"unique, great stay, wonderful time hotel monac...",5,"unique, great stay, wonderful time hotel monac..."
4,"great stay great stay, went seahawk game aweso...",5,"great stay great stay, went seahawk game aweso..."


Second Step, we need to remove the stopwords such as 'a, 'and', to', etc.
We need to remove these words to reduce the complexity of the data.
These kinds of stopwords don't add much meaning to the text. It will lead to smaller, cleaner dataset.

In [20]:
#import the packages
import nltk
from nltk.corpus import stopwords
import re

enlish_stopwords = stopwords.words('english')
data['reviews_with_no_stopwords'] = data['reviews_lowercase'].apply(lambda review: ' '.join([word for word in review.split() if word not in enlish_stopwords]))
data.head()

Unnamed: 0,Review,Rating,reviews_lowercase,reviews_with_no_stopwords
0,nice hotel expensive parking got good deal sta...,4,nice hotel expensive parking got good deal sta...,nice hotel expensive parking got good deal sta...
1,ok nothing special charge diamond member hilto...,2,ok nothing special charge diamond member hilto...,ok nothing special charge diamond member hilto...
2,nice rooms not 4* experience hotel monaco seat...,3,nice rooms not 4* experience hotel monaco seat...,nice rooms 4* experience hotel monaco seattle ...
3,"unique, great stay, wonderful time hotel monac...",5,"unique, great stay, wonderful time hotel monac...","unique, great stay, wonderful time hotel monac..."
4,"great stay great stay, went seahawk game aweso...",5,"great stay great stay, went seahawk game aweso...","great stay great stay, went seahawk game aweso..."


Third step: We can also utilize Regex to identify/Modify/transform certain patterns in the data

In [22]:
#removing unnecessary punctuations
data['reviews_with_no_stop_and_punct'] = data['reviews_with_no_stopwords'].apply(lambda review: re.sub(r'([^\w\s])','',review))
data.head()

Unnamed: 0,Review,Rating,reviews_lowercase,reviews_with_no_stopwords,reviews_with_no_stop_and_punct
0,nice hotel expensive parking got good deal sta...,4,nice hotel expensive parking got good deal sta...,nice hotel expensive parking got good deal sta...,nice hotel expensive parking got good deal sta...
1,ok nothing special charge diamond member hilto...,2,ok nothing special charge diamond member hilto...,ok nothing special charge diamond member hilto...,ok nothing special charge diamond member hilto...
2,nice rooms not 4* experience hotel monaco seat...,3,nice rooms not 4* experience hotel monaco seat...,nice rooms 4* experience hotel monaco seattle ...,nice rooms 4 experience hotel monaco seattle g...
3,"unique, great stay, wonderful time hotel monac...",5,"unique, great stay, wonderful time hotel monac...","unique, great stay, wonderful time hotel monac...",unique great stay wonderful time hotel monaco ...
4,"great stay great stay, went seahawk game aweso...",5,"great stay great stay, went seahawk game aweso...","great stay great stay, went seahawk game aweso...",great stay great stay went seahawk game awesom...


Fourth Step: Tokenization
It's a fundamental step in NLP which involves converting the text data into smaller pieces/units/tokens.
It is always preferred to perform tokenization because the overall meaning of the text will be clear if we analyze & understand the data

In [24]:
from nltk.tokenize import word_tokenize
data['tokens'] = data['reviews_with_no_stop_and_punct'].apply(lambda review: word_tokenize(review))
data.head()

Unnamed: 0,Review,Rating,reviews_lowercase,reviews_with_no_stopwords,reviews_with_no_stop_and_punct,tokens
0,nice hotel expensive parking got good deal sta...,4,nice hotel expensive parking got good deal sta...,nice hotel expensive parking got good deal sta...,nice hotel expensive parking got good deal sta...,"[nice, hotel, expensive, parking, got, good, d..."
1,ok nothing special charge diamond member hilto...,2,ok nothing special charge diamond member hilto...,ok nothing special charge diamond member hilto...,ok nothing special charge diamond member hilto...,"[ok, nothing, special, charge, diamond, member..."
2,nice rooms not 4* experience hotel monaco seat...,3,nice rooms not 4* experience hotel monaco seat...,nice rooms 4* experience hotel monaco seattle ...,nice rooms 4 experience hotel monaco seattle g...,"[nice, rooms, 4, experience, hotel, monaco, se..."
3,"unique, great stay, wonderful time hotel monac...",5,"unique, great stay, wonderful time hotel monac...","unique, great stay, wonderful time hotel monac...",unique great stay wonderful time hotel monaco ...,"[unique, great, stay, wonderful, time, hotel, ..."
4,"great stay great stay, went seahawk game aweso...",5,"great stay great stay, went seahawk game aweso...","great stay great stay, went seahawk game aweso...",great stay great stay went seahawk game awesom...,"[great, stay, great, stay, went, seahawk, game..."


Fifth Step: Stemming
An important text preprocessing technique which deals with the grouping of different words which are an altered form of the same word.
In other words, I can say it is just a process of removing a suffix.
Example: Climbing, Climbs, Climbed will be reduced to Climb.
This way, it will reduce the number of unique words in our data, in turn reducing the size & complexity

Drawback: This process involves no intelligence. Some of the pre-defined functions will remove the suffixes without any knowledge of whether it's disturbing the meaning or not.

Try the word 'Adverse': A lot of pre-defined functions will return 'Advers', which has no meaning.

In [25]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
data['stemmed_review'] = data['tokens'].apply(lambda tokens: [stemmer.stem(token) for token in tokens])
data.head()

Unnamed: 0,Review,Rating,reviews_lowercase,reviews_with_no_stopwords,reviews_with_no_stop_and_punct,tokens,stemmed_review
0,nice hotel expensive parking got good deal sta...,4,nice hotel expensive parking got good deal sta...,nice hotel expensive parking got good deal sta...,nice hotel expensive parking got good deal sta...,"[nice, hotel, expensive, parking, got, good, d...","[nice, hotel, expens, park, got, good, deal, s..."
1,ok nothing special charge diamond member hilto...,2,ok nothing special charge diamond member hilto...,ok nothing special charge diamond member hilto...,ok nothing special charge diamond member hilto...,"[ok, nothing, special, charge, diamond, member...","[ok, noth, special, charg, diamond, member, hi..."
2,nice rooms not 4* experience hotel monaco seat...,3,nice rooms not 4* experience hotel monaco seat...,nice rooms 4* experience hotel monaco seattle ...,nice rooms 4 experience hotel monaco seattle g...,"[nice, rooms, 4, experience, hotel, monaco, se...","[nice, room, 4, experi, hotel, monaco, seattl,..."
3,"unique, great stay, wonderful time hotel monac...",5,"unique, great stay, wonderful time hotel monac...","unique, great stay, wonderful time hotel monac...",unique great stay wonderful time hotel monaco ...,"[unique, great, stay, wonderful, time, hotel, ...","[uniqu, great, stay, wonder, time, hotel, mona..."
4,"great stay great stay, went seahawk game aweso...",5,"great stay great stay, went seahawk game aweso...","great stay great stay, went seahawk game aweso...",great stay great stay went seahawk game awesom...,"[great, stay, great, stay, went, seahawk, game...","[great, stay, great, stay, went, seahawk, game..."


Sixth step: Lemmatization
1. This will act as a modifier of the token into a meaningful word. It involves intelligence which takes reference from pre-defined dictionary words.
2. It often leads to meaninful words which have value but may lead to more words in a dataset.

In [36]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
data['lemmatized_reviews'] = data['tokens'].apply(lambda review: [lemmatizer.lemmatize(token) for token in review])
data.head()

Unnamed: 0,Review,Rating,reviews_lowercase,reviews_with_no_stopwords,reviews_with_no_stop_and_punct,tokens,stemmed_review,lemmatized_reviews
0,nice hotel expensive parking got good deal sta...,4,nice hotel expensive parking got good deal sta...,nice hotel expensive parking got good deal sta...,nice hotel expensive parking got good deal sta...,"[nice, hotel, expensive, parking, got, good, d...","[nice, hotel, expens, park, got, good, deal, s...","[nice, hotel, expensive, parking, got, good, d..."
1,ok nothing special charge diamond member hilto...,2,ok nothing special charge diamond member hilto...,ok nothing special charge diamond member hilto...,ok nothing special charge diamond member hilto...,"[ok, nothing, special, charge, diamond, member...","[ok, noth, special, charg, diamond, member, hi...","[ok, nothing, special, charge, diamond, member..."
2,nice rooms not 4* experience hotel monaco seat...,3,nice rooms not 4* experience hotel monaco seat...,nice rooms 4* experience hotel monaco seattle ...,nice rooms 4 experience hotel monaco seattle g...,"[nice, rooms, 4, experience, hotel, monaco, se...","[nice, room, 4, experi, hotel, monaco, seattl,...","[nice, room, 4, experience, hotel, monaco, sea..."
3,"unique, great stay, wonderful time hotel monac...",5,"unique, great stay, wonderful time hotel monac...","unique, great stay, wonderful time hotel monac...",unique great stay wonderful time hotel monaco ...,"[unique, great, stay, wonderful, time, hotel, ...","[uniqu, great, stay, wonder, time, hotel, mona...","[unique, great, stay, wonderful, time, hotel, ..."
4,"great stay great stay, went seahawk game aweso...",5,"great stay great stay, went seahawk game aweso...","great stay great stay, went seahawk game aweso...",great stay great stay went seahawk game awesom...,"[great, stay, great, stay, went, seahawk, game...","[great, stay, great, stay, went, seahawk, game...","[great, stay, great, stay, went, seahawk, game..."


To properly test our dataset, which's our output of preprocessing, we can utilize N-Grams to check the quality of the data.

In [41]:
tokens_grams = sum(data['lemmatized_reviews'],[])

In [42]:
bi_grams = pd.Series(nltk.ngrams(tokens_grams,2)).value_counts()
print(bi_grams)

(great, location)           24
(space, needle)             21
(hotel, monaco)             16
(staff, friendly)           13
(great, view)               12
                            ..
(hotel, dissapointment)      1
(unfortunately, warwick)     1
(agency, unfortunately)      1
(travel, agency)             1
(food, raffle)               1
Name: count, Length: 8183, dtype: int64


In [43]:
tri_grams = pd.Series(nltk.ngrams(tokens_grams,3)).value_counts()
print(tri_grams)

(pike, place, market)         8
(hotel, great, location)      5
(staff, friendly, helpful)    5
(view, space, needle)         5
(room, king, bed)             4
                             ..
(free, travel, central)       1
(cheap, free, travel)         1
(bus, cheap, free)            1
(seattle, bus, cheap)         1
(hotel, right, street)        1
Name: count, Length: 9173, dtype: int64


In [44]:
quarter_grams = pd.Series(nltk.ngrams(tokens_grams,4)).value_counts()
print(quarter_grams)

(block, away, pike, place)                      2
(im, pei, designed, hotel)                      2
(definitely, stay, crowne, plaza)               2
(lovely, hotel, great, location)                2
(nice, hotel, husband, stayed)                  2
                                               ..
(store, opposite, noise, airconditionera)       1
(opposite, noise, airconditionera, standard)    1
(noise, airconditionera, standard, arranged)    1
(airconditionera, standard, arranged, stay)     1
(raffle, hotel, right, street)                  1
Name: count, Length: 9272, dtype: int64


We can further extend our process with Text Tagging
1. Parts of Speech Tagging (Each token will be tagged to related Parts of Speech)
2. Named-Entity recognition (This will identify places, names, organizations or any kind of entity)

In [52]:
import spacy
pos_tagger = spacy.load('en_core_web_sm')
document = pos_tagger(' '.join(tokens_grams))
for r in range(10):
    print(document[r].text,': ',document[r].pos_)

nice :  ADJ
hotel :  NOUN
expensive :  ADJ
parking :  NOUN
got :  VERB
good :  ADJ
deal :  NOUN
stay :  VERB
hotel :  NOUN
anniversary :  NOUN


In [54]:
from spacy import displacy
displacy.render(document, style='ent', jupyter=True)