# Data Cleaning

## Introduction

This notebook goes through a necessary step of any data science project - data cleaning. Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out". Feeding dirty data into a model will give us results that are meaningless.

Specifically, we'll be walking through:

1. **Getting the data - **in this case, we'll be scraping data from a website
2. **Cleaning the data - **we will walk through popular text pre-processing techniques
3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

## Problem Statement

As a reminder, our goal is to look at reviews of various resorts and note their similarities and differences. Specifically, I'd like to know if "Rivulet_Resort" is better than other resorts, since it is the resort that got me interested in Munnar resorts.

## Getting The Data

Luckily, there are wonderful people online that keep track of stand up routine transcripts. [Scraps From The Loft](http://scrapsfromtheloft.com) makes them available for non-profit and educational purposes.
in 
To decide which resorts to look into, I went on 'online hotel booking site' and looked specifically at top 6 resorts in Munnar. To narrow it down further, I looked only at those with greater than a 4.2 rating and minimum 175 reviews. If a resort had more than 175 reviews , I would pick the recent 175 reviews.

Actullay for Web scraping I use Java(Selenium) based framework which I have recenlty created. In python we would use 'requests' and 'BeautifulSoup' modules.

Java Script Repo: https://github.com/vmsathiya/dataScraper

Review Files:
ktc.txt (KTDC_Tea_County)
mm.txt (Misty_Mountain)
mtc.txt (Munnar_Tea_Country)
rr.txt (Rivulet_Resort)
sc.txt (Swiss_County)
tc.txt (Tea_Valley)

In [None]:
# pickle imports
import pickle
# Pickle files for later use

# resort names
resorts = ['ktc', 'mm', 'mtc', 'rr', 'sc', 'tv']

#reviews
reviews=[]
for rName in resorts:
    with open("../reviews/rawdata/" + rName + ".txt", "r") as file:
        rComment = file.read().replace('\n', ' ')
        reviews.append(rComment)

#print (reviews)

In [None]:
# # Pickle files for later use

for i, rName in enumerate(resorts):
     with open("../reviews/" + rName + ".txt", "wb") as file:
         pickle.dump(reviews[i], file)

In [None]:
# Load pickled files
data = {}
for rName in resorts:
    with open("../reviews/" + rName + ".txt", "rb") as file:
        data[rName] = pickle.load(file)

In [None]:
# Double check to make sure data has been loaded properly
data.keys()

In [None]:
# More checks
data['ktc'][:2]

## Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [None]:
# Let's take a look at our data again
next(iter(data.keys()))

In [None]:
# Notice that our dictionary is currently in key: resort, value: string format
next(iter(data.values()))

In [None]:
print (data)

In [None]:
# While pandas create data frame from a dictionary, it is expecting its value to be a list or dict.
data_combined = {key: [value] for (key, value) in data.items()}

In [None]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',2000)
data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['review']
data_df = data_df.sort_index()
data_df

In [None]:
# Let's take a look at the transcript for ktc
data_df.review.loc['ktc']

In [None]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [None]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.review.apply(round1))
data_clean

In [57]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [58]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.review.apply(round2))
data_clean

Unnamed: 0,review
ktc,amazing and beautiful place to stay we went as a family the rooms allocated to us were connected they had taken care of that both the rooms offered scenic views overall the staff were very courteous and friendly also the property was very well maintained and scenic tea county munnar nice facilities and good arrangement staff are very courteous amazing services and gentle behaviour must stay overall a very pleasant stay highly recommended for families excellent stay wonderful property excellent location excellent view and very nicely maintained except for food quality stay was great excellent stay very nice good hotel with location near market near and clean excellent place to stay we had a great time there all the staffs had a happy to help attitude the premises itself is well maintained have beautiful garden and good for a brief walk safe place to stay very easy to find and reach there simple and sober location is very near to city and views are great with garden the room service is great and people are very friendly excellent property at a beautiful location neat and clean room with all modern amenities nicely maintained garden with variety of flowers well decorated dining hall with mouth watering foods pleasant stay had an amazing stay here at ktdc worth every penny very nice location polite staff and great premises will definitely recommend because of the location just walking distance from the market very beautiful premises gives a very graceful and premium ambience billiards was a unique thing you wont find it in many hotels good staff and nice views from the balcony conveniently located to cover devikulam tea estates and mathupetty area good hotel overall a good hotel to stay at neat and clean ktdc need to renovate bathrooms completely worth your money peaceful and serene stay one of the best hotel properties in munnar you might find some better hotels as well but those would be far from munnar town this is in the heart of the town and gives y...
mm,okayish hotel is just fine but view is great but to get view you have to book deluxe room or your stay will be watching a cement wall even in deluxe rooms furniture is trashy and there are lot for cockroaches in toilet nice hotel to stay the hotel location is the best point about this hotel breakfast was excellent room was nice as well stay at mountain courtyard we stayed for one night and view from forest flame is awesome room size is too good and balcony size and coffee table which excited me most just one concern food quality is average and prices are too high very good nice stay nice atmosphere very good food very good room services the experience the hotel ambience coupled with the view of the misty mountain is the most attractive and catchy thing really fitting to the name of the hotel misty mountain total justice to the name tasty food and very hospitable staff north indian food is very tasty in this hotel very hospitable staff specially the manager mr chandrashekhar one poor thing about this hotel is that it doesnt have a bell and every time for room service the person keeps knocking the door massage service set up is terrible however location of the hotel is very ideal view is excellent camp fire on terrace is good great location and great accessibility had a great stay at misty mountains great views from rooftop and secret garden we a group of taken rooms and given a kinda suite for of a family a single room for a couple food is good and sufficient varieties but fiery spicy hot for us but normal by kerala standard recommend this place for a good decent hygienic stay worth for each penny we started there for one day the room and secret garden for fabulous the room which we book was valley view room the view was great from your balcony night dinner buffet was awesome one suggestion please have toothpaste in the room along with complimentary soap and shampoo best for family good view and best service best location to stay in munnar hotel loca...
mtc,well maintained with clean room with great view and greenery all around good service and great food hello everyone prons we stayed at munnar tea country resort in family cottages for three night four days it was really a great experience the staff of the hotel is so friendly i cant tell you in words they were eager to help food was just awesome and price are also less compare to other resorts breakfast was also awesome because it was mixture of north indian and south indian dishes every dish was fantastic resort maintain the greenery very well it is six kilometer away from city so you need transport to hang around in the munnar if you want to go with jeep safari they also provide that i also did spa that was also good the best thing of the resort was morning walk with philips it was really great experience he will tell you about the spices and different kind of trees available in the city it is really helpful for school going kids so if you are planning to stay in munnar so try to make one morning walk with philip he will take you the tea garden and show you different spices on the way he has great knowledge of plant and he also know the history of trees cons if you are planing to go with small kid then there are stairs in some rooms it up to you thank you munnar tea country resort for making our stay fabulous stay is good but no much activity available in the resort spa quality is not that good but price is cheaper they could have added few more process in ayurvedic massage than it will be more effective food quality is good but taste is not up to the mark a stay really enjoyed a very enjoyable stay the staff of the resort are very courteous and helping nature beautiful stay its a place where you can relax by watching beautiful sceneries around and their own wonderful garden well maintained with clean and beautiful room and garden with good service and great food good place to go with whole family to have a very nice time with nature and great food ambienc...
rr,rivelute munnar location was extremely good variety and good quality of food all staff was polite cooperative and always ready excellent stay all things are perfect like food stay and service loved it the resort is located next to a river and probably the best time to stay there is during monsoons or post monsoons however we visited in february end and still loved it because of the hospitality of the people there we got a free upgrade to a villa and it was very beautiful the resort has a few activities but nothing extraordinary has a spa where you can get some ayurvedic massages the staff were very courteous and made us feel at home we thought the food wasnt that great and a little over priced good stay great location location and staff of the hotel were excellent only thing missing was a swimming pool free upgrade to premium suite we booked a royal suite but we got free upgrade to premium one rooms have nice view of river you will enjoy birds singing in the morning you might spot few in your balcony there are other activities in the hotel it is a big property you will definitely enjoy your stay best resort at best prices treks n river is main attraction hotel provides many leisure activities kids will have fun there peaceful and serene location rivulet resort is a wonderful place to stay in munnar away from city hustle bustle near river best place for couples good services rivulet resort and munnar it was a three days trip and the experiences are awesome it will definitely fulfill the thirst of a tour lover the tour is incomplete without staying at the rivulet resort the resort provide some in house activity such as nature walk cycling bamboo rafting and also some indoor outdoor games equipments excellent stay amazing location view staff comfy room and very much recommended for everyone awesome place to relax i like the stay at rivulet the rooms and the view from rooms were awesome quiet and calm place to relax the river and mountain view was good th...
sc,excellent stay awesome experience in swiss county food was good hotel staff was nice enjoyed my stay i loved my stay here the staff is very courteous and helpful the food was was great rooms are very clean and well maintained best hotel in munnar for family i have stayed in this hotel times my family consider the stay in swiss county as a gettogether in our own place all staff are friendly and care us not like customers but like family members from security to the manager each man associated with this property will help us to enjoy each moment in this hotel next time also we will select this place itself for our stay garden near to this hotel is a new attraction amazing stay everything was good with this hotel every room has its own specification view is awesome i am a north indian and i found both north and south taste with delicious recipes i would recommend this hotel to everyone who are planning to visit mumnar munnar family trip we had excellent stay few highlights very friendly and caring staff excellent food nice view from hotel good stay was good with beautiful service and good locality one of the best in town it was our pleasure to stay in swiss county the staff was very courteous and helpful during our whole stay of days the view was breathtaking from the terrace after evening we usually used to stare at the moon from the top while sipping our coffee overall it was a great experience great location not so good food stayed here for nights with my wife during our ride to south india good rooms excellent staff good breakfast the lunchdinner here though wasnt anywhere up to the mark apart from the fish we had on the first day both of us were down with food poisoning the next two days with stale chicken dishes ended up having to stay in munnar an extra night since i wasnt able to last minute without running to the loo warm welcome with wonderful smile thanks to all the staff at swiss county for making our stay so memorable we are really touch...
tv,excellent service and staff the location is really beautiful very green with vast amount of plant species rooms with spectacular view to the green valley the staff was really helpful and understanding we really enjoyed our stay here overall good resort was overall good we had a good time there view from the resort was really awesome cleanliness and room service was also satisfactory taste of food was okay the only negative point of the resort was its little bit away from the town therefore one has to have the lunch and dinner in the resort there is no nearby restaurant or dhaba fantastic dont miss finest resort and value for money food excellent staff very kind nice isolated property best of newly married couples we had a new nice experience to stay ambience is awesome hospitality and hil type resort staff behaviour is awesome hospitality is very courteous good overall good view climate and location good stay but service needs upgrade we visited munnar during february end the rooms are large and clean the spacious balcony is one of the best features the lights in the room are quite dim they should be upgraded the fan was moving very slowly and it was a bit warm so we asked room service to rectify the problem we were provided with a room fan the geyser in one of our rooms wasnt working properly and although it was rectified quickly these are the little things that a hotel can fix in a permanent manner i wasnt provided the wifi password until the next morning and i had to complain twice for it it was needed since there was a network problem these were not big matters but they add up besides these the stay was good one more thing and although it wasnt the fault of the resort the road to the resort seemed to me quite difficult and dangerous excellent stay it was overall a great experience staying there however the rooms are not very nicely maintained good trip the view from resort is really breathtaking splendid tea plantation and deep valley view from our ...


**NOTE:** This data cleaning aka text pre-processing step could go on for a while, but we are going to stop for now. After going through some analysis techniques, if you see that the results don't make sense or could be improved, you can come back and make more edits such as:
* Mark 'cheering' and 'cheer' as the same word (stemming / lemmatization)
* Combine 'thank you' into one term (bi-grams)
* And a lot more...

## Organizing The Data

I mentioned earlier that the output of this notebook will be clean, organized data in two standard text formats:
1. **Corpus - **a collection of text
2. **Document-Term Matrix - **word counts in matrix format

### Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [None]:
# Let's take a look at our dataframe
data_df

In [None]:
# Let's add the resorts' full names as well

full_names = ['KTDC Tea County', 'Misty Mountain', 'Munnar Tea Country', 'Rivulet Resort', 'Swiss County', 'Tea Valley']

data_df['full_name'] = full_names
data_df

In [62]:
# Let's pickle it for later use
data_df.to_pickle("../pickle/corpus.pkl")

### Document-Term Matrix

The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

<Need to Play around with CountVectorizer's parameters. What is ngram_range? What is min_df and max_df?>

In [56]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.review)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,abbas,able,abode,aboyt,abroad,absence,absent,absolutely,ac,accept,...,yesterday,yogurt,youll,young,youre,yummy,zeal,zero,zone,zoom
ktc,0,0,0,0,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
mm,0,0,1,0,0,0,2,1,3,1,...,0,0,1,0,0,1,0,0,1,0
mtc,1,1,0,0,1,0,0,4,0,0,...,0,0,1,1,1,1,0,1,0,0
rr,0,0,2,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
sc,0,1,0,1,0,1,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
tv,0,2,0,0,0,0,0,2,1,0,...,1,1,0,0,1,0,0,0,0,1


In [63]:
# Let's pickle it for later use
data_dtm.to_pickle("../pickle/dtm.pkl")

In [64]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('../pickle/data_clean.pkl')
pickle.dump(cv, open("../pickle/cv.pkl", "wb"))