# Reviews Analyzer

The objective of this analysis is threefold:
1. Word association identification
2. Contextual sentiment extraction
3. Theme/topic assignment to word clusters

The dataset used for the purpose of this analysis consists of [hotel reviews](https://code.google.com/archive/p/dataset/downloads) mined from [Tripadvisor.com](http://tripadvisor.com)

### Lets load the data
Hotel reviews are available for the cities of Beijing, Chicago, Dubai, Las-Vegas, London, Montreal, New-Delhi, New-York-City, San-Francisco and Shanghai spanning the years 2003 - 2010. For the purpose of this analysis, the city has been set to *chicago* and year to *2007*. These settings however **can** be changed. 

In [None]:
# Load necessary libraries
import numpy as np
import nltk

# Load custom module
import helperFunctions as hf

# Read Hotel Reviews Data into a data frame
dataPath = 'OpinRankDataSet/hotels/'
city = 'chicago'
year = '2007'
hotelReviewCount, reviewsDF = hf.readReviewsData(dataPath, city, year)
print('Done reading data')

### Clean up the text
Cleaning mainly involves removal of noisy characters, stopwords, case normalization and stemming. A mapping is also created between words and their stemmed forms, so that extracting the actual word is possible.

In [None]:
# Create combined reviews corpus from all reviews for different applications
combinedCorpus = reviewsDF['FullReview'].str.cat(sep=' ')
corpusOnlyChar = hf.preprocessText(combinedCorpus,onlyChar=True,lower=True,stopw=False,stem=False)
corpusNoStop   = hf.preprocessText(corpusOnlyChar,onlyChar=False,lower=False,stopw=True,stem=False)
corpusStem     = hf.preprocessText(corpusNoStop,onlyChar=False,lower=False,stopw=False,stem=True)
print('Done corpus processing')

# Create the unstem dictionary for the corpus
unstemDict = {}
corpusTrim = np.array(corpusNoStop.split())
hf.unstem(corpusTrim,unstemDict)
corpusTrim = list(corpusTrim)
print('Done creating unstem dictionary')

In [9]:
import os
os.sys.path
import matplotlib
from wordcloud import WordCloud, STOPWORDS

### A quick look at the most frequent terms in the corpus

In [None]:






# Find Similar Words (appearing in similar contexts)
from nltk import ContextIndex
corpusOnlyChar = hf.preprocessText(combinedCorpus,onlyChar=True,lower=True,stopw=False,stem=False)
reviewContextFull3W = ContextIndex(tokens=corpusOnlyChar.split(),context_func=hf.contextFunc3W)
reviewContextFull2W = ContextIndex(tokens=corpusOnlyChar.split(),context_func=hf.contextFunc2W)
reviewContextStop2W = ContextIndex(tokens=corpusNoStop.split(),context_func=hf.contextFunc2W)
word = 'suite'
similarWords = hf.getSimilarWords(word, reviewContextFull3W, reviewContextFull2W, reviewContextStop2W, numWords=20)
print('Done finding similar words')

# Find Collocated Words (words appearing together in a phrase)
from nltk.collocations import BigramCollocationFinder
windowSize = 3
finder = BigramCollocationFinder.from_words(corpusStem.split(), windowSize)
collocationWords = hf.getCollocatedWords(finder,unstemDict,numPairs=10)
print('Done Finding Collocated words')

# Find Contextual Sentiments
reviewsDF['cleanReview'] = reviewsDF['FullReview'].apply(hf.preprocessText)
mostFreqWords = hf.getMostFrequentWords(reviewsDF['cleanReview'],unstemDict,5)
posTaggedWords = nltk.pos_tag(list(mostFreqWords.index))
hotelWords = [w for w,tag in posTaggedWords if tag == 'NN']
reviewSentiments = hf.getContextualSentiment(reviewsDF['FullReview'][1], domainWords = hotelWords)
print('Done building sentiment analyzer')

# Get similar word clusters along with topic
cleanReview = reviewsDF['FullReview'].apply(hf.preprocessText,stopw=True,minLen = False)
cleanReview = cleanReview.str.cat(sep=' ')
maxClusters = 10
clusterDict = hf.getThemeClusters(cleanReview,mostFreqWords,unstemDict,maxClusters)
print('Done clustering similar terms and assigning topic')

In [None]:
# Extract the most frequent words (excluding stopwords)
cleanReview = reviewsDF['FullReview'].apply(hf.preprocessText)
mostFreqWords = hf.getMostFrequentWords(cleanReview,unstemDict,5)
print('Here is a list of the top-most occurring words in the corpus:')
mostFreqWords[:20]

In [None]:
import sys
sys.path.append('wordcloud')
!python wordcloud/setup.py install

### Word Association

Although word association has many connotations, it can be broadly classified into two types:
1.