# Sentiment Analysis (Using Yelp Review Data)

## Introduction

Language is one of the most complicated tools. Though we use language to communicate, to express feelings, and to share information everyday, it is still common that we can easily misunderstand things. One reason is that language itself contains a lot of amubiguity. For example, "Jane saw Mary standing at the bank with an umbrella. She waved to her.", these particular sentences raise at least three ambiguities and there are no absolute correct answers to those questions.

When we consider the actual hidden idea behind language, there are more aspects we have to consider. For example, emotions of speakers, negations, sarcasm, metaphors, and even the social relations between speakers and listeners can easily affect the words being used and the composition of sentences.

However, to understand(or to guess) the true meaning behind a piece of given text is always a challenge for humans and machines also! To be more specific, there are many websites which provides "review" information of products or services and customers of these websites often rely heavily on these reviews to make decisions. If one can understand what kind of perceptions do users have toward a products, he/she can generate greater business value by not looking at the transaction numbers only but also what the users actually express.

### preface

This tutorial will introduce simple sentiment analysis methods and provide a simple demo on evaluating and predicting Yelp review data. More than that, thie tutorial will also explore some existing sentiment libraries and do a demo of sentiment analysis on reviews.


### Tutorial content


We will cover the following topics in this tutorial:
- [Installing the libraries](#Installing-the-libraries)
- [Getting the data](#Getting-the-data)
- [Explaining models and assumptions](#Explaining-models-and-assumptions)
    * [Source-Channel paradigm](#Source-Channel-paradigm)
    * [Sentiment analysis models](#Sentiment-analysis-models)
    * [Naive Bayes Classifier with negation](#Naive-Bayes-Classifier-with-negation)
    * [Vader](#Vader)

## Installing the libraries

Before getting started, you'll need to install the various libraries that we will use.  You can install nltk using `pip` and use `nltk.download()` to obtain required libraries:


In [11]:
import nltk
import nltk.tokenize.punkt
from nltk.tokenize import word_tokenize
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet as wn
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import tokenize
from nltk.sentiment.util import *

import io, time, json
import sys

import requests
from bs4 import BeautifulSoup
from yelpAPI import *

from collections import OrderedDict
import codecs

ImportError: No module named 'nltk'

## Getting the data

This tutorial uses python package beautiful soup to crawl yelp data as we did in 15688 first assignment. However, to provide better analysis of user, we are going to crawl three-level data and therefore construct a network between user and restaurants.

The steps are as following:

1. crawl 1000 restaurants in Pittsburgh
2. from these restaurants, crawl the recommend reviews and not-recommend reviews and mark a link from reviewer to restaurants
    * e.g. 997 recommend reviews found in https://www.yelp.com/biz/gaucho-parrilla-argentina-pittsburgh
    * e.g. 198 not-recommend reviews found in https://www.yelp.com/not_recommended_reviews/gaucho-parrilla-argentina-pittsburgh
3. from these reviewers, crawl all the reviews they left for other restaurants
4. for new restaurants obtained in step 3, crawl all the reviews for the restaurant, mark a link between reviewers and restaurants only for the users we have seen in step 2.

The reason we are doing a second level search is that the distribution of the number of reviews left by reviewers is very skewed from the data we get from step 2. Most users (80%) only review once in the 1000 restaurant we get from first step so we only have a single point to describe these users, which is not enough and can cause significant variance.

Now we can get all the reviews by any user by doing step 3, but we cannot get the label since Yelp does not distinguish "recommend" reviews from "not recommend" reviews on users page but only distinguish them on page of restaurants. This is the reaon why we have to do the fourth step, to obtain the "recommend" or "not recommend" label.

After 4 steps of crawling, we get 
- 1000 restaurants in Pittsburgh.
- 35158 users
- 89882 reviews (reviews of the 1000 restaurants left by 35158 users) 
- and we crawl XXXX more restaurants reviewed by those 35158 users to get the review labels

During the crawling process, we sent more than 700,000 requests to Yelp and three of my machines get banned by Yelp robot check. The total crawling process may take more than 2 days.

### Crawler result data example

### user-id-mapping

<pre>
id  userId                  neg pos
1	 Zq31drx-JM2R1MKQg8uJQw	2	0
2	 We1kda5rqra8ClvV34Od4A	4	0
3	 m5lzdUZ00UkQEO-UXzTW9A	2	0
4	 bSvNU2vABlaBi1ooF4KNJg	15	0
5	 MUsXhUuDRzGLkh2l3aNDGA	6	0
6	 yq-MN1tqPA11TNWS7ZrIYQ	1	0
7	 nr97lipURi7lx-uexT3RkA	0	5
8	 qGDIsH6b4GTo37krG8ZfzQ	6	0
9	 9KUJxI5AZqm5K1TBe7lFfg	2	0
10	BLrtsER8kfkNlRWoCXYPqA	0	1
...
</pre>

### review metadata (graph)

<pre>
id    reviewId               userId rating label date
1	 BLkdRAJhSTsqy7bdMwONsg	1	5.0	1	2016-10-31
2	 aATT7y3AkCwyjiiiQpig9w	1	5.0	1	2016-10-26
3	 xjDkNv3JodG04_7Oq2JD-g	1	5.0	1	2016-10-23
4	 Hi10sGSZNxQH3NLyWSZ1oA	1	5.0	1	2016-10-7
5	 bsMGQruRQGgQZK4KA9Q4Aw	1	5.0	1	2016-10-4
6	 1nmIXJFvl0tI4gUMygKT2g	1	5.0	1	2016-10-23
7	 8Wi0srNRAF9hSAzg1qnchw	1	5.0	1	2016-9-18
8	 iipaDtoA1zHFkbAF3Gn5Rg	1	5.0	1	2016-9-8
9	 242-DHMPDzfjYrb45dd9ZQ	1	5.0	1	2016-10-23
10	aomkMAGFrL-gD6cqQjB4bw	1	4.0	1	2016-10-8
...
</pre>

### review content
<pre>
1	aVT6N0mvnM5vmr2_igf1QQ	My wife and I made an unexpected overnight stay in Pittsburgh ... of your way to try Gaucho, you will not be disappointed.
2	GcY4xubTKS2qzAszScim3A	So. Good. Other than the staff ... through but press on. You're welcome.
...
</pre>


### Crawler code snippet

In [9]:
#####################################
# load "recommended" and "non-recommended" reviews for all 
# restaurants given in a file, "argv = subset"
# generate three files
# 1. review content
# 2. metadata
# 3. userId mapping
#####################################

# load 1000 restaurants in Pittsburgh
client = authenticate("inputData/authenticate.json")
businesses = all_restaurants(client, 'Pittsburgh')
print ">>> get restaurants. count: " + str(len(businesses))

cnt = 1
f = io.open('outputData/businessesIdMapping.tsv', 'w', encoding='utf8')
for business in businesses:
	f.write(str(cnt) + "\t" + business.id + "\n")
	cnt += 1

subset = sys.argv[1]

f_reviews_content = io.open('outputData/reviews_content_' + subset + '.tsv', 'w', encoding='utf8')
f_userIdMapping = io.open('outputData/user_id_mapping_' + subset + '.tsv', 'w', encoding='utf8')
f_metaData = io.open('outputData/metaData_' + subset + '.tsv', 'w', encoding='utf8')

userIdDict = {}

# read 1000 restaurants in Pittsburgh
f = open('outputData/businessesIdMapping.tsv', 'r')
businessLines = [line.rstrip('\n\r') for line in f]
cnt = 1

for businessLine in businessLines[int(subset): int(subset) + 100]:
	businessId = businessLine.split('\t')[0]

	print ">>> " + businessId + " extracting recommended reviews from: " + businessLine.split('\t')[1]
	url = 'https://www.yelp.com/biz/' + businessLine.split('\t')[1]
	reviews = extract_reviews(url)
	print ">>> extrated " + str(len(reviews)) + " from " + businessLine.split('\t')[1]

	for review in reviews:

		# added user to user set
		if review['user_id'] not in userIdDict:
			userIdDict[review['user_id']] = {'pos': 0, 'neg': 1}
		else:
			userIdDict[review['user_id']]['neg'] += 1
		f_reviews_content.write(str(cnt) + "\t" + review['review_id'] + "\t" + review['text'] + "\n")
		f_metaData.write(str(cnt) + "\t" + review['user_id'] + "\t" + businessId + "\t" + str(review['rating']) + "\t" + "1\t" + review['date'] + "\n")
		cnt += 1

	print ">>> " + businessId + "  extracting non-recommended reviews from: " + businessLine.split('\t')[1]
	url = '/not_recommended_reviews/' + businessLine.split('\t')[1]
	reviews = extract_unrecommend_reviews(url)
	print ">>> extrated " + str(len(reviews)) + " from " + businessLine.split('\t')[1]

	for review in reviews:
		# added user to user set
		if review['user_id'] not in userIdDict:
			userIdDict[review['user_id']] = {'pos': 1, 'neg': 0}
		else:
			userIdDict[review['user_id']]['pos'] += 1

		f_reviews_content.write(str(cnt) + "\t" + review['review_id'] + "\t" + review['text'] + "\n")
		f_metaData.write(str(cnt) + "\t" + review['user_id'] + "\t" + businessId + "\t" + str(review['rating']) + "\t" + "-1\t" + review['date'] + "\n")
		cnt += 1

print "... finished crawling, extracted " + str(cnt-1) + " reviews in total"

cnt = 1
for userId, count in userIdDict.iteritems():
	f_userIdMapping.write(str(cnt) + "\t" + userId + "\t" + str(count['neg']) + "\t" + str(count['pos']) + "\n")
	cnt += 1

SyntaxError: invalid syntax (<ipython-input-9-32980a2b37f6>, line 17)

In [7]:
#####################################
# aggregate subset files into three files 
# re-number the reviews
# aggregate userId mapping into one bigger user file
# number users and update metadata
# 1. review content
# 2. metadata
# 3. userId mapping
#####################################

# aggregate users
def aggregateUsers():
	users = OrderedDict()
	countAllUsers = 0
	for i in range(10):
		f = open('outputData/user_id_mapping_' + str(i * 100) + '.tsv', 'r')
		lines = [line.rstrip('\n\r') for line in f]
		print ">>> FILE: " + str(i * 100) + " COUNT: " + str(len(lines))
		countAllUsers += len(lines)

		for line in lines:
			parts = line.split("\t")
			if parts[1] not in users:
				users[parts[1]] = [int(parts[2]), int(parts[3])]
			else:
				users[parts[1]][0] += int(parts[2])
				users[parts[1]][1] += int(parts[3])

	print ">>> total number of users: " + str(len(users)) + " ...writing to file"
	f_userIdMapping = open('outputData/user_id_mapping_all.tsv', 'w')
	cnt = 1
	for k,v in users.iteritems():
		f_userIdMapping.write(str(cnt) + "\t" + str(k) + "\t" + str(v[0]) + "\t" + str(v[1]) + "\n")
		cnt += 1

def aggregateReviewMetaData():
	f_metaData = open('outputData/metaData_all.tsv', 'w')
	cnt = 1
	for i in range(10):
		f = open('outputData/metaData_' + str(i * 100) + '.tsv', 'r')
		lines = [line.rstrip('\n\r') for line in f]
		print ">>> FILE: " + str(i * 100) + " COUNT: " + str(len(lines))

		for line in lines:
			parts = line.split("\t")
			items = parts[1:5]
			date = parts[5].replace("Updatedreview", "").split("/")

			items.insert(0, str(cnt))
			items.append(date[2] + "-" + date[0] + "-" + date[1])
			cnt += 1

			f_metaData.write("\t".join(items) + "\n")

def aggregateReviewContents():
	f_reviewContent = open('outputData/reviews_content_all.tsv', 'w')
	cnt = 1
	for i in range(10):
		f = open('outputData/reviews_content_' + str(i * 100) + '.tsv', 'r')
		lines = [line.rstrip('\n\r') for line in f]
		print ">>> FILE: " + str(i * 100) + " COUNT: " + str(len(lines))

		for line in lines:
			parts = line.split("\t", 1)
			f_reviewContent.write(str(cnt) + "\t" + parts[1] + "\n")
			cnt += 1

SyntaxError: Missing parentheses in call to 'print' (<ipython-input-7-0e170379721b>, line 21)

In [8]:
#####################################
# get viewed restaurants by the 35158 users extracted from 1000 restaurants
#####################################
offset = sys.argv[1]

# load 1000 restaurant in Pittsburgh
oldBiz = set()
f = open('outputData/businessesIdMapping.tsv', 'r')
businessLines = [line.rstrip('\n\r') for line in f]
for line in businessLines:
	oldBiz.add(line.split("\t")[1])

f = open('outputData/user_id_mapping_all.tsv', 'r')
userLines = [line.rstrip('\n\r') for line in f]
newBiz = OrderedDict()

for line in userLines[int(offset): int(offset) + 500]:
	parts = line.split("\t")

	businesses = getBizFromUserPage(parts[1])
	print ">>> " + parts[0] + "  extracted " + str(len(businesses)) + " from user: " + parts[1]
	for biz in businesses:
		if biz not in oldBiz:
			newBiz[biz] = parts[1]

cnt = 1
f_newBiz = open('outputData/newBiz_' + offset + '.tsv', 'w')
for k, v in newBiz.iteritems():
	f_newBiz.write(str(cnt) + "\t" + v + "\t" + k + "\n")
	cnt += 1

SyntaxError: invalid syntax (<ipython-input-8-635c7f614020>, line 20)

## Explaining models and assumptions

### Source-Channel paradigm

Source-Channel paradigm is a system that intend to capture the relationship between observed data and underlying data generating source. Here, for sentiment analysis, we assume that we have positive or negative perception for reviews(it can be boolean or a continous number between two values). Source-Channel paradigm assumes a prior probability P(S) for sentiment S and the channel can produce a document(review) D from a given S and the producing probability is P(S|D). Therefore, our task is to find the most likily S* given an observed D. The following equation shows the relationship:

<img src='sourceChannelEquation.png'>


And the source-channel paradigm can be visualized as:

<img src='sourceChannel.png'>


### Language models (unigram)

We use unigram to represent reviews.

### Sentiment analysis models (negation and amplifiers)

The idea behind this model is that we assume the sentiment score of a review is numerical calculable. There are three main assumptions:
+ Sentiment score of a sentence is the aggregation of words in the sentence.
 
 * This [awesome] movie is the [best] one I [have ever seen].
 
 
+ Sentiment score of a word can be flipped by a negation word.

 * This restaurant does [not] have the good food it claims.
 
 * The soup [could have been] more delicious if...


+ Sentiment score of a word can be amplified or diminished by another amplifier tokens.
  
 * The setting of this novel is [very] interesting[!]
 * it was the [WORST] experience I have ever had [in my life].


### Negation


When analyzing sentiment of reviews, negation word can play a very significant role here. Consider the following two simple cases:
* This is the best steak house in Pittsburgh.
* This is not the best steak house in Pittsburgh.

One single word 'not' can change the entire meaning of the sentence but if we approach the two using simple unigram and bag of words model, we will consider the two extremely similar to each other. We must deal with negation words when analyzing reviews. Gladly, nltk package provide a very simple way to mark words as negated.
    

In [13]:
test1 = ('This is the best steak house in Pittsburgh.'.split(" "), 'subj')
test2 = ('This is not the best steak house in Pittsburgh.'.split(" "), 'subj')
print mark_negation(test1)
print mark_negation(test2)

SyntaxError: invalid syntax (<ipython-input-13-0487901cfba6>, line 4)

The function `mark_negation()` will annotate sentences and extract any negation and the negated phrases. Basically, it will treat the negated words as if they are different tokens. So the previous example will produce:

* (['This', 'is', 'the', 'best', 'steak', 'house', 'in', 'Pittsburgh.'], 'subj')
* (['This', 'is', 'not', 'the_NEG', 'best_NEG', 'steak_NEG', 'house_NEG', 'in_NEG', 'Pittsburgh._NEG'], 'subj')


### Build model using nltk.NaiveBayesClassifier with negations

After we know how to treat negation in sentences, we can try to model reviews with different ratings. For demonstration purpose, we use Naive Bayes Classifier as a multi-class classifier. We train our model with 450 reviews from each categories and test our result on 50 of each. Note that the original distribution of reviews are very skewed (with reviews of ratings 5 being the most). To avoid the prior dominate the result, we construct a balanced training set.

In [None]:
f = open('inputData/metaData_all.tsv', 'r')
metaDataLines = [line.rstrip('\n\r') for line in f]

f = codecs.open('inputData/reviews_content_all.tsv', 'r', 'utf8')
reviewsContentLines = [line.rstrip('\n\r') for line in f]

# put reviews in five different bucket by ratings
buckets = {
	'1.0': [], '2.0': [], '3.0': [], '4.0': [], '5.0': [],
}

for i in range(20000):
	r = reviewsContentLines[i].split("\t")[2]
	tokens = nltk.word_tokenize(r.encode('ascii', 'ignore').decode('ascii'))
	rating = metaDataLines[i].split("\t")[3]
	buckets[rating].append((tokens, rating))

training_docs = []
testing_docs = []

# get reviews from buckets. 
for k, v in buckets.iteritems():
	training_docs.extend(v[0: 450])
	testing_docs.extend(v[450: 500])

sentim_analyzer = SentimentAnalyzer()
all_words_neg = sentim_analyzer.all_words([mark_negation(doc) for doc in training_docs])

unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(testing_docs)

trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)

for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
    print('{0}: {1}'.format(key, value))

<pre>
Number of features: 4703
Evaluating NaiveBayesClassifier results...
Accuracy: 0.484
F-measure [1.0]: 0.556701030928
F-measure [2.0]: 0.354166666667
F-measure [3.0]: 0.229885057471
F-measure [4.0]: 0.387096774194
F-measure [5.0]: 0.771653543307
Precision [1.0]: 0.574468085106
Precision [2.0]: 0.369565217391
Precision [3.0]: 0.27027027027
Precision [4.0]: 0.418604651163
Precision [5.0]: 0.636363636364
Recall [1.0]: 0.54
Recall [2.0]: 0.34
Recall [3.0]: 0.2
Recall [4.0]: 0.36
Recall [5.0]: 0.98
</pre>

The result is not very impressive but it is somewhat reasonable since we see the model is performing much better in radical ratings (1 and 5 stars) and rating 3 has the worst performance. It can be understand that it is always easier to identify if a review has strong opinion than neutral. As a human, when we feel strongly against or supportive toward an object, the language we use to describe it will be much more different than when we do not feel much about the object.

Also, another way to evaluate our model could be treating the problem as a regression-like problem because mistake a 5-star review to 1-star should be penalize more than predicting it as a 4-star. For multi-class classifier like the one we use, it does not really tell the true error of our model because it will count the prediction as error once no matter how far the prediction is to the true value. But for the purpose of this tutorial, we will not focus on discussing the philosophy of how to evaluate the problem.

### Amplifiers: Capital words


In [8]:
def allCapitalCount(reviewFile, outFile):
    with codecs.open(reviewFile, 'r', 'utf-8') as f:
        data1 = f.readlines()
    f.close()
    writer = csv.writer(open(outFile, 'wb'))
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    
    for line in data1:
        count = re.split("\s+", line, 1)[0]
        if len(re.split("\s+", line, 1)) > 1:
            line = re.split("\s+", line, 1)[1]
            words = nltk.word_tokenize(line)
            word = list()
            countWord = 0
            countAllCapital = 0

            for w in words:
                if w.isupper():
                    countAllCapital += 1
                new_token = regex.sub(u'', w)
                if not new_token == u'':
                    word.append(new_token)
                    countWord += 1
            if countWord > 0:
                percAllCapital = (float(countAllCapital)/countWord)
            else:
                percAllCapital = 0.0
            writer.writerow([count, percAllCapital])

### Amplifiers: Exclamation marks

In [10]:
def excSentenceCount(filename, outputFile):
    with codecs.open(filename, 'r', 'utf-8') as f:
        data1 = f.readlines()
    f.close()
    writer = csv.writer(open(outputFile, 'wb'))
    
    for line in data1:
        count = re.split("\s+", line, 1)[0]
        if len(re.split("\s+", line, 1)) > 1:
            line = re.split("\s+", line, 1)[1]
            tokenized_sentences = nltk.sent_tokenize(line)
            countExc = 0
            countSent = 0
            for sentence in tokenized_sentences:
                countSent += 1
                if '!' in sentence:
                    countExc += 1
            if countSent > 0:
                ratioExcSent = float(countExc)/countSent
            else:
                ratioExcSent = 0.0
            writer.writerow([count, ratioExcSent])

SyntaxError: Missing parentheses in call to 'print' (<ipython-input-10-19f7d5095c4a>, line 9)

### Using Vader to model polarity

<a href="https://pypi.python.org/pypi/vaderSentiment">VADER</a> (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Vader is already trained with a great variety of corpus and we can use it directly to analyze the polarity of any review.

In [16]:
# example sentence borrow from http://www.nltk.org/howto/sentiment.html
sid = SentimentIntensityAnalyzer()

ss1 = sid.polarity_scores("VADER is smart, handsome, and funny.")
ss2 = sid.polarity_scores("VADER is smart, handsome, and funny!")
ss3 = sid.polarity_scores("VADER is bad, boring, and funny.")
ss4 = sid.polarity_scores("VADER is bad, boring, and funny!")
print ss1, ss2, ss3, ss4

Output of the program is:

<pre>
{'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}
{'neg': 0.0, 'neu': 0.248, 'pos': 0.752, 'compound': 0.8439}
{'neg': 0.496, 'neu': 0.256, 'pos': 0.248, 'compound': -0.4404}
{'neg': 0.508, 'neu': 0.25, 'pos': 0.242, 'compound': -0.4926}
</pre>

We can see by adding the exclamation mark, positive reviews score higher in positive score and negative reviews became more negative.

In [18]:
sid = SentimentIntensityAnalyzer()

ss1 = sid.polarity_scores("VADER is SMART, HANDSOME, and FUNNY.")
ss2 = sid.polarity_scores("VADER is smart, handsome, and funny.")
ss3 = sid.polarity_scores("VADER is BAD, BORING, and FUNNY.")
ss4 = sid.polarity_scores("VADER is bad, boring, and funny.")
print ss1, ss2, ss3, ss4

SyntaxError: Missing parentheses in call to 'print' (<ipython-input-18-e669d950ff9f>, line 9)

Output of the program is:

<pre>
{'neg': 0.0, 'neu': 0.214, 'pos': 0.786, 'compound': 0.9}
{'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}
{'neg': 0.523, 'neu': 0.216, 'pos': 0.261, 'compound': -0.5622}
{'neg': 0.496, 'neu': 0.256, 'pos': 0.248, 'compound': -0.4404}
</pre>

We can see by making some words ALL CAPITAL, positive reviews score higher in positive score and negative reviews became more negative.

### Apply Vader to Yelp reviews

In [None]:
f = codecs.open('inputData/reviews_content_all.tsv', 'r', 'utf8')
reviewsContent = [line.rstrip('\n\r').split("\t")[2] for line in f]
f = open('inputData/metaData_all.tsv', 'r')
metaDataLines = [line.rstrip('\n\r') for line in f]

sentences = reviewsContent[0:1000]
categories = {
	'1.0': {'cnt': 0, 'neg': 0, 'neu': 0, 'pos': 0},
	'2.0': {'cnt': 0, 'neg': 0, 'neu': 0, 'pos': 0},
	'3.0': {'cnt': 0, 'neg': 0, 'neu': 0, 'pos': 0},
	'4.0': {'cnt': 0, 'neg': 0, 'neu': 0, 'pos': 0},
	'5.0': {'cnt': 0, 'neg': 0, 'neu': 0, 'pos': 0}
}

sid = SentimentIntensityAnalyzer()
for i in range(len(sentences)):
    print(sentences[i][0:50] + " ...")
    ss = sid.polarity_scores(sentences[i])
    print ss
    ratings = metaDataLines[i].split("\t")[3]
    categories[ratings]['cnt'] += 1
    categories[ratings]['neg'] += ss['neg']
    categories[ratings]['neu'] += ss['neu']
    categories[ratings]['pos'] += ss['pos']

for k, v in categories.iteritems():
	print ">>> ratings: " + k
	print "neg: " + str(float(v['neg'])/v['cnt'])
	print "neu: " + str(float(v['neu'])/v['cnt'])
	print "pos: " + str(float(v['pos'])/v['cnt'])

Output of the program is:
<pre>
>>> ratings: 5.0
neg: 0.025626429479
neu: 0.733320203304
pos: 0.241049555273
>>> ratings: 4.0
neg: 0.0287669172932
neu: 0.757466165414
pos: 0.21384962406
>>> ratings: 3.0
neg: 0.045380952381
neu: 0.799952380952
pos: 0.154714285714
>>> ratings: 2.0
neg: 0.0808888888889
neu: 0.796037037037
pos: 0.123111111111
>>> ratings: 1.0
neg: 0.0888181818182
neu: 0.831636363636
pos: 0.0795454545455
</pre>

We can see clearly 5-star reviews score higher than others and the order is consistent across categories.
Vader is awesome!