The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from 4 product types (domains): Kitchen, Books, DVDs, and Electronics. Each domain has several thousand reviews, but the exact number varies by domain. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed

In [1]:
from bs4 import BeautifulSoup

In [73]:
'''
Parse the downloaded html file which contains the data. Positive and negative
reviews have already been distinguished into two separate files
'''

positive_reviews = BeautifulSoup(open('positive.review.html').read(), "lxml")

In [3]:
'''
Make a list of all positive reviews
'''

positive_reviews = positive_reviews.findAll('review_text')

In [4]:
len(positive_reviews) #How many are there?

1000

In [5]:
positive_reviews[0:5] #Let's see some of these reviews

[<review_text>\nI purchased this unit due to frequent blackouts in my area and 2 power supplies going bad.  It will run my cable modem, router, PC, and LCD monitor for 5 minutes.  This is more than enough time to save work and shut down.   Equally important, I know that my electronics are receiving clean power.\n\nI feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply.\n\nAs always, Amazon had it to me in &lt;2 business days\n</review_text>,
 <review_text>\nI ordered 3 APC Back-UPS ES 500s on the recommendation of an employee of mine who used to work at APC. I've had them for about a month now without any problems. They've functioned properly through a few unexpected power interruptions. I'll gladly order more if the need arises.\n\nPros:\n - Large plug spacing, good for power adapters\n - Simple design\n - Long cord\n\nCons:\n - No line conditioning (usually an expensive option\n</review_

In [6]:
negative_reviews = BeautifulSoup(open('negative.review.html').read())

In [7]:
negative_reviews = negative_reviews.findAll('review_text')

In [8]:
len(negative_reviews)

1000

In [9]:
stopwords = set(w.rstrip() for w in open('stopwords.txt')) #List of stopwords downloaded from. Removing these help.

In [10]:
import nltk

In [11]:
from nltk.stem import WordNetLemmatizer

In [12]:
word_lemmatize = WordNetLemmatizer() #Word lemmatizer changes plurals to singular. Eg: 'Dogs' to 'Dog', since they have
                                    # same meaning

In [13]:
'''
A function to tokenize and lemmatize a review. Returns a list of tokens
'''

def tokenizer(s):
    s = s.lower()
    tokens = nltk.tokenize.word_tokenize(s)
    tokens = [t for t in tokens if len(t) > 2]
    tokens = [word_lemmatize.lemmatize(t) for t in tokens]
    tokens = [t for t in tokens if t not in stopwords]
    return tokens

In [14]:
word_map = {}
positive_tokenized = []
negative_tokenized = []
index = 0


In [15]:
'''
Here, we read through the list of positive reviews, tokenize them, and add them to
list positive_tokenized. We also take each token and add it to a dictionary,
where the key is the token, the value is the index.
'''


for review in positive_reviews:
    tokens = tokenizer(review.text)
    positive_tokenized.append(tokens)
    for t in tokens:
        if t not in word_map:
            word_map[t] = index
            index = index + 1

In [16]:
for review in negative_reviews:
    tokens = tokenizer(review.text)
    negative_tokenized.append(tokens)
    for t in tokens:
        if t not in word_map:
            word_map[t] = index
            index = index + 1

In [21]:
len(positive_tokenized), len(negative_tokenized)

(1000, 1000)

In [23]:
import numpy as np

In [24]:
np.zeros(len(word_map) + 1)

array([ 0.,  0.,  0., ...,  0.,  0.,  0.])

In [25]:
len(word_map) #This essentially contains ALL the tokens combined in positive and negative reviews

11393

In [26]:
index

11393

In [35]:
'''
Now that we have our tokens and indexes, we need to vectorize them. We first initialize a (1 * D+1) dimensional
vector, where D is basically the total no. of tokens (length of dictionary). We read the passed list, and find each 
token's index in the dictionary. The index is the column of the vector where we increment zero to 1. Also, since
positive reviews will have label 1 and negatives ones 0, we assign the respective labels at the last column.
We normalize each value assigned at the columns. Return the final vector
'''

def tokens_to_vector(tokens, label):
    x = np.zeros(len(word_map) + 1)
    for t in tokens:
        i = word_map[t]
        x[i] += 1
    x = x/x.sum()
    x[-1] =  label
    return x

In [37]:
N = len(positive_tokenized) + len(negative_tokenized)

In [38]:
'''
Now initialize a (N, (D+1)) matrix, where N is the total no. of reviews and D is already discussed.
We will assign each returned vector to the matrix at row 'i' (i.e. the current data vector) for all columns.
'''

data = np.zeros((N, (len(word_map) + 1)))
i = 0

In [39]:
for tokens in positive_tokenized:
    xy = tokens_to_vector(tokens, 1)
    data[i, :] = xy
    i = i + 1

In [40]:
for tokens in negative_tokenized:
    xy = tokens_to_vector(tokens, 0)
    data[i, :] = xy
    i = i + 1

In [44]:
data.shape # Cross check the dimensions of the matrix.

(2000, 11394)

In [45]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression

In [46]:
X = data[:, :-1] # Slice the matrix to get the input data, i.e. All columns except the last one.

In [48]:
X.shape

(2000, 11393)

In [49]:
y = data[:, -1] # The last column

In [52]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) #Split into training and 
                                                                                                #test sets

In [53]:
model = LogisticRegression()

In [54]:
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [55]:
model.score(X_test, y_test) # Performance on test data

0.73599999999999999

In [66]:
model.score(X_train, y_train) # Performance on training data.

0.99199999999999999

In [67]:
predict = model.predict(X_test)

In [68]:
from sklearn.metrics import roc_auc_score

In [69]:
roc_auc_score(y_test, predict) #Get AUCROC score

0.64899814352474228

The results above hint at overfitting taking place. Let's try another more powerful classifier

In [70]:
from sklearn.ensemble import AdaBoostClassifier

In [71]:
model_2 = AdaBoostClassifier()
model_2.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)

In [72]:
predict = model_2.predict(X_test)
roc_auc_score(y_test, predict)

0.75409704884450413

In [74]:
model_2.score(X_test, y_test)

0.754

In [75]:
model_2.score(X_train, y_train)

0.84333333333333338

Better. The difference in performance is not as high. The problem of overfitting has been reduced. Also, our AUC score 
has increased sufficiently.

This concludes our analysis. It should be noted that, much better results can be obtained by better feature engineering. For now, this basic implementation performs fairly well. 

In [79]:
import numpy
import pandas
import sklearn
import nltk
import bs4

In [82]:
print "Versions at time of writing: "

print "bs4: {}".format(bs4.__version__)
print "nltk: {}".format(nltk.__version__)
print "numpy: {}".format(numpy.__version__)
print "pandas: {}".format(pandas.__version__)
print "sklearn: {}".format(sklearn.__version__)

Versions at time of writing: 
bs4: 4.4.1
nltk: 3.2.1
numpy: 1.11.1
pandas: 0.18.1
sklearn: 0.17.1
