# Text Mining Amazon Reviews

A major component of our research has been text mining the actual Amazon reviews. Our hope for our analysis was to find trends and relationships that could help us discern some relationship between the reviews' natural language and the likelihood of a recall. This is a complex research topic, not only due to the nature of computational linguistics but also because of the diversity in nature and severity of recalls. 

For a preliminary analysis, we can fetch some of the reviews and perform a sentiment analysis to better understand the text. For the bulk of our natural language analyses and text preprocessing, we utilized the [NLTK Python module](http://www.nltk.org/).

In [2]:
from nltk.classify import NaiveBayesClassifier
import nltk.classify.util
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *
import psycopg2
import pandas as pd
import numpy as np


#Connect to database using your credentials
conn = psycopg2.connect(database=<db name>, user=<user name>, password=<password>, host=<host>, port="5432")

print("Opened database successfully")

cur = conn.cursor()

Opened database successfully


In [3]:
'''
Fetch a sample of recall reviews and non recall reviews
'''
cur.execute('Select review_text from review where product_id in (select product_id from recalledproduct)\
            order by random() limit 500;')

review_texts_recall = pd.DataFrame(cur.fetchall())
review_texts_recall = review_texts_recall[0]

cur.execute('Select review_text from review where product_id not in (select product_id from recalledproduct)\
            order by random() limit 500;')

review_texts_no_recall = pd.DataFrame(cur.fetchall())
review_texts_no_recall = review_texts_no_recall[0]

In [4]:
'''
Preprocess the text.
'''

#remove special characters, numbers, etc.
import re #regular expressions module in Python

review_texts_recall = [re.sub('[^a-zA-Z\s]', ' ',review_texts_recall[i]) for i in range(len(review_texts_recall))]
review_texts_no_recall = [re.sub('[^a-zA-Z\s]', ' ',review_texts_no_recall[i]) for i in range(len(review_texts_no_recall))]

In [5]:
'''
Tokenize the text. NOTE: In NLTK you have to run nltk.download() to run the download client. From
here you will be able to select 'punkt' which gives you access to the NLTK word tokenizer method 
as seen below.
'''
from nltk import word_tokenize

tokens_recall = [word_tokenize(review) for review in review_texts_recall]
tokens_no_recall = [word_tokenize(review) for review in review_texts_no_recall]

In [6]:
'''
Custom methods to split on uppercase letters and then make all lowercase
'''

#split words on uppercase letters
def split_uppercase(tokens):
    tokens_II = np.empty((len(tokens),0)).tolist()
    for review in tokens:
        n = tokens.index(review)
        for word in review:
            split = re.sub(r'([A-Z][a-z])', r' \1', word)
            tokens_II[n].append(split)
    return tokens_II

tokens_recall = split_uppercase(tokens_recall)
tokens_no_recall = split_uppercase(tokens_no_recall)


##Make all text lower case
def make_lowercase(tokens):
    tokens_final = np.empty((len(tokens),0)).tolist()
    for review in tokens:
        n = tokens.index(review)
        for word in review:
            lowercase_word = word.lower()
            tokens_final[n].append(lowercase_word)
    return tokens_final

tokens_recall = make_lowercase(tokens_recall)
tokens_no_recall = make_lowercase(tokens_no_recall)

In [7]:
'''
Stem words and remove stopwords- nltk.download('stopwords')
'''

from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer

st = LancasterStemmer()

##Remove stopwords and stem
stopwords = stopwords.words('english')

def stem_tokens(tokens):
    stemmed_token = np.empty((len(tokens),0)).tolist()
    for review in tokens:
        n = tokens.index(review)
        for word in review:
            if word not in stopwords:
                stem = st.stem(word)
                stemmed_token[n].append(stem)
    return stemmed_token
        
tokens_recall = stem_tokens(tokens_recall)
tokens_no_recall = stem_tokens(tokens_no_recall)

Now that we have processed and tokenized the texts, we can apply our analyses to the corpora. For this exercise, we will go through a simple classification model and then fetch testing data from the DB to see how our sentiment analysis fares in predictive analysis. You can see this exercise in more depth at the [sentiment analysis](http://www.nltk.org/howto/sentiment.html) page on nltk.org. 

In [8]:
# Create a list of training docs, each with their corresponding tag- recall or no recall
recall_docs = [[recall_doc, 'recall'] for recall_doc in tokens_recall]
no_recall_docs = [[no_recall_doc, 'no recall'] for no_recall_doc in tokens_no_recall]

recall_train = recall_docs[:int(len(review_texts_recall)/2)]
recall_test = recall_docs[int(len(review_texts_recall)/2):len(review_texts_recall)]
no_recall_train = no_recall_docs[:int(len(review_texts_no_recall)/2)]
no_recall_test = no_recall_docs[int(len(review_texts_no_recall)/2):len(review_texts_recall)]
docs_train = recall_train + no_recall_train
docs_test = recall_test + no_recall_test

In [9]:
#create sentiment analyzer from nltk package
sentim_analyzer = SentimentAnalyzer()

#handle negation for situations like 'not amazing'
all_words_neg = sentim_analyzer.all_words([mark_negation(doc[0]) for doc in docs_train])
len(all_words_neg)

17146

In [10]:
#perform sentiment analysis with unigrams (single tokens) handling the negations
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg)

In [11]:
#select features to extract. For this analysis with will use unigram features.
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

In [12]:
#apply the features to training and testing set
training_set = sentim_analyzer.apply_features(docs_train)
testing_set = sentim_analyzer.apply_features(docs_test)

In [13]:
#Create the Naive Bayes classifier and perform classification on the texts
trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)

Training classifier


In [14]:
for key,value in sorted(sentim_analyzer.evaluate(testing_set).items()):
    print('{0}: {1}'.format(key, value))

Evaluating NaiveBayesClassifier results...
Accuracy: 0.71
F-measure [no recall]: 0.7216890595009597
F-measure [recall]: 0.6972860125260961
Precision [no recall]: 0.6937269372693727
Precision [recall]: 0.7292576419213974
Recall [no recall]: 0.752
Recall [recall]: 0.668


## Optimizing Your Queries for Your Data Analysis

Maybe we can make our query more specific to handle the time of the review versus the time of the recall, and also let's go ahead and tag based not just on whether the product was reclled but also the classification of the recall (Class I, II or III). 

While this updated method might somewhat improve results, it could still be made much more accurate. One problem worth addressing is the fact that many of the reviews associated with a recall are not negative at all and should not indicate whether it should be recalled. Therefore, it would make sense to further explore how to alter the classification task.

In [28]:
'''
This query selects reviews made within a year of the recall, and also tags them with their classification. Moreover,
it randomly selects 500 rather than just the first 500, so that it is not dependent on the order in which the items were added.
'''
cur.execute('SELECT rv.review_text, rv.product_id from review rv \
            join recalledproduct rp on rv.product_id = rp.product_id\
            join recall rc on rp.recall_id = rc.recall_id\
            join event e on rc.event_id = e.event_id where \
            @ (date_part(\'month\',TIMESTAMP \'epoch\' + rv.unix_review_time * INTERVAL \'1 second\')\
            - date_part(\'month\', e.initiation_date)) <= 12 \
            order by random() limit 500;')

specified_recall = pd.DataFrame(cur.fetchall())

cur.execute('SELECT rv.review_text from review rv \
            where rv.product_id not in (\
            select distinct product_id from recalledproduct)\
            order by random() limit 500;')

specified_no_recall = pd.DataFrame(cur.fetchall())
specified_no_recall['Classification'] = ['No Recall'] * specified_no_recall.shape[0]

print(specified_recall.shape)
print(specified_no_recall.shape)

(500, 2)
(500, 2)


In [29]:
recall_text = specified_recall.iloc[:,0]
no_recall_text = specified_no_recall.iloc[:,0]

In [30]:
'''
Preprocess the text.
'''
#remove special characters, numbers, etc.
review_texts_recall = [re.sub('[^a-zA-Z\s]', ' ',recall_text[i]) for i in range(len(recall_text))]
review_texts_no_recall = [re.sub('[^a-zA-Z\s]', ' ',no_recall_text[i]) for i in range(len(no_recall_text))]

In [31]:
'''
Tokenize the text.
'''
tokens_recall = [word_tokenize(review) for review in review_texts_recall]
tokens_no_recall = [word_tokenize(review) for review in review_texts_no_recall]

In [32]:
'''
Custom methods to split on uppercase letters and then make all lowercase
'''

#split words on uppercase letters
tokens_recall = split_uppercase(tokens_recall)
tokens_no_recall = split_uppercase(tokens_no_recall)

##Make all text lower case
tokens_recall = make_lowercase(tokens_recall)
tokens_no_recall = make_lowercase(tokens_no_recall)

In [33]:
'''
Stem words and remove stopwords- nltk.download('stopwords')
'''
##Remove stopwords and stem using the previously defined Lancaster stemmer
tokens_recall = stem_tokens(tokens_recall)
tokens_no_recall = stem_tokens(tokens_no_recall)

## Sentiment Analysis in NLTK

In [34]:
#Create a list of training docs, each with their corresponding tag- recall or no recall
#This can be updated so that the tags are specified on the classification as fetched in 
#the more advanced query above. For these purposes, we will just replicate the above experiment
#to analyze the difference in the success of the Naive Bayes model after optimizing the
#query for our analysis.
recall_docs = [[recall_doc, 'recall'] for recall_doc in tokens_recall]
no_recall_docs = [[no_recall_doc, 'no recall'] for no_recall_doc in tokens_no_recall]

half_index_r = int(len(recall_docs)/2)
whole_index_r = len(recall_docs)
half_index_nr = int(len(no_recall_docs)/2)
whole_index_nr = len(no_recall_docs)

recall_train = recall_docs[:half_index_r]
recall_test = recall_docs[half_index_r:whole_index_r]
no_recall_train = no_recall_docs[:half_index_nr]
no_recall_test = no_recall_docs[half_index_nr:whole_index_nr]
docs_train = recall_train + no_recall_train
docs_test = recall_test + no_recall_test

In [35]:
#create sentiment analyzer from nltk package
sentim_analyzer = SentimentAnalyzer()

#handle negation for situations like 'not amazing'
all_words_neg = sentim_analyzer.all_words([mark_negation(doc[0]) for doc in docs_train])
len(all_words_neg)

15761

In [36]:
#perform sentiment analysis with unigrams (single tokens) handling the negations
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg)

In [37]:
#extract features
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)

In [38]:
#apply the features to training and testing set
training_set = sentim_analyzer.apply_features(docs_train)
testing_set = sentim_analyzer.apply_features(docs_test)

In [39]:
#Create the Naive Bayes classifier and perform classification on the texts
trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)

Training classifier


In [40]:
for key,value in sorted(sentim_analyzer.evaluate(testing_set).items()):
    print('{0}: {1}'.format(key, value))

Evaluating NaiveBayesClassifier results...
Accuracy: 0.666
F-measure [no recall]: 0.5876543209876542
F-measure [recall]: 0.719327731092437
Precision [no recall]: 0.7677419354838709
Precision [recall]: 0.6202898550724638
Recall [no recall]: 0.476
Recall [recall]: 0.856
