# Products Reviews: Analyzing unstructured text

Data scientists are often faced with data sets that contain unstructured text in product review data, and must employ **natural language processing (NLP) techniques** in order to make it useful. **Sentiment analysis** refers to the use of NLP techniques to extract subjective information such as the *polarity of the text*, e.g., whether or not the author is speaking *positively* or *negatively* about some topic.

Companies often have useful data which are hidden in large volumes of text such as:

- **online reviews**
- **social media posts** and **tweets**
- **interactions with customers**, such as emails and call center transcripts

For example, when shopping it can be challenging to decide between products with the same star rating. When this happens, shoppers often sift through the raw text of reviews to understand the strengths and weaknesses of each option.

In this short note, we will show how to use **GraphLab Create's** **`sentiment_analysis`** toolkit to apply pre-trained models to predict sentiment for text data in these situations. More specifically, we are going to automate the task of determining product strengths and weaknesses from review text by following the steps below:

1. Split the provided Amazon review text into sentences and applying a sentiment analysis model
2. Tag documents that mention aspects of interest
3. extract adjectives from raw text, and compare their use in positive and negative reviews
4. summarizing the use of adjectives for tagged documents

We w 

**Important Note:**

***GraphLab Create*** *includes feature engineering objects that leverage **`spaCy`**, a high performance NLP package. Here we use it for extracting parts of speech and parsing reviews into sentences.*

## Fire Up GraphLab Create

In [1]:
import graphlab as gl

## Feature engineering: Applying NLP Pipeline

In [2]:
def nlp_pipeline(reviews, title, aspects):
    
    from graphlab.toolkits.text_analytics import trim_rare_words, split_by_sentence, extract_parts_of_speech, stopwords, PartOfSpeech
    
    print(title)
    
    print('1. Get reviews for this product')
    reviews = reviews.filter_by(title, 'name')

    print('2. Splitting reviews into sentences')
    reviews['sentences'] = split_by_sentence(reviews['review'])
    sentences = reviews.stack('sentences', 'sentence').dropna()

    print('3. Tagging relevant reviews')
    tags = gl.SFrame({'tag': aspects})
    tagger_model = gl.data_matching.autotagger.create(tags, verbose=False)
    tagged = tagger_model.tag(sentences, query_name='sentence', similarity_threshold=.3, verbose=False)\
                         .join(sentences, on='sentence')

    print('4. Extracting adjectives')
    tagged['cleaned']    = trim_rare_words(tagged['sentence'], stopwords=list(stopwords()))
    tagged['adjectives'] = extract_parts_of_speech(tagged['cleaned'], [PartOfSpeech.ADJ])

    print('5. Predicting sentence-level sentiment')
    model = gl.sentiment_analysis.create(tagged, target=None, features=['review'])
    tagged['sentiment']  = model.predict(tagged)
    return tagged

In [3]:
reviews = gl.SFrame('./amazon_baby.gl/')

2016-05-25 14:48:11,533 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.9 started. Logging: /tmp/graphlab_server_1464176888.log


This non-commercial license of GraphLab Create is assigned to tgrammat@gmail.com and will expire on September 21, 2016. For commercial licensing options, visit https://dato.com/buy/.


In [4]:
reviews

name,review,rating
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0


## Focus on chosen aspects about baby monitors

First, we import helper function from **`helper_util.py`** file.

In [5]:
from helper_util import *

Next, we collect the baby monitor reviews:

In [6]:
reviews = search(reviews, 'monitor')

In [7]:
reviews.print_rows(num_rows=100,max_row_width=300)

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| Baby Monitor - Direct Link... | Considering how horrible t... |  3.0   |
| Graco ultraclear baby monitor | Only being able to compare... |  5.0   |
| Graco ultraclear baby monitor | I am currently looking for... |  1.0   |
| Graco ultraclear baby monitor | Love this monitor!  This m... |  5.0   |
| Graco ultraclear baby monitor | I can't tell you how clear... |  5.0   |
| Graco ultraclear baby monitor | After reading the reviews ... |  5.0   |
| Graco ultraclear baby monitor | I'm a fan of Graco product... |  5.0   |
| Graco ultraclear baby monitor | After trying 2 other monit... |  5.0   |
|   900mhz Attachable Monitor   | First the pros:1.  The ran... |  2.0   |
|   900mhz Attachable Monitor   | I have had my monitor for ... |  5.0   |
|   900mhz Attachable Mon

In [8]:
for review in reviews['review'][0:10]:
    print review, '\n'

Considering how horrible the selection is these days for choosing a quality baby monitor that doesn't transmit to anyone and everyone who may be listening in, intentionally or unintentionally, this seems to be one of the best choices to make.  Your scrambled voice will sound like Donald Duck on a regular analog 46-49 mhz cordless phone or radio scanner, so it's at least a moderate roadblock for the average signal interceptor.  However, why on earth there is no 900 mhz (or higher) digital-spread-spectrum baby monitor on the market is beyond me.  So, this unit is one of the better compromises out there, and at ... on amazon.com, it is amazingly affordable. 

Only being able to compare this product with an old Fisher Price monitor I can say that this monitor is crystal clear. The old Fisher Price one gave out a constant fussy hum and picked up far to much noise around the room (eg the wind blowing past our windows). I'm confused though about the packaging between this Graco Ultra Clear Mo

## Process reviews for the most common product

First, we define the aspects of our current interest:

In [9]:
aspects = ['audio', 'price', 'signal', 'range', 'battery life']

In [7]:
item_a = 'Snuza Baby Monitor, Hero'
reviews_a = nlp_pipeline(reviews, item_a, aspects)

Snuza Baby Monitor, Hero
1. Get reviews for this product
2. Splitting reviews into sentences
3. Tagging relevant reviews
4. Extracting adjectives
5. Predicting sentence-level sentiment




In [8]:
reviews_a

sentence_id,sentence,tag,score,name
23,I know from experience that just seconds can ...,range,0.333333333333,"Snuza Baby Monitor, Hero"
44,If the battery lasted longer or was easier to ...,battery life,0.545454545455,"Snuza Baby Monitor, Hero"
45,It works great and no false alarms but it only ...,battery life,0.454545454545,"Snuza Baby Monitor, Hero"
71,The big problem for me was after the first ...,battery life,0.333333333333,"Snuza Baby Monitor, Hero"
72,I even tried to order one directly from snuza ( ...,battery life,0.454545454545,"Snuza Baby Monitor, Hero"
74,While it works perfect when powered and the ...,battery life,0.357142857143,"Snuza Baby Monitor, Hero"
122,"As a note for future purchasers, read the ...",audio,0.333333333333,"Snuza Baby Monitor, Hero"
123,The audible sound really helps when the kid is in ...,audio,0.333333333333,"Snuza Baby Monitor, Hero"
184,"That, and the price was right. ...",price,1.0,"Snuza Baby Monitor, Hero"
199,But it was well worth the price for peace of mind. ...,price,1.0,"Snuza Baby Monitor, Hero"

review,rating,cleaned,adjectives,sentiment
After having a premature baby and dealing with ...,5.0,baby,{'ADJ': {}},0.99855793166
If the battery lasted longer or was easier to ...,2.0,battery lasted replace,{'ADJ': {}},9.55858128398e-06
If the battery lasted longer or was easier to ...,2.0,works great lasted battery replacement ...,{'ADJ': {'great': 1}},9.55858128398e-06
We loved the simplicity of the snuza! We decided ...,4.0,battery days so.,{'ADJ': {}},0.99451792208
We loved the simplicity of the snuza! We decided ...,4.0,snuza days battery lasted,{'ADJ': {'snuza': 1}},0.99451792208
We loved the simplicity of the snuza! We decided ...,4.0,works angelcare angelcare battery ...,{'ADJ': {'angelcare': 1}},0.99451792208
"Whether it is just a placebo or not, putting ...",5.0,make make audible baby,{'ADJ': {'audible': 1}},0.998973574468
"Whether it is just a placebo or not, putting ...",5.0,audible,{'ADJ': {'audible': 1}},0.998973574468
"As new, first-time parents, we were 99.9% ...",5.0,price,{'ADJ': {}},0.999999969114
"Fabulous item. I'm a first time mom, my ...",5.0,worth price,{'ADJ': {'worth': 1}},0.998414415867


## Comparing to another product

In [8]:
dropdown = get_dropdown(reviews)
display(dropdown)

In [11]:
reviews_a = gl.load_sframe('./reviews_a/')
item_b = dropdown.value
print 'Comparing reviews with \'%s\':\n' % item_b
reviews_b = nlp_pipeline(reviews, item_b, aspects)

Comparing reviews with '900mhz Attachable Monitor':

900mhz Attachable Monitor
1. Get reviews for this product
2. Splitting reviews into sentences
3. Tagging relevant reviews
4. Extracting adjectives
5. Predicting sentence-level sentiment




In [14]:
counts, sentiment, adjectives = get_comparisons(reviews_a, reviews_b, item_a, item_b, aspects)

Comparing the **number of sentences** that mention **each aspect**:

In [15]:
counts

tag,"Snuza Baby Monitor, Hero",900mhz Attachable Monitor
range,9,2
battery life,10,2


Comparing the **sentence-level sentiment** for **each aspect** of **each product**:

In [16]:
sentiment

tag,"Snuza Baby Monitor, Hero",900mhz Attachable Monitor
range,0.993788285437,0.0031214413625
battery life,0.797985423815,0.49800239056


Comparing the **use of adjectives** for **each aspect**:

In [17]:
adjectives

tag,"Snuza Baby Monitor, Hero",900mhz Attachable Monitor
range,"[hard, snuza, angelcare]",[aaas]
battery life,"[great, snuza, angelcare]",[aaas]


## Investigating good and bad sentences

In [18]:
good, bad = get_extreme_sentences(reviews_a)

Print **<font color='green'>good sentences</font>** for the **first item**, where <font color='green'>adjectives</font> and <font color='green'>aspects</font> are *highlighted*.

In [20]:
print_sentences(good['highlighted'])

Print **<font color='red'>bad sentences</font>** for the **first item**, where <font color='red'>adjectives</font> and <font color='green'>aspects</font> are *highlighted*.

In [21]:
print_sentences(bad['highlighted'])

## Summarizing sentiment with GLC `product_sentiment.create()`

One can even summarize the sentiment of the product reviews we saw earlier by utilizing **GraphLab Create's `product_sentiment`** toolkit. The toolkit enables to search for aspects of interest and obtain summaries of the reviews or sentences with the most positive (or negative) predicted sentiment.

In [11]:
reviews_sentiment = gl.product_sentiment.create(reviews, 
                                                target=None, #'rating', 
                                                features=['review'], 
                                                method='auto',
                                                splitby='review')

In [81]:
reviews_sentiment.

['features',
 'splitby',
 'feature_column',
 'sentiment_score_column',
 'reviews',
 'num_reviews',
 'review_searcher',
 'sentiment_scorer',
 'review_id_column',
 'method',
 'target']

In [59]:
reviews_sentiment.sentiment_summary(keywords=aspects, groupby='name', k=10, threshold=2)

keyword,name,mean_sentiment,sd_sentiment,review_count
battery life,"Snuza Baby Monitor, Hero",0.663579537023,0.322874099963,7
price,"Snuza Baby Monitor, Hero",0.977244026731,0.0140913309791,4
price,Fisher-Price Aquarium Monitor ...,0.305380831961,0.196483541825,4
range,900mhz Attachable Monitor,0.969815398071,0.00267362259556,2
range,Samsung Ezview Baby Monitor ...,0.626363802641,0.347454141285,4


In [61]:
reviews_sentiment.sentiment_scorer

Class                           : SentimentAnalysisModel

Data
----
Number of rows                  : 98

Model
-----
Score column                    : None
Features                        : ['review']
Method                          : bow-logistic

In [62]:
reviews_sentiment.review_searcher

Class                           : SearchModel

Corpus
------
Number of documents             : 98
Average tokens/document         : {'review': 121.09183673469387, 'name': 3.989795918367347}

Indexing settings
-----------------
BM25 k1                         : 1.5
BM25 b                          : 0.75
TF-IDF threshold                : 0.01

Index
-----
Number of unique tokens indexed : 1787
Preprocessing time (s)          : 0.4926
Indexing time (s)               : 0.8869

In [65]:
item_a = 'Snuza Baby Monitor, Hero'

In [78]:
most_positive_monitor_reviews = reviews_sentiment.get_most_positive(keywords=aspects, groupby='name', k=3)

most_positive_monitor_reviews = most_positive_monitor_reviews.filter_by(item_a, 'name')

print 'Top-3 positive reviews for \"%s\" per \'aspect\' of interest:\n' % item_a

for review in most_positive_monitor_reviews['review']:
    highlighted_review = highlight(review, aspects, 'green')
    display(HTML(highlighted_review))

print '\nRelevance and sentiment scores for these most positive reviews:\n'
most_positive_monitor_reviews.print_rows(num_rows=30, max_column_width=20, max_row_width=180)

Top-3 positive reviews for "Snuza Baby Monitor, Hero" per 'aspect' of interest:




Relevance and sentiment scores for these most positive reviews:

+-------------+-----------------+--------------+---------------------+-----------------+---------------------+
| __review_id | relevance_score |   keyword    |        review       | sentiment_score |         name        |
+-------------+-----------------+--------------+---------------------+-----------------+---------------------+
|      74     |  4.87557555332  |    price     | Got this for my ... |  0.992401856737 | Snuza Baby Monit... |
|      81     |  4.47594068884  |    price     | My wife and I ar... |  0.988045651138 | Snuza Baby Monit... |
|      66     |   3.3987484123  |    price     | Fabulous item. I... |  0.971710133865 | Snuza Baby Monit... |
|      78     |  7.91497773439  | battery life | I have used the ... |  0.995889919562 | Snuza Baby Monit... |
|      81     |  6.24912353626  | battery life | My wife and I ar... |  0.988045651138 | Snuza Baby Monit... |
|      71     |  4.73998147821  | battery life

In [82]:
most_negative_monitor_reviews = reviews_sentiment.get_most_negative(keywords=aspects, groupby='name', k=3)

most_negative_monitor_reviews = most_negative_monitor_reviews.filter_by(item_a, 'name')

print 'Top-3 negative reviews for \"%s\" per \'aspect\' of interest:\n' % item_a

for review in most_negative_monitor_reviews['review']:
    highlighted_review = highlight(review, aspects, 'green')
    display(HTML(highlighted_review))

print '\nRelevance and sentiment scores for these most negative reviews:\n'
most_negative_monitor_reviews.print_rows(num_rows=30, max_column_width=20, max_row_width=180)

Top-3 negative reviews for "Snuza Baby Monitor, Hero" per 'aspect' of interest:




Relevance and sentiment scores for these most negative reviews:

+-------------+-----------------+--------------+---------------------+-----------------+---------------------+
| __review_id | relevance_score |   keyword    |        review       | sentiment_score |         name        |
+-------------+-----------------+--------------+---------------------+-----------------+---------------------+
|      89     |  3.30281643601  |    price     | I was hesitant a... |  0.956818465185 | Snuza Baby Monit... |
|      66     |   3.3987484123  |    price     | Fabulous item. I... |  0.971710133865 | Snuza Baby Monit... |
|      81     |  4.47594068884  |    price     | My wife and I ar... |  0.988045651138 | Snuza Baby Monit... |
|      83     |  4.68684991156  | battery life | The idea is good... |  0.175443937193 | Snuza Baby Monit... |
|      45     |  5.66478907044  | battery life | If the battery l... |  0.211681056814 | Snuza Baby Monit... |
|      57     |  2.88786216689  | battery life