# Products Reviews: Analyzing unstructured text

Data scientists are often faced with data sets that contain unstructured text in product review data, and must employ **natural language processing (NLP) techniques** in order to make it useful. **Sentiment analysis** refers to the use of NLP techniques to extract subjective information such as the *polarity of the text*, e.g., whether or not the author is speaking positively or negatively about some topic.

Companies often have useful data which are hidden in large volumes of text such as:

- online reviews
- social media posts and tweets
- interactions with customers, such as emails and call center transcripts

For example, when shopping it can be challenging to decide between products with the same star rating. When this happens, shoppers often sift through the raw text of reviews to understand the strengths and weaknesses of each option.

In this short note, we will show how to use **GraphLab Create's sentiment_analysis** toolkit to apply pre-trained models to predict sentiment for text data in these situations. More specifically, we are going to automate the task of determining product strengths and weaknesses from review text by following the steps below:

1. Split the provided Amazon review text into sentences and applying a sentiment analysis model
2. Tag documents that mention aspects of interest
3. extract adjectives from raw text, and compare their use in positive and negative reviews
4. summarizing the use of adjectives for tagged documents

**Important Note:**

***GraphLab Create*** *includes feature engineering objects that leverage **`spaCy`**, a high performance NLP package. Here we use it for extracting parts of speech and parsing reviews into sentences.*

## Fire Up GraphLab Create

In [1]:
import graphlab as gl

## Feature engineering: Applying NLP Pipeline

In [2]:
from graphlab.toolkits.text_analytics import trim_rare_words, split_by_sentence, extract_parts_of_speech, stopwords, PartOfSpeech

def nlp_pipeline(reviews, title, aspects):

    print(title)
    
    print('1. Get reviews for this product')
    reviews = reviews.filter_by(title, 'name')

    print('2. Splitting reviews into sentences')
    reviews['sentences'] = split_by_sentence(reviews['review'])
    sentences = reviews.stack('sentences', 'sentence').dropna()

    print('3. Tagging relevant reviews')
    tags = gl.SFrame({'tag': aspects})
    tagger_model = gl.data_matching.autotagger.create(tags, verbose=False)
    tagged = tagger_model.tag(sentences, query_name='sentence', similarity_threshold=.3, verbose=False)\
                         .join(sentences, on='sentence')

    print('4. Extracting adjectives')
    tagged['cleaned']    = trim_rare_words(tagged['sentence'], stopwords=list(stopwords()))
    tagged['adjectives'] = extract_part_of_speech(tagged['cleaned'], [PartOfSpeech.ADJ])

    print('5. Predicting sentence-level sentiment')
    model = gl.sentiment_analysis.create(tagged, features=['review'])
    tagged['sentiment']  = model.predict(tagged)
    return tagged

In [3]:
reviews = gl.SFrame('amazon_baby.gl')

2016-05-09 09:01:43,528 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.9 started. Logging: /tmp/graphlab_server_1462773702.log


This non-commercial license of GraphLab Create is assigned to tgrammat@gmail.com and will expire on September 21, 2016. For commercial licensing options, visit https://dato.com/buy/.


In [4]:
reviews

name,review,rating
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0


## Focus on chosen aspects about baby monitors

First, we import helper function from **`helper_util.py`** file.

In [5]:
from helper_util import *

Next, we define the aspects of our current interest:

In [6]:
aspects = ['audio', 'price', 'signal', 'range', 'battery life']

and collect the most relevant reviews.

In [7]:
reviews = search(reviews, 'monitor')

In [8]:
reviews

name,review,rating
Baby Monitor - Direct Link Privacy Monitor ...,Considering how horrible the selection is these ...,3.0
Graco ultraclear baby monitor ...,Only being able to compare this product ...,5.0
Graco ultraclear baby monitor ...,I am currently looking for a monitor for our ...,1.0
Graco ultraclear baby monitor ...,Love this monitor! This monitor is so clear... I ...,5.0
Graco ultraclear baby monitor ...,I can't tell you how clear this monitor is ...,5.0
Graco ultraclear baby monitor ...,After reading the reviews I was very worried about ...,5.0
Graco ultraclear baby monitor ...,"I'm a fan of Graco products in general, and ...",5.0
Graco ultraclear baby monitor ...,"After trying 2 other monitors, this one is ...",5.0
900mhz Attachable Monitor,First the pros:1. The range on this monitor is ...,2.0
900mhz Attachable Monitor,I have had my monitor for 8 months now and I love ...,5.0


## Process reviews for the most common product

In [9]:
item_a = 'Infant Optics DXR-5 2.4 GHz Digital Video Baby Monitor with Night Vision'
reviews_a = nlp_pipeline(reviews, item_a, aspects)

Infant Optics DXR-5 2.4 GHz Digital Video Baby Monitor with Night Vision
1. Get reviews for this product
2. Splitting reviews into sentences
3. Tagging relevant reviews
4. Extracting adjectives
5. Predicting sentence-level sentiment


In [10]:
reviews_a

sentence_id,sentence,tag,score,review,rating
2,It killed our wifi signal then lost it's pairing ...,signal,1.0,It killed our wifi signal then lost it's pairing ...,1.0
4,Both audio and video monitor supersedes the ...,audio,0.5,I love this video monitor. Both audio and ...,5.0
4,Both audio and video monitor supersedes the ...,price,0.5,I love this video monitor. Both audio and ...,5.0
5,The VOX poewr saving helps prevent quick ...,battery life,0.454545454545,I love this video monitor. Both audio and ...,5.0
11,Definitely worth the price. ...,price,0.666666666667,This is such a great camera. It doesn't pivot ...,5.0
17,I purchased it based on reviews and the appea ...,price,0.666666666667,I had bought this monitor back in july when my ...,1.0
18,First and foremost the battery life SUCKS. ...,battery life,1.0,I had bought this monitor back in july when my ...,1.0
35,It's good and inexpensive enough that we've dec ...,range,0.333333333333,I tried this after having the Lenox monitor. The ...,4.0
36,It's been hard to find a video system that does ...,price,0.666666666667,I tried this after having the Lenox monitor. The ...,4.0
40,This is a great camera monitor for the price. ...,price,0.666666666667,This is a great camera monitor for the price. ...,4.0

name,cleaned,adjectives,sentiment
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,wifi signal lost it's camera. ...,[],0.726841681132
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,audio video monitor supersedes quality ...,"[audio, previous]",0.95069270188
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,audio video monitor supersedes quality ...,"[audio, previous]",0.954092282107
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,vox saving quick battery (which doesn't wanted ...,"[vox, quick, which]",0.830247002167
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,worth price.,[worth],0.999794625299
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,purchased based reviews price. ...,[],0.177435087275
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,battery life,[],0.0887905806345
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,it's good inexpensive we've decided angelcare ...,"[good, inexpensive, sound, total] ...",0.999940613283
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,"it's hard find video system price, work us. ...","[hard, find]",0.999923476009
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,great camera monitor price. ...,[great],0.999951337173


## Comparing to another product

In [11]:
dropdown = get_dropdown(reviews)
display(dropdown)

In [12]:
item_b = dropdown.value
reviews_b = nlp_pipeline(reviews, item_b, aspects)
counts, sentiment, adjectives = get_comparisons(reviews_a, reviews_b, item_a, item_b, aspects)

VTech Communications Safe &amp; Sound Digital Audio Monitor
1. Get reviews for this product
2. Splitting reviews into sentences
3. Tagging relevant reviews
4. Extracting adjectives
5. Predicting sentence-level sentiment


Comparing the number of sentences that mention each aspect

In [13]:
counts

tag,Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,VTech Communications Safe &amp; Sound Digital A ...
signal,107,15
battery life,180,93
range,144,68
audio,105,27
price,251,69


Comparing the sentence-level sentiment for each aspect of each product

In [14]:
sentiment

tag,Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,VTech Communications Safe &amp; Sound Digital A ...
signal,0.674221231596,0.695281129091
battery life,0.761379749571,0.831840808502
range,0.862246778561,0.840649624394
audio,0.826972950724,0.92381971402
price,0.885832746604,0.914282565406


Comparing the use of adjectives for each aspect