# Analyzing unstructured text in product review data

It's common for companies to have useful data hidden in large volumes of text:

- online reviews
- social media posts and tweets
- interactions with customers, such as emails and call center transcripts

For example, when shopping it can be challenging to decide between products with the same star rating. When this happens, shoppers often sift through the raw text of reviews to understand the strengths and weaknesses of each option.

<img src="ItemC.png">
<img src="ItemD.png">

In this notebook we seek to automate the task of determining product strengths and weaknesses from review text.

1. splitting Amazon review text into sentences and applying a sentiment analysis model
2. tagging documents that mention aspects of interest
3. extract adjectives from raw text, and compare their use in positive and negative reviews
4. summarizing the use of adjectives for tagged documents

GraphLab Create includes feature engineering objects that leverage spaCy, a high performance NLP package. Here we use it for extracting parts of speech and parsing reviews into sentences.

In [3]:
import graphlab as gl

In [4]:
from graphlab.toolkits.text_analytics import trim_rare_words, split_by_sentence, extract_part_of_speech, stopwords, PartOfSpeech

def nlp_pipeline(reviews, title, aspects):

    print(title)
    
    print('1. Get reviews for this product')
    reviews = reviews.filter_by(title, 'name')

    print('2. Splitting reviews into sentences')
    reviews['sentences'] = split_by_sentence(reviews['review'])
    sentences = reviews.stack('sentences', 'sentence').dropna()

    print('3. Tagging relevant reviews')
    tags = gl.SFrame({'tag': aspects})
    tagger_model = gl.data_matching.autotagger.create(tags, verbose=False)
    tagged = tagger_model.tag(sentences, query_name='sentence', similarity_threshold=.3, verbose=False)\
                         .join(sentences, on='sentence')

    print('4. Extracting adjectives')
    tagged['cleaned']    = trim_rare_words(tagged['sentence'], stopwords=list(stopwords()))
    tagged['adjectives'] = extract_part_of_speech(tagged['cleaned'], [PartOfSpeech.ADJ])

    print('5. Predicting sentence-level sentiment')
    model = gl.sentiment_analysis.create(tagged, features=['review'])
    tagged['sentiment']  = model.predict(tagged)
    return tagged

In [5]:
reviews = gl.SFrame('amazon_baby.gl')

2016-04-13 16:48:46,413 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.9 started. Logging: /tmp/graphlab_server_1460591325.log


This commercial license of GraphLab Create is assigned to engr@dato.com.


In [6]:
from helper_util import *

## Focus on chosen aspects about baby monitors

In [7]:
aspects = ['audio', 'price', 'signal', 'range', 'battery life']

In [None]:
reviews = search(reviews, 'monitor')

In [8]:
reviews

review,rating,name
This baby monitor has been working for us for ...,4.0,Baby Monitor - Direct Link Privacy Monitor ...
My sister recommended we get this monitor because ...,5.0,Graco ultraclear baby monitor ...
These monitors are absolutly wonderful. ...,5.0,Graco ultraclear baby monitor ...
I have been using this monitor for 10 months ...,5.0,Graco ultraclear baby monitor ...
After trying and returning 3 different ...,5.0,Graco ultraclear baby monitor ...
This monitor has been great! It's always very ...,5.0,Graco ultraclear baby monitor ...
I have used this monitor for three years with my ...,3.0,Graco ultraclear baby monitor ...
"Amazing monitor, I can hear every little sound ...",5.0,Graco ultraclear baby monitor ...
We were giving this monitor as a gift since ...,2.0,Graco ultraclear baby monitor ...
1 is too many stars for the product! ...,1.0,Fisher-price Super- sensitive Nursey Monitor ...


## Process reviews for the most common product

In [9]:
item_a = 'Infant Optics DXR-5 2.4 GHz Digital Video Baby Monitor with Night Vision'
reviews_a = nlp_pipeline(reviews, item_a, aspects)

Infant Optics DXR-5 2.4 GHz Digital Video Baby Monitor with Night Vision
1. Get reviews for this product
2. Splitting reviews into sentences
3. Tagging relevant reviews
4. Extracting adjectives
5. Predicting sentence-level sentiment


In [11]:
reviews_a

sentence_id,sentence,tag,score,review,rating
2,It killed our wifi signal then lost it's pairing ...,signal,1.0,It killed our wifi signal then lost it's pairing ...,1.0
4,Both audio and video monitor supersedes the ...,audio,0.5,I love this video monitor. Both audio and ...,5.0
4,Both audio and video monitor supersedes the ...,price,0.5,I love this video monitor. Both audio and ...,5.0
5,The VOX poewr saving helps prevent quick ...,battery life,0.454545454545,I love this video monitor. Both audio and ...,5.0
11,Definitely worth the price. ...,price,0.666666666667,This is such a great camera. It doesn't pivot ...,5.0
17,I purchased it based on reviews and the appea ...,price,0.666666666667,I had bought this monitor back in july when my ...,1.0
18,First and foremost the battery life SUCKS. ...,battery life,1.0,I had bought this monitor back in july when my ...,1.0
35,It's good and inexpensive enough that we've dec ...,range,0.333333333333,I tried this after having the Lenox monitor. The ...,4.0
36,It's been hard to find a video system that does ...,price,0.666666666667,I tried this after having the Lenox monitor. The ...,4.0
40,This is a great camera monitor for the price. ...,price,0.666666666667,This is a great camera monitor for the price. ...,4.0

name,cleaned,adjectives,sentiment
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,wifi signal lost it's camera. ...,[],0.726841681132
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,audio video monitor supersedes quality ...,"[audio, previous]",0.95069270188
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,audio video monitor supersedes quality ...,"[audio, previous]",0.954092282107
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,vox saving quick battery (which doesn't wanted ...,"[vox, quick, which]",0.830247002167
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,worth price.,[worth],0.999794625299
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,purchased based reviews price. ...,[],0.177435087275
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,battery life,[],0.0887905806345
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,it's good inexpensive we've decided angelcare ...,"[good, inexpensive, sound, total] ...",0.999940613283
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,"it's hard find video system price, work us. ...","[hard, find]",0.999923476009
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,great camera monitor price. ...,[great],0.999951337173


## Comparing to another product

In [12]:
dropdown = get_dropdown(reviews)
display(dropdown)

In [14]:
item_b = dropdown.value
reviews_b = nlp_pipeline(reviews, item_b, aspects)
counts, sentiment, adjectives = get_comparisons(reviews_a, reviews_b, item_a, item_b, aspects)

VTech Communications Safe &amp; Sound Digital Audio Monitor
1. Get reviews for this product
2. Splitting reviews into sentences
3. Tagging relevant reviews
4. Extracting adjectives
5. Predicting sentence-level sentiment


Comparing the number of sentences that mention each aspect

In [15]:
counts

tag,Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,VTech Communications Safe &amp; Sound Digital A ...
signal,107,15
battery life,180,93
range,144,68
audio,105,27
price,251,69


Comparing the sentence-level sentiment for each aspect of each product

In [16]:
sentiment

tag,Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,VTech Communications Safe &amp; Sound Digital A ...
signal,0.674221231596,0.695281129091
battery life,0.761379749571,0.831840808502
range,0.862246778561,0.840649624394
audio,0.826972950724,0.92381971402
price,0.885832746604,0.914282565406


Comparing the use of adjectives for each aspect

In [17]:
adjectives

tag,Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,VTech Communications Safe &amp; Sound Digital A ...
signal,"[obnoxious, lost, black, pros:-the, camera.-the, ...","[low, good, static, loud]"
battery life,"[dead, short, good, great, plastic, flimsy, ...","[small, nightlight, &#34;optimum, great, ..."
range,"[due, poor, good, entire, good, dark, annoying, ...","[good, good, good, longer, good, reasona ..."
audio,"[good, entire, 2nd, upgrade, slight, over ...","[close, audible, vtech, great, good, reliable, ..."
price,"[motorola, half, great, short, great, clear, ...","[good, reasonable, reasonable, decent, g ..."


## Investigating good and bad sentences

In [19]:
good, bad = get_extreme_sentences(reviews_a)

Print good sentences for the first item, where adjectives and aspects are highlighted.

In [20]:
print_sentences(good['highlighted'])

Print bad sentences for the first item, where adjectives and aspects are highlighted.

In [21]:
print_sentences(bad['highlighted'])

In [None]:
!cat helper_util.py