# Key positive and negative phrase extraction

This code extracts key positive and negative phrases from reviews. 

The main steps in the flow are:

1) Define Spacy matcher patterns to extract user defined patterns

2) Choose sentiment scorer model:
        - 'keras': Pre-trained Keras LSMT model with GloVe embeddings
        - Lexical models like 'affin', 'vader' or 'textblob'

3) Choose similarity model:
        - 'hamming': Hamming distance 
        - 'jaccard': Jaccard scorer
        - 'jaro_winkler': Jaro-Winkler scorer

In [10]:
import spacy
from spacy.matcher import Matcher
import pandas as pd
from phraseExtractor import phraseExtractor




#### Clean review here has just one row containing pre-cleaned text. All reviews have been concatenated into one large corpus.
Assign the combined reviews to the reviews string

In [None]:
data_df = pd.read_csv('../data/clean_review.csv')
reviews = data_df.review[0]

#### Create a Spacy doc bases on the review so we can use Spacy matcher. 

In [11]:

nlp = spacy.load('en_core_web_sm')
#create spacy doc of reviews
reviews_doc = nlp(reviews)
#initialize matcher
matcher = Matcher(nlp.vocab)


#### This is where we define the list of patterns/regex to find in the reviews. This is based on either Spacy Part Of Speech (POS) tagging or dependency trees. 

In [12]:
### INITIALIZE ALL THE PATTERNS ###
allPatterns = {}

#create list of patterns to search using spacy Matcher
pattern1 = [{'POS': 'VERB','OP':'+'},{'POS': 'PART','OP':'+'}, \
            {'POS' : 'VERB','OP':'+'}, {'POS' : 'NOUN','OP':'+'}]

allPatterns.__setitem__("VPaVN", pattern1)


#pattern2 = [{'POS': 'PRON','OP':'+'}, {'POS' : 'VERB','OP':'+'},{'POS': 'PROPN','OP':'+'}]
#matcher.add('Keywords2', None, pattern2)

pattern2 =  [{'POS': 'ADJ','OP':'+'},{'POS': 'ADJ','OP':'+'},\
             {'POS' : 'NOUN','OP':'+'}]
allPatterns.__setitem__("AdjAdjN", pattern2)
 
pattern3 = [{'POS':'ADJ','OPJ':'+'},{'POS':'PART','OP':'+'},\
            {'POS' : 'VERB','OP':'+'},{'POS':'PART','OP':'*'}]
allPatterns.__setitem__("AdjPaVPa", pattern3)
 
 
pattern4 = [{'POS':'ADJ','OPJ':'+'},{'POS':'NOUN','OP':'+'},\
            {'POS' : 'ADP','OP':'+'},{'POS':'VERB','OP':'+'},{'POS':'NOUN','OP':'+'}]
allPatterns.__setitem__("AdjNAdpVN", pattern4)
 
#pattern6 = [{'DEP':'nsubj','OPJ':'+'},{'DEP':'ROOT','OP':'+'},{'DEP':'dobj','OP':'+'}]
#matcher.add('Keywords6', None, pattern6)

pattern5 = [{'POS': 'ADV','OP':'+'},{'POS': 'VERB','OP':'+'},\
            {'POS' : 'NOUN','OP':'+'}]
            #,{'IS_ASCII':True,'OP':'+'},{'IS_ASCII':True,'OP':'+'}]
allPatterns.__setitem__("AdvVN", pattern5)

pattern6 = [{'DEP': 'amod','OP':'+'},{'DEP': 'compound','OP':'+'},\
            {'POS' : 'NOUN','OP':'+'}]
allPatterns.__setitem__("AmCN", pattern6)

pattern7 = [{'DEP': 'aux','OP':'+'},{'DEP': 'neg','OP':'+'},\
            {'DEP' : 'ROOT','OP':'+'},{'DEP' : 'dobj','OP':'+'}]
allPatterns.__setitem__("AuxNegRoot", pattern7)

In [13]:
########################
#Define model and params
#########################

#Initialize phrase extractor 
phraseExtractor = phraseExtractor()

#Define printing verbosity
phraseExtractor.verbose = False

#choose how  many words to print per class
phraseExtractor.phrase_print_threshold =20 

#Choose similarity scorer  and threshold
phraseExtractor.sim_scorer = "jaccard"
phraseExtractor.sim_threshold = 0.85

#Choose sentiment scorer and threshold
phraseExtractor.sent_scorer = "keras"
phraseExtractor.sent_model_path = "kerasLSTM/kerasModelLSTM"
phraseExtractor.sent_threshold = 0

#Assign spacy doc and vocab to model
phraseExtractor.doc = reviews_doc
phraseExtractor.nlp = nlp 

#Attach current matcher to model
phraseExtractor.matcher = matcher
phraseExtractor.allPatterns = allPatterns



In [14]:
#get most used positive and negative phrases 
phraseExtractor.summarize_reviews() 

Loading model..

Extacting matching phrases in doc..

Finding similar phrases..

Processing phrase scores and aggregating..

Positive phrases

Printing most used keywords..

easy to use : 140 AdjPaVPa
easy to set up : 110 AdjPaVPa
many other things : 42 AdjAdjN
normal prime account : 42 AmCN
able to set : 40 AdjPaVPa
great sound quality : 38 AdjAdjN
other smart devices : 36 AdjAdjN
nice to have : 35 AdjPaVPa
my android playing music : 35 AdjAdjN
smart compatible power switches : 34 AdjAdjN
your favorite music kids : 34 AdjAdjN
old handheld transistor radio : 32 AdjAdjN
my favorite radio stations : 30 AdjAdjN
Simple to set up : 30 AdjPaVPa
able to play : 30 AdjPaVPa
able to say : 30 AdjPaVPa
love to play games : 28 VPaVN
big Bluetooth speaker : 28 AdjAdjN
my bluetooth speakers : 26 AdjAdjN
n't know what : 25 AdvVN
full rich sound : 24 AdjAdjN

Negative phrases

Printing most used keywords..

unreliable low quality product : 22 AdjAdjN
difficult to set up : 20 AdjPaVPa
loud enough play m

#### The output looks like this:

----------------------------
Positive phrases

Printing most used keywords..

easy to use : 140 AdjPaVPa

----------------------------

It displays the ranked list of most used phrases, the aggregated phrase score and the matcher pattern that detected the key phrase. 

#### This is just an initial attempt and the patterns can defintely be improved upon to extract better phrases. 


