# Imports and Setup

In [None]:
!python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9MB)
[K     |████████████████████████████████| 827.9MB 1.2MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-cp36-none-any.whl size=829180944 sha256=210f51a53277dd059a0d9cf3e4e5ac22e01f5aac4678bf9854b6d057432b8c88
  Stored in directory: /tmp/pip-ephem-wheel-cache-je13dj_1/wheels/2a/c1/a6/fc7a877b1efca9bc6a089d6f506f16d3868408f9ff89f8dbfc
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


If you receive an error like `[E050] Can't find model 'en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.`, simply go to `Runtime > Restart and Run All`. The instance needs to be restarted to register that it has downloaded the `en_core_web_lg` module

In [None]:
# generic imports
import re
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 1000)

# spacy
import spacy
from spacy.lang.en import English # updated
from spacy import displacy
nlp = spacy.load('en_core_web_lg')

# nltk
import nltk
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')

# tf-idf
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer

  """


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Please input your own directory path towards the root location where this file is**


In [None]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)

# # sihan
# %cd "../gdrive/My Drive/CZ4045 " 

# chuanxin
%cd "../gdrive/My Drive/Group Work/CZ4045" 

Mounted at /gdrive
/gdrive/My Drive/Group Work/CZ4045


# Helper Functions

Helper function that is used to visualize a dependency tree structure of a given input string 

In [None]:
def display_dependency_tree(input_string):
  doc = nlp(input_string)
  print(doc[2].dep_)
  # Since this is an interactive Jupyter environment, we can use displacy.render here
  # displacy.render(doc, style='dep')
  displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

In [None]:
display_dependency_tree("Poor quality microphone")

ROOT


In [None]:
display_dependency_tree("Very comfortable shirt")

ROOT


# Preprocessing of Data

### Load in data
The data consists of reviews for Apple Airpods. We have obtained these reviews by manually searching through the customer reviews on a Amazon listing for this. We have also manually checked every review for the important noun-adjective pairs, as well as the sentiment of the review

https://www.amazon.com/Apple-AirPods-Charging-Latest-Model/dp/B07PXGQC1Q/ref=sr_1_4?dchild=1&keywords=airpods&qid=1601361680&sr=8-4#customerReviews

In [None]:
df = pd.read_csv('Q2_folder/data_airpods_reviews.csv')
df.head(3)

Unnamed: 0,Link,Stars,Reviews,Nouns,Adjectives,Sentiment,Remarks
0,https://www.amazon.com/gp/customer-reviews/R31ZK5M0UWDZ4R?ASIN=B07PXGQC1Q,1,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere","microphone, mic","poor quality, high quality",negative,
1,https://www.amazon.com/gp/customer-reviews/R13AIAX16Y8XXM/ref=cm_cr_getr_d_rvw_ttl?ie=UTF8&ASIN=B07PXGQC1Q,5,"These are the best pair of wireless earphones I ever came across, apple has definitely done it again. Being an apple fan, I love these! They are so easy to connect! Once I connected them to my iPhone, they were seamlessly connected to my other apple devices. All I have to do is select them in the list from the device you need them to connect and you are good to go. Another good thing is turning them on/off is easy, just remove them from the charging case and they turn on and put them back to turn them off, its simple. And like most of the earphones out there, they connect to the last device you used them with, but its easy to switch device. You have the flexibility to just use one piece, if you like and sometimes it does come handy. Another function I like is the automatic ear detection, just remove a piece from you ear and playback will pause and resume once you put it back, this does help when someone drops by for a conversation and you don't need to actually pickup your phone to pause the playback. The battery life is definitely impressive, I need to charge the case every 2 days. And it's easy to carry everywhere, it being so small. Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.","pair of earphones, battery life, sound quality, connectino, deal","best, definitely impressive, crisp, smooth, pretty sweet",positive,
2,https://www.amazon.com/gp/customer-reviews/RM7B0QA6K2S1K/ref=cm_cr_arp_d_rvw_ttl?ie=UTF8&ASIN=B07PXGQC1Q,5,"Okay, so I bought these after my third pair of cheaper wireless headphones broke. I was getting tired of ones that attached and had the cord that went around my neck. Hence, I decided to search the internet for the best headphones that would work for listening to music/audio books while working out or doing things around the house. I finally decided to try out the AirPods since they had such great reviews, and because they were on sale. When I received the AirPods, I was surprised at how well they actually fit in my ears. I had worried about them falling out, and though they have fallen out a few times due to my children rough housing and landing on my head, they have worked great for walking around the house and exercising. The sound quality is really amazing, and I am pleasantly surprised at how quickly they charge. I will update this once I've had them longer than a few weeks, but I am extremely impressed with them so far. I was extremely skeptical at first, but I'm very glad I decided to get them!","headphones, reviews, sound quality","best, great, really amazing",positive,


### Generate the review data and split review into sentences and segments
This is basic data preparation to break down a long review into more manageable sizes for analysis 

In [None]:
def generate_data(df):
  data = pd.DataFrame()

  # initialise split for sentences
  sent_splitter = English()
  sent_splitter.add_pipe(sent_splitter.create_pipe('sentencizer')) # updated

  # initialise regex splitting criterion for basic punctuation rules
  punc_split = re.compile(',|!|\n')

  for index, row in df.iterrows():
    rating = row.Stars
    review = row.Reviews
    doc = sent_splitter(review)
    sentences = [sent.string.strip() for sent in doc.sents]

    for sentence in sentences:
      # split sentences into segments based on punctuation
      segments = punc_split.split(sentence)
      segments = [segment.strip() for segment in segments if segment.strip()]

      # append into the data
      for segment in segments:
        data = data.append({
            "sentence": sentence,
            "review": review,
            "segment": segment,
            "rating": rating,
            "review_index": index
        }, ignore_index=True)  

  return data

In [None]:
data = generate_data(df)
data.head(3)

Unnamed: 0,rating,review,review_index,segment,sentence
0,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,Poor quality microphone.,Poor quality microphone.
1,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,Not suitable for a remote worker taking calls.,Not suitable for a remote worker taking calls.
2,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,If your job requires dictation or a high quality mic,"If your job requires dictation or a high quality mic, go elsewhere"


### Find noun-adjective pairs
We have the concept of singular nouns (e.g. connection) and compound nouns that have more than one word (battery life). They will be treated differently. We will use a combination of dependency parsing, and POS tagging to achieve our noun-adjective pair filtering.

Spacy POS tagging: https://spacy.io/api/annotation#pos-tagging

Spacy Dependency parsing: https://spacy.io/api/annotation#dependency-parsing

<br>
We will employ some simple rule-based filtering for our target nouns and adjectives

In [None]:
def get_singular_pairs(doc):
    doc = nlp(doc)
    nouns = [tok for tok in doc if tok.dep_ in ['nsubj']]               # Get list of singular nouns in doc
    nouns = [tok for tok in nouns if tok.pos_ not in  ('PRON', 'DET') ] # Remove nouns with POS tags matching personal pronouns

    pair_list = []
    if len(nouns) != 0:
      for tok in nouns:
          pair_item_noun, pair_item_adj = False, False # initialize false variables
          noun = doc[tok.i: tok.i+1] # slice up to the token's index itself 

          # Given a nominal subject (usually the noun), we will look for adjectives before and after it (although we expect mostly to be on the left) 
          # In simple cases, this would mean that the noun shares a head with the adjective
          right_adj_list = [adj for adj in noun.root.head.rights if adj.dep_ in ['amod', 'acl', 'acomp']]
          left_adj_list = [adj for adj in noun.root.head.lefts if adj.dep_ in ['amod', 'acl', 'acomp']]
          adj_list = left_adj_list + right_adj_list

          if len(adj_list) != 0:
              pair_item_noun = noun
              pair_item_adj = adj_list[0]

          if pair_item_noun and pair_item_adj:
              pair_list.append(pair_item_noun)
              pair_list.append(pair_item_adj)
              
    return pair_list                      

In [None]:
def get_compound_pairs(doc):
    doc = nlp(doc)
    compounds = [tok for tok in doc if tok.dep_ == 'compound']                        # Get list of compounds in doc
    compounds = [c for c in compounds if c.i == 0 or doc[c.i - 1].dep_ != 'compound'] # Avoid index errors for compounds at start of doc, and avoid compound-compound patterns

    pair_list = []
    if len(compounds) != 0: 
        for tok in compounds:
          pair_item_compound_noun, pair_item_adj = False, False # initialize false variables
          compound_noun = doc[tok.i: tok.head.i + 1] # slice up to the token's index + 1 on the assumption that the compound token is always before the nsubj token 
          noun = doc[tok.head.i : tok.head.i+1] # get the noun and not the compound 

          # Given a nominal subject (usually the noun), we will look for adjectives before and after it (although we expect mostly to be on the left) 
          # In simple cases, this would mean that the compound_noun shares a head with the adjective
          right_adj_list = [adj for adj in noun.root.head.rights if adj.dep_ in ['amod', 'acl', 'acomp']]
          left_adj_list = [adj for adj in noun.root.head.lefts if adj.dep_ in ['amod', 'acl', 'acomp']]
          adj_list = left_adj_list + right_adj_list

          if len(adj_list) != 0:
              pair_item_compound_noun = compound_noun
              pair_item_adj = adj_list[0]

          if pair_item_compound_noun and pair_item_adj:
              pair_list.append(pair_item_compound_noun)
              pair_list.append(pair_item_adj)

    return pair_list                    

In [None]:
data["singular_pairs"] = data['segment'].apply(get_singular_pairs)
data["compound_pairs"] = data['segment'].apply(get_compound_pairs)
print(data['singular_pairs'].describe())
print("")
print(data['compound_pairs'].describe())

count     413
unique    43 
top       [] 
freq      371
Name: singular_pairs, dtype: object

count     413
unique    23 
top       [] 
freq      391
Name: compound_pairs, dtype: object


In [None]:
data.head(3)

Unnamed: 0,rating,review,review_index,segment,sentence,singular_pairs,compound_pairs
0,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,Poor quality microphone.,Poor quality microphone.,[],"[(quality, microphone), Poor]"
1,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,Not suitable for a remote worker taking calls.,Not suitable for a remote worker taking calls.,[],[]
2,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,If your job requires dictation or a high quality mic,"If your job requires dictation or a high quality mic, go elsewhere",[],[]


### On the assumption that we have created short segments to analyze, we expect that if compound pairs exists, it will be a better version of the singular pair (e.g. 'battery life - good' vs 'life - good'), hence we will use the pair instead

In [None]:
def combine_pairs(row):
  compound_pair = row['compound_pairs']
  singular_pair = row['singular_pairs']

  if compound_pair:
    return compound_pair
  else:
    return singular_pair

data['combined_pairs'] = data.apply(combine_pairs, axis=1)

In [None]:
data.head(3)

Unnamed: 0,rating,review,review_index,segment,sentence,singular_pairs,compound_pairs,combined_pairs
0,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,Poor quality microphone.,Poor quality microphone.,[],"[(quality, microphone), Poor]","[(quality, microphone), Poor]"
1,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,Not suitable for a remote worker taking calls.,Not suitable for a remote worker taking calls.,[],[],[]
2,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,If your job requires dictation or a high quality mic,"If your job requires dictation or a high quality mic, go elsewhere",[],[],[]


### Clean out the dataframe

In [None]:
# Drop unecessary columns and rows
ranking_data = data.drop(data[data.astype(str).combined_pairs == '[]'].index)
ranking_data = ranking_data.drop(['singular_pairs', 'compound_pairs'], axis=1)

In [None]:
# Break up the rows with multiple combined_pairs
split_data = pd.DataFrame()

for index, row in ranking_data.iterrows():
  if len(row.combined_pairs) > 2:
    for i in range(2, len(row.combined_pairs), 2):
      new_row = row.copy()
      new_row.combined_pairs = row.combined_pairs[i:i+2]
      split_data = split_data.append(new_row)
    ranking_data.at[index,'combined_pairs'] = row.combined_pairs[0:2]

ranking_data = pd.concat([ranking_data, split_data])

In [None]:
ranking_data.head(3)

Unnamed: 0,rating,review,review_index,segment,sentence,combined_pairs
0,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,Poor quality microphone.,Poor quality microphone.,"[(quality, microphone), Poor]"
23,5.0,"These are the best pair of wireless earphones I ever came across, apple has definitely done it again. Being an apple fan, I love these! They are so easy to connect! Once I connected them to my iPhone, they were seamlessly connected to my other apple devices. All I have to do is select them in the list from the device you need them to connect and you are good to go. Another good thing is turning them on/off is easy, just remove them from the charging case and they turn on and put them back to turn them off, its simple. And like most of the earphones out there, they connect to the last device you used them with, but its easy to switch device. You have the flexibility to just use one piece, if you like and sometimes it does come handy. Another function I like is the automatic ear detection, just remove a piece from you ear and playback will pause and resume once you put it back, this does help when someone drops by for a conversation and you don't need to actually pickup your phone to pause the playback. The battery life is definitely impressive, I need to charge the case every 2 days. And it's easy to carry everywhere, it being so small. Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.",1.0,The battery life is definitely impressive,"The battery life is definitely impressive, I need to charge the case every 2 days.","[(battery, life), impressive]"
27,5.0,"These are the best pair of wireless earphones I ever came across, apple has definitely done it again. Being an apple fan, I love these! They are so easy to connect! Once I connected them to my iPhone, they were seamlessly connected to my other apple devices. All I have to do is select them in the list from the device you need them to connect and you are good to go. Another good thing is turning them on/off is easy, just remove them from the charging case and they turn on and put them back to turn them off, its simple. And like most of the earphones out there, they connect to the last device you used them with, but its easy to switch device. You have the flexibility to just use one piece, if you like and sometimes it does come handy. Another function I like is the automatic ear detection, just remove a piece from you ear and playback will pause and resume once you put it back, this does help when someone drops by for a conversation and you don't need to actually pickup your phone to pause the playback. The battery life is definitely impressive, I need to charge the case every 2 days. And it's easy to carry everywhere, it being so small. Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.",1.0,Now the sound quality is crisp and connection is smooth,"Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.","[(quality), crisp]"


# Meaningfulness
We will now add in some features that we think will help determine if the noun-adjective pair identified is useful or not

### Add in review length
Review length can contribute towards the meaningfulness of a noun-adjective pair indirectly. Longer reviews typically have more thought put into them and hence the noun-adjective pairs from these reviews might have been more well thought-out

In [None]:
ranking_data['review_length'] = ranking_data['review'].apply(len)

In [None]:
ranking_data.head(3)

Unnamed: 0,rating,review,review_index,segment,sentence,combined_pairs,review_length
0,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,Poor quality microphone.,Poor quality microphone.,"[(quality, microphone), Poor]",138
23,5.0,"These are the best pair of wireless earphones I ever came across, apple has definitely done it again. Being an apple fan, I love these! They are so easy to connect! Once I connected them to my iPhone, they were seamlessly connected to my other apple devices. All I have to do is select them in the list from the device you need them to connect and you are good to go. Another good thing is turning them on/off is easy, just remove them from the charging case and they turn on and put them back to turn them off, its simple. And like most of the earphones out there, they connect to the last device you used them with, but its easy to switch device. You have the flexibility to just use one piece, if you like and sometimes it does come handy. Another function I like is the automatic ear detection, just remove a piece from you ear and playback will pause and resume once you put it back, this does help when someone drops by for a conversation and you don't need to actually pickup your phone to pause the playback. The battery life is definitely impressive, I need to charge the case every 2 days. And it's easy to carry everywhere, it being so small. Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.",1.0,The battery life is definitely impressive,"The battery life is definitely impressive, I need to charge the case every 2 days.","[(battery, life), impressive]",1488
27,5.0,"These are the best pair of wireless earphones I ever came across, apple has definitely done it again. Being an apple fan, I love these! They are so easy to connect! Once I connected them to my iPhone, they were seamlessly connected to my other apple devices. All I have to do is select them in the list from the device you need them to connect and you are good to go. Another good thing is turning them on/off is easy, just remove them from the charging case and they turn on and put them back to turn them off, its simple. And like most of the earphones out there, they connect to the last device you used them with, but its easy to switch device. You have the flexibility to just use one piece, if you like and sometimes it does come handy. Another function I like is the automatic ear detection, just remove a piece from you ear and playback will pause and resume once you put it back, this does help when someone drops by for a conversation and you don't need to actually pickup your phone to pause the playback. The battery life is definitely impressive, I need to charge the case every 2 days. And it's easy to carry everywhere, it being so small. Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.",1.0,Now the sound quality is crisp and connection is smooth,"Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.","[(quality), crisp]",1488


### Add in lemmatization of combined_pairs
We will try and reduce the noun-adjective pairs into their base-forms so that we can achieve a grouping operation later. For example, `sound` and `sounds` will probably mean the same thing


In [None]:
def lemmatize(combined_pairs):
  lemmatizer = WordNetLemmatizer()
  noun, adj = "", ""

  for tok in combined_pairs[0]:
    if str(tok.lower_).strip() in ('they', 'these', 'it'): 
      noun = noun + str(tok.lemma_) + " "
    else:
      noun = noun + lemmatizer.lemmatize(tok.text, pos = "n") + " "
  noun = noun.strip()
  
  adj = lemmatizer.lemmatize(combined_pairs[1].text, pos = "a")

  return [noun, adj]

In [None]:
ranking_data['lemmatized_pairs'] = ranking_data['combined_pairs'].apply(lemmatize)

In [None]:
ranking_data['lemmatized_pairs'].describe()

count     52                
unique    49                
top       [quality, amazing]
freq      3                 
Name: lemmatized_pairs, dtype: object

In [None]:
ranking_data.head(3)

Unnamed: 0,rating,review,review_index,segment,sentence,combined_pairs,review_length,lemmatized_pairs
0,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,Poor quality microphone.,Poor quality microphone.,"[(quality, microphone), Poor]",138,"[quality microphone, Poor]"
23,5.0,"These are the best pair of wireless earphones I ever came across, apple has definitely done it again. Being an apple fan, I love these! They are so easy to connect! Once I connected them to my iPhone, they were seamlessly connected to my other apple devices. All I have to do is select them in the list from the device you need them to connect and you are good to go. Another good thing is turning them on/off is easy, just remove them from the charging case and they turn on and put them back to turn them off, its simple. And like most of the earphones out there, they connect to the last device you used them with, but its easy to switch device. You have the flexibility to just use one piece, if you like and sometimes it does come handy. Another function I like is the automatic ear detection, just remove a piece from you ear and playback will pause and resume once you put it back, this does help when someone drops by for a conversation and you don't need to actually pickup your phone to pause the playback. The battery life is definitely impressive, I need to charge the case every 2 days. And it's easy to carry everywhere, it being so small. Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.",1.0,The battery life is definitely impressive,"The battery life is definitely impressive, I need to charge the case every 2 days.","[(battery, life), impressive]",1488,"[battery life, impressive]"
27,5.0,"These are the best pair of wireless earphones I ever came across, apple has definitely done it again. Being an apple fan, I love these! They are so easy to connect! Once I connected them to my iPhone, they were seamlessly connected to my other apple devices. All I have to do is select them in the list from the device you need them to connect and you are good to go. Another good thing is turning them on/off is easy, just remove them from the charging case and they turn on and put them back to turn them off, its simple. And like most of the earphones out there, they connect to the last device you used them with, but its easy to switch device. You have the flexibility to just use one piece, if you like and sometimes it does come handy. Another function I like is the automatic ear detection, just remove a piece from you ear and playback will pause and resume once you put it back, this does help when someone drops by for a conversation and you don't need to actually pickup your phone to pause the playback. The battery life is definitely impressive, I need to charge the case every 2 days. And it's easy to carry everywhere, it being so small. Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.",1.0,Now the sound quality is crisp and connection is smooth,"Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.","[(quality), crisp]",1488,"[quality, crisp]"


### Add TF-IDF
Term Frequency–Inverse Document Frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It helps identify important words in a document, but offsets the importance of identified words if it appears frequently across the entire corpus

This is the central portion of our meaningfulness definition. We believe that more unique words will provide a more meaningful perspective of the product, rather than seeing constantly repeated buzzwords. It will also help in catching some outlier observations

In [None]:
tfIdfVectorizer=TfidfVectorizer()
tfIdf = tfIdfVectorizer.fit_transform(df['Reviews'].tolist())

feature_names = tfIdfVectorizer.get_feature_names()

dense = tfIdf.todense()
denselist = dense.tolist()
tf_df = pd.DataFrame(denselist, columns=feature_names)

In [None]:
tf_df.head()

Unnamed: 0,10,100,100m,11,130,15,1st,20,2019,20mph,220,24,2nd,50,60,63,70,80db,about,above,absolutely,ac,accessories,across,actually,adding,addition,advise,after,again,ahead,aids,air,airpod,airpods,all,allowed,almost,along,already,...,which,while,white,who,whole,will,window,wingtips,wins,wire,wired,wireless,wires,with,within,without,won,wonky,words,work,worked,worker,working,workout,workspace,world,worried,worry,worth,would,wouldn,yeah,year,years,yet,you,your,yours,yourself,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.270254,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14081,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.060621,0.0,0.0,0.0,0.0,0.0,0.108101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.060621,0.098777,0.0,0.0,0.0,0.0,0.054051,0.0,0.0,0.0,0.0,0.050029,0.036247,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.042818,0.0,0.0,0.0,0.0,0.0,0.038156,0.0,0.043525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042818,0.038156,0.0,0.0,0.0,0.0,0.0,0.233468,0.063171,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057184,0.0,0.0,0.0,0.0,0.0,0.074019,0.0,0.0,0.0,0.043593,0.0,0.0,0.0,0.0,0.0,0.074979,0.0,0.0,0.0,0.0,0.0,...,0.0,0.054324,0.0,0.0,0.0,0.064171,0.0,0.0,0.0,0.0,0.0,0.057184,0.0,0.032616,0.0,0.0,0.0,0.0,0.0,0.057184,0.081005,0.0,0.060427,0.0,0.0,0.0,0.081005,0.0,0.0,0.057184,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.124205,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.124205,0.0,0.0,0.0,0.074266,0.0,0.0,0.0,0.0,...,0.093782,0.074266,0.124205,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.089178,0.0,0.0,0.0,0.0,0.124205,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.124205,0.0,0.110742,0.101191,0.05315,0.0,0.0,0.0,0.0
4,0.047507,0.0,0.0,0.0,0.0,0.056098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.062918,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.155774,0.0,0.0,0.047507,0.0,0.062918,...,0.0,0.0,0.0,0.056098,0.0,0.0,0.0,0.062918,0.0,0.0,0.0,0.0,0.0,0.090349,0.0,0.112197,0.0,0.0,0.0,0.039602,0.0,0.0,0.0,0.0,0.0,0.062918,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.053847,0.032782,0.0,0.0,0.0


In [None]:
def calc_tfidf(row):
  score = 0
  word_list = []

  combined_pairs = row['combined_pairs']
  review_index = row['review_index']

  for tok in combined_pairs[0]:
    word_list.append(tok.text)
  word_list.append(combined_pairs[1].text)
  
  for word in word_list:
    score += tf_df.loc[review_index, word.lower()]
  return score / len(word_list)

In [None]:
ranking_data['tfidf_score'] = ranking_data.apply(calc_tfidf, axis=1)

In [None]:
ranking_data.head(3)

Unnamed: 0,rating,review,review_index,segment,sentence,combined_pairs,review_length,lemmatized_pairs,tfidf_score
0,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,Poor quality microphone.,Poor quality microphone.,"[(quality, microphone), Poor]",138,"[quality microphone, Poor]",0.237326
23,5.0,"These are the best pair of wireless earphones I ever came across, apple has definitely done it again. Being an apple fan, I love these! They are so easy to connect! Once I connected them to my iPhone, they were seamlessly connected to my other apple devices. All I have to do is select them in the list from the device you need them to connect and you are good to go. Another good thing is turning them on/off is easy, just remove them from the charging case and they turn on and put them back to turn them off, its simple. And like most of the earphones out there, they connect to the last device you used them with, but its easy to switch device. You have the flexibility to just use one piece, if you like and sometimes it does come handy. Another function I like is the automatic ear detection, just remove a piece from you ear and playback will pause and resume once you put it back, this does help when someone drops by for a conversation and you don't need to actually pickup your phone to pause the playback. The battery life is definitely impressive, I need to charge the case every 2 days. And it's easy to carry everywhere, it being so small. Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.",1.0,The battery life is definitely impressive,"The battery life is definitely impressive, I need to charge the case every 2 days.","[(battery, life), impressive]",1488,"[battery life, impressive]",0.03476
27,5.0,"These are the best pair of wireless earphones I ever came across, apple has definitely done it again. Being an apple fan, I love these! They are so easy to connect! Once I connected them to my iPhone, they were seamlessly connected to my other apple devices. All I have to do is select them in the list from the device you need them to connect and you are good to go. Another good thing is turning them on/off is easy, just remove them from the charging case and they turn on and put them back to turn them off, its simple. And like most of the earphones out there, they connect to the last device you used them with, but its easy to switch device. You have the flexibility to just use one piece, if you like and sometimes it does come handy. Another function I like is the automatic ear detection, just remove a piece from you ear and playback will pause and resume once you put it back, this does help when someone drops by for a conversation and you don't need to actually pickup your phone to pause the playback. The battery life is definitely impressive, I need to charge the case every 2 days. And it's easy to carry everywhere, it being so small. Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.",1.0,Now the sound quality is crisp and connection is smooth,"Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.","[(quality), crisp]",1488,"[quality, crisp]",0.041569


### Word embedding
To remove repeated words of positive and negative nature and filter out the more unique ones. In product descriptions, we often get various synonyms of `good`, such as `amazing, fantastic, great`, and the same can be said for `bad`. 

Word embedding will help us determine similar words are to the common idea of `good` and `bad`, and group them together as carrying the same concept. 

After the word embedding process, we will group up the noun-adjective pairs and checking the frequency of unique values. The higher frequency pairs will then represent the concept that frequently occuring pairs are also meaningful because a lot of people believe it is important, or a key aspect of the product. 

In [None]:
# reference vocabulary
good = nlp.vocab['good']
bad = nlp.vocab['bad']
threshold = 0.6

# function to compute cosine similarity
cosine = lambda v1, v2: np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

def embedding(row):
  noun = row.lemmatized_pairs[0]
  adj = row.lemmatized_pairs[1]
  adj_vocab = nlp.vocab[adj]
  
  similarity_good = cosine(good.vector, adj_vocab.vector)
  similarity_bad = cosine(bad.vector, adj_vocab.vector)

  if max(similarity_good, similarity_bad, threshold) == similarity_good:
    return [noun.lower(), "good"]
  elif max(similarity_good, similarity_bad, threshold) == similarity_bad:
    return [noun.lower(), "bad"]
  else:
    return [noun.lower(), adj.lower()]

In [None]:
ranking_data['word_embedding_pairs'] = ranking_data.apply(embedding, axis=1)

In [None]:
ranking_data['word_embedding_pairs'].describe()

count     52             
unique    43             
top       [quality, good]
freq      8              
Name: word_embedding_pairs, dtype: object

In [None]:
ranking_data.head(3)

Unnamed: 0,rating,review,review_index,segment,sentence,combined_pairs,review_length,lemmatized_pairs,tfidf_score,word_embedding_pairs
0,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,Poor quality microphone.,Poor quality microphone.,"[(quality, microphone), Poor]",138,"[quality microphone, Poor]",0.237326,"[quality microphone, bad]"
23,5.0,"These are the best pair of wireless earphones I ever came across, apple has definitely done it again. Being an apple fan, I love these! They are so easy to connect! Once I connected them to my iPhone, they were seamlessly connected to my other apple devices. All I have to do is select them in the list from the device you need them to connect and you are good to go. Another good thing is turning them on/off is easy, just remove them from the charging case and they turn on and put them back to turn them off, its simple. And like most of the earphones out there, they connect to the last device you used them with, but its easy to switch device. You have the flexibility to just use one piece, if you like and sometimes it does come handy. Another function I like is the automatic ear detection, just remove a piece from you ear and playback will pause and resume once you put it back, this does help when someone drops by for a conversation and you don't need to actually pickup your phone to pause the playback. The battery life is definitely impressive, I need to charge the case every 2 days. And it's easy to carry everywhere, it being so small. Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.",1.0,The battery life is definitely impressive,"The battery life is definitely impressive, I need to charge the case every 2 days.","[(battery, life), impressive]",1488,"[battery life, impressive]",0.03476,"[battery life, impressive]"
27,5.0,"These are the best pair of wireless earphones I ever came across, apple has definitely done it again. Being an apple fan, I love these! They are so easy to connect! Once I connected them to my iPhone, they were seamlessly connected to my other apple devices. All I have to do is select them in the list from the device you need them to connect and you are good to go. Another good thing is turning them on/off is easy, just remove them from the charging case and they turn on and put them back to turn them off, its simple. And like most of the earphones out there, they connect to the last device you used them with, but its easy to switch device. You have the flexibility to just use one piece, if you like and sometimes it does come handy. Another function I like is the automatic ear detection, just remove a piece from you ear and playback will pause and resume once you put it back, this does help when someone drops by for a conversation and you don't need to actually pickup your phone to pause the playback. The battery life is definitely impressive, I need to charge the case every 2 days. And it's easy to carry everywhere, it being so small. Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.",1.0,Now the sound quality is crisp and connection is smooth,"Now the sound quality is crisp and connection is smooth, I found that the 2nd gen AirPods are actually better and improved on 1st gen, but its not worth to upgrade, but if you are planning to get your first one, I would definitely say go for the 2nd gen AirPods, I got a pretty sweet deal on amazon and am very happy with my purchase.","[(quality), crisp]",1488,"[quality, crisp]",0.041569,"[quality, crisp]"


# Ranking Algorithm

### Multiplier score for review lengths 
We will weight in the review length as a contributor to the importance of the noun-adjective pair 

In [None]:
max_review_len = max(ranking_data.review_length)
min_review_len = min(ranking_data.review_length)
print("Min and Max Review Length:", min_review_len, ",", max_review_len)

max_review_len_multiplier = 1.5
min_review_len_multiplier = 1

# longer reviews are more meaningful
def multiplier_len(length):
  scaled = (length - min_review_len) / (max_review_len - min_review_len)
  multiplier = scaled * (max_review_len_multiplier - min_review_len_multiplier) + min_review_len_multiplier

  return multiplier

Min and Max Review Length: 137 , 2152


### Multiplier score for review ratings
Review ratings also carry some meaning. However it is typically less meaningful as most consumers do not give much thought to the difference between a 4 star or a 5 start review.

Also, we believe that extremely high, and extremely low ratings carry more meaning than a neutral rating, hence the use of absolute 

In [None]:
max_review_rating = max(ranking_data.rating)
min_review_rating = min(ranking_data.rating)
print("Min and Max Review Rating:", min_review_rating, ",", max_review_rating)

max_review_rating_multiplier = 1.2
min_review_rating_multiplier = 1.0

# Extreme ends of ratings are more meaningful
def multiplier_rating(rating):
  scaled = abs(rating - (max_review_rating + min_review_rating)/2) / ((max_review_rating - min_review_rating)/2)
  multiplier = scaled * (max_review_rating_multiplier - min_review_rating_multiplier) + min_review_rating_multiplier

  return multiplier

Min and Max Review Rating: 1.0 , 5.0


### Multiplier score for frequency of word embedding
As mentioned, the higher frequency noun-adjective pairs also have meaning to them, as they crop up often meaning it is common to many users of the product

In [None]:
# Obtain word embedding pairs frequency 
embedding_pairs_data = pd.DataFrame()
embedding_pairs_data['word_embedding_pairs_string'] = ranking_data['word_embedding_pairs'].apply(lambda x: ",".join(x))
embedding_pair_series = embedding_pairs_data['word_embedding_pairs_string'].value_counts()

max_embedding_count = max(embedding_pair_series.values)
min_embedding_count = min(embedding_pair_series.values)
print("Min and Max Embedding Count:", min_embedding_count, ",", max_embedding_count)

max_embedding_count_multiplier = 1.4
min_embedding_count_multiplier = 1.0

# more frequent appearances are more meaningful
def multiplier_embedding_count(count):
  scaled = (count - min_embedding_count) / (max_embedding_count - min_embedding_count)
  multiplier = scaled * (max_embedding_count_multiplier - min_embedding_count_multiplier) + min_embedding_count_multiplier

  return multiplier

Min and Max Embedding Count: 1 , 8


### Algorithm
Taking the three multipliers above and combine with the TF-IDF score to get a "meaningfulness" score

In [None]:
# Our own ranking algorithm 
def ranking_algorithm(row):
  rating = row.rating
  review_length = row.review_length
  tfidf_score = row.tfidf_score
  
  embedding_pairs = ",".join(row.word_embedding_pairs)
  
  # embedding_pairs = row.word_embedding_pairs_string
  embedding_count = embedding_pair_series[embedding_pairs]

  # scaling multiplier for rating 
  review_multiplier_rating = multiplier_rating(rating)

  # scaling multiplier for review_length 
  review_multiplier_len = multiplier_len(review_length)

  # scaling multiplier for embeddding count 
  review_multiplier_embedding_count = multiplier_embedding_count(embedding_count)

  output_score = tfidf_score * review_multiplier_rating * review_multiplier_len * review_multiplier_embedding_count

  return output_score

In [None]:
ranking_data['final_score'] = ranking_data.apply(ranking_algorithm, axis=1)
ranking_data.describe()

Unnamed: 0,rating,review_index,review_length,tfidf_score,final_score
count,52.0,52.0,52.0,52.0,52.0
mean,3.884615,16.25,926.576923,0.100877,0.143457
std,1.628831,9.143464,556.856842,0.050636,0.058683
min,1.0,0.0,137.0,0.03476,0.058879
25%,2.75,10.5,478.75,0.063941,0.10185
50%,5.0,16.5,762.0,0.086978,0.137993
75%,5.0,24.0,1457.0,0.125674,0.175006
max,5.0,29.0,2152.0,0.255384,0.306461


Top 5 Most Meaningful Pairs

In [None]:
ranking_data.sort_values(by=['final_score'], ascending=False).head(5)

Unnamed: 0,rating,review,review_index,segment,sentence,combined_pairs,review_length,lemmatized_pairs,tfidf_score,word_embedding_pairs,final_score
223,5.0,"Baught as a gift, they loved it the sound is awesome and, the feature that pauses music when you take out of 👂 s is pretty nifty as well.",14.0,the feature that pauses music when you take out of 👂 s is pretty nifty as well.,"Baught as a gift, they loved it the sound is awesome and, the feature that pauses music when you take out of 👂 s is pretty nifty as well.","[(feature), nifty]",137,"[feature, nifty]",0.255384,"[feature, nifty]",0.306461
271,5.0,They sound great for the prices. I definitely recommend them. They’re perfect for working out and working. The sound quality is great. The only dislike I have is the sound cancelling. Other then that this is an amazing product.,20.0,The sound quality is great.,The sound quality is great.,"[(quality), great]",227,"[quality, great]",0.173079,"[quality, good]",0.297266
0,1.0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",0.0,Poor quality microphone.,Poor quality microphone.,"[(quality, microphone), Poor]",138,"[quality microphone, Poor]",0.237326,"[quality microphone, bad]",0.284862
222,5.0,"Baught as a gift, they loved it the sound is awesome and, the feature that pauses music when you take out of 👂 s is pretty nifty as well.",14.0,they loved it the sound is awesome and,"Baught as a gift, they loved it the sound is awesome and, the feature that pauses music when you take out of 👂 s is pretty nifty as well.","[(sound), awesome]",137,"[sound, awesome]",0.196022,"[sound, good]",0.235227
301,5.0,"Great airpods! Low latency, good sound quality. Please put in the dissription that these are GENERATION 2 Airpods so people won't get confused to whether they are getting gen 1 or gen 2 airpods. I had to look up my serial number on the apple site to make sure, and these are generation 2 airpods! not generation 1. So this is truly the latest model! And please don't get confused, these are not Pros, but the latest non pro model (generation 2).",23.0,Please put in the dissription that these are GENERATION 2 Airpods so people won't get confused to whether they are getting gen 1 or gen 2 airpods.,Please put in the dissription that these are GENERATION 2 Airpods so people won't get confused to whether they are getting gen 1 or gen 2 airpods.,"[(people), confused]",445,"[people, confused]",0.174431,"[people, confused]",0.225315


In [None]:
ranking_data_top_5 = ranking_data.sort_values(by=['final_score'], ascending=False)[:5]
ranking_data_top_5.drop(['review_index','segment','sentence','lemmatized_pairs'], axis='columns', inplace=True)
ranking_data_top_5 = ranking_data_top_5[['review','review_length','rating','combined_pairs','word_embedding_pairs','tfidf_score','final_score']]
ranking_data_top_5

Unnamed: 0,review,review_length,rating,combined_pairs,word_embedding_pairs,tfidf_score,final_score
223,"Baught as a gift, they loved it the sound is awesome and, the feature that pauses music when you take out of 👂 s is pretty nifty as well.",137,5.0,"[(feature), nifty]","[feature, nifty]",0.255384,0.306461
271,They sound great for the prices. I definitely recommend them. They’re perfect for working out and working. The sound quality is great. The only dislike I have is the sound cancelling. Other then that this is an amazing product.,227,5.0,"[(quality), great]","[quality, good]",0.173079,0.297266
0,"Poor quality microphone. Not suitable for a remote worker taking calls. If your job requires dictation or a high quality mic, go elsewhere",138,1.0,"[(quality, microphone), Poor]","[quality microphone, bad]",0.237326,0.284862
222,"Baught as a gift, they loved it the sound is awesome and, the feature that pauses music when you take out of 👂 s is pretty nifty as well.",137,5.0,"[(sound), awesome]","[sound, good]",0.196022,0.235227
301,"Great airpods! Low latency, good sound quality. Please put in the dissription that these are GENERATION 2 Airpods so people won't get confused to whether they are getting gen 1 or gen 2 airpods. I had to look up my serial number on the apple site to make sure, and these are generation 2 airpods! not generation 1. So this is truly the latest model! And please don't get confused, these are not Pros, but the latest non pro model (generation 2).",445,5.0,"[(people), confused]","[people, confused]",0.174431,0.225315
