# Aspect Based Sentiment Analysis - Workbook 1

This is going to be the first version at getting out aspects from our User Comment set. The approach will be a combination of code adapted from the previous project with some new aproaches, from some reference points.

Referred documents
1) https://medium.com/analytics-vidhya/aspect-based-sentiment-analysis-a-practical-approach-8f51029bbc4a

Data Source: commentsWithRatings_ActivityJson.csv - the User comments previously extracted out of activities.json from the previous 1. explore notebook rerun
Aim of this file: Just extracting aspects, experiment with N-Grams by double-nouns and standard N-Grams detection

In [55]:
import pandas as pd
import numpy as np
import collections

from py_files import data_basic_utility as databasic
from py_files import statistics_utility as stats

# Stick with using the spaCy language model like previous project
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from nltk.probability import *
from nltk.util import ngrams
from nltk.corpus import stopwords
import spacy

# Load English tokeniser, tagger, parser and NER (Named Entity Recognition)
nlp = spacy.load("en_core_web_sm")

dfComments = pd.read_csv("./data_files/commentsWithRatings_ActivitiesJson.csv")
dfComments = dfComments[dfComments["Reason"].notna()]
print(dfComments.info())
dfComments.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2290 entries, 0 to 2290
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ActivityName      2290 non-null   object 
 1   PreResult         1016 non-null   float64
 2   PostResult        2290 non-null   float64
 3   Domain            2226 non-null   object 
 4   MarginDifference  1016 non-null   float64
 5   Reason            2290 non-null   object 
dtypes: float64(3), object(3)
memory usage: 125.2+ KB
None


Unnamed: 0,ActivityName,PreResult,PostResult,Domain,MarginDifference,Reason
0,"Art, not Apart 2019 Artists",,7.0,Cultural,,Some truly engaging and surprising works and p...
1,"Art, not Apart 2019 Artists",,7.0,Cultural,,"I would have given a higher rating, but this y..."
2,"Art, not Apart 2019 Artists",,7.0,Cultural,,Art Not Apart presents opportunities for exper...
3,"Art, not Apart 2019 Artists",,8.0,Cultural,,"I saw lots of different events, installations ..."
4,"Art, not Apart 2019 Artists",,10.0,Cultural,,"The Diversity of works, the acessibilitu to in..."


In [56]:
lstComments = dfComments["Reason"].to_list()
databasic.listHead(lstComments)

0: [Some truly engaging and surprising works and performances were included in the festival. These gems both piqued my curiosity and unearthed new ideas about unrelated creative possibilities. I found the strength of the works included in the festival was mixed, but the pieces that resonated with me definitely lingered.]
1: [I would have given a higher rating, but this year I was unable to attend the festival in person.]
2: [Art Not Apart presents opportunities for experimentation and new ideas, but naturally within its own context of a one-day festival. I would not have created this piece without the stimulus of Art, Not Apart and the incentive to imagine that it provides.]
3: [I saw lots of different events, installations & art pieces in different formats which broadened my range. The piece I came up with was in a brand new field for me too.]
4: [The Diversity of works, the acessibilitu to incredible spaces]


In [57]:
def sentenceTokenizeWithTags(inputStr):
    if inputStr is None:
        return []
    sentences = sent_tokenize(inputStr.lower())
    lstInput = []
    for sent in sentences:
        lstInput.append(nltk.pos_tag(nltk.word_tokenize(sent)))
    return lstInput

lstCommentTags = list(map(lambda x: sentenceTokenizeWithTags(x), lstComments))
databasic.listHead(lstCommentTags)

0: [[[('some', 'DT'), ('truly', 'NN'), ('engaging', 'VBG'), ('and', 'CC'), ('surprising', 'JJ'), ('works', 'NNS'), ('and', 'CC'), ('performances', 'NNS'), ('were', 'VBD'), ('included', 'VBN'), ('in', 'IN'), ('the', 'DT'), ('festival', 'NN'), ('.', '.')], [('these', 'DT'), ('gems', 'NNS'), ('both', 'DT'), ('piqued', 'VBN'), ('my', 'PRP$'), ('curiosity', 'NN'), ('and', 'CC'), ('unearthed', 'JJ'), ('new', 'JJ'), ('ideas', 'NNS'), ('about', 'IN'), ('unrelated', 'JJ'), ('creative', 'JJ'), ('possibilities', 'NNS'), ('.', '.')], [('i', 'NN'), ('found', 'VBD'), ('the', 'DT'), ('strength', 'NN'), ('of', 'IN'), ('the', 'DT'), ('works', 'NNS'), ('included', 'VBD'), ('in', 'IN'), ('the', 'DT'), ('festival', 'NN'), ('was', 'VBD'), ('mixed', 'JJ'), (',', ','), ('but', 'CC'), ('the', 'DT'), ('pieces', 'NNS'), ('that', 'WDT'), ('resonated', 'VBD'), ('with', 'IN'), ('me', 'PRP'), ('definitely', 'RB'), ('lingered', 'VBD'), ('.', '.')]]]
1: [[[('i', 'NN'), ('would', 'MD'), ('have', 'VB'), ('given', 'VBN'

In this section we will attempt to find "Comment Features" according to linked nouns. This is a technique from the medium doc, to find 2 word terms like "sound quality" or "battery life", where the two adjacent terms have the tag of NN - noun (or will expand to NNS for noun plurals)

Also, will save the detected terms and their counts for review

In [58]:
def findFeaturesInComment(inputSentences):    
    lstSentences = []
    lstFeaturesFound = []
    isPartOfFeature = 0

    # The input is a list of sentences, where each sentence is a list of word and POS tag pairs
    for inputTags in inputSentences:
        if len(inputTags) == 1:
            # this sentence is just 1 word. Don't look for features, just return as a sentence
            sentence = inputTags[0][0]
        else:
            lstNewWords = []
            # Loop through each word in the sentence with it's POS tag
            for i in range(0,len(inputTags)-1):
                if((len(inputTags[i][0]) > 1 and (inputTags[i][1]=="NN" or inputTags[i][1]=="NNS")) and 
                    (len(inputTags[i+1][0]) > 1 and (inputTags[i+1][1]=="NN" or inputTags[i+1][1]=="NNS")) ):
                    # found a "Feature", two nouns together. Join them as an N-gram, also save to features found
                    ngram = inputTags[i][0] + "_" + inputTags[i+1][0]
                    lstNewWords.append(ngram)
                    lstFeaturesFound.append(ngram)
                    isPartOfFeature = 1
                else:            
                    if(isPartOfFeature == 1):
                        # This word is part of the feature found together with the last word
                        isPartOfFeature = 0
                        continue
                    lstNewWords.append(inputTags[i][0])
                    if(i==len(inputTags)-2):
                        lstNewWords.append(inputTags[i+1][0])

                # Reconstruct the sentence after the ngrams have been detected and replaced
                sentence = ' '.join(word for word in lstNewWords)

        lstSentences.append(sentence)

        
    finaltxt = ' '.join(sent for sent in lstSentences)
    return finaltxt, lstFeaturesFound


# Loop through all the comment tags and find the features
lstCommentsWithFeatures = []
lstAllFeatures = []
for comment in lstCommentTags:
    commentWithFeatures, lstFeaturesFound = findFeaturesInComment(comment)
    lstCommentsWithFeatures.append(commentWithFeatures)
    lstAllFeatures.extend(lstFeaturesFound)



In [59]:
databasic.listHead(lstCommentsWithFeatures)

0: [some truly engaging and surprising works and performances were included in the festival . these gems both piqued my curiosity and unearthed new ideas about unrelated creative possibilities . i found the strength of the works included in the festival was mixed , but the pieces that resonated with me definitely lingered .]
1: [i would have given a higher rating , but this year i was unable to attend the festival in person .]
2: [art not apart presents opportunities for experimentation and new ideas , but naturally within its own context of a one-day festival . i would not have created this piece without the stimulus of art , not apart and the incentive to imagine that it provides .]
3: [i saw lots of different events , installations & art_pieces in different formats which broadened my range . the piece i came up with was in a brand new field for me too .]
4: [the diversity of works , the acessibilitu to incredible spaces]


In [60]:
print(len(lstAllFeatures))
databasic.listHead(lstAllFeatures)

1546
0: [art_pieces]
1: [fun_creations]
2: [quality_pieces]
3: [art_expression]
4: [street_painting]


In [61]:
featureCounter = collections.Counter(lstAllFeatures)
print(featureCounter)

# print(featureCounter.most_common(20))

Counter({'bluegrass_parkway': 48, 'event_volunteers': 46, 'sandra_donovan': 35, 'donovan_sound': 35, 'afternoon_concert': 33, 'uke_jam': 27, 'blackboard_concert': 27, 'film_festival': 13, 'okay_friendliness': 13, 'pre_show': 12, 'christmas_story': 9, 'truffle_hunters': 9, 'bass_coast': 8, 'art_practice': 7, 'life_experiences': 7, 'sound_shell': 7, 'car_park': 6, 'comfort_zone': 6, 'caravan_park': 6, 'art_work': 5, 'show_entertainment': 5, 'family_event': 5, 'coast_shorts': 5, 'focus_group': 5, 'n/a_friendliness': 5, 'okay_farmers': 5, 'street_gallery': 5, 'christmas_lights': 4, 'christmas_nativity': 4, 'parting_project': 4, 'pitch_document': 4, 'team_uma': 4, 'durga_puja': 4, 'community_members': 4, 'art_works': 4, 'art_making': 3, 'arts_scene': 3, 'arts_community': 3, 'art_community': 3, 'art_collection': 3, 'art_forms': 3, 'christmas_spirit': 3, 'nativity_play': 3, 'vice_versa': 3, 'risk_management': 3, 'covid_restrictions': 3, 'test_sites': 3, 'share_knowledge': 3, 'art_project': 2,

Now do some standard NLTK looking for bigrams

In [62]:
# recreate the word list and vocabulary then look at statistics
tokenPattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
tokenizer = RegexpTokenizer(tokenPattern)

# Turn the comments into a list of word tokens
lstCommentTokens = list(map(lambda x: tokenizer.tokenize(x), lstCommentsWithFeatures))

# Create the word list and vocabulary then look at statistics
allWords, vocabulary = databasic.createWordsAndVocabForTokenLists(lstCommentTokens)
stats.print_token_stats(lstCommentsWithFeatures, allWords, vocabulary)

Statistics for List of Text Inputs:
  Total number of tokens:  58033
  Vocabulary size:  5167
  Lexical diversity:  0.08903554873951028
  Total number of lists: 2290
  Average description tokens length: 146.5344978165939
  Minimum description tokens length: 0
  Maximum description tokens length: 1685
  Standard deviation of document length: 172.26944439933655


In [63]:
stopwordsNltk = sorted(list(stopwords.words('english')))
print(stopwordsNltk)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some',

In [64]:
# find frequency distribution of words
documentFrequency = FreqDist(allWords)

# find most common 50 words
documentFrequency.most_common(50)

[('the', 2948),
 ('and', 2161),
 ('to', 1950),
 ('i', 1775),
 ('of', 1433),
 ('a', 1278),
 ('in', 851),
 ('it', 847),
 ('was', 839),
 ('with', 605),
 ('my', 553),
 ('for', 497),
 ('this', 492),
 ('that', 476),
 ('is', 465),
 ('have', 435),
 ('very', 378),
 ('great', 368),
 ('me', 364),
 ('on', 353),
 ('more', 337),
 ('we', 332),
 ('as', 329),
 ('people', 303),
 ('not', 295),
 ('be', 278),
 ('work', 270),
 ('so', 251),
 ('but', 242),
 ('about', 229),
 ('from', 227),
 ('s', 225),
 ('were', 209),
 ('at', 207),
 ('an', 203),
 ('art', 198),
 ('all', 198),
 ('are', 195),
 ('event', 193),
 ('feel', 192),
 ('really', 181),
 ('had', 177),
 ('you', 173),
 ('like', 173),
 ('by', 167),
 ('has', 167),
 ('would', 164),
 ('good', 162),
 ('how', 162),
 ('because', 162)]

In [65]:
# Remove the stopWords
lstCommentTokens = list(map(lambda x: databasic.filter_by_words_bsearch(x, stopwordsNltk), lstCommentTokens))

# recreate the word list and vocabulary then look at statistics
allWords, vocabulary = databasic.createWordsAndVocabForTokenLists(lstCommentTokens)
stats.print_token_stats(lstCommentsWithFeatures, allWords, vocabulary)

Statistics for List of Text Inputs:
  Total number of tokens:  29805
  Vocabulary size:  5026
  Lexical diversity:  0.16862942459318905
  Total number of lists: 2290
  Average description tokens length: 146.5344978165939
  Minimum description tokens length: 0
  Maximum description tokens length: 1685
  Standard deviation of document length: 172.26944439933655


In [66]:
bigrams = ngrams(allWords, n = 2)
fdbigram = FreqDist(bigrams)
mostFreqBigrams = fdbigram.most_common(30)
mostFreqBigrams

[(('brockman', 'street'), 49),
 (('bluegrass', 'parkway'), 48),
 (('farmers', 'market'), 47),
 (('friendliness', 'helpfulness'), 47),
 (('helpfulness', 'event'), 47),
 (('event', 'volunteers'), 46),
 (('performances', 'featuring'), 46),
 (('featuring', 'bluegrass'), 46),
 (('entertainment', 'brockman'), 45),
 (('market', 'stalls'), 45),
 (('workshops', 'various'), 41),
 (('various', 'musicians'), 41),
 (('musicians', 'groups'), 40),
 (('big', 'uke'), 35),
 (('uke', 'jam'), 35),
 (('jam', 'sandra'), 35),
 (('sandra', 'donovan'), 35),
 (('donovan', 'donovan'), 35),
 (('donovan', 'sound'), 35),
 (('sound', 'shell'), 35),
 (('feel', 'like'), 34),
 (('afternoon', 'concert'), 33),
 (('parkway', 'great'), 32),
 (('concert', 'wayward'), 32),
 (('wayward', 'earls'), 32),
 (('young', 'people'), 29),
 (('great', 'workshops'), 29),
 (('volunteers', 'great'), 29),
 (('great', 'performances'), 28),
 (('sunday', 'evening'), 28)]