## Aspect-Based Sentiment Analysis: Findings from Natural Language
#### Code File \#2: Baseline Model

Tahmeed Tureen - University of Michigan, Ann Arbor<br>
Python file: <b>data-engin-tureen.ipynb</b> <br>
Description: Code that implements the SemEval baseline model for 2014 Task 4 as discussed in the paper (Pontiki et al.; 2014)

In [6]:
# Load up relevant libraries
import math
import pickle
from collections import Counter
from collections import defaultdict

In [7]:
from sklearn.metrics import f1_score

In [8]:
# Read in Data
train_data = pickle.load(open("pickled_data/pickled_train_data.pkl", "rb"))
print(train_data.shape)

(3156, 7)


In [9]:
test_data = pickle.load(open("pickled_data/pickled_test_data.pkl", "rb"))
print(test_data.shape)

(557, 7)


In [10]:
semEval_test_data = pickle.load(open("pickled_data/semEval_TestData.pkl", "rb"))
semEval_test_data.shape

(1025, 4)

### Baseline Models
The 2014 SemEval 2014 committee proposed the following baseline models:

- **Aspect Term Extraction**: In training, create a "dictionary" (word bank) that will consist of all of the aspect terms that show up in the Training Data. In test, go through each review and pick out aspect terms that are in this word back

- **Aspect Term Polarity**: In training, make a dictionary where the key is the aspect term and value is the sentiment that is the most frequent associated with that term. In test, assign the most frequent sentiment of the aspect term in the training data to the test term.

- **Aspect Category Extraction**: In testing, compare the current review with all of the reviews in the training data and whicherver training review has the highest dice coefficient with the current review, we assign the test review with that category

- **Aspect Category Polarity**: In training, make a dictionary where the key is the Category and value is the sentiment that is the most frequent associated with that term. In test, assign the most frequent sentiment of the Category in the training data to the test Category.

**Some Comments**: These are very naive and funky baseline models, but we use them as our lowest threshold metric and we want our built models to essentially beat these guys!

### Define Evaluation Metrics

Here we define the F1 Score, Precision and Recall for **Term Extraction** as discussed in (Pontiki et al.; 2014)

In [11]:
def F1_SemEval(predictions, truth):
    # need to calculate precision, recall
    intersect_SnG = 0 # intersection of extractions and truths
    cap_S = 0.0 # set of extractions
    cap_G = 0.0 # set of truths
    
    for i in range(len(predictions)):
        current_pred = predictions[i]
        current_truth = truth[i]
#         print(current_pred)
#         print(current_truth)
        
        # numerator for both precision and recall (number of terms in prediction that is also in the truth)
        intersect_SnG += len([term for term in current_pred if term in current_truth])
        cap_S += len(current_pred)
        cap_G += len(current_truth)
#         print("SnG", intersect_SnG)
#         print("S:", cap_S)
#         print("G", cap_G)
        
    # After loop is over we can now calculate the Precision and Recall values
    prec = float(intersect_SnG) / float(cap_S)
    recall = float(intersect_SnG) / float(cap_G)
    
    f1_score = ( 2.0 * prec * recall ) / (prec + recall)    
    return f1_score, prec, recall

### Task 1: Aspect Term Extraction

#### Training
We use our training data (n = 3167)

In [12]:
# Create the Naive word bank
train_aspTerms_bank = []

for term in train_data.Aspect_Term:
    train_aspTerms_bank = train_aspTerms_bank + term

# Get rid of repeated words (not necessary, but good practice to do this)
train_aspTerms_bank = set(train_aspTerms_bank)
print("We have", len(train_aspTerms_bank), "unique aspect terms in our training corpus")

We have 1124 unique aspect terms in our training corpus


#### Testing
- On sibling test split (n = 557)
- On SemEval annotated test data (n = 1025)

In [13]:
# load up SemEval's stopwords
# Stopwords, imported from NLTK (v 2.0.4)
stopwords = set(
    ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves',
     'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their',
     'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was',
     'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the',
     'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against',
     'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in',
     'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why',
     'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only',
     'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'])

First, go through the sibling test set's reviews

In [14]:
aspect_terms_extraction = []
for rev in test_data.Review:
    tokenized_rev = rev.split() # basic tokenization by whitespace
    current_rev_terms = [] # create container for the current review's extracted terms
    
    # Go through each token in the review then extract if match
    for token in tokenized_rev:
        if token in train_aspTerms_bank and token not in stopwords:
            current_rev_terms.append(token)
            
    aspect_terms_extraction.append(current_rev_terms)

Now, go trough the SemEval annotated test set

In [15]:
aspterm_extr_SemEval = []

for rev in semEval_test_data.Review:
    
    tokenized_rev = rev.split() # basic tokenization by whitespace
    current_rev_terms = [] # container for the current review's extracted terms
    
    # Go through each token
    for token in tokenized_rev:
        if token in train_aspTerms_bank and token not in stopwords:
            current_rev_terms.append(token)
            
    aspterm_extr_SemEval.append(current_rev_terms)

#### Evaluation
Calculate the raw correct/total score and the F1 Scores for both sets of test data

In [16]:
print("Raw correct prop on Sibling Test Set:", sum(aspect_terms_extraction == test_data.Aspect_Term) / test_data.shape[0])
print("F1 Score, Precision, and Recall on Sibling Test Set:", \
      F1_SemEval(aspect_terms_extraction, list(test_data.Aspect_Term)))

Raw correct prop on Sibling Test Set: 0.34470377019748655
F1 Score, Precision, and Recall on Sibling Test Set: (0.5375647668393783, 0.5563002680965148, 0.5200501253132832)


In [17]:
print("Raw correct prop on SemEval Annotated Test Set:", \
      sum(aspterm_extr_SemEval == semEval_test_data.Aspect_Term) / semEval_test_data.shape[0])
print("F1 Score, Precision, and Recall on SemEval Test Set:", \
      F1_SemEval(predictions = aspterm_extr_SemEval, truth = list(semEval_test_data.Aspect_Term)))

Raw correct prop on SemEval Annotated Test Set: 0.34146341463414637
F1 Score, Precision, and Recall on SemEval Test Set: (0.508235294117647, 0.5788667687595712, 0.4529658478130617)


**Observation**
- We note that baseline performs poorly in terms of F1 Scores on both sets of test data. Scores are just above 50%

### Task 2: Aspect Term Polarity

#### Training

In [18]:
aspTerm_PolClassifier = defaultdict(lambda : {'positive' : 0, 'negative' : 0, 'neutral' : 0})

for rev in train_data.itertuples():
    # loop through all of the terms in the Aspect Terms list in  a review
    for index in range(len(rev.Aspect_Term)):
        
        aspTerm_PolClassifier[rev.Aspect_Term[index]][rev.Aspect_Polarity[index]] += 1

In [19]:
len(aspTerm_PolClassifier.keys()) # should be 1124
# max(stats, key=stats.get)

1124

#### Testing

- Testing on the extracted aspect terms from Task 1
- Testing on the ground truth test terms from the sibling test dataset (Assuming all terms were extracted correctly)

In [20]:
## Testing on extracted data
baseline_aspTerm_polarities = []

for terms in aspect_terms_extraction:
    current_review = []
    
    for term in terms:
        
        curr_polarity = max(aspTerm_PolClassifier[term], key = aspTerm_PolClassifier[term].get)
        current_review.append(curr_polarity)
        
    baseline_aspTerm_polarities.append(current_review)

In [21]:
print(sum(baseline_aspTerm_polarities == test_data.Aspect_Polarity) / 557 ) 

0.32854578096947934


In [22]:
# Testing on truth aspect terms in the test data
baseline_aspTerm_polarities_2 = []

for terms in list(test_data.Aspect_Term):
    current_review = []
    
    for term in terms:
        curr_polarity = max(aspTerm_PolClassifier[term], key = aspTerm_PolClassifier[term].get)
        current_review.append(curr_polarity)
        
    baseline_aspTerm_polarities_2.append(current_review)

In [23]:
print(sum(baseline_aspTerm_polarities_2 == test_data.Aspect_Polarity) / 557)

0.6104129263913824


**Observations**: 

- When we run the classifier on our extracted terms, the test accuracy is approximately 32.9%. This makes sense because we also performed poorly on the baseline extraction in Task 1. This performance is conditional on the performance of the previous task.

- When we run the classifier on the test data and go through the extracted terms there, the accuracy performance rises to 61%, which is obviously an improvement but not a great score. (**Note:**) This performance is NOT conditional on Task 1



### Task 3: Aspect-Category Detection

"For every test sentence s, the k most similar to s training sentences are retrieved (as in the SB2 baseline). Then, s is assigned the m most frequent aspect category labels of the k retrieved sentences;
m is the most frequent number of aspect category
labels per sentence among the k sentences." - (Pontiki et al.; 2014)

**Interpretation:** Essentially, what they mean by this is to use the Dice Coefficient to get similarity ratings for a test sentence and all of the training sentences. Then pick the highest rating and assign the category label that is associated with that training sentence to the current test sentence

In [24]:
# Define a dice coefficient function (strucutred after SemEval's code)
# Dice coefficient
def dice_coeff(train_sen, test_sen, stopwords_in = stopwords):
    tokenize = lambda t: set([w for w in t.split() if (w not in stopwords)]) # define tokenize fxn
    train_sen = tokenize(train_sen)
    test_sen = tokenize(test_sen)
    
#     print(train_sen)
#     print(test_sen)
#     print(train_sen.intersection(test_sen))
    
    dice_val = 2.0 * len(train_sen.intersection(test_sen)) / (len(train_sen) + len(test_sen))
    return dice_val

Now, we'll go ahead and extract the categories for the test data

In [41]:
aspCategory_Extraction = []
aspCategory_PolarityTest = [] # also doing Task 4

for test_rev in test_data.Review:
    
    category_label = "other" # category label to assign
    category_polarity = "neutral" # category polarity to assign 
    max_diceCoeff = 0.0 # container for max dice coefficient
    
    for train_rev in train_data.itertuples():
        
        curr_diceCoeff = dice_coeff(train_sen= train_rev.Review, test_sen= test_rev)
        
        if curr_diceCoeff > max_diceCoeff:
            category_label = train_rev.Category
            category_polarity = train_rev.Category_Polarity
    
    aspCategory_Extraction.append(category_label)
    aspCategory_PolarityTest.append(category_polarity)

In [27]:
F1_SemEval(predictions=aspCategory_Extraction, truth= list(test_data.Category))

(0.5502680965147453, 0.5486134313397929, 0.5519327731092437)

In [36]:
print(aspCategory_Extraction[0:5])
print(list(test_data.Category)[0:5])
print(sum(aspCategory_Extraction == test_data.Category) / 557)

['food', 'food', 'food', 'service', 'service']
['ambience', 'food', 'food', 'food', 'service']
0.4254937163375224


The F1 Score associated with the Category classification/extraction is approximately 55% on our sibling test data and the raw accuracy is 42%

### Task 4: Aspect Category Polarity Classification

Same schema as Task 2
#### Training

In [39]:
aspCategory_PolClassifier = defaultdict(lambda : {'positive' : 0, 'negative' : 0, 'neutral' : 0})

for review in train_data.itertuples():
    aspCategory_PolClassifier[review.Category][review.Category_Polarity] += 1

len(aspCategory_PolClassifier.keys()) # should be five!

5

In [40]:
aspCategory_PolClassifier

defaultdict(<function __main__.<lambda>>,
            {'ambience': {'negative': 78, 'neutral': 61, 'positive': 228},
             'food': {'negative': 182, 'neutral': 125, 'positive': 743},
             'other': {'negative': 176, 'neutral': 309, 'positive': 470},
             'price': {'negative': 100, 'neutral': 23, 'positive': 154},
             'service': {'negative': 179, 'neutral': 46, 'positive': 282}})

In [42]:
aspCategory_PolarityTest2 = []

for cat in test_data.Category:
    aspCategory_PolarityTest2.append(max(aspCategory_PolClassifier[cat], key = aspCategory_PolClassifier[cat].get))

In [44]:
print(len(aspCategory_PolarityTest))
print(len(aspCategory_PolarityTest2))

557
557


In [46]:
print(sum(aspCategory_PolarityTest == test_data.Category_Polarity) / 557)
print(sum(aspCategory_PolarityTest == test_data.Category_Polarity))

0.4524236983842011
