# Sentence-level ABSA
Given a sentence, we'll need to identify and annotate the aspect category and the sentiment polarity towards the given category.

Because each sentence can belong to multiple categories at the same time, I have decided to implement a multi-label classification model. The idea behind this is to transform our multi-label problem into a multi-class problem, where my classes are not mutually exclusive.

In order to create only one model for both the category and the sentiment, I am going to add the polarity to each tag. This might not be the best approach to dealing with this task as it increases the dimensionality of our data significantly (from 198 labels to 594).


In [None]:
%pip install scikit-multilearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-multilearn
  Downloading scikit_multilearn-0.2.0-py3-none-any.whl (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.4/89.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-multilearn
Successfully installed scikit-multilearn-0.2.0


In [None]:
import nltk

nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import LabelPowerset
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.utils import class_weight

Extracting Data 
From the given XML files, we'll need the sententces, categories and the polarities of each category. We'll extract those using ElementTrees.

In [None]:
import pandas as pd
import numpy as np
import xml.etree.cElementTree as et

tree_train=et.parse('Laptops_Train_p1.xml')
root_train=tree_train.getroot()

tree_test=et.parse('Laptops_Test_p1_gold.xml')
root_test=tree_test.getroot()

We  generate the categories from the given sets. (Note: I modified HARD_DISK to HARD_DISC as there seems to have been a misspelling in the files provided)

In [None]:
entity = ["LAPTOP", "DISPLAY", "KEYBOARD", "MOUSE", "MOTHERBOARD", "CPU", "FANS_COOLING", "PORTS", "MEMORY", "POWER_SUPPLY" , "OPTICAL_DRIVES", "BATTERY", "GRAPHICS", "HARD_DISC", "MULTIMEDIA_DEVICES", "HARDWARE", "SOFTWARE", "OS", "WARRANTY", "SHIPPING", "SUPPORT", "COMPANY"]
attributes = ["GENERAL", "PRICE", "QUALITY", "DESIGN_FEATURES", "OPERATION_PERFORMANCE", "USABILITY", "PORTABILITY", "CONNECTIVITY", "MISCELLANEOUS"]
polarity = ["positive", "negative", "neutral"]
Cat = []
for e in entity:
  for a in attributes:
    for p in polarity:
      Cat.append(e+'#'+a+'#'+p)

I don't need the sententences which don't belong to any category, so I'll ignore those.

I create a row for each label, where the first element is the name of the tag, then for each sentence I add 0 if that sentence doesn't have that tag, and 1 if it does.

In [None]:
def extract_data(root):
  
  row_text = []
  Categories = []
  for i in range(len(Cat)):
    row_text.append([Cat[i]]) #create a row for each category

  Text = []
  for sentence in root.iter('sentence'):
    category = str(sentence.find('Opinions'))
    if category != 'None': #ignore sentences with no category
      Text.append(sentence.find('text').text)
      for index in range(len(Cat)):
        row_text[index].append(0) #add 0 for all categories for current sentence
      for i in sentence.iter('Opinion'): #iterate over all existing categories of the current sentence
       pol = str(i.get('polarity'))
       cat = str(i.get('category'))
       entity = cat + '#' + pol
       for j in range(len(Cat)):
         if entity == Cat[j]:
            row_text[j].pop()
            row_text[j].append(1) #replace the 0 with 1 for each category of the current sentence

  row_text.sort() #sort the tags so we'll have the same order in both data sets
  return Text , row_text

In [None]:
Text_train = []
row_text_train = []
Text_train , row_text_train = extract_data(root_train)
Text_test = []
row_text_test = []
Text_test , row_text_test = extract_data(root_test)


Because we can't make predictions about a tag that doesn't occur in our training set and there's no point in training our model on tags that we won't make predictions about, we'll ignore those.

Therefore, in order to minimaze the dimensionality of our data, we'll remove the tags which have no ocurences in neither the training set nor the testing set.

In [None]:

i = 0
while i < len(row_text_test):
  if {0} == set(row_text_test[i][1:]) or {0} == set(row_text_train[i][1:]):
    row_text_test.pop(i)
    row_text_train.pop(i)
    i = i - 1 
  i += 1
print(len(row_text_test))

105


As you can see only 105 out of a total of 594 tags have at least one occurence in both data sets which is a significant reduction in our data.

Pre-processing our data
We'll do the basic steps of pre-processing our data:
 1. words are lower case
 2. tokenize
 3. stop-word removal
 4. punctuation and non-alpha character removal
 5. Lemmatise the words in the text

I chose to use lemmatising in my algorithm as it slightly improves the accuracy of the model, stemming has proven to be too harsh for this task.

 Moreover, since negation has a powerful impact on the meaning of our words, we'll also replace the words after a negation with their antonyms. The algortihm for negation uses WordNet to find all the antonyms of each word. It is not always the case that there are any antonyms (for example verbs or pronouns), and in this case the word will not be replaced. If there are antonyms, I'll create a list of all posssible antonyms and pick the one with the highest dissimilarity coeficient.
 
 This step can be improved by replacing only the relevant parts of the sentence with antonyms. In my algorithm, I replace everything after the negation (not, never etc.) with their antonyms but it's not always the need for this. In compound sentences, stopwords such as "but", "whereas", etc. mark the the begining of a new clause in the sentence, which prevents the effect of negation to extend to it. For example: "I don't like soup but I love vegetables", the negation "n't" affects only the left clause of the sentence. A good method that might refine my algorithm is *conjunction analysis* as conjunctions are the words which link together diffrent clauses in a sentence.
 Also adding more cases of negation to include misspellings and prefixes such as "un-" is the next step that I want to take to improve my algotihm.

 Another improvement that might increase the accruracy of the model is to replace all synonym words with the same words to reduce the number of words in our dictionary.

 

In [None]:
# original code for negation https://gist.github.com/UtkarshRedd/3fbfd354ea7a6f83bd8f9419a27b0543#file-negation_handler-py
def Negation(sentence):	

  temp = int(0)
  for i in range(len(sentence)):
      if sentence[i-1] in ['not',"n't", "never"]:
        for j in range(i, len(sentence)):
          antonyms = []
          for syn in wordnet.synsets(sentence[j]):
              syns = wordnet.synsets(sentence[j])
              w1 = syns[0].name()
              temp = 0
              for l in syn.lemmas():
                  if l.antonyms():
                      antonyms.append(l.antonyms()[0].name())
              max_dissimilarity = 0
              for ant in antonyms:
                  syns = wordnet.synsets(ant)
                  w2 = syns[0].name()
                  syns = wordnet.synsets(sentence[j])
                  w1 = syns[0].name()
                  word1 = wordnet.synset(w1)
                  word2 = wordnet.synset(w2)
                  if isinstance(word1.wup_similarity(word2), float) or isinstance(word1.wup_similarity(word2), int):
                      temp = 1 - word1.wup_similarity(word2)
                  if temp>max_dissimilarity:
                      max_dissimilarity = temp
                      antonym_max = ant
                      sentence[j] = antonym_max
                      sentence[i-1] = ''
  while '' in sentence:
      sentence.remove('')
  return sentence 
def remove_stopwords(tknzd_text):
  tokens = [] #list of tokens w/o stopwords
  for token in tknzd_text:
    if token not in stopwords.words('english'):
      tokens.append(token)
  return tokens

def remove_non_alpha(tknzd_text):

  alpha_tokens = [] #list of tokens that are only alphabetic 
  for token in tknzd_text:
    if token.isalpha():
      alpha_tokens.append(token)
  return alpha_tokens

def lemmatise(tknzd_text):
  lemma_tokens = [] #list of lemmatized tokens
  lemmatizer = WordNetLemmatizer()
  for token in tknzd_text:
    lemma_tokens.append(lemmatizer.lemmatize(token))
  
  lemmatized_text = " ".join(lemma_tokens)
  return lemmatized_text

def preprocess(tokenized_data):
  pp_data = []  #list of preprocessed sms. This is not tokenized text anymore
  for tknzd_sms in tokenized_data:
    pp_text = remove_stopwords(tknzd_sms)
    pp_text = remove_non_alpha(pp_text)
    pp_text = lemmatise(pp_text)
    pp_data.append(pp_text)
  return pp_data


for i in range(len(Text_test)):
  Text_test[i] = Negation(word_tokenize(Text_test[i].lower()))

for i in range(len(Text_train)):
  Text_train[i] = Negation(word_tokenize(Text_train[i].lower()))

preprocessed_data_train = preprocess(Text_train)
Train = preprocessed_data_train
preprocessed_data_test = preprocess(Text_test)
Test = preprocessed_data_test


In order to make our data more easily accessible, we'll transform our XMLs into CSVs.

In [None]:
rows_train = []
rows_train = {         "text": Text_train,
        }
 
df_train = pd.DataFrame(rows_train)
for i in range(len(row_text_train)):
  row = []
  row = {
      row_text_train[i][0] : row_text_train[i][1:],
      }
  ex = pd.DataFrame(row)
  df_train = pd.concat([df_train,ex],axis =1)
# Writing dataframe to csv
df_train.to_csv('training_set.csv')



In [None]:
rows_test = []
rows_test = {         "text": Text_test,
        }
 
df_test = pd.DataFrame(rows_test)
for i in range(len(row_text_test)):
  row = []
  row = {
      row_text_test[i][0] : row_text_test[i][1:],
      }
  ex1 = pd.DataFrame(row)
  df_test = pd.concat([df_test,ex1],axis =1)
# Writing dataframe to csv
df_test.to_csv('testing_set.csv')


In [None]:
df_test.head()

Unnamed: 0,text,BATTERY#OPERATION_PERFORMANCE#negative,BATTERY#OPERATION_PERFORMANCE#positive,BATTERY#QUALITY#negative,COMPANY#GENERAL#negative,COMPANY#GENERAL#positive,CPU#OPERATION_PERFORMANCE#positive,DISPLAY#DESIGN_FEATURES#negative,DISPLAY#DESIGN_FEATURES#neutral,DISPLAY#DESIGN_FEATURES#positive,...,SOFTWARE#OPERATION_PERFORMANCE#negative,SOFTWARE#OPERATION_PERFORMANCE#positive,SOFTWARE#QUALITY#positive,SOFTWARE#USABILITY#negative,SOFTWARE#USABILITY#positive,SUPPORT#PRICE#negative,SUPPORT#QUALITY#negative,SUPPORT#QUALITY#neutral,SUPPORT#QUALITY#positive,WARRANTY#GENERAL#positive
0,"[well, ,, my, first, apple, computer, and, i, ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"[works, well, ,, fast, and, no, reboots, .]",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"[waiting, to, install, ms, office, and, see, h...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"[have, always, been, a, pc, guy, ,, but, decid...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"[glad, i, did, so, far, .]",0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Our Model

By testing different models for **Naive Bayes** and **Logistic Regression**, I have observed that **ComplementNB** works the best for my data sets. **Complement Naive Bayes** is significantly better than the other models as it is particularily suited for imbalanced data and for classes which are not mutually exclusive. **SVC** also works as well as **ComplementNB** but the performance time is a significant disadvantage in this case (it took around 2min to compile).

For problem transformation from multi-label to multi-class, there are 4 methods:
1. Binary relevance
2. One vs. Rest
3. Classifier chains
4. Label powerset

**Binary relevance** and **One vs. Rest** are not the best methods for my model as they treat each label independently, whereas our labels are not mutually exclusive. As we have 105 diffrent tags, **Label powerset** will lead to combinatorial explosion and thus computational infeasibility because it considers each unique label combinations found in the data as a single label.

On the other hand, **Classifier Chain** creates a chain of binary classifiers C0, C1, . . . , Cn, where the first classifier is built using the input data and the following classifiers are trained using the combined inputs and the previous classifiers in the given chain. This way the method can take into account label correlations. This is a sequential process where an output of one classifier is used as the input of the next classifier in the chain.



Feature Selection

We'll implement two different ways of selecting features from our data and we'll see which one works better for our algorithm

# 1. Mutual Information (TF-IDF)


We train our first model using TF-IDF.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB , ComplementNB , BernoulliNB , GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
tfidf_vectorizer = TfidfVectorizer()

# we fit the tfidf vectorizer with our train data
X_train_tfidf = tfidf_vectorizer.fit_transform(Train)

# we now use the same vectorizer used and fit with our train data with test set
X_test_tfidf = tfidf_vectorizer.transform(Test) 

 

In [None]:
y_train = (df_train[df_train.columns[1:]])
y_test = (df_test[df_test.columns[1:]])

In [None]:
classifier_tfidf = ClassifierChain(ComplementNB())
classifier_tfidf.fit(X_train_tfidf, y_train)
predictions_tfidf = classifier_tfidf.predict(X_test_tfidf)

cr_tfidf = classification_report(y_test, predictions_tfidf,zero_division = 0)
print("\n\nClassification Report\n")
print(cr_tfidf)




Classification Report

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.47      0.53      0.50        17
           2       0.00      0.00      0.00         5
           3       0.20      0.09      0.13        22
           4       0.08      0.06      0.07        16
           5       0.00      0.00      0.00         2
           6       0.00      0.00      0.00         4
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         7
           9       0.00      0.00      0.00         1
          10       0.17      0.33      0.22         3
          11       0.00      0.00      0.00         3
          12       0.00      0.00      0.00         2
          13       0.00      0.00      0.00         3
          14       0.00      0.00      0.00         5
          15       0.00      0.00      0.00         1
          16       0.43      0.21      0.29        14
  

# 2. Word Frequency
We train our second model using CountVectorizer


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()

# we fit the count vectorizer with our train data
X_train_counts = count_vectorizer.fit_transform(Train)

# we now use the same vectorizer used and fit with our train data with test set
X_test_counts = count_vectorizer.transform(Test) 


classifier_count = ClassifierChain(ComplementNB())
classifier_count.fit(X_train_counts, y_train)
predictions_count = classifier_count.predict(X_test_counts)


cr_count = classification_report(y_test, predictions_count,zero_division = 0)
print("\n\nClassification Report\n")
print(cr_count)



Classification Report

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.27      0.53      0.36        17
           2       0.00      0.00      0.00         5
           3       0.14      0.14      0.14        22
           4       0.09      0.12      0.11        16
           5       0.00      0.00      0.00         2
           6       0.00      0.00      0.00         4
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         7
           9       0.00      0.00      0.00         1
          10       0.09      0.33      0.14         3
          11       0.00      0.00      0.00         3
          12       0.00      0.00      0.00         2
          13       0.00      0.00      0.00         3
          14       0.20      0.20      0.20         5
          15       0.00      0.00      0.00         1
          16       0.16      0.36      0.22        14
  

# Conclusion

TF-IDF works slightly better than CountVectorizer.

 By looking at the classification report we notice that one of our biggest problems is the unbalanced data (many sentences have the LAPTOP#GENERAL#positive tag). This clearly affects the model as there are not enough examples for each category to learn from. One way of improving our model is to try and generate syntetic data to make our data more balanced. Or we could add weights to our classes to make the smaller ones more important. 
 
  We can also see that negative or neutral tags have a lower f1 score comapred to the positive tags. One reason is the low number of examples for those tags but also negation plays an important role in this. A better method of handling negation might help in this case.

Implementing a Hierarchical model to handle multi-labeling and the polarity of the labels could also be an interesting approach to this problem. Another interesting idea would be to make two models: we create a model which handels our aspect analysis task then we create a model which takes our predicted labels and generates their polarity.