<div style='font-size: 3em'>SentiLens - Uncover reviews' hidden emotion</div>

__Prepared by:__ Tina Vu</br>
__Date:__ 20231208</br>

Employing aspect-based sentiment analysis (ABSA) to extract valuable feature insights from e-commerce product reviews, thereby empowering consumers to make more informed purchasing decisions and enhancing their overall user experience on the platform.

Utilizing manually annotated reviews for aspect sentiment analysis to extract aspects and predict sentiments from reviews. This enables consumers to obtain a condensed overview of sentiments related to various product features, eliminating the need to delve into an extensive array of reviews. As a result, the decision-making process becomes more streamlined and user-friendly.

__Approach:__

ABSA

__Phase:__
1. Supervised ABSA (What, How)
2. Unsupervised ABSA
3. Add 'Why' into ABSA

<div style='font-size: 2em'>Phase 1 - Aspect Extration</div>

**Table of contents**<a id='toc0_'></a>    
- 1. [Import & prepare dataset](#toc1_)    
  - 1.1. [Import data](#toc1_1_)    
  - 1.2. [Preparing dataset for modelling](#toc1_2_)    
    - 1.2.1. [Unified BIO tagging encode](#toc1_2_1_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
from datasets import load_dataset
import numpy as np
import pandas as pd
import re

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
en_stop_words = set(stopwords.words('english'))
# nltk.download()

from sklearn_crfsuite import CRF

  from .autonotebook import tqdm as notebook_tqdm


# 1. <a id='toc1_'></a>[Import & prepare dataset](#toc0_)

## 1.1. <a id='toc1_1_'></a>[Import data](#toc0_)

We will load laptop reviews dataset with aspect term & sentiment annotations.

The dataset comes in two parts:
- train: 3,048 records
- test: 800 records

Each record is a sentence with zero, one or multiple aspect terms. Each aspect term has the following features:
- start character index
- end character index
- sentiment/ polarity (positive or negative)

In [2]:
df_train = pd.read_json('data/laptop/train.json')
df_val = pd.read_json('data/laptop/validate.json')
df_train.shape

(3048, 3)

In [3]:
df_train.head()

Unnamed: 0,id,text,aspects
0,2339,I charge it at night and skip taking the cord ...,"[{'term': 'cord', 'polarity': 'neutral', 'from..."
1,812,I bought a HP Pavilion DV4-1222nr laptop and h...,"[{'term': '', 'polarity': '', 'from': 0, 'to':..."
2,1316,The tech guy then said the service center does...,"[{'term': 'service center', 'polarity': 'negat..."
3,2328,I investigated netbooks and saw the Toshiba NB...,"[{'term': '', 'polarity': '', 'from': 0, 'to':..."
4,2193,The other day I had a presentation to do for a...,"[{'term': '', 'polarity': '', 'from': 0, 'to':..."


## 1.2. <a id='toc1_2_'></a>[Preparing dataset for modelling](#toc0_)

The task we are solving is Named Entity Recognition (NER) which is a sequential labeling task, a.k.a we would like to predict whether a token (word) in each sentence is part of an aspect term or not.

In order to prepare the data for NER task, we need to label our tokens. Here, I implemented a unified BIO tagging technique which combines aspect boundaries and aspect sentiment.

Word boundaries:
- B: indicates the 1st word in the aspect term
- I: indicates the subsequent word in the aspect term
- O: indicates words that are not part of any aspect term

Aspect sentiment:
- POS: positive
- NEU: neutral
- NEG: conflict

This BIO label technique is more effective in recognizing unigram and n-gram aspect terms comparing to a binary classification (whether a token is part of an aspect). By using a unified a approach, we can combine two tasks: aspect extraction and sentiment classification into one task.

In [4]:
# First, I will need to drop some duplicated data in our training dataset, as identified in the EDA process.
df_train.drop_duplicates(subset='text', inplace=True)

# We have removed 12 duplicated records in our training dataset
df_train.shape

(3036, 3)

### 1.2.1. <a id='toc1_2_1_'></a>[BIO tagging encode](#toc0_)

Here, I defined a function to encode our sentences' aspects using a unified BIO tagging technique (<a href='https://arxiv.org/pdf/1811.05082.pdf'>reference</a>) that combines aspect boundaries and aspect sentiment in a single label.

Word boundaries:
- B: indicates the 1st word in the aspect term
- I: indicates the subsequent word in the aspect term
- O: indicates words that are not part of any aspect term

Aspect sentiment:
- POS: positive
- NEU: neutral
- NEG: conflict

Unified BIO tagging will be like: B-NEU, I-NEU

For example:
['I', 'charge', 'it', 'at', 'night', 'and', 'skip', 'taking', 'the', 'cord', 'with', 'me', 'because', 'of', 'the', 'good', 'battery', 'life', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-NEU', 'I-NEU', 'O', 'O', 'O', 'O', 'O', 'B-POS', 'I-POS', 'I-POS']

Since the dataset provides a full sentence and annotates aspect term using character index, we cannot perform word_tokenize directly on the raw text due to:
  1. Word_tokenize separates punctuation as a single normal token which makes it difficult to re-string (combine) word tokens (using ' '.join(tokens)) for character index, as it adds extra spaces between word and punctuations, thus invalidate character index for aspect terms. This makes it very difficult to align the aspect term character index and word index accurately.</br>
  E.g. "I love pizza, cheese.", term indexes are (7,12),(16,22), re-string tokens from word_tokenize (' '.join(tokens)) can turn the sentence into "I love pizza__\<extra space\>__, cheese__\<extra space\>__.". The char index of the aspect terms now become (7,12), `(17,23)`.
  
  2. Word_tokenize tends to not separate words that have special characters between them (other than space and common punctuations like <,.:;>), while some terms treated those chunks as separated terms. 
  E.g. "size/screen" is a single token based on word_tokenize, while terms defined this as two separate tokens. 

Therefore, I applied the below approach to acoomodate the above short commings:
  1. Add aspect_prefix & aspect_suffix (with additional spaces in order to overcome issue #2) to the start & end of each aspect term (to overcome issue #1) in the sentence using from, to char index as supplied by the dataset
  2. Perform word_tokenize on the new aspect_annotated_sentence
  3. Perform BIO tagging on the sentence token

In [5]:
def encode_unified_BIO (x, sentiment_tag=False):
  '''  This function puts aspect's details into a dictionary, and multiple aspect as an array
  
  Parameter:
  - ASPECTS: dictionary array
    dictionary of
    - term
    - polarity
    - term_start
    - term_end

    For example:
    [
      {'term':'cord', 'polarity':'neutral', 'from': 41, 'to': 45},
      {'term':'battery life', 'polarity':'positive', 'from': 74, 'to': 86}
    ]
  - SENTIMENT_TAG: boolean
    True: if we want to return sentiment polarity with BIO tagging (unified BIO)
    False: if we do not want to return sentiment polarity with BIO tagging (just pure BIO)
    
  Output: 
  - TEXT_TOKENS: array of string
    Sentence tokens
    E.g. ['Boot', 'time', 'is', 'super', 'fast', ',', 'around', 'anywhere', 'from', '35', 'seconds', 'to', '1', 'minute', '.']
    
  - TAGS: array of string
    Unified BIO tags (if sentiment_tag paratement is set to True) or BIO tags (if sentiment_tag is set to False)
      Aspect boundaries:
        B: beginning of aspect term
        I: subsequent words of aspect term
        O: outside of aspect term
      Sentiment:
        POS: positive
        NEG: negative
        NEU: neutral
        CON: conflict
    E.g. ['B-POS', 'I-POS', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

  - ASPECT_COMPUTE_PAIRS: tuple of two strings
    1st String combining of all aspect terms as provided by the dataset
    2nd String combining of all aspect terms from annotated BIO tagged using text_token
    E.g.(' Boot time', ' Boot time')

  - IS_INCORRECT_TAGGING: boolean
    True: when ASPECT_COMPUTE_PAIRS is different from each other
    False: when ASPECT_COMPUTE_PAIRS is exactly like each other
    This may return False, when there is some special characters in aspect term that causing them to be splitted into a different token.
    E.g. aspect_term `15" TV`, computed_term can return `15 " TV`. Thus, they are not exactly matching each other, but they should be okay
  '''

  aspects = x['aspects']
  aspects = sorted(aspects, key=lambda d: int(d['from']))  # sort aspects based on from, some aspects are not sorted: later terms in the sentence sometimes are placed before terms appear earlier. 
  
  text = x['text']

  sentiment_tag_map = {'neutral': '-NEU'
                       ,'conflict':'-CON'
                       ,'positive':'-POS'
                       ,'negative':'-NEG'}
  
  aspect_prefix = ' XXATBXX' # add leading space to break words if they are in the same chunk. E.g. "size/window" --> "size / window"
  aspect_suffix = 'XXATEXX ' # add trailing space

  # these are for validation to ensure the BIO tagging is accurate
  aspect_terms = ''
  aspect_terms_compute = ''

  # we cannot perform word_tokenize directly on the raw text due to:
  # 1. there is no space between punctuation and word, which makes it difficult to calculate word index from char index when concatenating word tokens for word index search
  # 2. terms can be partial of a word token, e.g. "size/screen" is a single token based on word_token, while terms defined this as two separate terms. 
  # Therefore, I applied the below approach:
  # 1. Add aspect_prefix & aspect_suffix to the start & end of each aspect term in the sentence using from, to char index as supplied by the dataset
  # 2. Perform word_tokenize on the new aspect_annotated_sentence
  # 3. Perform BIO tagging on the sentence token

  # 1. Add aspect prefix & suffix to the sentence
  aspect_annotated_sentence = text

  for i, k in enumerate(aspects):
    term = k['term']
    
    if k['term'] != '': # there are empty aspects but still have an empty dict structure, so we only perform tagging for those that has `term` != ''
      # this is for validation purposes only
      aspect_terms += ' ' + term 

      if k['polarity'] == '':
        print(k['id'])

      polarity = sentiment_tag_map[k['polarity']] if sentiment_tag == True else '' # get polarity encode
  
      i_from = int(k['from']) + i * (len(aspect_prefix) + len(aspect_suffix) + len(polarity)) # re-calculate from & by shifting them by the length of additional character added for aspect prefix & suffix
      i_to = int(k['to']) + i * (len(aspect_prefix) + len(aspect_suffix) + len(polarity))
      
      aspect_annotated_sentence = aspect_annotated_sentence[:i_from] + aspect_prefix+ polarity + aspect_annotated_sentence[i_from:i_to] + aspect_suffix + aspect_annotated_sentence[i_to:]
  
  # Tokenize aspect annotated sentence
  text_tokens = word_tokenize(aspect_annotated_sentence)

  # Perfom BIO tagging
  tags = []
  aspect_start = False
  polarity = ''

  for i,k in enumerate(text_tokens):
    tag = 'O' # default token tag as 'O' outside of aspect term

    if k[:7] == aspect_prefix.strip(): # if we see aspect prefix in a term, update tag as 'B' & set aspect_start as True
      aspect_start = True
      polarity = k[7:11] # extract polarity for next following tokens if there is
      tag = 'B' + polarity
      
    elif aspect_start == True: # if token does not have aspect_prefx, set tag to I if aspect_start is still True
      tag = 'I' + polarity
      
    else: # if aspect_start is False or if there is no aspect prefix
      tag = 'O'

    text_tokens[i] = re.sub(aspect_prefix.strip() + '.{4}', '',text_tokens[i]).replace(aspect_suffix.strip(),'') # clean text token by removing aspect prefix, polarity & aspect suffix if any
    tags.append(tag) 
    
    if k[-7:] == aspect_suffix.strip(): # If token contains aspect_suffix, restart aspect_start as False and polarity as empty
      aspect_start = False
      polarity = ''
    
  # This is for validation purposes only
  for i, k in enumerate(tags):
    if k != 'O':
      aspect_terms_compute += ' ' + text_tokens[i]
  
  # This is for validation purposes only
  aspect_compute_pairs = (aspect_terms, aspect_terms_compute)
  is_incorrect_tagging = True if aspect_terms != aspect_terms_compute  else False
  
  return pd.Series([text_tokens, tags, aspect_compute_pairs, is_incorrect_tagging])

In [7]:
df_train[['text_token','aspect_unified_bio','aspect_compute_pairs','is_incorrect_tagging']] = df_train.apply(lambda x: encode_unified_BIO(x, True), axis=1)
df_val[['text_token','aspect_unified_bio','aspect_compute_pairs','is_incorrect_tagging']] = df_val.apply(lambda x: encode_unified_BIO(x, True), axis=1)

After performing BIO encode, we want to check if there is any misclassification using is_incorrect_tagging.

There are only 20 possible inccorect BIO labels in df_train and 10 for df_val.

Looking through a few examples, it seems like the issues are mainly around special characters inside aspect terms that cause the comparision strings to be different, however, the BIO labels are still accurate. 

In [53]:
print('# of possible incorrect labels in df_train: ', df_train['is_incorrect_tagging'].sum())
print('# of possible incorrect labels in df_train: ', df_val['is_incorrect_tagging'].sum())
print('\n\n Some examples')

n_samples = 2
sample = df_train[df_train['is_incorrect_tagging']].copy().reset_index()

for i in range(0, n_samples):
  s = sample.iloc[i]
  print(s['text_token'])
  print(s['aspects'])
  print(s['aspect_compute_pairs'])
  print(s['aspect_unified_bio'])
  print(len(s['text_token']))
  print(len(s['aspect_unified_bio']))
  print('\n')

# of possible incorrect labels in df_train:  20
# of possible incorrect labels in df_train:  10


 Some examples
['The', 'tech', 'guy', 'then', 'said', 'the', 'service', 'center', 'does', 'not', 'do', '1-to-1', 'exchange', 'and', 'I', 'have', 'to', 'direct', 'my', 'concern', 'to', 'the', '', "''", 'sales', "''", 'team', ',', 'which', 'is', 'the', 'retail', 'shop', 'which', 'I', 'bought', 'my', 'netbook', 'from', '.']
[{'term': 'service center', 'polarity': 'negative', 'from': '27', 'to': '41'}, {'term': '"sales" team', 'polarity': 'negative', 'from': '109', 'to': '121'}, {'term': 'tech guy', 'polarity': 'neutral', 'from': '4', 'to': '12'}]
(' tech guy service center "sales" team', " tech guy service center  '' sales '' team")
['O', 'B-NEU', 'I-NEU', 'O', 'O', 'O', 'B-NEG', 'I-NEG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-NEG', 'I-NEG', 'I-NEG', 'I-NEG', 'I-NEG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
40
40


['I', 'also', 'got'

### Word features
Here we will populate some word features for each token in the sentence, such as:
- word
- stemming / lemming versions of word
- part of speech (POS) of word
- words sentiment POS
- context words within a pre-defined window (5 words surrounding the token)
- context words stemming/ lemming
- context words POS
- context words sentiment POS
- ...
The list is not exhaustive, and is an iterative process as we explore on EDA.

In [54]:
# Function to convert sentences into features
def word2features(sent, i, window_size=5): 
    word = sent[i]

    _, pos = zip(*nltk.pos_tag(sent))

    window_size = int((window_size - 1)/ 2 if (window_size % 2) == 1 else window_size / 2)
    
    features = {
        'word.lower()': word.lower(), # word
        'word.index()': i,
        'word.reverseindex()': len(sent) - 1 - i, # reverse index - nth word from end of sentence
        'word.pos': pos[i],
        'word.isstopword()': word in en_stop_words,
        'word[-3:]': word[-3:], # last 4 char
        'word[-2:]': word[-2:], # last 3 char - in case of -ing, -ion, etc.
        'word.isupper()': word.isupper(), # is the word in upper case
        'word.istitle()': word.istitle(), # is the first letter of the word in upper case
        'word.isdigit()': word.isdigit(), # is the word full of digit
        'word.isspecialchar()': re.sub('[^\w,\d,\s]', '', word.lower()) == '', # is punctuation/ special characters
    }
    if i > 0:
        for k in range(1, min(window_size, i)+1):
            prev_word = sent[i - k]
            prev_pos = pos[i - k]
            
            features.update({
                f'-{k}:word.lower()': prev_word.lower(),
                f'-{k}:word.pos': prev_pos,
                f'-{k}:word.isstopword()': prev_word in en_stop_words,
                f'-{k}:word.istitle()': prev_word.istitle(),
                f'-{k}:word.isupper()': prev_word.isupper(),
                f'-{k}:word.isspecialchar()': re.sub('[^\w,\d,\s]', '', prev_word.lower()) == '', # is punctuation/ special characters
            })
    else:
        features['BOS'] = True  # Beginning of sentence

    if i < len(sent) - 1:
        for k in range(1, min(window_size, len(sent) - i - 1)+1):
            next_word = sent[i + k]
            next_pos = pos[i + k]

            features.update({
                f'+{k}:word.lower()': next_word.lower(),
                f'+{k}:word.pos': next_pos,
                f'+{k}:word.isstopword()': next_word in en_stop_words,
                f'+{k}:word.istitle()': next_word.istitle(),
                f'+{k}:word.isupper()': next_word.isupper(),
                f'-{k}:word.isspecialchar()': re.sub('[^\w,\d,\s]', '', next_word.lower()) == '', # is punctuation/ special characters
            })
    else:
        features['EOS'] = True  # End of sentence

    return features

# Function to convert sentences into feature sequences
def sent2features(sent, window_size=5):
    return [word2features(sent, i, window_size) for i in range(len(sent))]

In [56]:
# X_train = [sent2features(sentence, 5) for sentence in df_train['text_token']]
# y_train = df_train['aspect_unified_bio']

# # X_val = [sent2features(sentence,5) for sentence in df_val['text_token']]
# # y_val = df_val['aspect_encode']

### Define data processing functions
These functions help us streamline and ensure consistencies in our data preparation

In [57]:
def prepare_df(df, window_size=5):
  df[['text_token','aspect_unified_bio']], _, _ = df.apply(lambda x: encode_unified_BIO(x, True), axis=1)
  X =  [sent2features(sentence, window_size) for sentence in df['text_token']]
  y = df['aspect_unified_bio']
  return X,y

In [58]:
def sentences_to_features(sentences, window_size=5):
  sentences_token = [word_tokenize(sentence) for sentence in sentences]
  X =  [sent2features(sentence, window_size) for sentence in sentences_token]
  return X

# EDA

1. POS:
  - aspect - sentiment - neither
  - context
2. Word form (compact Xx)
3. Word index
4. Sentiment terms around aspects

# Random forest

- Scale the data
- Add regularization

## Convert X_train to tabular format

In [None]:
X_train_rf = pd.DataFrame()

def features2df(sentence_index, X):
  df = pd.DataFrame(X)
  df['nth_sentence'] = sentence_index
  return df

# Assuming X_train is a list of DataFrames
X_train_rf = pd.concat([features2df(i,X) for i,X in enumerate(X_train)], ignore_index=True)
X_train_rf.head()

In [None]:
X_train_rf.replace(True,1, inplace=True)
X_train_rf.replace(False,0, inplace=True)
X_train_rf.fillna(-1, inplace=True)
X_train_rf.info()

In [None]:
cols = pd.Series(X_train_rf.columns)
drop_cols = cols[cols.str.contains('.*word\.lower\(\)', regex=True)]
drop_cols = pd.concat([drop_cols, cols[cols.str.contains('.*word\[-\d\:]', regex=True)]])
drop_cols

In [None]:
X_train_rf.drop(columns=drop_cols, inplace=True)
X_train_rf.info()

In [None]:
X_train_rf = pd.get_dummies(X_train_rf, drop_first=True)
X_train_rf.shape

In [None]:
labels = {"B-POS": 0, "I-POS": 1, "B-NEU": 2, "I-NEU": 3, "B-NEG": 4, "I-NEG": 5, "O":6}
y_train_rf = [labels[y] for sentence in y_train for y in sentence]
y_train_rf[:20]

In [None]:
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.ensemble import RandomForestClassifier

pred = cross_val_predict(RandomForestClassifier(n_estimators=20),X=X_train_rf.drop(columns='nth_sentence'), y=y_train_rf, cv=5)

In [None]:
from sklearn.metrics import classification_report
report = classification_report(y_pred=pred, y_true=y_train_rf)
print(report)

# CRF model

In [None]:
# Create and train CRF model
crf_model = CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100)
try:
  crf_model.fit(X_train, y_train)
except AttributeError:
  pass

In [None]:
# def post_process_predictions(predictions):
#   for i in range(1, len(predictions)):
#       if predictions[i] == 'I' and predictions[i - 1] == 'O':
#           predictions[i] = 'O'  # Change 'I' to 'O' if not preceded by 'B'
#   return predictions

# Model evaluation

## Sample visualization

In [None]:
import matplotlib.pyplot as plt
import matplotlib
import random
from highlight_text import HighlightText, ax_text, fig_text

def vizualize_samples (sentences, pred_tags, tags = None):

  fig, ax = plt.subplots(figsize=(30,10))
  font = {'family' : 'arial',
          'size'   : 16}
  matplotlib.rc('font', **font)
  final_text = []
  color = []
  pos_element = {"bbox": {"edgecolor": "Green", "facecolor": "#99FF00", "linewidth": 1.5, "pad": 1}} 
  neu_element = {"bbox": {"edgecolor": "Orange", "facecolor": "Yellow", "linewidth": 1.5, "pad": 1.5}} 
  neg_element = {"bbox": {"edgecolor": "Red", "facecolor": "#FF99CC", "linewidth": 1.5, "pad": 1}}
  
  color_map = {'POS': pos_element, 'NEU': neu_element, 'NEG':neg_element}

  for s in range(0, len(sentences)):
    final_text.append('(P) ')
    chunk = []
    next_tag = ''
    
    for w in range(0, len(sentences[s])):
      # print(w)
      word = sentences[s][w]
      tag = pred_tags[s][w]
      # print(tag[2:])  
      if w + 1 < len(sentences[s]):
        next_tag = pred_tags[s][w+1]
      else:
        next_tag = 'O'
      
      # print(word, tag, next_tag)

      if tag == 'O':
        final_text.append(word)
      elif (tag in ['I-POS','I-NEG', 'I-NEU','B-POS','B-NEG', 'B-NEU']):
        chunk.append(word)

      if (next_tag in ['B-POS','B-NEG', 'B-NEU', 'O']) & (len(chunk) > 0):
        final_text.append(f'  <{" ".join(chunk)}>  ')
        color.append(color_map[tag[2:]])
        chunk = []

    if tags is not None:
      final_text.append('\n')
      final_text.append('(A) ')
      for w in range(0, len(sentences[s])):
        # print(w)
        word = sentences[s][w]
        tag = tags[s][w]
        if w + 1 < len(sentences[s]):
          next_tag = tags[s][w+1]
        else:
          next_tag = 'O'
          
        if tag == 'O':
          final_text.append(word)
        elif (tag in ['I-POS','I-NEG', 'I-NEU','B-POS','B-NEG', 'B-NEU']):
          chunk.append(word)

        if (next_tag in ['B-POS','B-NEG', 'B-NEU', 'O']) & (len(chunk) > 0):
          final_text.append(f'  <{" ".join(chunk)}>  ')
          color.append(pos_element)
          chunk = []
  


    

    # if tags is not None:
    #   final_text.append('\n')
    #   final_text.append('(A) ')
    #   for w in range(0, len(sentences[s])):
    #     word = sentences[s][w]
    #     tag = tags[s][w]

    #     if tag !='O':
    #       final_text.append('<{}>'.format(word))
    #       if tag in ['I-POS','I-NEG', 'I-NEU']:
    #         # color.append(color[-1])
    #         color.append(pos_element)
    #       else:
    #         color.append ({'color':random.choice(['blue','green','red','magenta'])})
    #     else:
    #       final_text.append(word)
    
    final_text.append('\n---------------\n')


    HighlightText(x=0, y=1,
                s=' '.join(final_text),
                highlight_textprops=color,
                ax=ax)
                
    plt.axis('off')

In [None]:
samples = 10
integer = 285 #random.randint(0,500)
tags = y_val[integer:integer+samples].array
pred_tags = crf_model.predict(X_val[integer:integer+samples])
sentences = [[x['word.lower()'] for x in sentence] for sentence in X_val[integer:integer+samples]]

print(integer)
vizualize_samples(sentences, pred_tags, tags)


In [None]:
i = 0
nth = integer + i
print(nth)
print(df_val.iloc[nth]['text'])

print(df_val.iloc[nth]['aspects'])

print(df_val.iloc[nth]['pairs'])

print(pred_tags[i])
print(tags[i])


In [None]:
# Test the model with new data
sample = pd.Series(['I was pleasantly surprised by the quality of this laptop for the money. '
            ,'I am not super techy so it may be difficult for me to comment on the technical specifications of the laptop, but I was pleasantly surprised with the quality of the product especially at this price point.'])
sample_text_token = [word_tokenize(sentence) for sentence in sample]
X_sample = [sent2features(sentence, 5) for sentence in sample_text_token]

predicted_labels = crf_model.predict(X_sample)

vizualize_samples(sample_text_token, predicted_labels)


In [None]:
import numpy as np
from sklearn.metrics import classification_report


tags = y_train
pred_tags = crf_model.predict(X_train)


# Create a mapping of labels to indices
labels = {"B-POS": 0, "I-POS": 1, "B-NEU": 2, "I-NEU": 3, "B-NEG": 4, "I-NEG": 5, "O":6}

# Convert the sequences of tags into a 1-dimensional array
predictions = np.array([labels[x] for x in sum([tag for tag in pred_tags],[])])
truths = np.array([labels[x] for x in sum([tag for tag in tags],[])])

# Print out the classification report
print(classification_report(
    truths, predictions,
    target_names= ["B-POS", "I-POS", "B-NEU", "I-NEU", "B-NEG", "I-NEG", "O"])
    )

In [None]:

tags = y_val
pred_tags = crf_model.predict(X_val)



# Convert the sequences of tags into a 1-dimensional array
predictions = np.array([labels[x] for x in sum([tag for tag in pred_tags],[])])
truths = np.array([labels[x] for x in sum([tag for tag in tags],[])])

# Print out the classification report
print(classification_report(
    truths, predictions,
    target_names=["B-POS", "I-POS", "B-NEU", "I-NEU", "B-NEG", "I-NEG", "O"])
    )

# Next steps
- Add more features:
  - head words
  - Google Word2Vec cluster_id
  - Stemming / Lemming
  - word index from beggining & ending of the sentence
- Employ pre-trained word embeddings
- Re-train model using rule-based aspect term extraction on larger dataset