## Aspect-Based Sentiment Analysis: Findings from Natural Language
#### Code File \#5: Implementations for our Proposed Models (Task 3)

Tahmeed Tureen - University of Michigan, Ann Arbor<br>
Python file: <b>asba-proposed-model-Task3.ipynb</b> <br>
Description: Code that implements the our proposed models/algorithms for 2014 Task 4 (Pontiki et al.; 2014)

In [1]:
# Load up relevant libraries
import math
import pickle
import pandas as pd

In [2]:
# Read in Data
train_data = pickle.load(open("pickled_data/pickled_train_data.pkl", "rb"))
print(train_data.shape)
test_data = pickle.load(open("pickled_data/pickled_test_data.pkl", "rb"))
print(test_data.shape)
semEval_test_data = pickle.load(open("pickled_data/semEval_TestData.pkl", "rb"))
print(semEval_test_data.shape)

(3156, 9)
(557, 9)
(1025, 5)


In [3]:
full_data = pd.concat([train_data, test_data], ignore_index=True)
print(full_data.shape)

(3713, 9)


### Phase A Task 3
We will use three classifiers for Aspect-Categroy classifications:

- Rule Based Algorithm (using word embeddings and aspect terms)
- Multinomial Naive Bayes
- Support Vector Machines

We can take two approaches here, we can train a multi-class classifier... or we can train a multi-label classifier

- multi-class: outcome is only 1-D 
- multi-label: outcome is > 1-D vector

In [4]:
## Load up relevant features from sklearn module
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline

#### Engineer the training and test sets

We need to convert outcome so that it is multilabel...

In [36]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
import numpy as np

In [32]:
np.random.seed(1) # for reproducibility
# Need to convert the Category-Labels into Arrays
ml_Binarizer = MultiLabelBinarizer()
full_Y = ml_Binarizer.fit_transform(full_data.Categories)


full_X = full_data.Review

print(full_Y.shape) # Y's
print(full_X.shape) # reviews

(3713, 5)
(3713,)


In [33]:
# Need to do train - test split
train_X, test_X, train_Y, test_Y = train_test_split(full_X, full_Y, test_size = 0.15) 

print(train_X.shape)
print(test_X.shape)
print(train_Y.shape)
print(test_Y.shape)

(3156,)
(557,)
(3156, 5)
(557, 5)


### Naive Bayes

In [34]:
from sklearn.naive_bayes import MultinomialNB

In [44]:
pipeline_NB = Pipeline([('vect', CountVectorizer(stop_words = "english", ngram_range=(1,1))),
                        ('clf', OneVsRestClassifier(MultinomialNB(alpha = 0.1))),])
NB_classifier = pipeline_NB.fit(X = train_X, y = train_Y)
predict_Y = NB_classifier.predict(test_X)
print(round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4))
print(round(f1_score(test_Y, predict_Y, average = 'weighted'), 4))

0.6302
0.8225


In [46]:
pipeline_NB = Pipeline([('vect', CountVectorizer(stop_words = "english", ngram_range=(1,2))),
                        ('clf', OneVsRestClassifier(MultinomialNB(alpha = 0.1))),])
NB_classifier = pipeline_NB.fit(X = train_X, y = train_Y)
predict_Y = NB_classifier.predict(test_X)
print(round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4))
print(round(f1_score(test_Y, predict_Y, average = 'weighted'), 4))

0.6661
0.8256


In [47]:
pipeline_NB = Pipeline([('vect', CountVectorizer(stop_words = "english", ngram_range=(1,3))),
                        ('clf', OneVsRestClassifier(MultinomialNB(alpha = 0.1))),])
NB_classifier = pipeline_NB.fit(X = train_X, y = train_Y)
predict_Y = NB_classifier.predict(test_X)
print(round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4))
print(round(f1_score(test_Y, predict_Y, average = 'weighted'), 4))

0.6715
0.8276


We beat the baseline metric! Let's look at some of the other statistical learning models

### SVM

In [12]:
from sklearn.svm import LinearSVC

#### unigrams

In [48]:
pipeline_SVM = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,1))),
                        ('clf', OneVsRestClassifier(LinearSVC(penalty='l2')))])
SVM_classifier = pipeline_SVM.fit(X = train_X, y = train_Y)
predict_Y = SVM_classifier.predict(test_X)
print(round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4))
print(round(f1_score(test_Y, predict_Y, average = 'weighted'), 4))

0.7828
0.8851


#### bigrams

In [49]:
pipeline_SVM = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,2))),
                        ('clf', OneVsRestClassifier(LinearSVC(penalty='l2')))])
SVM_classifier = pipeline_SVM.fit(X = train_X, y = train_Y)
predict_Y = SVM_classifier.predict(test_X)
print(round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4))
print(round(f1_score(test_Y, predict_Y, average = 'weighted'), 4))

0.7935
0.8834


#### trigrams

In [50]:
pipeline_SVM = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,3))),
                        ('clf', OneVsRestClassifier(LinearSVC(penalty='l2')))])
SVM_classifier = pipeline_SVM.fit(X = train_X, y = train_Y)
predict_Y = SVM_classifier.predict(test_X)

print(round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4))
print(round(f1_score(test_Y, predict_Y, average = 'weighted'), 4))

0.7882
0.8787


### Logistic Regression

In [16]:
from sklearn.linear_model import LogisticRegression

#### unigrams

In [52]:
pipeline_LogitRegression = Pipeline([('vect', CountVectorizer(stop_words= "english")),
                        ('clf', OneVsRestClassifier(LogisticRegression()))])
LR_classifier = pipeline_LogitRegression.fit(X = train_X, y = train_Y)
predict_Y = LR_classifier.predict(test_X)

print(round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4))
print(round(f1_score(test_Y, predict_Y, average = 'weighted'), 4))

0.7271
0.8636


#### bigrams

In [51]:
pipeline_LogitRegression = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,2))),
                        ('clf', OneVsRestClassifier(LogisticRegression()))])
LR_classifier = pipeline_LogitRegression.fit(X = train_X, y = train_Y)
predict_Y = LR_classifier.predict(test_X)

print(round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4))
print(round(f1_score(test_Y, predict_Y, average = 'weighted'), 4))

0.754
0.8755


In [53]:
pipeline_LogitRegression = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,3))),
                        ('clf', OneVsRestClassifier(LogisticRegression()))])
LR_classifier = pipeline_LogitRegression.fit(X = train_X, y = train_Y)
predict_Y = LR_classifier.predict(test_X)

print(round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4))
print(round(f1_score(test_Y, predict_Y, average = 'weighted'), 4))

0.754
0.8729


### Rule-Based Approach

- Assume that we have successfully extracted the aspect terms
- Check word similarities for the aspect terms with the aspect categories
- The category with the highest similarity, gets assigned as the Category for that label