## Aspect-Based Sentiment Analysis: Findings from Natural Language
#### Code File \#4: Implementations for our Proposed Models (Task 3)

Tahmeed Tureen - University of Michigan, Ann Arbor<br>
Python file: <b>asba-proposed-model-Task3.ipynb</b> <br>
Description: Code that implements the our proposed models/algorithms for 2014 Task 4 (Pontiki et al.; 2014)

In [14]:
# Load up relevant libraries
import math
import pickle
import pandas as pd

In [15]:
# Read in Data
train_data = pickle.load(open("pickled_data/pickled_train_data.pkl", "rb"))
print(train_data.shape)
test_data = pickle.load(open("pickled_data/pickled_test_data.pkl", "rb"))
print(test_data.shape)
semEval_test_data = pickle.load(open("pickled_data/semEval_TestData.pkl", "rb"))
print(semEval_test_data.shape)

(3156, 9)
(557, 9)
(1025, 5)


In [107]:
# pd.concat([a,b],ignore_index=True)

full_data = pd.concat([train_data, test_data], ignore_index=True)
print(full_data.shape)

(3713, 9)


### Phase A Task 3
We will use three classifiers for Aspect-Categroy classifications:

- Rule Based Algorithm (using word embeddings and aspect terms)
- Multinomial Naive Bayes
- Support Vector Machines

We can take two approaches here, we can train a multi-class classifier... or we can train a multi-label classifier

- multi-class: outcome is only 1-D 
- multi-label: outcome is > 1-D vector

In [25]:
## Load up relevant features from sklearn module
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline

#### Engineer the training and test sets

We need to convert outcome so that it is multilabel...

In [61]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

In [102]:
# Need to convert the Category-Labels into Arrays
ml_Binarizer = MultiLabelBinarizer()
full_Y = ml_Binarizer.fit_transform(full_data.Categories)
full_X = full_data.Review

print(full_Y.shape) # Y's
print(full_X.shape) # reviews

(3713, 5)
(3713,)


In [103]:
# Need to do train - test split
train_X, test_X, train_Y, test_Y = train_test_split(full_X, full_Y, test_size = 0.15) 

print(train_X.shape)
print(test_X.shape)
print(train_Y.shape)
print(test_Y.shape)

(3156,)
(557,)
(3156, 5)
(557, 5)


### Naive Bayes

In [184]:
pipeline_NB = Pipeline([('vect', CountVectorizer(stop_words = "english", ngram_range=(1,1))),
                        ('clf', OneVsRestClassifier(MultinomialNB(alpha = 0.1))),])
NB_classifier = pipeline_NB.fit(X = train_X, y = train_Y)
predict_Y = NB_classifier.predict(test_X)
accuracy_score(y_true=test_Y, y_pred=predict_Y)

0.6588868940754039

In [185]:
pipeline_NB = Pipeline([('vect', CountVectorizer(stop_words = "english", ngram_range=(1,2))),
                        ('clf', OneVsRestClassifier(MultinomialNB(alpha = 0.1))),])
NB_classifier = pipeline_NB.fit(X = train_X, y = train_Y)
predict_Y = NB_classifier.predict(test_X)
accuracy_score(y_true=test_Y, y_pred=predict_Y)

0.7073608617594255

In [186]:
pipeline_NB = Pipeline([('vect', CountVectorizer(stop_words = "english", ngram_range=(1,3))),
                        ('clf', OneVsRestClassifier(MultinomialNB(alpha = 0.1))),])
NB_classifier = pipeline_NB.fit(X = train_X, y = train_Y)
predict_Y = NB_classifier.predict(test_X)
accuracy_score(y_true=test_Y, y_pred=predict_Y)

0.7217235188509874

We beat the baseline metric! Let's look at some of the other statistical learning models

### SVM

In [108]:
from sklearn.svm import LinearSVC

#### unigrams

In [187]:
pipeline_SVM = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,1))),
                        ('clf', OneVsRestClassifier(LinearSVC(penalty='l2')))])
SVM_classifier = pipeline_SVM.fit(X = train_X, y = train_Y)
predict_Y = SVM_classifier.predict(test_X)
accuracy_score(y_true=test_Y, y_pred=predict_Y)

0.7630161579892281

#### bigrams

In [188]:
pipeline_SVM = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,2))),
                        ('clf', OneVsRestClassifier(LinearSVC(penalty='l2')))])
SVM_classifier = pipeline_SVM.fit(X = train_X, y = train_Y)
predict_Y = SVM_classifier.predict(test_X)
accuracy_score(y_true=test_Y, y_pred=predict_Y)

0.7791741472172352

#### trigrams

In [189]:
pipeline_SVM = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,3))),
                        ('clf', OneVsRestClassifier(LinearSVC(penalty='l2')))])
SVM_classifier = pipeline_SVM.fit(X = train_X, y = train_Y)
predict_Y = SVM_classifier.predict(test_X)
accuracy_score(y_true=test_Y, y_pred=predict_Y)

0.7719928186714542

### Logistic Regression

In [150]:
from sklearn.linear_model import LogisticRegression

#### unigrams

In [190]:
pipeline_LogitRegression = Pipeline([('vect', CountVectorizer(stop_words= "english")),
                        ('clf', OneVsRestClassifier(LogisticRegression()))])
LR_classifier = pipeline_LogitRegression.fit(X = train_X, y = train_Y)
predict_Y = LR_classifier.predict(test_X)
accuracy_score(y_true=test_Y, y_pred=predict_Y)

0.7127468581687613

#### bigrams

In [191]:
pipeline_LogitRegression = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,2))),
                        ('clf', OneVsRestClassifier(LogisticRegression()))])
LR_classifier = pipeline_LogitRegression.fit(X = train_X, y = train_Y)
predict_Y = LR_classifier.predict(test_X)
accuracy_score(y_true=test_Y, y_pred=predict_Y)

0.7504488330341114

In [194]:
pipeline_LogitRegression = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,3))),
                        ('clf', OneVsRestClassifier(LogisticRegression()))])
LR_classifier = pipeline_LogitRegression.fit(X = train_X, y = train_Y)
predict_Y = LR_classifier.predict(test_X)
accuracy_score(y_true=test_Y, y_pred=predict_Y)

0.748653500897666

### Rule-Based Approach

- Assume that we have successfully extracted the aspect terms
- Check word similarities for the aspect terms with the aspect categories
- The category with the highest similarity, gets assigned as the Category for that label