## Aspect-Based Sentiment Analysis: Findings from Natural Language
#### Code File \#4: Implementations for our Proposed Models (Task 4)

Tahmeed Tureen - University of Michigan, Ann Arbor<br>
Python file: <b>asba-proposed-models-Task2.ipynb</b> <br>
Description: Code that implements the our proposed models/algorithms for 2014 Task 4 (Pontiki et al.; 2014)

In [1]:
import pickle
import pandas as pd

In [2]:
# Read in Data
train_data = pickle.load(open("pickled_data/pickled_train_data.pkl", "rb"))
print(train_data.shape)
test_data = pickle.load(open("pickled_data/pickled_test_data.pkl", "rb"))
print(test_data.shape)
semEval_test_data = pickle.load(open("pickled_data/semEval_TestData.pkl", "rb"))
print(semEval_test_data.shape)

(3156, 9)
(557, 9)
(1025, 5)


In [3]:
full_data = pd.concat([train_data, test_data], ignore_index=True)
print(full_data.shape)

(3713, 9)


### Phase A Task 4
We will use three classifiers for Category-Polarity classifications:

**Supervised**:
- Multinomial Naive Bayes
- Support Vector Machines
- Logistic Regression
- (all of them are performed using unigram, bigrams, and trigrams)

**Unsupervised**:
- Rule Based Algorithm (using word embeddings, POS tagging, Dependency Parsing, and Opinion Lexicons)

In [4]:
## Load up relevant features from sklearn module
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.pipeline import Pipeline

In [5]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

In [6]:
np.random.seed(1) # for reproducibility
# Need to convert the Category-Labels into Arrays
ml_Binarizer = MultiLabelBinarizer()
full_Y = ml_Binarizer.fit_transform(full_data.Category_Polarities)
full_X = full_data.Review

print(full_Y.shape) # Y's
print(full_X.shape) # reviews

(3713, 3)
(3713,)


In [7]:
# Need to do train - test split
train_X, test_X, train_Y, test_Y = train_test_split(full_X, full_Y, test_size = 0.15) 

print(train_X.shape)
print(test_X.shape)
print(train_Y.shape)
print(test_Y.shape)

(3156,)
(557,)
(3156, 3)
(557, 3)


### Naive Bayes

In [8]:
from sklearn.naive_bayes import MultinomialNB

In [9]:
pipeline_NB = Pipeline([('vect', CountVectorizer(stop_words = "english", ngram_range=(1,1))),
                        ('clf', OneVsRestClassifier(MultinomialNB(alpha = 0.1))),])
NB_classifier = pipeline_NB.fit(X = train_X, y = train_Y)
predict_Y = NB_classifier.predict(test_X)
round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4)

0.6499

In [10]:
pipeline_NB = Pipeline([('vect', CountVectorizer(stop_words = "english", ngram_range=(1,2))),
                        ('clf', OneVsRestClassifier(MultinomialNB(alpha = 0.1))),])
NB_classifier = pipeline_NB.fit(X = train_X, y = train_Y)
predict_Y = NB_classifier.predict(test_X)
round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4)

0.6786

In [11]:
pipeline_NB = Pipeline([('vect', CountVectorizer(stop_words = "english", ngram_range=(1,3))),
                        ('clf', OneVsRestClassifier(MultinomialNB(alpha = 0.1))),])
NB_classifier = pipeline_NB.fit(X = train_X, y = train_Y)
predict_Y = NB_classifier.predict(test_X)

round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4)

0.6715

### Support Vector Machines (SVM)

In [12]:
from sklearn.svm import LinearSVC

In [13]:
pipeline_SVM = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,1))),
                        ('clf', OneVsRestClassifier(LinearSVC(penalty='l2')))])
SVM_classifier = pipeline_SVM.fit(X = train_X, y = train_Y)
predict_Y = SVM_classifier.predict(test_X)

round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4)

0.6517

In [14]:
pipeline_SVM = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,2))),
                        ('clf', OneVsRestClassifier(LinearSVC(penalty='l2')))])
SVM_classifier = pipeline_SVM.fit(X = train_X, y = train_Y)
predict_Y = SVM_classifier.predict(test_X)

round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4)

0.6697

In [15]:
pipeline_SVM = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,3))),
                        ('clf', OneVsRestClassifier(LinearSVC(penalty='l2')))])
SVM_classifier = pipeline_SVM.fit(X = train_X, y = train_Y)
predict_Y = SVM_classifier.predict(test_X)
round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4)

0.6679

### Logistic Regression

In [16]:
from sklearn.linear_model import LogisticRegression

In [17]:
pipeline_LogitRegression = Pipeline([('vect', CountVectorizer(stop_words= "english")),
                        ('clf', OneVsRestClassifier(LogisticRegression()))])
LR_classifier = pipeline_LogitRegression.fit(X = train_X, y = train_Y)
predict_Y = LR_classifier.predict(test_X)

round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4)

0.6427

In [18]:
pipeline_LogitRegression = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,2))),
                        ('clf', OneVsRestClassifier(LogisticRegression()))])
LR_classifier = pipeline_LogitRegression.fit(X = train_X, y = train_Y)
predict_Y = LR_classifier.predict(test_X)

round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4)

0.6697

In [19]:
pipeline_LogitRegression = Pipeline([('vect', CountVectorizer(stop_words= "english", ngram_range=(1,3))),
                        ('clf', OneVsRestClassifier(LogisticRegression()))])
LR_classifier = pipeline_LogitRegression.fit(X = train_X, y = train_Y)
predict_Y = LR_classifier.predict(test_X)

round(accuracy_score(y_true=test_Y, y_pred=predict_Y),4)

0.6607

### Rule-Based Algorithm