# Objective

As an example, compare a few methods for sentiment analysis: RNN with LSTM, Logistic Regression, and Naive Bayes

LSTM: consider the order of words, use word embedding to represent the input text

Logistic Regression: bag of words approach, use TF-IDF to represent the input text

Naive Bayes: here we also use TF-IDF for text representation, but the model is much different from Logsitic regression,

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

from keras.preprocessing.text import Tokenizer, text_to_word_sequence  # scikit-learn has similar ones. See sklearn.feature_extraction.text
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB, ComplementNB, MultinomialNB
from sklearn.preprocessing import LabelEncoder

import itertools

# get version number of some packages
import pkg_resources
print(pkg_resources.get_distribution("keras").version)
print(pkg_resources.get_distribution("scikit-learn").version)

# Import data

In [None]:
data = pd.read_csv('../input/Sentiment.csv')

# Take a look at the data

In [None]:
data.head()

In [None]:
data.groupby('candidate').size().sort_values(ascending=False)

In [None]:
datetime = pd.to_datetime(data['tweet_created'])
print(f"Tweets are created between {datetime.min()} and {datetime.max()}")

# Keep only the neccessary columns
For now, let's focus on sentiment analysis only and take only the 'text' and 'sentiment' columns.

In [None]:
data = data[['text','sentiment']].reset_index(drop=True)

# take a look at some random samples and check if the labels are accurate 
idx = np.random.choice(data.index, 5)
for i in idx:
    print(f"{i:5}  {data.loc[i, 'sentiment']:10}: {data.loc[i, 'text']}")

# Text Feature Extraction 

## This shows how keras.preprocessing.text.Tokenizer works

In [None]:
sample_text = data['text'].tolist()[0:1]
print(f"sample_text: {sample_text}")

t = Tokenizer()
t.fit_on_texts(sample_text)

In [None]:
# the vocabulary
t.word_index

In [None]:
text_to_word_sequence(sample_text[0])

In [None]:
t.texts_to_sequences(sample_text)

In [None]:
t.texts_to_matrix(sample_text, mode='tfidf')

The sample_text has 15 unique words. If you limit the number of words in the vocabulary to 12.

In [None]:
t = Tokenizer(num_words=12)
t.fit_on_texts(sample_text)
t.texts_to_sequences(sample_text)

The sequence is shortened because only the most common num_words-1 words will be kept. 

## Apply Tokenizer to text

Here we limit the vocabulary size to 3000. You can change to see how it affects results. 

In [None]:
# choose to exclude neutral samples or not, all codes below will still work
exclude_neutral = True
if exclude_neutral:
    data = data[data.sentiment != "Neutral"]

# display class labels
display(data.groupby('sentiment').size())

num_classes = len(data['sentiment'].unique())

num_words = 3000   # the maximum number of words to keep, based on word frequency.
transformer = Tokenizer(num_words=num_words)
transformer.fit_on_texts(data['text'].values)
X = transformer.texts_to_sequences(data['text'].values)
X = pad_sequences(X)
Y = pd.get_dummies(data['sentiment'])
class_labels = Y.columns.tolist()  # used for plotting confusion matrix
display(Y.head())
Y = Y.values
print(f"Total {len(transformer.word_index)} words in the vocabulary, {num_words} most common of them are used")
print(f"X shape {X.shape} \nY shape {Y.shape} ")

# Build and Train LSTM Network

Note that **num_words**, **embed_dim**, **lstm_out**, **batch_size**, **droupout** variables are hyperparameters, their values are somehow intuitive, can be and should be played with in order to achieve good results. Also note that we are using softmax as activation function. If there are only two target classes, you can use sigmoid function (softmax function reduces to sigmoid function in that case). 

In [None]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(num_words, embed_dim, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(units=lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(64,activation='relu'))  # for more model flexibility and smoother transition from 196 to num_classes
model.add(Dense(num_classes,activation='softmax'))
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

original_weights = model.get_weights()  # for resetting model weights

print(model.summary())

Split train and test dataset.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0, shuffle=True)
print(f"{X_train.shape[0]:6} samples for train, \n{X_test.shape[0]:6} samples for test")

Train the Network.

In [None]:
batch_size = 128
history = model.fit(X_train, Y_train, 
                    epochs=10, 
                    batch_size=batch_size, 
                    validation_data=(X_test, Y_test),
                    verbose=1)

# Plot training & validation accuracy values
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.ylim(bottom=0)
plt.show()

Plot confusion matrix

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    plt.figure(figsize = (6,6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
Y_pred = model.predict(X_test)
Y_pred_classes = Y_pred.argmax(axis=1)
Y_test_classes = Y_test.argmax(axis=1)
confusion_mtx = confusion_matrix(Y_test_classes, Y_pred_classes) 
plot_confusion_matrix(confusion_mtx, classes=class_labels)

In [None]:
print(classification_report(Y_test_classes, Y_pred_classes, target_names=class_labels))

The model does work becuase prediction for each class is clearly better than random guess. Predicting negative tweets works well, but not for neutral and positive ones. Guess it's understandable that predicting neutral tweets is difficult because it is quite subtle. But the low accuracy for predicting positive tweets may be related to class imbalance - there are many more negative tweets than positive ones in the train data. Let's see if we can improve accuracy by dealing with class imbalance.

In [None]:
# class percentage in train data
sample_composition = Y_train.sum(axis=0)/Y_train.sum()
sample_composition

# Re-train Model with Updated Sample Weight

In [None]:
class_weight = sample_composition.max() / sample_composition

sample_weight = np.ones(Y_train.shape[0])
for i in range(Y_train.shape[1]):
    sample_weight[Y_train[:, i]==1] = class_weight[i] 

In [None]:
model.set_weights(original_weights) # reset model weights

history = model.fit(X_train, Y_train, 
                    sample_weight=sample_weight,
                    epochs=10, 
                    batch_size=batch_size, 
                    validation_data=(X_test, Y_test),
                    verbose=1)

# Plot training & validation accuracy values
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.ylim(bottom=0)
plt.show()

In [None]:
# plot confusion matrix
Y_pred = model.predict(X_test)
Y_pred_classes = Y_pred.argmax(axis=1)
Y_test_classes = Y_test.argmax(axis=1)
confusion_mtx = confusion_matrix(Y_test_classes, Y_pred_classes) 
plot_confusion_matrix(confusion_mtx, classes=class_labels)

In [None]:
print(classification_report(Y_test_classes, Y_pred_classes, target_names=class_labels))

Recall of the positive class improves, but the overall accuray actually becomes slightly worse. 

# Prediction Example

In [None]:
twt = ["If Biden gets the nomination, not even Bernie can guarantee Biden progressive voters. Biden has to earn them. He didn't tonight."] 
# another example from twitter ["Biden blatantly lied about his record tonight."]

# vectorize the tweet by the pre-fitted tokenizer instance
twt = transformer.texts_to_sequences(twt)

# pad the tweet to have same length as train samples
twt = pad_sequences(twt, maxlen=X_train.shape[1], dtype='int32', value=0)

sentiment = model.predict(twt, batch_size=1, verbose=1).argmax()
print(f"Sentiment of this tweet is {class_labels[sentiment]}")

# TF-IDF + Logistic Regression

Let's see how well Logistic Regression works for sentiment classification 

In [None]:
# use the fitted tokenizer to get the sequences again without the padding zeros
X = transformer.texts_to_sequences(data['text'].values)

# calcuate TF-IDF
X = transformer.sequences_to_matrix(X, mode='tfidf')

# Encode target labels (LogisticRegression does not take OneHotEncoding of target labels)
le = LabelEncoder()
le.fit(data['sentiment'])
Y = le.transform(data['sentiment'])

In [None]:
# train test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0, shuffle=True)
print(f"{X_train.shape[0]:6} samples for train, \n{X_test.shape[0]:6} samples for test")

In [None]:
# fit model
clf = LogisticRegression(class_weight='balanced').fit(X_train, Y_train)

In [None]:
# mean accuracy on test data
clf.score(X_test, Y_test)

In [None]:
# confusion matrix
Y_pred = clf.predict(X_test)
confusion_mtx = confusion_matrix(Y_test, Y_pred) 
plot_confusion_matrix(confusion_mtx, classes=le.classes_)

In [None]:
# precision and recall
print(classification_report(Y_test, Y_pred, target_names=le.classes_))

For this simple case, performance of logistic regression is only slightly lower than RNN with LSTM. 

## Prediction Example for Logistic Regression

In [None]:
twt = ["If Biden gets the nomination, not even Bernie can guarantee Biden progressive voters. Biden has to earn them. He didn't tonight."] 
# another example from twitter ["Biden blatantly lied about his record tonight."]

# get sequence
twt = transformer.texts_to_sequences(twt)

# get tfidf 
twt = transformer.sequences_to_matrix(twt, mode='tfidf')

# predict
sentiment = le.inverse_transform(clf.predict(twt))[0]
print(f"Sentiment of this tweet is {sentiment}")

# TF-IDF + Naive Bayes

NB models are simple and can be extremely fast compared to more sophisticated models. It works quite well in many real-world situations, famously document classification and spam filtering. Let's see how it work in this case.

In [None]:
clf_NB = MultinomialNB().fit(X_train, Y_train)

In [None]:
# mean accuracy on test data
clf_NB.score(X_test, Y_test)

In [None]:
# confusion matrix
Y_pred_NB = clf_NB.predict(X_test)
confusion_mtx_NB = confusion_matrix(Y_test, Y_pred_NB) 
plot_confusion_matrix(confusion_mtx_NB, classes=le.classes_)

In [None]:
# precision and recall
print(classification_report(Y_test, Y_pred_NB, target_names=le.classes_))

Results show that Naive Bayes models also work pretty well. 

**Next step: It will be interesting to show how much accuracy improvement we will get if word embedding pre-trained with large text corpus. Will work on that when I have the chance.**