# Sentiment Analysis

Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. (wikipedia)

## 1. Loading the Dataset

In here we are using <a href="http://ai.stanford.edu/~amaas/data/sentiment/">Large Movie Review Dataset</a> from Stanford. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. This dataset provides a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. 

In [27]:
# Importing the libraries
import os
import glob
from sklearn.utils import shuffle

In [28]:
# Function for loading the dataset
def load_dataset(data_dir = "./data/imdb-reviews/"):

    # Initializing a dictionary for X_data and y_data
    X_data = {}
    y_data = {}
    
    # Iterating through the "train" and "test"
    for train_or_test in ['train', 'test']:
        
        # Initialize an empty dictionary for the train and test
        X_data[train_or_test] = {}
        y_data[train_or_test] = {}
        
        # Iterate through "positive", "negative"
        for positive_or_negative in ['positive', 'negative']:
            
            # Initialize an empty list for each sentiment
            X_data[train_or_test][positive_or_negative] = []
            y_data[train_or_test][positive_or_negative] = []
            
            # Get the name of all texts in our folder
            file_names = glob.glob(os.path.join(data_dir, train_or_test, positive_or_negative, "*.txt")) 
                
            # Iterate through file names
            for i_file in file_names:
                
                # Open the (text) file
                with open(i_file) as i_file:
                
                    # Assign values to our dictionary from that file
                    X_data[train_or_test][positive_or_negative].append(i_file.read())
                    y_data[train_or_test][positive_or_negative].append(positive_or_negative)
                
    return X_data, y_data

In [29]:
# Loading the dataset
X_data, y_data = load_dataset()

In [30]:
# Get the shape dataset
print("Training set:\n {} Positive / {} Negative\n".format(len(X_data["train"]["positive"]), 
                                                           len(X_data["train"]["negative"])))

print("Test set:\n {} Positive / {} Negative".format(len(X_data["test"]["positive"]), 
                                                   len(X_data["test"]["negative"])))

Training set:
 12500 Positive / 12500 Negative

Test set:
 12500 Positive / 12500 Negative


In [31]:
# Splitting the dataset into training and test set
X_train = X_data["train"]["positive"] + X_data["train"]["negative"]
y_train = y_data["train"]["positive"] + y_data["train"]["negative"]

X_test = X_data["test"]["positive"] + X_data["test"]["negative"]
y_test = y_data["test"]["positive"] + y_data["test"]["negative"]

In [32]:
# Suffling the trianing set and test ste
X_train, y_train = shuffle(X_train, y_train)
X_test, y_test = shuffle(X_test, y_test)

In [33]:
print("Training set = {} \nTest set = {}".format(len(X_train), len(X_test)))

Training set = 25000 
Test set = 25000


In [34]:
# Get a small subset of dataset (for speed purposes)
X_train, y_train = X_train[:4000], y_train[:4000]
X_test, y_test = X_test[:1000], y_test[:1000]

## 2. Preprocessing

At the second step, We will prerpocess our dataset which is an essential part of any type of model. More specifically we will apply the following steps for preprocessing:
1. Lowercasing the text
2. Removing the punctuation
3. Converting to tokens
4. Removing the stopwords
5. Apply stemmer
6. Apply lemmizer

In [35]:
# Importing the libraries
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from keras.preprocessing import sequence
import bs4
import numpy as np

Using TensorFlow backend.


In [36]:
# Preprocessing the text
def preprocess_text(text):

    # Removing all HTML tags
    text = bs4.BeautifulSoup(text, "html5lib").get_text().strip()
    
    # Lowercasing the text
    text = text.lower()

    # Removing the punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)

    # Converting to tokens
    tokens = word_tokenize(text)

    # Removing the stopwords
    tokens = [i_token for i_token in tokens if i_token not in stopwords.words("english")]

    # Apply stemmer
    stemmed = [PorterStemmer().stem(i_token) for i_token in tokens]

    # Apply lemmizer
    lemmtized = [WordNetLemmatizer().lemmatize(i_token, pos="n") for i_token in stemmed]
    lemmtized = [WordNetLemmatizer().lemmatize(i_token, pos="v") for i_token in lemmtized]

    return lemmtized

In [37]:
# Preproces the training set and test set
X_train = [preprocess_text(i) for i in X_train]
X_test = [preprocess_text(i) for i in X_test]

In [38]:
# Get the total dataset
total_dataset = X_train + X_test

In [39]:
# Create word2id and id2word
all_unique_words = np.unique([item for sub_list in total_dataset for item in sub_list])
word2id = {i_token: index for index, i_token in enumerate(all_unique_words)}
id2word = {index: i_token for index, i_token in enumerate(all_unique_words)}

In [40]:
# Mapping words in training to its corresponding id
for index, sub_list in enumerate(X_train):
    X_train[index] = list(map(lambda x: word2id[x], sub_list))
    
# Mapping words in test set to its corresponding id
for index, sub_list in enumerate(X_test):
    X_test[index] = list(map(lambda x: word2id[x], sub_list))

In [41]:
# Pad sequence
max_words = 500

X_train = sequence.pad_sequences(X_train, maxlen = max_words)
X_test = sequence.pad_sequences(X_test, maxlen = max_words)

In [42]:
# Covert labels into 0 and 1
def str_to_int(label):
    if label == "positive":
        return 1
    else:
        return 0
    
y_train = list(map(str_to_int, y_train))
y_test = list(map(str_to_int, y_test))

## 3. Model

Now we are ready to feed the data into our model for training. As you will see, Even with a simple architecture we can reach to a high accuracy rate.

In [43]:
# Import the libraries
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Reshape
from keras.callbacks import ModelCheckpoint

In [44]:
# Some hyperparameters
embedding_size = 32
lstm_units = 100
batch_size = 128
num_epochs = 10

vocabulary_size = len(all_unique_words)

In [45]:
# Create the model
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length = max_words))
model.add(LSTM(units = lstm_units))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Summary of model
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           831360    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 884,661
Trainable params: 884,661
Non-trainable params: 0
_________________________________________________________________
None


In [46]:
# Checkpoint for saving the model
checkpointer = ModelCheckpoint(filepath='./saved model/weights.best.sentiment_analysis.hdf5', 
                               verbose = 1, 
                               save_best_only = True)

# Train the model
model.fit(X_train, 
          y_train,
          validation_data = (X_test, y_test),
          batch_size = batch_size,
          epochs = num_epochs,
          callbacks = [checkpointer], 
          verbose = 1)

Train on 4000 samples, validate on 1000 samples
Epoch 1/10

Epoch 00001: val_loss improved from inf to 0.67256, saving model to ./saved model/weights.best.sentiment_analysis.hdf5
Epoch 2/10

Epoch 00002: val_loss improved from 0.67256 to 0.58436, saving model to ./saved model/weights.best.sentiment_analysis.hdf5
Epoch 3/10

Epoch 00003: val_loss improved from 0.58436 to 0.39578, saving model to ./saved model/weights.best.sentiment_analysis.hdf5
Epoch 4/10

Epoch 00004: val_loss improved from 0.39578 to 0.36961, saving model to ./saved model/weights.best.sentiment_analysis.hdf5
Epoch 5/10

Epoch 00005: val_loss did not improve from 0.36961
Epoch 6/10

Epoch 00006: val_loss did not improve from 0.36961
Epoch 7/10

Epoch 00007: val_loss did not improve from 0.36961
Epoch 8/10

Epoch 00008: val_loss did not improve from 0.36961
Epoch 9/10

Epoch 00009: val_loss did not improve from 0.36961
Epoch 10/10

Epoch 00010: val_loss did not improve from 0.36961


<keras.callbacks.History at 0x1a5ca71dd8>

## 4. Evaluation

Once you have trained your model, it's time to see how well it performs on unseen test data.

In [47]:
# Evaluate your model on the test set
scores = model.evaluate(X_test, y_test, verbose=0)  # returns loss and other metrics specified in model.compile()
print("Test Set Accuracy: {}%".format(scores[1]*100))  # scores[1] should correspond to accuracy if you passed in metrics=['accuracy']

Test Set Accuracy: 83.3%


## 5. Prediction

Now you are ready for prediction. You can check the sentiment of any sentence you input.

In [48]:
# Function for prediction
def text_to_predict(unseen_text):
    
    # Preprocess the text
    unseen_text = preprocess_text(unseen_text)
    
    # Convert the words to ids
    unseen_text = list(map(lambda x: word2id[x], unseen_text))
    
    # Pad sequences
    unseen_text = sequence.pad_sequences([unseen_text], max_words)
    
    # Get the prediction
    prediction = model.predict(unseen_text)[0][0]*100

    # Print the result
    if prediction < 0.6:
        print("The given sentence is negative.")
    elif prediction > 0.6:
        print("The given sentence is positive.")

In [51]:
# Predict a unseen text
unseen_text = "The movie is absolutely terrible. It's not something i would suggest"
text_to_predict(unseen_text)

The given sentence is negative.


In [52]:
# Predict a unseen text
unseen_text = "The movie is absolutely great. Can't wait to watch it again."
text_to_predict(unseen_text)

The given sentence is positive.
