# Sentiment Analysis

Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import spacy
import collections
import operator
from tqdm.autonotebook import tqdm
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences

### Dataset

Loading the [Sentiment140](https://www.kaggle.com/kazanova/sentiment140) Dataset that includes 1.6 milion tweets

In [None]:
%%time
columns = ["target", "ids", "date", "flag", "user", "text"]
encoding = "ISO-8859-1"
dataset = pd.read_csv('dataset/tweets.csv', encoding=encoding, names=columns)

Dropping columns with not useful data

In [None]:
dataset.drop(columns=["ids", "date", "flag", "user"], inplace=True)
dataset.head()

Since the original dataset is using classes `0-Negative` and `4-Positive`, I am mapping the `4` to `1` to fix the confusing naming

In [None]:
dataset.target.replace({4: 1}, inplace=True)

Now lets shuffle the rows in dataset, since originally they are sorted by the class

In [None]:
dataset = dataset.sample(frac=1).reset_index(drop=True)
# Remove for final training
dataset = dataset.iloc[:1_000_000]

In [None]:
dataset.head()

### Preprocessing

Dataset is now ready for preprocessing. I will use [spacy](https://spacy.io) which is a very popular library for Natural Language Processing.

Firstly, lets load a model that will help us with tokenisation and lemmatisation of tweets included in the dataset

In [None]:
%%time
nlp = spacy.load("en_core_web_lg")

Piping through the tweets from dataset, will result in a iterator object that we will need to use to iterate over preprocessed tweets - and tokenized words in each tweet.

In [None]:
tweets_iterator = nlp.pipe(dataset.text, n_threads=-1, batch_size=32)

Lets find out which are the most popular words in the dataset by mapping them to a dictionary. This can be useful during the optimalisation process, if we would want to drop some rare words to not include them in the final model to reduce its size. This process can take quite a while, since we are operating on a very large dataset.

In [None]:
words = collections.defaultdict(int)
preprocessed_text = []
for tweet in tqdm(tweets_iterator, total=dataset.shape[0]):
    preprocessed_tweet = []
    for token in tweet:
        if token.is_stop:
            continue
        lexeme = nlp.vocab[token.lemma]
        if lexeme.has_vector:
            words[lexeme] += 1
            preprocessed_tweet.append(lexeme.text)
    preprocessed_text.append(preprocessed_tweet)

Now lets update our dataset with new column that will contain preprocessed tweet contents

In [None]:
dataset['preprocessed'] = preprocessed_text
dataset.head()

In [None]:
# keys = [key for key in words.keys()]
# for word in keys:
#     if words[word] < 2:
#         del words[word]

Sorting the counts, with its corresponding tokens

In [None]:
sorted_words = sorted(words.items(), key=operator.itemgetter(1), reverse=True)
print('Top 10 words are:')
_ = [print(lexeme.text, count) for lexeme, count in sorted_words[:10]]

Now, once we have the tokens, sorted by their popularity, we will need to create the `embedding_matrix`. The matrix consists of vectors for each token/word that we get from spacy. One important thing to notice is that we will make an extra entry in the matrix at index 0, that will represent the placeholder for empty word. To train the model (and to recieve the prediction) we will feed it with array of ids of words from the tweet. These array will need to be equal lenght, because neural network will always expect the input to be equal size. We will use this row `0` to refer to a empty token, so we can pad the input array to match the expected by neural network shape of the input. 

Also I will create a `word_index` array to keep the reference of word to index so we can convert that later for proper neural network input.

In [None]:
word_index = {'': 0}
embedding_matrix = np.zeros((len(sorted_words) + 1, 300))

for i, lexeme in enumerate(sorted_words, start=1):
    word_index[lexeme[0].text] = i
    embedding_matrix[i] = lexeme[0].vector

The resulting matrix is a very important part of the model, since it will serve as a weights matrix in the first layer of the model.

In [None]:
embedding_matrix

### Model

At this point, once we are ready with the preprocessing data, lets design the architecture for our model. I will use [Keras](https://keras.io), which is a very popular Deep Learning library.

The model is a very simple Convolutional Neural Network consisting of 9 layers.

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(len(embedding_matrix), 300, weights=[embedding_matrix], input_length=100),
    tf.keras.layers.Conv1D(16, kernel_size=3),
    tf.keras.layers.ReLU(),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.GlobalMaxPooling1D(),
    tf.keras.layers.Dense(32),
    tf.keras.layers.ReLU(),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(1)
])
model.summary()

In [None]:
model.compile(
    loss='binary_crossentropy', optimizer="adam", metrics=['accuracy']
)

### Training

This is the part where we jump to the training our model. Lets split out data into training and testing subsets of the original dataset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    dataset.preprocessed, dataset.target, test_size=0.2, random_state=42
)

Before we will pipe our data into the model, there is one last thing to do. We need to map the words form tweet to ids, and to fill the array with zeros so the model will accept it as the input

In [None]:
def map_words(row):
    return np.array([word_index[word] if word in word_index else 0 for word in row])

In [None]:
X_train = X_train.apply(map_words)
X_test = X_test.apply(map_words)
X_train = pad_sequences(X_train, maxlen=100)
X_test = pad_sequences(X_test, maxlen=100)

In [None]:
print('Train data shapes:', 'X', X_train.shape, 'y', y_train.shape)
print('Test data shapes:', 'X', X_test.shape, 'y', y_test.shape)

Ok, finally! Let's go for it!

In [None]:
epochs = 5
batch_size = 1024

In [None]:
model.fit(
    X_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.1,
    verbose=1
)

### Testing

In [None]:
_ = model.evaluate(X_test, y_test, batch_size=batch_size)

### Export

After we are done with training and our model is ready, we have to export the model to the file so I can convert it to `.mlmodel` with `coremltools`.

In [None]:
model.save('SentimentalAnalysis.h5')