# Sentiment classification with Sentiment140 dataset

1. [x] Parse data and access dataset
2. [x] Inspect dataset structure
3. [x] Ensure sentiment target feature encoded as binary
4. [x] Preprocess tweets
5. [x] Split dataset into training, validation and test sets
6. [x] Define 3 classification models 
7. [x] Use 10-fold cross-validation to define the optimal model
8. [x] Train best model on full training and validation sets
7. [x] Measure final model performance on test set
8. [x] References


In [None]:
import csv
import os
import numpy as np
import string
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt
import pickle
import random
import dask.dataframe as dd
from collections import defaultdict
from dataclasses import dataclass
from zipfile import ZipFile
import pandas as pd
from nltk.corpus import stopwords
import nltk
import re
import multiprocessing
import tensorflow as tf
from sklearn.model_selection import KFold, train_test_split
nltk.download("stopwords") 

In [None]:
print('Max cpu detected: {}'.format(multiprocessing.cpu_count()))
#npartitions = multiprocessing.cpu_count()

In [None]:
if tf.config.list_physical_devices('GPU'):
    print('GPU available ;)')
else:
    print('No GPU detected...when it comes to training please set the accelerator to use a GPU')

In [None]:
data_path = '../input/sentiment140/training.1600000.processed.noemoticon.csv'

@dataclass
class CONFIG():
  """
  """
  col_names = ["target", "ids", "date", "flag", "user", "text"]
  embedding_dim = 300
  maxlen = 50
  vocab_size = 200000
  truncating = 'post'
  padding = 'post'
  oov_token = '<OOV>'
  max_examples = 160000
  training_split = .9

Config = CONFIG()

data = pd.read_csv(data_path,
                   names = CONFIG.col_names,
                   encoding = "ISO-8859-1")

print('Dataset size {}'.format(len(data)))
print('Dataset first five rows:\n{}'.format(data.head()))

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2,
                               figsize = (16, 8))

data.target.value_counts().plot(kind = 'bar', ax = ax1)
ax1.set_title('Before remapping')
data.target = data.target.map({0 : 0,
                               4 : 1})
data.target.value_counts().plot(kind = 'bar', ax = ax2)
ax2.set_title('After remapping')
plt.tight_layout()     

In [None]:
# Sample text to check appearance of special characters
stop_words = stopwords.words('english')
porter = nltk.stem.PorterStemmer()
remove_links = "https?:\S+|http?:\S|[0-9]+"
rep_elipse = "\.{2,}"
stop_list = stop_words + list(string.punctuation)

Lets remove stop words first and lowercase all tokens. We will also reduce elipses to two dots, people love elipses and all their meaning!

I will also utilise stemming to reduce the size of the vocabulary. Stemming truncates words to their roots, which unlike lemmatisation do not have to form valid words in the vocabulary. This will also help reduce model complexity and the size of the final model since the embedding layer will have reduced number of words to represent! Moreover, it some tweets are not english.

In [None]:
def preprocess_tweet(tweet: str, 
                     stemmer: object,
                     remove_links_regex,
                     reduce_elipses_regex):
  """
  """
  tweet_tokenizer = nltk.tokenize.TweetTokenizer(strip_handles=True,
                                                 reduce_len=True)
  # remove links and numbers
  
  tweet = re.sub(remove_links_regex, '', str(tweet))
  # Tokenize 
  tweet = tweet_tokenizer.tokenize(tweet)
  # Remove stop words
  tweet = [stemmer.stem(token.lower().strip()) for token in tweet if token not in stop_list]
  tweet =  ' '.join(tweet)

  # Replace elipse (two or more .. with ..)
  tweet = re.sub(reduce_elipses_regex, '..', str(tweet))
  
  return tweet

In [None]:
test = data.head()
test = test['text'].apply(lambda x: preprocess_tweet(x, porter, remove_links, rep_elipse))
print('Before: {}'. format(list(data['text'][:5])))
print('After: {}'. format(list(test[:5])))

In [None]:
def preprocess_tweet_df(dd:object):
  """
  """
  dd['text'] = data['text'].map(lambda tweet: preprocess_tweet(tweet, 
                                                               porter,
                                                               remove_links,
                                                               rep_elipse))

  return dd

Lets remove user, links and punctuation.

In [None]:
print('Some stop word examples ... {}'.format(list(stop_words[:10])))

Drop unnecessary columns

In [None]:
data = data.drop(['ids', 'date', 'flag','user'], axis = 1)

In [None]:
X, y  = data['text'].tolist(), data.pop('target').to_numpy()

In [None]:
X = [preprocess_tweet(tweet,porter,remove_links,rep_elipse) for tweet in X]

Prepare list of X, y tuples, which I will save to disk to save time later on.

In [None]:
processed = [(X[i], y[i]) for i, tweet in enumerate(X)]

In [None]:
print('Post preprocessing check: {}'.format(list(X[:5])))

In [None]:
with open('/kaggle/working/sentiment140_processed.pkl', 'wb') as f:
  pickle.dump(processed, f)

Load preprocess tweets (easier to start from here if you already save your preprocessed data !) 

In [None]:
# Load processed tweets with target data
with open('/kaggle/working/sentiment140_processed.pkl', 'rb') as f:
  processed = pickle.load(f)

print('Processed data: {}'.format(processed[0]))

X, y = zip(*processed)

In [None]:
print(len(X))
print(len(y))

Split into train/val and test sets.

In [None]:
# Split in training validation and test sets
X_train, X_test, y_train, y_test = test = train_test_split(X, y,
                                                           shuffle = True, 
                                                           random_state=1, 
                                                           test_size = 10000,
                                                           stratify = y)

print('Train/val size is {}'.format(len(X_train))) 
print('Test size is {}'.format(len(X_test))) 
print('Example train/val tweet: {}'.format(X_train[:1]))
print('Example test tweet: {}'.format(X_test[:1]))

Tokenise the tweets!

In [None]:
def fit_tokenizer(train_sentences, oov_token, vocab_size):
  """
  """

  tokenizer = Tokenizer(num_words = vocab_size, oov_token = oov_token)

  tokenizer.fit_on_texts(train_sentences)

  return tokenizer

Instantiate tokenizer!

In [None]:
tokenizer = fit_tokenizer(X, oov_token = Config.oov_token, vocab_size = Config.vocab_size)

word_index = tokenizer.word_index

print('Vocab contains {} words'.format(Config.vocab_size))
print('<OOV> token successfully placed in vocabulary!' if '<OOV>' in word_index else 'No <OOV> in vocabulary! something went wrong :(')

Lets check tweet length and use this data to inform our choice on tweet length.

In [None]:
tweet_lengths = [len(tweet) for tweet in X]
plt.figure(figsize=(16,8))
plt.hist(tweet_lengths, 
         bins = 100)
plt.axvline( Config.maxlen, 
            ls = '--',
            c = 'red')
plt.show()

In [None]:
def tokenise_sentences(tweets: list,
                       tokenizer: object,
                       padding: str,
                       truncating: str,
                       maxlen: int):
  """
  """

  tweets = tokenizer.texts_to_sequences(tweets)

  padded_and_trunc_tweets = pad_sequences(sequences = tweets,
                                          maxlen = maxlen,
                                          truncating = truncating,
                                          padding = padding)
  
  return padded_and_trunc_tweets

# Try increasing the maxlen ;)

In [None]:
X_train = tokenise_sentences(X_train, tokenizer, Config.padding, Config.truncating, Config.maxlen) 
X_test = tokenise_sentences(X_test, tokenizer, Config.padding, Config.truncating, Config.maxlen) 

Check dimensions of padded tokenised tweet sequences

In [None]:
print('Tokenised tweets have shape {}'.format(X_train.shape))

Lets define a model that uses leverages an embedding layer. I'll try 3 architectures, some of which are included for explanatory purposes.

In [None]:
def sentiment_classifier_embedding_max(vocab_size, embedding_dim, maxlen):
  """
  """
  model = tf.keras.Sequential([
                               tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = maxlen),
                               tf.keras.layers.GlobalMaxPooling1D(),
                               tf.keras.layers.Dropout(.3),
                               tf.keras.layers.Dense(32, activation = tf.nn.relu),
                               tf.keras.layers.Dense(1, activation = tf.nn.sigmoid)
  ])

  model.compile(loss = 'binary_crossentropy',
                optimizer = tf.keras.optimizers.Adam(0.001),
                metrics = ['acc'])
  
  return model

In [None]:
def sentiment_classifier_lstm(vocab_size, embedding_dim, maxlen):
  """
  """
  model = tf.keras.Sequential([
                               tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = maxlen),
                               tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
                               tf.keras.layers.Dropout(.4),
                               tf.keras.layers.Dense(32, activation = tf.nn.relu),
                               tf.keras.layers.Dropout(.4),
                               tf.keras.layers.Dense(16, activation = tf.nn.relu),
                               tf.keras.layers.Dense(1, activation = tf.nn.sigmoid)
  ])

  model.compile(loss = 'binary_crossentropy',
                optimizer = tf.keras.optimizers.Adam(0.001),
                metrics = ['acc'])
  
  return model

In [None]:
sentiment_classifier_lstm(Config.vocab_size, Config.embedding_dim, Config.maxlen)

In [None]:
# Instantiate model checkpoint callback
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss',factor=0.1,patience=1,verbose=0,mode='auto')

In order to evaluate each architecture i'll need to train for a reasonable number of epochs and then recover the model with the best validation loss observed, it would not be fair to compare models at the end of all the epochs are some overfit and others might not. The accuracy will then be averaged over each validation fold and we will then be able to compare model performance. The early stopping will ensure models do not overfit prior to comparison, whilst the 'Reduce LR on plateau' will give the models a chance to optimise their parameters further.

Get train test split

---

In [None]:
y_train = np.array(y_train)
y_test = np.array(y_test)

The below cross validation will take a while if you wanting to compare architectures through a number of epochs

In [None]:
kf = KFold(n_splits=3, 
           shuffle=True, 
           random_state=1)

fold = 0

# takes a while to train ...
epochs = 3

model_eval = defaultdict(list)
# 10 fold cross validation strategy
for train_index, test_index in kf.split(X_train):
    cv_X_train, cv_X_val= X_train[train_index], X_train[test_index]
    cv_y_train, cv_y_val = y_train[train_index], y_train[test_index]
    print('Fold: {}'.format(fold))
    print('# train: {}\n# val: {}'.format(len(train_index), len(test_index)))

    # reinitialise classifers to reset weights
    sentiment_classifier_max_model = sentiment_classifier_embedding_max(Config.vocab_size, Config.embedding_dim, Config.maxlen)
    sentiment_classifier_lstm_model = sentiment_classifier_lstm(Config.vocab_size, Config.embedding_dim, Config.maxlen)

    # train model 1
    history_max = sentiment_classifier_max_model.fit(cv_X_train,
                                             cv_y_train,
                                             validation_data = (cv_X_val, cv_y_val),
                                             callbacks = [early_stopping, reduce_lr],
                                             epochs = epochs,
                                             batch_size = 128)
    # train model 2
    history_lstm = sentiment_classifier_lstm_model.fit(cv_X_train,
                                               cv_y_train,
                                               validation_data = (cv_X_val, cv_y_val),
                                               callbacks = [early_stopping, reduce_lr],
                                               epochs = epochs,
                                               batch_size = 128)


    # make predictions on hold out validation set for each model
    _ , acc = sentiment_classifier_max_model.evaluate(cv_X_val, cv_y_val)
    _ , acc2 = sentiment_classifier_lstm_model.evaluate(cv_X_val, cv_y_val)

    # calculate performance metric
    model_eval['sentiment_classifier_embedding_max'].append(acc)
    model_eval['sentiment_classifier_embedding_lstm'].append(acc2)

    fold += 1


performance1 = np.mean(model_eval['sentiment_classifier_embedding_max']) - np.std(model_eval['sentiment_classifier_embedding_max'])
performance2 = np.mean(model_eval['sentiment_classifier_embedding_lstm']) - np.std(model_eval['sentiment_classifier_embedding_lstm'])

In [None]:
print('Performance of Embedding max model: {}'.format(performance1))
print('Performance of LSTM model: {}'.format(performance2))

Looks like LSTM model just won out by a fraction.

It would be worth doing error analysis now and figuring out why the model is failing at the remaining ~20%. It could be a language issue for example, or perhaps for a fraction of the tweets it is hard to discern whether they are positive or negative because they are neutral.

Now that we have used 3 fold cross validation to select the better architecture, lets go ahead and tweak the initial learning rate to push performance a little. Please do not try and run the above cell. I just wanted to relate how to integrate cross validation with tensorflow workflows as I find it really helpful. None of the deeplearning courses I have attended mention how to do this. To extend the above code, I would create a function to loop through x models and collect the performance metrics. 

The reason I take away the std of the accuracy is so models are penalised by the variability of their accuracy  in each validation set. This helps choose a robust model.

Its worth noting that keras has a [tuner module](https://keras.io/keras_tuner/) that could help you check how the size of the layers affects performance and to optimise these hyperparameters.

---

## Tuning the initial learning rate

**Keras callbacks enable us to automate the selection of the starting learning rate** since we can modify it at the end of a batch and monitor how learning rates impact the cost. Here, after every epoch we tweak the learning rate to see which performs well for the task.

The formula specified below will increase the learning rate gradually so that the learning rate will ✖10 every 5 epochs.

In [None]:
def sentiment_classifier_lstm(vocab_size, embedding_dim, maxlen):
  """
  """
  model = tf.keras.Sequential([
                               tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = maxlen),
                               tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
                               tf.keras.layers.Dropout(.4),
                               tf.keras.layers.Dense(32, activation = tf.nn.relu),
                               tf.keras.layers.Dropout(.4),
                               tf.keras.layers.Dense(16, activation = tf.nn.relu),
                               tf.keras.layers.Dense(1, activation = tf.nn.sigmoid)
  ])

  model.compile(loss = 'binary_crossentropy',
                optimizer = tf.keras.optimizers.Adam(),
                metrics = ['acc'])
  
  return model

In [None]:
lr_schedule = tf.keras.callbacks.LearningRateScheduler(lambda epoch: 1e-6 * 10**(epoch / 5))

tune_model = sentiment_classifier_lstm(Config.vocab_size, Config.embedding_dim, Config.maxlen)
# Since the dataset is large ... lets reduce the dataset size
history = tune_model.fit(cv_X_train[:100000],
                         cv_y_train[:100000],
                         validation_data = (cv_X_val[:100000], cv_y_val[:100000]),
                         callbacks = [lr_schedule],
                         epochs = 30, 
                         batch_size = 128)

Lets inspect how the loss varied with various learning rates and use the plot to 'eye-ball' an acceptable initial learning rate ⚡

In [None]:
learning_rates = 1e-8 * (10 ** ((np.arange(30))/5))
best_lr = learning_rates[np.argmin(history.history['loss'])]
plt.figure(figsize = (10,8))
plt.grid(True)
plt.semilogx(learning_rates, history.history["loss"])
plt.tick_params('both', length=8, width=1, which = 'both')
plt.axis([1e-8,.007,0,.79])

plt.axvline(learning_rates[np.argmin(history.history['loss'])],
            ls = '--',
            c = 'red')
plt.title('Optimising learning rate selection')
plt.show()

In [None]:
Looks like the stable sweet spot is between 10^-5 and 10^-4 . The loss starts to increase dramatically when it is above 10^-4.

## Final model architecture and training

After cross-validation, its important to re-train using both the training and validation data. **Remember to only do this once you have picked your final model based on the validation metrics**

In [None]:
def sentiment_classifier_lstm(vocab_size, embedding_dim, maxlen, best_lr):
  """
  """
  model = tf.keras.Sequential([
                               tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = maxlen),
                               tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, name = 'lstm_layer')),
                               tf.keras.layers.Dropout(.4),
                               tf.keras.layers.Dense(32, activation = tf.nn.relu, name = 'dense_layer_1'),
                               tf.keras.layers.Dropout(.4),
                               tf.keras.layers.Dense(16, activation = tf.nn.relu,name = 'dense_layer_2'),
                               tf.keras.layers.Dense(1, activation = tf.nn.sigmoid,name = 'output_layer')
  ])

  model.compile(loss = 'binary_crossentropy',
                optimizer = tf.keras.optimizers.Adam(best_lr),
                metrics = ['acc'])
  
  return model

Now we have our final model architecture and one hyperparameter chosen! Lets train on the both the train and validation before testing its final performance on the test set. Once val loss starts to plateau I will reduce the learning rate if it stagnates over 3 epochs. When you want to save models, the modelcheckpoint callback is superb, and you can set it to only save to best model! ♥

In [None]:
final_model = sentiment_classifier_lstm(Config.vocab_size, Config.embedding_dim, Config.maxlen, best_lr)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss',factor=0.1,patience=3,verbose=0,mode='auto')

# The below will only save if the val accuracy improves. This means we can later load the best model without worrying about overtraining.
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath='/kaggle/working/sentiment140-lstm-model_{epoch:02d}_{acc:.2f}',
                                                               save_weights_only=False,
                                                               monitor='val_acc',
                                                               mode='max',
                                                               save_best_only=True)
history = final_model.fit(X_train[:-5000],
                         y_train[:-5000],
                         validation_data = (X_train[-5000:], y_train[-5000:]),
                         callbacks = [reduce_lr,model_checkpoint_callback],
                         epochs = 10, 
                         batch_size = 128)

## Load final model and check final performance ⭐

In [None]:
final_model.evaluate(X_test, y_test)

## ~ 80 % great!

In a few months I hope to do error analysis and understand the failings of this model. Obviously this should be done before the test set is used to evaluate the model, but this is kaggle and I am only doing it for curiousity!

# Closing remarks

I hope you found this notebook useful and maybe learnt a neat trick or two! Please leave any constructive feedback on avenues I could explore to easily improve this notebook.

Happy Kaggling ✌