# **Sentiment Analysis of tweets with Deep Learning**

In this notebook, we finetune a pre-trained Transformer to outperform ü§ó's default `pipeline` model on the task of sentiment analysis of tweets. The dataset we'll use is the [Sentiment140](https://huggingface.co/datasets/sentiment140) dataset. Just for fun, we'll also scrape some recent tweets related to Elon Musk and use our finetuned transformer to predict their sentiments. Lastly, we'll also train a bi-LSTM on the same data just for comparison and practice üòÉ

I will be using TensorFlow to train our model. This notebook was also created in Colab to leverage Google's free GPUs. 
<br>
<br>
**Note:** any text marked with "note to self" are purely notes for my own learning and **may be ignored by the reader** üòÑ

## **Table of contents**
- [Data pre-processing](#Data-pre-processing)
- [Preparing Training and Testing sets](#Preparing-Training-and-Testing-sets)
- [Transformer](#Time-to-Transform-ü§ñ)
  - [Fine-tuning (training)](#Finetuning)
  - [Prediction](#Prediction-with-Transformer)
  - [Bonus Elon Musk tweets](#Bonus-Elon-Musk-tweets)
- [LSTM](#Bi-LSTM-time)
  - [Training](#Training)
  - [Prediction](#Prediction-with-RNN)
- [Project Summary](#Project-Summary)

In [1]:
import pandas as pd
import numpy as np 
import collections
from sklearn.metrics import accuracy_score
import tensorflow as tf
import time
tf.test.gpu_device_name() # should output '/device:GPU:0'

'/device:GPU:0'

The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment. Download dataset from ü§ó's [Dataset Hub](https://huggingface.co/datasets/sentiment140).

In [2]:
pip install datasets

Collecting datasets
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 311 kB 5.3 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 67 kB 4.9 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 243 kB 36.2 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1.1 MB 18.9 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.1.0-py3-n

In [3]:
from datasets import load_dataset

dataset = load_dataset('sentiment140', split='train')
dataset

Downloading:   0%|          | 0.00/1.54k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/739 [00:00<?, ?B/s]

Downloading and preparing dataset sentiment140/sentiment140 (download: 77.59 MiB, generated: 215.36 MiB, post-processed: Unknown size, total: 292.95 MiB) to /root/.cache/huggingface/datasets/sentiment140/sentiment140/1.0.0/f81c014152931b776735658d8ae493b181927de002e706c4d5244ecb26376997...


Downloading:   0%|          | 0.00/81.4M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset sentiment140 downloaded and prepared to /root/.cache/huggingface/datasets/sentiment140/sentiment140/1.0.0/f81c014152931b776735658d8ae493b181927de002e706c4d5244ecb26376997. Subsequent calls will reuse this data.


Dataset({
    features: ['text', 'date', 'user', 'sentiment', 'query'],
    num_rows: 1600000
})

The dataset has an equal number of positive and negative data points.

In [4]:
collections.Counter(dataset['sentiment'])

Counter({0: 800000, 4: 800000})

# Data pre-processing

We now perform 2 very important pre-processing steps. First, we need to change the sentiment values from (0,4) to (0,1). This is important for training the model later on with the cross entropy loss function. 

Note to self: Through several failed training attempts, I learned that for a classification task with n labels, our labels should only take on values \[0,n) for the model to train successfully. Otherwise, there would either be an error or the model would compile and run but fail to learn at all.
<br>
<br>
We can do this by converting the ü§ó Dataset to a `pandas.DataFrame` and replacing 4 with 1 before converting it back. It also seems possible to manipulate the dataset via its own methods [as seen here](https://huggingface.co/docs/datasets/processing.html#processing-data-row-by-row) but it is somehow slower than the pandas way.
<br>
<br>
Built-in methods:

In [5]:
# # Both methods are equivalent here:

# # # Method 1
# # def replace_four(example):
# #   if example['sentiment'] == 4:
# #     example['sentiment'] = 1
# #   return example
# # updated_dataset = dataset.map(replace_four)

# # Method 2
# dataset = dataset.map(lambda example: {'sentiment': 1} if example['sentiment'] == 4 else {'sentiment': 0})

pandas way:

In [6]:
ds = dataset.to_pandas()
ds['sentiment'].unique()

array([0, 4], dtype=int32)

In [7]:
ds['sentiment'] = ds['sentiment'].replace(4,1)
ds['sentiment'].unique()

array([0, 1], dtype=int32)

In [8]:
from datasets import Dataset
dataset = Dataset.from_pandas(ds)

The second pre-processing step is to change the sentiment values' feature type from `Value` to `ClassLabel`. Relevant documentation can be found [here](https://huggingface.co/docs/datasets/processing.html#casting-the-dataset-to-a-new-set-of-features-types-cast).

In [9]:
dataset.features

{'date': Value(dtype='string', id=None),
 'query': Value(dtype='string', id=None),
 'sentiment': Value(dtype='int32', id=None),
 'text': Value(dtype='string', id=None),
 'user': Value(dtype='string', id=None)}

In [10]:
from datasets import ClassLabel
new_features = dataset.features.copy()
new_features['sentiment'] = ClassLabel(names=['negative', 'positive'])

dataset = dataset.cast(new_features)
dataset.features

Casting the dataset:   0%|          | 0/160 [00:00<?, ?ba/s]

{'date': Value(dtype='string', id=None),
 'query': Value(dtype='string', id=None),
 'sentiment': ClassLabel(num_classes=2, names=['negative', 'positive'], names_file=None, id=None),
 'text': Value(dtype='string', id=None),
 'user': Value(dtype='string', id=None)}

# Preparing Training and Testing sets

Let's use 1000 data points for testing and 5000 data points for training. I chose a small training set for faster training and thus faster feedback on whether or not the model is indeed learning. If the model needs more data, we can always come back to this step and increase the training set.

In [11]:
test_data = dataset.shuffle(seed=10).select([i for i in list(range(1000))])
test_X = test_data['text']

train_data = dataset.shuffle(seed=10).select([i for i in list(range(1000,6000))])

Check that label proportions are about 50%, which is similar to the entire dataset.

In [12]:
collections.Counter(test_data['sentiment'])

Counter({0: 503, 1: 497})

In [13]:
collections.Counter(train_data['sentiment'])

Counter({0: 2517, 1: 2483})

I plan to use the distilBERT model and so its corresponding tokenizer must be used.

In [14]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3.5 MB 4.9 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 895 kB 26.5 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6.8 MB 20.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 596 kB 31.4 MB/s 
Installing collected packages: pyya

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

There are 3 main NLP pre-processing steps: **Tokenize, Truncate, Pad**. 
<br>
<br>
We can efficiently apply the tokenizer to our dataset via the [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method which preserves our data as a Dataset. We can also perform truncation here if the input sequence is too long (specifically, > 512 for BERT or distilBERT).

In [16]:
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True) # tokenize 'text' column only

test_data_tokenized = test_data.map(tokenize_function, batched=True)
train_data_tokenized = train_data.map(tokenize_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

**Dynamic padding** ensures that all sequences in the same batch(instead of the entire training set) are of the same length. We can do this via the `DataCollatorWithPadding` function.

We also set `return_tensors="tf"` since we're using TensorFlow.

In [17]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf") 

test_data_preproc = test_data_tokenized.to_tf_dataset( # wrap a tf.data.Dataset around our dataset
    columns=["attention_mask", "input_ids"],
    label_cols=["sentiment"],
    shuffle=False, # NOTE: must set as false for validation
    collate_fn=data_collator,
    batch_size=8,
)

train_data_preproc = train_data_tokenized.to_tf_dataset( # wrap a tf.data.Dataset around our dataset
    columns=["attention_mask", "input_ids"],
    label_cols=["sentiment"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

# Time to Transform ü§ñ
For most NLP tasks, Huggingface's `pipeline()` provides a default model that allows for quick and easy inferences. Let's first see how the default model performs.
<br>
<br>
The [default `pipeline` model](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) for sentiment analysis has been fine-tuned on SST-2 (Stanford Sentiment Treebank).

In [18]:
from transformers import pipeline

default_model = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
default_model(data)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

In [19]:
result = default_model(test_X) 
result_preproc = [0 if row['label'] == 'NEGATIVE' else 1 for row in result]
accuracy_score(test_data['sentiment'], result_preproc)

0.714

The default model achieves an accuracy of about 71% on our test data.

## Finetuning

As previously mentioned, we will use distilBERT as our choice of model. DistilBERT retains most of the performance of BERT while being much smaller and faster. It is also an [encoder model](https://huggingface.co/course/chapter1/5?fw=tf) which makes it great for sentence classification.
<br>
<br>
Note that we are loading in a **pre-trained** model. We are not training a new model from scratch as it is too expensive and transfer learning works pretty well anyway. Let's aim to finetune a model that outperforms the default model on our Sentiment140 dataset.

In [20]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2) # 2 labels for +ve and -ve

Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['activation_13', 'vocab_layer_norm', 'vocab_transform', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use i

Let's see how the pre-trained model performs before any finetuning.

In [21]:
preds = model.predict(test_data_preproc)["logits"] 

class_preds = np.argmax(preds, axis=1)
accuracy_score(test_data['sentiment'], class_preds)

0.501

Its poor performance could be due to weights for the new head being randomly initialized and thus unhelpful for our task, as mentioned [here](https://huggingface.co/course/chapter3/3?fw=tf#:~:text=You%20will%20notice,to%20do%20now.).
<br>
<br>
Let's now train our model with TensorFlow. We should explicitly set the learning rate for the optimizer to be much lower than the Adam default which greatly benefits transformers (also mentioned [here](https://huggingface.co/course/chapter3/3?fw=tf#:~:text=From%20long%20experience%2C%20though%2C%20we%20know%20that%20transformer%20models%20benefit%20from%20a%20much%20lower%20learning%20rate%20than%20the%20default%20for%20Adam%2C%20which%20is%201e%2D3%2C%20also%20written%20as%2010%20to%20the%20power%20of%20%2D3%2C%20or%200.001.%205e%2D5%20\(0.00005\)%2C%20which%20is%20some%20twenty%20times%20lower%2C%20is%20a%20much%20better%20starting%20point.)). We can also set the learning rate to decay linearly, which means in the later parts of training, as the model is near convergence, it will take smaller learning steps to prevent overstepping the optimum point.

In [22]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.optimizers import Adam

batch_size = 8
num_epochs = 3
num_train_steps = len(train_data_preproc) * num_epochs # can do len(tf.data.Dataset) to see how many batches there are in 1 epoch
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, 
    end_learning_rate=0.0, 
    decay_steps=num_train_steps
)

opt = Adam(learning_rate=lr_scheduler)

Since we're training with TensorFlow, we can call `.compile()` and `.fit()` on our model.

In [23]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy

start = time.time()
model.compile(
    optimizer=opt,
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

model.fit(
    train_data_preproc, 
    validation_data=test_data_preproc, 
    epochs=num_epochs
) 
end = time.time()

Epoch 1/3
Epoch 2/3
Epoch 3/3


## Prediction with Transformer

Now, we'll predict on the `test_data` with our fine-tuned transformer model.

In [24]:
preds = model.predict(test_data_preproc)["logits"]

class_preds = np.argmax(preds, axis=1)
accuracy_score(test_data['sentiment'], class_preds)

0.797

In [25]:
print(f"Time taken: {(end - start):.2f} seconds")

Time taken: 435.57 seconds


Our finetuned model easily outperforms the default `pipeline` model. This was achieved with only 5000 training points and less than 10 minutes of computing with a GPU(from Colab). Thank you Transfer Learning!


## Bonus Elon Musk tweets

As a bonus test, since the [Sentiment140](https://huggingface.co/datasets/sentiment140) dataset we used contains tweets from 2009, it would be interesting to see how the model performs on more recent tweets. Elon Musk has been known to be a controversial figure. Lets analyse the sentiment of tweets mentioning him. We will scrape tweets using the [Twint](https://github.com/twintproject/twint) tool.

In [26]:
! pip3 install --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint

Collecting twint
  Cloning https://github.com/twintproject/twint.git (to revision origin/master) to /tmp/pip-install-a0ay5x44/twint_0ee0671d24c94317b3f03c9d32fd86d8
  Running command git clone -q https://github.com/twintproject/twint.git /tmp/pip-install-a0ay5x44/twint_0ee0671d24c94317b3f03c9d32fd86d8
  Running command git checkout -q origin/master
Collecting aiodns
  Downloading aiodns-3.0.0-py3-none-any.whl (5.0 kB)
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 263 kB 5.2 MB/s 
[?25hCollecting dataclasses
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Collecting elasticsearch
  Downloading elasticsearch-8.0.0-py3-none-any.whl (369 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 369 kB 48.8 MB/s 
Collecting aiohttp_socks
  Downloading aiohttp_socks-0.7.1-py3-non

In [27]:
import twint

We need to install `nest_asyncio` to avoid `RuntimeError: This event loop is already running`. [source](https://stackoverflow.com/questions/66920753/running-a-justpy-web-app-on-jupyter-returns-runtimeerror)

In [28]:
! pip install nest_asyncio

import nest_asyncio
nest_asyncio.apply()



We'll scrape 30 tweets related to Elon Musk and use our model to predict their sentiments.
<br>
<br>
Note to self: The following cell has to be run on its own i.e "run all before this cell", "run this cell", "run all after this cell". For some reason, the scraping tool does not work if we select "run all" in this notebook in one shot.

In [29]:
c = twint.Config()
c.Limit = 30
# c.Username = "elonmusk" # to scrape all tweets under a user
c.Search = ['Elon Musk'] # to scrape all tweets under a topic
c.Pandas = True

twint.run.Search(c)

tweets_df = twint.storage.panda.Tweets_df

1495372730123882498 2022-02-20 12:20:16 +0000 <DanRudia> @elonmusk –ú–æ–∂–µ—à—å –ª–∏ —Ç—ã –º–µ–Ω—è –ø–æ–∑–¥—Ä–∞–≤–∏—Ç—å —Å –¥–Ω—é—Ö–æ–π? 15 –ª–µ—Ç
1495372722607689731 2022-02-20 12:20:14 +0000 <MinusPlacebo> @spcefinance @Tesla @elonmusk @TeslaOwnersFL Do you get any money to say that...!?? How much is it üòÇüòÇüòÇü§¶‚Äç‚ôÇÔ∏è?
1495372690181533696 2022-02-20 12:20:06 +0000 <j5FaXo3b998MUV7> @FilippVI @elonmusk —á—ë –Ω–µ —Ç–∞–∫ ?
1495372687178354690 2022-02-20 12:20:05 +0000 <sophieandrea02> @Samyar71435396 @xdqss @ChinaPumpWXC @BabyTKing @CryptoKojima @BabyDogeCEO @Shibtoken @williamcoit @ShibaArchives @Investments_CEO @elonmusk @BabytkTurkey You should join this chat you will thank me later   https://t.co/TeZU7jEBsB
1495372669239259140 2022-02-20 12:20:01 +0000 <ByWhatMeasure> @samovog @jacecraftmiller @sdteslaowners @elonmusk @Tesla You are misinformed
1495372663178416128 2022-02-20 12:20:00 +0000 <WIONews> .@Tesla Chief Executive @elonmusk, who has a long history of clashing wit

We need to extract the text of the tweets since Twint also scrapes other related information such as the date and username of the tweets. Then, we need to preprocess the tweets in a way similar to our training data for our model to accept them i.e `tokenize_function` and `to_tf_dataset`.

In [30]:
bonus_data = tweets_df[['tweet']]
bonus_data.rename(columns={"tweet": "text"}, inplace=True) # need to rename for tokenize_function

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [31]:
df = Dataset.from_pandas(bonus_data)

df_tokenized = df.map(tokenize_function, batched=True)
df_preproc = df_tokenized.to_tf_dataset( # wrap a tf.data.Dataset around our dataset
    columns=["attention_mask", "input_ids"],
    shuffle=False, # NOTE: must set as false for validation
    collate_fn=data_collator,
    batch_size=8,
)

preds = model.predict(df_preproc)["logits"]
class_preds = np.argmax(preds, axis=1)

  0%|          | 0/1 [00:00<?, ?ba/s]

Post-processing the result. For better readability, I have prepended the model predictions to each tweet.

In [32]:
result_preproc = ['NEGATIVE' if val == 0 else 'POSITIVE' for val in class_preds]
final_list = []

for i in range(len(result_preproc)):
  final_list.append(result_preproc[i] + ": " + bonus_data.iloc[i,0])

final_list

['NEGATIVE: @elonmusk –ú–æ–∂–µ—à—å –ª–∏ —Ç—ã –º–µ–Ω—è –ø–æ–∑–¥—Ä–∞–≤–∏—Ç—å —Å –¥–Ω—é—Ö–æ–π? 15 –ª–µ—Ç',
 'POSITIVE: @spcefinance @Tesla @elonmusk @TeslaOwnersFL Do you get any money to say that...!?? How much is it üòÇüòÇüòÇü§¶\u200d‚ôÇÔ∏è?',
 'POSITIVE: @FilippVI @elonmusk —á—ë –Ω–µ —Ç–∞–∫ ?',
 'POSITIVE: @Samyar71435396 @xdqss @ChinaPumpWXC @BabyTKing @CryptoKojima @BabyDogeCEO @Shibtoken @williamcoit @ShibaArchives @Investments_CEO @elonmusk @BabytkTurkey You should join this chat you will thank me later   https://t.co/TeZU7jEBsB',
 'NEGATIVE: @samovog @jacecraftmiller @sdteslaowners @elonmusk @Tesla You are misinformed',
 'NEGATIVE: .@Tesla Chief Executive @elonmusk, who has a long history of clashing with US safety officials, denied there was a safety issue with the function   https://t.co/LGyrKWrdZL',
 'POSITIVE: @oye_to_usama @elonmusk üòÇüòÇüòÇ',
 'POSITIVE: @saucyboy71 @GeorgeTakei @elonmusk üôÑüôÑüôÑ And you enjoy your crooked little cult, zombie.',
 'POSITIVE: @Cre

Interestingly, while the model did not predict every tweet perfectly, it also does not seem to be complete rubbish. In its defence, some of the tweets are really difficult i.e different languages or containing only usernames(I guess these tweets were replying to other tweets with an image or a gif). With that in mind, I would say the model's performance is to be expected as well since:
- the model was trained on old tweets when Elon Musk was not really prominent yet and
- the model was trained on a very small training set of 5000 data points

This shows that it is really important to keep in mind what problem we are trying to solve and what data we are feeding to our models. If we feed poor data, expect poor results.

# Bi-LSTM time

In this section, we train a bi-LSTM model and compare its performance on the `test_set` to our finetuned transformer. [This tutorial was used as reference](https://www.tensorflow.org/text/tutorials/text_classification_rnn).
<br>
<br>
We first need to convert the ü§ó Dataset into a `tf.Dataset`.

In [33]:
train_X = tf.constant(train_data['text'])
train_Y = tf.constant(train_data['sentiment'])
train_tf = tf.data.Dataset.from_tensor_slices((train_X, train_Y))

We now create batches of data for training. [Note](https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle) on `buffer_size` in `tf.Dataset.shuffle()` and `tf.Dataset.prefetch()`.

In [34]:
buffer_size = 10000
batch_size = 64
train_tf = train_tf.shuffle(buffer_size).batch(batch_size).prefetch(tf.data.AUTOTUNE) 

In [35]:
# check to make sure train_tf is correct
for text, sentiment in train_tf.take(1):
  print('texts: ', text.numpy()[:3])
  print()
  print('sentiments: ', sentiment.numpy()[:3])

texts:  [b'PNG sucks on FB. '
 b"@PaalSA New Map DVD on Toyota Avensis gave me TMC as well. But it's a shame that the DVD costs the same as a complete TomTom unit "
 b'@3guser im 16 too ']

sentiments:  [0 0 1]


## Training 

To perform the NLP pre-processing tasks of **Tokenize, Truncate & Pad** in TensorFlow, we can use the [`TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer. I believe [dynamic padding](https://www.tensorflow.org/text/tutorials/text_classification_rnn#:~:text=Once%20the%20vocabulary%20is%20set%2C%20the%20layer%20can%20encode%20text%20into%20indices.%20The%20tensors%20of%20indices%20are%200%2Dpadded%20to%20the%20longest%20sequence%20in%20the%20batch%20(unless%20you%20set%20a%20fixed%20output_sequence_length) is also done by default in this layer. We'll set the vocabulary size to 10000.

Note to self: Documentation on the [map](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#:~:text=dataset.map(lambda%20x_int%2C%20y_str%3A%20x_int\)) and [adapt](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization#adapt) methods.



In [36]:
VOCAB_SIZE = 10000
preproc = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE)

preproc.adapt(train_tf.map(lambda text, sentiment: text)) # this should apply the preproc layer to just the  X 'text' values and not the labels

For the model architecture, we will have:
- the [preprocessing layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) to perform NLP preprocessing,
- an [embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer to build word embeddings
- a [bi-directional](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional) [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) layer
- 2 [dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layers to obtain our final output. The first dense layer uses a relu activation function which helps with learning while the second dense layer uses a sigmoid to compute the probability of the sample belonging to the 'positive' class.

Note to self: We set `from_logits=False` in the [loss function](https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy#:~:text=By%20default%2C%20we%20assume%20that%20y_pred%20contains%20probabilities%20(i.e.%2C%20values%20in%20%5B0%2C%201%5D) since our model predictions `y_pred` are probabilities due to the sigmoid activation function.

In [37]:
model = tf.keras.Sequential([
    preproc,
    tf.keras.layers.Embedding(
        input_dim=len(preproc.get_vocabulary()), # input_dim is usually size of vocab
        output_dim=64, # dimension of word embeddings
        mask_zero=True # set to true since we are building an RNN which may take variable length input
    ),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)), # same dimension as embeddings
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid') # to calculate P(positive class)
])

model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(), # note: from_logits=False
    optimizer=tf.keras.optimizers.Adam(1e-4),
    metrics=['accuracy']
)

Since our transformer was pre-trained while our RNN will be trained from scratch, I increased the epochs to 10 to allow the model to learn more.

In [38]:
model.fit(train_tf, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f458b008450>

## Prediction with RNN
We will now use our trained RNN to predict `test_data`. There are 2 ways to do this: either `model.evaluate` or `model.predict`.

`model.evaluate`:

In [39]:
test_X = tf.constant(test_data['text'])
test_Y = tf.constant(test_data['sentiment'])
test_loss, test_acc = model.evaluate(test_X, test_Y)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

Test Loss: 1.094499945640564
Test Accuracy: 0.7350000143051147


`model.predict`:

In [40]:
test_X = tf.constant(test_data['text'])
test_Y = tf.constant(test_data['sentiment'])
preds = model.predict(test_X)

`model.predict` requires additional post-processing. Since the last layer of our model is a dense layer with sigmoid activation, its outputs are *probabilities*. If the predicted probability of the sample having positive sentiment is greater than or equal to 0.5, we will classify it as positive (class 1). Otherwise, negative (class 0).

In [41]:
result_preproc = [1 if val >= 0.5 else 0 for val in preds]

accuracy_score(test_data['sentiment'], result_preproc)

0.735

It seems that the RNN's performance is not too bad and is actually comparable to the default ü§ó sentiment analysis model. I believe one reason for this is that the tweets are generally not very long in length. This means that the LSTM is able to capture most of the sequential information. This is usually not the case for super long sequences due to the vanishing gradient problem in RNNs during training and how RNNs make use of *sequential computation*. In contrast, transformers do not face this problem as they utilise the [attention](https://arxiv.org/abs/1706.03762) mechanism.
<br>
<br>
Note to self: If we had set `from_logits=True` in the loss function and used a linear activation function in the last layer as in the [tutorial](https://www.tensorflow.org/text/tutorials/text_classification_rnn), the output of our model would have been [a logit](https://www.tensorflow.org/text/tutorials/text_classification_rnn#:~:text=After%20the%20RNN%20has%20converted%20the%20sequence%20to%20a%20single%20vector%20the%20two%20layers.Dense%20do%20some%20final%20processing%2C%20and%20convert%20from%20this%20vector%20representation%20to%20a%20single%20logit%20as%20the%20classification%20output.). In this scenario, as mentioned [here](https://www.tensorflow.org/text/tutorials/text_classification_rnn#:~:text=If%20the%20prediction%20is%20%3E%3D%200.0%2C%20it%20is%20positive%20else%20it%20is%20negative.), if the output was >= 0.0, the prediction would be positive otherwise negative. For a while, I wondered what was so magical about the value 0 in the [tutorial](https://www.tensorflow.org/text/tutorials/text_classification_rnn#:~:text=If%20the%20prediction%20is%20%3E%3D%200.0%2C%20it%20is%20positive%20else%20it%20is%20negative.). Since there are 2 classes, shouldn't there be 2 logit values per sample? And to determine the class prediction, we would need to take argmax of the softmax. Then I remembered we were in the binary classification setting i.e our probability prediction for the other class would be 1 - (our probability prediction for this class). And the reason 0 is the 'magical value' is because of the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function). Any value in the domain of the function that is >= 0 would be mapped to >= 0.5 on the y-axis.

# Project Summary

To recap, in this project, we finetuned a pre-trained distilBERT model to outperform the default ü§ó sentiment analysis model on the [Sentiment140](https://huggingface.co/datasets/sentiment140) dataset. We even tested it on some recent tweets related to Elon Musk. We also trained a bi-LSTM neural network on the same task and saw that it performed similarly to the default model.
<br>
<br>
Thank you for opening and reading through this project if you did. I hope this has been informative for you as much as it was educational and enriching for me ü§ó