Data Loading and Pre-processing

In [None]:
# Load Dataset
import pandas as pd
train_data_url = "https://drive.google.com/uc?export=download&id=19JmVSOZ85vikn5aKna97aL5LM8KtG3T7"
test_data_url = "https://drive.google.com/uc?export=download&id=19EnwRfr6q5lzVB_UpJlGOG3IxgDhYVgP"

df_train = pd.read_csv(train_data_url, encoding='latin-1')[["OriginalTweet","Sentiment"]].rename(columns={'OriginalTweet': 'tweet', 'Sentiment': 'label'})
df_test = pd.read_csv(test_data_url, encoding='latin-1')[["OriginalTweet","Sentiment"]].rename(columns={'OriginalTweet': 'tweet', 'Sentiment': 'label'})



In [None]:
#Cheking the training dataset
df_train.head()

Unnamed: 0,tweet,label
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [None]:
#Checking the columns of the dataset
df_train.columns

Index(['tweet', 'label'], dtype='object')

In [None]:
#Checking the unique values in label
df_train['label'].value_counts()


label
Positive              11422
Negative               9917
Neutral                7713
Extremely Positive     6624
Extremely Negative     5481
Name: count, dtype: int64

In [None]:
#Checking the first tweet in training data
df_train['tweet'].iloc[0]

'@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8'

In [None]:
#Checking the unique values in testing sample
df_test['label'].value_counts()

label
Negative              1041
Positive               947
Neutral                619
Extremely Positive     599
Extremely Negative     592
Name: count, dtype: int64

In [None]:
#First lets change the sentiment labels to numeric values in training and testing samples
#create a new colum label_num by changing the tweet labels Extremely Negative = 0, Negative = 1, Neutral = 2, Positive = 3 and Extremely Positive = 4
df_train['label_num'] = df_train['label'].replace({'Extremely Negative': 0, 'Negative': 1, 'Neutral': 2, 'Positive': 3, 'Extremely Positive': 4})
df_test['label_num'] = df_test['label'].replace({'Extremely Negative': 0, 'Negative': 1, 'Neutral': 2, 'Positive': 3, 'Extremely Positive': 4})


In [None]:
#Lets see the change for training data
df_train.head()

Unnamed: 0,tweet,label,label_num
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,2
1,advice Talk to your neighbours family to excha...,Positive,3
2,Coronavirus Australia: Woolworths to give elde...,Positive,3
3,My food stock is not the only one which is emp...,Positive,3
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,0


In [None]:
#Lets see the change for testing data
df_test.head()

Unnamed: 0,tweet,label,label_num
0,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative,0
1,When I couldn't find hand sanitizer at Fred Me...,Positive,3
2,Find out how you can protect yourself and love...,Extremely Positive,4
3,#Panic buying hits #NewYork City as anxious sh...,Negative,1
4,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral,2


In [None]:
#Importing necessary library
import re
import string
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

In [None]:
#Checking the function on first trainig tweet
filter_first_tweet = process_tweet(df_train['tweet'].iloc[1])
print(filter_first_tweet)

['advic', 'talk', 'neighbour', 'famili', 'exchang', 'phone', 'number', 'creat', 'contact', 'list', 'phone', 'number', 'neighbour', 'school', 'employ', 'chemist', 'gp', 'set', 'onlin', 'shop', 'account', 'poss', 'adequ', 'suppli', 'regular', 'med', 'order']


In [None]:
#Applying process_tweet function to creat a new clean tokens named clean_tweet
df_train['clean_tweet'] = df_train['tweet'].apply(process_tweet)

In [None]:
#Lets see the change
df_train.head()

Unnamed: 0,tweet,label,label_num,clean_tweet
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,2,[]
1,advice Talk to your neighbours family to excha...,Positive,3,"[advic, talk, neighbour, famili, exchang, phon..."
2,Coronavirus Australia: Woolworths to give elde...,Positive,3,"[coronaviru, australia, woolworth, give, elder..."
3,My food stock is not the only one which is emp...,Positive,3,"[food, stock, one, empti, ..., pleas, panic, e..."
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,0,"[readi, go, supermarket, covid, 19, outbreak, ..."


In [None]:
#Applying process_tweet function to creat a new clean tokens named clean_tweet
df_test['clean_tweet'] = df_test['tweet'].apply(process_tweet)

In [None]:
#Lets see the change
df_test.head()

Unnamed: 0,tweet,label,label_num,clean_tweet
0,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative,0,"[trend, new, yorker, encount, empti, supermark..."
1,When I couldn't find hand sanitizer at Fred Me...,Positive,3,"[find, hand, sanit, fred, meyer, turn, amazon,..."
2,Find out how you can protect yourself and love...,Extremely Positive,4,"[find, protect, love, one, coronaviru]"
3,#Panic buying hits #NewYork City as anxious sh...,Negative,1,"[panic, buy, hit, newyork, citi, anxiou, shopp..."
4,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral,2,"[toiletpap, dunnypap, coronaviru, coronavirusa..."


In [None]:
#Lets split the dataframe into features and labels
X_train = df_train['clean_tweet']
y_train = df_train['label_num']
X_test = df_test['clean_tweet']
y_test = df_test['label_num']

In [None]:
X_train[:5]

0                                                   []
1    [advic, talk, neighbour, famili, exchang, phon...
2    [coronaviru, australia, woolworth, give, elder...
3    [food, stock, one, empti, ..., pleas, panic, e...
4    [readi, go, supermarket, covid, 19, outbreak, ...
Name: clean_tweet, dtype: object

In [None]:
X_test[:5]

0    [trend, new, yorker, encount, empti, supermark...
1    [find, hand, sanit, fred, meyer, turn, amazon,...
2               [find, protect, love, one, coronaviru]
3    [panic, buy, hit, newyork, citi, anxiou, shopp...
4    [toiletpap, dunnypap, coronaviru, coronavirusa...
Name: clean_tweet, dtype: object

Generation of Word level Emdedding

We will train the Word2Vec model on our dataset using gensim

In [None]:
#First lets create a full dataset using training and testing
#tweet_df = pd.concat([X_train_rep, X_test_rep])
tweet_df = pd.concat([X_train, X_test])
len(tweet_df)

44955

In [None]:
# Train word2vec model
from gensim.models import Word2Vec
word_vectors = Word2Vec(sentences=tweet_df, vector_size=100, window=4, min_count=1, workers=4 )


In [None]:
#Gensim pre-trained vectors
#import gensim.downloader as api
#word_vectors = api.load("glove-twitter-100")



In [None]:
# Lets create a function to compute the mean word vector for each tweet
def get_mean_word_vector(tokens, model, vector_size):
    """
    Function to calculate the mean of word vectors for each tweets

    Args:
        tokens: a list of tokenize tweets
        model: a word2vec model which creats embeddings for each words
        vector_size: dimension of each in vector

    Returns:
        mean_vector: a single mean vector for each words in a tweet
    """
    #Checks if the word in token is present in emdedded model
    valid_words = [model.wv[word] for word in tokens if word in model.wv]
    #valid_words = [model[word] for word in tokens if word in model]
    if valid_words: #take the mean of vectors
      mean_vector = sum(valid_words) / len(valid_words)
      return mean_vector
    else: #create a vector with 0 elements
      mean_vector = [0] * vector_size
      return mean_vector

In [None]:
#Lets compute the mean word vector for our training a testing sample
X_train_embeddings = [get_mean_word_vector(tokens, word_vectors, 100) for tokens in X_train]
X_test_embeddings = [get_mean_word_vector(tokens, word_vectors, 100) for tokens in X_test]

In [None]:
#Lets see the mean word vector for first tweet in trainig sample
#X_train_embeddings[1]

In [None]:
#Lets see the mean word vector for first tweet in testing sample
#X_test_embeddings[1]

In [None]:
len(X_train_embeddings)

41157

In [None]:
len(y_train)

41157

Implementation of Traditional Classifier

For this multi-class classification problem we will use Logistic Regression Model from scikit-learn

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize our logistic regression classifier
lr_classifier = LogisticRegression(max_iter = 2000, random_state= 2024)

#Training our lr model on training dataset
lr_classifier.fit(X_train_embeddings, y_train)


In [None]:
#Predicting the labels
y_pred_lr = lr_classifier.predict(X_test_embeddings)

In [None]:
#Lets see the whole classification report first
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred_lr))

              precision    recall  f1-score   support

           0       0.48      0.20      0.28       592
           1       0.39      0.39      0.39      1041
           2       0.41      0.57      0.47       619
           3       0.34      0.50      0.41       947
           4       0.59      0.24      0.34       599

    accuracy                           0.39      3798
   macro avg       0.44      0.38      0.38      3798
weighted avg       0.42      0.39      0.38      3798



In [None]:
#Lets evaluate the model performance
from sklearn.metrics import accuracy_score, f1_score, classification_report
accuracy_lr = accuracy_score(y_test, y_pred_lr)
f1_micro_lr = f1_score(y_test, y_pred_lr, average='micro')
f1_macro_lr = f1_score(y_test, y_pred_lr, average='macro')

print("Logistic Regression:")
print(f"  Accuracy: {accuracy_lr:.4f}")
print(f"  F1 Micro: {f1_micro_lr:.4f}")
print(f"  F1 Macro: {f1_macro_lr:.4f}")

Logistic Regression:
  Accuracy: 0.3944
  F1 Micro: 0.3944
  F1 Macro: 0.3787


Implementation of NN-based classifier

We will build a Neural Network using PyTorch

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train_embeddings, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
X_test_tensor = torch.tensor(X_test_embeddings, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

#Creating training and testing tensors
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

#Loading the dataset into the tensors
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)



In [None]:

# Define neural network model
class SentimentClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(SentimentClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

input_dim = 100 #Size of embeddings features or X
hidden_dim = 50 #Size of hidden layer
output_dim = 5 #Size of output layer

model = SentimentClassifier(input_dim, hidden_dim, output_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
#Training our model
n_epochs = 50

for epoch in range(n_epochs):
    model.train()
    for batch in train_loader:
        optimizer.zero_grad()
        X_batch, y_batch = batch
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()

In [None]:
#Lets predict the sentiments for each tweets in testing samples
model.eval()
y_pred_nn = []
with torch.no_grad():
    for X_batch, _ in test_loader:
        outputs = model(X_batch)
        _, predicted = torch.max(outputs, 1)
        y_pred_nn.extend(predicted.numpy())

In [None]:
#Lets see the whole classification report first
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred_nn))

              precision    recall  f1-score   support

           0       0.45      0.42      0.43       592
           1       0.39      0.42      0.41      1041
           2       0.51      0.46      0.48       619
           3       0.36      0.46      0.40       947
           4       0.54      0.33      0.41       599

    accuracy                           0.42      3798
   macro avg       0.45      0.42      0.43      3798
weighted avg       0.43      0.42      0.42      3798



In [None]:
#Lets evaluate our model performance
y_pred_nn = np.array(y_pred_nn)
accuracy_nn = accuracy_score(y_test, y_pred_nn)
f1_micro_nn = f1_score(y_test, y_pred_nn, average='micro')
f1_macro_nn = f1_score(y_test, y_pred_nn, average='macro')

print("Neural Network:")
print(f"  Accuracy: {accuracy_nn:.4f}")
print(f"  F1 Micro: {f1_micro_nn:.4f}")
print(f"  F1 Macro: {f1_macro_nn:.4f}")

Neural Network:
  Accuracy: 0.4215
  F1 Micro: 0.4215
  F1 Macro: 0.4265


Discussion on Model Performance

Accuracy

From the above result, we can see that the accuracy of the logistic regression model is 0.3944, and that of the Neural Network model is 0.4215. It suggests that the NN-based model correctly predicts 42.15% of tweet sentiments from testing data correctly whereas lr model only predicts 39.44% of sentiments correctly.

These accuracy scores indicate that the NN-based model outperformed the lr model. But in our case the tweet's sentiment labels are imbalanced, meaning we do not have equal tweets for all classes, hence we cannot directly make a comparison of the model based on accuracy alone. However, the higher accuracy of the NN-based model indicates it is more suitable for making predictions as compared to logistic regression.



F1 Score

F1 micro is more suitable for the evaluation of model performance with a balanced class. It is similar to accuracy. Since we have a class imbalance, we are more interested in F1 macro score.

F1 macro gives equal weight to every class regardless of their support, thus it calculates the F1 score independently before giving the average score. In our case, the F1 macro score for the neural network is 0.4265 which indicates that it is performing well for all classes. But the F1 macro score for logistic regression is only 0.3787 which suggest it is more biased towards class with more frequent sentiments.

If we see the F1 score for individual classes from above classification report, the logistic regression is struggling to predict Neutral sentiments with the score of 0.28 which has lowest support among all classes of 592. On the other hand, the neural network has a score of 0.43 for this Neutral class despite having the lowest support. If we compare the F1 score for other classes also, the neural network has less variation in the F1 score as compared to logistic regression.

All of our findings conclude that the neural network model is more robust and suitable for sentiment classification of COVID-19 tweets where there is presence of class imbalance.


Reasons for Lower Accuracy

Although the neural network is a good fit for sentiment analysis of COVID-19 tweets, the accuracy is very low to call it a good model for this task. Below are some of the reasons which contribute to the lower accuracy.

1. Class Imbalance: Our dataset has class imbalance meaning the model will be biased towards the class with higher training samples hence affecting the overall accuracy greatly.

2. Mean of word Embeddings: While representing the features of our model we have used the mean word embedding for each tweet. This will restrict our model to capture the word context effectively. Hence, it is not able to distinguish which words are used as positive sentiment and which are used as negative in a tweet.

3. Data Size: The Word2Vec model which generates the embedding for each word in the tweets is trained with 44955 tokens which is very small for an NLP task. With this small training sample, the model will not be able to capture the context and relationship of each word.
