# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint

### Mini Project Notebook: To perform text classification of coronavirus tweets during the peak Covid - 19 period using LSTMs/RNNs/CNNs/BERT.


## Learning Objectives

At the end of the mini-hackathon, you will be able to :

* perform data preprocessing/preprocess the text
* represent the text/words using the pretrained word embeddings - Word2Vec/Glove
* build the deep neural network (RNN, LSTM, GRU, CNNs, Bidirectional-LSTM, GRU, BERT) to classify the tweets


### Introduction

First we need to understand why sentiment analysis is needed for social media?

People from all around the world have been using social media more than ever. Sentiment analysis on social media data helps to understand the wider public opinion about certain topics such as movies, events, politics, sports, and more and gain valuable insights from this social data. Sentiment analysis has some powerful applications. Nowadays it is also used by some businesses to do market research and understand the customer’s experiences for their products or services.

Now an interesting question about this type of problem statement that may arise in your mind is that why sentiment analysis on COVID-19 Tweets? What is about the coronavirus tweets that would be positive? You may have heard sentiment analysis on movie or book reviews, but what is the purpose of exploring and analyzing this type of data?

The use of social media for communication during the time of crisis has increased remarkably over the recent years. As mentioned above, analyzing social media data is important as it helps understand public sentiment. During the coronavirus pandemic, many people took to social media to express their anger, grief, or sadness while some also spread happiness and positivity. People also used social media to ask their network for help related to vaccines or hospitals during this hard time. Many issues related to this pandemic can also be solved if experts considered this social data. That’s the reason why analyzing this type of data is important to understand the overall issues faced by people.



## Dataset

The given challenge is to build a multiclass classification model to predict the sentiment of Covid-19 tweets. The tweets have been pulled from Twitter and manual tagging has been done. We are given information like Location, Tweet At, Original Tweet, and Sentiment.

The training dataset consists of 36000 tweets and the testing dataset consists of 8955 tweets. There are 5 sentiments namely ‘Positive’, ‘Extremely Positive’, ‘Negative’, ‘Extremely Negative’, and ‘Neutral’ in the sentiment column.

## Description

This dataset has the following information about the user who tweeted:

1. **UserName:** twitter handler
2. **ScreenName:** a personal identifier on Twitter and is separate from the username
3. **Location:** where in the world the person tweets from
4. **TweetAt:** date of the tweet posted (DD-MM-YYYY)
5. **OriginalTweet:** the tweet itself
6. **Sentiment:** sentiment value



## Problem Statement

To build and implement a multiclass classification deep neural network model to classify between Positive/Extremely Positive/Negative/Extremely Negative/Neutral sentiments

## Grading

Here is a handy link to Kaggle's competition documentation (https://www.kaggle.com/docs/competitions), which includes, among other things, instructions on submitting predictions (https://www.kaggle.com/docs/competitions#making-a-submission).

## Instructions for downloading train and test dataset from Kaggle API are as follows:

### 1. Create an API key in Kaggle.

To do this, go to the competition site on Kaggle at (https://www.kaggle.com/t/15cef0def403469ebbb5db1a67991873) and open your user settings page. Click Account.

* Click on your profile picture at the top-right corner of the page.

![alt text](https://i.imgur.com/kSLmEj2.png)

* In the popout menu, click the Settings option.

![alt text](https://i.imgur.com/tNi6yun.png)








### 2. Next, scroll down to the API access section and click generate to download an API key (kaggle.json).
![alt text](https://i.imgur.com/vRNBgrF.png)


### 3. Upload your kaggle.json file using the following snippet in a code cell:



In [None]:
from google.colab import files
files.upload()

In [None]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

### 4. Install the Kaggle API using the following command


In [None]:
!pip install -U -q kaggle==1.5.8

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
# Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

In [None]:
!chmod 600 /root/.kaggle/kaggle.json # run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the Test Data from Kaggle

**NOTE: If you get a '404 - Not Found' error after running the cell below, it is most likely that the user (whose kaggle.json is uploaded above) has not 'accepted' the rules of the competition and therefore has 'not joined' the competition.**

If you encounter **401-unauthorised** download latest **kaggle.json** by repeating steps 1 & 2

In [None]:
#If you get a forbidden link, you have most likely not joined the competition.
# !kaggle competitions download -c multi-text-classification-of-coronavirus-tweets
!kaggle competitions download -c to-classify-coronavirus-tweets-during-covid-19

In [None]:
!unzip to-classify-coronavirus-tweets-during-covid-19.zip

## YOUR CODING STARTS FROM HERE

## Import required packages

In [None]:
# Import required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
import sklearn
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.utils import simple_preprocess
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

import tensorflow as tf  # use TensorFlow
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
from tensorflow.keras.layers import Input, Embedding, Dense, Bidirectional, Dropout, GRU, LSTM, BatchNormalization, MaxPooling1D,Attention, GlobalAveragePooling1D

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

##   **Stage 1**:  Data Loading and Perform Exploratory Data Analysis

* Load the Dataset


In [None]:
train = pd.read_csv('corona_nlp_train.csv/corona_nlp_train.csv', encoding='latin1')
test = pd.read_csv('corona_nlp_test.csv/corona_nlp_test.csv', encoding='latin1')

In [None]:
train.head()

In [None]:
test.head()

* Check for Missing Values

In [None]:
# missing values in train dataset
train.isnull().sum() / len(train) * 100

In [None]:
# missing values in test dataset
test.isnull().sum() / len(train) * 100

In [None]:
# impute Location missing values with a category 'Not available'
train['Location'] = train['Location'].fillna('Not available')
test['Location'] = test['Location'].fillna('Not available')

In [None]:
train['Location'].isnull().sum(), test['Location'].isnull().sum()

* Visualize the sentiment column values


In [None]:
train['Sentiment'].value_counts() / len(train) * 100

In [None]:
sns.countplot(x='Sentiment', data=train)
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.title('Distribution of Sentiments')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

* Visualize top 10 Countries that had the highest tweets using countplot (Tweet count vs Location)


In [None]:
sns.countplot(x='Location', data=train, order=train['Location'].value_counts().iloc[:10].index)
plt.xticks(rotation=90)
plt.title('Top 10 Countries with Highest Tweet Counts')
plt.xlabel('Location')
plt.ylabel('Tweet Count')
plt.show()

* Plotting Pie Chart for the Sentiments in percentage


In [None]:
sentiment_counts = train['Sentiment'].value_counts()
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%.2f%%')
plt.title('Sentiment Distribution in Tweets')
plt.show()

* WordCloud for the Tweets/Text

    * Visualize the most commonly used words in each sentiment using wordcloud
    * Refer to the following [link](https://medium.com/analytics-vidhya/word-cloud-a-text-visualization-tool-fb7348fbf502) for Word Cloud: A Text Visualization tool




In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# Concatenate all tweets for each sentiment
positive_tweets = " ".join(train[train['Sentiment'] == 'Positive']['OriginalTweet'])
negative_tweets = " ".join(train[train['Sentiment'] == 'Negative']['OriginalTweet'])
neutral_tweets = " ".join(train[train['Sentiment'] == 'Neutral']['OriginalTweet'])
extremely_positive_tweets = " ".join(train[train['Sentiment'] == 'Extremely Positive']['OriginalTweet'])
extremely_negative_tweets = " ".join(train[train['Sentiment'] == 'Extremely Negative']['OriginalTweet'])


# Generate word clouds for each sentiment
def generate_wordcloud(input, title):
  wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=STOPWORDS).generate(input)
  plt.figure(figsize=(10, 5))
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis('off')
  plt.title(title)
  plt.show()

generate_wordcloud(positive_tweets, 'Wordcloud for Positive Tweets')
generate_wordcloud(negative_tweets, 'Wordcloud for Negative Tweets')
generate_wordcloud(neutral_tweets, 'Wordcloud for Neutral Tweets')
generate_wordcloud(extremely_positive_tweets, 'Wordcloud for Extremely Positive Tweets')
generate_wordcloud(extremely_negative_tweets, 'Wordcloud for Extremely Negative Tweets')

##   **Stage 2**: Data Pre-Processing
####  Clean and Transform the data into a specified format


In [None]:
#text preprocessing on the OriginalTweet
train['OriginalTweet'] = train['OriginalTweet'].apply(lambda text:simple_preprocess(text, max_len=300))
test['OriginalTweet'] = test['OriginalTweet'].apply(lambda text:simple_preprocess(text, max_len=300))

In [None]:
# Remove stop words
stop_words = set(stopwords.words('english'))
stop_words.remove('not')
train['OriginalTweet'] = train['OriginalTweet'].apply(lambda x: [w for w in x if not w in stop_words])
test['OriginalTweet'] = test['OriginalTweet'].apply(lambda x: [w for w in x if not w in stop_words])

##   **Stage 3**: Build the Word Embeddings using pretrained Word2vec/Glove (Text Representation)


In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

In [None]:
embeddings_index = {}
# Loading the 300-dimensional vector of the model
f = open('glove.6B.300d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype=np.float32)
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

In [None]:
MAX_VOCAB_SIZE = 20000
EMBEDDING_DIM = 300
MAX_SENT_LEN = 45
BATCH_SIZE = 64
N_EPOCHS = 30

tf.random.set_seed(42)
np.random.seed(42)

In [None]:
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token='<UNK>')
tokenizer.fit_on_texts([' '.join(seq[:MAX_SENT_LEN]) for seq in train['OriginalTweet']])

print("Number of words in vocabulary:", len(tokenizer.word_index))

In [None]:
# Adding 1 because of reversed 0 index
words_not_found = []
vocab_size = len(tokenizer.word_index) + 1
print('Loaded %s word vectors.' % len(embeddings_index))

# Create a weight matrix for words in the training data
embedding_matrix = np.zeros((vocab_size, EMBEDDING_DIM))
for word, i in tokenizer.word_index.items():
    if i >= vocab_size:
      continue
    embedding_vector = embeddings_index.get(word)
    if (embedding_vector is not None) and len(embedding_vector) > 0:
                embedding_matrix[i] = embedding_vector
    else:
        words_not_found.append(word)

##   **Stage 4**: Build and Train the Deep Recurrent Model using Pytorch/Keras



In [None]:
# Convert the sequence of words to sequnce of indices
X = tokenizer.texts_to_sequences([' '.join(seq[:MAX_SENT_LEN]) for seq in train['OriginalTweet']])
X = pad_sequences(X, maxlen=MAX_SENT_LEN, padding='post', truncating='post')
y = train['Sentiment']

In [None]:
# Converting the labels from categorical to numerical
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
le.classes_

In [None]:
# split to train and validation datasets
# X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, random_state=42, train_size=34560)

In [None]:
# Define the model
input_layer = Input(shape=(MAX_SENT_LEN,))
embedding_layer = Embedding(vocab_size, EMBEDDING_DIM, weights =[embedding_matrix], trainable=False)(input_layer)
bi_lstm1 = Bidirectional(LSTM(128, return_sequences=True, dropout=0.2))(embedding_layer)
batch_nor1 = BatchNormalization()(bi_lstm1)
bi_lstm2 = Bidirectional(LSTM(128, return_sequences=True, dropout=0.2))(batch_nor1)
batch_nor2 = BatchNormalization()(bi_lstm2)
bi_lstm3 = Bidirectional(LSTM(64, return_sequences=True, dropout=0.2))(batch_nor2)

# Attention mechanism
query = Dense(128, kernel_regularizer='l2')(bi_lstm3)  # Use bi_lstm output as query
query = Dropout(0.2)(query)  # Dropout applied to the query layer
value = Dense(128, kernel_regularizer='l2')(bi_lstm3)  # Use bi_lstm output as value
value = Dropout(0.2)(value)  # Dropout applied to the value layer
attention_layer = Attention()([query, value])
attention_output = GlobalAveragePooling1D()(attention_layer)  # Summarize the attention output

# Add Dense Layers after Attention and Pooling
dense_1 = Dense(64, activation='relu', kernel_regularizer='l2')(attention_output)
dropout1 = Dropout(0.4)

# Fully connected output layer
output_layer = Dense(5, activation='softmax')(dense_1)


In [None]:
# Build and compile the model
model = Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer=Adam(learning_rate=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [None]:
print('Summary of the model')
model.summary()

In [None]:
early_stopping = EarlyStopping(
    monitor='val_accuracy',
    patience=8, restore_best_weights=True)

In [None]:
# fit the model
history=   model.fit(X, y,
                      batch_size=64,
                      epochs=30,
                      validation_split=0.2)

##   **Stage 5**: Evaluate the Model and get model predictions on the test dataset

* Upload the model predictions to kaggle by mapping the sentiment column vlalues from numericals the categorical







In [None]:
print('Testing...')
model.evaluate(X_val, y_val)

In [None]:
def predict_class(text):
    '''Function to predict sentiment class of the passed text'''

    sentiment_classes = ['Extremely Negative', 'Extremely Positive', 'Negative', 'Neutral',
       'Positive']
    max_len=MAX_SENT_LEN

    # Transforms text to a sequence of integers using a tokenizer object
    xt = tokenizer.texts_to_sequences(text)
    # Pad sequences to the same length
    xt = pad_sequences(xt, padding='post', maxlen=max_len)
    # Do the prediction using the loaded model
    yt = model.predict(xt).argmax(axis=1)
    # Print the predicted sentiment
    print('The predicted sentiment is', sentiment_classes[yt[0]])

In [None]:
X_test = tokenizer.texts_to_sequences([' '.join(seq[:MAX_SENT_LEN]) for seq in test['OriginalTweet']])
X_test = pad_sequences(X_test, maxlen=MAX_SENT_LEN, padding='post', truncating='post')

In [None]:
# model predictions on the test data
preds = model.predict(X_test)
preds.shape

In [None]:
# Convert probabilities to class labels (0 through 4 for 5 classes)
predicted_classes = np.argmax(preds, axis=1)
predicted_classes.shape

In [None]:
test['predicted_sentiment'] = predicted_classes

In [None]:
test["Sentiment"] = le.inverse_transform(test['predicted_sentiment'])

In [None]:
test.head()

In [None]:
le.classes_

In [None]:
test.to_csv('output.csv', index=False)

### Instructions for preparing Kaggle competition predictions


* Get the predictions using trained model and prepare a csv file
    * DeepNet model gives output for each class, consider the maximum value among all classes as prediction using `np.argmax`.

* Predictions (csv) file should contain 2 columns as Sample_Submission.csv
  - First column is the Test_Id which is considered as index
  - Second column is prediction in decoded form (for eg. Positive, Negative etc...).