<a href="https://colab.research.google.com/github/tharungajula2/Portfolio/blob/main/Team_13_Kaggle1_Final_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Final Model**


# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint

### Mini Project Notebook: Multi-Text Classification of Coronavirus Tweets using Deep Neural Networks (RNNs).


## Learning Objectives

At the end of the mini-hackathon, you will be able to :

* perform data preprocessing/preprocess the text
* represent the text/words using the pretrained word embeddings - Word2Vec/Glove
* build the deep neural network (RNN, LSTM, GRU, CNNs, Bidirectional-LSTM, GRU,BERT) to classify the tweets


### Introduction

First we need to understand why sentiment analysis is needed for social media?

People from all around the world have been using social media more than ever. Sentiment analysis on social media data helps to understand the wider public opinion about certain topics such as movies, events, politics, sports, and more and gain valuable insights from this social data. Sentiment analysis has some powerful applications. Nowadays it is also used by some businesses to do market research and understand the customer’s experiences for their products or services.

Now an interesting question about this type of problem statement that may arise in your mind is that why sentiment analysis on COVID-19 Tweets? What is about the coronavirus tweets that would be positive? You may have heard sentiment analysis on movie or book reviews, but what is the purpose of exploring and analyzing this type of data?

The use of social media for communication during the time of crisis has increased remarkably over the recent years. As mentioned above, analyzing social media data is important as it helps understand public sentiment. During the coronavirus pandemic, many people took to social media to express their anger, grief, or sadness while some also spread happiness and positivity. People also used social media to ask their network for help related to vaccines or hospitals during this hard time. Many issues related to this pandemic can also be solved if experts considered this social data. That’s the reason why analyzing this type of data is important to understand the overall issues faced by people.



## Dataset

The given challenge is to build a multiclass classification model to predict the sentiment of Covid-19 tweets. The tweets have been pulled from Twitter and manual tagging has been done. We are given information like Location, Tweet At, Original Tweet, and Sentiment.

The training dataset consists of 36000 tweets and the testing dataset consists of 8955 tweets. There are 5 sentiments namely ‘Positive’, ‘Extremely Positive’, ‘Negative’, ‘Extremely Negative’, and ‘Neutral’ in the sentiment column.

## Description

This dataset has the following information about the user who tweeted:

1. **UserName:** twitter handler
2. **ScreenName:** a personal identifier on Twitter and is separate from the username
3. **Location:** where in the world the person tweets from
4. **TweetAt:** date of the tweet posted (DD-MM-YYYY)
5. **OriginalTweet:** the tweet itself
6. **Sentiment:** sentiment value



## Problem Statement

To build and implement a multiclass classification deep neural network model to classify between Positive/Extremely Positive/Negative/Extremely Negative/Neutral sentiments

## Grading = 10 Marks

Here is a handy link to Kaggle's competition documentation (https://www.kaggle.com/docs/competitions), which includes, among other things, instructions on submitting predictions (https://www.kaggle.com/docs/competitions#making-a-submission).

## Instructions for downloading train and test dataset from Kaggle API are as follows:

### 1. Create an API key in Kaggle.

To do this, go to the competition site on Kaggle at (https://www.kaggle.com/t/6ff3f8dbf34a4a57af7eac66dded4f31) and open your user settings page. Click Account.

* Click on your profile picture at the top-right corner of the page.

![alt text](https://i.imgur.com/kSLmEj2.png)

* In the popout menu, click the Settings option.

![alt text](https://i.imgur.com/tNi6yun.png)








### 2. Next, scroll down to the API access section and click generate to download an API key (kaggle.json).
![alt text](https://i.imgur.com/vRNBgrF.png)


### 3. Upload your kaggle.json file using the following snippet in a code cell:



In [None]:
#Start

In [None]:
from google.colab import files
files.upload()

In [None]:
#If successfully uploaded in the above step, the 'ls' command here should display the kaggle.json file.
%ls

### 4. Install the Kaggle API using the following command


In [None]:
!pip install -U -q kaggle==1.5.8

### 5. Move the kaggle.json file into ~/.kaggle, which is where the API client expects your token to be located:



In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:
# Execute the following command to verify whether the kaggle.json is stored in the appropriate location: ~/.kaggle/kaggle.json
!ls ~/.kaggle

In [None]:
!chmod 600 /root/.kaggle/kaggle.json # run this command to ensure your Kaggle API token is secure on colab

### 6. Now download the Test Data from Kaggle

**NOTE: If you get a '404 - Not Found' error after running the cell below, it is most likely that the user (whose kaggle.json is uploaded above) has not 'accepted' the rules of the competition and therefore has 'not joined' the competition.**

If you encounter **401-unauthorised** download latest **kaggle.json** by repeating steps 1 & 2

In [None]:
#If you get a forbidden link, you have most likely not joined the competition.
!kaggle competitions download -c multi-text-classification-of-coronavirus-tweets

In [None]:
!unzip /content/multi-text-classification-of-coronavirus-tweets.zip

## YOUR CODING STARTS FROM HERE

## Import required packages

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

##   **Stage 1**:  Data Loading and Perform Exploratory Data Analysis (1 Points)

* Load the Dataset


In [None]:
# Load the dataset
train_data = pd.read_csv('/content/corona_nlp_train.csv/corona_nlp_train.csv', encoding='ISO-8859-1')
test_data = pd.read_csv('/content/corona_nlp_test.csv/corona_nlp_test.csv', encoding='ISO-8859-1')

# Display the first few rows of each dataframe
print("Train Data")
print(train_data.head())
print("\nTest Data")
print(test_data.head())

* Check for Missing Values

In [None]:


# Handle missing values in 'Location' by filling with 'Unknown'
train_data['Location'] = train_data['Location'].fillna('Unknown')
test_data['Location'] = test_data['Location'].fillna('Unknown')

In [None]:
# Check for missing values in the train data
print("Missing values in train data:")
print(train_data.isnull().sum())

# Check for missing values in the test data
print("\nMissing values in test data:")
print(test_data.isnull().sum())

* Visualize the sentiment column values


In [None]:
# Sentiment mapping based on the actual values in the dataset
sentiment_mapping = {
    'Extremely Positive': 4,
    'Positive': 3,
    'Neutral': 2,
    'Negative': 1,
    'Extremely Negative': 0
}

# Map sentiment values to numeric values
train_data['Sentiment'] = train_data['Sentiment'].map(sentiment_mapping)

# Plot the distribution of sentiment values
plt.figure(figsize=(10, 6))
sns.countplot(x='Sentiment', data=train_data, order=[0, 1, 2, 3, 4])
plt.title('Distribution of Sentiment Values')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.xticks(ticks=[0, 1, 2, 3, 4], labels=['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive'])
plt.show()

* Visualize top 10 Countries that had the highest tweets using countplot (Tweet count vs Location)


In [None]:
# Convert all location strings to lower case for consistency
train_data['Location'] = train_data['Location'].str.lower()
test_data['Location'] = test_data['Location'].str.lower()

# Define a function to clean and map locations to countries
def clean_location(location):
    if re.search(r'\b(new york|ny|nyc)\b', location):
        return 'united states'
    elif re.search(r'\b(london|england|uk|united kingdom|britain)\b', location):
        return 'united kingdom'
    elif re.search(r'\b(california|ca|los angeles|la|san francisco|sf)\b', location):
        return 'united states'
    elif re.search(r'\b(washington|dc)\b', location):
        return 'united states'
    elif re.search(r'\b(usa|us|u.s.|united states)\b', location):
        return 'united states'
    elif re.search(r'\bindia\b', location):
        return 'india'
    elif re.search(r'\b(australia|sydney|melbourne)\b', location):
        return 'australia'
    elif re.search(r'\b(canada|toronto|vancouver)\b', location):
        return 'canada'
    elif re.search(r'\b(germany|berlin)\b', location):
        return 'germany'
    elif re.search(r'\b(france|paris)\b', location):
        return 'france'
    elif re.search(r'\b(spain|madrid|barcelona)\b', location):
        return 'spain'
    elif re.search(r'\b(italy|rome|milan)\b', location):
        return 'italy'
    elif re.search(r'\b(brazil|rio|são paulo)\b', location):
        return 'brazil'
    elif re.search(r'\b(china|beijing|shanghai)\b', location):
        return 'china'
    elif re.search(r'\b(japan|tokyo)\b', location):
        return 'japan'
    elif re.search(r'\b(mexico|mexico city)\b', location):
        return 'mexico'
    elif re.search(r'\b(atlanta)\b', location):
        return 'united states'
    elif re.search(r'\b(boston)\b', location):
        return 'united states'
    # Add more patterns as needed
    else:
        return location

# Apply the clean_location function
train_data['Location'] = train_data['Location'].apply(clean_location)
test_data['Location'] = test_data['Location'].apply(clean_location)

# Verify the changes
print("Unique locations in train data:")
print(train_data['Location'].unique())

print("\nUnique locations in test data:")
print(test_data['Location'].unique())

# Count the number of tweets per location again after cleaning
top_10_locations = train_data['Location'].value_counts().head(10)

# Plot the top 10 locations
plt.figure(figsize=(12, 6))
sns.countplot(y=train_data[train_data['Location'].isin(top_10_locations.index)]['Location'], order=top_10_locations.index)
plt.title('Top 10 Locations with Highest Tweet Counts')
plt.xlabel('Tweet Count')
plt.ylabel('Location')
plt.show()

* Plotting Pie Chart for the Sentiments in percentage


In [None]:
# Calculate the percentage of each sentiment
sentiment_percentage = train_data['Sentiment'].value_counts(normalize=True) * 100

# Plot the pie chart
plt.figure(figsize=(8, 8))
plt.pie(sentiment_percentage, labels=sentiment_percentage.index, autopct='%1.1f%%', startangle=140)
plt.title('Sentiment Distribution in Percentage')
plt.show()

* WordCloud for the Tweets/Text

    * Visualize the most commonly used words in each sentiment using wordcloud
    * Refer to the following [link](https://medium.com/analytics-vidhya/word-cloud-a-text-visualization-tool-fb7348fbf502) for Word Cloud: A Text Visualization tool




In [None]:
# Function to plot word cloud
def plot_wordcloud(text, title):
    wordcloud = WordCloud(width=800, height=400, random_state=21, max_font_size=110, background_color='white').generate(text)
    plt.figure(figsize=(10, 7))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.title(title)
    plt.axis('off')
    plt.show()

# Plot word cloud for each sentiment
for sentiment in train_data['Sentiment'].unique():
    text = " ".join(review for review in train_data[train_data['Sentiment'] == sentiment]['OriginalTweet'])
    plot_wordcloud(text, f'Word Cloud for Sentiment {sentiment}')

## Tharun:- Thanks for the remark, will do that

In [None]:
# Final model executed here


##   **Stage 2**: Data Pre-Processing  (2 Points)

####  Clean and Transform the data into a specified format


In [None]:
import pandas as pd
import numpy as np
import re
import string
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset, DataLoader
import torch

# Preprocess the text data
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#\w+', '', text)
    text = re.sub(r'[^A-Za-z\s]+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\s+', ' ', text).strip()
    return text

train_data['OriginalTweet'] = train_data['OriginalTweet'].apply(preprocess_text)
test_data['OriginalTweet'] = test_data['OriginalTweet'].apply(preprocess_text)

# Encode the labels
label_encoder = LabelEncoder()
train_data['Sentiment'] = label_encoder.fit_transform(train_data['Sentiment'])

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_data['OriginalTweet'], train_data['Sentiment'], test_size=0.2, random_state=42)

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


In [None]:
# Plot word cloud for each sentiment
for sentiment in train_data['Sentiment'].unique():
    text = " ".join(review for review in train_data[train_data['Sentiment'] == sentiment]['OriginalTweet'])
    plot_wordcloud(text, f'Word Cloud for Sentiment {sentiment}'

##   **Stage 3**: Build the Word Embeddings using pretrained Word2vec/Glove (Text Representation) (1 Point)



In [None]:
# YOUR CODE HERE

In [None]:
class TweetDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        return {
            'text': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

train_dataset = TweetDataset(X_train.tolist(), y_train.tolist(), tokenizer, max_len=128)
val_dataset = TweetDataset(X_val.tolist(), y_val.tolist(), tokenizer, max_len=128)
test_dataset = TweetDataset(test_data['OriginalTweet'].tolist(), [0] * len(test_data), tokenizer, max_len=128)


##   **Stage 4**: Build and Train the Deep Recurrent Model using Pytorch/Keras (4 Points)



In [None]:
# YOUR CODE HERE

In [None]:
!pip install transformers[torch] --upgrade
!pip install accelerate --upgrade

In [None]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,  # Reduced epochs
    per_device_train_batch_size=8,  # Smaller batch size
    per_device_eval_batch_size=8,  # Smaller batch size
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=lambda p: {"accuracy": (p.predictions.argmax(-1) == p.label_ids).mean()}
)

trainer.train()


##   **Stage 5**: Evaluate the Model and get model predictions on the test dataset (2 Points)

* Upload the model predictions to kaggle by mapping the sentiment column vlalues from numericals the categorical







In [None]:
# YOUR CODE HERE

In [None]:
# Evaluate the model
trainer.evaluate()

# Make predictions
predictions = trainer.predict(test_dataset)
predicted_classes = predictions.predictions.argmax(axis=-1)

# Map predicted classes back to sentiment labels
reverse_sentiment_mapping = {v: k for k, v in sentiment_mapping.items()}
test_data['Sentiment_Pred'] = predicted_classes
test_data['Sentiment_Pred'] = test_data['Sentiment_Pred'].map(reverse_sentiment_mapping)

# Prepare the submission file
submission = test_data[['UserName', 'Sentiment_Pred']]
submission.columns = ['Test_Id', 'Sentiment']
submission.to_csv('submission.csv', index=False)

print("Submission file saved as 'submission.csv'")


### Instructions for preparing Kaggle competition predictions


* Get the predictions using trained model and prepare a csv file
    * DeepNet model gives output for each class, consider the maximum value among all classes as prediction using `np.argmax`.

* Predictions (csv) file should contain 2 columns as Sample_Submission.csv
  - First column is the Test_Id which is considered as index
  - Second column is prediction in decoded form (for eg. Positive, Negative etc...).