# EDA and Sentiment Analysis on COVID19 Tweets

This notebook is organized as follows:

1. EDA on Covid19 tweets<br> 
    * Plot missing values.
    * Plot unique values.
    * Plot frequency of users tweeting about Corona
    * Plot frequency of locations tweeting about Corona
    * Plot frequency of sources tweeting about Corona
    * Visualizing location-wise top 50 prevelant words
2. Sentiment Analysis on Covid19 Tweets<br>
    * Exploring tweet data
    * Encoding tweets
    * Encoding sentiments
    * Detecting outlier reviews
    * Training, testing and validating
    * Dataloaders and batching
    * Sentiment network with PyTorch
    * Instantiate the netork
    * Calculating model's accuracy
    * Testing model on a random covid19 tweet.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd

df = pd.read_csv('/kaggle/input/covid19-tweets/covid19_tweets.csv')

# EDA on Covid19 Tweets

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.describe()

In [None]:
print('Shape of tweets dataframe : {}'.format(df.shape))

In [None]:
df.info()


## Plot Missing Values

In [None]:


import seaborn as sns
import matplotlib.pyplot as plt

def return_missing_values(data_frame):
    missing_values = data_frame.isnull().sum()/len(data_frame)
    missing_values = missing_values[missing_values>0]
    missing_values.sort_values(inplace=True)
    return missing_values

def plot_missing_values(data_frame):
    missing_values = return_missing_values(data_frame)
    missing_values = missing_values.to_frame()
    missing_values.columns = ['count']
    missing_values.index.names = ['Name']
    missing_values['Name'] = missing_values.index
    sns.set(style='whitegrid', color_codes=True)
    sns.barplot(x='Name', y='count', data=missing_values)
    plt.xticks(rotation=90)
    plt.show()
     

In [None]:


return_missing_values(df)

In [None]:
plot_missing_values(df)

### Acceptable color pallets in seaborn (i.e. we can experiment with `cmap` value below) : <br>

```python
Accent', 'Accent_r', 'Blues', 'Blues_r', 'BrBG', 'BrBG_r', 'BuGn', 'BuGn_r', 'BuPu', 'BuPu_r', 'CMRmap', 'CMRmap_r', 'Dark2', 'Dark2_r', 'GnBu', 'GnBu_r', 'Greens', 'Greens_r', 'Greys', 'Greys_r', 'OrRd', 'OrRd_r', 'Oranges', 'Oranges_r', 'PRGn', 'PRGn_r', 'Paired', 'Paired_r', 'Pastel1', 'Pastel1_r', 'Pastel2', 'Pastel2_r', 'PiYG', 'PiYG_r', 'PuBu', 'PuBuGn', 'PuBuGn_r', 'PuBu_r', 'PuOr', 'PuOr_r', 'PuRd', 'PuRd_r', 'Purples', 'Purples_r', 'RdBu', 'RdBu_r', 'RdGy', 'RdGy_r', 'RdPu', 'RdPu_r', 'RdYlBu', 'RdYlBu_r', 'RdYlGn', 'RdYlGn_r', 'Reds', 'Reds_r', 'Set1', 'Set1_r', 'Set2', 'Set2_r', 'Set3', 'Set3_r', 'Spectral', 'Spectral_r', 'Wistia', 'Wistia_r', 'YlGn', 'YlGnBu', 'YlGnBu_r', 'YlGn_r', 'YlOrBr', 'YlOrBr_r', 'YlOrRd', 'YlOrRd_r', 'afmhot', 'afmhot_r', 'autumn', 'autumn_r', 'binary', 'binary_r', 'bone', 'bone_r', 'brg', 'brg_r', 'bwr', 'bwr_r', 'cividis', 'cividis_r', 'cool', 'cool_r', 'coolwarm', 'coolwarm_r', 'copper', 'copper_r', 'cubehelix', 'cubehelix_r', 'flag', 'flag_r', 'gist_earth', 'gist_earth_r', 'gist_gray', 'gist_gray_r', 'gist_heat', 'gist_heat_r', 'gist_ncar', 'gist_ncar_r', 'gist_rainbow', 'gist_rainbow_r', 'gist_stern', 'gist_stern_r', 'gist_yarg', 'gist_yarg_r', 'gnuplot', 'gnuplot2', 'gnuplot2_r', 'gnuplot_r', 'gray', 'gray_r', 'hot', 'hot_r', 'hsv', 'hsv_r', 'icefire', 'icefire_r', 'inferno', 'inferno_r', 'jet', 'jet_r', 'magma', 'magma_r', 'mako', 'mako_r', 'nipy_spectral', 'nipy_spectral_r', 'ocean', 'ocean_r', 'pink', 'pink_r', 'plasma', 'plasma_r', 'prism', 'prism_r', 'rainbow', 'rainbow_r', 'rocket', 'rocket_r', 'seismic', 'seismic_r', 'spring', 'spring_r', 'summer', 'summer_r', 'tab10', 'tab10_r', 'tab20', 'tab20_r', 'tab20b', 'tab20b_r', 'tab20c', 'tab20c_r', 'terrain', 'terrain_r', 'twilight', 'twilight_r', 'twilight_shifted', 'twilight_shifted_r', 'viridis', 'viridis_r', 'vlag', 'vlag_r', 'winter', 'winter_r'
```


In [None]:
# heatmap representation of missing values

# plasma,visdir

sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='ocean')

## Plot Unique Values 

In [None]:
def return_unique_values(data_frame):
    unique_dataframe = pd.DataFrame()
    unique_dataframe['Features'] = data_frame.columns
    uniques = []
    for col in data_frame.columns:
        u = data_frame[col].nunique()
        uniques.append(u)
    unique_dataframe['Uniques'] = uniques
    return unique_dataframe

In [None]:
udf = return_unique_values(df)
print(udf)

In [None]:
f, ax = plt.subplots(1,1, figsize=(10,5))#plt.figure(figsize=(10, 5))

sns.barplot(x=udf['Features'], y=udf['Uniques'], alpha=0.8)
plt.title('Bar plot for #unique values in each column')
plt.ylabel('#Unique values', fontsize=12)
plt.xlabel('Features', fontsize=12)
plt.xticks(rotation=90)
plt.show()

## Frequency of users tweeting about Corona

In [None]:
def plot_frequency_charts(df, feature, title, pallete):
    freq_df = pd.DataFrame()
    freq_df[feature] = df[feature]
    
    f, ax = plt.subplots(1,1, figsize=(16,4))
    total = float(len(df))
    g = sns.countplot(df[feature], order = df[feature].value_counts().index[:20], palette=pallete)
    g.set_title("Number and percentage of {}".format(title))

    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(100*height/total),
                ha="center") 

    plt.title('Frequency of {} tweeting about Corona'.format(feature))
    plt.ylabel('Frequency', fontsize=12)
    plt.xlabel(title, fontsize=12)
    plt.xticks(rotation=90)
    plt.show()
    

In [None]:
plot_frequency_charts(df, 'user_name', 'User Names','Wistia')

## Frequency of locations tweeting about Corona

In [None]:
plot_frequency_charts(df, 'user_location', 'User Locations', 'BuGn_r')

## Frequency of sources tweeting about Corona

In [None]:
plot_frequency_charts(df, 'source','Source', 'vlag')

## Visualizing top 30 words location wise

In [None]:
from string import punctuation
from nltk.corpus import stopwords
print(stopwords.words('english')[10:15])

def punctuation_stopwords_removal(sms):
    # filters charecter-by-charecter : ['h', 'e', 'e', 'l', 'o', 'o', ' ', 'm', 'y', ' ', 'n', 'a', 'm', 'e', ' ', 'i', 's', ' ', 'p', 'u', 'r', 'v', 'a']
    remove_punctuation = [ch for ch in sms if ch not in punctuation]
    # convert them back to sentences and split into words
    remove_punctuation = "".join(remove_punctuation).split()
    filtered_sms = [word.lower() for word in remove_punctuation if word.lower() not in stopwords.words('english')]
    return filtered_sms

In [None]:
df.head()

In [None]:
from collections import Counter

def draw_bar_graph_for_text_visualization(df, location):
    tweets_from_loc = df.loc[df.user_location==location]
    tweets_from_loc.loc[:, 'text'] = tweets_from_loc['text'].apply(punctuation_stopwords_removal)
    loc_tweets_curated = tweets_from_loc['text'].tolist()
    loc_tweet_list = []
    for sublist in loc_tweets_curated:
        for word in sublist:
            loc_tweet_list.append(word)
    loc_tweet_count = Counter(loc_tweet_list)
    loc_top_30_words = pd.DataFrame(loc_tweet_count.most_common(50), columns=['word', 'count'])
    fig, ax = plt.subplots(figsize=(16, 6))
    sns.barplot(x='word', y='count', 
                data=loc_top_30_words, ax=ax)
    plt.title("Top 50 Prevelant Words in {}".format(location))
    plt.xticks(rotation='vertical');
    

In [None]:
from wordcloud import WordCloud, STOPWORDS



def draw_word_cloud(df, location, title):
    loc_df = df.loc[df.user_location==location]
    loc_df.loc[:, 'text'] = loc_df['text'].apply(punctuation_stopwords_removal)
    word_cloud = WordCloud(
                    background_color='white',
                    stopwords=set(STOPWORDS),
                    max_words=50,
                    max_font_size=40,
                    scale=5,
                    random_state=1).generate(str(loc_df['text']))
    fig = plt.figure(1, figsize=(10,10))
    plt.axis('off')
    fig.suptitle(title, fontsize=20)
    fig.subplots_adjust(top=2.3)
    plt.imshow(word_cloud)
    plt.show()
    

In [None]:
draw_bar_graph_for_text_visualization(df, 'India')

In [None]:
draw_word_cloud(df, 'India', 'Word Cloud for top 50 prevelant words in India')

In [None]:
draw_bar_graph_for_text_visualization(df, 'United Kingdom')

In [None]:
draw_word_cloud(df, 'United Kingdom', 'Word Cloud for top 50 prevelant words in United Kingdom')

In [None]:
draw_bar_graph_for_text_visualization(df, 'Canada')

In [None]:
draw_word_cloud(df, 'Canada', 'Word Cloud for top 50 prevelant words in Canada')

In [None]:
draw_bar_graph_for_text_visualization(df, 'South Africa')

In [None]:
draw_word_cloud(df, 'South Africa', 'Word Cloud for top 50 prevelant words in South Africa')

In [None]:
draw_bar_graph_for_text_visualization(df, 'Switzerland')

In [None]:
draw_word_cloud(df, 'Switzerland', 'Word Cloud for top 50 prevelant words in Switzerland')

In [None]:
draw_bar_graph_for_text_visualization(df, 'London')

In [None]:
draw_word_cloud(df, 'London', 'Word Cloud for top 50 prevelant words in London')

# Sentiment Analysis on Covid19 Tweets

For this part of the notebook I will be using [Covid 19 Indian Sentiments on covid19 and lockdown](https://www.kaggle.com/surajkum1198/twitterdata) dataset.

## Exploring Tweet Data

* Sentiment Analysis on Covid19 Tweets
    * Exploring tweet data
    * Encoding tweets
    * Encoding sentiments
    * Detecting outlier reviews
    * Training, testing and validating
    * Dataloaders and batching
    * Sentiment network with PyTorch
    * Instantiate the netork
    * Calculating model's accuracy
    * Testing model on a random covid19 tweet.

In [None]:
sentiment_df = pd.read_csv('/kaggle/input/twitterdata/finalSentimentdata2.csv')

In [None]:
sentiment_df.head()

In [None]:
sentiment_df.columns

In [None]:
sentiment_df['sentiment'].nunique

In [None]:
sentiment_df.loc[:, 'text'] = sentiment_df['text'].apply(punctuation_stopwords_removal)

In [None]:
reviews_split = []
for i, j in sentiment_df.iterrows():
    reviews_split.append(j['text'])


In [None]:
words = []
for review in reviews_split:
    for word in review:
        words.append(word)


In [None]:
print(words[:20])

## Encoding Tweets
Create an array that contains integer encoded version of words in reviews. The word appearing the most should have least integer value. Example if the appeared the most in reviews, then assign 'the' : 1

In [None]:
from collections import Counter

counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word:ii for ii, word in enumerate(vocab, 1)}

In [None]:
encoded_reviews = []
for review in reviews_split:
    encoded_reviews.append([vocab_to_int[word] for word in review])


In [None]:
print(len(vocab_to_int))
print(encoded_reviews[:10])

## Encoding Sentiments

For simplicity purposes, I am encoding positive sentiment such as joy as 1 and rest (anger, sad) as 0

In [None]:
labels_to_int = []
for i, j in sentiment_df.iterrows():
    if j['sentiment']=='joy':
        labels_to_int.append(1)
    else:
        labels_to_int.append(0)
    

## Detecting any outlier reviews

This step involves -<br>
1. Getting rid of extremely long/short reviews
2. Padding/truncating reaining data to maintain constant review length.

In [None]:
reviews_len = Counter([len(x) for x in encoded_reviews])
print(max(reviews_len))

In [None]:
print(len(encoded_reviews))

In [None]:
non_zero_idx = [ii for ii, review in enumerate(encoded_reviews) if len(encoded_reviews)!=0]
encoded_reviews = [encoded_reviews[ii] for ii in non_zero_idx]
encoded_labels = np.array([labels_to_int[ii] for ii in non_zero_idx])

In [None]:
print(len(encoded_reviews))
print(len(encoded_labels))

In [None]:
def pad_features(reviews_int, seq_length):
    features = np.zeros((len(reviews_int), seq_length), dtype=int)
    for i, row in enumerate(reviews_int):
        if len(row)!=0:
            features[i, -len(row):] = np.array(row)[:seq_length]
    return features

In [None]:
seq_length = 50
padded_features= pad_features(encoded_reviews, seq_length)
print(padded_features[:2])


## Training, Testing and Validating

In [None]:
split_frac = 0.8
split_idx = int(len(padded_features)*split_frac)

training_x, remaining_x = padded_features[:split_idx], padded_features[split_idx:]
training_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]


## Dataloaders and Batching

A neat way to create data-loaders and batch our training, validation and test Tensor datasets is as follows -<br>
```python
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)
```
This is an alternative to creating a generator function for batching our data into full batches.

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

In [None]:
# torch.from_numpy creates a tensor data from n-d array
train_data = TensorDataset(torch.from_numpy(training_x), torch.from_numpy(training_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))

batch_size = 1

train_loader = DataLoader(train_data, batch_size=batch_size)
test_loader = DataLoader(test_data, batch_size=batch_size)
valid_loader = DataLoader(valid_data, batch_size=batch_size)

In [None]:
gpu_available = torch.cuda.is_available

if gpu_available:
    print('Training on GPU')
else:
    print('GPU not available')

## Sentiment Network with PyTorch
Below are the various layers of our RNN that would perform sentiment analysis -<br>
1. An *embedding layer* that converts our word tokens (integers) into embeddings of a specific size.
2. A *LSTM layer* defined by a hidden_state size and number of layers
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
4. A sigmoid activation layer which turns all outputs into a value 0-1; return only the last sigmoid output as the output of this network."

In [None]:
import torch.nn as nn

class CovidTweetSentimentAnalysis(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.2):
        super(CovidTweetSentimentAnalysis, self).__init__()
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        self.embedding_layer = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)
        
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
    
    def forward(self, x, hidden):
        # x : batch_size * seq_length * features
        batch_size = x.size(0)
        x = x.long()
        embeds = self.embedding_layer(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        out = self.dropout(lstm_out)
        out = self.fc(out)
        sig_out = self.sig(out)
        
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1]
        
        return sig_out, hidden
    
    def init_hidden(self, batch_size):
        # initialize weights for lstm layer
        weights = next(self.parameters()).data
        
        if gpu_available:
            hidden = (weights.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                     weights.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weights.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                     weights.new(self.n_layers, batch_size, self.hidden_dim).zero())
        return hidden

## Instantiate the network
Here, I will define the model hyper-parameters -<br>

1. `vocab_size` : Size of our vocabulary or the range of values for our input, word tokens.
2. `output_size` : Size of our desired output; the number of class scores we want to output (pos/neg).
3. `embedding_dim` : Number of columns in the embedding lookup table; size of our embeddings.
4. `hidden_dim` : Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
5. `n_layers`: Number of LSTM layers in the network. Typically between 1-3

In [None]:
vocab_size = len(vocab_to_int)+1 # +1 for the 0 padding + our word tokens
output_size = 1 # either happy or sad
embedding_dim = 400
hidden_dim = 256
n_layers = 2

In [None]:
net = CovidTweetSentimentAnalysis(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
print(net)

In [None]:
lr = 0.001
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

In [None]:
epochs = 4
count = 0
print_every = 100
clip = 5 
if gpu_available:
    net.cuda()

net.train()
for e in range(epochs):
    # initialize lstm's hidden layer 
    h = net.init_hidden(batch_size)
    for inputs, labels in train_loader:
        count += 1
        if gpu_available:
            inputs, labels = inputs.cuda(), labels.cuda()
        h = tuple([each.data for each in h])
        
        # training process
        net.zero_grad()
        outputs, h = net(inputs, h)
        loss = criterion(outputs.squeeze(), labels.float())
        loss.backward()
        nn.utils.clip_grad_norm(net.parameters(), clip)
        optimizer.step()
        
        # print average training losses
        if count % print_every == 0:
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:
                val_h = tuple([each.data for each in val_h])
                if gpu_available:
                    inputs, labels = inputs.cuda(), labels.cuda()
            outputs, val_h = net(inputs, val_h)
            val_loss = criterion(outputs.squeeze(), labels.float())
            val_losses.append(val_loss.item())
        
            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(count),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

## Calculating model's accuracy

The `CovidTweetSentimentAnalysis` model achieved accuracy of 87.4 %

In [None]:
test_losses = []
num_correct = 0

h = net.init_hidden(batch_size)
net.eval()

for inputs, labels in test_loader:
    h = tuple([each.data for each in h])
    if gpu_available:
        inputs, labels = inputs.cuda(), labels.cuda()
    
    outputs, h = net(inputs, h)
    test_loss = criterion(outputs.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    pred = torch.round(outputs.squeeze())
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not gpu_available else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)

# printing average statistics
print("Test loss: {:.3f}".format(np.mean(test_losses)))
    
# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

## Testing model on random tweet

Since for performing sentiment analysis on covid 19 tweets, I on-boarded a completely different dataset in this notebook. Now that the our model is trained,we can use this model to perform sentiment analysis on tweets related to covid19 on this notebook.

In [None]:
from string import punctuation

def tokenize_covid_tweet(tweet):
    test_ints = []
    test_ints.append([vocab_to_int[word] for word in tweet])
    return test_ints

In [None]:
def predict_covid_sentiment(net, test_tweet, seq_length=50):
    print('Original Sentence :')
    print(test_tweet)
    
    print('\nAfter removing punctuations and stop-words :')
    test_tweet = punctuation_stopwords_removal(test_tweet)
    print(test_tweet)
    
    print('\nAfter converting pre-processed tweet to tokens :')
    tokenized_tweet = tokenize_covid_tweet(test_tweet)
    print(tokenized_tweet)
    
    print('\nAfter padding the tokens into fixed sequence lengths :')
    padded_tweet = pad_features(tokenized_tweet, 50)
    print(padded_tweet)
    
    feature_tensor = torch.from_numpy(padded_tweet)
    batch_size = feature_tensor.size(0)
    
    if gpu_available:
        feature_tensor = feature_tensor.cuda()
    
    h = net.init_hidden(batch_size)
    output, h = net(feature_tensor, h)
    
    predicted_sentiment = torch.round(output.squeeze())
    print('\n==========Predicted Sentiment==========\n')
    if predicted_sentiment == 1:
        print('Happy')
    else:
        print('Sad')
    print('\n==========Predicted Sentiment==========\n')


In [None]:
test_sad_tweet = 'It is very sad to see the corona pandemic increasing at such an alarming rate'
predict_covid_sentiment(net, test_sad_tweet)

In [None]:
test_happy_tweet = 'It is amazing to see that New Zealand reaches 100 days without Covid transmission!'
predict_covid_sentiment(net, test_happy_tweet)