# Coronavirus Tweet Classification

Welcome to the Coronavirus Tweet Classification where today, we will be analysing 40,000 tweets about the Covid-19 pandemic and predict whether they are positive, negative or neutral. Using these predictions, we can figure out how people are dealing with the pandemic and we can take a step further in figuring out how to make the world a calmer place.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud
from keras.optimizers import Adam
from keras.models import Sequential
from keras.utils import to_categorical
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.layers import Dense, Dropout, LSTM, Embedding
from sklearn.preprocessing import Normalizer, LabelEncoder
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train = pd.read_csv('../input/covid-19-nlp-text-classification/Corona_NLP_train.csv', encoding='latin1')
test = pd.read_csv('../input/covid-19-nlp-text-classification/Corona_NLP_test.csv', encoding='latin1')

## Feature engineering

Firstly we will gather the data. As seen below, we have acquired train and test sets, both consisting of UserName, ScreenName, Location, TweetAt, OriginalTweet and Sentiment columns.

In [None]:
train

In [None]:
test

The 'OriginalTweet' feature from will be used as X and the 'Sentiment' as the target. Then, we encode the target (y) feature with a label encoder in order for it to be able to be inputted to the classifier.

In [None]:
le = LabelEncoder()
X_train = train['OriginalTweet']
y_train = le.fit_transform(train['Sentiment'])

X_test = test['OriginalTweet']
y_test = le.transform(test['Sentiment'])

Now we create three models: Bag of Words, TFIDF and Tokenizer in order to be able to convert the text data (X) into a usable matrix of numbers. All models are created in order to test them out and see which one is the best for use.

In [None]:
BoW = CountVectorizer()
tfidf = TfidfTransformer()
tok = Tokenizer(num_words=50, split=" ")

At last, we create different transformations of the Bag of Word, TFIDF and Tokenizer methods. All three methods are types of NLP (Natural Language Processing) approaches that work in different ways to help AIs get better at reading and understanding text.

In [None]:
X_tr_bow = BoW.fit_transform(X_train)
X_te_bow = BoW.transform(X_test)

X_tr_tfidf = tfidf.fit_transform(X_tr_bow)
X_te_tfidf = tfidf.transform(X_te_bow)

tok.fit_on_texts(X_train)
tok_train = tok.texts_to_sequences(X_train)
X_tr_tok = sequence.pad_sequences(tok_train, maxlen=11, dtype='float32')

tok_test = tok.texts_to_sequences(X_test)
X_te_tok = sequence.pad_sequences(tok_test, maxlen=11, dtype='float32')

# Data visualisation

Subsequently, we will now visualise the data that has been engineered. This is useful as it can help us understand the trend of which our data follows.

Below is a word cloud that shows which words are the most common. 'https' and 'co' are the most common due to people linking their websites in tweets. Following are the words 'COVID' and a misspelled version of 'coronavirus'. After that are words such as 'grocery' and 'supermarket', which isn't surprising due to people's concerns about not being able to go out and buy things.

In [None]:
wordcloud = WordCloud(background_color='white').generate(" ".join(X))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

The final piece of visualisation is two bar charts which plot out how many tweets happened each day of the month and every month of the year. The month of March is when the most tweets happened, presumably because that was when lockdown started.

In [None]:
month = pd.to_datetime(train['TweetAt']).dt.month
day = pd.to_datetime(train['TweetAt']).dt.day

count = Counter(day)
plt.bar(count.keys(), count.values(), color='blue')
plt.xlabel('Days')
plt.ylabel('Occurence')
plt.title('Distribution of the tweets per day')
plt.show()

count = Counter(month)
plt.bar(count.keys(), count.values(), color='red')
plt.xlabel('Month')
plt.ylabel('Occurence')
plt.title('Distribution of the tweets per month')
plt.show()

Now, at the climax of our notebook, we shall create three SGD classifiers which each use the NLP approaches to create predictions as to whether a tweet is positive, negative or neutral.

In [None]:
bow_clf = SGDClassifier(eta0=0.01, learning_rate='optimal', penalty='l1', max_iter=100)
bow_clf.fit(X_tr_bow, y_train)
bow_score = bow_clf.score(X_te_bow, y_test)
print('Bag of Words score: ' + str(bow_score))

tfidf_clf = SGDClassifier(eta0=0.0001, loss='modified_huber', penalty='l1', learning_rate='optimal')
tfidf_clf.fit(X_tr_tfidf, y_train)
tfidf_score = tfidf_clf.score(X_te_tfidf, y_test)
print('TFIDF score:        ' + str(tfidf_score))

tok_clf = SGDClassifier()
tok_clf.fit(X_tr_tok, y_train)
tok_score = tok_clf.score(X_te_tok, y_test)
print('Tokenizer score:    ' + str(tok_score))

As seen above, the SGD classifier which has had its data engineered with a Bag of Words method has the highest accuracy, with a score of 60%. Meanwhile, the TFIDF score is closely behind with 59% and the Tokenizer is far back with 18%.

In [None]:
pred = bow_clf.predict(X_te_bow)
output = pd.DataFrame({'Real': y_test, 'Prediction': pred})
output.to_csv('submission.csv', index=False)

## Thank you for reading my notebook. 
## If you enjoyed it and found it helpful, please upvote it as it would help me make more of these.