## Covid-19 Tweet Sentimental Analysis using Naive Bayes

In [None]:
# importing necessary libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

In [None]:
data = pd.read_csv("/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_train.csv",encoding='latin1')
df = pd.DataFrame(data)

In [None]:
df.head()

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='Sentiment', data=df, order=['Extremely Negative', 'Negative', 'Neutral', 'Positive', 'Extremely Positive'], )

In [None]:
# showing column wise %ge of NaN values they contains 

for i in df.columns:
  print(i,"\t-\t", df[i].isna().mean()*100)


> Here  ___Location___ has some null values. Since location does not affects are model as we are not considering it as feature in analysis, we will leave it as it is.


In [None]:
df.info()

> Since our main columns for analysis "OriginalTweet" contains lots of unnecssary stuff like links, hashtags, mentions etc., we have to clean them and extract the content of tweet. For that I'm using regex and ommitting the perticular sequences which resembles links, hashtags, mentions.

In [None]:
a = re.compile("(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)")
tweet = []

for i in df["OriginalTweet"]:
  tweet.append(a.sub(" ", i))

df = pd.concat([df, pd.DataFrame(tweet, columns=["CleanedTweet"])], axis=1, sort=False)


In [None]:
df.head()

> Since we got our cleaned tweets, now we have to convert them in vectors for classifications.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

stop_words = set(stopwords.words('english'))     # Here making a set of stopwords (useless words which will not affect the classification)
vectoriser = TfidfVectorizer(stop_words=None)    # of English language do that can be removed while vectorization

In [None]:
X_train = vectoriser.fit_transform(df["CleanedTweet"])

In [None]:
# Encoding the classes in numerical values

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_train = encoder.fit_transform(df['Sentiment'])

In [None]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train, y_train)

In [None]:
# importing the Test dataset for prediction and testing purposes

test_data = pd.read_csv("/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_test.csv",encoding='latin1')
test_df = pd.DataFrame(test_data)

In [None]:
test_df.head()

In [None]:
# showing column wise %ge of NaN values they contains 

for i in test_df.columns:
  print(i,"\t-\t", test_df[i].isna().mean()*100)


> Like training dataset, ignoring ___Location___ as it has no significance in classification

In [None]:
a = re.compile("(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)")
tweet = []

for i in test_df["OriginalTweet"]:
  tweet.append(a.sub(" ", i))

test_df = pd.concat([test_df, pd.DataFrame(tweet, columns=["CleanedTweet"])], axis=1, sort=False)


In [None]:
test_df.head()

In [None]:
X_test = vectoriser.transform(test_df["CleanedTweet"])

In [None]:
y_test = encoder.transform(test_df["Sentiment"])

In [None]:
# Prediction

y_pred = classifier.predict(X_test)

pred_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
pred_df.head()

> Plotting ROC Curve (Receiver operating characteristic) for checking the accuracy of classifier.

In [None]:
from sklearn import metrics

# Generate the roc curve using scikit-learn.
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=1)
plt.plot(fpr, tpr)
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.show()

# Measure the area under the curve.  The closer to 1, the "better" the predictions.
print("AUC of the predictions: {0}".format(metrics.auc(fpr, tpr)))

> Since we got 0.64 auc score for the classifier, we can say that the classifier (Naive Bayes) is not that good but acceptable. Since more neerer to 1 auc score, more better the classifier.