# Covid19 Tweet Truth Analysis

Coded by Luna McBride

This dataset contains the training, validation, and test csv's, along with excel documents for the train and test files, a csv with the test file actual values, and ERNIE test results. For this analysis, I will be ignoring the excel files (as they are the same as the csv's) and the ERNIE results. I will be acting as if the test answer file did not exist for the duration of the testing phase as well, thus sticking with a basic approach of train, validate, see what the model decides for the tests.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import nltk #Natural Language Toolkit for Processing
from nltk.corpus import stopwords #Get the Stopwords to Remove

import re #Regular Expressions
import html #Messing with HTML content, like &amp;
import string #String Processing

import tensorflow as tf #Import tensorflow in order to use Keras
from tensorflow.keras.preprocessing.text import Tokenizer #Add the keras tokenizer for tweet tokenization
from tensorflow.keras.preprocessing.sequence import pad_sequences #Add padding to help the Keras Sequencing
import tensorflow.keras.layers as L #Import the layers as L for quicker typing
from tensorflow.keras.optimizers import Adam #Pull the adam optimizer for usage

from tensorflow.keras.losses import SparseCategoricalCrossentropy #Loss function being used
from sklearn.model_selection import train_test_split #Train Test Split

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
twTrain = pd.read_csv("../input/covid19-fake-news-dataset-nlp/Constraint_Train.csv") #Load the tweet (tw) training set
twTrain.head() #Take a peek at the data

In [None]:
twValid = pd.read_csv("../input/covid19-fake-news-dataset-nlp/Constraint_Val.csv") #Load the tweet (tw) validation set
twValid.head() #Take a peek at the data

In [None]:
twTest = pd.read_csv("../input/covid19-fake-news-dataset-nlp/Constraint_Test.csv") #Load the tweet (tw) testing set
twTest.head() #Take a peek at the data

---

# Check for Null Values

In [None]:
print("Training Set:\n", twTrain.isnull().any()) #Check for null values in the training set
print("Validation Set:\n", twValid.isnull().any()) #Check for null values in the validation set
print("Testing Set:\n", twTest.isnull().any()) #Check for null values in the testing set

There are no null values in the dataset.

---

# Data Exploration

In [None]:
print(twTrain["tweet"][0]) #Print a simple tweet example
print(twTrain["tweet"][300]) #Print a more typical tweet example

It appears there are more dry tweets along with more typical tweets (with hashtags and links). The typical tweet examples exist, so I will have to do more usual tweet cleaning.

In [None]:
print("Training Labels:\n", twTrain["label"].value_counts()) #See the training labels
print("Validation Labels:\n", twValid["label"].value_counts()) #See the validation labels

The labels appear to be pretty balanced in number. I will definitely need to get dummies for these to make real and fake into 1 and 0, but the fact that the labels are balanced in number means the model should pick up on these labels without too much difficulty.

---

# Tweet Processing

In [None]:
punctuations = string.punctuation #List of punctuations to remove
print(punctuations) #See the punctuations the string library has

STOP = stopwords.words("english") #Get the NLTK stopwords
print(STOP) #See what NLTK considers stopwords

In [None]:
#CleanTweets: parses the tweets and removes punctuation, stop words, digits, and links.
#Input: the list of tweets that need parsing
#Output: the parsed tweets
def cleanTweets(tweetParse):
    for i in range(0,len(tweetParse)):
        tweet = tweetParse[i] #Putting the tweet into a variable so that it is not calling tweetParse[i] over and over
        tweet = html.unescape(tweet) #Removes leftover HTML elements, such as &amp;
        tweet = re.sub(r"@\w+", " ", tweet) #Completely removes @'s, as other peoples' usernames mean nothing
        tweet = re.sub(r"http\S+", " ", tweet) #Removes links, as links provide no data in tweet analysis in themselves
        
        tweet = "".join([punc for punc in tweet if not punc in punctuations]) #Removes the punctuation defined above
        tweet = tweet.lower() #Turning the tweets lowercase real quick for later use
    
        tweetWord = tweet.split() #Splits the tweet into individual words
        tweetParse[i] = "".join([word + " " for word in tweetWord if not word in STOP]) #Checks if the words are stop words
        
    return tweetParse #Returns the parsed tweets

This code is reworked from my original coronavirus tweet sentiment analysis from earlier in the pandemic (https://www.kaggle.com/lunamcbride24/coronavirus-tweet-processing). I have changed it to use NLTK instead of spacy since those stopwords do not require building a spacy model. I have also used the string library to get punctuation instead of having a bulky hard-coded list and removed the number remover, as I feel that numbers may be a key factor here (especially with the usage of the name Covid-19, since that may have lost the 19 and became just covid, which has a different connotation). These were factors I wanted to change about the original after playing with Keras for TripAdvisor reviews (https://www.kaggle.com/lunamcbride24/hotel-review-keras-classification-project). 

This may be a note to myself, but I did both of those projects half a year ago. This is why you should keep your code well-commented.

In [None]:
twTrain["cleanTweet"] = cleanTweets(twTrain["tweet"].copy()) #Clean the training tweets
twTrain.head() #Take a look at the dataset

In [None]:
twValid["cleanTweet"] = cleanTweets(twValid["tweet"].copy()) #Clean the validation tweets
twValid.head() #Take a peek at the dataset

In [None]:
twTest["cleanTweet"] = cleanTweets(twTest["tweet"].copy()) #Clean the testing tweets
twTest.head() #Take a peek at the dataset

---

# Check for Post-Processing Blank Tweets

In [None]:
print("Training: \n", twTrain.loc[twTrain["cleanTweet"] == ""]) #Check for Training Blank Tweets
print("Validation: \n", twValid.loc[twValid["cleanTweet"] == ""]) #Check for Validation Blank Tweets
print("Testing: \n", twTest.loc[twTest["cleanTweet"] == ""]) #Check for Testing Blank Tweets

In [None]:
print(twTrain["tweet"][300]) #Print a more typical tweet example
print(twTrain["cleanTweet"][300]) #Print the tweet after processing to show link and stopword removal

There were no blank tweets created in any set. Tweets can become blank if they were just user names and links, so I just needed to make sure.

---

# Label Encoding

Interestingly, the get_dummies function in pandas will create encoded labels, since this is a binary classification problem. The real column created by it would have 1 for real and 0 for not real, which necessarily means fake in this case. That is the same as label encoding in this case.

In [None]:
dummyTrain = pd.get_dummies(twTrain["label"]) #Get the dummies for the training set
print(dummyTrain) #Show the dummies

That real column shows the encoded values for real vs fake. I will be taking the real column as the encoded values.

In [None]:
twTrain["encodedLabel"] = dummyTrain["real"] #Get the encoded labels from the "real" dummies
twTrain.head() #Take a peek at the data

In [None]:
twValid["encodedLabel"] = pd.get_dummies(twValid["label"])["real"] #Get the encoded labels for the validation set
twValid.head() #Take a peek at the data

---

# Tokenizing and Padding

In [None]:
trainClean = twTrain["cleanTweet"].copy() #Get the training clean tweets
testClean = twTest["cleanTweet"].copy() #Get the testing clean tweets
validClean = twValid["cleanTweet"].copy() #Get the validation clean tweets

trVaClean = trainClean.append(validClean, ignore_index = True) #Combine the training and validation tweets
allCleanTweet = trVaClean.append(testClean, ignore_index = True) #Combine all of the tweets into one series
print(len(allCleanTweet)) #Print the length to show they are all together

In [None]:
token = Tokenizer() #Initialize the tokenizer (set here so all of the datasets are in the same tokenizer)
token.fit_on_texts(allCleanTweet) #Fit the tokenizer to all of the tweets

In [None]:
#TokenizeTweet: turn the tweets into tokens for Keras to use
#Input: a set of tweets
#Output: a set of padded sequences representing the tweets
def tokenizeTweet(tweets):
    texts = token.texts_to_sequences(tweets) #Convert the tweets into sequences for keras to use
    texts = pad_sequences(texts, padding='post') #Pad the sequences to make them similar lengths
    
    return texts #Return the padded sequences

In [None]:
texts = tokenizeTweet(twTrain["cleanTweet"].copy()) #Collect the tokenized tweet sequences
twTrain["tweetSequence"] = list(texts) #Add this data to the dataframe
twTrain.head() #Take a peek at the dataset

In [None]:
textsValid = tokenizeTweet(twValid["cleanTweet"].copy()) #Collect tokenized tweet sequences
twValid["tweetSequence"] = list(textsValid) #Add this data to the dataframe
twValid.head() #Take a peek at the dataset

In [None]:
textsTest = tokenizeTweet(twTest["cleanTweet"].copy()) #Collect tokenized tweet sequences
twTest["tweetSequence"] = list(textsTest) #Add this data to the dataframe
twTest.head() #Take a peek at the dataset

---

# Model Training

In [None]:
size = len(token.word_index) + 1 #Set the number of words for the size

tf.keras.backend.clear_session() #Clear any previous model building

epoch = 3 #Number of runs through the data
batchSize = 32 #The number of items in each batch
outputDimensions = 16 #The size of the output
units = 256 #Dimensions of the output space

model = tf.keras.Sequential([ #Start the sequential model, doing one layer after another in a sequence
    L.Embedding(size, outputDimensions, input_length = texts.shape[1]), #Embed the model with the number of words and size
    L.Bidirectional(L.LSTM(units, return_sequences = True)), #Make it so the model looks both forward and backward at the data
    L.GlobalMaxPool1D(), #Take the max values over time
    L.Dropout(0.3), #Make the dropout 0.3, making about a third 0 to prevent overfitting
    L.Dense(64, activation="relu"), #Create a large dense layer
    L.Dropout(0.3), #Make the dropout 0.3, making about a third 0 to prevent overfitting
    L.Dense(3) #Create a small dense layer
])


model.compile(loss = SparseCategoricalCrossentropy(from_logits = True), #Compile the model with a SparseCategorical loss function
              optimizer = 'adam', metrics = ['accuracy'] #Add an adam optimizer and collect the accuracy along the way
             )

history = model.fit(texts, twTrain["encodedLabel"], epochs = epoch, validation_split = 0, batch_size = batchSize) #Fit the model to the data

---

# Validate

In [None]:
predict = model.predict_classes(textsValid) #Predict ratings based on the model
loss, accuracy = model.evaluate(textsValid, twValid["encodedLabel"]) #Get the loss and Accuracy based on the tests

#Print the loss and accuracy
print("Validation Loss: ", loss)
print("Validation Accuracy: ", accuracy)

In [None]:
pd.set_option("display.max_colwidth", 1000) #Show as much of the tweet as possible

validLabel = twValid["encodedLabel"].copy() #Get the encoded labels (1 for real, 0 for fake)
validLabel = pd.DataFrame(validLabel) #Convert to a dataframe to hold more data
validLabel["predictions"] = predict #Add the predictions to the dataframe
validLabel["tweet"] = twValid["tweet"].copy() #Add the original tweet for comparison sake
validLabel.head() #Compare

This is just in case someone is interested to go line by line. Of the ones showing in my dashboard (which is very cropped), the second tweet was flagged as real despite being fake. The wording does seem a bit more reasonable. It probably could have fooled me too.

Note: both this and the test predictions will display their full lists at the bottom of the notebook for ease of access

---

# Test Set Predictions

In [None]:
predictTest = model.predict_classes(textsTest) #Predict ratings based on the model

In [None]:
tweetTest = twTest["tweet"].copy() #Get the original tweets
tweetTest = pd.DataFrame(tweetTest) #Put the tweets into a dataframe
tweetTest["prediction"] = predictTest #Add in the predictions
tweetTest = tweetTest[["prediction", "tweet"]] #Change column order to line up with the validation dataframe's order
tweetTest.head() #Show the tests

The ones displayed do seem to make sense in context. The "President Trump Asked What He Would Do If He Were To Catch The Coronavirus https://t.co/3MEWhusRZI #donaldtrump #coronavirus" tweet has less to do with the virus itself or truth claims, which is a bit odd, but the rest make sense. I will have all of the test and validation sets fully shown below for those who want to look deeper. A 91% accuracy on a validation set is very good, so I can reasonably assume that it should be fairly accurate on the test set.

---

# Tweets with Predictions: Full Data

## Validation

In [None]:
pd.set_option("display.max_rows", 10000) #Show as much as possible
validLabel #Show the validation set

## Test

In [None]:
tweetTest #Show the test set