


# **Rob Boswell**

# **Assignment 3: Projects in Advanced ML**

https://github.com/roboswell/Advanced_Machine_Learning

---


---








## **Data Source:**  Shahi, Gautam Kishore, Anne Dirkson, and Tim A. Majchrzak. "An exploratory study of covid-19 misinformation on twitter." Online Social Networks and Media 22 (2021): 100104.

## Can you use the following data to build....?
1. A model with an embedding layer and dense layers (but w/ no layers meant for sequential data)
2.  A model using Conv1d Layers
3.  A model with one sequential layer (LSTM or GRU)
4. A model with stacked sequential layers (LSTM or GRU)
5. A model with bidirectional sequential layers 

### After choosing a model, feed it some realistic tweets that are not from your training data to see if it returns meaningful/useful results.







---

**Question 1: Present examples of tweets from the dataset that demonstrate real information or misinformation.**

---



## The code below shows a summary of some real and fake COVID-19 tweets in the training set.

In [None]:
#Source:Fighting an Infodemic: COVID-19 Fake News Dataset, https://github.com/diptamath/covid_fake_news,https://arxiv.org/abs/2011.03327 

import pandas as pd
trainingdata=pd.read_csv("https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv", usecols = ['tweet','label'])
testdata=pd.read_csv("https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/english_test_with_labels.csv", usecols = ['tweet','label'])

trainingdata

Unnamed: 0,tweet,label
0,The CDC currently reports 99031 deaths. In gen...,real
1,States reported 1121 deaths a small rise from ...,real
2,Politically Correct Woman (Almost) Uses Pandem...,fake
3,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,Populous states can generate large case counts...,real
...,...,...
6415,A tiger tested positive for COVID-19 please st...,fake
6416,???Autopsies prove that COVID-19 is??� a blood...,fake
6417,_A post claims a COVID-19 vaccine has already ...,fake
6418,Aamir Khan Donate 250 Cr. In PM Relief Cares Fund,fake




---

**Question 2: Discuss the dataset in general terms and describe why building a predictive model using this data might be practically useful.  Who could benefit from a model like this? Explain.**

---

### **Answer:** The dataset contains 8,560 tweets that have previously been labelled as either "real" (i.e., true) or "fake" (i.e., incorrect/misleading). As seen from the results of the code below, there are 3,360 "real" tweets in the training set, and 1,120 in the test set. Further, there are 3,060 "fake" tweets in the training set, and 1,020 in the test set. Thus, 52% of the data are real tweets and 48% are fake tweets.

### Building a highly accurate and predictively strong model based on this data could be very beneficial for helping Twitter to identify and remove tweets that are misleading/false regarding COVID-19, and thus pose a strong potential health threat to viewers and those who they may communicate misinformation to others learned from such tweets.

### Monitoring when spikes in fake tweets are occuring by using such a model could also help health professionals with access to Twitter to know to move quickly to counter misinformation by speading accurrate tweets in hopes that far more Twitter users will be exposed to correct information about COVID-19 than fake information. Thus, both Twitter (the company) and the general population of not just Twitter users, but those with whom Twitter users share information about COVID-19 would stand to benefit from such a model.

---


In [None]:
print(len(trainingdata[trainingdata['label'] == 'real']))
print(len(testdata[testdata['label'] == 'real']))


3360
1120


In [None]:
print(len(trainingdata[trainingdata['label'] == 'fake']))
print(len(testdata[testdata['label'] == 'fake']))

3060
1020


## Define Preprocessor

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=25000)
tokenizer.fit_on_texts(trainingdata.tweet)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen, max_words):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

## Prepare Train and Test Data

In [None]:
# tokenize and pad X data
X_train = preprocessor(trainingdata.tweet, maxlen=45, max_words=25000)
X_test = preprocessor(testdata.tweet, maxlen=45, max_words=25000)

# ohe encode Y data
y_train = pd.get_dummies(trainingdata.label)
y_test = pd.get_dummies(testdata.label)

In [None]:
print(X_train.shape)
print(X_test.shape)

(6420, 45)
(2140, 45)


## Model 1:

In [None]:
from tensorflow.keras.layers import Dense, Embedding,Flatten
from tensorflow.keras.layers import SimpleRNN, LSTM
from tensorflow.keras.models import Sequential
from keras.callbacks import ModelCheckpoint
from tensorflow.python.keras.callbacks import ReduceLROnPlateau

# replace this model with the architectures from the task description
model = Sequential()
model.add(Embedding(25000, 32, input_length=45))
model.add(LSTM(32))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))

mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)
red_lr= ReduceLROnPlateau(monitor='val_acc', patience=2,verbose=1,factor=0.05)

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=32,
                    callbacks=[mc,red_lr],
                    validation_split=0.2)

Epoch 1/20

Epoch 00001: val_acc improved from -inf to 0.90576, saving model to best_model.h5
Epoch 2/20

Epoch 00002: val_acc improved from 0.90576 to 0.93069, saving model to best_model.h5
Epoch 3/20

Epoch 00003: val_acc did not improve from 0.93069
Epoch 4/20

Epoch 00004: val_acc did not improve from 0.93069

Epoch 00004: ReduceLROnPlateau reducing learning rate to 5.0000002374872565e-05.
Epoch 5/20

Epoch 00005: val_acc improved from 0.93069 to 0.93146, saving model to best_model.h5
Epoch 6/20

Epoch 00006: val_acc did not improve from 0.93146
Epoch 7/20

Epoch 00007: val_acc did not improve from 0.93146

Epoch 00007: ReduceLROnPlateau reducing learning rate to 2.5000001187436284e-06.
Epoch 8/20

Epoch 00008: val_acc did not improve from 0.93146
Epoch 9/20

Epoch 00009: val_acc did not improve from 0.93146

Epoch 00009: ReduceLROnPlateau reducing learning rate to 1.2500000821091816e-07.
Epoch 10/20

Epoch 00010: val_acc did not improve from 0.93146
Epoch 11/20

Epoch 00011: val_a

## Model 2:

In [None]:
from tensorflow.keras.layers import Bidirectional
import tensorflow as tf

with tf.device('/device:GPU:0'):
# replace this model with the architectures from the task description
  model = Sequential()
  model.add(Embedding(25000, 16, input_length=45))
  model.add(Bidirectional(LSTM(32, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))) 
  model.add(Bidirectional(LSTM(32, dropout=0.3, recurrent_dropout=0.3, return_sequences=True)))
  model.add(Bidirectional(LSTM(32, dropout=0.3, recurrent_dropout=0.3)))  
  model.add(Flatten())
  model.add(Dense(2, activation='softmax'))

  mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)
  red_lr= ReduceLROnPlateau(monitor='val_acc', patience=2,verbose=1,factor=0.1)

  model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

  history = model.fit(X_train, y_train,
                      epochs=10,
                      batch_size=16,
                      callbacks=[mc,red_lr],
                      validation_split=0.2)

Epoch 1/10

Epoch 00001: val_acc improved from -inf to 0.90966, saving model to best_model.h5
Epoch 2/10

Epoch 00002: val_acc improved from 0.90966 to 0.92368, saving model to best_model.h5
Epoch 3/10

Epoch 00003: val_acc improved from 0.92368 to 0.92445, saving model to best_model.h5
Epoch 4/10

Epoch 00004: val_acc improved from 0.92445 to 0.93069, saving model to best_model.h5
Epoch 5/10

Epoch 00005: val_acc did not improve from 0.93069
Epoch 6/10

Epoch 00006: val_acc did not improve from 0.93069

Epoch 00006: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 7/10

Epoch 00007: val_acc did not improve from 0.93069
Epoch 8/10

Epoch 00008: val_acc did not improve from 0.93069

Epoch 00008: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 9/10

Epoch 00009: val_acc did not improve from 0.93069
Epoch 10/10

Epoch 00010: val_acc did not improve from 0.93069

Epoch 00010: ReduceLROnPlateau reducing learning rate to 1.0000000656873453e-

## Model 3:

In [None]:
from tensorflow.keras import layers

with tf.device('/device:GPU:0'):
# replace this model with the architectures from the task description
  model = Sequential()
  model.add(layers.Embedding(25000, 150, input_length=45))
  model.add(layers.Conv1D(32, 7, activation='relu')) 
  model.add(layers.AveragePooling1D(5)) 
  model.add(layers.GRU(128, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
  model.add(layers.GRU(128, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
  model.add(layers.GRU(128, dropout=0.3, recurrent_dropout=0.3))  
  model.add(Flatten())
  model.add(layers.Dense(2, activation='softmax'))

  mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)
  red_lr= ReduceLROnPlateau(monitor='val_acc', patience=2,verbose=1,factor=0.05)

  model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

  history = model.fit(X_train, y_train,
                      epochs=20,
                      batch_size=50,
                      callbacks=[mc,red_lr],
                      validation_split=0.2)

Epoch 1/20

Epoch 00001: val_acc improved from -inf to 0.90576, saving model to best_model.h5
Epoch 2/20

Epoch 00002: val_acc improved from 0.90576 to 0.91978, saving model to best_model.h5
Epoch 3/20

Epoch 00003: val_acc did not improve from 0.91978
Epoch 4/20

Epoch 00004: val_acc did not improve from 0.91978

Epoch 00004: ReduceLROnPlateau reducing learning rate to 5.0000002374872565e-05.
Epoch 5/20

Epoch 00005: val_acc did not improve from 0.91978
Epoch 6/20

Epoch 00006: val_acc did not improve from 0.91978

Epoch 00006: ReduceLROnPlateau reducing learning rate to 2.5000001187436284e-06.
Epoch 7/20

Epoch 00007: val_acc did not improve from 0.91978
Epoch 8/20

Epoch 00008: val_acc did not improve from 0.91978

Epoch 00008: ReduceLROnPlateau reducing learning rate to 1.2500000821091816e-07.
Epoch 9/20

Epoch 00009: val_acc did not improve from 0.91978
Epoch 10/20

Epoch 00010: val_acc did not improve from 0.91978

Epoch 00010: ReduceLROnPlateau reducing learning rate to 6.250000

## Model 4:

In [None]:
from tensorflow.keras import layers

with tf.device('/device:GPU:0'):
# replace this model with the architectures from the task description
  model = Sequential()
  model.add(layers.Embedding(25000, 50, input_length=45))
  model.add(layers.Conv1D(60, 5, activation='relu')) 
  model.add(layers.AveragePooling1D(3)) 
  model.add(layers.GRU(128, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
  model.add(layers.GRU(128, dropout=0.2, recurrent_dropout=0.2))  
  model.add(Flatten())
  model.add(layers.Dense(2, activation='softmax'))

  mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)
  red_lr= ReduceLROnPlateau(monitor='val_acc', patience=2,verbose=1,factor=0.05)

  model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

  history = model.fit(X_train, y_train,
                      epochs=20,
                      batch_size=20,
                      callbacks=[mc,red_lr],
                      validation_split=0.2)

Epoch 1/20

Epoch 00001: val_acc improved from -inf to 0.91277, saving model to best_model.h5
Epoch 2/20

Epoch 00002: val_acc improved from 0.91277 to 0.93302, saving model to best_model.h5
Epoch 3/20

Epoch 00003: val_acc did not improve from 0.93302
Epoch 4/20

Epoch 00004: val_acc did not improve from 0.93302

Epoch 00004: ReduceLROnPlateau reducing learning rate to 5.0000002374872565e-05.
Epoch 5/20

Epoch 00005: val_acc did not improve from 0.93302
Epoch 6/20

Epoch 00006: val_acc did not improve from 0.93302

Epoch 00006: ReduceLROnPlateau reducing learning rate to 2.5000001187436284e-06.
Epoch 7/20

Epoch 00007: val_acc did not improve from 0.93302
Epoch 8/20

Epoch 00008: val_acc did not improve from 0.93302

Epoch 00008: ReduceLROnPlateau reducing learning rate to 1.2500000821091816e-07.
Epoch 9/20

Epoch 00009: val_acc did not improve from 0.93302
Epoch 10/20

Epoch 00010: val_acc did not improve from 0.93302

Epoch 00010: ReduceLROnPlateau reducing learning rate to 6.250000

## Model 5:

In [None]:
from tensorflow.keras.layers import GRU
from tensorflow.keras.layers import Dense, Embedding,Flatten
from tensorflow.keras.layers import SimpleRNN, LSTM
from tensorflow.keras.models import Sequential
from keras.callbacks import ModelCheckpoint
from tensorflow.python.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.layers import Bidirectional
import tensorflow as tf


with tf.device('/device:GPU:0'):

  model = Sequential()
  model.add(Embedding(25000, 26, input_length=45))
  model.add(Bidirectional(LSTM(128, dropout=0.15, recurrent_dropout=0.15, return_sequences=True))) 
  model.add(Bidirectional(LSTM(128, dropout=0.15, recurrent_dropout=0.15, return_sequences=True)))
  model.add(Bidirectional(GRU(32, dropout=0.15, recurrent_dropout=0.15, return_sequences=True)))
  model.add(Bidirectional(GRU(32, dropout=0.15, recurrent_dropout=0.15)))  
  model.add(Flatten())
  model.add(Dense(2, activation='softmax'))

  mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)
  red_lr= ReduceLROnPlateau(monitor='val_acc', patience=2, verbose=1, factor=0.05)

  model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

  history = model.fit(X_train, y_train,
                      epochs=10,
                      batch_size=32,
                      callbacks=[mc,red_lr],
                      validation_split=0.2)

Epoch 1/10

Epoch 00001: val_acc improved from -inf to 0.91511, saving model to best_model.h5
Epoch 2/10

Epoch 00002: val_acc improved from 0.91511 to 0.92212, saving model to best_model.h5
Epoch 3/10

Epoch 00003: val_acc improved from 0.92212 to 0.93847, saving model to best_model.h5
Epoch 4/10

Epoch 00004: val_acc did not improve from 0.93847
Epoch 5/10

Epoch 00005: val_acc did not improve from 0.93847

Epoch 00005: ReduceLROnPlateau reducing learning rate to 5.0000002374872565e-05.
Epoch 6/10

Epoch 00006: val_acc did not improve from 0.93847
Epoch 7/10

Epoch 00007: val_acc did not improve from 0.93847

Epoch 00007: ReduceLROnPlateau reducing learning rate to 2.5000001187436284e-06.
Epoch 8/10

Epoch 00008: val_acc did not improve from 0.93847
Epoch 9/10

Epoch 00009: val_acc did not improve from 0.93847

Epoch 00009: ReduceLROnPlateau reducing learning rate to 1.2500000821091816e-07.
Epoch 10/10

Epoch 00010: val_acc did not improve from 0.93847


## **Question:** Discuss which models performed better and point out relevant hyper-parameter values for successful models.

## **Answer:** My 5th model performed the best (93.847% validation accuracy). I describe in more detail below the specifics of the 5th model. It combined two SLTM layers (each having 128 neurons) with two GRU layers (each having 32 neurons). This model (and in fact, all of my models) had embeddings; in my 5th model, the embedding layer contained 26 attributes. I also used dropout (.15) and recurrent dropout (.15) to try to reduce overfitting. It is possible that lower percentage for dropout compared to my other models may have contributed towards the higher accuracy rate, although normally the opposite might be expected. 

## My fourth model was my second highest performing model, with validation accuracy reaching 93.302%. It had a max_words value of 25,000 (as did model 5), but had 50 embedding attributes. It appears that a smaller number of embedded attributes in the case of this data may contribute to higher validation accuracy. I used a conv1d layer of size 5 x 5 and average pooling of size 3 x 3 to shrink down parameters before applying stacking with two GRU layers having 128 neurons each. I also used dropout (.2) and recurrent dropout (.2) for both GRU layers.

## I will lastly touch on the details of my worst performing model, model 3, which only achieved a validation accuracy of 91.978%. It also had a max_words value of 25,000 and input_length value of 45, and used conv1d (32 filters of size 7x7) and average pooling of size 5x5 to decrease the number of parameters before stacking. I then included 3 GRU layers back to back - each having 128 neurons and dropout (0.30) and recurrent dropout (0.30). It is noteable that the larger number of stacked layers, larger number of neurons per layer, and larger proportion of nuerons experiencing dropout did not result in higher accuracy. This may suggest that with this dataset simpler models perform better. Bigger is not always better.

In [None]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
predicted_labels[0:5]

['real', 'fake', 'fake', 'real', 'real']

## Submitting My Best Model (Model 5)

In [None]:
# install aimodelshare library
%%capture
! pip install aimodelshare --upgrade --extra-index-url https://test.pypi.org/simple/ 

In [None]:
import aimodelshare as ai
from aimodelshare.aimsonnx import model_to_onnx

In [None]:
# save preprocessor
ai.export_preprocessor(preprocessor,"")

In [None]:
# save model in onnx format
onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("onnx_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

INFO:tensorflow:Assets written to: /tmp/assets


In [None]:
# set credentials for modeltoapi function 
# make sure you have uploaded your credentials.txt file
from aimodelshare.aws import set_credentials
api_url = "https://wvr23l2z9i.execute-api.us-east-1.amazonaws.com/prod/m"

set_credentials(apiurl=api_url,credential_file="credentials.txt", type="submit_model", manual=False)

AI Model Share login credentials set successfully.
AWS credentials set successfully.


In [None]:
# submit model and predictions to competition
ai.submit_model("onnx_model.onnx",
                api_url,
                prediction_submission=predicted_labels,
                preprocessor="preprocessor.zip")

'Your model has been submitted as model version 75'

In [None]:
# check leaderboard
data=ai.get_leaderboard(api_url, verbose=3)
ai.leaderboard.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,bidirectional_layers,conv1d_layers,dense_layers,embedding_layers,flatten_layers,globalmaxpooling1d_layers,lstm_layers,maxpooling1d_layers,simplernn_layers,relu_act,sigmoid_act,softmax_act,tanh_act,loss,optimizer,model_config,username,version
0,95.09%,95.09%,95.07%,95.12%,keras,False,True,Sequential,3,161922,,,1,1,1.0,,,,,,,1.0,,str,RMSprop,"{'name': 'sequential', 'layers...",hpeters,66
1,95.09%,95.09%,95.07%,95.12%,keras,False,True,Sequential,3,161922,,,1,1,1.0,,,,,,,1.0,,str,RMSprop,"{'name': 'sequential', 'layers...",hpeters,67
2,95.00%,94.99%,94.97%,95.02%,keras,False,True,Sequential,5,1081482,1.0,,2,1,,,1.0,,,1.0,,1.0,1.0,str,RMSprop,"{'name': 'sequential_29', 'lay...",kagenlim,61
3,94.86%,94.85%,94.84%,94.87%,keras,False,True,Sequential,5,1035746,,,2,1,,,2.0,,,1.0,,1.0,2.0,str,RMSprop,"{'name': 'sequential_3', 'laye...",kagenlim,19
4,94.77%,94.76%,94.74%,94.78%,keras,False,True,Sequential,9,1313030,,,2,1,1.0,,1.0,,4.0,,3.0,,4.0,str,RMSprop,"{'name': 'sequential_1', 'laye...",kka2120,69
5,94.58%,94.57%,94.57%,94.57%,keras,False,True,Sequential,5,1070202,,,2,1,,,2.0,,,1.0,,1.0,2.0,str,RMSprop,"{'name': 'sequential_4', 'laye...",kagenlim,60
6,94.49%,94.47%,94.47%,94.48%,keras,False,True,Sequential,3,161282,,,1,1,1.0,,,,,,,1.0,,str,RMSprop,"{'name': 'sequential', 'layers...",newusertest,4
7,94.35%,94.34%,94.32%,94.37%,keras,False,True,Sequential,6,148066,,2.0,1,1,1.0,,,1.0,,2.0,,1.0,,str,RMSprop,"{'name': 'sequential_72', 'lay...",prajseth,40
8,94.25%,94.24%,94.24%,94.24%,keras,False,True,Sequential,3,98818,,,1,1,,,1.0,,,,,1.0,1.0,str,RMSprop,"{'name': 'sequential_78', 'lay...",prajseth,41
9,94.21%,94.19%,94.18%,94.21%,keras,False,True,Sequential,3,402690,,,1,1,,,1.0,,,,1.0,,1.0,str,RMSprop,"{'name': 'sequential_5', 'laye...",xc2303_xc,63


In [None]:
 # Get best model architecture and view model summary, change version arg as needed
 
 bestmodel = ai.aimsonnx.instantiate_model(api_url, version=66) 

 bestmodel.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 60, 16)            160000    
_________________________________________________________________
flatten (Flatten)            (None, 960)               0         
_________________________________________________________________
dense (Dense)                (None, 2)                 1922      
Total params: 161,922
Trainable params: 161,922
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Compare two model versions to see diffs
ai.aimsonnx.compare_models(api_url, version_list=[66,75]) 



Unnamed: 0,Model_66_Layer,Model_66_Shape,Model_66_Params,Model_75_Layer,Model_75_Shape,Model_75_Params
0,Embedding,"(None, 60, 16)",160000.0,Embedding,"(None, 45, 26)",650000
1,Flatten,"(None, 960)",0.0,Bidirectional,"(None, 45, 256)",158720
2,Dense,"(None, 2)",1922.0,Bidirectional,"(None, 45, 256)",394240
3,,,,Bidirectional,"(None, 45, 64)",55680
4,,,,Bidirectional,"(None, 64)",18816
5,,,,Flatten,"(None, 64)",0
6,,,,Dense,"(None, 2)",130


# **Question: Explain how the [best] model's structure is different from your best model.**



---

### **Answer:** I have printed the keras model.summary() (as well as compare_models() ) above. The best model (version 66) is different in several ways from my model (version 75). It does not use bidirectional networks, and does not use stacking, unlike my model. Rather, it only uses embedding according to both model.summary() and compare_models(). Whereas my model has maxlen = 45, the best model has maxlen=60. While my model used an embedding size of 26 attributes, the best model in the class used 16. As a result, my model started with a lot more parameters (650,000) compared to the best model (160,000).  This starting parameter value means that they used a max_words value of 160,000/60 = 10,000, whereas my model had a max_words value of 25,000. 

### Although compare_models() above says that my first two layers had 256 neurons, and that my last 2 layers had 64 neurons each, this is incorrect. My 4 layers had 128, 128, 32 and 32 neurons, respectively. The function evidently doubles the true number of neurons in each layer.

---



# **Question:** Fit the best model from the leader board to training data and evaluate it on test data to complete your report. 

# Complete your report by feeding your model some realistic tweets to see if it returns meaningful/useful results (these tweets can be found online or you can create them yourself).

---

### **Answer:** First, I will reconstruct the best model, training and testing it on the same data used for my models above. Then I will test it on 10 immitation tweets that I designed, to see how accurately the model can predict them.



## Define Preprocessor

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(trainingdata.tweet)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen, max_words):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

In [None]:
# tokenize and pad X data
X_train = preprocessor(trainingdata.tweet, maxlen=60, max_words=10000)
X_test = preprocessor(testdata.tweet, maxlen=60, max_words=10000)

# ohe encode Y data
y_train = pd.get_dummies(trainingdata.label)
y_test = pd.get_dummies(testdata.label)

In [None]:
print(X_train.shape)
print(X_test.shape)

(6420, 60)
(2140, 60)


In [None]:
from tensorflow.keras.layers import GRU

with tf.device('/device:GPU:0'):
# replace this model with the architectures from the task description
  model = Sequential()
  model.add(Embedding(10000, 16, input_length=60))
  model.add(Flatten())
  model.add(Dense(2, activation='softmax'))

  mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)
  red_lr= ReduceLROnPlateau(monitor='val_acc', patience=2, verbose=1,factor=0.05)

  model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

  history = model.fit(X_train, y_train,
                      epochs=100,
                      batch_size=30,
                      callbacks=[mc,red_lr],
                      validation_split=0.2)

Epoch 1/100

Epoch 00001: val_acc improved from -inf to 0.83255, saving model to best_model.h5
Epoch 2/100

Epoch 00002: val_acc improved from 0.83255 to 0.88629, saving model to best_model.h5
Epoch 3/100

Epoch 00003: val_acc improved from 0.88629 to 0.91745, saving model to best_model.h5
Epoch 4/100

Epoch 00004: val_acc improved from 0.91745 to 0.93069, saving model to best_model.h5
Epoch 5/100

Epoch 00005: val_acc did not improve from 0.93069
Epoch 6/100

Epoch 00006: val_acc improved from 0.93069 to 0.93146, saving model to best_model.h5
Epoch 7/100

Epoch 00007: val_acc improved from 0.93146 to 0.93692, saving model to best_model.h5
Epoch 8/100

Epoch 00008: val_acc did not improve from 0.93692
Epoch 9/100

Epoch 00009: val_acc did not improve from 0.93692

Epoch 00009: ReduceLROnPlateau reducing learning rate to 5.0000002374872565e-05.
Epoch 10/100

Epoch 00010: val_acc did not improve from 0.93692
Epoch 11/100

Epoch 00011: val_acc improved from 0.93692 to 0.93847, saving mode

### **Answer a):** Based on the train and test set provided for this assignment, it appears that the best model only achieves a validation accuracy score of 93.93%, at least when I set the batch_size equal to 30. 

### **Answer b):** 

### As seen below, I have created a series of tweets (5 real and 5 fake). My results are described further below the following code: 

In [None]:
# Fake example #1

print(model.predict(preprocessor(["COVID is fake news. It's nothing more than the common flu. This is just anti-Trump propoganda from the Radical Left."], maxlen=60, max_words=10000)))

[[9.99988317e-01 1.16371575e-05]]


In [None]:
# Fake example #2

print(model.predict(preprocessor(["COVID-19 is no more deadly than the flu. Don't believe what the 'experts' are telling you. Don't wear a mask if you don't feel like it!"], maxlen=60, max_words=10000)))

[[9.991449e-01 8.550276e-04]]


In [None]:
# Fake example #3

print(model.predict(preprocessor(["Don't let the Antifa radicals convince you that COVID is dangerous. They just want to destroy our economy by making people stay home so that they can steal the election."], maxlen=60, max_words=10000)))

[[9.9931133e-01 6.8871636e-04]]


In [None]:
# Fake example #4

print(model.predict(preprocessor(["COVID-19 is China's attempt to take over the world. They have been developing biological weapons for decades to unleash on the US. They will kill their own people to do it if necessary."], maxlen=60, max_words=10000)))

[[0.9611879 0.0388121]]


In [None]:
# Fake example #5

print(model.predict(preprocessor(["COVID-19 mRNA is actually not a vaccine at all, but rather an operating system that will convert our bodies into zombies."], maxlen=60, max_words=10000)))

[[0.98063135 0.01936869]]


In [None]:
# Real example #1

print(model.predict(preprocessor(["COVID-19 was the 3rd leading cause of death in the US in 2020, with heart disease and cancer being even deadlier."], maxlen=60, max_words=10000)))

[[0.9864965  0.01350354]]


In [None]:
# Real example #2

print(model.predict(preprocessor(["There is no evidence to back up the claim that COVID-19 increases the chances of women having miscarriages."], maxlen=60, max_words=10000)))

[[0.99835336 0.00164668]]


In [None]:
# Real example #3

print(model.predict(preprocessor(["People with cancer, kidney disease, lung diseases, dementia, diabetes, or liver disease are more likely to become seriously ill from COVID-19."], maxlen=60, max_words=10000)))

[[0.454971 0.545029]]


In [None]:
# Real example #4

print(model.predict(preprocessor(["Children have been impacted less harmfully by COVID infections, on average, than adults."], maxlen=60, max_words=10000)))

[[0.51612514 0.48387483]]


In [None]:
# Real example #5

print(model.predict(preprocessor(["People who have substance abuse problems are more likely to experience severe COVID-19 symtoms, if infected, than those who do not."], maxlen=60, max_words=10000)))

[[0.04058149 0.95941854]]


## **Results:** 

### The following analysis is based on the assumption that **0** corresponds to **real** and that **1** corresponds to **fake**: 

### Since predicted probabilities for accuracy are rounded up for values greater than 0.5, and rounded down for values less than 0.5, the model correctly predicts all 5 of my fake tweets as being fake. However, it only correctly predicts 2 of my 5 real tweets (real examples #3 and #5) as being real. This corresponds to an overall model accuracy level of 70% - at least when tested on my "unseen" sample.