<a href="https://colab.research.google.com/github/weswest/MSDS422/blob/main/MSDS_422_Assignment9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0 Project Overview

This workbook focuses on the Disaster Tweets...

https://www.kaggle.com/hritikchaturvedi/disaster-prediction-roberta-large/notebook?scriptVersionId=87180937

# Workbook Structure

TKTKTK

## Considerations for analysis vs EDA

TKTKTK



## Overall layout

TKTKTK


# 0 Setup


## 0.1 Setup - Load Libraries

In [1]:
!pip install transformers



In [2]:

from transformers import RobertaTokenizer, TFRobertaModel, RobertaConfig 
from tensorflow.keras.layers import Input, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.utils import to_categorical
import tensorflow as tf
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image

import re
import string

import pathlib
import warnings
warnings.filterwarnings('ignore')

import os
import io

pd.set_option('display.max_columns', None)

In [3]:
#from kerastuner.tuners import RandomSearch

def set_seed(seed=422):
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
set_seed()

## 0.2 Setup - Operating Environment
This code allows the Colab notebook to access my Google Drive files. 

In [4]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
import os
try:
  os.chdir("drive/My Drive/MSDS/422/NLPTweets")
except:
  pass

Mounted at /content/drive


## 0.3 Setup - Read in Data
Note: the Kaggle dataset already splits the housing data into "train" and "test" sets.  This assignment allows us to ignore the test set for now

In [5]:
train_df = pd.read_csv('Data/train.csv')
train_df.name = 'Training Set'
test_df = pd.read_csv('Data/test.csv')
test_df.name = 'Test Set'

In [6]:
dfs = [train_df, test_df]

for df in dfs:
  obs = df.shape[0]
  tot = df.shape[1]
  numeric = df.select_dtypes(include=np.number).shape[1]
  categorical = df.select_dtypes(exclude=np.number).shape[1]
  print('In {} we have {} observations, {} variables: {} numeric and {} categorical'.format(df.name, obs, tot, numeric, categorical))



In Training Set we have 7613 observations, 5 variables: 2 numeric and 3 categorical
In Test Set we have 3263 observations, 4 variables: 1 numeric and 3 categorical


In [7]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


## 0.4 Set up functions for later reference



# 1 EDA

In [8]:
train_df.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

### With that many null values in location, is our DV balanced?

Answer: looks like the distribution is pretty consistent, with or without location

In [9]:
train_df[train_df.location.notnull()].target.value_counts()

0    2884
1    2196
Name: target, dtype: int64

In [10]:
train_df[train_df.location.isnull()].target.value_counts()


0    1458
1    1075
Name: target, dtype: int64

### Where are the locations?

Answer: Some strong concentrations, but it descends into noise pretty fast.

Good justification for deleting location entirely

In [11]:
train_df[train_df.location.notnull()].location.value_counts()

USA                    104
New York                71
United States           50
London                  45
Canada                  29
                      ... 
MontrÌ©al, QuÌ©bec       1
Montreal                 1
ÌÏT: 6.4682,3.18287      1
Live4Heed??              1
Lincoln                  1
Name: location, Length: 3341, dtype: int64

# 2. Prepare the Data

## 2.1 Drop Location

In [12]:
train_df.drop("location", axis = 1, inplace = True)
test_df.drop("location", axis = 1, inplace = True)

## 2.2 Clean up the tweets

In [13]:
# Remove urls

def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

train_df['text'] = train_df['text'].apply(lambda x : remove_URL(x))
test_df['text'] = test_df['text'].apply(lambda x : remove_URL(x))

In [14]:
# Remove html

example = """<div>
<h1>Real or Fake</h1>
<p>Kaggle </p>
<a href="https://www.kaggle.com/c/nlp-getting-started">getting started</a>
</div>"""

def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)
    
print(remove_html(example))

train_df['text'] = train_df['text'].apply(lambda x : remove_html(x))
test_df['text'] = test_df['text'].apply(lambda x : remove_html(x))


Real or Fake
Kaggle 
getting started



In [15]:
# Remove emoji

# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

remove_emoji("Omg another Earthquake 😔😔")

train_df['text'] = train_df['text'].apply(lambda x: remove_emoji(x))
test_df['text'] = test_df['text'].apply(lambda x: remove_emoji(x))

In [16]:
# Get rid of other issues

def cleaner(tweet):
  # Acronyms and miswritten words
  tweet = re.sub(r"Typhoon-Devastated", "typhoon devastated", tweet)
  tweet = re.sub(r"TyphoonDevastated", "typhoon devastated", tweet)
  tweet = re.sub(r"typhoondevastated", "typhoon devastated", tweet)
  tweet = re.sub(r"MH370", "Malaysia Airlines Flight", tweet)
  tweet = re.sub(r"MH", "Malaysia Airlines Flight", tweet)
  tweet = re.sub(r"mh370", "Malaysia Airlines Flight", tweet)
  tweet = re.sub(r"year-old", "years old", tweet)
  tweet = re.sub(r"yearold", "years old", tweet)
  tweet = re.sub(r"yr old", "years old", tweet)
  tweet = re.sub(r"PKK", "Kurdistan Workers Party", tweet)
  tweet = re.sub(r"MP", "madhya pradesh", tweet)
  tweet = re.sub(r"rly", "railway", tweet)
  tweet = re.sub(r"CDT", "Central Daylight Time", tweet)
  tweet = re.sub(r"sensorsenso", "sensor senso", tweet)
  tweet = re.sub(r"pm", "", tweet)
  tweet = re.sub(r"PM", "", tweet)
  tweet = re.sub(r"nan", " ", tweet)
  tweet = re.sub(r"terrorismturn", "terrorism turn", tweet)
  tweet = re.sub(r"epicente", "epicenter", tweet)
  tweet = re.sub(r"epicenterr", "epicenter", tweet)
  tweet = re.sub(r"WAwildfire", "Washington Wildfire", tweet)
  tweet = re.sub(r"prebreak", "pre break", tweet)
  tweet = re.sub(r"nowplaying", "now playing", tweet)
  tweet = re.sub(r"RT", "retweet", tweet)
  tweet = re.sub(r"EbolaOutbreak", "Ebola Outbreak", tweet)
  tweet = re.sub(r"LondonFire", "London Fire", tweet)
  tweet = re.sub(r"IDFire", "Idaho Fire", tweet)
  tweet = re.sub(r"withBioterrorism&use", "with Bioterrorism & use", tweet)
  tweet = re.sub(r"NASAHurricane", "NASA Hurricane", tweet)
  tweet = re.sub(r"withweapons", "with weapons", tweet)
  tweet = re.sub(r"NuclearPower", "Nuclear Power", tweet)
  tweet = re.sub(r"WhiteTerrorism", "White Terrorism", tweet)
  tweet = re.sub(r"MyanmarFlood", "Myanmar Flood", tweet)
  tweet = re.sub(r"ExtremeWeather", "Extreme Weather", tweet)

  # Special characters
  tweet = re.sub(r"%20", " ", tweet)
  tweet = re.sub(r"%", " ", tweet)
  tweet = re.sub(r"@", " ", tweet)
  tweet = re.sub(r"#", " ", tweet)
  tweet = re.sub(r"'", " ", tweet)
  tweet = re.sub(r"\x89û_", " ", tweet)
  tweet = re.sub(r"\x89ûò", " ", tweet)
  tweet = re.sub(r"16yr", "16 year", tweet)
  tweet = re.sub(r"re\x89û_", " ", tweet)
  tweet = re.sub(r"\x89û", " ", tweet)
  tweet = re.sub(r"\x89Û", " ", tweet)
  tweet = re.sub(r"re\x89Û", "re ", tweet)
  tweet = re.sub(r"re\x89û", "re ", tweet)
  tweet = re.sub(r"\x89ûª", "'", tweet)
  tweet = re.sub(r"\x89û", " ", tweet)
  tweet = re.sub(r"\x89ûò", " ", tweet)
  tweet = re.sub(r"\x89Û_", "", tweet)
  tweet = re.sub(r"\x89ÛÒ", "", tweet)
  tweet = re.sub(r"\x89ÛÓ", "", tweet)
  tweet = re.sub(r"\x89ÛÏWhen", "When", tweet)
  tweet = re.sub(r"\x89ÛÏ", "", tweet)
  tweet = re.sub(r"China\x89Ûªs", "China's", tweet)
  tweet = re.sub(r"let\x89Ûªs", "let's", tweet)
  tweet = re.sub(r"\x89Û÷", "", tweet)
  tweet = re.sub(r"\x89Ûª", "", tweet)
  tweet = re.sub(r"\x89Û\x9d", "", tweet)
  tweet = re.sub(r"å_", "", tweet)
  tweet = re.sub(r"\x89Û¢", "", tweet)
  tweet = re.sub(r"\x89Û¢åÊ", "", tweet)
  tweet = re.sub(r"fromåÊwounds", "from wounds", tweet)
  tweet = re.sub(r"åÊ", "", tweet)
  tweet = re.sub(r"åÈ", "", tweet)
  tweet = re.sub(r"JapÌ_n", "Japan", tweet)    
  tweet = re.sub(r"Ì©", "e", tweet)
  tweet = re.sub(r"å¨", "", tweet)
  tweet = re.sub(r"SuruÌ¤", "Suruc", tweet)
  tweet = re.sub(r"åÇ", "", tweet)
  tweet = re.sub(r"å£3million", "3 million", tweet)
  tweet = re.sub(r"åÀ", "", tweet)

  # Contractions
  tweet = re.sub(r"he's", "he is", tweet)
  tweet = re.sub(r"there's", "there is", tweet)
  tweet = re.sub(r"We're", "We are", tweet)
  tweet = re.sub(r"That's", "That is", tweet)
  tweet = re.sub(r"won't", "will not", tweet)
  tweet = re.sub(r"they're", "they are", tweet)
  tweet = re.sub(r"Can't", "Cannot", tweet)
  tweet = re.sub(r"wasn't", "was not", tweet)
  tweet = re.sub(r"don\x89Ûªt", "do not", tweet)
  tweet = re.sub(r"aren't", "are not", tweet)
  tweet = re.sub(r"isn't", "is not", tweet)
  tweet = re.sub(r"What's", "What is", tweet)
  tweet = re.sub(r"haven't", "have not", tweet)
  tweet = re.sub(r"hasn't", "has not", tweet)
  tweet = re.sub(r"There's", "There is", tweet)
  tweet = re.sub(r"He's", "He is", tweet)
  tweet = re.sub(r"It's", "It is", tweet)
  tweet = re.sub(r"You're", "You are", tweet)
  tweet = re.sub(r"I'M", "I am", tweet)
  tweet = re.sub(r"Im", "I am", tweet)
  tweet = re.sub(r"shouldn't", "should not", tweet)
  tweet = re.sub(r"wouldn't", "would not", tweet)
  tweet = re.sub(r"i'm", "I am", tweet)
  tweet = re.sub(r"I\x89Ûªm", "I am", tweet)
  tweet = re.sub(r"I'm", "I am", tweet)
  tweet = re.sub(r"Isn't", "is not", tweet)
  tweet = re.sub(r"Here's", "Here is", tweet)
  tweet = re.sub(r"you've", "you have", tweet)
  tweet = re.sub(r"you\x89Ûªve", "you have", tweet)
  tweet = re.sub(r"we're", "we are", tweet)
  tweet = re.sub(r"what's", "what is", tweet)
  tweet = re.sub(r"couldn't", "could not", tweet)
  tweet = re.sub(r"we've", "we have", tweet)
  tweet = re.sub(r"it\x89Ûªs", "it is", tweet)
  tweet = re.sub(r"doesn\x89Ûªt", "does not", tweet)
  tweet = re.sub(r"It\x89Ûªs", "It is", tweet)
  tweet = re.sub(r"Here\x89Ûªs", "Here is", tweet)
  tweet = re.sub(r"who's", "who is", tweet)
  tweet = re.sub(r"I\x89Ûªve", "I have", tweet)
  tweet = re.sub(r"y'all", "you all", tweet)
  tweet = re.sub(r"can\x89Ûªt", "cannot", tweet)
  tweet = re.sub(r"would've", "would have", tweet)
  tweet = re.sub(r"it'll", "it will", tweet)
  tweet = re.sub(r"we'll", "we will", tweet)
  tweet = re.sub(r"wouldn\x89Ûªt", "would not", tweet)
  tweet = re.sub(r"We've", "We have", tweet)
  tweet = re.sub(r"he'll", "he will", tweet)
  tweet = re.sub(r"Y'all", "You all", tweet)
  tweet = re.sub(r"Weren't", "Were not", tweet)
  tweet = re.sub(r"Didn't", "Did not", tweet)
  tweet = re.sub(r"they'll", "they will", tweet)
  tweet = re.sub(r"they'd", "they would", tweet)
  tweet = re.sub(r"DON'T", "DO NOT", tweet)
  tweet = re.sub(r"That\x89Ûªs", "That is", tweet)
  tweet = re.sub(r"they've", "they have", tweet)
  tweet = re.sub(r"i'd", "I would", tweet)
  tweet = re.sub(r"should've", "should have", tweet)
  tweet = re.sub(r"You\x89Ûªre", "You are", tweet)
  tweet = re.sub(r"where's", "where is", tweet)
  tweet = re.sub(r"Don\x89Ûªt", "Do not", tweet)
  tweet = re.sub(r"we'd", "we would", tweet)
  tweet = re.sub(r"i'll", "I will", tweet)
  tweet = re.sub(r"weren't", "were not", tweet)
  tweet = re.sub(r"They're", "They are", tweet)
  tweet = re.sub(r"Can\x89Ûªt", "Cannot", tweet)
  tweet = re.sub(r"you\x89Ûªll", "you will", tweet)
  tweet = re.sub(r"I\x89Ûªd", "I would", tweet)
  tweet = re.sub(r"let's", "let us", tweet)
  tweet = re.sub(r"it's", "it is", tweet)
  tweet = re.sub(r"can't", "can not", tweet)
  tweet = re.sub(r"cant", "can not", tweet)
  tweet = re.sub(r"don't", "do not", tweet)
  tweet = re.sub(r"dont", "do not", tweet)
  tweet = re.sub(r"you're", "you are", tweet)
  tweet = re.sub(r"i've", "I have", tweet)
  tweet = re.sub(r"that's", "that is", tweet)
  tweet = re.sub(r"i'll", "I will", tweet)
  tweet = re.sub(r"doesn't", "does not", tweet)
  tweet = re.sub(r"i'd", "I would", tweet)
  tweet = re.sub(r"didn't", "did not", tweet)
  tweet = re.sub(r"ain't", "am not", tweet)
  tweet = re.sub(r"you'll", "you will", tweet)
  tweet = re.sub(r"I've", "I have", tweet)
  tweet = re.sub(r"Don't", "do not", tweet)
  tweet = re.sub(r"I'll", "I will", tweet)
  tweet = re.sub(r"I'd", "I would", tweet)
  tweet = re.sub(r"Let's", "Let us", tweet)
  tweet = re.sub(r"you'd", "You would", tweet)
  tweet = re.sub(r"It's", "It is", tweet)
  tweet = re.sub(r"Ain't", "am not", tweet)
  tweet = re.sub(r"Haven't", "Have not", tweet)
  tweet = re.sub(r"Could've", "Could have", tweet)
  tweet = re.sub(r"youve", "you have", tweet)  
  tweet = re.sub(r"donå«t", "do not", tweet)

  return tweet
train_df['text'] = train_df['text'].apply(lambda s : cleaner(s))
test_df['text'] = test_df['text'].apply(lambda s : cleaner(s))

In [17]:
# Remove punctuation

def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

example="I am a #king"
print(remove_punct(example))

train_df['text']= train_df['text'].apply(lambda x : remove_punct(x))
test_df['text']= test_df['text'].apply(lambda x : remove_punct(x))

I am a king


In [18]:
# Remove multiple spaces

train_df['text'] = train_df['text'].str.replace('   ', ' ')
train_df['text'] = train_df['text'].str.replace('     ', ' ')
train_df['text'] = train_df['text'].str.replace('\xa0 \xa0 \xa0', ' ')
train_df['text'] = train_df['text'].str.replace('  ', ' ')
train_df['text'] = train_df['text'].str.replace('—', ' ')
train_df['text'] = train_df['text'].str.replace('–', ' ')

test_df['text'] = test_df['text'].str.replace('   ', ' ')
test_df['text'] = test_df['text'].str.replace('     ', ' ')
test_df['text'] = test_df['text'].str.replace('\xa0 \xa0 \xa0', ' ')
test_df['text'] = test_df['text'].str.replace('  ', ' ')
test_df['text'] = test_df['text'].str.replace('—', ' ')
test_df['text'] = test_df['text'].str.replace('–', ' ')

## 2.3 Prepare the data for modeling

In [19]:
# Select required columns
data = train_df[['text', 'target']]

# Set your model output as categorical and save in new label col
data['target_label'] = pd.Categorical(train_df['target'])

# Transform your output to numeric
data['target'] = data['target_label'].cat.codes

# 3. Build Models

## 3.1 Load Model Transformer (RoBERTa)

In [20]:
### --------- Setup Roberta ---------- ###

model_name = 'roberta-base'

# Max length of tokens
max_length = 45

# Load transformers config and set output_hidden_states to False
config = RobertaConfig.from_pretrained(model_name)
config.output_hidden_states = False

# Load Roberta tokenizer
tokenizer = RobertaTokenizer.from_pretrained(pretrained_model_name_or_path = model_name, config = config)

# Load the Roberta model
transformer_roberta_model = TFRobertaModel.from_pretrained(model_name, config = config)

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


## 3.2 Add more layers

In [21]:
### ------- Build the model ------- ###

# Load the MainLayer
roberta = transformer_roberta_model.layers[0]

# Build your model input
input_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')
inputs = {'input_ids': input_ids}

# Load the Transformers RoBERTa model as a layer in a Keras model
roberta_model = roberta(inputs)[1]
dropout = Dropout(config.hidden_dropout_prob, name='pooled_output')
pooled_output = dropout(roberta_model, training=False)

dropout_1 = Dropout(config.hidden_dropout_prob, name='pooled_output_1')
pooled_output_1 = dropout(pooled_output, training=False)

# Then build your model output
targets = Dense(units=len(data.target_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='target')(pooled_output_1)
outputs = {'target': targets}

# And combine it all in a model object
# Note: building 3 models since there will be three iterations of testing
model1 = Model(inputs=inputs, outputs=outputs, name='RoBERTa_Binary_Classifier1')
model2 = Model(inputs=inputs, outputs=outputs, name='RoBERTa_Binary_Classifier2')
model3 = Model(inputs=inputs, outputs=outputs, name='RoBERTa_Binary_Classifier3')



In [22]:
# Take a look at the model
model1.summary()

Model: "RoBERTa_Binary_Classifier1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 45)]         0           []                               
                                                                                                  
 roberta (TFRobertaMainLayer)   TFBaseModelOutputWi  124645632   ['input_ids[0][0]']              
                                thPoolingAndCrossAt                                               
                                tentions(last_hidde                                               
                                n_state=(None, 45,                                                
                                768),                                                             
                                 pooler_output=(Non                      

In [23]:
# Take a look at the model
model2.summary()

Model: "RoBERTa_Binary_Classifier2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 45)]         0           []                               
                                                                                                  
 roberta (TFRobertaMainLayer)   TFBaseModelOutputWi  124645632   ['input_ids[0][0]']              
                                thPoolingAndCrossAt                                               
                                tentions(last_hidde                                               
                                n_state=(None, 45,                                                
                                768),                                                             
                                 pooler_output=(Non                      

In [24]:
# Take a look at the model
model3.summary()

Model: "RoBERTa_Binary_Classifier3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 45)]         0           []                               
                                                                                                  
 roberta (TFRobertaMainLayer)   TFBaseModelOutputWi  124645632   ['input_ids[0][0]']              
                                thPoolingAndCrossAt                                               
                                tentions(last_hidde                                               
                                n_state=(None, 45,                                                
                                768),                                                             
                                 pooler_output=(Non                      

## 3.3 Train the model

### 3.3.1 Train Model 1 - Base Model

In [25]:
### ------- Train the model ------- ###

optimizer = Adam(learning_rate=6e-05,epsilon=1e-08,decay=0.01,clipnorm=1.0)
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("Wk9Model1.h5", save_best_only=True)
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)


# Set loss and metrics
loss = {'target': CategoricalCrossentropy(from_logits = True)}

# Compile the model
model1.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

# Ready output data for the model
y_target = to_categorical(data['target'])

# Tokenize the input (takes some time)
x_train = tokenizer(
            text=data['text'].to_list(),
            add_special_tokens=True,
            max_length=max_length,
            truncation=True,
            padding=True, 
            return_tensors='tf',
            return_token_type_ids = False,
            return_attention_mask = True,
            verbose = True)

# Fit the model
history1 = model1.fit(
    x={'input_ids': x_train['input_ids']},
    y={'target': y_target},
    validation_split=0.25,
    batch_size=64,
    epochs=50,
    callbacks=[checkpoint_cb, early_stopping_cb],
    verbose=1)

model1.save('Wk9Model1')   # Commented out so that a run doesn't overwrite the model


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50




INFO:tensorflow:Assets written to: Wk9Model1/assets


INFO:tensorflow:Assets written to: Wk9Model1/assets


In [26]:
x_test = tokenizer(
          text=test_df['text'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = True,
          verbose = True)

label_predicted1 = model1.predict(x={'input_ids': x_test['input_ids']},)
label_pred_max1=[np.argmax(i) for i in label_predicted1['target']]
output1 = pd.DataFrame({'id': test_df.id, 'target': label_pred_max1})
output1.to_csv('Wk9Model1Take2.csv', index=False)

### 3.3.2 Train Model 2 - Decayed learning rate

In [27]:
### ------- Train the model ------- ###

optimizer = Adam(learning_rate=6e-05,epsilon=1e-08,decay=0.01,clipnorm=1.0)
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("Wk9Model2.h5", save_best_only=True)
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
lr_reduction_cb = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', patience = 3, verbose = 1, factor = 0.3)


# Set loss and metrics
loss = {'target': CategoricalCrossentropy(from_logits = True)}

# Compile the model
model2.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

# Ready output data for the model
y_target = to_categorical(data['target'])

# Tokenize the input (takes some time)
x_train = tokenizer(
            text=data['text'].to_list(),
            add_special_tokens=True,
            max_length=max_length,
            truncation=True,
            padding=True, 
            return_tensors='tf',
            return_token_type_ids = False,
            return_attention_mask = True,
            verbose = True)

# Fit the model
history2 = model2.fit(
    x={'input_ids': x_train['input_ids']},
    y={'target': y_target},
    validation_split=0.25,
    batch_size=64,
    epochs=50,
    callbacks=[checkpoint_cb, early_stopping_cb, lr_reduction_cb],
    verbose=1)

model2.save('Wk9Model2')   # Commented out so that a run doesn't overwrite the model


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 5: ReduceLROnPlateau reducing learning rate to 1.7999999545281754e-05.
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 8: ReduceLROnPlateau reducing learning rate to 5.399999645305797e-06.
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 11: ReduceLROnPlateau reducing learning rate to 1.6199999208765803e-06.
Epoch 12/50




INFO:tensorflow:Assets written to: Wk9Model2/assets


INFO:tensorflow:Assets written to: Wk9Model2/assets


In [28]:
x_test = tokenizer(
          text=test_df['text'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = True,
          verbose = True)

label_predicted2 = model2.predict(x={'input_ids': x_test['input_ids']},)
label_pred_max2=[np.argmax(i) for i in label_predicted2['target']]
output2 = pd.DataFrame({'id': test_df.id, 'target': label_pred_max1})
output2.to_csv('Wk9Model2Take2.csv', index=False)

### 3.3.3 Train Model 3 - Different Optimizer

In [29]:
### ------- Train the model ------- ###

optimizer = SGD(learning_rate=6e-05, momentum = 0, clipnorm=1.0)
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("Wk9Model3.h5", save_best_only=True)
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)


# Set loss and metrics
loss = {'target': CategoricalCrossentropy(from_logits = True)}

# Compile the model
model3.compile(optimizer = optimizer, loss = loss, metrics = ['accuracy'])

# Ready output data for the model
y_target = to_categorical(data['target'])

# Tokenize the input (takes some time)
x_train = tokenizer(
            text=data['text'].to_list(),
            add_special_tokens=True,
            max_length=max_length,
            truncation=True,
            padding=True, 
            return_tensors='tf',
            return_token_type_ids = False,
            return_attention_mask = True,
            verbose = True)

# Fit the model
history3 = model3.fit(
    x={'input_ids': x_train['input_ids']},
    y={'target': y_target},
    validation_split=0.25,
    batch_size=64,
    epochs=50,
    callbacks=[checkpoint_cb, early_stopping_cb],
    verbose=1)

model3.save('Wk9Model3')   # Commented out so that a run doesn't overwrite the model


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50




INFO:tensorflow:Assets written to: Wk9Model3/assets


INFO:tensorflow:Assets written to: Wk9Model3/assets


In [30]:
x_test = tokenizer(
          text=test_df['text'].to_list(),
          add_special_tokens=True,
          max_length=max_length,
          truncation=True,
          padding=True, 
          return_tensors='tf',
          return_token_type_ids = False,
          return_attention_mask = True,
          verbose = True)

label_predicted3 = model3.predict(x={'input_ids': x_test['input_ids']},)
label_pred_max3=[np.argmax(i) for i in label_predicted3['target']]
output3 = pd.DataFrame({'id': test_df.id, 'target': label_pred_max1})
output3.to_csv('Wk9Mode31Take2.csv', index=False)

# 4. Predict