#BERT
Vad görs i denna notebook?
- BERT modell hämtas från TensorFlow Hub
- Twitter data hämtas från NLTK Twitter Corpus
- BERT kombineras med en klassificerare för sentiment analys
- Modellen tränas och BERT fine-tuneas
- Modellen sparas och används för att klassificera tweets.

Huvudsaklig: https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/classify_text_with_bert.ipynb#scrollTo=6IwI_2bcIeX8

Utöver https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb#scrollTo=7wzwke0sxS6W
så är denna colab till stora delar inspirerad/guidad av: https://pypi.org/project/bert-for-tf2/ för förenkling av importeringar osv.

# Imports


In [None]:
!pip install -q tensorflow-text
!pip install -q tf-models-official

import os
import shutil

import random
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

tf.get_logger().setLevel('ERROR')

[K     |████████████████████████████████| 3.4MB 13.3MB/s 
[K     |████████████████████████████████| 1.1MB 17.2MB/s 
[K     |████████████████████████████████| 1.1MB 44.7MB/s 
[K     |████████████████████████████████| 174kB 64.0MB/s 
[K     |████████████████████████████████| 358kB 59.9MB/s 
[K     |████████████████████████████████| 102kB 16.6MB/s 
[K     |████████████████████████████████| 276kB 55.2MB/s 
[K     |████████████████████████████████| 37.6MB 83kB/s 
[K     |████████████████████████████████| 51kB 8.9MB/s 
[?25h  Building wheel for py-cpuinfo (setup.py) ... [?25l[?25hdone
  Building wheel for pyyaml (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


#Sentiment140
Importing the dataset and preparing it for use in the model

---
Uploads a zip file containing the Sentiment140 dataset to Google colab. 




In [None]:
from google.colab import files
uploaded = files.upload()

Saving trainingdata.zip to trainingdata.zip


In [None]:
!unzip trainingdata.zip

Archive:  trainingdata.zip
  inflating: training.1600000.processed.noemoticon.csv  


Reads the dataset. Dataset contains 4 and 0 labels. 0 represents negative and 4 positive. 4 label is converted to 1.


In [None]:
import io
import pandas as pd

df = pd.read_csv('training.1600000.processed.noemoticon.csv',encoding='latin-1',usecols=[0,5],names=['sentiment','text','text2','text3','text4','tweets'])

df.head()

df['sentiment'] = df['sentiment'].replace(4,1)
tweetslist = df.values.tolist()
random.shuffle(tweetslist)

Unnamed: 0,sentiment,tweets
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


#Building the model
These are the methods used for building the model.

---
Method for building the model itself and its layers.

In [None]:
def build_classifier_model(batch):

  #Loads the pre-trained BERT-model and the corresponding preprocessor
  tfhub_handle_encoder = hub.load('https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3')
  tfhub_handle_preprocess = hub.load("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/1")

  #Builds the layers of the BERT-model
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='inputs',batch_size=batch)
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation='gelu', name='classifier')(net) 
  
  return tf.keras.Model(inputs=text_input, outputs=net)

Defines the loss function which in this case is binary crossentrpoy.

In [None]:
def lossfunction():
  return tf.keras.losses.BinaryCrossentropy(from_logits=True)


Defines epochs and optimizer of the model.

In [None]:
def define_epochs(trainingdata,e):
  epochs = e
  steps_per_epoch = tf.data.experimental.cardinality(trainingdata).numpy()
  num_train_steps = steps_per_epoch * epochs
  num_warmup_steps = int(0.1*num_train_steps)

  init_lr = 3e-5
  optimizer = optimization.create_optimizer(init_lr=init_lr,
                                            num_train_steps=num_train_steps,
                                            num_warmup_steps=num_warmup_steps,
                                            optimizer_type='adamw')
  return epochs, optimizer

Final compilation of the model.

In [None]:
def compile_model(optimizer, loss):
  classifier_model.compile(optimizer=optimizer,
                          loss=loss)

#Execution of the cross-validation and corresponding  fine-tuning.
This is the main body of code for the model.
Firstly the final operations on the data are made  and the validation data is split from the rest of the data.
Secondly the cross-validation is made which includes fine-tuning and evaluation of each of the *K* models.


In [None]:
#Defines some basic parameters for the model. Number of epochs, Batch size of the input data and the K-value of the cross-validation
NUM_OF_EPOCHS = 3
K = 5
BATCH_SIZE = 32

#Labelling for the evaluation. Not necessary for functionality.
sentiment = ['Negative', 'Positive']

#Splits data into X - exapmles, and y - labels. Also limits the data used to 5% of original dataset size.
X = np.array([x[1] for x in tweetslist])
y = np.array([x[0] for x in tweetslist])
X, Xg, y, yg = train_test_split(X,y,train_size=0.05, test_size=0.95, stratify=y)

#Splits the validation data from the main data and converts it into a Tensorflow dataset.
X, Xval, y, yval = train_test_split(X,y,train_size=0.9,test_size=0.1,stratify=y)

valdata = tf.data.Dataset.from_tensor_slices((Xval,yval))
valdata = valdata.batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)


#Main cross-validation loop that is run K times
skf = StratifiedKFold(shuffle=True,n_splits=K)
n = 0
for train, test in skf.split(X, y):

  n += 1
  print(n)
  classifier_model = None

  #Uses the indexes provided to make train and testsets and converts them into tensorflow datatsets
  traindata = tf.data.Dataset.from_tensor_slices((X[train], y[train]))
  testdata = tf.data.Dataset.from_tensor_slices((X[test], y[test]))
 
  traindata = traindata.batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
  testdata = testdata.batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)


  #Builds model
  classifier_model = build_classifier_model(BATCH_SIZE)

  #Defines the rest of the necessary components. loss, pochs and optimizer
  loss = lossfunction()
  
  epochs, optimizer = define_epochs(traindata,NUM_OF_EPOCHS)

  compile_model(optimizer, loss)

  #Displays the final model build including it's layers. Can be viewed in output below.
  print(classifier_model.summary())
  

  #Fine-tunes the model
  classifier_model.fit(x=traindata, epochs=epochs, validation_data=valdata) #callbacks=[cp_callback])  # Pass callback to training)

  #Evaluation
  #The model makes it's predictions on the test data.
  predictions = classifier_model.predict(testdata)

  #The Keras functional model gives predictions in the form of weights where all weights above 0.5 are considered to be of class 1 and the ones 0.5 and below are class 0.
  predictions[predictions > 0.5] = 1
  predictions[predictions <= 0.5] = 0

  #Final evaluation
  print(classification_report(y[test], predictions,target_names= sentiment))

  #Saves the model
  classifier_model.save('BERT' + str(n))

1
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
inputs (InputLayer)             [(32,)]              0                                            
__________________________________________________________________________________________________
preprocessing (KerasLayer)      {'input_word_ids': ( 0           inputs[0][0]                     
__________________________________________________________________________________________________
BERT_encoder (KerasLayer)       {'encoder_outputs':  109482241   preprocessing[0][0]              
                                                                 preprocessing[0][1]              
                                                                 preprocessing[0][2]              
____________________________________________________________________________________________



2
Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
inputs (InputLayer)             [(32,)]              0                                            
__________________________________________________________________________________________________
preprocessing (KerasLayer)      {'input_word_ids': ( 0           inputs[0][0]                     
__________________________________________________________________________________________________
BERT_encoder (KerasLayer)       {'sequence_output':  109482241   preprocessing[0][0]              
                                                                 preprocessing[0][1]              
                                                                 preprocessing[0][2]              
__________________________________________________________________________________________



3
Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
inputs (InputLayer)             [(32,)]              0                                            
__________________________________________________________________________________________________
preprocessing (KerasLayer)      {'input_word_ids': ( 0           inputs[0][0]                     
__________________________________________________________________________________________________
BERT_encoder (KerasLayer)       {'encoder_outputs':  109482241   preprocessing[0][0]              
                                                                 preprocessing[0][1]              
                                                                 preprocessing[0][2]              
__________________________________________________________________________________________



4
Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
inputs (InputLayer)             [(32,)]              0                                            
__________________________________________________________________________________________________
preprocessing (KerasLayer)      {'input_type_ids': ( 0           inputs[0][0]                     
__________________________________________________________________________________________________
BERT_encoder (KerasLayer)       {'encoder_outputs':  109482241   preprocessing[0][0]              
                                                                 preprocessing[0][1]              
                                                                 preprocessing[0][2]              
__________________________________________________________________________________________



5
Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
inputs (InputLayer)             [(32,)]              0                                            
__________________________________________________________________________________________________
preprocessing (KerasLayer)      {'input_word_ids': ( 0           inputs[0][0]                     
__________________________________________________________________________________________________
BERT_encoder (KerasLayer)       {'encoder_outputs':  109482241   preprocessing[0][0]              
                                                                 preprocessing[0][1]              
                                                                 preprocessing[0][2]              
__________________________________________________________________________________________



Code below this point concerns saving and loading of an already fine-tuned model and should not affect the results of the thesis

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!zip -r '/content/drive/MyDrive/BERT1.zip' '/content/BERT1'
!zip -r '/content/drive/MyDrive/BERT2.zip' '/content/BERT2'
!zip -r '/content/drive/MyDrive/BERT3.zip' '/content/BERT3'
!zip -r '/content/drive/MyDrive/BERT4.zip' '/content/BERT4'
!zip -r '/content/drive/MyDrive/BERT5.zip' '/content/BERT5'

  adding: content/BERT1/ (stored 0%)
  adding: content/BERT1/assets/ (stored 0%)
  adding: content/BERT1/assets/vocab.txt (deflated 53%)
  adding: content/BERT1/saved_model.pb (deflated 93%)
  adding: content/BERT1/variables/ (stored 0%)
  adding: content/BERT1/variables/variables.index (deflated 82%)
  adding: content/BERT1/variables/variables.data-00000-of-00001 (deflated 14%)
  adding: content/BERT2/ (stored 0%)
  adding: content/BERT2/assets/ (stored 0%)
  adding: content/BERT2/assets/vocab.txt (deflated 53%)
  adding: content/BERT2/saved_model.pb (deflated 93%)
  adding: content/BERT2/variables/ (stored 0%)
  adding: content/BERT2/variables/variables.index (deflated 82%)
  adding: content/BERT2/variables/variables.data-00000-of-00001 (deflated 14%)
  adding: content/BERT3/ (stored 0%)
  adding: content/BERT3/assets/ (stored 0%)
  adding: content/BERT3/assets/vocab.txt (deflated 53%)
  adding: content/BERT3/saved_model.pb (deflated 93%)
  adding: content/BERT3/variables/ (stored 0%

In [None]:
filepath = ''
loaded_model = tf.keras.models.load_model(filepath)
pred = np.argmax(model.predict(testdata), axis=-1)
print(pred)
pred2 = np.argmax(loaded_model.predict(testdata), axis=-1)
print(pred)
print(classification_report(pred,pred2,target_names= sentiment))