# BERT.ipynb

### Designation: Model Generation Script

    Purpose: to train a single BERT model on our dataset for training ('Tweets.csv') at 90:10 train-test split

- Requirements:
    
    Packages: tensorflow, pandas, matplotlib, transformers, sklearn, os

    Datasets (csv's): Tweets.csv

- This program will require an internet connection, as it will download the model and tokenizer from the HuggingFace model repository.

- Saved model-weight name (output): bertBasic.h5
    - Please note, all files referenced (input and output) will all be on the folder-level.

### A note on BERT, Tensorflow and GPU:

- This is a BERT model, which means it makes predictions of each word based on ALL words around it in the sentence. This has the following consequences, aside from achieving high accuracy

    - It has a LOT of weights to be trained. hence a GPU is strongly recommended, if I cannot 'require' it.

    - It is very prone to overfitting, and as such, the way to tune this model is to run 1 epoch on as many unique queries as possible.

    - It will take up a lot of storage. We would even observe training speed differences if it was on a hard drive vs. if it was on an SSD.

- For Tensorflow:
    
    - This model is compiled with Tensorflow 2.8+, with CUDA toolkit version 11.6 with its corresponding cuDNN and other nVidia required databases, on python 3.8.10

    - Training time reference: 12 minutes (GPU) - 1 epoch, batch size 32, 90:10 train-test ratio, RTX 3060, Ryzen 5 5600x, 8GB ram assigned, WSL-2 ubuntu 20.04 LTS.



## 1. TensorFlow Standalone Setup

- To get tensorflow working with the correct device, we import it first before we load the model.

- 'useCPU' argument will disable CUDA and force TensorFlow to run with CPU only.
    
    - there are many 'with tf.device('/GPU:0'):' casts, forcing TensorFlow to run with CPU only here will not raise an error. 
        
        TensorFlow will use CPU even when the code tells it to use GPU, as there is no GPU detected.

            Tip: Restart the kernel before running the program (again), so no variable collision happens, and it will ensure that there are RAM/VRAM available to use.

In [12]:
useCPU = False #Choose whether to use CPU or GPU for running the program

import tensorflow as tf
import os
if useCPU:
    os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
tf.config.list_physical_devices('GPU')

Num GPUs Available:  1


## 2. Importing, downloading, and Building the model

- The model and tokenizer are quite large, so make sure enough drive space is provided in C drive, and a fast, reliable internet connection is available.

- We have specified that we want, from HuggingFace, a BERT tokenizer (though AutoModel will work too), and a TensorFlow BERT model with a sequence classification head attached to it.

    - And we specified that we are doing a 3-label classification.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
with tf.device('/GPU:0'):
    model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased",num_labels=3,problem_type="multi_label_classification")
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased",num_labels=3)
    model.summary()

## 3. Read in dataset: Tweets.csv (our dataset for training purposes), and clean up the dataset

- textID and selected_text will not matter for this program.

In [None]:
dataset = pd.read_csv('../Dataset/Tweets.csv', encoding='ISO-8859-1')
dataset_drop = dataset.drop(['textID', 'selected_text'], axis=1)
dataset_drop

### 3.1. Extract and encode the dataset's label column into number-category encoding.

- 0 is negative, 1 is neutral, 2 is positive.

In [None]:
datasetSentimentEncode = dataset_drop['sentiment'].apply(lambda c: 0 if c == 'negative' else (1 if c=='neutral' else 2))
datasetSentimentEncode

## 4. Compiling Training/Test split dataframes

- 90:10 seeded split, and dataframes are named in a specific way so we can use the loader functions

In [None]:
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(dataset_drop['text'].astype(str), datasetSentimentEncode, test_size=0.1, random_state=21)
trainDF = pd.DataFrame()
testDF = pd.DataFrame()
trainDF['DATA_COLUMN'] = xtrain
trainDF['LABEL_COLUMN'] = ytrain
testDF['DATA_COLUMN'] = xtest
testDF['LABEL_COLUMN'] = ytest
trainDF,testDF

## 5. Converting dataframes into supported input format for the AI

- defining methods of conversion below

- The conversion will first convert the dataframe into a list of InputExample Objects

    - InputExample: Hugging Face-provided object-enclosure for data.

- Then it will convert the list of InputExample objects into a TensorFlow Tensor Dataset, with a list of text IDs and a list of labels.

    - The tokenizer is used to convert the plain text into a list of vectors (numbers, textIDs), that the machine understands.



In [None]:
def convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN): 
  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)
  
  return train_InputExamples, validation_InputExamples
  
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        # Documentation is really strong for this method, so please take a look at it
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )

### 5.1. The call of the functions, and batching

- We will transform both train and test data separately.

In [None]:
DATA_COLUMN = 'DATA_COLUMN'
LABEL_COLUMN = 'LABEL_COLUMN'
train_InputExamples, validation_InputExamples = convert_data_to_examples(trainDF, testDF, DATA_COLUMN, LABEL_COLUMN)
with tf.device('/GPU:0'):
    train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
    train_data = train_data.shuffle(100).batch(32)

    validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
    validation_data = validation_data.batch(32)

## 6. Train the model

- Compiling the model, which we are using Adam, and we specify the learning rate here.

- We run model.fit() (the train command), and we specify the epoch we run here.

In [None]:
with tf.device('/GPU:0'):
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

    model.fit(train_data, epochs=1, validation_data=validation_data)

## 7. (Optional) Save Weights

- If the training result is satisfactory, uncomment the code below and run the cell.
    - it will grab the model weight that is still stored in the kernel and save it.

In [None]:
#model.save_weights('bertBasic.h5')

## 8. (Deprecated) Load Weights and get predictions on the train set (seeded)

- This part of the code can run separately from the training. 
    - To use it, run up to the cell before #6: train the model, then uncomment and run this cell.

In [None]:
# # load model

# trained_model = TFBertForSequenceClassification.from_pretrained(
#     'bert-base-uncased', num_labels=3)

# trained_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
#                       loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
#                       metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

# trained_model.load_weights('bertBasic.h5')

# preds = trained_model.predict(train_data)