### Text Pre-Processing for Feature-Based Training
- BERT (Bidirectional Encoder Representations from Transformers)
- BERT makes use of a Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text
- Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. 
- Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary
- the Transformer encoder reads the entire sequence of words at once

- Resources: 
    - https://keras.io/examples/nlp/text_extraction_with_bert/#preprocess-the-data
    - https://collab.its.virginia.edu/access/lessonbuilder/item/2079372/group/25e8dc9b-3e66-4249-ae38-3a124dead1e4/Module%207:%20Seque_%2010_12%20-%2010_18/7.5%20%20Video%20+%20Qu_%20and%20Attention/M7.5_%20NLP%20with%20Attention,%20Transformers,%20and%20BERT.pdf

![image.png](attachment:8aa7a389-5aeb-485d-9884-f0761f6097af.png)

Two-Phase training:

1.) Pretrain to understand language

2.) Finetume to learn a specific task


In [None]:
#load dataset, tokenizer, model from pretrained model/vocab


### Setup

In [21]:
#import libraries
from tokenizers import Tokenizer
import pandas as pd
import numpy as np
import os
import re
import json
import string
import tensorflow as tf
from tensorflow.keras import layers
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer, TFBertModel, BertConfig
import random
from sklearn.model_selection import train_test_split

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

In [5]:
configuration = BertConfig()  # default parameters and configuration for BERT
max_len = 384

### Setup BERT tokenizer

In [9]:
# Save the slow pretrained tokenizer
slow_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
save_path = "bert_base_uncased/"
if not os.path.exists(save_path):
    os.makedirs(save_path)
slow_tokenizer.save_pretrained(save_path)

('bert_base_uncased/tokenizer_config.json',
 'bert_base_uncased/special_tokens_map.json',
 'bert_base_uncased/vocab.txt',
 'bert_base_uncased/added_tokens.json')

In [10]:
# Load the fast tokenizer from saved file
tokenizer = BertWordPieceTokenizer("bert_base_uncased/vocab.txt", lowercase=True)

### Load Data & Split into Test, Train, Validate

In [34]:
#import data
data = pd.read_csv('pharmacy_dataset_reduced.csv')
#data.head()
#len(data)

In [81]:
#Split the dataset into train, val, test

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

#shuffle dataset for fair test/train split
#split train into 50% of data, test into 1/2 of remaining 50% (25%), and validate remaining 25%
train, test, validate = np.split(data.sample(frac=1), [int(.5*len(data)), int(.75*len(data))])
#len(train)
#len(validate)
#len(test)

### Preprocess Data