# Bidirectional Encoder Representations from Transformers (BERT)

## Training BERT for Binary Text Classification using TensorFlow

Configuring your Python TensorFlow environment is quite straightforward:

* Clone the BERT GitHub repository onto your computer. In your terminal, enter the following command: git clone https://github.com/google-research/bert.git
* Obtain the pre-trained BERT model files from the official BERT GitHub page. These files contain the weights, hyperparameters, and other essential information BERT acquired during pre-training. Save these files in the directory where you cloned the GitHub repository and then extract them. The following links provide access to different files:
    * BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters: https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
    * BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters: https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip
    * BERT-Base, Multilingual Cased: 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters: https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
    * BERT-Base, Multilingual (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters: https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip

More can be found on GitHub: https://github.com/google-research/bert.git

We observe that there are files designated as "cased" and "uncased," indicating whether letter casing is deemed beneficial for the task in question. In our example, we opted to download the BERT-Base-Cased model.

To utilize BERT, it is necessary to transform our data into the format BERT anticipates. BERT requires data to be in a TSV file with a specific structure with four columns: 
    * Column 0: A unique identifier for each row
    * Column 1: An integer label for the row (class labels: 0, 1, 2, 3, etc.)
    * Column 2: A consistent letter for all rows, included solely because BERT expects it, though it serves no purpose.
    * Column 3: The text samples we aim to classify

In the subsequent analysis, we will interact with the Yelp Reviews Polarity dataset. By leveraging the pandas library, we will import and meticulously analyze this information. The dataset comprises user-generated assessments and ratings for a wide variety of businesses, primarily centered on restaurants and local services, as featured on Yelp's platform. This aggregation of data provides crucial insights into consumer preferences, experiences, and viewpoints, thus facilitating businesses and researchers in deciphering customer behavior and enhancing their offerings. The dataset can be accessed at https://www.tensorflow.org/datasets/catalog/yelp_polarity_reviews. Designed for binary sentiment classification, the Yelp Reviews Polarity dataset includes 560,000 highly polarized Yelp reviews for training and an additional 38,000 for testing. This dataset originates from Yelp reviews and constitutes a portion of the Yelp Dataset Challenge 2015 data. For additional details, please visit http://www.yelp.com/dataset.

As delineated above, it is necessary to create a folder within the directory where BERT was cloned, which will house three distinct files: train.tsv, dev.tsv, and test.tsv (where TSV denotes tab-separated values). Both train.tsv and dev.tsv should encompass all four columns, while test.tsv ought to contain only two columns, specifically the row ID and the text designated for classification.

Additionally, we should create a folder named "data" within the "bert" directory to store the .tsv files and another folder called "bert_output" where the fine-tuned model will be saved. The pre-trained BERT model should be stored in the "bert" directory as well.

Let’s prepare our data. 



In [3]:
# Import the necessary libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Use pandas read_csv function to load the Yelp Reviews Polarity dataset into a DataFrame
# The dataset is stored in the file 'train.csv' and 'test.csv' with comma-separated values
# Assign column names 'label' and 'text' to the respective columns in the DataFrames
df_bert_train = pd.read_csv('../data/datasets/yelp_review_polarity_csv/train.csv', names=['label', 'text'])
df_bert_test = pd.read_csv('../data/datasets/yelp_review_polarity_csv/test.csv', names=['label', 'text'])

# Display the first five rows of the training DataFrame to verify the data import
df_bert_train.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# Instantiate a LabelEncoder object
labelencoder = LabelEncoder()

# Use the LabelEncoder object to fit and transform the 'label' column in the DataFrame
# This converts the original labels into integer-encoded labels
df_bert_train['label'] = labelencoder.fit_transform(df_bert_train['label'])
df_bert_test['label'] = labelencoder.fit_transform(df_bert_test['label'])

# Show the first five rows of the DataFrame, displaying the transformed 'label' column
df_bert_train.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
# Construct a new DataFrame for training in compliance with BERT's specifications
df_bert_train = pd.DataFrame({
    'id': range(len(df_bert_train)),                 # Create an 'id' column with a sequence of integers from 0 to the length of df_bert_train
    'label': df_bert_train['label'],                 # Incorporate the integer-encoded 'label' column from the existing df_bert_train DataFrame
    'alpha': ['a'] * df_bert_train.shape[0],         # Generate a dummy 'alpha' column with the same letter 'a' for all rows
    'text': df_bert_train['text'].replace(r'\n', ' ', regex=True) # Introduce a 'text' column containing the text from the 'text' column, substituting newline characters with spaces
})

# Showcase the first five rows of the newly established DataFrame
df_bert_train.head()

Unnamed: 0,id,label,alpha,text
0,0,0,a,"Go until jurong point, crazy.. Available only ..."
1,1,0,a,Ok lar... Joking wif u oni...
2,2,1,a,Free entry in 2 a wkly comp to win FA Cup fina...
3,3,0,a,U dun say so early hor... U c already then say...
4,4,0,a,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
# Construct a new DataFrame for testing in compliance with BERT's specifications
df_bert_test = pd.DataFrame({
    'id': range(len(df_bert_test)),                 # Create an 'id' column with a sequence of integers from 0 to the length of df_bert_test
    'label': df_bert_test['label'],                 # Incorporate the integer-encoded 'label' column from the existing df_bert_test DataFrame
    'alpha': ['a'] * df_bert_test.shape[0],         # Generate a dummy 'alpha' column with the same letter 'a' for all rows
    'text': df_bert_test['text'].replace(r'\n', ' ', regex=True) # Introduce a 'text' column containing the text from the 'text' column, substituting newline characters with spaces
})

# Showcase the first five rows of the newly established DataFrame
df_bert_test.head()

(        id  label alpha                                               text
 813    813      1     a  Congratulations ur awarded either £500 of CD g...
 2305  2305      0     a  Friendship poem: Dear O Dear U R Not Near But ...
 5464  5464      0     a  I will treasure every moment we spend together...
 3702  3702      0     a                              Shall i get my pouch?
 1383  1383      0     a  Its ok my arm is feeling weak cuz i got a shot...,
         id  label alpha                                               text
 5478  5478      0     a                          No probably  &lt;#&gt; %.
 380    380      0     a  I taught that Ranjith sir called me. So only i...
 290    290      0     a  Dear,shall mail tonite.busy in the street,shal...
 2750  2750      0     a  You said not now. No problem. When you can. Le...
 716    716      0     a            When i have stuff to sell i.ll tell you,
         id  label alpha                                               text
 378    37

In [7]:
# Split the train set further into train and dev (development/validation) sets
df_bert_train, df_bert_dev = train_test_split(df_bert_train, test_size=0.01)

# Display the first five rows of each set (train, test, and dev) by calling the head() method
df_bert_train.head(), df_bert_test.head(), df_bert_dev.head()


OSError: Cannot save file into a non-existent directory: 'data'

In [None]:
# Save the DataFrames to .tsv format, as required by BERT
df_bert_train.to_csv('../../bert/data/train.tsv', sep='\t', index=False, header=False)
df_bert_dev.to_csv('../../bert/data/dev.tsv', sep='\t', index=False, header=False)
df_bert_test.to_csv('../../bert/data/test.tsv', sep='\t', index=False, header=False)

With our terminal, we can go to the “bert” folder and type the following command line:

    python run_classifier.py --task_name=cola --do_train=true --do_eval=true --do_predict=true --data_dir=./data/ --vocab_file=./cased_L-12_H-768_A-12/vocab.txt --bert_config_file=./cased_L-12_H-768_A-12/bert_config.json --init_checkpoint=./cased_L-12_H-768_A-12/bert_model.ckpt --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=./bert_output/ --do_lower_case=False


## Bert Extractive Summarizer 

In [11]:
# Import the Summarizer class from the summarizer module
from summarizer import Summarizer

# Define a multi-line string variable 'body' containing text
body = '''
Learning algorithms work on the basis that strategies, algorithms, and inferences that worked well in the past are likely to continue working well in the future. These inferences can sometimes be obvious, such as "since the sun rose every morning for the last 10,000 days, it will probably rise tomorrow morning as well". Other times, they can be more nuanced, such as "X% of families have geographically separate species with color variants, so there is a Y% chance that undiscovered black swans exist".
Machine learning programs can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tasks. For simple tasks assigned to computers, it is possible to program algorithms telling the machine how to execute all steps required to solve the problem at hand; on the computer's part, no learning is needed. For more advanced tasks, it can be challenging for a human to manually create the needed algorithms. In practice, it can turn out to be more effective to help the machine develop its own algorithm, rather than having human programmers specify every needed step. 
The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available. In cases where vast numbers of potential answers exist, one approach is to label some of the correct answers as valid. This can then be used as training data for the computer to improve the algorithm(s) it uses to determine correct answers. For example, to train a system for the task of digital character recognition, the MNIST dataset of handwritten digits has often been used.

'''

# Instantiate the Summarizer class
model = Summarizer()

# Call the summarizer model on the 'body' text with a specified minimum summary length (in this case, 60 characters)
result = model(body, min_length=60)

# Join the resulting summarized text and store it in the 'full' variable
full = ''.join(result)

# Print the final summarized text
print(full)

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Learning algorithms work on the basis that strategies, algorithms, and inferences that worked well in the past are likely to continue working well in the future. For more advanced tasks, it can be challenging for a human to manually create the needed algorithms. For example, to train a system for the task of digital character recognition, the MNIST dataset of handwritten digits has often been used.


## Question Answering

In [3]:
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer

# Load the pre-trained BERT model for Question Answering
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

# Load the pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

# Define the question and context (body) as strings
question = '''What is Machine Learning?'''
body = ''' Learning algorithms work on the basis that strategies, algorithms, and inferences that worked well in the past are likely to continue working well in the future. These inferences can sometimes be obvious, such as "since the sun rose every morning for the last 10,000 days, it will probably rise tomorrow morning as well". Other times, they can be more nuanced, such as "X% of families have geographically separate species with color variants, so there is a Y% chance that undiscovered black swans exist".
Machine learning programs can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tasks. For simple tasks assigned to computers, it is possible to program algorithms telling the machine how to execute all steps required to solve the problem at hand; on the computer's part, no learning is needed. For more advanced tasks, it can be challenging for a human to manually create the needed algorithms. In practice, it can turn out to be more effective to help the machine develop its own algorithm, rather than having human programmers specify every needed step. 
The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available. In cases where vast numbers of potential answers exist, one approach is to label some of the correct answers as valid. This can then be used as training data for the computer to improve the algorithm(s) it uses to determine correct answers. For example, to train a system for the task of digital character recognition, the MNIST dataset of handwritten digits has often been used. '''

# Encode the question and context using the tokenizer
encoding = tokenizer.encode_plus(text=question, text_pair=body)

# Extract the input IDs (token embeddings) and token type IDs (segment embeddings)
inputs = encoding['input_ids']
sentence_embedding = encoding['token_type_ids']

# Convert input IDs to tokens
tokens = tokenizer.convert_ids_to_tokens(inputs)

# Obtain the start and end scores for the answer span from the BERT model
start_scores, end_scores = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]), return_dict=False)

# Find the indices of the highest start and end scores
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)

# Extract the answer tokens from the token list
answer = ' '.join(tokens[start_index:end_index+1])

# Initialize an empty string for the corrected answer
corrected_answer = ''

# Iterate through each word in the answer and correct subword tokens
for word in answer.split():
    # If it's a subword token, remove the '##' prefix and append to the corrected answer
    if word[0:2] == '##':
        corrected_answer += word[2:]
    # Otherwise, add a space before the word and append to the corrected answer
    else:
        corrected_answer += ' ' + word

# Print the final corrected answer
print(corrected_answer)


 computers learning from data provided so that they carry out certain tasks . for simple tasks assigned to computers , it is possible to program algorithms telling the machine how to execute all steps required to solve the problem at hand ; on the computer ' s part , no learning is needed . for more advanced tasks , it can be challenging for a human to manually create the needed algorithms . in practice , it can turn out to be more effective to help the machine develop its own algorithm , rather than having human programmers specify every needed step . the discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available
