# Training a BERT model

### Import packages

Import Python packages and display the Azure Machine Learning SDK version.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

import azureml.core
from azureml.core import Workspace

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.38.0


### Connect to workspace

Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `ws`.

In [2]:
# load workspace configuration from the config.json file in the current folder.
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep='\t')

docs-ws	francecentral	openclassrooms


### Create experiment

Create an experiment to track the runs in your workspace. A workspace can have muliple experiments. 

In [12]:
experiment_name = 'bert_sentiment_analysis'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

### Create or Attach existing compute resource

In [3]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpu-cluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D2_V2")


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print("found compute target: " + compute_name)
else:
    print("creating new compute target...")
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

found compute target: cpu-cluster



## Explore data


In [4]:
from azureml.core import Dataset

#TabularDataset
datastore = ws.get_default_datastore()
#1600tweets 
csv_path = [(datastore, 'UI/02-17-2022_021854_UTC/selected_tweets.csv')] 
dataset = Dataset.Tabular.from_delimited_files(path=csv_path)

# load the TabularDataset to pandas DataFrame
df_tweet = dataset.to_pandas_dataframe()

print(df_tweet.shape)
df_tweet.head()



(1600, 2)


Unnamed: 0,polarity,tweet
0,0,wants to compete! i want hard competition! i w...
1,0,It seems we are stuck on the ground in Amarill...
2,0,where the f are my pinking shears? rarararrrar...
3,0,0ff t0 tHE MEEtiN.. i HAtE WhEN PPl V0lUNtEER...
4,4,@ reply me pls


### Display some sample images

Load the compressed files into `numpy` arrays. Then use `matplotlib` to plot 30 random images from the dataset with their labels above them. Note this step requires a `load_data` function that's included in an `utils.py` file. This file is included in the sample folder. Please make sure it is placed in the same folder as this notebook. The `load_data` function simply parses the compresse files into numpy arrays.

## Train on a remote cluster

For this task, submit the job to run on the remote training cluster:
* Create a directory
* Save training and testing dataset to directory
* Create a training script
* Create a script run configuration
* Submit the job 

### Create a directory

Create a directory to deliver the necessary code from your computer to the remote resource.

In [5]:
import os

script_folder = os.path.join(os.getcwd(), "bert_training")
os.makedirs(script_folder, exist_ok=True)



In [6]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_tweet, test_size=0.1, random_state=0)

#save to csv
df_train.to_csv(script_folder+'//train.csv',index=False)
df_test.to_csv(script_folder+'//test.csv',index=False)

### Create a training script

To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train.py` in the directory you just created. 

In [20]:
%%writefile $script_folder/train.py

import argparse
import os
import pandas as pd
import numpy as np
import tensorflow as tf
# Hide GPU from visible devices
tf.config.set_visible_devices([], 'GPU')

import warnings
warnings.filterwarnings('ignore')

# let user feed in 2 parameters, the dataset to mount or download, and the regularization rate of the logistic regression model
parser = argparse.ArgumentParser()
parser.add_argument('--train-file', type=str, dest='trainfile', help='Data sources')
parser.add_argument('--test-file', type=str, dest='testfile', help='Data sources')
parser.add_argument('--epochs', type=int, dest='e', default=1, help='epochs')
args = parser.parse_args()

####################################load the source file###################################
df_train = pd.read_csv(args.trainfile, sep=',', encoding='UTF')
df_test = pd.read_csv(args.testfile, sep=',', encoding='UTF')
print("training size:", len(df_train))
print("testing size:", len(df_test))

X_train = df_train.tweet.values
y_train = df_train.polarity.replace(4,1)
X_test = df_test.tweet.values
y_test = df_test.polarity.replace(4,1)

########################Creating a vocabulary index and converting it to sequences#####################
from transformers import BertTokenizer,TFBertForSequenceClassification
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def preprocess(X):
    import re
    def text_clean(text):
        temp = text.lower()
        temp = re.sub("@[A-Za-z0-9_]+","", temp)
        temp = re.sub("#[A-Za-z0-9_]+","", temp)
        temp = re.sub(r"http\S+", "", temp)
        temp = re.sub(r"www.\S+", "", temp)
        temp = re.sub("[0-9]","", temp)
        return temp
    X_cleaned = [text_clean(text) for text in X]
    return X_cleaned

#The encode_plus  function of the tokenizer class will tokenize the raw input, add the special tokens, and pad the vector 
def convert_example_to_feature(text):
    return bert_tokenizer.encode_plus(text,
            add_special_tokens = True, # add [CLS], [SEP]
            max_length = 128, # max length of the text that can go to BERT
            pad_to_max_length = True, # add [PAD] tokens
            return_attention_mask = True, # add attention mask to not focus on pad tokens
          )
#he following helper functions will help us to transform our raw data to an appropriate format ready to feed into the BERT model
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
    return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
    }, label
def encode_examples(X,y):
    # prepare list, so that we can build up final TensorFlow dataset from slices.
    input_ids_list = []
    token_type_ids_list = []
    attention_mask_list = []
    label_list = []
    for text, label in zip(X, y):
        bert_input = convert_example_to_feature(text)
        input_ids_list.append(bert_input['input_ids'])
        token_type_ids_list.append(bert_input['token_type_ids'])
        attention_mask_list.append(bert_input['attention_mask'])
        label_list.append([label])
    return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)

# train dataset
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.2, random_state=0)
ds_train_encoded = encode_examples(preprocess(X_train), y_train).shuffle(100).batch(16).repeat(2)
ds_val_encoded = encode_examples(preprocess(X_validation), y_validation).batch(16)
# test dataset
ds_test_encoded = encode_examples(preprocess(X_test), y_test).batch(16)
########################training and evaluating the model###################################
from azureml.core import Run

# get hold of the current run
run = Run.get_context()

# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 3e-5
# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
# choosing Adam optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

model.fit(ds_train_encoded, epochs=args.e, validation_data=ds_val_encoded)

loss, acc = model.evaluate(ds_test_encoded, verbose=0)
print("accuracy: {:5.2f}%".format(100 * acc))

run.log('epochs', args.e)
run.log('accuracy', acc)

#############save the model in two parts, one for keras, one for pipeline#####################
# note file saved in the outputs folder is automatically uploaded into experiment record
os.makedirs('./outputs', exist_ok=True)
# Save the model :
model.save_pretrained("outputs/bert_model", saved_model=True)

if os.path.exists('outputs/bert_model') :
    print("model saved")


Overwriting /mnt/batch/tasks/shared/LS_root/mounts/clusters/calcbert/code/Users/lei.xiaofan/sentiment/bert_training/train.py


### Configure the training job

Configure the ScriptRunConfig by specifying:

* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. 
* The compute target.  In this case you will use the AmlCompute you created
* The training script name, train.py
* An environment that contains the libraries needed to run the script
* Arguments required from the training script. 


In [9]:
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

env=Environment.get(workspace=ws, name='train-P7-bert', version=6)
"""
# to install required packages
env = Environment('train-P7-bert')
cd = CondaDependencies.create(pip_packages=['azureml-dataset-runtime[pandas,fuse]','azureml-defaults','tensorflow==2.8.0', 'transformers==4.17.0','huggingface_hub' ],
conda_packages = ['pip','python==3.9.7','scikit-learn==1.0.2'])

env.python.conda_dependencies = cd

# Register environment to re-use later
env.register(workspace = ws)
"""

"\n# to install required packages\nenv = Environment('train-P7-bert')\ncd = CondaDependencies.create(pip_packages=['azureml-dataset-runtime[pandas,fuse]','azureml-defaults','tensorflow==2.8.0', 'transformers==4.17.0','huggingface_hub' ],\nconda_packages = ['pip','python==3.9.7','scikit-learn==1.0.2'])\n\nenv.python.conda_dependencies = cd\n\n# Register environment to re-use later\nenv.register(workspace = ws)\n"

Then, create the ScriptRunConfig by specifying the training script, compute target and environment.

In [16]:
from azureml.core import ScriptRunConfig


args = ['--train-file', 'train.csv','--test-file', 'test.csv', '--epochs',1]

src = ScriptRunConfig(source_directory=script_folder,
                      script='train.py', 
                      arguments=args,
                      compute_target=compute_target,
                      environment=env)

### Submit the job to the cluster

Run the experiment by submitting the ScriptRunConfig object

In [21]:
run = exp.submit(config=src)
run

Experiment,Id,Type,Status,Details Page,Docs Page
bert_sentiment_analysis,bert_sentiment_analysis_1647877648_c5c49869,azureml.scriptrun,Starting,Link to Azure Machine Learning studio,Link to Documentation


## Register model



Register the model in the workspace

In [28]:
# register model 
model = run.register_model(model_name='bert_model', model_path='outputs/bert_model',
                        tags={'Training context':'Script'},
                    properties={'Accuracy': run.get_metrics()['accuracy']})
print('bert model', model.name, model.id, model.version, sep='\t')

bert model	bert_model	bert_model:2	2


In [35]:
model.get_model_path(model_name='bert_model',version=1,_workspace=ws)

'azureml-models/bert_model/1/bert_model'

In [26]:
from azureml.core.model import Model
Model(ws, 'bert_model',version=1)

Model(workspace=Workspace.create(name='docs-ws', subscription_id='403c34e4-adde-4596-85b1-272a798d7ef2', resource_group='openclassrooms'), name=bert_model, id=bert_model:1, version=1, tags={'Training context': 'Script'}, properties={'Accuracy': '0.78125'})

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/tutorials/img-classification-part1-training.png)