Now we will continue on the [Conversation AI](https://conversationai.github.io/) dataset seen in [week 4 homework and lab](https://github.com/MIDS-scaling-up/v2/tree/master/week04). 
We shall use a version of pytorch BERT for classifying comments found at [https://github.com/huggingface/pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT).  

The original implementation of BERT is optimised for TPU. Google released some amazing performance improvements on TPU over GPU, for example, see [here](https://medium.com/@ranko.mosic/googles-bert-nlp-5b2bb1236d78) - *BERT relies on massive compute for pre-training ( 4 days on 4 to 16 Cloud TPUs; pre-training on 8 GPUs would take 40–70 days).*. In response, Nvidia released [apex](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/), which gave mixed precision training. Weights are stored in float32 format, but calculations, like forward and backward propagation happen in float16 - this allows these calculations to be made with a [4X speed up](https://github.com/huggingface/pytorch-pretrained-BERT/issues/149).  

We shall apply BERT to the problem for classifiying toxicity, using apex from Nvidia. We shall compare the impact of hardware by running the model on a V100 and P100 and comparing the speed and accuracy in both cases.   

This script relies heavily on an existing [Kaggle kernel](https://www.kaggle.com/yuval6967/toxic-bert-plain-vanila) from [yuval r](https://www.kaggle.com/yuval6967). 
  
*Disclaimer: the dataset used contains text that may be considered profane, vulgar, or offensive.*

In [1]:
import sys, os
import numpy as np 
import pandas as pd 
import torch
import torch.nn as nn
import torch.utils.data
import torch.nn.functional as F
from sklearn.metrics import roc_auc_score
%load_ext autoreload
%autoreload 2
%matplotlib inline
from tqdm import tqdm, tqdm_notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings(action='once')
import pickle
from apex import amp
import shutil

In [2]:
# Let's activate CUDA for GPU based operations
device=torch.device('cuda')
torch.cuda.empty_cache()

Change the PATH variable to whereever your `week06/hw` directory is located.  
**For the final run we would like you to have a train_size of at least 1 Million rows, and a valid size of at least 500K rows. When you first run the script, feel free to work with a reduced train and valid size for speed.** 

In [19]:
# In bert we need all inputs to have the same length, we will use the first 220 characters. 
class bert_training():
    def __init__(self,train_size,val_size,
                max_seq_length=220,seed=1234,parent_dir_path='/root/v2/week06/hw',
                bert_tf_model='uncased_L-12_H-768_A-12'):
        self._MAX_SEQUENCE_LENGTH = max_seq_length
        self._SEED = seed
        self._PATH = parent_dir_path
        self._DATA_DIR = os.path.join(self._PATH, "data")
        self._WORK_DIR = os.path.join(self._PATH, "workingdir")
        self._train_size=train_size
        self._val_size=val_size
        self._BERT_MODEL_PATH = os.path.join(self._DATA_DIR, bert_tf_model)
        self._tokenizer = BertTokenizer.from_pretrained(self._BERT_MODEL_PATH, cache_dir=None,do_lower_case=True)

    def tf_to_pytorch_model(self,pytorch_model_bin='pytorch_model.bin'):
        BERT_MODEL_PATH = os.path.join(self._DATA_DIR, self._BERT_MODEL_PATH)
        convert_tf_checkpoint_to_pytorch.convert_tf_checkpoint_to_pytorch(
                                    os.path.join(self._BERT_MODEL_PATH, 'bert_model.ckpt'),
                                    os.path.join(self._BERT_MODEL_PATH, 'bert_config.json'), 
                                    os.path.join(self._WORK_DIR, pytorch_model_bin))

        shutil.copyfile(os.path.join(BERT_MODEL_PATH, 'bert_config.json'), \
                        os.path.join(self._WORK_DIR, 'bert_config.json'))
        # This is the Bert configuration file
        bert_config = BertConfig(os.path.join(self._WORK_DIR, 'bert_config.json'))
        return bert_config
    
#     def convert_lines(self,example):
#         tokenizer = self._tokenizer
#         max_seq_length = self._MAX_SEQUENCE_LENGTH-2
#         all_tokens = []
#         longer = 0
#         for text in tqdm_notebook(example):
#             tokens_a = tokenizer.tokenize(text)
#             if len(tokens_a)>max_seq_length:
#                 tokens_a = tokens_a[:max_seq_length]
#                 longer += 1
#             one_token = tokenizer.convert_tokens_to_ids(["[CLS]"]+tokens_a+["[SEP]"])+[0] * (max_seq_length - len(tokens_a))
#             all_tokens.append(one_token)
#         print(longer)
#         return np.array(all_tokens)
    
    def predict_from_pretrained_model(self):
        bert = BertModel.from_pretrained(self._WORK_DIR).cuda()
        bert_output = bert(torch.tensor([input_ids]).cuda())
        return bert_output

    def tokenize(self,text):
        tokens = self._tokenizer.tokenize(text)
        tokens_bert = ["[CLS]"] + tokens + ["[SEP]"]
        input_ids = self._tokenizer.convert_tokens_to_ids(tokens_bert)
        return input_ids,tokens,tokens_bert
    

    def initialize_model_for_training(self,num_labels,EPOCHS=1,model_seed=21000,lr=2e-5,batch_size=32,
                                      accumulation_steps=2):
        # Setup model parameters
        np.random.seed(model_seed)
        torch.manual_seed(model_seed)
        torch.cuda.manual_seed(model_seed)
        torch.backends.cudnn.deterministic = True

        # Empty cache
        torch.cuda.empty_cache()

        model = BertForSequenceClassification.from_pretrained(self._WORK_DIR,cache_dir=None,num_labels=num_labels)
        model.zero_grad()
        model = model.to(device)
        param_optimizer = list(model.named_parameters())
        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
            {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
            ]
        train = train_dataset
        num_train_optimization_steps = int(EPOCHS*len(train)/batch_size/accumulation_steps)
        optimizer = BertAdam(optimizer_grouped_parameters,
                             lr=lr,
                             warmup=0.05,
                             t_total=num_train_optimization_steps)

        model, optimizer = amp.initialize(model, optimizer, opt_level="O1",verbosity=0)
        model=model.train()
        return model,optimizer,EPOCHS

    def run_training(self,model,train,optimizer,EPOCHS=1,batch_size=32,accumulation_steps=2):
        tq = tqdm_notebook(range(EPOCHS))
        for epoch in tq:
            train_loader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True)
            avg_loss = 0.
            avg_accuracy = 0.
            lossf=None
            tk0 = tqdm_notebook(enumerate(train_loader),total=len(train_loader),leave=False)
            optimizer.zero_grad()   # Bug fix - thanks to @chinhuic
            for i,(x_batch, y_batch) in tk0:
                y_pred = model(x_batch.to(device), attention_mask=(x_batch>0).to(device), labels=None)
                loss =  F.binary_cross_entropy_with_logits(y_pred,y_batch.to(device))
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
                if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
                    optimizer.step()                            # Now we can do an optimizer step
                    optimizer.zero_grad()
                if lossf:
                    lossf = 0.98*lossf+0.02*loss.item()
                else:
                    lossf = loss.item()
                tk0.set_postfix(loss = lossf)
                avg_loss += loss.item() / len(train_loader)
                avg_accuracy += torch.mean(((torch.sigmoid(y_pred[:,0])>0.5) == (y_batch[:,0]>0.5).to(device)).to(torch.float) ).item()/len(train_loader)
            tq.set_postfix(avg_loss=avg_loss,avg_accuracy=avg_accuracy)
            return model
       
    def predict(self,model,X_val,batch_size=32):
        for param in model.parameters():
            param.requires_grad=False
        model.eval()
        valid_preds = np.zeros((len(X_val)))
        valid = torch.utils.data.TensorDataset(torch.tensor(X_val,dtype=torch.long))
        valid_loader = torch.utils.data.DataLoader(valid, batch_size=batch_size, shuffle=False)

        tk0 = tqdm_notebook(valid_loader)
        for i,(x_batch,)  in enumerate(tk0):
            pred = model(x_batch.to(device), attention_mask=(x_batch>0).to(device), labels=None)
            valid_preds[i*batch_size:(i+1)*batch_size]=pred[:,0].detach().cpu().squeeze().numpy()
        return valid_preds
    
    def compute_auc_score(self,y, predictions):
        return roc_auc_score(y, predictions)
   


This should be the files you downloaded earlier when you ran `download.sh`

In [20]:
# train_size = 10000
# valid_size = 5000

train_size = 1000000
valid_size = 500000

bert_obj = bert_training(train_size=train_size,val_size=val_size) # create an instance of the setup class
DATA_DIR = bert_obj._DATA_DIR
WORK_DIR = bert_obj._WORK_DIR
os.listdir(DATA_DIR)

['download.sh',
 'cased_L-12_H-768_A-12',
 'test.csv',
 'train.csv',
 'uncased_L-12_H-768_A-12']

We shall install pytorch BERT implementation.   
If you would like to experiment with or view any code (purely optional, and not graded :) ), you can copy the files from the repo https://github.com/huggingface/pytorch-pretrained-BERT  

In [16]:
%%capture
from pytorch_pretrained_bert import convert_tf_checkpoint_to_pytorch
from pytorch_pretrained_bert import BertTokenizer, BertForSequenceClassification,BertAdam
from pytorch_pretrained_bert.modeling import BertModel
from pytorch_pretrained_bert import BertConfig

We shall now load the model. When you run this, comment out the `capture` command to understand the archecture.

In [7]:
%%capture
bert_config1 = bert_obj.tf_to_pytorch_model()

In [6]:
# %%capture
# # Translate model from tensorflow to pytorch
# BERT_MODEL_PATH = os.path.join(model1._DATA_DIR, 'uncased_L-12_H-768_A-12')
# convert_tf_checkpoint_to_pytorch.convert_tf_checkpoint_to_pytorch(
#                             os.path.join(BERT_MODEL_PATH, 'bert_model.ckpt'),
#                             os.path.join(BERT_MODEL_PATH, 'bert_config.json'), 
#                             os.path.join(model1._WORK_DIR, 'pytorch_model.bin'))

# shutil.copyfile(os.path.join(BERT_MODEL_PATH, 'bert_config.json'), \
#                 os.path.join(model1._WORK_DIR, 'bert_config.json'))
# # This is the Bert configuration file
# bert_config2 = BertConfig(os.path.join(model1._WORK_DIR, 'bert_config.json'))

Bert needs a special formatting of sentences, so we have a sentence start and end token, as well as separators.   
Thanks to this [script](https://www.kaggle.com/httpwwwfszyc/bert-in-keras-taming) for a fast convertor of the sentences.

In [18]:
def convert_lines(example, max_seq_length,tokenizer):
    max_seq_length -=2
    all_tokens = []
    longer = 0
    for text in tqdm_notebook(example):
        tokens_a = tokenizer.tokenize(text)
        if len(tokens_a)>max_seq_length:
            tokens_a = tokens_a[:max_seq_length]
            longer += 1
        one_token = tokenizer.convert_tokens_to_ids(["[CLS]"]+tokens_a+["[SEP]"])+[0] * (max_seq_length - len(tokens_a))
        all_tokens.append(one_token)
    print(longer)
    return np.array(all_tokens)


Now we load the BERT tokenizer and convert the sentences.

In [15]:
%%time
SEED = bert_obj._SEED
train_all = pd.read_csv(os.path.join(DATA_DIR, "train.csv")).sample(train_size+valid_size,random_state=SEED)
print('loaded %d records' % len(train_all))

# Make sure all comment_text values are strings
train_all['comment_text'] = train_all['comment_text'].astype(str) 

sequences =bert_obj.convert_lines(train_all["comment_text"].fillna("DUMMY_VALUE"))
train_all=train_all.fillna(0)

loaded 1500000 records


HBox(children=(IntProgress(value=0, max=1500000), HTML(value='')))


33724
CPU times: user 27min 52s, sys: 8.28 s, total: 28min
Wall time: 27min 51s


In [143]:
# %%time
# tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_PATH, cache_dir=None,do_lower_case=True)
# train_all = pd.read_csv(os.path.join(DATA_DIR, "train.csv")).sample(train_size+valid_size,random_state=SEED)
# print('loaded %d records' % len(train_all))

# # Make sure all comment_text values are strings
# train_all['comment_text'] = train_all['comment_text'].astype(str) 

# sequences = convert_lines(train_all["comment_text"].fillna("DUMMY_VALUE"),MAX_SEQUENCE_LENGTH,tokenizer)
# train_all=train_all.fillna(0)

Let us look at how the tokenising works in BERT, see below how it recongizes misspellings - words the model never saw. 

In [21]:
train_all[["comment_text", 'target']].head()

Unnamed: 0,comment_text,target
458232,It's difficult for many old people to keep up ...,0.0
272766,She recognized that her tiny-handed husband is...,0.166667
339129,"HPHY76,\nGood for you for thinking out loud, w...",0.0
773565,And I bet that in the day you expected your Je...,0.5
476233,Kennedy will add a much needed and scientifica...,0.0


Lets tokenize some text (I intentionally mispelled some words to check berts subword information handling)

In [22]:
text = 'Hi, I am learning new things in w251 about deep learning the cloud and teh edge.'
input_ids,tokens,tokens_bert = bert_obj.tokenize(text)
' '.join(tokens)

'hi , i am learning new things in w ##25 ##1 about deep learning the cloud and te ##h edge .'

Added start and end token and convert to ids. This is how it is fed into BERT.

In [23]:
# tokens1 = ["[CLS]"] + tokens1 + ["[SEP]"]
' '.join(tokens_bert)
' '.join(map(str, input_ids))

'[CLS] hi , i am learning new things in w ##25 ##1 about deep learning the cloud and te ##h edge . [SEP]'

'101 7632 1010 1045 2572 4083 2047 2477 1999 1059 17788 2487 2055 2784 4083 1996 6112 1998 8915 2232 3341 1012 102'

When BERT converts this sentence to a torch tensor below is shape of the stored tensors.  
We have 12 input tensors, while the sentence tokens has length 23; where are can you see the 23 tokens in the tensors ?... **Feel free to post in slack or discuss in class**

In [24]:
# put input on gpu and make prediction
bert_output = bert_obj.predict_from_pretrained_model()
print('Sentence tokens {}'.format(tokens))
print('Number of tokens {}'.format(len(tokens)))
print('Tensor shapes : {}'.format([b.cpu().detach().numpy().shape for b in bert_output[0]]))
print('Number of torch tensors : {}'.format(len(bert_output[0])))

Sentence tokens ['hi', ',', 'i', 'am', 'learning', 'new', 'things', 'in', 'w', '##25', '##1', 'about', 'deep', 'learning', 'the', 'cloud', 'and', 'te', '##h', 'edge', '.']
Number of tokens 21
Tensor shapes : [(1, 23, 768), (1, 23, 768), (1, 23, 768), (1, 23, 768), (1, 23, 768), (1, 23, 768), (1, 23, 768), (1, 23, 768), (1, 23, 768), (1, 23, 768), (1, 23, 768), (1, 23, 768)]
Number of torch tensors : 12


As it is a binary problem, we change our target to [0,1], instead of float.   
We also split the dataset into a training and validation set, 

In [25]:
train_all['target']=(train_all['target']>=0.5).astype(float)
# Training data - sentences
X = sequences[:train_size] 
# Target - the toxicity. 
y = train_all[['target']].values[:train_size]
X_val = sequences[train_size:]                
y_val = train_all[['target']].values[train_size:]

In [26]:
test_df=train_all.tail(valid_size).copy()
train_df=train_all.head(train_size)

**From here on in we would like you to run BERT.**   
**Please do rely on the script available -  [Kaggle kernel](https://www.kaggle.com/yuval6967/toxic-bert-plain-vanila) from [yuval r](https://www.kaggle.com/yuval6967) - for at least the first few steps up to training and prediction.**


**1)**   
**Load the training set to a training dataset. For this you need to load the X sequences and y objects to torch tensors**   
**You can use `torch.utils.data.TensorDataset` to input these into a train_dataset.**

In [27]:
# Training data creations
train_dataset = torch.utils.data.TensorDataset(torch.tensor(X,dtype=torch.long), torch.tensor(y,dtype=torch.float))

**2)**  
**Set your learning rate and batch size; and optionally random seeds if you want reproducable results**   
**Load your pretrained BERT using `BertForSequenceClassification`**   
**Initialise the gradients and place the model on cuda, set up your optimiser and decay parameters**
**Initialise the model with `apex` (we imprted this as `amp`) for mixed precision training**

In [31]:
# Initialize the model for training 
model1,optimizer1,epochs = bert_obj.initialize_model_for_training(y.shape[1],EPOCHS=1)

**3)**  
**Start training your model by iterating through batches in a single epoch of the data**

In [32]:
%%time
# Train the model
model1=bert_obj.run_training(model1,train_dataset,optimizer1,EPOCHS=epochs)

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=31250), HTML(value='')))

CPU times: user 1h 23min 39s, sys: 28min 40s, total: 1h 52min 20s
Wall time: 1h 52min 15s


**4)**  
**Store your trained model to disk, you will need it if you choose section 8C.**

In [34]:
output_model_file = WORK_DIR+"/bert_pytorch_v100a_train"+str(train_size)+".bin"
torch.save(model1.state_dict(), output_model_file)

**5)**   
**Now make a prediction for your validation set.**  

In [37]:
# The following 2 lines are not needed but show how to download the model for prediction
output_model_file = WORK_DIR+"/bert_pytorch_v100a_train"+str(train_size)+".bin"
model = BertForSequenceClassification(bert_config1,num_labels=y.shape[1])
model.load_state_dict(torch.load(output_model_file))
model.to(device)

IncompatibleKeys(missing_keys=[], unexpected_keys=[])

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, eleme

In [38]:
%%time
predictions=bert_obj.predict(model,X_val)

HBox(children=(IntProgress(value=0, max=15625), HTML(value='')))

CPU times: user 12min 20s, sys: 3min 26s, total: 15min 46s
Wall time: 15min 42s


**6)**  
**In the yuval's kernel he get a metric based on the metric for the jigsaw competition - it is quite complicated. Instead, we would like you to measure the `AUC`, similar to how you did in homework 04. You can compare the results to HW04**  
*A tip, if your score is lower than homework 04 something is wrong....*

In [39]:
predictions = torch.sigmoid(torch.tensor(predictions)).numpy()
auc_score = bert_obj.compute_auc_score(y_val,predictions)
print("AUC score = " , round(auc_score,5))

AUC score =  0.97004


**7)**  
**Can you show/print the validation sentences predicted with the highest and lowest toxicity ?**

In [40]:
idx_most_toxic=predictions.argsort()[-5:][::-1]
for index,row in enumerate(idx_most_toxic):
    print("Sentence: ",index+1)
    print("True Target Value:",train_all.iloc[row].target,"Predicted Target Value:",round(predictions[row],4))
    print()
    print(train_all.iloc[row].comment_text)
    print("-"*100)


Sentence:  1
True Target Value: 1.0 Predicted Target Value: 0.9995

"An idea that seemed plausible and gathered a lot of excitement when it was proposed decades ago lost its traction as the 2008 economic crash took its toll on the RTD system."

This is absolute hogwash. The northwest rail line was dead in the water before the 2008 crash ever happened. Most of the same factors that make the line extremely unlikely to be built in most of our lifetimes existed at the time Fastracks was being proposed. So why was it part of the plan? Because Crooked Cal Marsella knew he needed a big turnout from Boulder and the northwest metro area in order to win at the ballot box.

My frustration is not that the northwest line won't be built, it's that RTD didn't follow through on its Bus Rapid Transit promises (no, the Flatiron Flyer isn't BRT - it's the BV, BF and BX with a new paint job) and continues to make bus service cuts in this area. And if we contact our RTD rep about it, she's too dimwitted to

In [41]:
idx_least_toxic=predictions.argsort()[:5]
for index,row in enumerate(idx_least_toxic):
    print("Sentence: ",index+1)
    print("True Target Value:",train_all.iloc[row].target,"Predicted Target Value:",round(predictions[row],4))
    print()
    print(train_all.iloc[row].comment_text)
    print("-"*100)


Sentence:  1
True Target Value: 0.0 Predicted Target Value: 0.0001

Mr. Speaker,

Please inform our Russian friends that no one in Canada votes for the Prime Minister except the people in the Prime Minister's electoral riding.

Prime Minister Trudeau received  almost 52% of the popular vote in his riding, a majority of votes cast.
----------------------------------------------------------------------------------------------------
Sentence:  2
True Target Value: 1.0 Predicted Target Value: 0.0001

In a business world (Donaldland) bottom line is $$$$. Lots of $$$$$. So why not deal with Russia businesslike? Not politics but busine$$. That's what's happening. Hey!! Mr.Day One Trump, show us the $$$. Words are cheap. Promises are cheap. Bankrupcies is just a tool we in business use.  Flying to Florida this weekend?? Lol, you know Mar A lago is just a secret palace where deals get made. Lol. You voted for this clown.
--------------------------------------------------------------------------

**8)**  
**Pick only one of the below items and complete it. The last two will take a good amount of time (and partial success on them is fine), so proceed with caution on your choice of items :)** 
  
  
**A. Can you train on two epochs ?**

**B. Can you change the learning rate and improve validation score ?**
   
**C. Make a prediction on the test data set with your downloaded model and submit to Kaggle to see where you score on public LB - check out [Abhishek's](https://www.kaggle.com/abhishek) script - https://www.kaggle.com/abhishek/pytorch-bert-inference . Note, you will need to fork Abhisheks kernel, swap out the weights to your downloaded weights and commit the kernel. When finalised and you get the output, there is a button to submit to the competition**  
  
**D. Get BERT running on the tx2 for a sample of the data.** 
  
**E. Finally, and very challenging -- the `BertAdam` optimiser proved to be suboptimal for this task. There is a better optimiser for this dataset in this script [here](https://www.kaggle.com/cristinasierra/pretext-lstm-tuning-v3). Check out the `custom_loss` function. Can you implement it ? It means getting under the hood of the `BertForSequenceClassification` at the source repo and implementing a modified version locally .  `https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py`**

### A. Training on 2 epochs

In [43]:
%%time
# Initialize 
model2,optimizer2,epochs = bert_obj.initialize_model_for_training(y.shape[1],EPOCHS=2)

# Train and save the model
model2=bert_obj.run_training(model2,train_dataset,optimizer2,EPOCHS=epochs)
output_model_file = WORK_DIR+"/bert_pytorch_v100a_train"+str(train_size)+"_epochs_"+str(epochs)+".bin"
torch.save(model.state_dict(), output_model_file)

# Make the predictions
predictions = bert_obj.predict(model2,X_val)
predictions = torch.sigmoid(torch.tensor(predictions)).numpy() # add a final sigmoid layer
auc_score = bert_obj.compute_auc_score(y_val,predictions)
print("AUC score = " , round(auc_score,5))


HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

HBox(children=(IntProgress(value=0, max=31250), HTML(value='')))

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



HBox(children=(IntProgress(value=0, max=15625), HTML(value='')))

AUC score =  0.96963
CPU times: user 1h 40min 53s, sys: 30min 9s, total: 2h 11min 3s
Wall time: 2h 10min 48s


### B. Change the learning rate

In [45]:
%%time
# Set learning rate 
lr = 1e-5

# Initialize 
model3,optimizer3,epochs = bert_obj.initialize_model_for_training(y.shape[1],EPOCHS=1,lr=lr)

# Train and save the model
model3=bert_obj.run_training(model3,train_dataset,optimizer3,EPOCHS=epochs)
output_model_file = WORK_DIR+"/bert_pytorch_v100a_train"+str(train_size)+"_lr_"+str(lr)+"_epochs_"+str(epochs)+".bin"
torch.save(model.state_dict(), output_model_file)

# Make the predictions
predictions = bert_obj.predict(model3,X_val)
predictions = torch.sigmoid(torch.tensor(predictions)).numpy() # add a final sigmoid layer
auc_score = bert_obj.compute_auc_score(y_val,predictions)
print("AUC score = " , round(auc_score,5))

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, max=31250), HTML(value='')))

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



HBox(children=(IntProgress(value=0, max=15625), HTML(value='')))

AUC score =  0.96916
CPU times: user 1h 44min 38s, sys: 28min 15s, total: 2h 12min 53s
Wall time: 2h 12min 40s
