## Import Libraries

In [None]:
#pip install smaberta
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import random
import torch
import pickle
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)
torch.cuda.manual_seed(1)

import sys
sys.path.append('../smaberta')
from smaberta import TransformerModel

### Loading Data

Load train data stored in CSV format using Pandas. Pretty much any format should be acceptable, just some form of text and accompanying labels. 

In [None]:
#change file path as needed
train_df = pd.read_csv("../data/cmv_train.csv")

### Previewing data

In [None]:
#Just to get an idea of what this dataset looks like
print(len(train_df.label.values))

In [None]:
train_df.head()

In [None]:
print(train_df.text[:10].tolist(), train_df.label[:10].tolist())

### Learning Parameters
These are training arguments that I will use to train the classifier. These current values are simple sample values. I may want to ultimately perform a grid search or random search CV or some other approach.

In [None]:
lr = 1e-3
epochs = 2
print("Learning Rate ", lr)
print("Train Epochs ", epochs)

### Initialise model

The following steps are provided as recommended advice by TowardsDataScience:
1. First argument is indicative to use the Roberta architecture
2. Second argument provides intialisation point as provided by Huggingface [here](https://huggingface.co/transformers/pretrained_models.html). Examples - roberta-base, roberta-large, gpt2-large...
3. The tokenizer accepts the freeform text input and tansforms it into a sequence of tokens suitable for input to the transformer. The transformer architecture processes these before passing it on to the classifier head which transforms this representation into the label space.  
4. Number of labels is specified below to initialise the classification head appropriately. As per the classification task you would change this.
5. You can see the training args set above were used in the model initiation below.
6. Pass in training arguments as initialised, especially note the output directory where the model is to be saved and also training logs will be output. The overwrite output directory parameter is a safeguard in case you're rerunning the experiment. Similarly if you're rerunning the same experiment with different parameters, you might not want to reprocess the input every time - the first time it's done, it is cached so you might be able to just reuse the same. fp16 refers to floating point precision which you set according to the GPUs available to you, it shouldn't affect the classification result just the performance.

In [None]:
model = TransformerModel('roberta', 'roberta-base', num_labels=25, reprocess_input_data=True, num_train_epochs=epochs, learning_rate=lr, 
                  output_dir='./saved_model/', overwrite_output_dir=True, fp16=False)

### Run Training

In [None]:
model.train(train_df['text'], test_df['label'])
#To see more in depth logs, set flag show_running_loss=True on the function call of train_model

### Saving the Model

In [None]:
model = TransformerModel('roberta', 'roberta-base',  num_labels=25, location="./saved_model/")

### Evaluate on Test Data

In [None]:
result, model_outputs, wrong_predictions = model.evaluate(test_df['text'], test_df['label'])
preds = np.argmax(model_outputs, axis = 1)

In [None]:
len(test_df), len(preds)

In [None]:
correct = 0
labels = test_df['label'].tolist()
for i in range(len(labels)):
    if preds[i] == labels[i]:
        correct+=1

accuracy = correct/len(labels)
print("Accuracy: ", accuracy)

In [None]:
pickle.dump(model_outputs, open("../model_outputs.pkl", "wb"))

### Final Steps

The tutorial that I've used to gather this code ended with a few use cases that were not applicable to the project at hand, so I will still need to produce the final code which will then be able to apply the model by using it to actually label the remaining dataset, so that is one piece of code that will still need to be produced here. 