# Plaidbot Training

This notebook will allow you to configure and train a BERT-based model that can predict the author of a slack message. Follow each instruction closely.

Since we are training with BERT models, it is recommended that you upgrade to a higher Google Colab tier so that you can make use of a GPU, which makes the training significantly faster. If you have upgraded, make sure to set the runtime to use a GPU.

## Setup

In [None]:
!pip install transformers

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Ensure this path corresponds to the 'src' folder that you copied to Google Drive
!cp -r '/content/drive/MyDrive/Colab Notebooks/plaidbot_v2/src' .

In [None]:
from datetime import datetime
from src.options.model_options import ModelOptions
from src.options.prepro_options import PreproOptions
from src.training.pick_users.run_pick_users import run_pick_users
from src.training.select_data.run_select_data import run_select_data
from src.training.train.run_training import run_training
from src.training.train.run_evaluation import run_evaluation

## Options

For the most part, these default options will be fine. But pay close attention to the comments, as some require credentials and customization.


In [None]:
prepro_opts = PreproOptions()

# File path options
prepro_opts.message_folder = '/content/drive/MyDrive/Colab Notebooks/plaidbot_v2/messages' # Folder containing slack message folders
prepro_opts.user_filename = 'users.json' # File from slack where user info is stored. This Should not need changing
prepro_opts.selected_folders = [ 
    'general',
    'random',
    # add desired folders that you wish to use messages from for training
]

# Filtering Options
prepro_opts.min_date: datetime  = datetime(2018, 1, 1) # Set the earliest message date
prepro_opts.min_num_words = 3 # Minimum number of words in a message. Words with less than this will be filtered out
prepro_opts.max_messages = 100000 # Maximum number of messages to use for training and testing

In [None]:
model_opts = ModelOptions()

# Defaults should be fine for most of these, unless you want to do your own fine-tuning
model_opts.bert_model_name = 'distilbert-base-uncased' # BERT model name. Must work with DistilBertForSequenceClassification
model_opts.max_len = 200 # Max characters per message
model_opts.val_size = 0.2 # Proportion of messages used for validation
model_opts.num_epochs = 2 # Number of training epochs
model_opts.batch_size = 8 # Batch size for data loader
model_opts.learning_rate = 2e-5 # Learning rate of optimizer

# Device used for training. If using a GPU, use 'cuda:0', otherwise use 'cpu'
model_opts.device = 'cuda:0' 

model_opts.saved_model_name = 'username/model-name' # Saved model name for HuggingFace
model_opts.auth_token = 'hugging-face-auth-token-goes-here' # HuggingFace Access token - Can be accessed here: https://huggingface.co/settings/tokens

## Pick users and Data

Run these scripts, and make sure to copy and save the user dictionary somewhere temporarily. You will need it when deploying your model.

In [None]:
a, b = run_pick_users(prepro_opts)

In [None]:
prepro_opts.user_id_int_dict = a
model_opts.user_int_name_dict = b

In [None]:
train_messages, test_messages = run_select_data(prepro_opts)

## Train and Evaluate

In [None]:
model = run_training(train_messages, model_opts)

In [None]:
run_evaluation(model, test_messages, model_opts)

## Save the model to HuggingFace

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
base_model = model.get_inner_model().get_base_model()
base_model.push_to_hub(model_opts.saved_model_name)