# Fine Tuning Models - Using Custom Data
> Fine-tuning using your own data

In this notebook, we'll use:https://huggingface.co/transformers/custom_datasets.html as a guide for our work.  The notebook headers mirror the ones of notebook 3.  However, in this notebook, we'll use our own custom data available through our `workshop-files` subdirectory.  Some code has already been provided from Notebook 2.  Other code, we will write together.  See the solutions notebook if you fall behind!

# 0. Preliminaries
You can use the following code to mount your drive and cd into the relevant directory.  Uncomment the git clone command if you don't have the `deep-learning-intensive` repo already cloned.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd drive/MyDrive
#!git clone https://github.com/vanderbilt-data-science/deep-learning-intensive.git
%cd deep-learning-intensive

# 1.  Installing Required Packages
Note that this is mostly required if you're on Google Colab.

In [None]:
! pip install transformers
! pip install datasets

# 2. Importing Packages for Use

In [None]:
import glob

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from datasets import load_dataset, load_metric, Dataset
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch

# 3. Load Data
## Read in data and convert to dataframe

In [None]:
#get filenames list
filenames = glob.glob('workshop-files/*.txt')

#read file contents
file_contents = []
for file in filenames:
    with open(file, 'r') as f:
        file_contents.append(f.read())

#convert to df
tinfo_df = pd.DataFrame({'filename':[fname.split('/')[-1] for fname in filenames], 'text':file_contents})
tinfo_df['article_id'] = tinfo_df['filename'].apply(lambda x: int(x.split('.')[0]))

#read author csv
author_df = pd.read_csv('workshop-files/author_data.csv')

#join
full_df = pd.merge(author_df, tinfo_df, on='article_id')
full_df.head()

Unnamed: 0,last_name,first_name,age,years_of_journalism,college major,article_id,filename,text
0,west,enrique,56,12,humanities,551293,551293.txt,"The rain and wind abruptly stopped, but the sk..."
1,braun,damien,43,13,humanities,373587,373587.txt,She patiently waited for his number to be call...
2,osborn,ellie,22,2,engineering,597061,597061.txt,The chair sat in the corner where it had been ...
3,vega,cierra,67,34,science,434648,434648.txt,The computer wouldn't start. She banged on the...
4,cantrell,alden,53,23,science,532970,532970.txt,Do you really listen when you are talking with...


## Add training labels and split column
Note that our data currently doesn't have any training labels, so I'll make some up here add concatenate them to the dataframe.  I'll also add a split column.

In [None]:
#create training labels
label_dict = {0:'elle', 1:'people'}
labels = [0]*10 + [1]*10
full_df['labels'] = pd.Series(labels).sample(frac=1, random_state=2345).reset_index(drop=True)

#create split labels
splits = [0]*15 + [1]*5
full_df['split'] = pd.Series(splits).sample(frac=1, random_state=2323).reset_index(drop=True)

#view
full_df.head()

Unnamed: 0,last_name,first_name,age,years_of_journalism,college major,article_id,filename,text,labels,split
0,west,enrique,56,12,humanities,551293,551293.txt,"The rain and wind abruptly stopped, but the sk...",0,0
1,braun,damien,43,13,humanities,373587,373587.txt,She patiently waited for his number to be call...,1,0
2,osborn,ellie,22,2,engineering,597061,597061.txt,The chair sat in the corner where it had been ...,0,0
3,vega,cierra,67,34,science,434648,434648.txt,The computer wouldn't start. She banged on the...,0,0
4,cantrell,alden,53,23,science,532970,532970.txt,Do you really listen when you are talking with...,1,1


# 4. Load Tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer.name_or_path

'bert-base-cased'

# 5. Tokenize Inputs and Convert to PyTorch Dataset

In [None]:
#create tokenized representations
train_encodings = tokenizer(full_df.query('split==0')['text'].tolist(), truncation=True, padding='longest')
val_encodings = tokenizer(full_df.query('split==1')['text'].tolist(), truncation=True, padding='longest')

In [None]:
#helpers for class size and class names
no_classes = len(full_df.query('split==0')['labels'].unique())
train_classes = [label_dict[class_ind] for class_ind in range(no_classes)]

In [None]:
#Create custom Datasets Class
class ArticlesDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

#Create datasets from encodings
train_dataset = ArticlesDataset(train_encodings, full_df.query('split==0')['labels'].tolist())
val_dataset = ArticlesDataset(val_encodings, full_df.query('split==1')['labels'].tolist())

# 6. Split Data
Already done above!  Whoo!

# 7. Create Model for Task

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=no_classes, id2label=label_dict)
model.name_or_path

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

'bert-base-cased'

# 8. Setup arguments for training

In [None]:
training_args = TrainingArguments("test_trainer",
                                 logging_strategy='epoch')
training_args

TrainingArguments(output_dir=test_trainer, overwrite_output_dir=False, do_train=False, do_eval=False, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs\May26_02-10-13_PROVL-CX0L7Y2, logging_strategy=IntervalStrategy.EPOCH, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_na

# 9. Train model (without computing metrics)

In [None]:
#trainer = Trainer(model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset)

In [None]:
#trainer.train()

# 10. Train model using evaluation metric

In [None]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)
trainer.train()

Step,Training Loss
2,0.7796
4,0.696
6,0.6423


TrainOutput(global_step=6, training_loss=0.7059771219889323, metrics={'train_runtime': 3.5836, 'train_samples_per_second': 1.674, 'total_flos': 3830988719700.0, 'epoch': 3.0})

# 11. Additional Exercises with `Trainer`
## Evaluate

In [None]:
trainer.evaluate(train_dataset)

{'eval_loss': 0.6205865144729614,
 'eval_accuracy': 0.8,
 'eval_runtime': 0.368,
 'eval_samples_per_second': 40.759,
 'epoch': 3.0}

## Predict

In [None]:
trainer.predict(train_dataset)

PredictionOutput(predictions=array([[-0.29494038, -0.5705005 ],
       [-0.59375405, -0.46901992],
       [-0.2581168 , -0.4892554 ],
       [-0.20425233, -0.61265355],
       [-0.5602172 , -0.5579459 ],
       [-0.2200632 , -0.5284079 ],
       [-0.55317354, -0.5999937 ],
       [-0.4260145 , -0.45691708],
       [-0.36270788, -0.3824813 ],
       [-0.42646652, -0.30376187],
       [-0.37042508, -0.28922018],
       [-0.1676437 , -0.5701666 ],
       [-0.6241655 , -0.407779  ],
       [-0.54866135, -0.4822002 ],
       [-0.5887574 , -0.47976074]], dtype=float32), label_ids=array([0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1], dtype=int64), metrics={'test_loss': 0.6205865144729614, 'test_accuracy': 0.8, 'test_runtime': 0.3172, 'test_samples_per_second': 47.291})

## Save Model

In [None]:
trainer.save_model('bert-magazine-classifier')

## Use as pretrained

In [None]:
mag_classifier = pipeline('text-classification', model='bert-magazine-classifier')
mag_classifier(full_df['text'].tolist())

[{'label': 'elle', 'score': 0.5684574842453003},
 {'label': 'people', 'score': 0.531143069267273},
 {'label': 'elle', 'score': 0.5575286746025085},
 {'label': 'elle', 'score': 0.6007044315338135},
 {'label': 'elle', 'score': 0.5512022972106934},
 {'label': 'people', 'score': 0.5005678534507751},
 {'label': 'elle', 'score': 0.5764812231063843},
 {'label': 'elle', 'score': 0.5117030143737793},
 {'label': 'elle', 'score': 0.5077250003814697},
 {'label': 'elle', 'score': 0.5049431920051575},
 {'label': 'people', 'score': 0.5306376218795776},
 {'label': 'people', 'score': 0.5202901363372803},
 {'label': 'elle', 'score': 0.5992936491966248},
 {'label': 'people', 'score': 0.5538864135742188},
 {'label': 'elle', 'score': 0.5338598489761353},
 {'label': 'people', 'score': 0.5166091918945312},
 {'label': 'elle', 'score': 0.5768876075744629},
 {'label': 'people', 'score': 0.5069248676300049},
 {'label': 'elle', 'score': 0.5140368342399597},
 {'label': 'people', 'score': 0.527222216129303}]