# Fine Tuning Models - Using Custom Data
> Fine-tuning using your own data

In this notebook, we'll use:https://huggingface.co/transformers/custom_datasets.html as a guide for our work.  The notebook headers mirror the ones of notebook 3.  However, in this notebook, we'll use our own custom data available through our `workshop-files` subdirectory.  Some code has already been provided from Notebook 2.  Other code, we will write together.  See the solutions notebook if you fall behind!

# 0. Preliminaries
You can use the following code to mount your drive and cd into the relevant directory.  Uncomment the git clone command if you don't have the `deep-learning-intensive` repo already cloned.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd drive/MyDrive
#!git clone https://github.com/vanderbilt-data-science/deep-learning-intensive.git
%cd deep-learning-intensive

# 1.  Installing Required Packages
Note that this is mostly required if you're on Google Colab.

In [None]:
! pip install transformers
! pip install datasets

# 2. Importing Packages for Use

In [None]:
import glob

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from datasets import load_dataset, load_metric, Dataset
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch

# 3. Load Data
## Read in data and convert to dataframe

In [None]:
#get filenames list
filenames = glob.glob('workshop-files/*.txt')

#read file contents
file_contents = []
for file in filenames:
    with open(file, 'r') as f:
        file_contents.append(f.read())

#convert to df
tinfo_df = pd.DataFrame({'filename':[fname.split('\\')[-1] for fname in filenames], 'text':file_contents})
tinfo_df['article_id'] = tinfo_df['filename'].apply(lambda x: int(x.split('.')[0]))

#read author csv
author_df = pd.read_csv('workshop-files/author_data.csv')

#join
full_df = pd.merge(author_df, tinfo_df, on='article_id')
full_df.head()

## Add training labels and split column
Note that our data currently doesn't have any training labels, so I'll make some up here add concatenate them to the dataframe.  I'll also add a split column.

In [None]:
#create training labels
label_dict = {0:'elle', 1:'people'}
labels = [0]*10 + [1]*10
full_df['labels'] = pd.Series(labels).sample(frac=1, random_state=2345).reset_index(drop=True)

#create split labels
splits = [0]*15 + [1]*5
full_df['split'] = pd.Series(splits).sample(frac=1, random_state=2323).reset_index(drop=True)

#view
full_df.head()

# 4. Load Tokenizer

In [None]:
#load tokenizer - you can use the same one as in the previous notebook example
#tokenizer = 

# 5. Tokenize Inputs and Convert to PyTorch Dataset

In [None]:
#create tokenized representations
#train_encodings = 
#val_encodings = 

In [None]:
#helpers for class size and class names
no_classes = len(full_df.query('split==0')['labels'].unique())
train_classes = [label_dict[class_ind] for class_ind in range(no_classes)]

In [None]:
#Create custom Datasets Class
class ArticlesDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

#Create datasets from encodings
#train_dataset = 
#val_dataset = 

# 6. Split Data
Already done above!  Whoo!

# 7. Create Model for Task

In [None]:
#Here, you'll create your model for sequence classification using bert.  Pass in two additional parameters:
#num_classes (find the appropriate variable with the number of classes in this notebook) and
#id2word (this is a dictionary, also in this notebook, that defines the correspondence between labels and their names)
#model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", ...)
#model.name_or_path

# 8. Setup arguments for training

In [None]:
training_args = TrainingArguments("test_trainer",
                                 logging_strategy='epoch')
training_args

# 9. Train model (without computing metrics)
We won't run this one - we'll just use the next set of code to train and evaluate!

In [None]:
#trainer = Trainer(model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset)

In [None]:
#trainer.train()

# 10. Train model using evaluation metric

In [None]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)
trainer.train()

# 11. Additional Exercises with `Trainer`
## Evaluate

In [None]:
trainer.evaluate(train_dataset)

## Predict

In [None]:
trainer.predict(train_dataset)

## Save Model

In [None]:
trainer.save_model('bert-magazine-classifier')

## Use as pretrained

In [None]:
mag_classifier = pipeline('text-classification', model='bert-magazine-classifier')
mag_classifier(full_df['text'].tolist())