# CIS6930 Week 11 More Topics



---
Preparation: Go to `Runtime > Change runtime type` and choose `GPU` for the hardware accelerator.

In [None]:
gpu_info = !nvidia-smi -L
gpu_info = "\n".join(gpu_info)
if gpu_info.find("failed") >= 0:
    print("Not connected to a GPU")
else:
    print(gpu_info)

## Preparation

For this notebookt, we use Hugging Face's `transformers` library.

In [None]:
!pip install transformers

In [None]:
import transformers

In [None]:
from transformers import TrainingArguments

## Twitter Classification Dataset (Again!)

In Week 6, we created a custom dataset for the Twitter dataset (Please see [the Google Colab notebook](https://colab.research.google.com/drive/1DZN-Bo2HBnPQPm4jrQzEIchhHdN682qP?usp=sharing))


In [None]:
# https://www.kaggle.com/crowdflower/twitter-airline-sentiment
# License CC BY-NC-SA 4.0
!gdown --id 1BS_TIqm7crkBRr8p6REZrMv4Uk9_-e6W

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from torch.utils.data import Dataset, TensorDataset, DataLoader

# Loading dataset
df = pd.read_csv("Tweets.csv")

# Label encoder
le = LabelEncoder()
y = le.fit_transform(df["airline_sentiment"].values)
df["label"] = y

# Splint into 60% train, 20% valid, 20% test
train_df, test_df = train_test_split(
    df, test_size=0.2, random_state=1)

train_df, valid_df = train_test_split(
    train_df, test_size=0.25, random_state=1)  # 0.25 x 0.8 = 0.2

print(len(train_df), len(valid_df), len(test_df))

In [None]:
df.columns

In [None]:
class TweetDataset(Dataset):
    def __init__(self,
                 df,
                 tokenizer,
                 max_length=256):
        self.df = df
        input_ids = []
        for text in self.df["text"].tolist():
            d = tokenizer(text,
                          max_length=max_length,
                          padding="max_length",
                          return_tensors="pt")
            for k, v in d.items():
                # To remove unnecessary list
                d[k] = v.squeeze(0)
            input_ids.append(d)

        self.df["input_ids"] = input_ids

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        return {**self.df.iloc[idx]["input_ids"],
                "labels": self.df.iloc[idx]["label"]}

### Trainer!

So far, we have used a hand-made training function. `transformers` has the `Trainer` class that takes care of customizable training procedures.


In [None]:
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments
)


tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)


train_dataset = TweetDataset(train_df, tokenizer, max_length=256)
valid_dataset = TweetDataset(valid_df, tokenizer, max_length=256)
test_dataset = TweetDataset(test_df, tokenizer, max_length=256)


# https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=100,
)

# 
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset             # evaluation dataset
)

trainer.train()

In [None]:
%ls

In [None]:
%ls results/checkpoint-500

### Visualizing the experiment with TensorBoard

As shown in the name, it's originally developed as part of the Tensorflow framework, but it is now compatible with PyTorch and other frameowkrks.

https://www.tensorflow.org/tensorboard

You can use Tensorboard in Google Coalb by just running the following two lines. 


In [None]:
%load_ext tensorboard
%tensorboard --logdir logs