# "astroGPT"
> "How to fine-tune a GPT-2 model with fastai and 🤗 Hugging Face to generate daily horoscopes"

- toc: true
- categories: [gpt2, huggingface, nlp]
- show_tags: true

## Introduction

In the last [post](https://stevhliu.github.io/satsuma/transformer/gpt2/huggingface/nlp/2020/09/08/transformer-gpt2.html), I shared some notes on understanding the Transformer, GPT-2 and different decoding strategies for generating text. Now for something a little more practical (and fun! 😊), let's fine-tune GPT-2 to generate horoscopes.

> twitter: https://twitter.com/huggingface/status/1302289493915992066?s=20

## Data

To start, we need some horoscopes to fine-tune the model with. The only publicly available dataset I could find was from a similar [project](https://github.com/dsnam/markovscope) that pulled horoscopes from The New York Post. However, it didn't contain as much text as I had hoped for, so I built my own scraper to collect data from [Horoscope.com](https://www.horoscope.com/us/index.aspx). The dataset contains a years worth of horoscopes for each of the twelve zodiac signs across several categories (daily, love, wellness, career). Once we have the data, we can create a training and validation set using a 80/20 split (`all_text` contains all the text).

```python
num = int(0.8*len(all_text))

idxs = np.random.randint(0, len(all_text), len(all_text))
idxs_train = idxs[:num]
idxs_val = idxs[num:]

train = all_text.iloc[idxs_train]
test = all_text.iloc[idxs_val]
```

## Training the fastai way

![fastai_hf.png](https://raw.githubusercontent.com/stevhliu/ingolmo/master/images/fastai_hf.png)

The next step is to fine-tune GPT-2 on the horoscopes. [fastai](https://docs.fast.ai/) makes this part really simple, and offers amazing utilities like the learning rate finder, discriminative fine-tuning and 1cycle training. If you aren't familiar with these concepts, I highly recommend taking the fastai course, [*Practical Deep Learning for Coders*](https://course.fast.ai/index.html), by Jeremy Howard and Sylvain Gugger 👏. 

I'll just provide a brief summary of these ideas here, so I don't spoil anything in case you're interested in the course!

### Learning rate finder

In the past, selecting an optimal learning rate has been challenging. If your learning rate is too high it will blow your training up, but if it is too low the model will train very, very slowly. So in 2015, Leslie Smith came up with the learning rate finder (see this [paper](https://arxiv.org/abs/1506.01186) for the details). The idea is simple: begin training the model with a small learning rate and then gradually increase it until the loss gets worse and worse (we record and plot the loss after each mini-batch to generate a chart). All we have to do then is pick a learning rate where the loss is decreasing. The general rule of thumb is to pick an order of magnitude less than the point where things begin to get worse.

### Discriminative fine-tuning

Discriminative learning rates was introduced in [ULMFiT](https://arxiv.org/abs/1801.06146) by Jeremy Howard and Sebastian Ruder. It is based on the observation that not every layer of the neural net should be trained with the same learning rate. The earliest layers of a neural net detects basic patterns, while the later layers learn more complex things. In other words, different layers of the neural net represent different levels of semantic complexity. So if the early layers are already pretty good at recognizing these simple patterns, then we shouldn't change their weights too much (set a lower learning rate). But for the more complex patterns in the later layers, we use a higher learning rate so that they train and learn better. 

### 1cycle training

This is another amazing [idea](https://arxiv.org/abs/1708.07120) from Leslie Smith for training neural nets. Simply put, you begin training with a lower learning rate and then gradually increase that to some maximum value. This way, you avoid letting your training get out of hand and instead discover increasingly better parameters. Once you get to a sweet spot on the loss landscape, you lower the learning rate again to really hone in on the best parameters. Combined, this *warming up* and *cooling down* is known as 1cycle training and it allows you to train faster and more accurately.

## fastai and 🤗 Hugging Face 

To get started, download the pre-trained GPT-2 model from [Hugging Face](https://huggingface.co/gpt2).

```python
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

pretrained_weights = 'gpt2'
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model = GPT2LMHeadModel.from_pretrained(pretrained_weights)
```

Next, build a `Transform` that tokenizes the text. Luckily for us, Sylvain has already demonstrated how we can use fastai with Hugging Face and provided a [tutorial](https://docs.fast.ai/tutorial.transformers) that we can follow.

```python
all_text = np.concatenate([df_train['text'].values, df_valid['text'].values])

class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        toks = self.tokenizer.tokenize(x)
        return tensor(self.tokenizer.convert_tokens_to_ids(toks))
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))
```

Then you specify the training and validation sets.

```python
splits = [list(range_of(df_train)), list(range(len(df_train), len(all_text)))]
```

Put together everything we need to handle our data:

* all the text data ✅
* the GPT-2 tokenizer ✅
* the training and validation splits ✅
* set the dataloader type to `LMDataLoader` because we're using a language model ✅

```python
tls = TfmdLists(all_text, TransformersTokenizer(tokenizer), splits=splits, dl_type=LMDataLoader)
```

Create a `DataLoader` and set the batch size and sequence length. You may have to adjust your batch size depending on your GPU memory. GPT-2 was trained on sequences of size 1024 so we will just keep it as it is.

```python
bs,sl = 2,1024
dls = tls.dataloaders(bs=bs, seq_len=sl)
```

To get Hugging Face to work with fastai, we add a minor modification to the training loop with a callback. The Hugging Face model returns a tuple (see [here](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel)) that contains the predictions and some other things. We only want the predictions so we drop the other stuff with a callback.

```python
class DropOutput(Callback):
  def after_pred(self): self.learn.pred = self.pred[0]
```

Then create a `Learner` object which contains the data, model, loss function, the custom callback and a metric for evaluating the language model. We use mixed precision to train faster and save some memory.

```python
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), cbs=[DropOutput], metrics=[Perplexity()]).to_fp16()
```

Use the learning rate finder to discover a good learning rate.

```python
learn.lr_find()
```

Once you've picked a good learning rate, we do discriminative learning rates and 1cycle training.

```python
learn.fit_one_cycle(2, 5e-3)

learn.fit_one_cycle(10, lr_max=slice(1e-7, 1e-5))

learn.fit_one_cycle(20, lr_max=slice(1e-7, 1e-5))
```

Here, you basically train as long as you can until the validation loss doesn't get any better. With my setup, I was able to get the validation loss down to 2.64 and the perplexity to 14.04. This took roughly 2.5 hours on one of Colab's GPU's.

And just like that you have a fine-tuned GPT-2 model! 🥳

## Generating horoscopes 🔮

After you're done fine-tuning your model, you can upload it to the Hugging Face Model Hub so that everyone can easily use it. Just follow the steps [here](https://huggingface.co/transformers/model_sharing.html).

Now for the fun part, load the model and generate your horoscope!

```python
from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("stevhliu/astroGPT")
model = AutoModelWithLMHead.from_pretrained("stevhliu/astroGPT")

model.eval()
model.to('cuda')

# input the date as Mon DD, YYYY
input_ids = tokenizer.encode('Sep 23, 2020', return_tensors='pt').to('cuda')

sample_output = model.generate(input_ids,
                               do_sample=True, 
                               max_length=75,
                               top_k=20, 
                               top_p=0.97)

print(tokenizer.decode(sample_output.tolist(), skip_special_tokens=True))
```

In [None]:
#hide 

# 👇 uncomment code below to generate text (make sure Runtime is set to GPU)!

# !pip install transformers tokenizers fastai2
 
# from transformers import AutoTokenizer, AutoModelWithLMHead
 
# tokenizer = AutoTokenizer.from_pretrained("stevhliu/astroGPT")
# model = AutoModelWithLMHead.from_pretrained("stevhliu/astroGPT")
 
# model.eval()
# model.to('cuda')
 
# input_ids = tokenizer.encode('Sep 23, 2020', return_tensors='pt').to('cuda')
 
# sample_output = model.generate(input_ids,
#                                do_sample=True, 
#                                max_length=75,
#                                top_k=20, 
#                                top_p=0.97)
 
# print(tokenizer.decode(sample_output.tolist(), skip_special_tokens=True))