#Tell Me a Story
Children's story generator using a [GPT-2](https://openai.com/blog/better-language-models/) network fine-tuned on children's stories from the Guttenberg project (via [bAbI](https://research.fb.com/downloads/babi/)). This notebook will only work in full on [colab](colab.research.google.com), as it saves the resultant model to the user's google drive.

In [0]:
! pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/f9/51824e40f0a23a49eab4fcaa45c1c797cbf9761adedd0b558dab7c958b34/transformers-2.1.1-py3-none-any.whl (311kB)
[K     |████████████████████████████████| 317kB 2.8MB/s 
[?25hCollecting regex
[?25l  Downloading https://files.pythonhosted.org/packages/e3/8e/cbf2295643d7265e7883326fb4654e643bfc93b3a8a8274d8010a39d8804/regex-2019.11.1-cp36-cp36m-manylinux1_x86_64.whl (643kB)
[K     |████████████████████████████████| 645kB 53.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/1f/8e/ed5364a06a9ba720fddd9820155cc57300d28f5f43a6fd7b7e817177e642/sacremoses-0.0.35.tar.gz (859kB)
[K     |████████████████████████████████| 860kB 47.0MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/3d/efb655a670b98f62ec32d66954e1109f403db4d937c50d779a75b9763a29/sentencepiece-0.1.83-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |█████████████████

In [0]:
! rm -rf transformers
! git clone https://github.com/huggingface/transformers.git

Cloning into 'transformers'...
remote: Enumerating objects: 221, done.[K
remote: Counting objects: 100% (221/221), done.[K
remote: Compressing objects: 100% (117/117), done.[K
remote: Total 12628 (delta 133), reused 154 (delta 104), pack-reused 12407[K
Receiving objects: 100% (12628/12628), 6.63 MiB | 5.71 MiB/s, done.
Resolving deltas: 100% (9221/9221), done.
Note: checking out '3ddce1d74cda5be47704381e657ee22ce5a5fc7b'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>



In [0]:
# get the data
import re
from requests import get

url = "http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz"
fn = re.search("[^/]+$", url).group(0)
response = get(url)
with open(fn, "wb") as f:
  f.write(response.content)

! tar xfz {fn} 2> /dev/null

In [0]:
# munge it
from pathlib import PosixPath
from os import makedirs

data_path = PosixPath("/content/CBTest/data")
makedirs(data_path/"train", exist_ok=True)
makedirs(data_path/"valid", exist_ok=True)
cbt_train_file = data_path/"train/cbt_train_cleaned.txt"
cbt_valid_file = data_path/"valid/cbt_valid_cleaned.txt"
# we don't care about test, so add it to the train set
! echo "Cleaning training data"
! cat {data_path}/cbt_train.txt {data_path}/cbt_test.txt | tqdm | grep -v _BOOK_TITLE | \
perl -pe 's/-L[CS]B-.*?-R[CS]B-//g; s/-L[CS]B-.*$// if ! /-R[CS]B-/; s/^.*-R[CS]B-// if ! /-L[CS]B-/; s/-LRB-/(/g; s/-RRB-/)/g;' \
> {cbt_train_file}
! echo "Cleaning validation data"
! grep -v _BOOK_TITLE {data_path}/cbt_valid.txt | tqdm | \
perl -pe 's/-L[CS]B-.*?-R[CS]B-//g; s/-L[CS]B-.*$// if ! /-R[CS]B-/; s/^.*-R[CS]B-// if ! /-L[CS]B-/; s/-LRB-/(/g; s/-RRB-/)/g;' \
 > {cbt_valid_file}

Cleaning training data
280165it [00:00, 494072.12it/s]
Cleaning validation data
12742it [00:00, 408201.81it/s]


In [0]:
! python transformers/examples/run_lm_finetuning.py \
    --output_dir=model \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file={cbt_train_file} \
    --do_eval \
    --eval_data_file={cbt_valid_file} \
    --per_gpu_train_batch_size=2 \
    --per_gpu_eval_batch_size=2

11/26/2019 18:44:04 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json not found in cache or force_download set to True, downloading to /tmp/tmpcb9b66ei
100% 176/176 [00:00<00:00, 135822.91B/s]
11/26/2019 18:44:04 - INFO - transformers.file_utils -   copying /tmp/tmpcb9b66ei to cache at /root/.cache/torch/transformers/4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.085d5f6a8e7812ea05ff0e6ed0645ab2e75d80387ad55c1ad9806ee70d272f80
11/26/2019 18:44:04 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.085d5f6a8e7812ea05ff0e6ed0645ab2e75d80387ad55c1ad9806ee70d272f80
11/26/2019 18:44:04 - INFO - transformers.file_utils -   removing temp file /tmp/tmpcb9b66ei
11/26/2019 18:44:04 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config

In [0]:
import random

seed = random.randint(1,50000)
length = 500
prompt = "Once upon a time, Alice fell down a well, where she met a very curious rabbit."
! python transformers/examples/run_generation.py \
  --model_type=gpt2 \
  --model_name_or_path="./model" \
  --prompt="{prompt}" \
  --length={length} \
  --seed={seed}

11/26/2019 20:51:03 - INFO - transformers.tokenization_utils -   Model name './model' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, distilgpt2). Assuming './model' is a path or url to a directory containing tokenizer files.
11/26/2019 20:51:03 - INFO - transformers.tokenization_utils -   loading file ./model/vocab.json
11/26/2019 20:51:03 - INFO - transformers.tokenization_utils -   loading file ./model/merges.txt
11/26/2019 20:51:03 - INFO - transformers.tokenization_utils -   loading file ./model/added_tokens.json
11/26/2019 20:51:03 - INFO - transformers.tokenization_utils -   loading file ./model/special_tokens_map.json
11/26/2019 20:51:03 - INFO - transformers.tokenization_utils -   loading file ./model/tokenizer_config.json
11/26/2019 20:51:03 - INFO - transformers.configuration_utils -   loading configuration file ./model/config.json
11/26/2019 20:51:03 - INFO - transformers.configuration_utils -   Model config {
  "attn_pdrop": 0.1,
  "embd_pdrop": 0.1,


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
drive_dir = "/content/drive/My Drive/Colab Notebooks/models/childrens-stories_fine-tuned_gpt2"
! rm -rf "{drive_dir}"
! mkdir "{drive_dir}"
! cp /content/model/*.json /content/model/*.txt /content/model/*.bin "{drive_dir}"

#Inference from Drive

In [0]:
drive_dir = "/content/drive/My Drive/Colab Notebooks/models/childrens-stories_fine-tuned_gpt2"

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
! pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/f9/51824e40f0a23a49eab4fcaa45c1c797cbf9761adedd0b558dab7c958b34/transformers-2.1.1-py3-none-any.whl (311kB)
[K     |████████████████████████████████| 317kB 3.5MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/1f/8e/ed5364a06a9ba720fddd9820155cc57300d28f5f43a6fd7b7e817177e642/sacremoses-0.0.35.tar.gz (859kB)
[K     |████████████████████████████████| 860kB 42.5MB/s 
[?25hCollecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/3d/efb655a670b98f62ec32d66954e1109f403db4d937c50d779a75b9763a29/sentencepiece-0.1.83-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 31.2MB/s 
[?25hCollecting regex
[?25l  Downloading https://files.pythonhosted.org/packages/e3/8e/cbf2295643d7265e7883326fb4654e643bfc93b3a8a8274d8010a39d8804/regex-2019.11.1-cp36-cp36m-manylinux1_x86_64.whl (643kB)
[K     |█████

In [0]:
! rm -rf transformers
# enhanced to use past with GPT-2
! git clone https://github.com/thisisrandy/transformers.git

Cloning into 'transformers'...
remote: Enumerating objects: 70, done.[K
remote: Counting objects: 100% (70/70), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 12479 (delta 38), reused 52 (delta 30), pack-reused 12409[K
Receiving objects: 100% (12479/12479), 6.56 MiB | 20.47 MiB/s, done.
Resolving deltas: 100% (9130/9130), done.


In [0]:
import random

seed = random.randint(1,50000)
length = 1000
prompt = "Once upon a time, Alice fell down a well, where she met a very curious rabbit."
! python transformers/examples/run_generation.py \
  --model_type=gpt2 \
  --model_name_or_path="{drive_dir}" \
  --prompt="{prompt}" \
  --length={length} \
  --seed={seed}

11/24/2019 17:57:02 - INFO - transformers.tokenization_utils -   Model name '/content/drive/My Drive/Colab Notebooks/models/childrens-stories_fine-tuned_gpt2_1000' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, distilgpt2). Assuming '/content/drive/My Drive/Colab Notebooks/models/childrens-stories_fine-tuned_gpt2_1000' is a path or url to a directory containing tokenizer files.
11/24/2019 17:57:03 - INFO - transformers.tokenization_utils -   loading file /content/drive/My Drive/Colab Notebooks/models/childrens-stories_fine-tuned_gpt2_1000/vocab.json
11/24/2019 17:57:03 - INFO - transformers.tokenization_utils -   loading file /content/drive/My Drive/Colab Notebooks/models/childrens-stories_fine-tuned_gpt2_1000/merges.txt
11/24/2019 17:57:03 - INFO - transformers.tokenization_utils -   loading file /content/drive/My Drive/Colab Notebooks/models/childrens-stories_fine-tuned_gpt2_1000/added_tokens.json
11/24/2019 17:57:03 - INFO - transformers.tokenization_utils -  