# Sample Steam Store Descriptions with GPT-2
Code inspired from https://github.com/woctezuma/sample-steam-descriptions

## Setting the GPT-2 model

Install the Python package

Reference: https://github.com/minimaxir/gpt-2-simple

In [1]:
!pip install gpt_2_simple

Collecting gpt_2_simple
  Downloading https://files.pythonhosted.org/packages/b6/cf/4003c7d85425af353e15d938bc0d87a0bdedd6b00229e1f7808c2524b518/gpt_2_simple-0.2.tar.gz
Building wheels for collected packages: gpt-2-simple
  Building wheel for gpt-2-simple (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/51/d0/bd/293c80200f60bcd75a0f4028684e55e959da3a2727858d98a0
Successfully built gpt-2-simple
Installing collected packages: gpt-2-simple
Successfully installed gpt-2-simple-0.2


Download the pre-trained model

In [0]:
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

## Downloading GPT-2

In [3]:
gpt2.download_gpt2()

Fetching checkpoint: 1.00kit [00:00, 330kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:00, 49.3Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 323kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:09, 54.4Mit/s]                                  
Fetching model.ckpt.index: 6.00kit [00:00, 2.95Mit/s]                                               
Fetching model.ckpt.meta: 472kit [00:00, 35.9Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 37.1Mit/s]                                                       


## Mounting Google Drive

In [0]:
gpt2.mount_gdrive()

## Uploading a Text File to be Trained to Colaboratory

#### Either get the data by yourself

Currently not possible because you:
-   either need app details (slow to download),
-   or aggregate.json (stored with Git LFS, not installed on Google Colab.)

#### Or get a data snapshot from me

In [5]:
!curl -O https://raw.githubusercontent.com/woctezuma/sample-steam-descriptions/master/data/concatenated_store_descriptions.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 43.1M  100 43.1M    0     0  36.8M      0  0:00:01  0:00:01 --:--:-- 36.8M


## Finetune GPT-2

In [0]:
file_name='concatenated_store_descriptions.txt'

In [0]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              run_name='descriptions',
              dataset=file_name,
              steps=1000,
              restore_from='fresh',   # change to 'latest' to resume training
              print_every=10,   # how many steps between printing progress
              sample_every=200,   # how many steps to print a demo sample
              save_every=500   # how many steps between saving checkpoint              
              )

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Use tf.random.categorical instead.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Loading checkpoint models/117M/model.ckpt
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from models/117M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


In [0]:
gpt2.copy_checkpoint_to_gdrive()

## Load a Trained Model Checkpoint

In [0]:
gpt2.copy_checkpoint_from_gdrive()

In [0]:
sess = gpt2.start_tf_sess()

gpt2.load_gpt2(sess,
               run_name='descriptions')

## Generate Text From The Trained Model

In [0]:
num_samples = 3
num_batches = 3 # Unique to GPT-2, you can pass a batch_size to generate multiple samples in parallel, giving a massive speedup.

gpt2.generate(sess,
              nsamples=num_samples,
              batch_size=num_batches)

In [0]:
gpt2.generate(sess,
              nsamples=num_samples,
              batch_size=num_batches,
              prefix='Half-Life 3 is the long-awaited sequel in the Half-Life franchise developped by Valve')

In [0]:
gpt2.generate(sess,
              nsamples=num_samples,
              batch_size=num_batches,
              prefix='Spelunky 2 is the sequel of the most acclaimed rogue-like platformer of all-time')