# Sample Steam Reviews with GPT-2
Code inspired from https://github.com/woctezuma/sample-steam-reviews-with-gpt-2

## Setting the GPT-2 model

Install the Python package

Reference: https://github.com/minimaxir/gpt-2-simple

In [1]:
!pip install gpt_2_simple

Collecting gpt_2_simple
  Downloading https://files.pythonhosted.org/packages/bc/7d/1ea4c2a54ecdda5e57e45686e5cdf1ccc45809841ab50c89bc63638c5553/gpt_2_simple-0.5.tar.gz
Collecting toposort (from gpt_2_simple)
  Downloading https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl
Building wheels for collected packages: gpt-2-simple
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/0a/0d/50/166d4caecc4bb1820ce1b7d8e68ce12f9839c919a5c530cc60
Successfully built gpt-2-simple
Installing collected packages: toposort, gpt-2-simple
Successfully installed gpt-2-simple-0.5 toposort-1.5


Download the pre-trained model

In [0]:
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

## Downloading GPT-2

Choose between `117M` and `345M` models

In [0]:
# model_name = '117M'
model_name = '345M'

Download

In [4]:
gpt2.download_gpt2(model_name=model_name)

Fetching checkpoint: 1.00kit [00:00, 257kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:00, 42.4Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 300kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:22, 62.7Mit/s]                                 
Fetching model.ckpt.index: 11.0kit [00:00, 2.14Mit/s]                                               
Fetching model.ckpt.meta: 927kit [00:00, 39.0Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 30.9Mit/s]                                                       


## Uploading a Text File to be Trained to Colaboratory

### Either get the data by yourself

In [5]:
!curl -O https://raw.githubusercontent.com/woctezuma/sample-steam-reviews-with-gpt-2/master/export_review_data.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  7198  100  7198    0     0  33324      0 --:--:-- --:--:-- --:--:-- 33324


In [6]:
!curl -O https://raw.githubusercontent.com/woctezuma/sample-steam-reviews-with-gpt-2/master/requirements.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100    37  100    37    0     0    213      0 --:--:-- --:--:-- --:--:--   213


In [7]:
!pip install -r requirements.txt

Collecting steamreviews==0.8.0 (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/c9/2c/556162233faa4c854f66d5f3e4a4495dc294c72e897711aa83c6fa742a86/steamreviews-0.8.0-py3-none-any.whl
Collecting langdetect==1.0.7 (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/59/59/4bc44158a767a6d66de18c4136c8aa90491d56cc951c10b74dd1e13213c9/langdetect-1.0.7.zip (998kB)
[K     |████████████████████████████████| 1.0MB 9.2MB/s 
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/ec/0c/a9/1647275e7ef5014e7b83ff30105180e332867d65e7617ddafe
Successfully built langdetect
Installing collected packages: steamreviews, langdetect
Successfully installed langdetect-1.0.7 steamreviews-0.8.0


In [0]:
app_id = 203770 # Artifact: 583950

# num_days = 28*3 # slightly less than 3 months
num_days = -1 # if negative, then no time limit

In [9]:
from export_review_data import apply_workflow_for_app_id

apply_workflow_for_app_id(app_id,
                          num_days=num_days)

[appID = 203770] expected #reviews = 24077
Number of queries 150 reached. Cooldown: 310 seconds
#reviews = 24071
Filtering out reviews which were not written in english.
#reviews = 24071
Filtering out reviews with strictly fewer than 150 characters.
#reviews = 11011
[review n°49661536] https://steamcommunity.com/profiles/76561198173075417/recommended/203770/
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
░░░░░░████░░████░░░████░░░░░░
░░░░░░░████░░████░░░░░░░░░░░░░
░░░░░░████░░████░░░████░░░░░░
░░░░░░░████▄▄████░░░████░░░░░░
░░░░░░██████████░░░████░░░░░░
░░░░░░░████▀▀████░░░████░░░░░░
░░░░░░░████░░████░░░████░░░░░░
░░░░░░████░░████░░░████░░░░░░
░░░░░░░████░░████░░░████░░░░░░
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
_______________$
 ___________$$$$$_________________________$$
 ________$$$$$$$$$$$______________________$$
 ______$$$$$$$$$$$$$$$____________________$$
 _____$$$$$$$$$$$$$$$$____________________$$
 ____$$$$$$$$$$$$$$$$$$$__________________$$
 ___$$$$$$$$$$$$$$$$$$$$$_________________$$
 __$$$$$$$$$$$$

### Or get a data snapshot from me

Currently only possible for Artifact, as an example, because the recommended way is to run the code above for the game of your choice instead.

In [0]:
!mkdir -p data/

## Either Artifact (only the recent English reviews):
# !curl -O https://raw.githubusercontent.com/woctezuma/sample-steam-reviews-with-gpt-2/master/data/with_delimiters/583950.txt
# !mv 583950.txt data/

## Or Crusader Kings II (all the English reviews):
# !curl -O https://raw.githubusercontent.com/wiki/woctezuma/sample-steam-reviews-with-gpt-2/data/with_delimiters/203770.txt
# !mv 203770.txt data/

## Finetune GPT-2

In [0]:
file_name = 'data/' + str(app_id) + '.txt'

run_name = model_name + '_reviews_' + str(app_id)

In [12]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              run_name=run_name,
              dataset=file_name,
              model_name=model_name,
              steps=1000,
              restore_from='fresh', # change to 'latest' to resume training
              print_every=10,       # how many steps between printing progress
              sample_every=200,     # how many steps to print a demo sample
              save_every=500        # how many steps between saving checkpoint              
              )

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Use tf.random.categorical instead.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Loading checkpoint models/345M/model.ckpt
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from models/345M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:10<00:00, 10.78s/it]


dataset has 1776101 tokens
Training...
[10 | 23.78] loss=3.41 avg=3.41
[20 | 39.55] loss=3.51 avg=3.46
[30 | 55.78] loss=3.01 avg=3.31
[40 | 72.56] loss=3.17 avg=3.27
[50 | 89.01] loss=2.93 avg=3.20
[60 | 105.15] loss=3.03 avg=3.17
[70 | 121.22] loss=3.13 avg=3.17
[80 | 137.35] loss=2.85 avg=3.13
[90 | 153.60] loss=2.81 avg=3.09
[100 | 169.93] loss=2.65 avg=3.04
[110 | 186.23] loss=2.58 avg=3.00
[120 | 202.52] loss=2.78 avg=2.98
[130 | 218.79] loss=2.87 avg=2.97
[140 | 234.97] loss=2.67 avg=2.95
[150 | 251.21] loss=3.26 avg=2.97
[160 | 267.51] loss=2.68 avg=2.95
[170 | 283.86] loss=2.97 avg=2.95
[180 | 300.14] loss=3.18 avg=2.97
[190 | 316.38] loss=3.05 avg=2.97
[200 | 332.60] loss=2.84 avg=2.96
 the game is very complex, the learning curve is very steep, and it takes quite a few hours to become adept at it.
After you understand the basics and how it works, you may feel like you can start playing anything and everything. Of course this is not accurate by any stretch of the imagination,

## Save a Trained Model Checkpoint

In [0]:
# gpt2.mount_gdrive()

In [0]:
# gpt2.copy_checkpoint_to_gdrive(run_name=run_name)

## Load a Trained Model Checkpoint

In [0]:
# gpt2.mount_gdrive()

In [0]:
# gpt2.copy_checkpoint_from_gdrive(run_name=run_name)

## Generate Text From The Trained Model

In [0]:
temperature=1.0 # Default is 0.7, but you may want to increase the temperature, especially if your dataset is small, to avoid copying text.
top_k = 40      # Default: 0   ; Recommended: 40  ; useless parameter if top_p > 0.0
top_p = 0.9     # Default: 0.0 ; Recommended: 0.9 ; no need for top_k if top_p > 0.0

In [0]:
num_samples = 3
num_batches = 3 # Unique to GPT-2, you can pass a batch_size to generate multiple samples in parallel, giving a massive speedup.

In [19]:
gen_texts_A = gpt2.generate(sess,
              run_name=run_name,
              nsamples=num_samples,
              batch_size=num_batches,              
              temperature=temperature,
              top_k=top_k,
              top_p=top_p,
              truncate='<|endoftext|>',                            
              return_as_list=True)

print('\n\n--- SEPARATOR ---\n\n'.join(gen_texts_A))

We were expecting something like Lord of the Rings, but we weren't surprised. It's a very sandboxish gameplay and the depth is quite excellent. It has a very huge learning curve, but once you learn how to play the game, it's fantastic.
I might be spoiled by Paradox, but you will definitely like this game. It can be quite complex, but it is very rewarding when you understand the basics. Even if you don't, playing it will keep you busy for many years. If you haven't played the base game or a similar game, I highly recommend it.
It isn't an exact replica of the Total War series, but it's very close. Battles are a bit too intricate for most people to handle, but the concept of managing your personal armies is good. A more realistic version of Crusader kings 2 is also highly recommended. You can tailor your army even further, with various improvements and additions. It's fantastic and it's also very difficult to learn. Just be prepared to wait for a large fortune (Puerto Rican divorcees are

In [20]:
gen_texts_B = gpt2.generate(sess,
              run_name=run_name,
              nsamples=num_samples,
              batch_size=num_batches,
              temperature=temperature,
              top_k=top_k,
              top_p=top_p,                            
              prefix='<|startoftext|>I love',
              truncate='<|endoftext|>',
              return_as_list=True)

print('\n\n--- SEPARATOR ---\n\n'.join(gen_texts_B))

<|startoftext|>I love this game, but would recommend it with caution.
If you want to just play as a count and form the kingdom of France, that is fine, but then you can play as some random German tribes, or to put a ruler in a westerland and name him Duke of Württemberg, although that will happen again. And don't buy it if you've played another Paradox game, or can get the dlc's faster, but don't put off getting them as well.
Also, you could change the time zone of the game, while that is also fine, you need to make sure that your steam client is not set to the UK time zone, otherwise you might be losing your game, as different time zones can have different political issues. If you do try it, it might work for awhile.
It's a great game, but I wouldn't recommend it as a gift to give to friends. Also, I can't recommend it to buy without the DLC's, so they have to be purchased separately from the base game as well.

--- SEPARATOR ---

<|startoftext|>I love the game and with the Expansion 

In [21]:
gen_texts_C = gpt2.generate(sess,
              run_name=run_name,
              nsamples=num_samples,
              batch_size=num_batches,
              temperature=temperature,
              top_k=top_k,
              top_p=top_p,                            
              prefix='<|startoftext|>I hate',
              truncate='<|endoftext|>',
              return_as_list=True)

print('\n\n--- SEPARATOR ---\n\n'.join(gen_texts_C))

<|startoftext|>I hate this game. The interface is clunky, the is a very large learning curve.
If you are not willing to take the time to learn the game it will be a hard slog to learn.
Update, I understand the interface issues and I have manually created a video tutorial for those who are interested in how to play it.

--- SEPARATOR ---

<|startoftext|>I hate the tutorial.
The game is simple and easy.
When you play it, you just wait, almost 100 hours of gameplay never even gets started.
The game doesn't take long, it can be fast paced or slow.
You can almost spend a day in the game itself, only to get bored.
When you want to play a different type of game, try Crusader Kings 2 because it's on the same level with Total War and is a bit easier to learn.
I'm addicted!

--- SEPARATOR ---

<|startoftext|>I hate this game.  I've spent hundreds of hours in it playing, modding, and teaching myself everything I know, and it still doesn't understand how to play it.  I've seen more info than I've 

In [22]:
gen_texts_D = gpt2.generate(sess,
              run_name=run_name,
              nsamples=num_samples,
              batch_size=num_batches,
              temperature=temperature,
              top_k=top_k,
              top_p=top_p,                            
              prefix='<|startoftext|>Please',
              truncate='<|endoftext|>',
              return_as_list=True)

print('\n\n--- SEPARATOR ---\n\n'.join(gen_texts_D))

<|startoftext|>Please wait a few weeks for me to decide I want to play this game because the price isn't worth it and it has kind of sucked my life away.
If you like RTS type games, like Europa Universalis you won't like this game.
I wish I had a little more of my life and had it because I can't get it, but if it does help you keep your attention and start a family or do some hard riding and backstabs then I can recommend this game to you.
Recommended for people who have a love of history.
Buy, let me just say if you like Strategy games then it should be the most popular FPS game.
Just like Civilization you can adopt other history-based facets of your character.
I didn't play as a Mongol, but I liked the history aspect, and the Mongol Tocchu and his bronze waves were a nice change of pace from the Europa Universalis series.
Be warned though it's a daunting game, the learning curve is pretty high, but once you figure out the basics it's a fun game.
A game that can remind you a bit of ta

In [23]:
gen_texts_E = gpt2.generate(sess,
              run_name=run_name,
              nsamples=num_samples,
              batch_size=num_batches,
              temperature=temperature,
              top_k=top_k,
              top_p=top_p,                            
              prefix='<|startoftext|>This game has near infinite replay value',
              truncate='<|endoftext|>',
              return_as_list=True)

print('\n\n--- SEPARATOR ---\n\n'.join(gen_texts_E))

<|startoftext|>This game has near infinite replay value, as long as you can stay with it for a few months before your character dies, your future becomes uncertain. As a new character you'll keep trying to keep your marriage healthy, while the first ruler you play as makes mistakes and dukes are weak or evil. There is so much to keep track of and watch the relationships between your characters, their children and surrounding dukes. It's a unique and interesting take on the whole medieval (and indeed medieval history) world.
If you find your love of historical re-playability, political intrigue and history soothing, this game may just be for you!
9.8/10 is recommended to anyone who enjoys using their imagination and imagination to rule through tough times.
But don't let the silly, discordant, and complex nature of the game scare you away. You'll have a blast in the magic of the role playing and backstabbing that is CK2.

--- SEPARATOR ---

<|startoftext|>This game has near infinite repl

## Copy the Generated Text to Google Drive

In [0]:
temperature_suffixe = '_temperature_' + str(temperature)

In [0]:
if top_p > 0.0:
  file_name_suffixe = temperature_suffixe + '_top_p_' + str(top_p)
elif top_k > 0:
  file_name_suffixe = temperature_suffixe + '_top_k_' + str(top_k)
else:
  file_name_suffixe = temperature_suffixe

In [26]:
output_file_name = 'output_' + str(app_id) + file_name_suffixe + '.md'

print(output_file_name)

output_203770_temperature_1.0_top_p_0.9.md


In [0]:
with open(output_file_name, 'w') as f:
  
  f.write('## Game\n\n')
  f.write('[<img alt="game name" src="https://steamcdn-a.akamaihd.net/steam/apps/{}/header.jpg" width="150">](https://store.steampowered.com/app/{})\n\n'.format(app_id, app_id))
  
  f.write('## Reviews generated unconditionally\n\n')
  for (i, gen_text) in enumerate(gen_texts_A):
    f.write('{}.\n\n'.format(i+1))
    f.write('> {}\n\n'.format(gen_text))
    
  f.write('## Reviews starting with I love\n\n')
  for (i, gen_text) in enumerate(gen_texts_B):
    f.write('{}.\n\n'.format(i+1))
    f.write('> {}\n\n'.format(gen_text))
    
  f.write('## Reviews starting with I hate\n\n')    
  for (i, gen_text) in enumerate(gen_texts_C):
    f.write('{}.\n\n'.format(i+1))
    f.write('> {}\n\n'.format(gen_text))
  
  f.write('## Reviews starting with Please\n\n')  
  for (i, gen_text) in enumerate(gen_texts_D):
    f.write('{}.\n\n'.format(i+1))
    f.write('> {}\n\n'.format(gen_text))
  
  f.write('## Reviews starting with This game has near infinite replay value\n\n')  
  for (i, gen_text) in enumerate(gen_texts_E):
    f.write('{}.\n\n'.format(i+1))
    f.write('> {}\n\n'.format(gen_text))
   

In [0]:
gpt2.mount_gdrive()

In [29]:
import shutil

shutil.copyfile(output_file_name, '/content/drive/My Drive/' + output_file_name)

'/content/drive/My Drive/output_203770_temperature_1.0_top_p_0.9.md'