Source: 

*   https://pypi.org/project/gpt-2-simple/#description
*   https://medium.com/@stasinopoulos.dimitrios/a-beginners-guide-to-training-and-generating-text-using-gpt2-c2f2e1fbd10a
*   https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce#scrollTo=VHdTL8NDbAh3
*  https://github.com/ak9250/gpt-2-colab
*  https://www.aiweirdness.com/d-and-d-character-bios-now-making-19-03-15/
*  https://minimaxir.com/2019/09/howto-gpt2/





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zawemi/GS2DIT/blob/main/Class%203/gpt_2_shakespeare.ipynb#scrollTo=4tIUvFbLMUuE)

#Let's teach AI writing like a Shakespeare 🎓

##Installing the model

In [1]:
#install the library we'll use today
!pip install gpt-2-simple

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


##Generating text with basic model

###Importing and loading necessary components

In [2]:
#import what we need
import gpt_2_simple as gpt2 #for gpt-2 (our AI model)
import os #lets us doing things with files and folders
import requests #this one helps to dowload from the internet

In [3]:
#and let's download our AI model
gpt2.download_gpt2()   # model is saved into current directory under /models/124M/

Fetching checkpoint: 1.05Mit [00:00, 720Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 3.75Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 897Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:51, 9.62Mit/s]
Fetching model.ckpt.index: 1.05Mit [00:00, 652Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 4.31Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 5.12Mit/s]


In [4]:
#strating the session so we can play with the gpt-2 model
sess = gpt2.start_tf_sess()

In [5]:
#we load the model from file to use it
gpt2.load_gpt2(sess, run_name='124M', checkpoint_dir='models')

Loading checkpoint models/124M/model.ckpt


###Text generation

In [6]:
#this is how we would start model statement
prefix = "Is there a second Earth?"

In [7]:
#the model is generating text
gpt2.generate(sess, run_name='124M', checkpoint_dir='models', prefix=prefix, length=50)

Is there a second Earth?

Fact: The only possible Earth-like planet is found in the gas giant planet Jovian. When it's orbiting a nearby star, it would move very slowly, but that's a problem for the solar system, since it's so


##Generating text with improved (finetuned) model

**IMPORTANT**
</br>Restart the runtime (Runtime -> Restart runtime)

###Importing and loading necessary components

In [1]:
#import what we need
import gpt_2_simple as gpt2 #for gpt-2 (our AI model)
import os #lets us doing things with files and folders
import requests #this one helps to dowload from the internet

In [2]:
#get nietzsche texts
!wget "https://s3.amazonaws.com/text-datasets/nietzsche.txt"

--2023-03-22 12:46:30--  https://s3.amazonaws.com/text-datasets/nietzsche.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.161.216, 52.216.244.14, 52.217.72.102, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.161.216|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 600901 (587K) [text/plain]
Saving to: ‘nietzsche.txt.2’


2023-03-22 12:46:30 (8.40 MB/s) - ‘nietzsche.txt.2’ saved [600901/600901]



In [3]:
#game of thrones from https://www.kaggle.com/datasets/khulasasndh/game-of-thrones-books?select=001ssb.txt
!gdown "1CrL1wde_NGO68i5Prd_UNA_oW0cGQsxg&confirm=t"
!mv /content/001ssb.txt /content/got1.txt

Downloading...
From: https://drive.google.com/uc?id=1CrL1wde_NGO68i5Prd_UNA_oW0cGQsxg&confirm=t
To: /content/001ssb.txt
  0% 0.00/1.63M [00:00<?, ?B/s]100% 1.63M/1.63M [00:00<00:00, 127MB/s]


In [4]:
#let's dowload a file with all Shakespeare plays
!wget "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
!mv /content/input.txt /content/shakespeare.txt

--2023-03-22 12:46:38--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-03-22 12:46:38 (188 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [5]:
#strating the session so we can play with the gpt-2 model
sess = gpt2.start_tf_sess()

###Teaching our model

In [6]:
#finetuning with shakespeare.txt (which, to be honest, means that we are teaching the model how to write like a shakespeare)
#it takes a lot of time (~15min)...
gpt2.finetune(sess, 'got1.txt', steps=500)   # steps is max number of training steps

Loading checkpoint models/124M/model.ckpt
Loading dataset...


100%|██████████| 1/1 [00:01<00:00,  1.84s/it]


dataset has 433157 tokens
Training...
[1 | 7.18] loss=3.55 avg=3.55
[2 | 9.26] loss=3.50 avg=3.52
[3 | 11.35] loss=3.44 avg=3.50
[4 | 13.44] loss=3.15 avg=3.41
[5 | 15.54] loss=3.25 avg=3.38
[6 | 17.64] loss=3.26 avg=3.36
[7 | 19.74] loss=3.22 avg=3.34
[8 | 21.87] loss=3.22 avg=3.32
[9 | 23.98] loss=3.14 avg=3.30
[10 | 26.10] loss=3.05 avg=3.27
[11 | 28.23] loss=3.20 avg=3.27
[12 | 30.36] loss=3.18 avg=3.26
[13 | 32.50] loss=3.16 avg=3.25
[14 | 34.63] loss=3.19 avg=3.25
[15 | 36.77] loss=3.02 avg=3.23
[16 | 38.91] loss=3.15 avg=3.23
[17 | 41.06] loss=3.09 avg=3.22
[18 | 43.21] loss=3.09 avg=3.21
[19 | 45.36] loss=3.15 avg=3.21
[20 | 47.53] loss=3.08 avg=3.20
[21 | 49.70] loss=3.12 avg=3.19
[22 | 51.87] loss=3.14 avg=3.19
[23 | 54.05] loss=3.05 avg=3.19
[24 | 56.23] loss=3.13 avg=3.18
[25 | 58.42] loss=3.11 avg=3.18
[26 | 60.61] loss=3.07 avg=3.17
[27 | 62.80] loss=3.21 avg=3.18
[28 | 65.00] loss=3.08 avg=3.17
[29 | 67.20] loss=3.08 avg=3.17
[30 | 69.41] loss=3.05 avg=3.16
[31 | 71.63] 

###Text generation

In [16]:
prefix = "Is there a second Earth?"

In [17]:
gpt2.generate(sess, prefix=prefix, length=150)

FailedPreconditionError: ignored

###Saving model to Google Drive (optional)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

You can find more texts e.g. on:
https://www.gutenberg.org/cache/epub/1597/pg1597.txt
</br></br>
You can download them to Colab using code similar to the ones below.

In [None]:
#!wget https://www.gutenberg.org/cache/epub/1597/pg1597.txt

--2023-03-21 14:49:16--  https://www.gutenberg.org/cache/epub/1597/pg1597.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 329071 (321K) [text/plain]
Saving to: ‘pg1597.txt’


2023-03-21 14:49:22 (800 KB/s) - ‘pg1597.txt’ saved [329071/329071]



In [None]:
#!wget https://www.gutenberg.org/files/98/98-0.txt

--2023-02-22 13:25:10--  https://www.gutenberg.org/files/98/98-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 807231 (788K) [text/plain]
Saving to: ‘98-0.txt’


2023-02-22 13:25:12 (718 KB/s) - ‘98-0.txt’ saved [807231/807231]



In [None]:
#https://github.com/matt-dray/tng-stardate/tree/master/data/scripts