In [1]:
!pip install gpt-2-simple



# Borrowing GPT2

Language models such as the popular GPT2/3/4/chat models are trained on lots of data and are absolutely huge in size. It isn't realistic for us to train a model that is anywhere near that size and sophistication, but we can borrow a model and repurpose it for our use. 

## Download Model

The model itself is pretty large, we are downloading a model that is roughly 500MB, and we are using the smallest model. The large ones are large enough that they are impractical to deal with if we don't have some enterprise scale hardware.

##

In [1]:
import gpt_2_simple as gpt2
import os
import requests

model_name = "124M"
if not os.path.isdir(os.path.join("models", model_name)):
	print(f"Downloading {model_name} model...")
	gpt2.download_gpt2(model_name=model_name)   # model is saved into current directory under /models/124M/

2023-03-27 16:19:17.768938: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Finetune Model

We can take the model and tailor it to our use by providing it with some additional text that it can use for fine tuning. 

In [2]:
file_name = "shakespeare.txt"
if not os.path.isfile(file_name):
	url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
	data = requests.get(url)

	with open(file_name, 'w') as f:
		f.write(data.text)



#gpt2.finetune(sess,
#              file_name,
#              model_name=model_name,
#              steps=1000)   # steps is max number of training steps
sess = gpt2.start_tf_sess()

2023-03-27 16:19:28.046795: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Generate Text

Now that the model is downloaded and fine tuned to our data, we can generate some new text. 

In [4]:
sessA = gpt2.start_tf_sess()
gpt2.load_gpt2(sessA)
gpt2.generate(sessA, model_name=model_name, length=100, temperature=0.7, nsamples=5, batch_size=5, prefix="Where for art thou")

2023-03-27 16:15:47.318110: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:357] MLIR V1 optimization pass is not enabled


Loading checkpoint checkpoint/run1/model-1
INFO:tensorflow:Restoring parameters from checkpoint/run1/model-1
This article is about the legendary character. You may be looking for Kudri. You may be looking for

Kudri is an iconic character in Fallout 4 and Fallout: New Vegas. He is a human male whose name means "sister" and who took the name "Kudri" from his mother, Mary.

Contents show]

Biography Edit

Kudri was born in the Kudri village of Anvil, a stone's throw from the continent of Tamriel, and grew up in the ruins of an abandoned town. After his mother's death, his father moved to the same town, after which the rest of the family members refused to give him any of their children. Despite the obvious kinship of his mother and father, Kudri's half-human half-human half-boy half-boy, who is named Kudri, was never brought up by his grandmother. He was raised by his son, who is named Kudri, and adopted by his grandmother, who taught him to read and write. He was neglected by his grandm

## Alternate Tuning Data

We can train our model on some different data, and the fine tuning will adjust it to "sound more like" the type of text we feed it. 

In [7]:
!pip install praw

Collecting praw
  Downloading praw-7.7.0-py3-none-any.whl (189 kB)
[K     |████████████████████████████████| 189 kB 3.0 MB/s eta 0:00:01
[?25hCollecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Collecting update-checker>=0.18
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting websocket-client>=0.54.0
  Downloading websocket_client-1.5.1-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 10.0 MB/s eta 0:00:01
Installing collected packages: websocket-client, update-checker, prawcore, praw
Successfully installed praw-7.7.0 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.5.1


In [3]:
import praw

# Define user agent
user_agent = "gpt_scrape"

# Create an instance of reddit class
reddit = praw.Reddit(username="Adventurous_Salt",
                     password="mx3CnTXVpWb4Rbp",
                     client_id="lJfXwWgNg8by93YL401vyA",
                     client_secret="k8sI0o3zCEAnNrsnSgKk55Vp6WinYw",
                     user_agent=user_agent
)

In [8]:
# Create sub-reddit instance
subreddit_name = "wallstreetbets"
subreddit = reddit.subreddit(subreddit_name)
# Printing subreddit info
print(subreddit.display_name)

wallstreetbets


In [9]:
titles=[]
scores=[]
ids=[]

for submission in subreddit.new(limit=200):
    titles.append(submission.title)
    scores.append(submission.score) #upvotes
    ids.append(submission.id)

In [6]:
# open file in write mode
with open(r'reddit_dl.txt', 'w') as fp:
    for item in titles:
        # write each item on a new line
        fp.write("%s\n" % item)
    print('Done')

Done


In [7]:
gpt2.finetune(sess,
              "reddit_dl.txt",
              model_name=model_name,
              steps=100)   # steps is max number of training steps

2023-03-27 16:20:01.731489: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:357] MLIR V1 optimization pass is not enabled


Loading checkpoint checkpoint/run1/model-1
INFO:tensorflow:Restoring parameters from checkpoint/run1/model-1
Loading dataset...


100%|██████████| 1/1 [00:00<00:00, 125.25it/s]


dataset has 2995 tokens
Training...
[2 | 53.82] loss=4.67 avg=4.67
[3 | 97.58] loss=4.45 avg=4.56
[4 | 142.24] loss=4.16 avg=4.42
[5 | 187.35] loss=3.90 avg=4.29
[6 | 227.30] loss=3.63 avg=4.16
[7 | 264.45] loss=3.27 avg=4.01
[8 | 303.41] loss=3.17 avg=3.88
[9 | 341.77] loss=2.88 avg=3.75
[10 | 385.16] loss=3.08 avg=3.68
[11 | 428.29] loss=2.47 avg=3.55
[12 | 478.04] loss=2.15 avg=3.42
[13 | 541.21] loss=2.02 avg=3.29
[14 | 584.61] loss=1.95 avg=3.18
[15 | 628.79] loss=1.45 avg=3.05
[16 | 678.53] loss=1.28 avg=2.93
interrupted
Saving checkpoint/run1/model-16


In [None]:
gpt2.generate(sess, model_name=model_name, length=100, temperature=0.7, nsamples=5, batch_size=5, prefix="The best way to learn statistics is")