# Practical Necromancy

We're going to resurrect [Flinders Petrie](https://en.wikipedia.org/wiki/Flinders_Petrie).


![](https://upload.wikimedia.org/wikipedia/commons/5/5a/Flinders_Petrie.jpg)

This demo is based on the colab notebook by Max Woolf (see his [2019 blog post](https://minimaxir.com/2019/09/howto-gpt2/)).

We're using an older model because a) we don't always need the latest flashy thing b) the older model can be explored c) we should try understanding the foundational technologies before jumping to the latest flashy thing. A and C are really aspects of the same argument I suppose.

I also want you to see what generative 'ai' looks like when we pull back the curtain and remove the chatbot and its illusion of intelligence.

# Preliminaries

In [1]:
!pip install -q gpt-2-simple

In [2]:

import gpt_2_simple as gpt2
from datetime import datetime
gpt2.download_gpt2(model_name="124M") 

#124 million parameters! State of the art, only a few short years ago. 
# Now, something that you can run in a notebook.


Fetching checkpoint: 1.05Mit [00:00, 3.26Git/s]                                                     
Fetching encoder.json: 1.05Mit [00:00, 2.98Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 4.33Git/s]                                                   
Fetching model.ckpt.data-00000-of-00001: 498Mit [01:05, 7.55Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 1.51Git/s]                                               
Fetching model.ckpt.meta: 1.05Mit [00:00, 4.28Mit/s]                                                
Fetching vocab.bpe: 1.05Mit [00:00, 4.75Mit/s]                                                      


Now we're going to make a *big* assumption. We're going to assume that the published works of Flinders Petrie contain something meaningful about the essence of the man. 

We're going to grab these from the Gutenberg Project (a repository of public domain literature).

In [3]:
import urllib.request
import os

# List of URLs to download
urls = [
    "https://www.gutenberg.org/cache/epub/7386/pg7386.txt",
    "https://www.gutenberg.org/cache/epub/70049/pg70049.txt",
    "https://www.gutenberg.org/cache/epub/52570/pg52570.txt",
    "https://www.gutenberg.org/cache/epub/63311/pg63311.txt",
    "https://www.gutenberg.org/cache/epub/56095/pg56095.txt"
]

# Loop through each URL and download the file
for url in urls:
    # Extract the filename from the URL
    filename = url.split('/')[-1]
    
    try:
        print(f"Downloading {filename}...")
        urllib.request.urlretrieve(url, filename)
        print(f"Successfully downloaded {filename}")
    except Exception as e:
        print(f"Error downloading {filename}: {e}")

print("All downloads completed!")

Downloading pg7386.txt...
Successfully downloaded pg7386.txt
Downloading pg70049.txt...
Successfully downloaded pg70049.txt
Downloading pg52570.txt...
Successfully downloaded pg52570.txt
Downloading pg63311.txt...
Successfully downloaded pg63311.txt
Downloading pg56095.txt...
Successfully downloaded pg56095.txt
All downloads completed!


In [7]:
# join those files together
def concatenate_files(output_filename="petrie.txt"):
    # List of specific files to concatenate
    files_to_concat = [
        "pg7386.txt",
        "pg70049.txt", 
        "pg52570.txt",
        "pg63311.txt",
        "pg56095.txt"
    ]
    
    with open(output_filename, 'w', encoding='utf-8') as outfile:
        for filename in files_to_concat:
            if os.path.exists(filename):
                print(f"Adding {filename} to {output_filename}")
                with open(filename, 'r', encoding='utf-8') as infile:
                    outfile.write(infile.read())
                    outfile.write('\n')
            else:
                print(f"Warning: {filename} not found")
    
    print(f"Files concatenated into {output_filename}")

# Run the concatenation
file_name = "petrie.txt"
concatenate_files(file_name)

Adding pg7386.txt to petrie.txt
Adding pg70049.txt to petrie.txt
Adding pg52570.txt to petrie.txt
Adding pg63311.txt to petrie.txt
Adding pg56095.txt to petrie.txt
Files concatenated into petrie.txt


Double check 'petrie.txt' now to see what you've got.

Hey, it's all just data, right? You mean you're worried about representation? Comprehensiveness? Balance? Nah, bro, moar data will just do the trick.

Now we'll add another layer of culture, of memory, of voice on top of the 'frog dna' of the original model. If it was good enough for Jurassic Park, it's good enough for us.

This next step might take a while, even with the short-trained parameters I've set below. We'll check in later. Keep an eye on the 'loss' function. When that starts to flatline (the descent is no longer steep) the model is starting to over fit. [Chantal Brousseau has a good explanation at the Programming Historian](https://programminghistorian.org/en/lessons/interrogating-national-narrative-gpt#gradient-descent-explained). (By the way, she wrote that tutorial as part of her course work here at Carleton!)

# Retraining

'You will know a word by the company it keeps' - all generative AI is just statistical associations of words with other words across various contexts. This is why these models are inherently conservative and uncreative; they contain within themselves enormous pressure towards the _mean_ patterns, the _average_ patterns.

Remember when we took an image model, and threw away its categories and used that image model with a final bit of smooshing of our own categories to create an image classifier? We're doing something similar here. We're throwing out the last layer that says 'this word tends to go with that word in this context' and giving it Petrie's work. That last layer of data will therefore tend to guide the ultimate generation of words. Strong signals in Petrie's text are things that we can now find, using this approach.

In [None]:
# warning: this WILL TAKE QUITE SOME TIME, and the success of this will in part depend on the quality of
# your computing hardware. If you look at the top right of this page where it says Python 3 (ipykernel) there's a circle.
# If that circle is blacked-out, it means your machine is calculating (as does the [*] at left). Remain patient.

sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name, # this is where it grabs your data
              model_name='124M',
              steps=100, # in a proper training session you'd run at least 1000
              restore_from='fresh',
              run_name='run1',
              print_every=10, # and you'd print out much less frequently
              sample_every=20, # and sample every once in a while
              save_every=50 # and probably save every 500
              ) # with these settings and this little data, you're going to get 'overfitting' - why might that be problem? The word itself tells you...


# optional-but-helpful parameters for gpt2.finetune:
# restore_from: Set to fresh to start training from the base GPT-2, or set to latest to restart training from an existing checkpoint.
# sample_every: Number of steps to print example output
# print_every: Number of steps to print training progress.
# learning_rate: Learning rate for the training. (default 1e-4, can lower to 1e-5 if you have <1MB input data)
# run_name: subfolder within checkpoint to save the model. This is useful if you want to work with multiple models (will also need to specify run_name when loading the model)
# overwrite: Set to True if you want to continue finetuning an existing model (w/ restore_from='latest') without creating duplicate copies.


## AS THIS BLOCK RUNS it will periodically pause and generate some text given the current state of training.
## It will print this out and you'll get a sense of how it is improving. With my default settings, it will 
## give you some stats about how well the model is training (loss, avg values; you want the loss settings to drop smoothly), 
## and then every 20 steps it'll sample.

## Seance Time
Ok! That took a _long_ time. Let's prompt this ghostly shade of Petrie to see what he thinks about labour issues in archaeology:

In [9]:
gpt2.generate(sess,
              model_name='124M',
              prefix="Local workers are",  ## Ghostly Petrie will start with this phrase and select the most likely tokens that will follow it
              length=100, ## Ghostly Petrie will only generate the next 100 tokens
              temperature=0.7, ## Ghostly Petrie will choose from a wider range of probabilities; you could dial this down to 1 to see what the ur-text of his published writing might be (the absolutely most probable tokens)
              top_p=0.9, ## another probability setting
              nsamples=5, ## since this is all about probability, we'll do this five times.
              batch_size=5
              )

# I've left the results of my own run below so you can get a sense of the BEHAVIOUR SPACE of this model (a concept we
# encounter agent based modeling); when you run the cell these results will disappear and your own will turn up 
# when it finishes doing all the math!

Local workers are getting a lot of attention now, and it is the first time that this is dealt with in a proper way. The main problem is that most of the local work is done by the Littors, and not by the town
employees. It is very troublesome to have a town who cannot get all the workers together, and is
not a suitable place for any kind of organisation.

We have already noticed that many towns have been largely built on the old stock of
trou
Local workers are also expected to be given some work time at work, as well as a chance to work with other workers.

The union will also have to be prepared to take on certain types of positions, such as inspectors.
The position of workers should be based on the type of work done.

The same is true for the workmen, who will be expected to perform the most important work.
The union will need to be prepared to take on the most demanding positions.

Work
Local workers are going to the trouble of dealing with them, and even some of them will do so. The

## What do you see?
So copy that last block of code and try exploring this ghostly echo of Petrie's writing. Write some observations down: do we learn anything about Petrie's general sensibility this way?


## Don't Look Behind The Curtain
EVERY so-called 'AI' is doing this under the hood; it's just the more recent ones have many more layers of modeling to constrain the probabilities of what-comes-next. It's all just mad echos bouncing around the computer, recombining in various ways.

The _illusion_ of intelligence emerges from the use of a chat bot interface rather than this 'completion' interface we're using here: we're programmed to see humanity in things that look human (except for the sociopaths amongst us). 

These models can be further tuned by producing outputs that a human trainer (ill-paid, overworked, and exploited) judges 'more human/less human'. NOT TRUE/FALSE, not RIGHT/WRONG, but 'meh, that sounds human / doesn't sound quite as good'. The big AI models therefore are highly crafted to SOUND LIKE A PERSON: they are, quite literally, bullshit machines in the sense of that guy you hear talking out of his arsehole in a bar somewhere.

Do you really want to use such things to speak for you? 