## Finetune CLIP on archaeological objects

October 8. Shawn Graham

This notebook downloads images from Open Context, reshapes the metadata into captions, and then uses Damian Stewart's hugginface_finetune_clip.py to retrain the `openai/clip-vit-base-patch32` model (see [his repo](https://github.com/damian0815/finetune-clip-huggingface/blob/main/huggingface_finetune_clip_runner.ipynb). We forked a copy [too](https://github.com/shawngraham/finetune-clip-huggingface)). Other CLIP versions can be used, but so far the ones I've tried take too much memory to be used in the free colab tier. [Here's our Github repo btw](https://github.com/XLabCU/embedded_image_search)

The code block under 'old descriptions' creates captions from separate metadata fields and downloads, reshapes the results. Users should use the subsequent block instead.

In [None]:
!pip install pandas requests

jump down to the better download/captions block, ignore the next bit.

## old descriptions

In [None]:
import requests
import pandas as pd

url = 'https://raw.githubusercontent.com/opencontext/archaeology-images-ai/main/json_data/artifact_images_w_descriptions.json'
data = requests.get(url).json()
df = pd.json_normalize(data)  # convert json to pandas DataFrame

In [None]:
df.rename(columns={'image_file__uri': 'image'}, inplace=True)
df

Unnamed: 0,image,media__uri,image_genre,image_type,subject__item_class__label,context___1,context___2,context___3,time_range,has_type,consists_of,origin_place,has_taxonomic_identifier,has_anatomical_identification,temporal_coverage,project_specific_descriptions
0,https://iiif.archivelab.org/iiif/opencontext-1...,https://opencontext.org/media/a9cedbad-e25b-4f...,archaeology,artifact,Object,Asia,Turkey,Domuztepe,6500 BCE to 5500 BCE,seals (artifacts),rock (inorganic material),,,,,"Artifact Name: Stamp Seal \n Material: Stone, ..."
1,https://iiif.archivelab.org/iiif/opencontext-1...,https://opencontext.org/media/1bbbca07-82f3-46...,archaeology,artifact,Object,Asia,Turkey,Domuztepe,6500 BCE to 5500 BCE,seals (artifacts),soapstone,,,,,Artifact Name: Stamp Seal \n Material: Steatit...
2,https://iiif.archivelab.org/iiif/opencontext-1...,https://opencontext.org/media/2062e3fa-41e2-d7...,archaeology,artifact,Object,Asia,Turkey,Domuztepe,6500 BCE to 5500 BCE,seals (artifacts),rock (inorganic material),,,,,"Artifact Name: Stamp Seal \n Material: Stone, ..."
3,https://iiif.archivelab.org/iiif/opencontext-1...,https://opencontext.org/media/2dc18114-4ddf-7c...,archaeology,artifact,Object,Asia,Turkey,Domuztepe,6500 BCE to 5500 BCE,pendants (jewelry),chert,,,,,Artifact Name: Pendant \n Material: Chert/Flint
4,https://iiif.archivelab.org/iiif/opencontext-1...,https://opencontext.org/media/d7e8b4e5-be3b-44...,archaeology,artifact,Object,Asia,Turkey,Domuztepe,6500 BCE to 5500 BCE,nails (fasteners),iron (metal),,,,,Artifact Name: Nail \n Material: Iron \n Dispo...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72357,https://artiraq.org/static/opencontext/pettegr...,https://opencontext.org/media/d3620d27-cb41-44...,archaeology,artifact,Object,Europe,Greece,Corinthia,,,,,,,,"Chronotype: Fineware, Late Helladic I-IIA \n Z..."
72358,https://artiraq.org/static/opencontext/pettegr...,https://opencontext.org/media/41f03708-baa5-4d...,archaeology,artifact,Object,Europe,Greece,Corinthia,,,,,,,,"Chronotype: Fineware, Late Helladic I-IIA \n Z..."
72359,https://artiraq.org/static/opencontext/pettegr...,https://opencontext.org/media/453e04b2-7905-4e...,archaeology,artifact,Object,Europe,Greece,Corinthia,,,,,,,,"Chronotype: Fineware, Late Helladic I-IIA \n Z..."
72360,https://artiraq.org/static/opencontext/interna...,https://opencontext.org/media/1840719d-2934-48...,archaeology,artifact,Object,Off World,International Space Station,Zvezda Service Module,,icons (devotional images); religions and reli...,,,,,,Location: Top center \n Item type: Icon \n Sec...


In [None]:
import os
import random
import requests

download_dir = 'images' # directory where you want to store images
random_images = df.sample(500, random_state=67)  # pick 500 random rows

os.makedirs(download_dir, exist_ok=True)
url_errors=[]
for _, row in random_images.iterrows():
    url = row['image']
    # Get the extension for the image file from the URL
    extension = url.split('.')[-1]
    # Get the UUID for the media file
    media_uuid = row['media__uri'].split('/')[-1]

    # Make a unique file_name from the UUID of the Open Context media resource. This has
    # the advantage of making sure that the image files can be easily looked up on
    # Open Context itself.
    file_name = f'{media_uuid}.{extension}'
    file_path = os.path.join('images', file_name)

    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()
    except (requests.exceptions.RequestException, requests.exceptions.Timeout):
        print(f'An error occurred while fetching: {url}')
        url_errors.append(url)
        continue

    with open(file_path, 'wb') as img_file:
        img_file.write(response.content)

An error occurred while fetching: https://iiif.archivelab.org/iiif/opencontext-22-f-7-7005-7312-1-p-3jpg/full/675,/0/default.jpg
An error occurred while fetching: https://artiraq.org/static/opencontext/poggio-civitate/preview/photos/20030129PROFILE.jpg
An error occurred while fetching: https://iiif.archivelab.org/iiif/opencontext-22-d-6-188-16-1-p-4jpg/full/675,/0/default.jpg
An error occurred while fetching: https://iiif.archivelab.org/iiif/opencontext-24-19900011exteriorbtjpg/full/675,/0/default.jpg


In [None]:
import re
import string
def create_metadata(df, dir_name):
    # Make a copy of the dataframe to avoid modifying the original one
    df = df.copy()

    # fill NaN values with empty string
    df.fillna('', inplace=True)

    # append values of required columns into new 'caption' column
    df['caption'] = 'A photograph of ' + df['consists_of'].astype(str) \
                    + ', ' + df['project_specific_descriptions'].astype(str) \
                    + ' dating to ' + df['time_range'].astype(str) \
                    + ' from ' + df['context___1'].astype(str) \
                    + ', ' + df['context___2'].astype(str) \
                    + ', ' + df['context___3'].astype(str)
    df['caption'] = df['caption'].replace('\n', ' ')
    df['caption'] = df['caption'].replace('False', ' ')
    df['caption'] = df['caption'].replace('True', ' ')
    # Remove all other punctuation
    df['caption'] = df['caption'].apply(lambda x: re.sub(r'[{}]'.format(string.punctuation), ' ', x)).str.strip()

    # Rewrite 'image' column to just contain the filename
    df['image'] = df.apply(lambda row: f"{row['media__uri'].split('/')[-1]}.{row['image'].split('.').pop()}", axis=1)

    # reshaping data to contain only 'image' and 'caption'
    df = df[['image', 'caption']]
    df.loc[:, 'image'] = dir_name + '/' + df['image'].astype(str)

    return df

df = create_metadata(train_images, 'images')
with open('train.json', 'w') as file:
    df.to_json(file, orient='records', lines=True)

testdf = create_metadata(test_images, 'testing')
with open('test.json', 'w') as file:
    testdf.to_json(file, orient='records', lines=True)

# use this code block for a better download & captions

In [None]:
# better captions download
import os
import json
import requests
import pandas as pd
from urllib.parse import urlparse
from urllib.request import urlretrieve
from urllib.error import HTTPError, URLError
from sklearn.model_selection import train_test_split
import concurrent.futures

# Load JSON from remote URL
url = "https://raw.githubusercontent.com/opencontext/archaeology-images-ai/main/json_data/artifact_images_w_sentence_captions.json"
response = requests.get(url)
data = response.json()

# Randomly select records
df = pd.DataFrame(data)
train_df, rem_df = train_test_split(df, train_size=2000, random_state=24)
test_df = rem_df.sample(50, random_state=42)

def download_and_rename(row, folder):
    os.makedirs(folder, exist_ok=True)
    uri = row['image_file__uri']
    # Check if uri exists and is a string
    if uri and isinstance(uri, str):
        uuid = row['media__uuid']
        caption = row['caption']
        parse_object = urlparse(uri)
        _, ext = os.path.splitext(parse_object.path)
        # Make sure uuid and ext are strings
        if not isinstance(uuid, str):
            uuid = str(uuid)
        if isinstance(ext, bytes):
            ext = ext.decode("utf-8")
        new_image_name = uuid + ext
        new_image_path = os.path.join(folder, new_image_name)

        try:
            urlretrieve(uri, new_image_path)
            return {"image": new_image_path, "caption": caption}

        except (HTTPError, URLError) as error:
            print(f"Download error for URL {uri}")
            print(error)
            return None
    else:
        return None

# Writing to 'jsonl' files
def write_to_jsonl(new_data, jsonl_file):
    with open(jsonl_file, 'w') as file:
        for json_dict in new_data:
            line = json.dumps(json_dict)
            file.write(line + "\n")

# Process train and test data
with concurrent.futures.ThreadPoolExecutor() as executor:
    train_data = list(executor.map(download_and_rename, [row for _, row in train_df.iterrows()], ['images']*len(train_df)))
    test_data = list(executor.map(download_and_rename, [row for _, row in test_df.iterrows()], ['testing']*len(test_df)))

# Write train/test data to jsonl files
write_to_jsonl(train_data, 'train.json')
write_to_jsonl(test_data, 'test.json')

Download error for URL https://iiif.archivelab.org/iiif/opencontext-16-250jpg/full/675,/0/default.jpg
HTTP Error 404: NOT FOUND
Download error for URL https://iiif.archivelab.org/iiif/opencontext-16-251jpg/full/675,/0/default.jpg
HTTP Error 404: NOT FOUND
Download error for URL https://artiraq.org/static/opencontext/poggio-civitate/preview/photos/20020001PROFILE.jpg
HTTP Error 404: Not Found
Download error for URL https://artiraq.org/static/opencontext/poggio-civitate/preview/photos/19790199BOTTOM.jpg
HTTP Error 404: Not Found
Download error for URL https://iiif.archivelab.org/iiif/opencontext-16-262jpg/full/675,/0/default.jpg
HTTP Error 404: NOT FOUND
Download error for URL https://artiraq.org/static/opencontext/poggio-civitate/preview/photos/20020015HEAD.jpg
HTTP Error 404: Not Found
Download error for URL https://artiraq.org/static/opencontext/poggio-civitate/preview/photos/19780245BOTTOM.jpg
HTTP Error 404: Not Found
Download error for URL https://iiif.archivelab.org/iiif/openconte

## Other sources of archaeological imagery?

Let's try the MET.

Departments 3, 10, and 13 are 'ancient near east', 'egypt', and 'greek and roman'.

In [None]:
!pip install jsonlines

In [None]:
#this block queries the api, makes the json, figures out the path to download, and writes the captions
import requests
import json
import jsonlines
import random
from concurrent.futures import ThreadPoolExecutor

# Function to fetch object data
def fetch_object_data(object_id):
    object_response = requests.get(f"{base_url}objects/{object_id}")
    return object_response.json()

# Define base URL for the Met's API
base_url = 'https://collectionapi.metmuseum.org/public/collection/v1/'

# Define our search term
search_term = 'archaeology'

allowed_departments = ["Ancient Near Eastern Art", "Egyptian Art", "Greek and Roman Art"]

# Generate the search URL
search_url = f"{base_url}search?q={search_term}"

# Make the GET request to the Met's API search endpoint
response = requests.get(search_url)

# Parse the response as JSON
data = response.json()

# Get a random sample of 100 object IDs, if there are at least 100 object IDs.
# Otherwise, get all object IDs.
object_ids_sample = random.sample(data['objectIDs'], min(1000, len(data['objectIDs'])))

# Open the jsonlines file in write mode
with jsonlines.open('METoutput.json', mode='w') as writer:
    # Use a ThreadPoolExecutor for parallel requests
    with ThreadPoolExecutor(max_workers=5) as executor:
        # Fetch all object data in parallel
        for object_data in executor.map(fetch_object_data, object_ids_sample):
            # If object's department in allowed departments and there's an image for this object
            if (object_data.get('department') in allowed_departments) and object_data['primaryImage']:
                # Create a list with all components of the caption
                caption_components = [
                    object_data['title'],
                    f"a {object_data['objectName']}" if object_data.get('objectName') else None,
                    f"from the {object_data['culture']}" if object_data.get('culture') else None,
                    f"dating to the {object_data['period']}" if object_data.get('period') else None,
                    object_data['dynasty'] if object_data.get('dynasty') else None,
                    object_data['reign'] if object_data.get('reign') else None,
                    f"({object_data['objectDate']})" if object_data.get('objectDate') else None,
                    f"created by {object_data['artistDisplayName']}" if object_data.get('artistDisplayName') else None,
                    f"in {object_data['country']}" if object_data.get('country') else None,
                    object_data['region'] if object_data.get('region') else None
                ]

                # Remove None elements from the list
                caption_components = [component for component in caption_components if component is not None]

                # Create the caption
                caption = ', '.join(caption_components) + '.'

                # Create the record
                record = {
                    'image': object_data['primaryImage'],
                    'caption': caption
                }

                # Write to jsonlines file
                writer.write(record)

In [None]:
!pip install retry

Collecting retry
  Downloading retry-0.9.2-py2.py3-none-any.whl (8.0 kB)
Collecting py<2.0.0,>=1.4.26 (from retry)
  Downloading py-1.11.0-py2.py3-none-any.whl (98 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/98.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m92.2/98.7 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: py, retry
Successfully installed py-1.11.0 retry-0.9.2


In [None]:
#this block does the downloading and fixes the paths to the local folder in the json
import os
import json
import jsonlines
import urllib.request
from concurrent.futures import ThreadPoolExecutor
from sklearn.model_selection import train_test_split
import requests
from retry import retry

# Function to download images
@retry(tries=3, delay=2)
def download_image(image_url, local_path):
    try:
        response = requests.get(image_url, stream=True)
        response.raise_for_status()
        with open(local_path, 'wb') as f:
            f.write(response.content)
    except requests.exceptions.RequestException as err:
        print ("Requests Error-URL {0}: {1}".format(image_url,str(err)))
        raise Exception(err)

def process_lines(lines, dataset):
    # List to store records
    records = []

    # Create the directory if it doesn't exist
    os.makedirs(dataset, exist_ok=True)

    for line in lines:
        # Parse the line as JSON
        data = json.loads(line)

        # Define the local path
        image_url = data['image']
        local_filename = image_url.split('/')[-1]  # Use the last part of the URL as the filename
        local_path = os.path.join(dataset, local_filename)

        # Append this task to the list
        records.append((image_url, local_path, data['caption']))

    # Create ThreadPoolExecutor
    with ThreadPoolExecutor(max_workers=10) as executor:
        # Download images in parallel
        executor.map(lambda x: download_image(x[0], x[1]), records)

    # Open the corresponding jsonl file in write mode
    with jsonlines.open(f'{dataset}_output.json', mode='w') as writer:
        # Write records to file
        for _, local_path, caption in records:
            record = {
                'image': local_path,
                'caption': caption
            }
            writer.write(record)

# Read lines from METoutput.jsonl file
with open('METoutput.json', 'r') as f:
    lines = f.readlines()

# Split into train and test sets
train_lines, test_lines = train_test_split(lines, test_size=0.20)

# Process training and test sets
process_lines(train_lines, 'METtrain')
process_lines(test_lines, 'METtest')

So the next thing to do would be to append the metadata from METtrain_output.json to the train.json file.

In [None]:
# Open 'train.json' in append mode and 'METtrain_output.jsonl' in read mode
with open('train.json', 'a') as train_file, open('METtrain_output.json', 'r') as met_file:
    # Iterate over the lines in met_file
    for line in met_file:
        # Write each line to train_file
        train_file.write(line)

With that achieved, we can go back to the finetune clip code block and run the training on open context AND MET images.

There aren't a lot of MET images, and it's because a lot of images seem to be missing the 'department' descriptor, so the code skips. Must track down why that's happening.

If you get any kind of error with the MET json, run it through https://jsonlines.org/validator/ to identify the problem.

# Set up to train

In [None]:
!pip install torchvision datasets Pillow
!pip install -q git+https://github.com/huggingface/transformers
!pip install accelerate -U

In [None]:
# test loading it back in
from datasets import load_dataset
dataset = load_dataset("json", data_files="train.json")
print(f"first image: {dataset['train'][0]['image']}, caption: '{dataset['train'][0]['caption']}'")


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

first image: testing/866566da-3114-4407-8ea4-50838814820f.jpg, caption: 'An image of an archaeological artifact found at Tell en-Nasbeh, a place in Palestinian Authority which is more generally located in Asia. The artifact has a general classification of lithics and mainly consists of chert;  flint (rock). Additional attributes that describe the artifact include: Condition: Good 
 Category Type: Lithic 
 Material: Flint 
 Subcatagory: Lithic -- Tool 
 Manufacture: Handmade'


# Retrain CLIP


In [None]:
!git clone https://github.com/damian0815/finetune-clip-huggingface.git

Cloning into 'finetune-clip-huggingface'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 19 (delta 6), reused 17 (delta 4), pack-reused 0[K
Receiving objects: 100% (19/19), 13.79 KiB | 1.53 MiB/s, done.
Resolving deltas: 100% (6/6), done.


In [None]:
!mkdir results

In [None]:
repo_id =  "openai/clip-vit-base-patch32"
#repo_id = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K" #this requires too much memory for fee google tier but i'll bet it gives good results
output_folder = "results"
batch_size = 50
num_train_epochs = 30
out_json = "train.json"

In [None]:
print(f"Finetuning {repo_id} for {num_train_epochs} epochs with batch size {batch_size}, and then saving output to {output_folder}.")
!python -W ignore finetune-clip-huggingface/huggingface_finetune_clip.py \
    --output_dir {output_folder} \
    --model_name_or_path {repo_id} \
    --train_file {out_json} \
    --image_column image \
    --overwrite_output_dir=True \
    --max_seq_length=77 \
    --num_train_epochs={num_train_epochs} \
    --caption_column caption \
    --remove_unused_columns=False \
    --do_train \
    --per_device_train_batch_size={batch_size} \
    --learning_rate="5e-5" --warmup_steps="2" --weight_decay 0.2
print("--\nDONE")
print(f"If it worked, trained data should be in {output_folder}")

Finetuning openai/clip-vit-base-patch32 for 30 epochs with batch size 50, and then saving output to results.
{'loss': 1.425, 'learning_rate': 3.7217659137577005e-05, 'epoch': 7.69}
{'loss': 0.5322, 'learning_rate': 2.4383983572895276e-05, 'epoch': 15.38}
{'loss': 0.3858, 'learning_rate': 1.1550308008213554e-05, 'epoch': 23.08}
{'train_runtime': 2907.0394, 'train_samples_per_second': 33.364, 'train_steps_per_second': 0.671, 'train_loss': 0.6757516635992589, 'epoch': 30.0}
100% 1950/1950 [48:27<00:00,  1.49s/it]
***** train metrics *****
  epoch                    =       30.0
  train_loss               =     0.6758
  train_runtime            = 0:48:27.03
  train_samples_per_second =     33.364
  train_steps_per_second   =      0.671
--
DONE
If it worked, trained data should be in results


In [None]:
!zip -r archaeai.zip results/pytorch_model.bin results/config.json

  adding: results/pytorch_model.bin (deflated 7%)
  adding: results/config.json (deflated 46%)


In [None]:
from google.colab import files
files.download("archaeai.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from google.colab import files
files.download("results/pytorch_model.bin")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Next

The next step is to use your fine-tuned model. [This notebook](https://colab.research.google.com/drive/1eYcYeygkoe-4fqLYNZW0JdvxWk_go_56#scrollTo=ftqZ03HZLVLC) uses one of our finetuned models, and shows you what to do. If you go to [the repo](https://huggingface.co/sgraham/archae-ai/tree/main) where our model lives, you'll also see what other supplementary json files you need to copy and arrange to work with your own.