The goal is to fine-tune smol17 to return useful metadata for categorizing archaeological reports in Britain.

The source data came from the Archaeology Data Service, 1173 report metadata records returned from a simple search of 'Roman'. 'Bibliography' and 'Url' were removed from the columns. Training data is in `ads-roman-result.csv`

Guidance from https://mikulskibartosz.name/fine-tune-small-language-model. Code pair-programmed with Gemini 2 and Claude Sonnet models.

Look: I never said I was a good programmer. Rather, I start with what I have, where I want to go, I look up tutorials, I adapt as I can, and I use the models to understand the errors and find (limited) solutions that I understand.

In [None]:
!wget https://gist.githubusercontent.com/shawngraham/d71c21640e1597d90c02123c290c9472/raw/372b865372872fdd78bf1f222bd2299b3217863b/ads-roman-result.csv

--2024-12-21 14:41:07--  https://gist.githubusercontent.com/shawngraham/d71c21640e1597d90c02123c290c9472/raw/372b865372872fdd78bf1f222bd2299b3217863b/ads-roman-result.csv
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1221046 (1.2M) [text/plain]
Saving to: ‘ads-roman-result.csv’


2024-12-21 14:41:07 (16.8 MB/s) - ‘ads-roman-result.csv’ saved [1221046/1221046]



## Turn the data into training data

We need to turn the data into examples of what we want smol to eventually do. The following code reads the comma delimited results from ADS (which use semi-colons _within_ columns to indicated lists) into jsonl data with example prompt and response text.

(The hardest part of all of this is finding, and formatting, training data).

The following code block will parse the csv and map the information in the desired format. It will write `processed_data.json` which is the csv data represented as json, and then `training_data.jsonl` which is in the json lines format, and is what the fine-tuner requires with a system prompt/question and the training data formatted as an answer.

In [None]:
# for use with ADS csv download file that mixes comma delimited fields with semicolon delimited lists
import pandas as pd
import json
import re

def parse_delimited_field(field, delimiter=';'):
    """Parse semicolon-delimited fields into lists, cleaning empty entries"""
    if pd.isna(field):
        return []
    items = [item.strip() for item in str(field).split(delimiter)]
    return [item for item in items if item]

def parse_location_data(location_str):
    """Parse location string into structured format"""
    if pd.isna(location_str):
        return {}

    location_data = {}
    parts = location_str.split(';')

    for part in parts:
        if ':' in part:
            key, value = part.split(':', 1)
            location_data[key.strip()] = value.strip()
        elif 'EPSG' in part:  # Handle EPSG coordinates
            coord_type, value = part.split(':', 2)[1:]
            location_data[f'EPSG_{coord_type}'] = value

    return location_data

def parse_period_subject(field):
    """Parse period and subject information into categorized lists"""
    if pd.isna(field):
        return {'periods': [], 'subjects': []}

    periods = []
    subjects = []

    items = parse_delimited_field(field)
    for item in items:
        if item.startswith('Period:'):
            periods.append(item.replace('Period:', '').strip())
        elif item.startswith('Subject:'):
            subjects.append(item.replace('Subject:', '').strip())

    return {
        'periods': periods,
        'subjects': subjects
    }

def parse_identifiers(identifier_str):
    """Parse identifier string into structured format"""
    if pd.isna(identifier_str):
        return {}

    identifiers = {}
    parts = parse_delimited_field(identifier_str)

    for part in parts:
        if ':' in part:
            key, value = part.split(':', 1)
            identifiers[key.strip()] = value.strip()

    return identifiers

def transform_row_to_json(row):
    """Transform a single row into structured JSON format"""
    return {
        'title': row['Title'],
        'description': row['Description'],
        'location': parse_location_data(row['Location']),
        'period_subject': parse_period_subject(row['PeriodSubjectIntervention']),
        'identifiers': parse_identifiers(row['Indentifiers']),
        'people': parse_delimited_field(row['People'])
    }

# this is for creating a dataset for finetuning smol
def format_for_training(entry):
    """Format the JSON entry into training format"""
    # Create instruction from available data
    instruction = (
        f"Please identify the metadata that describes the work recounted in this archaeological report from {entry['location'].get('Named Location', 'unknown location')}: "
        f"{entry['title']} {entry['description']} {entry['location']} {entry['period_subject']} {entry['people']} {entry['identifiers']}"
    )

    # Create response using structured data
    response = {
        "location": {
            "civil_parish": entry['location'].get('Civil Parish', ''),
             "admin_county": entry['location'].get('Admin County', ''),
        "subjects": entry['period_subject']['subjects'],
        "periods": entry['period_subject']['periods'],
        "work_conducted_by": entry['people'],
        "identifiers": entry['identifiers']

        }
    }

    return {
        "text": f"<|system|>You are a helpful archaeological assistant trained to identify appropriate metadata from archaeological reports.\n"
                f"<|user|>{instruction}\n"
                f"<|assistant|>{json.dumps(response)}<|endoftext|>"
    }

def process_archaeological_csv(input_file, output_json="processed_data.json", output_training="training_data.jsonl"):
    """Process archaeological CSV file into JSON and training format"""
    try:
        # Read CSV using pandas, handle quote char issues
        try:
            df = pd.read_csv(input_file,
                               quotechar='"',
                               escapechar='\\',
                               encoding='utf-8',
                               on_bad_lines='warn')
        except Exception as e:
             print(f"Error during initial read with quotes:\n{e}\n trying without quotes")
             try:
                  df = pd.read_csv(input_file,
                                   encoding='utf-8',
                                   on_bad_lines='warn')
             except Exception as e:
                  print(f"Error during initial read without quotes:\n{e}")
                  raise e

        # Remove leading/trailing spaces from column names
        df.columns = df.columns.str.strip()

        # Debug: Print column names and first row
        print("\nAvailable columns in CSV:")
        for col in df.columns:
            print(f"- {col}")

        print("\nFirst row of data:")
        print(df.iloc[0].to_dict())

        # Save raw CSV content for debugging
        with open('debug_raw.txt', 'w', encoding='utf-8') as f:
            with open(input_file, 'r', encoding='utf-8') as src:
                f.write(src.read())

        print("\nSaved raw CSV content to debug_raw.txt for inspection")

        # Print shape of dataframe
        print(f"\nDataFrame shape: {df.shape}")

        # Print first few lines of raw file
        print("\nFirst few lines of raw file:")
        with open(input_file, 'r', encoding='utf-8') as f:
            print(f.readline())  # Header
            print(f.readline())  # First data row

        processed_data = df.apply(transform_row_to_json, axis=1).tolist()

        training_data = [format_for_training(entry) for entry in processed_data]

        # Output processed data as json
        with open(output_json, "w") as f:
            json.dump(processed_data, f, indent=2)

        # Output training data as jsonl
        with open(output_training, "w") as f:
            for entry in training_data:
                json.dump(entry, f)
                f.write('\n')

        return processed_data, training_data

    except Exception as e:
        print(f"\nError during processing:")
        print(f"Type of error: {type(e)}")
        print(f"Error message: {str(e)}")
        if 'df' in locals():
            print("\nDataFrame Info:")
            print(df.info())
        raise e


# run the code
if __name__ == "__main__":
    try:
        processed_data, training_data = process_archaeological_csv('ads-roman-result.csv')

        # Print example of processed data
        print("\nExample of processed JSON:")
        print(json.dumps(processed_data[0], indent=2))

        print("\nExample of training format:")
        print(training_data[0]['text'])

    except Exception as e:
        print(f"Error: {str(e)}")


  df = pd.read_csv(input_file,



Available columns in CSV:
- Title
- Description
- Location
- PeriodSubjectIntervention
- Indentifiers
- People

First row of data:
{'Title': 'Images and CAD Data from the Island and PSS Soil Bund Sites phase of Archaeological Mitigation Work at the Houghton Regis North 1 Development, Bedfordshire, 2021', 'Description': 'This collection comprises images and CAD data from archaeological work by Albion Archaeology, in advance of development at Houghton Regis North 1 (HRN1) or Linmere, north of Houghton Regis, Bedfordshire. This particular phase of work was undertaken between 21st June and 23rd August 2021, and comprised archaeological monitoring and a Strip, Map and Sample on the Island and PSS Soil Bund Sites (AIA3C) area of the proposed development site.', 'Location': 'Civil Parish:Chalton;Named Location:Island and PSS Soil Bund Sites;Grid Ref:TL036256;District:Central Bedfordshire;Admin County:Bedfordshire;Country:England;Civil Parish:Houghton Regis;Named Location:Linmere;EPSG:27700:5

Now we need to get the Smol-135 model. The code below will download it from Huggingface, a site that functions as a repository for models and data. You need to make an account there, and get an access token for the api. Once you have your token, click on the 'key' icon ('secrets') at the left of the screen, and create a new secret called `HF_TOKEN`. Paste your token in. Now you can access these models *and* push your fine-tuned version to your own user account for use later.

In [None]:
## now we get set up with a model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


device = "cuda"
model_name = "HuggingFaceTB/SmolLM-135M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
eos_string = tokenizer.decode([tokenizer.eos_token_id])

The next block installs some more packages that we will need.

In [None]:
%%capture
!pip install datasets
!pip install -U bitsandbytes

This block creates our fine tuning, taking our dataset and splitting it into training and testing splits (so we can get a sense of how well the model is trained). The arguments for _how_ to train the model can be adjusted and explored. This block also sets the trainer running. I find it helpful sometimes to copy the output (the training quality statistics) and the training arguments into a model like Claude.ai Sonnet and ask the model to 'interpret these results then suggest more effective settings', just to see.

In [None]:
import os
import pandas as pd
import json
from datasets import Dataset
from transformers import DataCollatorForLanguageModeling, Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer

model_id = "HuggingFaceTB/SmolLM-135M"

# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

def load_and_preprocess_data(data_path, data_type='jsonl'):
    """
    Load archaeological reports and preprocess them for training
    """
    texts = []
    if data_type == 'jsonl':
        if not os.path.exists(data_path):
           raise FileNotFoundError(f"Error: The file '{data_path}' could not be found.")
        with open(data_path, 'r', encoding='utf-8') as f:
           for line in f:
              try:
                  entry = json.loads(line)
                  texts.append(entry['text'])
              except json.JSONDecodeError:
                 print(f"Warning: Could not decode line as JSON: {line.strip()}")
    elif data_type == 'csv':
        if not os.path.exists(data_path):
           raise FileNotFoundError(f"Error: The file '{data_path}' could not be found.")
        df = pd.read_csv(data_path)
        print(f"Shape of df: {df.shape}") # check shape of dataframe
        print(f"first 5 rows of df: {df.head()}") # check dataframe content
        texts = df['text'].tolist()
    else:
        raise ValueError("Invalid data_type. Must be 'jsonl' or 'csv'.")

    return Dataset.from_pandas(pd.DataFrame({'text': texts}))

def tokenize_function(examples):
    tokenized = tokenizer(examples["text"], truncation=True, padding="max_length", max_length=300, return_tensors="pt") # modified max_length
    return tokenized


# Load and preprocess the dataset
data_path="training_data.jsonl"
data_type="jsonl"
dataset = load_and_preprocess_data(data_path, data_type)
print(f"first row of dataset before mapping: {dataset[0]}") # check content before mapping

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)
print(f"first row of dataset after mapping: {tokenized_dataset[0]}") # check content after mapping

tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.05)


# Create Data Collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)


# Training Arguments
args = TrainingArguments(
    output_dir="SmolLM",
    evaluation_strategy="steps",
    eval_steps=50,
    learning_rate=1e-5,
    per_device_train_batch_size=16,       # Increased batch size if memory allows
    per_device_eval_batch_size=16,
    num_train_epochs=50,
    weight_decay=0.01,
    save_strategy="steps",
    save_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    logging_dir="./logs",                 # Add logging
    logging_steps=50,
    fp16=True,                           # Enable mixed precision training if available
    gradient_accumulation_steps=2,        # Accumulate gradients for larger effective batch size
    warmup_steps=500,                    # Add warmup steps
    seed=42,                             # Set random seed for reproducibility
    report_to="none",                    # without this, colab logs the run with 'weights and biases' service, which requires an api etc
)

# Trainer
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
)
print(f"Training Dataset Length {len(tokenized_dataset['train'])}") # checking training data length
print(f"Validation Dataset Length {len(tokenized_dataset['test'])}") # checking test data length

# Do the Training
trainer.train()

first row of dataset before mapping: {'text': '<|system|>You are a helpful archaeological assistant trained to categorize archaeological reports.\n<|user|>Please categorize this archaeological report metadata from Linmere: This collection comprises images and CAD data from archaeological work by Albion Archaeology, in advance of development at Houghton Regis North 1 (HRN1) or Linmere, north of Houghton Regis, Bedfordshire. This particular phase of work was undertaken between 21st June and 23rd August 2021, and comprised archaeological monitoring and a Strip, Map and Sample on the Island and PSS Soil Bund Sites (AIA3C) area of the proposed development site.\n<|assistant|>{"subjects": ["Sherd", "Field Observation (Monitoring)", "Field Boundary", "Excavations (Archaeology)--England", "Post Hole", "Quarry", "Archaeology", "Strip Map And Sample", "Ridge And Furrow", "Pit", "Ditch"], "periods": ["POST MEDIEVAL", "-800 - 1800", "ROMAN", "IRON AGE"], "work_conducted_by": ["Creator:Albion Archa

Map:   0%|          | 0/849 [00:00<?, ? examples/s]

first row of dataset after mapping: {'input_ids': [44, 108, 9690, 108, 46, 2683, 359, 253, 5356, 13753, 11173, 7018, 288, 31239, 13753, 4631, 30, 198, 44, 108, 4093, 108, 46, 10180, 31239, 451, 13753, 1378, 11566, 429, 9565, 42596, 42, 669, 3854, 15658, 3265, 284, 24218, 940, 429, 13753, 746, 411, 19726, 285, 26287, 28, 281, 4408, 282, 1421, 418, 42674, 3192, 271, 2601, 216, 33, 365, 15416, 62, 33, 25, 355, 9565, 42596, 28, 4081, 282, 42674, 3192, 271, 28, 37201, 21225, 30, 669, 1542, 5239, 282, 746, 436, 14999, 826, 216, 34, 33, 302, 4019, 284, 216, 34, 35, 7212, 4053, 216, 34, 32, 34, 33, 28, 284, 15556, 13753, 5293, 284, 253, 41529, 28, 11563, 284, 21530, 335, 260, 5378, 284, 377, 6690, 15960, 27969, 25730, 365, 49, 7854, 35, 51, 25, 1557, 282, 260, 5433, 1421, 2530, 30, 198, 44, 108, 520, 9531, 108, 46, 39428, 19541, 99, 1799, 9523, 47415, 84, 1002, 476, 5468, 35300, 365, 41155, 23849, 476, 5468, 44842, 1002, 476, 32223, 565, 491, 365, 37318, 899, 25, 423, 47063, 1002, 476, 10360, 

  trainer = Trainer(


Training Dataset Length 806
Validation Dataset Length 43


Step,Training Loss,Validation Loss
5,No log,3.317705
10,44.372200,3.066287
15,44.372200,2.665973
20,33.430600,2.267325
25,33.430600,1.931662
30,27.542900,1.750549


TrainOutput(global_step=30, training_loss=35.11522216796875, metrics={'train_runtime': 251.3707, 'train_samples_per_second': 32.064, 'train_steps_per_second': 0.119, 'total_flos': 1176436922803200.0, 'train_loss': 35.11522216796875, 'epoch': 7.627450980392156})

Test! In this next block, I can paste in the archaeological site description after where it says `...return json:`

```
prompt = "<|system|>You are a helpful archaeological assistant trained to
categorize archaeological reports.\n<|user|>Please categorize this
archaeological report metadata; return json: This collection comprises images
and CAD data from archaeological work by GrahamCo Archaeology, in advance of
development at Claygate North 1 (CLYGT1). This particular phase of work was
undertaken between 21st June and 23rd August 2021, and comprised archaeological
monitoring and a Strip, Map and Sample on the area of the proposed development
site. Roman period Samian ware from the 1st and second centuries were
recovered. \n"
```


In [None]:
# test out before saving
trained_model = trainer.model

prompt = "<|system|>You are a helpful archaeological assistant trained to categorize archaeological reports.\n<|user|>Please categorize this archaeological report metadata; return json: This collection comprises images and CAD data from archaeological work by GrahamCo Archaeology, in advance of development at Claygate North 1 (CLYGT1). This particular phase of work was undertaken between 21st June and 23rd August 2021, and comprised archaeological monitoring and a Strip, Map and Sample on the area of the proposed development site. Roman period Samian ware from the 1st and second centuries were recovered. \n"

input_ids = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False).to(device)

generated_ids = trained_model.generate(
 input_ids,
    max_new_tokens=300,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id
)

generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)

If you want to try retraining, you should run this next block then go back to the training block and rerun. But if you want to save your model, DO NOT run this block.

In [None]:
#clean up and free memory
# remove # to run this:

#del model
#del tokenizer
#del trainer
#del tokenized_dataset
#del dataset
#gc.collect()
#torch.cuda.empty_cache()

This code block will push your model to your account on huggingface. Change where it says `your-username`.

In [None]:

#save to hugginface.
#it will ask for your token and then like github push the model
#to your space, ie huggingface.com/your-username/name-of-your-new-model-as-specified-below

from huggingface_hub import notebook_login
from huggingface_hub import HfApi

# First, login to Hugging Face
notebook_login()

# Save the model and all necessary files
trainer.save_model("./SmolLM")

# Save the tokenizer and label map with the model
tokenizer.save_pretrained("./SmolLM")

# Push to hub with a model card
model.push_to_hub("your-username/Smol_archae_metadata_model",
    use_auth_token=True,
    model_card_kwargs={
        "language": "en",
        "license": "mit",
        "datasets": ["custom archaeology dataset"],
    }
)

# Push the tokenizer configuration
tokenizer.push_to_hub("your-username/Smol_archae_metadata_model")

# Initialize the Hugging Face API
api = HfApi()




Here's some code to load your model from Huggingface, ie, if you were coming back to the computer after a hiatus and you didn't want to retrain everything from scratch.

In [None]:
# Smol135 versus your fine tuned model

# and here we're going to load the models from huggingface and test them against the same output

# so you can see the difference between fine-tuning the original smol-135 makes
# of course, the result might not yet be much good, but if you paid for more memory/time on the gpu it'd get better
# and also, if you did more with the training data - and more data - to make sure it was how you wanted it...

import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import json

def generate_response_pipeline(pipeline, input_text, max_length=500, temperature=0):
    """
    Generates a response using a Hugging Face pipeline
    """
    response = pipeline(input_text, max_length=max_length, temperature=0)[0]['generated_text']

    return response

def generate_response(model, tokenizer, input_text, max_length=500, temperature=0):
    """
    Generates a response using the model
    """

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=max_length)

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

def load_and_prepare_model(model_path, model_id=None, use_merged = True):
    """
    Loads and prepares either the merged model or the LoRA model
    """
    if use_merged:
         # Load the merged model directly
        model = AutoModelForCausalLM.from_pretrained(
            model_path,
            device_map="auto",
            torch_dtype=torch.float16
        )
    else:
        # Load base model and LoRA weights
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
             device_map="auto",
            torch_dtype=torch.float16
        )
        model = PeftModel.from_pretrained(model, model_path)

    tokenizer = AutoTokenizer.from_pretrained(model_path) if use_merged else AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token
    return model, tokenizer


def main():
    # Model names
    base_model_id = "HuggingFaceTB/SmolLM-135M"
    fine_tuned_model_id = "sgraham/merged_archae_metadata_model"  # Replace with your actual model repo

     # Load the base model
    base_model, base_tokenizer = load_and_prepare_model(model_path=base_model_id, model_id=base_model_id, use_merged = True)

   # Load the fine-tuned model
    ft_model, ft_tokenizer = load_and_prepare_model(model_path=fine_tuned_model_id, model_id=base_model_id, use_merged = True)


    # Load the pipeline
    ft_pipeline = pipeline("text-generation", model=fine_tuned_model_id, device_map="auto", torch_dtype=torch.float16)


    # Test example (ensure you format as used in fine-tuning)
    prompt = """<|system|>You are a helpful archaeological assistant trained to identify appropriate metadata from archaeological reports.\n<|user|>Please identify the metadata that describes the work recounted in this archaeological report; return json: This collection comprises images and CAD from an archaeological evaluation and watching brief, undertaken by Cotswold Archaeology in August 2018, at Hewmar House, 120 London Road, Gloucester, Gloucestershire. Four archaeological evaluation trenches were excavated and four geotechnical test pits were  observed. Despite the proximity of the site to Wotton Roman cemetery, no evidence for any in situ burials, or indeed any Roman activity, was identified in any of the excavated trenches or test pits. It is likely that the site lay beyond the southern boundary of the cemetery and formed  part of the agricultural hinterland of both Roman and medieval Gloucester until the  construction of Hillfield Villa (later Hewmar House) in the early 19th century. Three linear  garden features, probably planting trenches, associated with Hillfield Villa and a large undated ditch were identified. Evidence for possible quarrying was also identified throughout the site. Periods: POST MEDIEVAL, 1800 - 1850, UNCERTAIN. Subjects: Archaeology, Evaluation, DITCH, GARDEN FEATURE, TRIAL TRENCH.\n"""

    # Generate and compare
    print(f"Input Prompt: {prompt}")

    base_response = generate_response(base_model, base_tokenizer, prompt)
    print(f"\nBase Model Response:\n{base_response}")

    #ft_response_direct = generate_response(ft_model, ft_tokenizer, prompt)
    #print(f"\nFine-tuned Model Response (Direct):\n{ft_response_direct}")

    ft_response_pipeline = generate_response_pipeline(ft_pipeline, prompt)
    print(f"\nFine-tuned Model Response (Pipeline):\n{ft_response_pipeline}")


if __name__ == "__main__":
    main()

This last bit is a pipeline for passing rows from a csv to your newly fine-tuned model for archaeological metadata extraction. Data via ADS again.

In [None]:

## get some test data
# csv where all of the fields have been smooshed into a single column
!wget https://gist.githubusercontent.com/shawngraham/15c7cf3e2982d645b0c03c745f12e6bf/raw/b06b6333aa14dd7d40bb14aac79b3434db3afdd0/test.csv


In [None]:

import pandas as pd
import json
import re
from transformers import AutoTokenizer
import numpy as np
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import Dataset
from sklearn.model_selection import train_test_split
from transformers import AutoModelForTokenClassification, AutoTokenizer
from huggingface_hub import notebook_login
from huggingface_hub import HfApi
import csv
import datetime
from pathlib import Path
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

def generate_response_pipeline(pipeline, input_text, max_length=500):
    """
    Generates a response using a Hugging Face pipeline
    """
    #response = pipeline(input_text, max_length=max_length)[0]['generated_text']
    response = pipeline(input_text, max_length=max_length, temperature=0.0)[0]['generated_text']
    return response

def generate_response(model, tokenizer, input_text, max_length=500):
    """
    Generates a response using the model
    """

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    with torch.no_grad():
        #outputs = model.generate(**inputs, max_length=max_length)
        outputs = model.generate(**inputs, max_length=max_length, temperature=0.0)

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

def load_and_prepare_model(model_path, model_id=None, use_merged = True):
    """
    Loads and prepares either the merged model or the LoRA model
    """
    if use_merged:
         # Load the merged model directly
        model = AutoModelForCausalLM.from_pretrained(
            model_path,
            device_map="auto",
            torch_dtype=torch.float16
        )
    else:
        # Load base model and LoRA weights
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
             device_map="auto",
            torch_dtype=torch.float16
        )
        model = PeftModel.from_pretrained(model, model_path)

    tokenizer = AutoTokenizer.from_pretrained(model_path) if use_merged else AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token
    return model, tokenizer


def main():
   # Model names
    base_model_id = "HuggingFaceTB/SmolLM-135M"
    fine_tuned_model_id = "sgraham/merged_archae_metadata_model"  # Replace with your actual model repo

     # Load the base model
    base_model, base_tokenizer = load_and_prepare_model(model_path=base_model_id, model_id=base_model_id, use_merged = True)

   # Load the fine-tuned model
    ft_model, ft_tokenizer = load_and_prepare_model(model_path=fine_tuned_model_id, model_id=base_model_id, use_merged = True)


    # Load the pipeline
    ft_pipeline = pipeline("text-generation", model=fine_tuned_model_id, device_map="auto", torch_dtype=torch.float16)

    # Read the CSV file
    df = pd.read_csv("test.csv")

    # Iterate over the rows of the DataFrame
    for index, row in df.iterrows():
        prompt = f"""<|system|>You are a helpful archaeological assistant trained to identify appropriate metadata from archaeological reports.\n<|user|>Please identify the metadata that describes the work recounted in this archaeological report; return json: {row['description']}\n"""

        # Generate and compare
        print(f"Input Prompt: {prompt}")

        #base_response = generate_response(base_model, base_tokenizer, prompt)
        #print(f"\nBase Model Response:\n{base_response}")

        #ft_response_direct = generate_response(ft_model, ft_tokenizer, prompt)
        #print(f"\nFine-tuned Model Response (Direct):\n{ft_response_direct}")

        ft_response_pipeline = generate_response_pipeline(ft_pipeline, prompt)
        print(f"\nFine-tuned Model Response (Pipeline):\n{ft_response_pipeline}")

if __name__ == "__main__":
    main()