<a href="https://colab.research.google.com/github/rchejfec/IRPP-oasis-llm-automation-ratings-guide/blob/main/HarnessingAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Step 1 - Set up and data prep

The easiest way to get started, especially if you're new to python or LLMs, is to use Google Colab. Its free and relatively easy to use. Just log in with or make a new google account, open this file, and upload the datasets included here. You can also use your google drive and there are many tutorials online if you get stuck.

If you don't use Colab, or a similar developer environment, you'll have to manage your own dependencies, versions, and keys, which can be a bit daunting if you're not used to it but its not much more complicated.  

In [None]:
# If you're using Google Colab, most likely you only need to install DSPy

!pip install dspy-ai -q

# if you're not using Google Colab, ensure the following are installed.
# !pip google-generativeai pandas json #etc.

In [None]:
import dspy
import google.generativeai as genai
import pandas as pd
import numpy as np
import os
import re
import time
import json

We'll first get a list of all unique skills and work activities in ESDC's OaSIS, wich you can download from this repo or here. If you wanted to replace it with some other source (O*Net's for example), just ensure the final list is in a dict and follows the structure of `items_to_rate`.

In [None]:
# Set the working directory so that you can use relative paths.
os.chdir("/content/drive/MyDrive/Colab Projects/Harnessing AI") # update with your own

try:
  # OaSIS Guide:
  guide_df = pd.read_csv("Data/OaSIS_Guide_2023.csv")

  # Extract skills & work activities
  items_to_rate_df = guide_df[(guide_df['Structure type'] == "Descriptor") &
                          (guide_df['Category'].isin(["Skills", "Work Activities"]))]

  # Convert to dict for easier wrangling
  items_to_rate = items_to_rate_df[['Name', 'Description']].to_dict(orient="records")

  # Remove unecessary information that might interfere with prompts.
  pattern_to_remove = r"\s*This descriptor is measured by .*? level on a scale of \d+-\d+\.?\s*$"

  for item in items_to_rate:
    original = item.get("Description")
    cleaned = re.sub(pattern_to_remove, "", original, flags=re.IGNORECASE).strip()
    item["Description"] = cleaned

  print(f"Extracted dictionary with {len(items_to_rate)} items with {list(items_to_rate[0].keys())} as keys.")

except Exception as e:
  print(f"Failed to make a dictionary from input table. Ensure you have uploaded the file and that file names are aligned.\
  \nError message: {e}")

Extracted dictionary with 99 items with ['Name', 'Description'] as keys.


Next, we'll import a list of different phrasings to use when prompting the LLM. These come from Appendix B of the study, with some slight modificiations. I omit versions 7 and 9, since I couldn't find an easy way to handle changing those parameters on the fly across different model providers. I also incorporate the name and (sometimes) description of the skills and work activities into the prompt using tags (i.e `{item_name}`). The goal here was to make it easier for the LLM to parse the question, as some models are better than others at processing longer, formally structured requests.

You can download the prompts from `data/prompt_templates.json`. Edit it using a text editor to remove, modify, or add new phrasings. For example, while only some of the original study's prompts included a description of the skill/activity, you could ensure all of them do by adding the tag `{description}` where it makes sense.

In [None]:
try:
  with open("Data/prompt_templates.json", "r") as f:
    prompts = json.load(f)
    print(f"Imported {len(prompts)} prompts.")
except Exception as e:
  print(f"Failed to import list of prompts. Ensure you have uploaded the file and that file names are aligned.\
  \nError message: {e}")

Imported 10 prompts.


### Step 2 - Configure model via DSPy

The next step is to initialize and configure our model and define the structure of our prompts and expected response.

I used Google's Gemma-3-1b-it and got my API key through Google Studio. You could just as easily use ChatGPT by finding the exact name of the model (should look like `basemodel/model-version`) and getting yourself an API key. The same could be said of a Llama model using Groq for example, or virtually any other major model available.

It is important that you don't share or accidentally publish your API key. That's why the code below uses Google Colab's *secrets* feature, which makes things pretty easy. Just click on the key icon on the left navigation bar, add your key and give it a name. Remember to update `API_KEY_NAME` below too. If you rather load your key locally, you can check this out for some resources.

In [None]:
# Define the name and version of the LLM (Defaults to Gemma)
LLM_PROVIDER_MODEL_STRING = "gemini/gemma-3-1b-it"

# Initialize variables
API_KEY = None
llm = None
API_KEY_NAME = "GOOGLE_API_KEY"   # update with your own name if different

# Load API Key
# If running in Google Colab, set your API key in the "Secrets" tab.
try:
  from google.colab import userdata
  API_KEY = userdata.get(API_KEY_NAME)
  if API_KEY:
      print(f"{API_KEY_NAME} loaded from Colab secrets.")
except ImportError:
  print("Could not import Colab userdata")
  pass

# If not using Google Colab, make sure to implement your key loading here
# #######
# #######

# Set model parameters
if API_KEY:
  try:
      model_config = {
          "temperature": 0.1,     # change the desired temperature
          "max_tokens": 300}      # change the desired number of max tokens
      llm = dspy.LM(
          model=LLM_PROVIDER_MODEL_STRING,
          api_key=API_KEY,
          temperature=model_config["temperature"],
          max_tokens= model_config["max_tokens"],
          timeout=60,             # number of seconds before it retries. lengthen if your connection is spotty
          num_retries=3           # number of times it should retry after timing out
          )

      print(f"DSPy configured to use model: {LLM_PROVIDER_MODEL_STRING}")
  except Exception as e:
      print(f"ERROR: Could not configure DSPy with {LLM_PROVIDER_MODEL_STRING}. Exception: {e}")
      llm = None
else:
  print(f"ERROR: {API_KEY_NAME} not found. Please set it in Colab secrets OR as an environment variable.")

# Configure DSPy with the chosen LLM
if llm:
    dspy.settings.configure(lm=llm)
    print(f"DSPy global LM successfully configured. Active LM uses model: {llm.model if hasattr(llm, 'model') else LLM_PROVIDER_MODEL_STRING}")
else:
    print("CRITICAL ERROR: LLM was not configured. DSPy operations will fail. Please check your LLM_PROVIDER_MODEL_STRING and API key setup.")

GOOGLE_API_KEY loaded from Colab secrets.
DSPy configured to use model: gemini/gemma-3-1b-it
DSPy global LM successfully configured. Active LM uses model: gemini/gemma-3-1b-it


Next, we define the DSPy signature, which acts as a kind of recipe for your prompts. I called it `AutomatabilityRatingSignature`, and as you can see below, it provides a high level description of the expected behaviour, the full text of the request (`full_request_text`) as an input, and a numeric `rating` and an an `explanation` as ouputs. This isn't the prompt itself but rather a skeleton for the prompt. DSPy will use this, combined with the parameters we set in the previous cell, to convert our request into something the API of any model can understand.

To tell DSPy to use this signature when prompting the LLM, we use it inside a Predict module (called `generate_rating` below). At this point, we could already make start firing questions to the model. For example:

```
# calling generate_rating with a question about a skill
generate_rating(
  full_request_text="Rate the automatability of the skill ~Writing~                     \
                     in the context of advacnemtents in generative AI                   \
                     in the next 5-10 years")

# returns something like
  Rating: 4
  Response: Writing is a complex skill that can be automated to a significant           \
  degree, but it still requires considerable human oversight...

```




In [None]:
# Edit the signature if you want to modify the LLM's behaviour.
class AutomatabilityRatingSignature(dspy.Signature):
    """Given a specific request about a skill or work activity, provide a numerical rating for its automatability and an explanation if requested."""

    full_request_text = dspy.InputField(
        desc="The complete, formatted prompt text that asks for the automatability rating and may request an explanation."
        )
    rating = dspy.OutputField(
        desc="A single numerical rating on a scale of 1 to 5 (e.g., 1, 2, 3, 4, or 5)."
        )
    explanation = dspy.OutputField(
        desc="A brief explanation for the rating. This may be empty if the prompt did not request an explanation."
        )

if dspy.settings.lm is None:
    print("CRITICAL ERROR: LLM not configured in dspy.settings. Please run the LLM configuration cell.")
else:
    generate_rating = dspy.Predict(AutomatabilityRatingSignature)

### Step 3 - Prompting the LLM

Now that everything is set up, we can move on to actually prompting the LLM, which turns out to be surprisingly simple. In essence, for evey item in `items_to_rate` (a skill or work activity) the code iterates through all 10 `prompts`, inserts the name and description (when relevant) into the template, and calls our `generate_rating()` function with the resulting text as the prompt. It then saves the output along some identifying information in a new dictionary called `all_results`.

In [None]:
# Flag to signal breaking out of loops when the model can't be reached.
stop_processing_flag = False

# Empty dictionary on which results will be saved
all_results = []

if 'generate_rating' not in locals():
    print("Error: The 'generate_rating' DSPy predictor is not initialized. Please check step 2.")
else:
    for item_info in items_to_rate:
        item_name = item_info.get('Name')
        item_descriptor = item_info.get('Description')

        # print(f"\nProcessing: '{item_name}'")     # uncomment for debugging or tracking progress

        for prompt_info in prompts:
            prompt_id = prompt_info.get('id')
            prompt_template = prompt_info.get('template')

            if not prompt_template:
                print(f"Skipping prompt due to missing template: {prompt_info}")
                continue

            # Format the prompt - fill in the placeholders in the template
            try:
                if "{item_descriptor}" in prompt_template:
                    current_full_request_text = prompt_template.format(
                        item_name=item_name,
                        item_descriptor=item_descriptor
                    )
                else:
                    current_full_request_text = prompt_template.format(
                        item_name=item_name
                    )
            except KeyError as e:
                print(f"KeyError during formatting for item '{item_name}' with prompt ID '{prompt_id}'. Placeholder: {e}. Skipping.")
                continue

            # Call the DSPy Predictor
            try:
                # This sends the request to the LLM
                prediction = generate_rating(
                    full_request_text=current_full_request_text)

                # The prediction object will have attributes corresponding to the OutputFields
                llm_rating_raw = prediction.rating
                llm_explanation = prediction.explanation if hasattr(prediction, 'explanation') else ""

                result_entry = {
                    'item_name': item_name,
                    'prompt_id': prompt_id,
                    'expects_explanation_flag': prompt_info.get('expects_explanation'),
                    'llm_raw_rating_output': llm_rating_raw,
                    'llm_explanation_output': llm_explanation,
                }

                all_results.append(result_entry)

            except Exception as e:
                error_message = str(e)
                print(f"ERROR during LLM call for item '{item_name}' with prompt ID '{prompt_id}': {error_message}")

                # Store error information
                all_results.append({
                    'item_name': item_name,
                    'error': error_message
                })

                # Check if the error is model not found (404)
                if "404" in error_message:
                    print("This seems to be a critical API or model configuration error. Stopping further processing.")
                    stop_processing_flag = True # Signal to stop all processing
                    break

            # Rate Limiting - add a sleep timer to avoid hitting API rate limits.
            time.sleep(2.2) # Sleep for 5 seconds (e.g., for ~12 requests per minute)

        if stop_processing_flag:
            break # Critical error found, breaking out of both loops

    print("\n--- Processing Complete ---")
    print(f"Total results collected: {len(all_results)}")

### Step 4 - Store and process the results

The last step is to take the dictionary containing all of our results and turn into datasets that we can work with. We'll use pandas for this, one of the most popular and powerful data libraries for python. You can learn more about it here:

We create two resulting tables: `full_table_df` which contains all 990 individual ratings, and `summary_table_df` wich aggregates the responses from the different prompts by skill or work activity, calculating the mean score, the standard deviation and the range of ratings. For further exploration, it also records the LLM's explanation for the lowest and highest recorded scores.



In [None]:
# Convert results to DataFrame
results_df = pd.DataFrame(all_results)
results_df["llm_raw_rating_output"] = pd.to_numeric(results_df["llm_raw_rating_output"],
                                                    errors = "coerce")

# Filter for desired columns
full_table_df = results_df[['item_name',
                         'prompt_id',
                         'llm_raw_rating_output',
                         'llm_explanation_output']].copy()

# Group by skill or work activity for aggregation
grouped_by_item = full_table_df.groupby('item_name')

# Calculate mean, std deviation, min and max
summary_stats = grouped_by_item['llm_raw_rating_output'].agg(
    average_rating = 'mean',
    std_dev = 'std',
    min = 'min',
    max = 'max',
).reset_index()

# Create the 'range' column
summary_stats['range'] = summary_stats.apply(
    lambda row: f"{int(row['min'])} - {int(row['max'])}"
    if pd.notna(row['min']) and pd.notna(row['max']) else "N/A",
    axis=1
)

# Extract explanation for Min and Max scores
def get_min_max_explanations(group):

    min_val = group['llm_raw_rating_output'].min()
    max_val = group['llm_raw_rating_output'].max()

    min_explanation = "N/A"
    max_explanation = "N/A"

    if pd.notna(min_val):
        # Get the first row where parsed_rating_numeric matches min_val
        min_row = group[group['llm_raw_rating_output'] == min_val].iloc[0]
        min_explanation = min_row['llm_explanation_output'] if pd.notna(min_row['llm_explanation_output']) else "N/A"

    if pd.notna(max_val):
        # Get the first row where parsed_rating_numeric matches max_val
        max_row = group[group['llm_raw_rating_output'] == max_val].iloc[0]
        max_explanation = max_row['llm_explanation_output'] if pd.notna(max_row['llm_explanation_output']) else "N/A"

    if pd.notna(min_val) and pd.notna(max_val) and min_val == max_val:
        return f"Min/Max ({int(min_val)}): {min_explanation}"
    else:
        return f"Min ({int(min_val) if pd.notna(min_val) else 'N/A'}): {min_explanation}\n----------\nMax ({int(max_val) if pd.notna(max_val) else 'N/A'}): {max_explanation}"

min_max_explanations_series = grouped_by_item.apply(get_min_max_explanations, include_groups=False)
min_max_explanations_df = min_max_explanations_series.reset_index(name='min_max_explanations')

# Merge with summary_stats
summary_table_df = pd.merge(summary_stats, min_max_explanations_df, on='item_name')

In [None]:
# We save the resulting two tables into a csv files.
# If you're on google colab, make sure to download these before
# restarting your session or you'll lose them.

full_table_df.to_csv("full_table_df.csv", index=False, encoding='utf-8-sig')

summary_table_df.to_csv("summary_table_df.csv", index=False, encoding='utf-8-sig')