# Workshop: Structured information extraction with LLMs

## Installation

Switch to a T4 GPU.

Install required Python packages.

We use:

- `llama-cpp-python` for running the LLM
- `instructor` for structured outputs
- `polars` for data manipulation

These need to be installed with GPU support.


In [None]:
# Instructions for GPU install: https://github.com/abetlen/llama-cpp-python/issues/576#issuecomment-2379861701
import os
os.environ["CMAKE_ARGS"]="-DLLAMA_CUBLAS=on"
os.environ["FORCE_CMAKE"]="1"

Check that the GPU is available.

In [None]:
!nvidia-smi

Mon Mar 17 14:36:48 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   65C    P8             12W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
!pip install --no-cache-dir llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122
!pip install instructor==1.7.2
!pip install polars==1.20.0

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.3.4
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.4-cu122/llama_cpp_python-0.3.4-cp311-cp311-linux_x86_64.whl (445.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m445.2/445.2 MB[0m [31m83.0 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.3.4)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.4
Collecting instructor==1.7.2
  Downloading instructor-1.7.2-py3-none-any.whl.metadata (18 kB)
Collecting jiter<0.9,>=0.6.1 (from instructor==1.7.2)
  Downloading jiter-0.

## Run an LLM locally

We will run an LLM on Google Colab using Llama.cpp, a highly optimized C++ inference engine for LLMs. It enables running LLMs on laptop-grade hardware.

First, we download an open source model from Hugging Face hub. It's **Hermes-3-Llama-3.2-3B**, which is a modified version of Meta's Llama 3.2. It has 3 billion parameters. The developers at Nous Research have quantized it to 4-bit precision. This makes the model faster and smaller, at the cost of some accuracy. We will use it throughout the workshop. Larger models are typically better, so if something doesn't work, it may be because the model is too small.


In [None]:
import llama_cpp
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

# Load model:
# https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B-GGUF

llama = llama_cpp.Llama.from_pretrained(
    repo_id="NousResearch/Hermes-3-Llama-3.2-3B-GGUF",
    filename="Hermes-3-Llama-3.2-3B.Q4_K_M.gguf",
    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
      "NousResearch/Hermes-3-Llama-3.2-3B"
    ),
    n_gpu_layers=-1,
    chat_format="chatml",
    n_ctx=8192,
    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=2),
    logits_all=True,
    verbose=False,
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/50.3k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

Hermes-3-Llama-3.2-3B.Q4_K_M.gguf:   0%|          | 0.00/2.02G [00:00<?, ?B/s]

llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


Let's test our model with a text classification task.

In [None]:
response = llama.create_chat_completion_openai_v1(
    messages=[
        {"role": "user", "content": "Classify the following news article as either World, Sports, Business or Sci/Tech: `Apple delays launch of LLM-enhanced Siri assistant`"},
    ],
)

print(response.choices[0].message.content)


The news article "Apple announces new iPhone with revolutionary AI features" would be classified as Sci/Tech.


The answer is correct, but it's not structured. We can't easily use it in data analysis. We could ask the model to only return the category, but it may still add extra comments, formatting or other text. Instead, we can force it to always return a valid JSON object.

## Structured outputs with instructor

Structured information extraction works best with structured outputs. We will use the instructor package for that. This is possible because llama.cpp and instructor both accept the OpenAI API specification, which has become an industry standard.

In [None]:
import instructor

create = instructor.patch(
    create=llama.create_chat_completion_openai_v1,
    mode=instructor.Mode.JSON_SCHEMA,
)

## Alternative: Connect with OpenAI API

Uncomment the code chunk below and place your OpenAI API key in it. Alternatively, rewrite this chunk to connect to another LLM API that is compatible with instructor. See the instructor [documentation](https://python.useinstructor.com/integrations/) for more details.

In [None]:
#import instructor
#from openai import OpenAI
#from functools import partial

#client = instructor.from_openai(OpenAI(api_key="<your-api-key>"))
# Simplify the function to create a chat completion
#create = partial(client.chat.completions.create, model="gpt-4o-mini")

### Text classification

Let's do the same text classification task, but now with structured output.

In [None]:
from typing import Literal
from pydantic import BaseModel

class NewsCategory(BaseModel):
    category: Literal["World", "Sports", "Business", "Sci/Tech"]


news_category = create(
    messages=[
        {
            "role": "user",
            "content": "Classify this news article into a category: `Apple announces new iPhone with revolutionary AI features`",
        }
    ],
    response_model=NewsCategory,
)

print(news_category)


category='Sci/Tech'


### Named entity recognition

Let's try named entity recognition.

In [None]:
from typing import List

class Entity(BaseModel):
    text: str
    type: Literal["PERSON", "ORGANIZATION", "LOCATION", "DATE", "OTHER"]

class NamedEntities(BaseModel):
    entities: List[Entity]

entities = create(
    messages=[
        {
            "role": "user",
            "content": "Extract named entities from this text: 'John Smith visited Microsoft headquarters in Seattle last Tuesday'"
        }
    ],
    response_model=NamedEntities,
)

print(entities)


entities=[Entity(text='John Smith', type='PERSON'), Entity(text='Microsoft', type='ORGANIZATION'), Entity(text='headquarters', type='LOCATION'), Entity(text='Seattle', type='LOCATION'), Entity(text='last Tuesday', type='DATE')]


### Aspect-based sentiment analysis

Let's try aspect-based sentiment analysis.

In [None]:
from typing import Literal
from pydantic import BaseModel

class Sentiment(BaseModel):
    aspect: str
    polarity: Literal["positive", "neutral", "negative"]


sentiment = create(
    messages=[
        {
            "role": "user",
            "content": "Analyze the following review with aspect-based sentiment analysis: `The cheeseburger was delicious`",
        }
    ],
    response_model=Sentiment,
)

print(sentiment)

aspect='cheeseburger' polarity='positive'


### Free form tasks

We can ask the LLM to extract any extraction, transformation or other reasoning task. Let's try a prompt to comprehensively analyze a news article.

In [None]:
from pydantic import Field

class NewsAnalysis(BaseModel):
    language: str = Field(description="Language of the article as a two letter code")
    topic: Literal["World", "Sports", "Business", "Sci/Tech"]
    topic_detail: str = Field(description="Detailed categorization of the topic")
    summary: str = Field(description="One sentence summary of the article")
    entities: List[Entity]

article = """
The Tito & Friends group of companies continues its shopping spree in the market \
research industry. After acquiring the full-service institutes dcore from Munich \
and Mindtake from Vienna, the Hamburg Institute for Applied Data Analysis in \
Market Research (IfaD) has now also become part of the group at the turn of the year. \
With the sale, founder and previous managing director Martin Cyrus is retiring.
"""
# Translated excerpt from: https://www.marktforschung.de/marktforschung/a/ifad-wird-von-tito-friends-uebernommen/

analysis = create(
    messages=[
        {
            "role": "user",
            "content": f"Analyze the following news article: {article}",
        }
    ],
    response_model=NewsAnalysis,
)

print(analysis)


language='de' topic='Business' topic_detail='Market Research Industry' summary='The Tito & Friends group of companies continues its shopping spree in the market research industry. After acquiring the full-service institutes dcore from Munich and Mindtake from Vienna, the Hamburg Institute for Applied Data Analysis in Market Research (IfaD) has now also become part of the group at the turn of the year.' entities=[Entity(text='Tito & Friends', type='ORGANIZATION'), Entity(text='dcore', type='ORGANIZATION'), Entity(text='Mindtake', type='ORGANIZATION'), Entity(text='Hamburg Institute for Applied Data Analysis in Market Research', type='ORGANIZATION'), Entity(text='Martin Cyrus', type='PERSON')]


With the multi-task approach we can bundle a whole NLP pipeline into a single prompt. However, it makes evaluation and debugging more difficult.

## Few-shot learning

We can add examples to the prompt to improve the model's performance. Let's go back to the text classification example and create a system prompt with examples.

In [None]:
categories = ["World", "Sports", "Business", "Sci/Tech"]

examples = [
    {
        "text": "UN Security Council passes resolution on global peace initiative",
        "category": "World"
    },
    {
        "text": "Major tech company reports record quarterly earnings",
        "category": "Business"
    },
    {
        "text": "Scientists develop breakthrough quantum computing technology",
        "category": "Sci/Tech"
    },
    {
        "text": "Local team wins national championship in dramatic final",
        "category": "Sports"
    }
]

examples_joined = "\n".join(str(example) for example in examples)

system_prompt = f"""
You are a text classifier. You are given a news article and you need to classify it into one of the following categories:
{", ".join(categories)}

Examples:
{examples_joined}
"""

print(system_prompt)


You are a text classifier. You are given a news article and you need to classify it into one of the following categories:
World, Sports, Business, Sci/Tech

Examples:
{'text': 'UN Security Council passes resolution on global peace initiative', 'category': 'World'}
{'text': 'Major tech company reports record quarterly earnings', 'category': 'Business'}
{'text': 'Scientists develop breakthrough quantum computing technology', 'category': 'Sci/Tech'}
{'text': 'Local team wins national championship in dramatic final', 'category': 'Sports'}



Now we can use the system prompt to classify another news article.

In [None]:
news_category = create(
    messages=[
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": "Breakthrough AI technology for market research revealed at GOR 25",
        }
    ],
    response_model=NewsCategory,
)

print(news_category)

category='Business'


## Evaluation

We can evaluate the performance of the model by comparing the predicted category to the true category. Here, we can run the experiment with the AG News dataset. The dataset contains 127,600 news articles classified into 4 categories: World, Sports, Business, and Science/Technology. Let's see how well the model performs on this dataset. It's available on Hugging Face: <https://huggingface.co/datasets/fancyzhx/ag_news>. We download it as a polars DataFrame. Polars is similar to pandas, but faster.


In [None]:
import polars as pl

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
agnews_test = pl.read_parquet('hf://datasets/fancyzhx/ag_news/' + splits['test'])


Text labels are easier to work with than a category ID number, because they are more descriptive and help the LLM understand the task. We can map the category ID to a text label.

In [None]:
label_map = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}
agnews_test = agnews_test.with_columns(pl.col('label').replace_strict(label_map).alias('category'))

print(agnews_test.group_by("category").agg(examples=pl.len()))


shape: (4, 2)
┌──────────┬──────────┐
│ category ┆ examples │
│ ---      ┆ ---      │
│ str      ┆ u32      │
╞══════════╪══════════╡
│ Business ┆ 1900     │
│ Sci/Tech ┆ 1900     │
│ World    ┆ 1900     │
│ Sports   ┆ 1900     │
└──────────┴──────────┘


Let's run the model on a subset of the test set. Running it on the entire dataset would take a long time in a Colab notebook. If this were running on a powerful GPU, it would be done quickly.

In [None]:
agnews_test_sample = agnews_test.sample(n=100, seed=42)

In [None]:
from tqdm import tqdm

def get_response(text):
    return create(
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {"role": "user", "content": text}
        ],
        response_model=NewsCategory,
    )

responses = [
    get_response(text) for text in tqdm(agnews_test_sample['text'])
]


100%|██████████| 100/100 [01:06<00:00,  1.51it/s]


In [None]:
import numpy as np
from sklearn.metrics import accuracy_score

y_true = agnews_test_sample['category'].to_numpy()
y_pred = np.array([response.category for response in responses])

acc = accuracy_score(y_true=y_true, y_pred=y_pred)

print(f"Accuracy: {acc:.2f}")

Accuracy: 0.72


The accuracy is a start but not satisfactory. This highlights the importance of evaluation. If we had just looked at a few examples, we might have falsely concluded that the model is always correct.

There are many ways to improve the model's performance:

- Describe the different labels in more detail
- Add more examples
- Use prompting techniques, such as Chain of Thought
- Use a higher precision model (less quantization)
- Use a larger model

## Extracting information from images

Multimodal LLMs can take images as input and generate structured output - just like with text input.

Our local model does not have vision capability, so we need to switch to another model, such as gpt-4o-mini. There are also open weights models that support vision, such as [gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it).

Instructor also support audio. See the [documentation](https://python.useinstructor.com/concepts/multimodal/#usage).

As an example, let's set up a response model and a prompt to describe someone's clothes.

![Example image: Fashion model](https://images.unsplash.com/photo-1523559695898-5262d13a600c?q=80&w=2427&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D)

In [None]:
from typing import Literal
import openai
import instructor
from google.colab import userdata
from pydantic import BaseModel

class Item(BaseModel):
   name: str
   color: Literal["red", "blue", "green", "brown", "yellow", "orange", "purple", "pink", "gray", "black", "white"]

class Outfit(BaseModel):
    hat: Item | None
    top: Item
    bottom: Item
    shoes: Item
    accessories: list[Item]


client = instructor.from_openai(openai.OpenAI(api_key=userdata.get('OPENAI_API_KEY')))
url = "https://images.unsplash.com/photo-1523559695898-5262d13a600c?q=80&w=2427&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"

image = instructor.Image.from_url(url)

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. You are given a an image and need to identify what the person is wearing."
        },
        {
            "role": "user",
            "content": [
                "Here is the image",
                image
            ],
        },
    ],
    response_model=Outfit
)

print(resp)

hat=None top=Item(name='shirt', color='blue') bottom=Item(name='pants', color='white') shoes=Item(name='sneakers', color='gray') accessories=[Item(name='sunglasses', color='black'), Item(name='bracelet', color='black')]
