# Structured information extraction

## Installation

Install required Python packages.

We use:

- `llama-cpp-python` for running the LLM
- `instructor` for structured outputs
- `polars` for data manipulation


In [1]:
# !pip install llama-cpp-python
# !pip install instructor
# !pip install polars

## Model

We will run an LLM on Google Colab using Llama.cpp, a highly optimized C++ inference engine for LLMs. It enables running LLMs on laptop-grade hardware.

First, we download an open source model from Hugging Face hub. It's **Hermes-3-Llama-3.2-3B**, which is a modified version of Meta's Llama 3.2. It has 3 billion parameters. The developers at Nous Research have quantized it to 4-bit precision. This makes the model faster and smaller, at the cost of some accuracy. We will use it throughout the workshop. Larger models are typically better, so if something doesn't work, it may be because the model is too small.


In [2]:
import llama_cpp
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

# Load model:
# https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B-GGUF

llama = llama_cpp.Llama.from_pretrained(
    repo_id="NousResearch/Hermes-3-Llama-3.2-3B-GGUF",
    filename="Hermes-3-Llama-3.2-3B.Q4_K_M.gguf",
    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(
      "NousResearch/Hermes-3-Llama-3.2-3B"
    ),
    n_gpu_layers=-1,
    chat_format="chatml",
    n_ctx=8192,
    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=2),
    logits_all=True,
    verbose=False,
)


  from .autonotebook import tqdm as notebook_tqdm
llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not su

Let's test our model with a text classification task.

In [3]:
response = llama.create_chat_completion_openai_v1(
    messages=[
        {"role": "user", "content": "Classify the following news article as either World, Sports, Business or Sci/Tech: `Apple announces new iPhone with revolutionary AI features`"},
    ],
)

print(response.choices[0].message.content)


The news article "Apple announces new iPhone with revolutionary AI features" would be classified as Sci/Tech.


The answer is correct, but it's not structured. We can't easily use it in data analysis. We could ask the model to only return the category, but it may still add extra comments, formatting or other text. Instead, we can force it to always return a valid JSON object.

## Structured outputs with instructor

Structured information extraction works best with structured outputs. We will use the instructor package for that. This is possible because llama.cpp and instructor both accept the OpenAI API specification, which has become an industry standard.

In [4]:

import instructor

create = instructor.patch(
    create=llama.create_chat_completion_openai_v1,
    mode=instructor.Mode.JSON_SCHEMA,
)

### Text classification

Let's do the same text classification task, but now with structured output.

In [5]:
from typing import Literal
from pydantic import BaseModel

class NewsCategory(BaseModel):
    category: Literal["World", "Sports", "Business", "Sci/Tech"]


news_category = create(
    messages=[
        {
            "role": "user",
            "content": "Classify this news article into a category: `Apple announces new iPhone with revolutionary AI features`",
        }
    ],
    response_model=NewsCategory,
)

print(news_category)


category='Sci/Tech'


### Named entity recognition

Let's try named entity recognition.

In [6]:
from typing import List

class Entity(BaseModel):
    text: str
    type: Literal["PERSON", "ORGANIZATION", "LOCATION", "DATE", "OTHER"]
    
class NamedEntities(BaseModel):
    entities: List[Entity]

entities = create(
    messages=[
        {
            "role": "user", 
            "content": "Extract named entities from this text: 'John Smith visited Microsoft headquarters in Seattle last Tuesday'"
        }
    ],
    response_model=NamedEntities,
)

print(entities)


entities=[Entity(text='John Smith', type='PERSON'), Entity(text='Microsoft', type='ORGANIZATION'), Entity(text='headquarters', type='LOCATION'), Entity(text='Seattle', type='LOCATION'), Entity(text='last Tuesday', type='DATE')]


### Aspect-based sentiment analysis

Let's try aspect-based sentiment analysis.

In [7]:
from typing import Literal
from pydantic import BaseModel

class Sentiment(BaseModel):
    aspect: str
    polarity: Literal["positive", "neutral", "negative"]


sentiment = create(
    messages=[
        {
            "role": "user",
            "content": "Analyze the following review with aspect-based sentiment analysis: `The cheeseburger was delicious`",
        }
    ],
    response_model=Sentiment,
)

print(sentiment)

aspect='cheeseburger' polarity='positive'


### Free form tasks

We can ask the LLM to extract any extraction, transformation or other reasoning task. Let's try a prompt to comprehensively analyze a news article.

In [8]:

from pydantic import Field

class NewsAnalysis(BaseModel):
    language: str = Field(description="Language of the article as a two letter code")
    topic: Literal["World", "Sports", "Business", "Sci/Tech"]
    topic_detail: str = Field(description="Detailed categorization of the topic")
    summary: str = Field(description="One sentence summary of the article")
    entities: List[Entity]

article = """
In the three state elections in eastern Germany in September, the far-right Alternative for Germany (AfD) won more votes than ever before, even though the party adopted particularly extreme positions in those states and has been classified as right-wing extremist by the Office for the Protection of the Constitution, Germany's domestic intelligence agency.
However, this does not bother the party's supporters in the least. According to the latest ARD Deutschlandtrend survey, 84% of AfD voters agree with the statement: "I don't care that the AfD has been labeled partly right-wing extremist, as long as it addresses the right issues."
For this survey, the pollsters from infratest-dimap questioned a representative sample of 1321 Germans eligible to vote from October 7 to 9.
Overall, two-thirds of respondents said they believe that a strong AfD is a danger to democracy and the rule of law in Germany. Many politicians also share this view. However, opinions begin to diverge when it comes to the question of how best to combat the AfD.
"""
# Excerpt from: https://www.dw.com/en/germans-divided-over-far-right-afd-ban/a-70465031

analysis = create(
    messages=[
        {
            "role": "user",
            "content": f"Analyze the following news article: {article}",
        }
    ],
    response_model=NewsAnalysis,
)

print(analysis)


language='de' topic='World' topic_detail='German Politics' summary="The far-right Alternative for Germany (AfD) has won more votes than ever before in three state elections in eastern Germany in September, despite being classified as right-wing extremist by Germany's domestic intelligence agency." entities=[Entity(text='Alternative for Germany', type='ORGANIZATION'), Entity(text='eastern Germany', type='LOCATION'), Entity(text='September', type='DATE'), Entity(text='right-wing extremist', type='OTHER'), Entity(text='Office for the Protection of the Constitution', type='ORGANIZATION'), Entity(text='ARD Deutschlandtrend', type='ORGANIZATION'), Entity(text='infratest-dimap', type='ORGANIZATION'), Entity(text='1321 Germans', type='OTHER'), Entity(text='AfD voters', type='OTHER'), Entity(text='right issues', type='OTHER'), Entity(text='right-wing extremist', type='OTHER'), Entity(text='strong AfD', type='OTHER'), Entity(text='democracy', type='OTHER'), Entity(text='rule of law', type='OTHER

With the multi-task approach we can bundle a whole NLP pipeline into a single prompt. However, it makes evaluation and debugging more difficult.

## Few-shot learning

We can add examples to the prompt to improve the model's performance. Let's go back to the text classification example and create a system prompt with examples.

In [9]:
categories = ["World", "Sports", "Business", "Sci/Tech"]

examples = [
    {
        "text": "UN Security Council passes resolution on global peace initiative",
        "category": "World"
    },
    {
        "text": "Major tech company reports record quarterly earnings",
        "category": "Business"
    }, 
    {
        "text": "Scientists develop breakthrough quantum computing technology",
        "category": "Sci/Tech"
    },
    {
        "text": "Local team wins national championship in dramatic final",
        "category": "Sports"
    }
]

system_prompt = f"""
You are a text classifier. You are given a news article and you need to classify it into one of the following categories:
{", ".join(categories)}

Examples:
{"\n".join(str(example) for example in examples)}
"""

print(system_prompt)


You are a text classifier. You are given a news article and you need to classify it into one of the following categories:
World, Sports, Business, Sci/Tech

Examples:
{'text': 'UN Security Council passes resolution on global peace initiative', 'category': 'World'}
{'text': 'Major tech company reports record quarterly earnings', 'category': 'Business'}
{'text': 'Scientists develop breakthrough quantum computing technology', 'category': 'Sci/Tech'}
{'text': 'Local team wins national championship in dramatic final', 'category': 'Sports'}



Now we can use the system prompt to classify another news article.

In [10]:
news_category = create(
    messages=[
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": "Breakthrough AI technology for market research revealed at GOR 25",
        }
    ],
    response_model=NewsCategory,
)

print(news_category)

category='Business'


## Evaluation

We can evaluate the performance of the model by comparing the predicted category to the true category. Here, we can run the experiment with the AG News dataset. The dataset contains 127,600 news articles classified into 4 categories: World, Sports, Business, and Science/Technology. Let's see how well the model performs on this dataset. It's available on Hugging Face: <https://huggingface.co/datasets/fancyzhx/ag_news>. We download it as a polars DataFrame. Polars is similar to pandas, but faster.


In [11]:
import polars as pl

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
agnews_test = pl.read_parquet('hf://datasets/fancyzhx/ag_news/' + splits['train'])


Text labels are easier to work with than a category ID number, because they are more descriptive and help the LLM understand the task. We can map the category ID to a text label.

In [12]:
label_map = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}
agnews_test = agnews_test.with_columns(pl.col('label').replace_strict(label_map).alias('category'))

print(agnews_test.head())


shape: (5, 3)
┌─────────────────────────────────┬───────┬──────────┐
│ text                            ┆ label ┆ category │
│ ---                             ┆ ---   ┆ ---      │
│ str                             ┆ i64   ┆ str      │
╞═════════════════════════════════╪═══════╪══════════╡
│ Wall St. Bears Claw Back Into … ┆ 2     ┆ Business │
│ Carlyle Looks Toward Commercia… ┆ 2     ┆ Business │
│ Oil and Economy Cloud Stocks' … ┆ 2     ┆ Business │
│ Iraq Halts Oil Exports from Ma… ┆ 2     ┆ Business │
│ Oil prices soar to all-time re… ┆ 2     ┆ Business │
└─────────────────────────────────┴───────┴──────────┘


Let's run the model on a subset of the test set. Running it on the entire dataset would take a long time in a Colab notebook. If this were running on a powerful GPU, it would be done quickly.

In [13]:
agnews_test_sample = agnews_test.sample(n=100, seed=42)

In [14]:
from tqdm import tqdm

def get_response(text):
    return create(
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {"role": "user", "content": text}
        ],
        response_model=NewsCategory,
    )

responses = [
    get_response(text) for text in tqdm(agnews_test_sample['text'])
]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|██████████| 100/100 [00:58<00:00,  1.71it/s]


In [15]:
predicted_labels = [response.category for response in responses]
expected_labels = agnews_test_sample['category'].to_list()

accuracy = sum(
    pred == exp
    for pred, exp in zip(predicted_labels, expected_labels)
) / len(expected_labels)

print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.67


The accuracy is a start but not satisfactory. This highlights the importance of evaluation. If we had just looked at a few examples, we might have falsely concluded that the model is always correct.

There are many ways to improve the model's performance:

- Describe the different labels in more detail
- Add more examples
- Use prompting techniques, such as Chain of Thought
- Use a higher precision model (less quantization)
- Use a larger model