# Exercise 2 - Generating JSON Structured Output

In this exercise we'll extend our zero-shot classifer into something a bit more useful for everyday software. We'll be using our LLM to process the complaint text into JSON with the order information. Specifically our JSON contain the following information:

- User's First Name.
- User's Last Name.
- The Department of the complaint is about.
- A propertly formatted order number.

It's not hard to imagine such a tool being used in a real world application were we want to automatically parse out complaint information and dispatch it to the correct department. As you'll see, even a relatively weak LLM such as Qwen2-0.5B does a surprsingly good job and extracting information for us.

This exercise is a bit more involved that the last one. You'll see that most of the work required involved defining Pydantic classes to describe the desired JSON output. We can the trivially uses `outlines.generate.json` to ensure we get structured outputs exactly as we want them.

## Loading the model

As before we need to import the necessary libraries and initialize our model and tokenizer.

**Reminder:** If you are not running on an Apple Silicon device, chaing the `device` value.

In [1]:
import json
import outlines
import torch
from transformers import AutoTokenizer
from textwrap import dedent

In [2]:
model_name = "Qwen/Qwen2-0.5B-Instruct"
model = outlines.models.transformers(
    model_name,
    device='mps',
    model_kwargs={
        'torch_dtype': torch.bfloat16,
        'trust_remote_code': True
    })
tokenizer = AutoTokenizer.from_pretrained(model_name)

And of course load the example data as well.

In [3]:
with open("../examples.json",'r') as fin:
    complaint_data = json.loads(fin.read())

## Main Exercise: Defining our Pydantic Models

If you aren't familiar with [Pydantic](https://docs.pydantic.dev/latest/), it is library that uses Python classes to define what are essentially complex types in Python that can then be easily converted into JSON or a standard Python dictionary. Pydantic classes are a great way to easily describe complex structure to use with Outlines.

The bulk of the work in this exercise involves filling out these classes. Where necessary as few notes have been added to help you get started.

Here is some useful information to help you:

The `ComplaintData` should consist of:

- `first_name`: a string representing the first name.

- `last_name`: a string representing the last name.

- `order_number`: is an ID formated that can start with 'A', 'D', or 'Z' followed by 2 digits, a '-' and then 4 more digits.

- `department`: can be any of the following: "clothing", "electronics", "kitchen" or "automotive".



In [4]:
from pydantic import BaseModel, Field, constr
from enum import Enum


class Department(str, Enum):
    clothing = "clothing"
    electronics = "electronics"
    kitchen = "kitchen"
    automotive = "automotive"

class ComplaintData(BaseModel):
    first_name: str
    last_name: str
    order_number: str = Field(pattern=r'[ADZ][0-9]{2}-[0-9]{4}')
    department: Department
    
complaint_processor = outlines.generate.json(model, ComplaintData)

## Prompt

Similar to the last exercise the prompt is a series of messages. Note that we do specify how the output should look.

In [5]:
def create_prompt(complaint):
    complaint_messages = [
        {
        'role': 'user',
        'content': f"""
        You are a complaint processing assistent, you aim is to process complaints and return the following intformation in this JSON format:
        {{
            'first_name': <first name>,
            'last_name': <last name>,
            'order number': <order number has the following format (ADZ)XX-XXXXX>,
            'department': <{"|".join([e.value for e in Department])}>,
        }}
        """},
        {'role': 'assistant',
         'content': "I undersand and will process the complaints in the JSON format you described"
        },
        {'role': 'user',
        'content': complaint['message']
        }
    ]
    complaint_prompt = tokenizer.apply_chat_template(complaint_messages, tokenize=False)
    return complaint_prompt

## Exercise - Create the complaint processer

The final bit of this exercise is the use our Pydantic model in order to create a `complaint_processor` that automatically. This can be done using `outlines.generate.json`

In [6]:
complaint_processor = outlines.generate.json(model, ComplaintData)

Finally we can test our an see how our processor does!

In [7]:
results = []
for complaint in complaint_data[0:10]:
    prompt = create_prompt(complaint)
    result = complaint_processor(prompt)
    results.append(result)

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


In [8]:
idx = 3
complaint_data[idx]['message']

'Hi, my name is Jane Doeandcommn.I recently ordered a stylish black pair of headphones from your store, and unfortunately, they failed to work on the first try. Despite troubleshooting instructions in the manual, there seems to be no response from the device.The order number is D12-3456'

In [9]:
results[idx].json()

'{"first_name":"Jane","last_name":"Doe","order_number":"D12-3456","department":"electronics"}'