# Exercise 3: Generating Complaints and the Structured Generation Workflow

For this exercise we're going to work on a much more involved example. When creating this workshop I didn't want to have to hand write 50 complaints, so I decided to let an LLM do that for me! Typically we would think of a complain as *unstructured* but, as you can see in this exercise, there's almost always an advantage to using structured generation. 

The code for this exercise is a fair be more involved than the last two, but don't worry, you only need to work on a small part of this project.

We're also going to learn about the [Structured Generation Workflow](https://blog.dottxt.co/coding-for-structured-generation.html) which make it easier to iteratively develop structured applications using LLMs.

In [1]:
import json
import outlines
from transformers import AutoTokenizer
import torch
from textwrap import dedent
from enum import Enum
import re
import random

**Note:** change the `DEVICE` if you are using a non-Apple Silicon device.

In [2]:
MODEL_NAME = "Qwen/Qwen2-0.5B-Instruct"
# Change to 'cuda' or 'cpu' if not using Apple Silicon
DEVICE='mps'

For consistency, we'll be using the Enum from the last exercise.

In [3]:
class Department(str, Enum):
    clothing = "clothing"
    electronics = "electronics"
    kitchen = "kitchen"
    automotive = "automotive"

DEFAULT_DEPTS = [dept.name for dept in list(Department)]
DEFAULT_DEPTS

['clothing', 'electronics', 'kitchen', 'automotive']

## Step 1 - Draft Structure

Our `ComplaintGenerator` builds complaints by breaking the complaint down into 3 steps with accompanying methods.

- `intro_structure` contains the name of the person
- `complaint_structure` contains the body of the complaint
- `order_number_structure` gives the order number in several different ways.

For this exercise we'll focus on the **intro_structure** and fill out the rest as time permits.


All structured generation tasks, just like normal machine learnings tasks, should start with some *examples of real data*. Let's take a look at a few complaints (pretending for this exercise that these aren't generated and are real):

In [4]:
with open("../examples.json",'r') as fin:
    complaint_data = json.loads(fin.read())

Let's look at some example intros:

In [5]:
example_intros = [complaint['message'][0:10] for complaint in complaint_data]
set(example_intros)

{"Hello, I'm", 'Hi! This i'}

As we can see there are only two intros (in real life we would of course expect many more), but this helps us start to imagine the patterns we would like to generate.

We'll use this to draft a version of the `intro_structure` method as our first step for structured generation.

In [6]:
from copy import deepcopy
class ComplaintGenerator:

    def __init__(self, model_name, departments=DEFAULT_DEPTS):
        self.model_name = model_name
        self.departments = departments
        self._model = None
        self._tokenizer = None
        self._intro_generator = None
        self._complaint_generator = None
        self._order_number_generator = None
    ####################################
    # Structured Generation Section
    #
    @property
    def intro_structure(self):
        # TODO - find and fix the bug
        possible_intros = [
            r'(Hi! This is [A-Z][a-z]{3,10} [A-z][a-z]{3,10})',
            r'(Hey, my name is [A-z][a-z]{3,10} [A-Z][a-z]{3,10})',
            r'(Hello, I\'m [A-Z][a-z]{3,10} [A-z][a-z]{3,10})'
        ]
        return rf"({'|'.join(possible_intros)})\."        

    @property
    def complaint_structure(self):
        # TODO - implement the complaint structure
        return r''

    @property
    def order_number_structure(self):
        # TODO - implement a few variations on order number
        possible_order_numbers = [
             r'',
             r'',
             r''
         ]
        return rf"({'|'.join(possible_order_numbers)})"
    #
    #
    ####################################

    
    @property
    def intro_generator(self):
        if self._intro_generator is None:
            self._intro_generator = outlines.generate.regex(
                self.model, self.intro_structure
            )
        return self._intro_generator
        
    @property
    def complaint_generator(self):
        if self._complaint_generator is None:
            self._complaint_generator = outlines.generate.regex(self.model, self.complaint_structure)
        return self._complaint_generator

    @property
    def order_number_generator(self):
        if self._order_number_generator is None:
            self._order_number_generator = outlines.generate.regex(
                self.model, 
                self.order_number_structure)
        return self._order_number_generator
    
    @property
    def model(self):
        print("getting model")
        if self._model is None:
            print("loading model")
            self._model = outlines.models.transformers(
                    self.model_name,
                    device=DEVICE,
                    model_kwargs={
                        'torch_dtype': torch.bfloat16,
                        'trust_remote_code': True
                    })
        return self._model

    @property
    def tokenizer(self):
        if self._tokenizer is None:
            print("loading tokenizer")
            self._tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        return self._tokenizer
        
    def generate_complaint(self):
        prompt_messages = self._start_messages()
        prompt_messages.append(self._intro_prompt())
        prompt_intro = self.tokenizer.apply_chat_template(
            prompt_messages,
            tokenize=False
        )
        print("Generating intro")
        intro_result = self.intro_generator(prompt_intro)
        prompt_messages.append({
            "role": "assistant",
            "content": intro_result
        })
        print("Generating complaint")
        department = random.choice(self.departments)
        prompt_messages.append(self._complaint_prompt(department))
        prompt_complaint = self.tokenizer.apply_chat_template(
            prompt_messages,
            tokenize=False
        )
        complaint_result = self.complaint_generator(prompt_complaint)
        prompt_messages.append({
            "role": "assistant",
            "content": complaint_result
        })
        prompt_messages.append(self._order_number_prompt())
        prompt_order_number = self.tokenizer.apply_chat_template(
            prompt_messages,
            tokenize=False
        )
        print("Generating order number")
        order_number_result = self.order_number_generator(prompt_order_number)

        final_message = intro_result + complaint_result + order_number_result
        return {
            "message": final_message,
            "order_number": self.parse_order_number(order_number_result),
            "department": department
        }
    
    def parse_order_number(self, message):
        """
        We want to extract the order number so that we can 
        send it back with the response to use for validation later.
        """
        number_only = r'((A|D|Z)[0-9]{6})|((A|D|Z)[0-9]{2}-[0-9]{4})'
        order_number = re.search(number_only, message)[0]
        if not ("-" in order_number):
            order_number = f"{order_number[0:3]}-{order_number[3:]}"
        return order_number
        
    def _start_messages(self):
        """
        These are the starting prompt messages, since we'll be
        appending to these messages, we'd like to return a 
        copy of them.
        """
        prompt_messages = [{
            "role": "user",
            "content": dedent("""
            You are an agent designed to create simulated customer complaints. The
            complaints are essentially short text messages that describe a customer,
            their problem, and provide an order number.
        
            You will build the complaint in parts based on the user request. The
            complaint will be about a product from a specified department, but you
            will not mention the department name directly.
        
            For example, if you are asked about something from the 'kitchen' department 
            you might mention an 'knife' but you won't mention the department.
            """)
        },{ 
            "role": "agent",
            "content": dedent("""
            I understand the task, and will wait for the you to instruct me on
            next steps.
            """)
        }]
        return(deepcopy(prompt_messages))

    def _intro_prompt(self):
        intro_prompt = {
            "role": "user",
            "content": "Start the message with a short intro stating the customer's name."
        }
        return(deepcopy(intro_prompt))

    def _complaint_prompt(self, department):
        complaint_message = {
            "role":"user", 
            "content": dedent(f"""
                            Good! Now write a short description of the problem with an item from the {department} department,
                            but don't mention the actual name of the department the product comes from!
                            """)
        }
        return deepcopy(complaint_message)

    def _order_number_prompt(self):
        order_number_message = {
            "role": "user",
            "content": dedent("""
            Finally, add a statement about the order number which starts with letter 'A', 'D' or 'Z' and consists of 6 digits after.
            """)
        }
        return deepcopy(order_number_message)
        
    

    
    

In [7]:
complainer = ComplaintGenerator(MODEL_NAME)
# complainer.generate_complaint()

## Step 2 - Verify Structure 

We can now test that this structure indeed matches the real data we have. To start we're only going to test the `intro_structure` property. To help ensure that our structure is correct, we'll verify that our structure does indeed match *all* of the examples in our dataset.

In [8]:
all([re.search(complainer.intro_structure, complaint['message'])
     for complaint in complaint_data])

True

This is a great place to catch bugs in structured generation. Since we saw that all cases matched, we can at least be sure that our basic structure is correct.

## Step 3 - Generate Structure

The next step is to generate examples of our structure to further test whether or not we're really solving the problem we're after. Rather than run the model right now, we'll use an example I generated earlier

In [9]:
# Normally we would do the following...
# example_generation = complainer.generate()
example_generation = {
 'message': 'Hi! This is Emily andbuyerser.I recently ordered a laptop with an extended warranty, but upon arrival, I noticed a malfunctioning trackpad. Despite numerous attempts at troubleshooting, the issue persists, greatly hindering my everyday use.This is order A12-3456',
 'order_number': 'A12-3456',
 'department': 'electronics'
}

## Step 4 - Inspect Output

Uh oh! Look at the name output! `Emily andbuyerser` is not a name that I would expect and doesn't match the expected output!

Now it's *your turn* to fix it!

When you've found the bug you can continue on to the next sections:

- Finish the `complaint_structure`, repeating this process
- Finish the `order_number_structure`, repeating this process
- If you have time, generate some new complaints!

To help you get started, we can see current structure *does* match unexpected example output. A good sign that you have fixed the problem is that this erroneous response will no longer match the defined structure in `intro_structure`.

In [10]:
re.search(complainer.intro_structure, example_generation['message'])

<re.Match object; span=(0, 30), match='Hi! This is Emily andbuyerser.'>

If you get the rest figured out, feel free to generate some examples of your own!

In [11]:
complaints = [complainer.generate_complaint() for _ in range(50)]
complaints

loading tokenizer
Generating intro
getting model
loading model


Compiling FSM index for all state transitions: 100%|█| 71/71 [00:02<00:00, 32.38it/s
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


Generating complaint
getting model
Generating order number
getting model
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order number
Generating intro
Generating complaint
Generating order numb

[{'message': "Hello, I'm Timothy Smith.I recently ordered the XYZ Cookware set, and it broke on me on the first day. The pieces were only just beginning to come apart when I went to fix it, and added to that, the handle was way too stiff and could not be put back together.The order number is A20-0134",
  'order_number': 'A20-0134',
  'department': 'kitchen'},
 {'message': "Hello, I'm Frank Swift.I recently ordered a lot of the made in the kitchen items from this supplier, but\nI did not received most items. I would love your help, if you could provide me a list of all the made in the kitchen items and how many I ordered.\nThe product I ordered can be a.This is order A46-2562",
  'order_number': 'A46-2562',
  'department': 'kitchen'},
 {'message': 'Hey, my name is Sarah Jones.I recently ordered the Sizzle Bolognese Steak\nwith pickled onions, garlic mashed potatoes, sour cream, and jalapenos.\nI always love the flavor of this special sauce,\nbut for some weird reason, I can barley taste

In [12]:
#with open("../examples.json", 'w') as fout:
#    fout.write(json.dumps(complaints))