# Strict JSON v2
A Strict JSON Framework for LLM Outputs, that fixes problems that json.loads() cannot solve
- Works for JSON outputs with multiple ' or " or { or } or \ or unmatched braces/brackets that may break a json.loads()
- Updated: 4 Jan 2024 [New: Support for OpenAI JSON Mode, Functions]
- Created: 28 Oct 2023
- Video tutorial: https://www.youtube.com/watch?v=IjTUKAciTCg
- Collaborators welcome

## How do I use this? 
1. Replace ```<YOUR API KEY HERE>``` in ```os.environ['OPENAI_API_KEY'] = '<YOUR API KEY HERE>'``` with your own OpenAI API key (https://platform.openai.com/account/api-keys)
2. Copy and paste ```strict_json``` and ```strict_function``` from Strict_JSON_v2.ipynb 
3. Use the functions as needed (Note: In the future, it will be as a python package/library for easy usage)

## How does it work?
- Extract JSON values as a string using a special regex (add delimiters to ```key``` to make ```###key###```) to split keys and values
- By default, uses ```ast.literal_eval``` to best match the string to a literal (e.g. int, string, dict). Set ```literal_eval = False``` when calling ```strict_json``` to preserve output fields as string
- Ensures that all JSON fields are output by LLM, if not it will feed in error message to LLM to iteratively correct its generation (default: 3 tries)

# Library
- These cells are the key cells to copy/paste for StrictJSON v2 to work (update your own OpenAI API Key)
- Kindly do this until I finally make this into a package (coming soon)

In [1]:
import os
import openai
import json
import re
import ast
from openai import OpenAI

#API Keys
os.environ['OPENAI_API_KEY'] = '<YOUR API KEY HERE>'
openai.api_key = os.environ['OPENAI_API_KEY']

In [2]:
# for functions using strict_json
def strict_json(system_prompt, user_prompt, output_format, delimiter = '###',
                  model = 'gpt-3.5-turbo', temperature = 0, num_tries = 3, verbose = False, literal_eval = True, openai_json_mode = False, **kwargs):
    ''' Ensures that OpenAI will always adhere to the desired output JSON format defined in output_format. 
    Uses rule-based iterative feedback to ask GPT to self-correct.
    Keeps trying up to num_tries it it does not. Returns empty JSON if unable to after num_tries iterations.
    
    Inputs:
    - system_prompt: String. Write in whatever you want GPT to become. e.g. "You are a \<purpose in life\>"
    - user_prompt: String. The user input. Later, when we use it as a function, this is the function input
    - output_format: JSON. JSON format with the key as the output key, and the value as the output description
    - delimiter: String. This is the delimiter to surround the keys. With delimiter ###, key becomes ###key###
    - model: String. The OpenAI model to use for json generation
    - temperature: Float (default: 0) The temperature of the openai model, the higher the more variable the output (lowest = 0)
    - num_tries: Integer (default: 3) The number of tries to iteratively prompt GPT to generate correct json format
    - verbose: Boolean (default: False). Whether or not to print out the system prompt, user prompt, GPT response
    - literal_eval: Boolean (default: True). Whether or not to do ast.literal_eval for output fields
    - openai_json_mode: Boolean (default: False). Whether or not to use OpenAI JSON Mode
    - **kwargs: Dict. Additional arguments you would like to pass on to OpenAI Chat Completion API
    
    Output:
    - res: Dict. The JSON output of the model. Returns {} if unable to output correct JSON
    '''
    client = OpenAI()
    
    # If OpenAI JSON mode is selected, then just let OpenAI do the processing
    if openai_json_mode:
        # if model fails, default to gpt-3.5-turbo-1106
        try:
            assert(model in ['gpt-4-1106-preview', 'gpt-3.5-turbo-1106'])
        except Exception as e:
            model = 'gpt-3.5-turbo-1106'
            
        response = client.chat.completions.create(
            temperature = temperature,
            model=model,
            response_format={"type": "json_object"},
            messages=[
                {"role": "system", "content": str(system_prompt) + "\nOutput in the following json format: "+str(output_format)},
                {"role": "user", "content": str(user_prompt)}
            ],
            **kwargs
        )
        res = response.choices[0].message.content
        try:
            loaded_json = json.loads(res)
        except Exception as e:
            loaded_json = {}
        return loaded_json
        
    # Otherwise, implement JSON parsing using Strict JSON
    else:
        # start off with no error message
        error_msg = ''

        for i in range(num_tries):

            # make the output format keys with a unique identifier
            new_output_format = {}
            for key in output_format.keys():
                new_output_format[f'{delimiter}{key}{delimiter}'] = output_format[key]
            output_format_prompt = f'''\nYou are to output the following in json format: {new_output_format}
    You must use "{delimiter}{{key}}{delimiter}" to enclose each {{key}} and change values based on context'''

            # Use OpenAI to get a response
            client = OpenAI()
            response = client.chat.completions.create(
              temperature = temperature,
              model=model,
              messages=[
                {"role": "system", "content": str(system_prompt) + output_format_prompt + error_msg},
                {"role": "user", "content": str(user_prompt)}
              ],
              **kwargs
            )

            res = response.choices[0].message.content

            if verbose:
                print('System prompt:', system_prompt + output_format_prompt + error_msg)
                print('\nUser prompt:', str(user_prompt))
                print('\nGPT response:', res)

            # try-catch block to ensure output format is adhered to
            try:
                # check key appears for each element in the output
                for key in new_output_format.keys():
                    # if output field missing, raise an error
                    if key not in res: raise Exception(f"{key} not in json output")

                # if all is good, we then extract out the fields
                # Use regular expressions to extract keys and values
                pattern = fr",*\s*['|\"]{delimiter}([^#]*){delimiter}['|\"]: "

                matches = re.split(pattern, res[1:-1])

                # remove null matches
                my_matches = [match for match in matches if match !='']

                # remove the ' from the value matches
                curated_matches = [match[1:-1] if match[0] in '\'"' else match for match in my_matches]

                # create a dictionary
                end_dict = {}
                for i in range(0, len(curated_matches), 2):
                    end_dict[curated_matches[i]] = curated_matches[i+1]

                # try to do some parsing via literal_eval
                if literal_eval:
                    res = end_dict
                    for key in end_dict.keys():
                        try:
                            end_dict[key] = ast.literal_eval(end_dict[key])
                        except Exception as e:
                            # if there is an error in literal processing, do nothing as it is not of the form of a literal
                            continue 

                return end_dict

            except Exception as e:
                error_msg = f"\n\nResult: {res}\n\nError message: {str(e)}\nYou must use \"{delimiter}{{key}}{delimiter}\" to enclose the each {{key}}."
                print("An exception occurred:", str(e))
                print("Current invalid json format:", res)

        return {}

In [3]:
# for functions using strict_json
class strict_function:
    def __init__(self, fn_description = 'Output a reminder to define this function in a happy way', 
                 output_format = {'output': 'sentence'}, 
                 examples = None,
                 input_type = None, 
                 output_type = None,
                 **kwargs):
        ''' 
        Creates an LLM-based function using fn_description and outputs JSON based on output_format. 
        Optionally, can define the function based on examples (list of Dict containing input and output variables for each example)
        Optionally, can also force input/output variables to a particular type with input_type and output_type dictionary respectively.
        
        Inputs (compulsory):
        - fn_description: String. Function description to describe process of transforming input variables to output variables
        - output_format: String. Dictionary containing output variables names and description for each variable. There must be at least one output variable
           
        Inputs (optional):
        - examples: Dict or List[Dict]. Examples in Dictionary form with the input and output variables (list if more than one)
        - input_type: Dict. Dictionary containing input variable names as keys and mapping functions as values (need not contain all variables)
        - output_type: Dict. Dictionary containing output variable names as keys and mapping functions as values (need not contain all variables)
        If you do not put any of the optional fields, then we will by default do it by best fit to datatype
        - kwargs: Dict. Additional arguments you would like to pass on to the strict_json function
        
        ## Example
        fn_description = 'Output the sum of var1 and var2'
        output_format = {'output': 'sum of two numbers'}
        examples = [{'var1': 5, 'var2': 6, 'output': 11}, {'var1': 2, 'var2': 4, 'output': 6}]
        input_type = {'var1': int, 'var2': int}
        output_type = {'output': int}
        
        ## Advanced Conversion of list-based outputs
        - If your output field is of the form of a list, you can ensure strict type conversion of each element using a lambda function
        - Examples
            - For strings, lambda x: [str(y) for y in x]
            - For integers, lambda x: [int(y) for y in x]
        '''
        
        # Compulsary variables
        self.fn_description = fn_description
        self.output_format = output_format
        
        # Optional variables
        self.input_type = input_type
        self.output_type = output_type
        self.examples = examples
        self.kwargs = kwargs
        
        if self.examples is not None:
            self.fn_description += '\nExamples:\n' + str(examples)
        
    def __call__(self, *args, **kwargs):
        ''' Describes the function, and inputs the relevant parameters as either unnamed variables (args) or named variables (kwargs)
        If there is any variable that needs to be strictly converted to a datatype, put mapping function in input_type or output_type
        
        Inputs:
        - *args: Tuple. Unnamed input variables of the function. Will be processed to var1, var2 and so on based on order in the tuple
        - **kwargs: Dict. Named input variables of the function
        
        Output:
        - res: Dict. JSON containing the output variables'''
        
        # Do the merging of args and kwargs
        for num, arg in enumerate(args):
            kwargs['var'+str(num+1)] = arg
        
        # Do the input type converstion (optional)
        if self.input_type is not None:
            for key in kwargs:
                if key in self.input_type:
                    try:
                        kwargs[key] = self.input_type[key](kwargs[key])
                    except Exception as e: continue

        # do the function. 
        res = strict_json(system_prompt = self.fn_description,
                        user_prompt = kwargs,
                        output_format = self.output_format, 
                        **self.kwargs)
    
        # Do the output type conversion (optional)
        if self.output_type is not None:
            for key in res:
                if key in self.output_type:
                    try:
                        res[key] = self.output_type[key](res[key])      
                    except Exception as e: continue
                
        return res

# 1. Basic Generation

- **system_prompt**: Write in whatever you want GPT to become. "You are a \<purpose in life\>"
- **user_prompt**: The user input. Later, when we use it as a function, this is the function input
- **output_format**: JSON of output variables in a dictionary, with the key as the output key, and the value as the output description
    - The output keys will be preserved exactly, while GPT will generate content to match the description of the value as best as possible

#### Example Usage
```python
res = strict_json(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful and sunny day',
                    output_format = {'Sentiment': 'Type of Sentiment',
                                    'Adjectives': 'List of adjectives',
                                    'Words': 'Number of words'})
                                    
print(res)
```

#### Example output
```{'Sentiment': 'positive', 'Adjectives': ['beautiful', 'sunny'], 'Words': 7}```

In [4]:
res = strict_json(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful and sunny day',
                    output_format = {'Sentiment': 'Type of Sentiment',
                                    'Adjectives': 'List of adjectives',
                                    'Words': 'Number of words'})
print(res)

{'Sentiment': 'Positive', 'Adjectives': ['beautiful', 'sunny'], 'Words': 7}


## Easy to split into corresponding elements

In [5]:
res['Sentiment']

'Positive'

In [6]:
res['Adjectives']

['beautiful', 'sunny']

In [7]:
res['Words']

7

# 2. Advanced Generation
- More advanced demonstration involving code that would typically break ```json.loads()```

#### Example Usage
```python
res = strict_json(system_prompt = 'You are a code generator, generating code to fulfil a task',
                    user_prompt = 'Given array p, output a function named func_sum to return its sum',
                    output_format = {'Elaboration': 'How you would do it',
                                     'C': 'Code',
                                    'Python': 'Code'})
                                    
print(res)
```

#### Example output
```{'Elaboration': 'To calculate the sum of an array, we can iterate through each element of the array and add it to a running total. Finally, we return the total as the result.', ```

```'C': 'int func_sum(int p[], int size) {\\n    int sum = 0;\\n    for (int i = 0; i < size; i++) {\\n        sum += p[i];\\n    }\\n    return sum;\\n}', ```

```'Python': 'def func_sum(p):\\n    sum = 0\\n    for num in p:\\n        sum += num\\n    return sum'}```


In [None]:
res = strict_json(system_prompt = 'You are a code generator, generating code to fulfil a task',
                    user_prompt = 'Given array p, output a function named func_sum to return its sum',
                    output_format = {'Elaboration': 'How you would do it',
                                     'C': 'Code',
                                    'Python': 'Code'})
                                    
print(res)

## Easy to split into corresponding elements

In [None]:
res['Elaboration']

In [None]:
print(res['C'])

In [None]:
res['Python']

In [None]:
## we can even run the Python code (potentially risky due to prompt injection attacks when running unverified code)
p = [1, 2, 3, 4, 5]
exec(res['Python'].replace('\\n', '\n'))
try:
    print('The output sum is', func_sum(p))
except Exception as e:
    print('An exception occured')

# 3. Strict JSON Functions
- Enhances ```strict_json()``` with a function-like interface for repeated use of modular LLM-based functions
- Inputs (compulsory):
    - **fn_description** - Function description to describe process of transforming input variables to output variables
    - **output_format** - Dictionary containing output variables names and description for each variable. There must be at least one output variable
- Inputs (optional):
    - **examples** - Examples in Dictionary form with the input and output variables (list if more than one)
    - **input_type** - Dictionary containing input variable names as keys and mapping functions as values (need not contain all variables)
    - **output_type** - Dictionary containing output variable names as keys and mapping functions as values (need not contain all variables)
    - **kwargs** - Additional arguments you would like to pass on to the ```strict_json``` function
        
- Outputs:
    JSON of output variables in a dictionary (similar to ```strict_json```)
    
#### Example Usage 1 (Description only)
```python
# Construct the function: var1 will be first input variable, var2 will be second input variable and so on
fn = strict_function(fn_description = 'Output a sentence with words var1 and var2 in the style of var3', 
                     output_format = {'output': 'sentence'})

# Use the function
fn('ball', 'dog', 'happy')
```

#### Example Output 1
```{'output': 'The happy dog chased the ball.'}```

#### Example Usage 2 (Examples only)
```python
# Construct the function: infer pattern from just examples without description (here it is multiplication)
fn = strict_function(fn_description = 'Map input to output based on examples', 
                     output_format = {'output': 'final answer'}, 
                     examples = [{'var1': 3, 'var2': 2, 'output': 6}, 
                                 {'var1': 5, 'var2': 3, 'output': 15}, 
                                 {'var1': 7, 'var2': 4, 'output': 28}])

# Use the function
fn(2, 10)
```

#### Example Output 2
```{'output': 20}```

#### Example Usage 3 (Description, Examples and Type Forcing)
```python
# Construct the function: var1 will be first input variable, var2 will be second input variable and so on
fn = strict_function(fn_description = 'Output the sum and difference of var1 and var2', 
                 output_format = {'sum': 'sum of two numbers', 'difference': 'absolute difference of two numbers'}, 
                 examples = {'var1': 2, 'var2': 4, 'sum': 6, 'difference': '2'}, 
                 input_type = {'var1': int, 'var2': int},           # optional
                 output_type = {'sum': int, 'difference': str})     # optional

# Use the function
fn(3, 4)
```

#### Example Output 3
```{'sum': 7, 'difference': '1'}```

In [None]:
# basic configuration with no variable names (recommended)
# var1 will be first input variable, var2 will be second input variable and so on
fn = strict_function(fn_description = 'Output a sentence with words var1 and var2 in the style of var3', 
                     output_format = {'output': 'sentence'})
fn('ball', 'dog', 'happy')

In [None]:
# basic configuration with variable names
fn = strict_function(fn_description = 'Output a sentence with "obj" and "entity" in the style of "emotion"', 
                     output_format = {'output': 'sentence'})
fn(obj = 'ball', entity = 'dog', emotion = 'happy')

In [None]:
# infer pattern from just examples without description (here it is multiplication)
fn = strict_function(fn_description = 'Map input to output based on examples', 
                     output_format = {'output': 'final answer'}, 
                     examples = [{'var1': 3, 'var2': 2, 'output': 6}, 
                                 {'var1': 5, 'var2': 3, 'output': 15}, 
                                 {'var1': 7, 'var2': 4, 'output': 28}])
fn(2, 10)

In [None]:
# multiple outputs and examples without strict typing. allows very flexible input types and output types (recommended)
fn = strict_function(fn_description = 'Output the sum and difference of var1 and var2', 
                 output_format = {'sum': 'sum of two numbers', 'difference': 'absolute difference of two numbers'}, 
                 examples = {'var1': 2, 'var2': 4, 'sum': 6, 'difference': 2})
fn(3, 4)

In [None]:
# multiple outputs and examples and strict typing. converts difference into a string
fn = strict_function(fn_description = 'Output the sum and difference of var1 and var2', 
                 output_format = {'sum': 'sum of two numbers', 'difference': 'absolute difference of two numbers'}, 
                 examples = {'var1': 2, 'var2': 4, 'sum': 6, 'difference': '2'}, 
                 input_type = {'var1': int, 'var2': int},           # optional
                 output_type = {'sum': int, 'difference': str})     # optional
fn(3, 4)

In [None]:
# multiple outputs without strict typing. allows very flexible input types and output types (recommended)
fn = strict_function(fn_description = '''Output the sum of var1 and var2, 
                     generate a poem in style of var3 and code in var4''', 
                 output_format = {'sum': 'sum of two numbers', 
                'poem': 'poem about the two numbers',
                'code': 'code to do the sum of two numbers'})
fn('three', 4, 'happy', 'Python')

In [None]:
# multiple outputs with strict typing. Converts sum into a string. Unspecified types are converted to best fit
fn = strict_function(fn_description = '''Output the sum of var1 and var2, 
                     generate a poem in style of var3 and code in var4''', 
                 output_format = {'sum': 'sum of two numbers', 
                'poem': 'poem about the two numbers',
                'code': 'code to do the sum of two numbers'},
                 input_type = {'var1': str, 'var2': int},           # optional
                 output_type = {'sum': str})                        # optional
fn('three', 4, 'haiku', 'Assembly')

## Advanced Conversion of list-based outputs
- If your output field is of the form of a list, you can ensure strict type conversion of each element using a lambda function
- Examples
    - For strings, lambda x: [str(y) for y in x]
    - For integers, lambda x: [int(y) for y in x]

In [None]:
# multiple outputs with strict typing. shows how to do list[str] conversion using lambda functions
fn = strict_function(fn_description = '''Output 5 prime numbers after var1, output 5 even numbers after var2''', 
                 output_format = {'primes': 'list of primes', 'evens': 'list of evens'},
                 input_type = {'var1': int, 'var2': int},                               # optional
                 output_type = {'primes': lambda x: [str(y) for y in x],
                               'evens': lambda x: [int(y) for y in x]})           # optional
fn(4, 10)

# 4. Integrating with OpenAI JSON Mode
- If you want to use the OpenAI JSON Mode (which is pretty good btw), you can simply add in ```openai_json_mode = True``` in ```strict_json``` or ```strict_function```
- Note that the model must be one of ```gpt-4-1106-preview``` or ```gpt-3.5-turbo-1106```. We will set it to ```gpt-3.5-turbo-1106``` by default if you provide an invalid model

#### Example Usage
```python
res = strict_json(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful and sunny day',
                    output_format = {'Sentiment': 'Type of Sentiment',
                                    'Adjectives': 'List of adjectives',
                                    'Words': 'Number of words'},
                    openai_json_mode = True) # Toggle this to True
                                    
print(res)
```

#### Example output
```{'Sentiment': 'positive', 'Adjectives': ['beautiful', 'sunny'], 'Words': 6}```

In [None]:
res = strict_json(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful and sunny day',
                    output_format = {'Sentiment': 'Type of Sentiment',
                                    'Adjectives': 'List of adjectives',
                                    'Words': 'Number of words'},
                   openai_json_mode = True) # Toggle this to True
print(res)

In [None]:
fn = strict_function(fn_description = 'Output a sentence with words var1 and var2 in the style of var3', 
                     output_format = {'output': 'sentence'},
                    openai_json_mode = True) # Toggle this to True
fn('ball', 'dog', 'happy')

# Optional: Under the hood (Explanation of how strict_json works)
- When given the output JSON format, it adds a delimiter (default: ###) to enclose the key of the JSON.
- Example output JSON provided: ```{'Sentiment': 'Type of Sentiment'}```
- Example output JSON interpreted by Strict JSON: ```{'###Sentiment###': 'Type of Sentiment'}```
- We then process the JSON format by using regex to search for the delimiter to extract the keys and values
- Note: Change the delimiter to whatever is not present in your dataset

In [None]:
# a very difficult chunk of text for json.loads() to parse (it will fail)
res = '''{'###Question of the day###': 'What is the 'x' in dx/dy?', 
'###Code Block 1###': '#include <stdio.h>\nint main(){\nint x = 'a'; return 0;\n}',
'###Another Code###': 'import numpy as np
### Oh what is this doing here
print("It can handle so many quotations ' \\" and backslashes and unexpected curly braces { } You don't even need to match }!")',
'###Some characters###': '~!@#$%^&*()_+-'"{}[];?><,.'}'''

In [None]:
# change this to whatever is not common in your dataset
delimiter = '###'

In [None]:
# Use regular expressions to extract keys and values
pattern = fr",*\s*['|\"]{delimiter}([^#]*){delimiter}['|\"]: "

matches = re.split(pattern, res[1:-1])

# remove null matches
my_matches = [match for match in matches if match !='']

print(my_matches)

In [None]:
# remove the ' from the value matches
curated_matches = [match[1:-1] if match[0] in '\'"' else match for match in my_matches]

print(curated_matches)

In [None]:
len(curated_matches)

In [None]:
# create a dictionary
end_dict = {}
for i in range(0, len(curated_matches), 2):
    end_dict[curated_matches[i]] = curated_matches[i+1]
    
print(end_dict)

In [None]:
for key, value in end_dict.items():
    print('Key:', key)
    print('Value:', value)
    print('#####')