# Strict JSON Usage Tutorial
A Strict JSON Framework for LLM Outputs, that fixes problems that json.loads() cannot solve
- Works for JSON outputs with multiple ' or " or { or } or \ or unmatched braces/brackets that may break a json.loads()
- Updated: 8 Jan 2024
- Repo: https://github.com/tanchongmin/strictjson
- Collaborators welcome

## How do I use this? 
1. Download package via command line ```pip install strictjson```
2. Set up your OpenAPI API Key. Refer to ```Tutorial.ipynb``` for how to do it for Jupyter Notebooks.
3. Import the required functions from ```strictjson``` and use them!

## How does it work?
- Extract JSON values as a string using a special regex (add delimiters to ```key``` to make ```###key###```) to split keys and values
- By default, uses ```ast.literal_eval``` to best match the string to a literal (e.g. int, string, dict). Set ```literal_eval = False``` when calling ```strict_json``` to preserve output fields as string
- Ensures that all JSON fields are output by LLM, if not it will feed in error message to LLM to iteratively correct its generation (default: 3 tries)

# Setup Guide

## Step 1: Install StrictJSON

In [1]:
# !pip install strictjson

## Step 2: Set up OpenAI API Key

In [2]:
#Python way to set up OpenAI API Keys
import os
os.environ['OPENAI_API_KEY'] = '<YOUR API KEY HERE>'

## Step 3: Import required functions

In [3]:
from strictjson import *

# 1. Basic Generation

- **system_prompt**: Write in whatever you want GPT to become. "You are a \<purpose in life\>"
- **user_prompt**: The user input. Later, when we use it as a function, this is the function input
- **output_format**: JSON of output variables in a dictionary, with the key as the output key, and the value as the output description
    - The output keys will be preserved exactly, while GPT will generate content to match the description of the value as best as possible

#### Example Usage
```python
res = strict_json(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful and sunny day',
                    output_format = {'Sentiment': 'Type of Sentiment',
                                    'Adjectives': 'List of adjectives',
                                    'Words': 'Number of words'})
                                    
print(res)
```

#### Example output
```{'Sentiment': 'Positive', 'Adjectives': ['beautiful', 'sunny'], 'Words': 7}```

In [4]:
res = strict_json(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful and sunny day',
                    output_format = {'Sentiment': 'Type of Sentiment',
                                    'Adjectives': 'List of adjectives',
                                    'Words': 'Number of words'})
print(res)

{'Sentiment': 'Positive', 'Adjectives': ['beautiful', 'sunny'], 'Words': 7}


## Easy to split into corresponding elements

In [5]:
res['Sentiment']

'Positive'

In [6]:
res['Adjectives']

['beautiful', 'sunny']

In [7]:
res['Words']

7

# 2. Advanced Generation
- More advanced demonstration involving code that would typically break ```json.loads()```

#### Example Usage
```python
res = strict_json(system_prompt = 'You are a code generator, generating code to fulfil a task',
                    user_prompt = 'Given array p, output a function named func_sum to return its sum',
                    output_format = {'Elaboration': 'How you would do it',
                                     'C': 'Code',
                                    'Python': 'Code'})
                                    
print(res)
```

#### Example output
```{'Elaboration': 'To calculate the sum of an array, we can iterate through each element of the array and add it to a running total.', ```

```'C': 'int func_sum(int p[], int size) {\n    int sum = 0;\n    for (int i = 0; i < size; i++) {\n        sum += p[i];\n    }\n    return sum;\n}', ```

```'Python': 'def func_sum(p):\n    sum = 0\n    for num in p:\n        sum += num\n    return sum'}```


In [8]:
res = strict_json(system_prompt = 'You are a code generator, generating code to fulfil a task',
                    user_prompt = 'Given array p, output a function named func_sum to return its sum',
                    output_format = {'Elaboration': 'How you would do it',
                                     'C': 'Code',
                                    'Python': 'Code'})
                                    
print(res)

{'Elaboration': 'To calculate the sum of an array, we can iterate through each element of the array and add it to a running total.', 'C': 'int func_sum(int p[], int size) {\n    int sum = 0;\n    for (int i = 0; i < size; i++) {\n        sum += p[i];\n    }\n    return sum;\n}', 'Python': 'def func_sum(p):\n    return sum(p)'}


## Easy to split into corresponding elements

In [9]:
res['Elaboration']

'To calculate the sum of an array, we can iterate through each element of the array and add it to a running total.'

In [10]:
print(res['C'])

int func_sum(int p[], int size) {
    int sum = 0;
    for (int i = 0; i < size; i++) {
        sum += p[i];
    }
    return sum;
}


In [11]:
print(res['Python'])

def func_sum(p):
    return sum(p)


In [12]:
## we can even run the Python code (potentially risky due to prompt injection attacks when running unverified code)
p = [1, 2, 3, 4, 5]
exec(res['Python'])
try:
    print('The output sum is', func_sum(p))
except Exception as e:
    print('An exception occured')

The output sum is 15


# 3. Strict JSON Functions
- Enhances ```strict_json()``` with a function-like interface for repeated use of modular LLM-based functions
- Inputs (compulsory):
    - **fn_description** - Function description to describe process of transforming input variables to output variables
    - **output_format** - Dictionary containing output variables names and description for each variable. There must be at least one output variable
- Inputs (optional):
    - **examples** - Examples in Dictionary form with the input and output variables (list if more than one)
    - **input_type** - Dictionary containing input variable names as keys and mapping functions as values (need not contain all variables)
    - **output_type** - Dictionary containing output variable names as keys and mapping functions as values (need not contain all variables)
    - **kwargs** - Additional arguments you would like to pass on to the ```strict_json``` function
        
- Outputs:
    JSON of output variables in a dictionary (similar to ```strict_json```)
    
#### Example Usage 1 (Description only)
```python
# Construct the function: var1 will be first input variable, var2 will be second input variable and so on
fn = strict_function(fn_description = 'Output a sentence with words var1 and var2 in the style of var3', 
                     output_format = {'output': 'sentence'})

# Use the function
fn('ball', 'dog', 'happy')
```

#### Example Output 1
```{'output': 'The happy dog chased the ball.'}```

#### Example Usage 2 (Examples only)
```python
# Construct the function: infer pattern from just examples without description (here it is multiplication)
fn = strict_function(fn_description = 'Map input to output based on examples', 
                     output_format = {'output': 'final answer'}, 
                     examples = [{'var1': 3, 'var2': 2, 'output': 6}, 
                                 {'var1': 5, 'var2': 3, 'output': 15}, 
                                 {'var1': 7, 'var2': 4, 'output': 28}])

# Use the function
fn(2, 10)
```

#### Example Output 2
```{'output': 20}```

#### Example Usage 3 (Description, Examples and Type Forcing)
```python
# Construct the function: var1 will be first input variable, var2 will be second input variable and so on
fn = strict_function(fn_description = 'Output the sum and difference of var1 and var2', 
                 output_format = {'sum': 'sum of two numbers', 'difference': 'absolute difference of two numbers'}, 
                 examples = {'var1': 2, 'var2': 4, 'sum': 6, 'difference': '2'}, 
                 input_type = {'var1': int, 'var2': int},           # optional
                 output_type = {'sum': int, 'difference': str})     # optional

# Use the function
fn(3, 4)
```

#### Example Output 3
```{'sum': 7, 'difference': '1'}```

In [13]:
# basic configuration with no variable names (recommended)
# var1 will be first input variable, var2 will be second input variable and so on
fn = strict_function(fn_description = 'Output a sentence with words var1 and var2 in the style of var3', 
                     output_format = {'output': 'sentence'})
fn('ball', 'dog', 'happy')

{'output': 'The happy dog chased the ball.'}

In [14]:
# basic configuration with variable names
fn = strict_function(fn_description = 'Output a sentence with "obj" and "entity" in the style of "emotion"', 
                     output_format = {'output': 'sentence'})
fn(obj = 'ball', entity = 'dog', emotion = 'happy')

{'output': 'The dog is happy with the ball.'}

In [15]:
# infer pattern from just examples without description (here it is multiplication)
fn = strict_function(fn_description = 'Map input to output based on examples', 
                     output_format = {'output': 'final answer'}, 
                     examples = [{'var1': 3, 'var2': 2, 'output': 6}, 
                                 {'var1': 5, 'var2': 3, 'output': 15}, 
                                 {'var1': 7, 'var2': 4, 'output': 28}])
fn(2, 10)

{'output': 20}

In [16]:
# multiple outputs and examples without strict typing. allows very flexible input types and output types (recommended)
fn = strict_function(fn_description = 'Output the sum and difference of var1 and var2', 
                 output_format = {'sum': 'sum of two numbers', 'difference': 'absolute difference of two numbers'}, 
                 examples = {'var1': 2, 'var2': 4, 'sum': 6, 'difference': 2})
fn(3, 4)

{'sum': 7, 'difference': 1}

In [17]:
# multiple outputs and examples and strict typing. converts difference into a string
fn = strict_function(fn_description = 'Output the sum and difference of var1 and var2', 
                 output_format = {'sum': 'sum of two numbers', 'difference': 'absolute difference of two numbers'}, 
                 examples = {'var1': 2, 'var2': 4, 'sum': 6, 'difference': '2'}, 
                 input_type = {'var1': int, 'var2': int},           # optional
                 output_type = {'sum': int, 'difference': str})     # optional
fn(3, 4)

{'sum': 7, 'difference': '1'}

In [18]:
# multiple outputs without strict typing. allows very flexible input types and output types (recommended)
fn = strict_function(fn_description = '''Output the sum of var1 and var2, 
                     generate a poem in style of var3 and code in var4''', 
                 output_format = {'sum': 'sum of two numbers', 
                'poem': 'poem about the two numbers',
                'code': 'code to do the sum of two numbers'})
fn('three', 4, 'happy', 'Python')

{'sum': 7,
 'poem': 'In a world so happy and bright, where three meets four in pure delight.',
 'code': 'sum = 3 + 4'}

In [19]:
# multiple outputs with strict typing. Converts sum into a string. Unspecified types are converted to best fit
fn = strict_function(fn_description = '''Output the sum of var1 and var2, 
                     generate a poem in style of var3 and code in var4''', 
                 output_format = {'sum': 'sum of two numbers', 
                'poem': 'poem about the two numbers',
                'code': 'code to do the sum of two numbers'},
                 input_type = {'var1': str, 'var2': int},           # optional
                 output_type = {'sum': str})                        # optional
fn('three', 4, 'haiku', 'Assembly')

{'sum': '7',
 'poem': 'Three and four meet\nIn a dance of numbers pure\nTheir sum, a delight',
 'code': 'ADD var1, var2'}

## Advanced Conversion of list-based outputs
- If your output field is of the form of a list, you can ensure strict type conversion of each element using a lambda function
- Examples
    - For strings, lambda x: [str(y) for y in x]
    - For integers, lambda x: [int(y) for y in x]

In [20]:
# multiple outputs with strict typing. shows how to do list[str] conversion using lambda functions
fn = strict_function(fn_description = '''Output 5 prime numbers after var1, output 5 even numbers after var2''', 
                 output_format = {'primes': 'list of primes', 'evens': 'list of evens'},
                 input_type = {'var1': int, 'var2': int},                               # optional
                 output_type = {'primes': lambda x: [str(y) for y in x],
                               'evens': lambda x: [int(y) for y in x]})           # optional
fn(4, 10)

{'primes': ['5', '7', '11', '13', '17'], 'evens': [12, 14, 16, 18, 20]}

# 4. Type specificity hints
- Generally, ```strict_json``` will infer the data type automatically for you for the output fields
- However, if you would like very specific data types, or to better enforce data types (due to long context etc.), you can just insert data type hints of the form ```type: <data_type>``` into the output field description
- This ```<data_type>``` can be the same as Pydantic, or json schema, or simply plain text to guide the LLM
- Note: This is not strict converstion, if you would like strict conversion, use ```input_type``` and ```output_type``` which converts the data types using rule-based functions outside of the LLM

#### Example Usage
```python
res = strict_json(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful and sunny day',
                    output_format = {'Sentiment': 'Type of Sentiment, type: enum["Positive", "Negative"]',
                                    'Adjectives': 'List of adjectives, type: List[str]',
                                    'Words': 'Number of words, type: int'})
                                    
print(res)
```

#### Example output
```{'Sentiment': 'Positive', 'Adjectives': ['beautiful', 'sunny'], 'Words': 7}```

In [21]:
res = strict_json(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful and sunny day',
                    output_format = {'Sentiment': 'Type of Sentiment, type: enum["Positive", "Negative"]',
                                    'Adjectives': 'List of adjectives, type: List[str]',
                                    'Words': 'Number of words, type: int'})

print(res)

{'Sentiment': 'Positive', 'Adjectives': ['beautiful', 'sunny'], 'Words': 7}


# 5. Integrating with OpenAI JSON Mode
- If you want to use the OpenAI JSON Mode (which is pretty good btw), you can simply add in ```openai_json_mode = True``` in ```strict_json``` or ```strict_function```
- Note that the model must be one of ```gpt-4-1106-preview``` or ```gpt-3.5-turbo-1106```. We will set it to ```gpt-3.5-turbo-1106``` by default if you provide an invalid model

#### Example Usage
```python
res = strict_json(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful and sunny day',
                    output_format = {'Sentiment': 'Type of Sentiment',
                                    'Adjectives': 'List of adjectives',
                                    'Words': 'Number of words'},
                    openai_json_mode = True) # Toggle this to True
                                    
print(res)
```

#### Example output
```{'Sentiment': 'positive', 'Adjectives': ['beautiful', 'sunny'], 'Words': 6}```

In [22]:
res = strict_json(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful and sunny day',
                    output_format = {'Sentiment': 'Type of Sentiment',
                                    'Adjectives': 'List of adjectives',
                                    'Words': 'Number of words'},
                   openai_json_mode = True) # Toggle this to True
print(res)

{'Sentiment': 'Positive', 'Adjectives': ['beautiful', 'sunny'], 'Words': 6}


In [23]:
fn = strict_function(fn_description = 'Output a sentence with words var1 and var2 in the style of var3', 
                     output_format = {'output': 'sentence'},
                    openai_json_mode = True) # Toggle this to True
fn('ball', 'dog', 'happy')

{'output': 'The ball made the dog happy.'}

# Optional: Under the hood (Explanation of how strict_json works)
- When given the output JSON format, it adds a delimiter (default: ###) to enclose the key of the JSON.
- Example output JSON provided: ```{'Sentiment': 'Type of Sentiment'}```
- Example output JSON interpreted by Strict JSON: ```{'###Sentiment###': 'Type of Sentiment'}```
- We then process the JSON format by using regex to search for the delimiter to extract the keys and values
- Note: Change the delimiter to whatever is not present in your dataset

In [24]:
# a very difficult chunk of text for json.loads() to parse (it will fail)
res = '''{'###Question of the day###': 'What is the 'x' in dx/dy?', 
'###Code Block 1###': '#include <stdio.h>\nint main(){\nint x = 'a'; return 0;\n}',
'###Another Code###': 'import numpy as np
### Oh what is this doing here
print("It can handle so many quotations ' \\" and backslashes and unexpected curly braces { } You don't even need to match }!")',
'###Some characters###': '~!@#$%^&*()_+-'"{}[];?><,.'}'''

In [25]:
# change this to whatever is not common in your dataset
delimiter = '###'

In [26]:
import re
# Use regular expressions to extract keys and values
pattern = fr",*\s*['|\"]{delimiter}([^#]*){delimiter}['|\"]: "

matches = re.split(pattern, res[1:-1])

# remove null matches
my_matches = [match for match in matches if match !='']

print(my_matches)

['Question of the day', "'What is the 'x' in dx/dy?'", 'Code Block 1', "'#include <stdio.h>\nint main(){\nint x = 'a'; return 0;\n}'", 'Another Code', '\'import numpy as np\n### Oh what is this doing here\nprint("It can handle so many quotations \' \\" and backslashes and unexpected curly braces { } You don\'t even need to match }!")\'', 'Some characters', '\'~!@#$%^&*()_+-\'"{}[];?><,.\'']


In [27]:
# remove the ' from the value matches
curated_matches = [match[1:-1] if match[0] in '\'"' else match for match in my_matches]

print(curated_matches)

['Question of the day', "What is the 'x' in dx/dy?", 'Code Block 1', "#include <stdio.h>\nint main(){\nint x = 'a'; return 0;\n}", 'Another Code', 'import numpy as np\n### Oh what is this doing here\nprint("It can handle so many quotations \' \\" and backslashes and unexpected curly braces { } You don\'t even need to match }!")', 'Some characters', '~!@#$%^&*()_+-\'"{}[];?><,.']


In [28]:
len(curated_matches)

8

In [29]:
# create a dictionary
end_dict = {}
for i in range(0, len(curated_matches), 2):
    end_dict[curated_matches[i]] = curated_matches[i+1]
    
print(end_dict)

{'Question of the day': "What is the 'x' in dx/dy?", 'Code Block 1': "#include <stdio.h>\nint main(){\nint x = 'a'; return 0;\n}", 'Another Code': 'import numpy as np\n### Oh what is this doing here\nprint("It can handle so many quotations \' \\" and backslashes and unexpected curly braces { } You don\'t even need to match }!")', 'Some characters': '~!@#$%^&*()_+-\'"{}[];?><,.'}


In [30]:
for key, value in end_dict.items():
    print('Key:', key)
    print('Value:', value)
    print('#####')

Key: Question of the day
Value: What is the 'x' in dx/dy?
#####
Key: Code Block 1
Value: #include <stdio.h>
int main(){
int x = 'a'; return 0;
}
#####
Key: Another Code
Value: import numpy as np
### Oh what is this doing here
print("It can handle so many quotations ' \" and backslashes and unexpected curly braces { } You don't even need to match }!")
#####
Key: Some characters
Value: ~!@#$%^&*()_+-'"{}[];?><,.
#####
