# LLM Strict JSON v2 (Text) Framework
- Created by John Tan Chong Min
- Updated: 2 Nov 2023
    - Changed to include both ' and " quotations to surround the ###key###: ```fr",*\s*['|\"]{delimiter}([^#]*){delimiter}['|\"]: "```
    - Changed to include both ' and " quotations when extracting out the value: ``` curated_matches = [match[1:-1] if match[0] in '\'"' else match for match in my_matches]```
    - Added ```You must use "{delimiter}{{key}}{delimiter}" to enclose the each {{key}}.``` to system prompt and to error message to cue GPT to output the key with the delimiter as it does not output the delimiter naturally sometimes
- Created: 28 Oct 2023
- Collaborators welcome

In [60]:
import os
import openai
import json
import re

#API Keys
os.environ['OPENAI_API_TOKEN'] = '<YOUR_API_KEY_HERE>'
openai.api_key = os.environ['OPENAI_API_TOKEN']

# Strict Text Formatting
- Use when you want to force the function output to be a json format
- Helps a lot with minimizing unnecessary explanations of ChatGPT, and ensuring all output fields are there
- Better than vanilla Strict JSON if you want to output ' or " or \ that may break a json.loads()
- All you need is the function ```strict_text()```

## Key Guideline: Bare Minimum, Functional Concept
- "Fit everything into a string, because it works"
- You will get everything back as a string, and you can then convert it to int, float, code, array up to your liking
- With ```strict_text()```, you can get any kind of answers including those with lots of ' or " or { or } or \
- You don't even need to match brackets { or quotation marks ' in the json fields for this to work
- Fewer features than vanilla Strict JSON (such as list-based constraining, dynamic inputs), but you can always just type it out in system prompt yourself

In [64]:
def strict_text(system_prompt, user_prompt, output_format, delimiter = '###',
                  model = 'gpt-3.5-turbo', temperature = 0, num_tries = 3, verbose = False):
    ''' Ensures that OpenAI will always adhere to the desired output json format. 
    Uses rule-based iterative feedback to ask GPT to self-correct.
    Keeps trying up to num_tries it it does not. Returns empty json if unable to after num_tries iterations.'''

    # start off with no error message
    error_msg = ''
    
    for i in range(num_tries):
        
        # make the output format keys with a unique identifier
        new_output_format = {}
        for key in output_format.keys():
            new_output_format[f'{delimiter}{key}{delimiter}'] = output_format[key]
        output_format_prompt = f'''\nYou are to output the following in json format: {new_output_format}
You must use "{delimiter}{{key}}{delimiter}" to enclose the each {{key}}.'''
                    
        # Use OpenAI to get a response
        response = openai.ChatCompletion.create(
          temperature = temperature,
          model=model,
          messages=[
            {"role": "system", "content": system_prompt + output_format_prompt + error_msg},
            {"role": "user", "content": str(user_prompt)}
          ]
        )
        
        res = response['choices'][0]['message']['content']

        if verbose:
            print('System prompt:', system_prompt + output_format_prompt + error_msg)
            print('\nUser prompt:', str(user_prompt))
            print('\nGPT response:', res)
        
        # try-catch block to ensure output format is adhered to
        try:
            # check key appears for each element in the output
            for key in new_output_format.keys():
                # if output field missing, raise an error
                if key not in res: raise Exception(f"{key} not in json output")
                
            # if all is good, we then extract out the fields
            # Use regular expressions to extract keys and values
            pattern = fr",*\s*['|\"]{delimiter}([^#]*){delimiter}['|\"]: "

            matches = re.split(pattern, res[1:-1])

            # remove null matches
            my_matches = [match for match in matches if match !='']

            # remove the ' or " from the value matches
            curated_matches = [match[1:-1] if match[0] in '\'"' else match for match in my_matches]

            # create a dictionary
            end_dict = {}
            for i in range(0, len(curated_matches), 2):
                end_dict[curated_matches[i]] = curated_matches[i+1]

            return end_dict

        except Exception as e:
            error_msg = f"\n\nResult: {res}\n\nError message: {str(e)}\nYou must use \"{delimiter}{{key}}{delimiter}\" to enclose the each {{key}}."
            print("An exception occurred:", str(e))
            print("Current invalid json format:", res)
         
    return {}

## Overall Open-ended generation (vanilla)
- **system_prompt**: Write in whatever you want GPT to become. "You are a \<purpose in life\>"
- **user_prompt**: The user input. Later, when we use it as a function, this is the function input
- **output_format**: JSON format with the key as the output key, and the value as the output description
    - The output keys will be preserved exactly, while GPT will generate content to match the description of the value as best as possible

#### Example Usage
```python
res = strict_text(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful day',
                    output_format = {"Sentiment": "Type of Sentiment",
                                    "Tense": "Type of Tense"})
                                    
print(res)
```

#### Example output
```{'Sentiment': 'Positive', 'Tense': 'Present'}```


In [65]:
res = strict_text(system_prompt = 'You are a classifier',
                    user_prompt = 'It is a beautiful day',
                    output_format = {"Sentiment": "Type of Sentiment",
                                    "Tense": "Type of Tense"})
print(res)

{'Sentiment': 'Positive', 'Tense': 'Present'}


In [66]:
text = '''
One, two, three, four, five,
Once I caught a fish alive,
Six, seven, eight, nine, ten,
Then I let it go again.
Why did you let it go?
Because it bit my finger so.
Which finger did it bite?
This little finger on my right'''

In [67]:
# Open-ended information extraction from text
res = strict_text(system_prompt = 'You are a friendly assistant meant to extract information from text', 
                    user_prompt = text,
                    output_format = {"Summary": "Summarize the text in 10 words", "Entity Caught": "name of entity caught", 
                                 "Finger Bitten": "finger which was bitten", "Numbers": "List of numbers"})
print(res)

{'Summary': 'Once caught fish alive, let go, bit finger', 'Entity Caught': 'fish', 'Finger Bitten': 'little finger', 'Numbers': "['One', 'two', 'three', 'four', 'five', 'Six', 'seven', 'eight', 'nine', 'ten']"}


# StrictJSON v2 (StrictText)

## Overall Open-ended generation (advanced)
- Constrain your multiple GPT outputs in a JSON format for consistency
- All JSON value fields will be strings, easily referenced via dictionary key lookup
- Able to handle code and unstructured formats that would typically break ```json.loads()```

#### Example Usage
```python
res = strict_text(system_prompt = 'You are a code generator, generating code to fulfil a task',
                    user_prompt = 'Sum all elements in a given array p',
                    output_format = {"Elaboration": "How you would do it",
                                     "C": "Code in C",
                                    "Python": "Code in Python"})
                                    
print(res)
```

#### Example output
```{'Elaboration': 'To sum all elements in a given array, you can iterate through each element of the array and keep adding them to a running total.', ```

```'C': 'int sum = 0;\\nfor (int i = 0; i < n; i++) {\\n    sum += p[i];\\n}', ```

```'Python': 'sum = 0\\nfor i in range(len(p)):\\n    sum += p[i]'}```


In [72]:
res = strict_text(system_prompt = 'You are a code generator, generating code to fulfil a task',
                    user_prompt = 'Sum all elements in a given array p',
                    output_format = {"Elaboration": "How you would do it",
                                     "C": "Code in C",
                                    "Python": "Code in Python"})
                                    
print(res)

{'Elaboration': 'To sum all elements in a given array, you can iterate through each element of the array and keep adding them to a running total.', 'C': 'int sum = 0;\\nfor (int i = 0; i < n; i++) {\\n    sum += p[i];\\n}', 'Python': 'sum = 0\\nfor i in range(len(p)):\\n    sum += p[i]'}


In [73]:
print(res['Elaboration'])

To sum all elements in a given array, you can iterate through each element of the array and keep adding them to a running total.


In [74]:
print(res['C'].replace('\\n','\n'))

int sum = 0;
for (int i = 0; i < n; i++) {
    sum += p[i];
}


In [75]:
print(res['Python'].replace('\\n','\n'))

sum = 0
for i in range(len(p)):
    sum += p[i]


## Under the hood (Explanation of how it works - Not needed to run this part)
- When given the output JSON format, it adds a delimiter (default: ###) to enclose the key of the JSON.
- Example output JSON provided: ```{'Sentiment': 'Type of Sentiment'}```
- Example output JSON interpreted by Strict JSON v2: ```{'###Sentiment###': 'Type of Sentiment'}```
- We then process the JSON format by using regex to search for the delimiter to extract the keys and values
- Note: Change the delimiter to whatever is not present in your dataset

In [83]:
# a very difficult chunk of text for json.loads() to parse (it will fail)
res = '''{'###Question of the day###': 'What is the 'x' in dx/dy?', 
'###Code Block 1###': '#include <stdio.h>\nint main(){\nint x = 'a'; return 0;\n}',
'###Another Code###': 'import numpy as np
### Oh what is this doing here
print("It can handle so many quotations ' \\" and backslashes and unexpected curly braces { } You don't even need to match }!")',
'###Some characters###': '~!@#$%^&*()_+-'"{}[];?><,.'}'''

In [84]:
# change this to whatever is not common in your dataset
delimiter = '###'

In [85]:
# Use regular expressions to extract keys and values
pattern = fr",*\s*['|\"]{delimiter}([^#]*){delimiter}['|\"]: "

matches = re.split(pattern, res[1:-1])

# remove null matches
my_matches = [match for match in matches if match !='']

print(my_matches)

['Question of the day', "'What is the 'x' in dx/dy?'", 'Code Block 1', "'#include <stdio.h>\nint main(){\nint x = 'a'; return 0;\n}'", 'Another Code', '\'import numpy as np\n### Oh what is this doing here\nprint("It can handle so many quotations \' \\" and backslashes and unexpected curly braces { } You don\'t even need to match }!")\'', 'Some characters', '\'~!@#$%^&*()_+-\'"{}[];?><,.\'']


In [86]:
# remove the ' from the value matches
curated_matches = [match[1:-1] if match[0] in '\'"' else match for match in my_matches]

print(curated_matches)

['Question of the day', "What is the 'x' in dx/dy?", 'Code Block 1', "#include <stdio.h>\nint main(){\nint x = 'a'; return 0;\n}", 'Another Code', 'import numpy as np\n### Oh what is this doing here\nprint("It can handle so many quotations \' \\" and backslashes and unexpected curly braces { } You don\'t even need to match }!")', 'Some characters', '~!@#$%^&*()_+-\'"{}[];?><,.']


In [87]:
len(curated_matches)

8

In [88]:
# create a dictionary
end_dict = {}
for i in range(0, len(curated_matches), 2):
    end_dict[curated_matches[i]] = curated_matches[i+1]
    
print(end_dict)

{'Question of the day': "What is the 'x' in dx/dy?", 'Code Block 1': "#include <stdio.h>\nint main(){\nint x = 'a'; return 0;\n}", 'Another Code': 'import numpy as np\n### Oh what is this doing here\nprint("It can handle so many quotations \' \\" and backslashes and unexpected curly braces { } You don\'t even need to match }!")', 'Some characters': '~!@#$%^&*()_+-\'"{}[];?><,.'}


In [82]:
for key, value in end_dict.items():
    print('Key:', key)
    print('Value:', value)
    print('#####')

Key: Question of the day
Value: What is the 'x' in dx/dy?
#####
Key: Code Block 1
Value: #include <stdio.h>
int main(){
int x = 'a'; return 0;
}
#####
Key: Another Code
Value: import numpy as np
### Oh what is this doing here
print("It can handle so many quotations ' \" and backslashes and unexpected curly braces { } You don't even need to match }!")
#####
Key: Some characters
Value: ~!@#$%^&*()_+-'"{}[];?><,.
#####
