In [1]:
import textwrap
from lyon_common import create_chain, report_on_message
from question_generator_model import SingleSelection, Code, AnyQuestion, FillInBlank, MultipleSelection
from langchain.output_parsers import PydanticOutputParser
from pydantic import ValidationError
import json
from json.decoder import JSONDecodeError

**NOTE** I have updated `lyon_common.py` so that the `create_chain` function accepts a few more arguments:

- `model_name` now defaults to the GPT 3.5 turbo model from 2023-11-06 
- `model_kwargs`: this is a dictionary with OpenAI specific parameters we can set. The default value is now to instruct OpenAI to use the [new `json_object` response format ](https://platform.openai.com/docs/guides/text-generation/json-mode), which modifies the available tokens so that at each step only valid JSON can be produced.
- `verbose` is a new keyword argument that is set to `False` by default.

This notebook shows how we can use these updates to generate valid Pydantic ready JSON

In [28]:
def create_system_prompt_1(pydantic_object):
    common_system_prompt = textwrap.dedent("""
    You are a smart, helpful teaching assistant chatbot named AacdemiaGPT.

    You are an expert Python programmer with 15+ years of experience and have used all the most popular
    libraries for data analysis, machine learning, and artificial intelligence.

    You assist professors that teach courses about Python, data science, and machine learning
    to college/ university students.

    Your task is to help professors produce practice questions to help students solidify 
    their understanding of specific topics.

    You must always generate questions that have more than one option as solution for the MultipleSelection question type.

    In your conversations with a professor you will be given a topic (string) and an
    expected difficulty level (integer)
    
    The difficulty will be a number between 1 and 3, with 1 corresponding to a request 
    for an easy question, and 3 for the most difficult question.

     Example of a difficulty 1 question is given below.
    
    "How would you reverse the order of the following list in python
    
    ```python
    a = [1, 'hi', 3, 'there']
    ```
    and save the result in an object `b`"?

    
    Example of difficulty 3 question is given below.
    
    "You are given a 3 dimensional numpy array as specified below:
    
    ```
    A = np.array([[[0.0, 1.0], [2.0, 3.0]], [[4.0, 5.0], [6.0, 7.0]]])
    ```
    
    Create a variable `idx` (define as a tuple) that you could use to select the `4.0` element of this array.
    
    For example,
    ```
    idx = (0, 0, 0)
    ```
    would select the `0.0` element of the array."

    
    
    If the professor asks you for a question and does not specify either a new topic 
    or a new difficulty, you must use the previous topic or difficulty.

 
    Occasionaly the professor may ask you to do something like produce a similar question,
    or  make the question more difficult or easy. You need to assist the professor with 
    the same. You must use the previous topic to do this.
    
    If the professor ask for more than one question in a single message, you need to apologize and 
    inform that you can only generate one question at a time. You need to also ask the professor to 
    put in a new message with the topic and difficulty to generate a new question.

    If the topic is available in the given information, refer that for giving a response to the professor. 
    You are encouraged to use any tools available to look up relevant information, only
    if necessary. 

    You will apologize if you're unable to generate an output that meets professor's requirement

    Your responses must always exactly match the specified JSON format with no extra words or content.

    You must always produce exactly one JSON object.
    
    {format_instructions}
    """)

    parser = PydanticOutputParser(pydantic_object=pydantic_object)
    return common_system_prompt.format(format_instructions=parser.get_format_instructions())
    

In [26]:
def generate_and_parse_question(pydantic_model, query):
    rag_chain = create_chain(create_system_prompt(pydantic_model), temperature=0.1, verbose=True, model_name="gpt-4-1106-preview")
    
    try:
        response = rag_chain(query)
        report_on_message(response)  # print a summary of what was produced
        parser = PydanticOutputParser(pydantic_object=pydantic_model)
        return parser.parse(response["output"])
    except ValidationError as ve:
        print(f"Pydantic validation error: {ve}")
        # If Pydantic validation fails, fallback to json.loads
        return json.loads(response["output"])
    except JSONDecodeError as json_error:
        # If JSON decoding fails, perform json.loads and inform the caller about the error
        result_output = json.loads(response["output"])
        print(f"JSON decoding error: {json_error}")
        return result_output
    except Exception as e:
        print(f"An error occurred: {e}")
        # Handle other exceptions and fallback to json.loads
        return json.loads(response["output"])

In [27]:
generate_and_parse_question(MultipleSelection, "topic: pandas groupby, difficulty: 3")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `search_course_content` with `pandas groupby`


[0m[36;1m[1;3m[Document(page_content="We're going to use this example data set to demonstrate the three steps in split apply combine. To begin, we'll start with the split step. In order to ask pandas to split the data for us, we use the group by method of a data frame. You see here that we're calling DF dot group by and we're passing the string A. This instructs pandas to construct groups of our data using the values from the A column. This is the most basic and often most used form of the group by method to split on the values of a single column. We can check the type of this GBA object. And we see here a very long type name but we're just going to refer to this as a group by for short. Once we have a group by object, there are a few things we can do with it. One thing we could do is we could ask to get the subset of data for a particular group. So here we're goin

{'description': 'Question where user is presented a prompt in `question_text` and \na list of `choices`. They are supposed to provide all answers that\napply (`solution`)\n\nAll questions must have a minimum of 3 options\n\nExamples\n--------\n{\n  "question_text": "Given a DataFrame `df` with a datetime column \'timestamp\', how would you group the data into 1-hour intervals and sum the values in a column \'value\'?",\n  "difficulty": 3,\n  "topics": ["pandas", "groupby"],\n  "choices": [\n    "df.groupby(df[\'timestamp\'].dt.hour).sum()",\n    "df.resample(\'H\', on=\'timestamp\').sum()",\n    "df.groupby(pd.Grouper(key=\'timestamp\', freq=\'H\')).sum()",\n    "df.groupby(\'timestamp\').sum()"\n  ],\n  "solution": [1, 2]\n}'}

In [27]:
generate_and_parse_question(MultipleSelection, "topic: pandas groupby, difficulty: 3")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `search_course_content` with `pandas groupby`


[0m[36;1m[1;3m[Document(page_content="We're going to use this example data set to demonstrate the three steps in split apply combine. To begin, we'll start with the split step. In order to ask pandas to split the data for us, we use the group by method of a data frame. You see here that we're calling DF dot group by and we're passing the string A. This instructs pandas to construct groups of our data using the values from the A column. This is the most basic and often most used form of the group by method to split on the values of a single column. We can check the type of this GBA object. And we see here a very long type name but we're just going to refer to this as a group by for short. Once we have a group by object, there are a few things we can do with it. One thing we could do is we could ask to get the subset of data for a particular group. So here we're goin

{'description': 'Question where user is presented a prompt in `question_text` and \na list of `choices`. They are supposed to provide all answers that\napply (`solution`)\n\nAll questions must have a minimum of 3 options\n\nExamples\n--------\n{\n  "question_text": "Given a DataFrame `df` with a datetime column \'timestamp\', how would you group the data into 1-hour intervals and sum the values in a column \'value\'?",\n  "difficulty": 3,\n  "topics": ["pandas", "groupby"],\n  "choices": [\n    "df.groupby(df[\'timestamp\'].dt.hour).sum()",\n    "df.resample(\'H\', on=\'timestamp\').sum()",\n    "df.groupby(pd.Grouper(key=\'timestamp\', freq=\'H\')).sum()",\n    "df.groupby(\'timestamp\').sum()"\n  ],\n  "solution": [1, 2]\n}'}

In [27]:
generate_and_parse_question(MultipleSelection, "topic: pandas groupby, difficulty: 3")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `search_course_content` with `pandas groupby`


[0m[36;1m[1;3m[Document(page_content="We're going to use this example data set to demonstrate the three steps in split apply combine. To begin, we'll start with the split step. In order to ask pandas to split the data for us, we use the group by method of a data frame. You see here that we're calling DF dot group by and we're passing the string A. This instructs pandas to construct groups of our data using the values from the A column. This is the most basic and often most used form of the group by method to split on the values of a single column. We can check the type of this GBA object. And we see here a very long type name but we're just going to refer to this as a group by for short. Once we have a group by object, there are a few things we can do with it. One thing we could do is we could ask to get the subset of data for a particular group. So here we're goin

{'description': 'Question where user is presented a prompt in `question_text` and \na list of `choices`. They are supposed to provide all answers that\napply (`solution`)\n\nAll questions must have a minimum of 3 options\n\nExamples\n--------\n{\n  "question_text": "Given a DataFrame `df` with a datetime column \'timestamp\', how would you group the data into 1-hour intervals and sum the values in a column \'value\'?",\n  "difficulty": 3,\n  "topics": ["pandas", "groupby"],\n  "choices": [\n    "df.groupby(df[\'timestamp\'].dt.hour).sum()",\n    "df.resample(\'H\', on=\'timestamp\').sum()",\n    "df.groupby(pd.Grouper(key=\'timestamp\', freq=\'H\')).sum()",\n    "df.groupby(\'timestamp\').sum()"\n  ],\n  "solution": [1, 2]\n}'}

In [28]:
def create_system_prompt(pydantic_object):
    common_system_prompt = textwrap.dedent("""
    You are a smart, helpful teaching assistant chatbot named AacdemiaGPT.

    You are an expert Python programmer with 15+ years of experience and have used all the most popular
    libraries for data analysis, machine learning, and artificial intelligence.

    You assist professors that teach courses about Python, data science, and machine learning
    to college and university students.

    Your task is to help professors produce practice questions to help students solidify 
    their understanding of specific topics.

    You must always generate questions that have more than one option as solution for the MultipleSelection question type.

    In your conversations with a professor you will be given a topic (string) and an
    expected difficulty level (integer)
    
    The difficulty will be a number between 1 and 3, with 1 corresponding to a request 
    for an easy question, and 3 for the most difficult question.

    If the professor asks you for a question and does not specify either a new topic 
    or a new difficulty, you must use the previous topic or difficulty.

    Occasionaly the professor may ask you to do something like produce a similar question,
    or  make the question more difficult or easy. You need to assist the professor with 
    the same. You must use the previous topic to do this.
    
    If the professor ask for more than one question in a single message, you need to apologize and 
    inform that you can only generate one question at a time. You need to also ask the professor to 
    put in a new message with the topic and difficulty to generate a new question.

    If the topic is available in the given information, refer that for giving a response to the professor. 
    You are encouraged to use any tools available to look up relevant information, only
    if necessary. 

    You will apologize if you're unable to generate an output that meets professor's requirement

    Your responses must always exactly match the specified JSON format with no extra words or content.

    You must always produce exactly one JSON object.

    Your responses should always be consistent
    
    {format_instructions}
    """)

    parser = PydanticOutputParser(pydantic_object=pydantic_object)
    return common_system_prompt.format(format_instructions=parser.get_format_instructions())
    

In [30]:
generate_and_parse_question(MultipleSelection, "topic: pandas groupby\n difficulty: 3")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `search_course_content` with `pandas groupby`


[0m[36;1m[1;3m[Document(page_content="We're going to use this example data set to demonstrate the three steps in split apply combine. To begin, we'll start with the split step. In order to ask pandas to split the data for us, we use the group by method of a data frame. You see here that we're calling DF dot group by and we're passing the string A. This instructs pandas to construct groups of our data using the values from the A column. This is the most basic and often most used form of the group by method to split on the values of a single column. We can check the type of this GBA object. And we see here a very long type name but we're just going to refer to this as a group by for short. Once we have a group by object, there are a few things we can do with it. One thing we could do is we could ask to get the subset of data for a particular group. So here we're goin

{'description': 'Question where user is presented a prompt in `question_text` and \na list of `choices`. They are supposed to provide all answers that\napply (`solution`)\n\nAll questions must have a minimum of 3 options\n\nExamples\n--------\n{\n  "question_text": "Given a DataFrame `df` with a datetime column \'timestamp\', how would you group the entries by year using pandas?",\n  "difficulty": 3,\n  "topics": ["pandas", "groupby", "datetime"],\n  "choices": [\n    "df.groupby(df[\'timestamp\'].dt.year)",\n    "df.groupby(pd.Grouper(key=\'timestamp\', freq=\'A\'))",\n    "df.groupby(\'timestamp\').resample(\'A\')",\n    "df.set_index(\'timestamp\').groupby(pd.TimeGrouper(\'A\'))"\n  ],\n  "solution": [1]\n}'}