## Question Type examples

In this notebook we work with the 4 Jupyteach question types: `SingleSelection`, `MultipleSelection`, `Code`, `FillInBlank`

We will import the corresponding Pydantic classes from `question_generator_model.py`

We will then create a `langchain.output_parsers.PydanticOutputParser` for each question type and demonstrate how to use each of these.

In [44]:
from question_generator_model import (
    MultipleSelection, 
    SingleSelection, 
    Code, 
    FillInBlank
)

Here is an example of each question type

> Note: you should also read the docstrings for more details on how each question type works

> Note 2: You don't have to create these class instances yourself... langchain will do this for you when we create a `PydanticOutputParser` for each of the question types

In [45]:
ms_question = MultipleSelection(
    question_text="What are some possible consequences of a learning rate that is too large?",
    difficulty=2,
    choices=[
        "The algorithm never converges",
        "The algorithm becomes unstable",
        "Learning is stable, but very slow"
    ],
    solution=[0, 1],
    topics=["optimization", "gradient descent"]
)
ms_question

What are some possible consequences of a learning rate that is too large?

- [x] The algorithm never converges
- [x] The algorithm becomes unstable
- [ ] Learning is stable, but very slow


In [46]:
ss_question = SingleSelection(
    question_text="""What does `.loc` do?

Below is an example of how it might be used

```python
df.loc[1995, "NorthEast"]
```""",
    difficulty=2,
    topics=["pandas", "loc", "indexing"],
    choices=[
        "The `.loc` method allows a user to select rows/columns by name",
        "The `.loc` method allows a  user to select rows/columns by their position",
        "The `.loc` method is for aggregating data"
    ],
    solution=0
)
ss_question

What does `.loc` do?

Below is an example of how it might be used

```python
df.loc[1995, "NorthEast"]
```

- [x] The `.loc` method allows a user to select rows/columns by name
- [ ] The `.loc` method allows a  user to select rows/columns by their position
- [ ] The `.loc` method is for aggregating data


In [47]:
fib_question = FillInBlank(
    question_text='''\
Suppose you have already executed the following code:

```python
import numpy as np

A = np.array([[1, 2], [3, 4]])
b = np.array([10, 42])
```

Fill in the blanks below to solve the matrix equation $Ax = b$ for $x$
''',
    difficulty=2,
    topics=["linear algebra", "regression", "numpy"],
    starting_code='''\
from scipy.linalg import ___X

x = ___X(A, ___X)''',
    solution=["solve", "solve", "b"],
    setup_code='''\
import numpy as np

A = np.array([[1, 2], [3, 4]])
b = np.array([10, 42])
''',
    test_code="assert np.allclose(x, [22, -6])"
)
fib_question

Suppose you have already executed the following code:

```python
import numpy as np

A = np.array([[1, 2], [3, 4]])
b = np.array([10, 42])
```

Fill in the blanks below to solve the matrix equation $Ax = b$ for $x$


```python
from scipy.linalg import ___X

x = ___X(A, ___X)
```

**Solution**

[solve, solve, b]
```

**Rendered Solution**

```python
from scipy.linalg import solve

x = solve(A, b)
```

**Test Suite**

```python
import numpy as np

A = np.array([[1, 2], [3, 4]])
b = np.array([10, 42])


from scipy.linalg import solve

x = solve(A, b)

assert np.allclose(x, [22, -6])
```

In [48]:
code_question = Code(
    question_text='''How would you create a `DatetimeIndex` starting on January 1, 2022 and ending on June 1, 2022 with the values taking every hour in between?

Save this to a variable called `dates`''',
    difficulty=2,
    topics=["pandas", "dates"],
    starting_code="dates = ...",
    setup_code="import pandas as pd",
    test_code='''\
assert dates.sort_values()[0].strftime("%Y-%m-%d") == "2022-01-01"
assert dates.sort_values()[-1].strftime("%Y-%m-%d") == "2022-06-01"
assert dates.shape[0] == 3625''',
    solution='dates = pd.date_range("2022-01-01", "2022-06-01", freq="h")'
)
code_question

How would you create a `DatetimeIndex` starting on January 1, 2022 and ending on June 1, 2022 with the values taking every hour in between?

Save this to a variable called `dates`

```python
dates = ...
```

**Solution**

```python
dates = pd.date_range("2022-01-01", "2022-06-01", freq="h")
```

**Test Suite**

```python
import pandas as pd

dates = pd.date_range("2022-01-01", "2022-06-01", freq="h")

assert dates.sort_values()[0].strftime("%Y-%m-%d") == "2022-01-01"
assert dates.sort_values()[-1].strftime("%Y-%m-%d") == "2022-06-01"
assert dates.shape[0] == 3625
```

## Create PydanticOutputParser

Now we will create a `langchain.output_parsers.PydanticOutputParser` for each question type.

These objects have to main purposes:

1. They provide a set of format instructions that will be embedded into the system prompt. We can get them via the `.get_format_instructions()` method
2. They know how to parse the return value from the LLM into an instance of our Pydantic class

In [3]:
from langchain.output_parsers import PydanticOutputParser
from question_generator_model import (
    MultipleSelection, 
    SingleSelection, 
    Code, 
    FillInBlank
)

In [2]:
# create output parsers

ms_parser = PydanticOutputParser(pydantic_object=MultipleSelection)
code_parser = PydanticOutputParser(pydantic_object=Code)
ss_parser = PydanticOutputParser(pydantic_object=SingleSelection)
fib_parser = PydanticOutputParser(pydantic_object=FillInBlank)

NameError: name 'MultipleSelection' is not defined

Let' see the format instructions for the `ss_parser`:

In [63]:
print(fib_parser.get_format_instructions())

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"description": "    Question type where the student is given a main question and then\n    a code block with \"blanks\" (represented by `___X` in the source).\n    The student must provide one string per blank. Correctness is evaluated\n    based on a Python test suite based on the following template:\n\n    \n    ```python\n    {setup_code}\n\n    {code_block_with_blanks_filled_in}\n\n    {test_code}\n    ```\n\n    There must be at least one `___X` (one blank) in `starting_code`\n\n\n    Examples\n    --------\n    {\n      \"question_text\

## Example for each question type

In [52]:
common_system_prompt = """You are a smart, helpful teaching assistant chatbot named Callisto.

You assist professors that teach courses about Python, data science, and machine learning
to graduate students.

You have 5+ years of experience writing Python code to do a variety of tasks. 

Your responses typically include examples of datasets or code snippets.

For each message you will be given two inputs

topic: string
difficulty: integer

Your task is to produce practice questions to help students solidify their understanding of the provided topic

The difficulty will be a number between 1 and 3, with 1 corresponding to a request for an easy question, and 3 for the most difficult question.

If the user asks you for another question and does not specify either a new topic or a new difficulty, you must use the previous topic or difficulty

Your responses must always exactly match the specified format with no extra words or content.


{format_instructions}
"""

In [53]:
from typing import List, Dict, Any

import dotenv
dotenv.load_dotenv("/home/jupyteach-msda/jupyteach-ai/.env")

from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder,
)
from langchain.schema import LLMResult
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.output_parsers import PydanticOutputParser
from langchain.chat_models import ChatOpenAI


class JupyteachQuestionChain(LLMChain):
    """
    necessary for memory and PydanticOutputParser to work at the same time. 
    
    Notice that we set `ConversationBufferMemory.output_key` to `"original_text_response"`
    and we use `"original_text_response"` as a key in `create_outputs` below.
    """

    def create_outputs(self, llm_result: LLMResult) -> List[Dict[str, Any]]:
        out = super().create_outputs(llm_result)
        return [
            {**d, "original_text_response": g[0].text}
             for (d, g) in zip(out, llm_result.generations)
        ]


def build_llm_for_pydantic_model(model_class):
    parser = PydanticOutputParser(pydantic_object=model_class)
    system = SystemMessagePromptTemplate.from_template(common_system_prompt)
    human = HumanMessagePromptTemplate.from_template("{input}")
    
    prompt = ChatPromptTemplate(
        messages = [system, MessagesPlaceholder(variable_name="history"), human],
        partial_variables={"format_instructions": parser.get_format_instructions()},
        # output_parser=parser,
    )
    
    model = ChatOpenAI(temperature=0.4)
    
    memory = ConversationBufferMemory(
        memory_key="history", 
        return_messages=True,
        output_key="original_text_response",
    )
    return JupyteachQuestionChain(
        memory=memory,
        llm=model,
        prompt=prompt,
        output_parser=parser,
        output_key="question",
        return_final_only=False,
    )

In [59]:
parser = PydanticOutputParser(pydantic_object=SingleSelection)
system = SystemMessagePromptTemplate.from_template(common_system_prompt)
human = HumanMessagePromptTemplate.from_template("{input}")
print(parser.get_format_instructions())

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"description": "    Question where user is presented a prompt in `question_text` and \n    a list of `choices`. They are supposed to provide the single best\n    answer (`solution`) as an integer, which is the index into `choices.\n\n    All questions must have a minimum of 2 options\n\n    Examples\n    --------\n    {\n      \"question_text\": \"What does `.loc` do?\n\nBelow is an example of how it might be used\n\n```python\ndf.loc[1995, \"NorthEast\"]\n```\",\n      \"difficulty\": 2,\n      \"topics\": [\"pandas\", \"loc\", \"indexing\"]

In [67]:
ss_chain = build_llm_for_pydantic_model(SingleSelection)
q_ss = ss_chain.invoke(input="difficulty: 1\ntopic: scikit-learn LinearRegression")
q_ss["question"]

What is the purpose of the `fit` method in scikit-learn's LinearRegression?

- [x] To train the linear regression model on the given training data
- [ ] To make predictions using the trained linear regression model
- [ ] To evaluate the performance of the linear regression model


In [68]:
ms_chain = build_llm_for_pydantic_model(MultipleSelection)
q_ms = ms_chain.invoke(input="difficulty: 3\ntopic: pandas vs numpy")
q_ms["question"]

What are the main differences between pandas and numpy?

- [x] Pandas is primarily used for data manipulation and analysis, while numpy is used for numerical computing and mathematical operations.
- [x] Pandas provides a DataFrame object that allows for easy handling of structured data, while numpy provides multi-dimensional arrays for efficient storage and manipulation of homogeneous data.
- [x] Pandas has built-in support for handling missing data, while numpy does not.
- [ ] Numpy is faster than pandas for numerical computations because it is implemented in C.
- [ ] Both pandas and numpy are open-source libraries for Python.


In [36]:
code_chain = build_llm_for_pydantic_model(Code)
q_code = code_chain.invoke(input="difficulty: 2\ntopic: pandas reshaping")
q_code["question"]

How would you reshape the following DataFrame `df` so that the columns become rows and the rows become columns?

```
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
```

Save the reshaped DataFrame to a variable called `df_reshaped`

```python
df_reshaped = ...
```

**Solution**

```python
df_reshaped = df.T
```

**Test Suite**

```python
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

df_reshaped = df.T

assert df_reshaped.shape == (3, 3)
assert df_reshaped.columns.tolist() == ['A', 'B', 'C']
assert df_reshaped.index.tolist() == [0, 1, 2]
assert df_reshaped.values.tolist() == [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
```

In [43]:
fib_chain = build_llm_for_pydantic_model(FillInBlank)
q_fib = fib_chain.invoke(input="difficulty: 3\ntopic: python for loops")
q_fib["question"]

Write a Python for loop that iterates over a list of numbers and prints the square of each number.

```python
numbers = [1, 2, 3, 4, 5]

for number in numbers:
    ___X
```

**Solution**

[print(number ** 2)]
```

**Rendered Solution**

```python
numbers = [1, 2, 3, 4, 5]

for number in numbers:
    print(number ** 2)
```

**Test Suite**

```python


numbers = [1, 2, 3, 4, 5]

for number in numbers:
    print(number ** 2)

import io
import sys

# Redirect stdout to capture printed output
stdout = sys.stdout
sys.stdout = io.StringIO()

numbers = [1, 2, 3, 4, 5]

for number in numbers:
    print(number ** 2)

# Get the printed output
output = sys.stdout.getvalue()

# Reset stdout
sys.stdout = stdout

# Check if the output is correct
assert output == '1\n4\n9\n16\n25\n'
```

In [None]:
#Steph

In [27]:
fib_chain = build_llm_for_pydantic_model(SingleSelection)
q_fib = fib_chain.invoke(input="difficulty: 1 \n topic: python for loops")
q_fib["question"]

What is the purpose of a for loop in Python?

- [ ] To repeat a block of code a specific number of times
- [x] To iterate over a sequence of elements
- [ ] To define a function


In [29]:
fib_chain = build_llm_for_pydantic_model(SingleSelection)
q_fib = fib_chain.invoke(input="difficuty: 3 \n topic: groupby")
q_fib["question"]

What is the purpose of the `groupby` function in pandas?

- [x] To split a DataFrame into groups based on specified criteria
- [ ] To combine multiple DataFrames into a single DataFrame
- [ ] To sort a DataFrame in ascending order


In [None]:
###Prompt1

In [30]:
prompt1= """ For each message you will be given two inputs

number of questions to generator: integer
topic: string
difficulty: integer 

"""

In [31]:
parser = PydanticOutputParser(pydantic_object=SingleSelection)
system = SystemMessagePromptTemplate.from_template(prompt1)
human = HumanMessagePromptTemplate.from_template("{input}")
print(parser.get_format_instructions())

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"description": "    Question where user is presented a prompt in `question_text` and \n    a list of `choices`. They are supposed to provide the single best\n    answer (`solution`) as an integer, which is the index into `choices.\n\n    All questions must have a minimum of 3 options\n\n    Examples\n    --------\n    {\n      \"question_text\": \"What does `.loc` do?\n\nBelow is an example of how it might be used\n\n```python\ndf.loc[1995, \"NorthEast\"]\n```\",\n      \"difficulty\": 2,\n      \"topics\": [\"pandas\", \"loc\", \"indexing\"]

In [None]:
fib_chain = build_llm_for_pydantic_model(FillInBlank)
q_fib = fib_chain.invoke(input="difficulty: 3\ntopic: python for loops")
q_fib["question"]