In [1]:
from notebook_helpers import mprint

# Prompt Templates

A prompt-template is just a way to build different prompts to send to a chat model, based on pre-defined use-cases.

# DocSearchTemplate

A document search template is an object that searches a `DocumentIndex` based on a query, and inserts `n_docs` documents into the prompt, along with additional wording to the model asking it to use the information provided from the docs to answer the question.

In [2]:
from llm_workflow.indexes import ChromaDocumentIndex
from llm_workflow.prompt_templates import DocSearchTemplate

doc_index = ChromaDocumentIndex()
prompt_template = DocSearchTemplate(doc_index=doc_index, n_docs=1)

Here's the default prompt-template used by `DocSearchTemplate`:

In [3]:
mprint(prompt_template.template)


Answer the question at the end of the text as truthfully and accurately as possible, based on the following information provided.

Here is the information:

```
{{documents}}
```

Here is the question:

```
{{prompt}}
```


Let's add documents to our document index.

**If we pass a list of documents to `doc_index`, the `__call__` method will pass the list to the `add()` method. If we pass a string or Document to `dock_index`, the `__call__` method will pass the value to the `search()` method.**

In [4]:
from llm_workflow.base import Document

docs = [
    Document(
        content="The greatest basketball player of all time is Michael Jordan",
        metadata={'id': 1}
    ),
    Document(
        content="The greatest three point shooter of all time is Steph Curry.",
        metadata={'id': 0}
    ),
    Document(
        content="The greatest hockey player of all time is Wayne Gretzky.",
        metadata={'id': 2}
    ),
]
# passing list[Document] is equivalent of calling `doc_index.add(docs)`
doc_index(docs)

In [5]:
# passing a string (or Document) is equivalent of calling `doc_index.search(value)`
doc_index("Who is the greatest 3-point shooter of all time?", n_results=1)

[Document(content='The greatest three point shooter of all time is Steph Curry.', metadata={'id': 0, 'distance': 0.35710838437080383})]

Now, let's construct our prompt. The `DocSearchTemplate` object will retrieve the most relevant document (from the `ChromaDocumentIndex` object) based on the value we send it, and then inject that document into the prompt. Because we set `n_docs=1` above, it will only include one Document.

In [6]:
prompt = prompt_template("Who is the greatest 3-point shooter of all time?")
mprint(prompt)


Answer the question at the end of the text as truthfully and accurately as possible, based on the following information provided.

Here is the information:

```
The greatest three point shooter of all time is Steph Curry.
```

Here is the question:

```
Who is the greatest 3-point shooter of all time?
```


---

# PythonObjectMetadataTemplate 

In [7]:
from llm_workflow.base import Workflow
from llm_workflow.openai import OpenAIChat
from llm_workflow.prompt_templates import PythonObjectMetadataTemplate
import pandas as pd
from notebook_helpers import mprint

credit = pd.read_csv('/code/tests/test_data/data/credit.csv')

prompt_template = PythonObjectMetadataTemplate(objects={'my_credit_df': credit})
model = OpenAIChat()
workflow = Workflow(tasks=[prompt_template, model])

prompt = "Create a graph using @my_credit_df and plotly express of checking account balance and " \
    "the duration of the loan.?"
response = workflow(prompt)
mprint(response)

To create a graph using the `my_credit_df` DataFrame and Plotly Express, you can use the following code:

```python
import plotly.express as px

fig = px.scatter(my_credit_df, x='checking_balance', y='months_loan_duration')
fig.show()
```

This code will create a scatter plot with the checking account balance on the x-axis and the duration of the loan on the y-axis.

### Variables Used:

In [8]:
prompt_template._extracted_variables_last_call

{'my_credit_df'}

### Prompt Template Used:

In [9]:
mprint(workflow.history()[0].prompt)


Answer the question at the end of the text as truthfully and accurately as possible. Use the metadata of the python objects as appropriate. Tailor your response according to the most relevant objects. Don't use the if they don't appear relevant.

Here is the metadata:

```
A pd.DataFrame `my_credit_df` that contains the following numeric and non-numeric columns:


Here are the numeric columns and corresponding summary statistics:

                       count      mean          std    min     25%     50%  \
months_loan_duration  1000.0    20.903    12.058814    4.0    12.0    18.0   
amount                1000.0  3271.258  2822.736876  250.0  1365.5  2319.5   
percent_of_income     1000.0     2.973     1.118715    1.0     2.0     3.0   
years_at_residence    1000.0     2.845     1.103718    1.0     2.0     3.0   
age                   1000.0    35.546    11.375469   19.0    27.0    33.0   
existing_loans_count  1000.0     1.407     0.577654    1.0     1.0     1.0   
dependents            1000.0     1.155     0.362086    1.0     1.0     1.0   

                          75%      max  
months_loan_duration    24.00     72.0  
amount                3972.25  18424.0  
percent_of_income        4.00      4.0  
years_at_residence       4.00      4.0  
age                     42.00     75.0  
existing_loans_count     2.00      4.0  
dependents               1.00      2.0  

Here are the non-numeric columns and corresponding value counts:

`checking_balance`: {'unknown': 394, '< 0 DM': 274, '1 - 200 DM': 269, '> 200 DM': 63}
`credit_history`: {'good': 530, 'critical': 293, 'poor': 88, 'very good': 49, 'perfect': 40}
`purpose`: {'furniture/appliances': 473, 'car': 337, 'business': 97, 'education': 59, 'renovations': 22, 'car0': 12}
`savings_balance`: {'< 100 DM': 603, 'unknown': 183, '100 - 500 DM': 103, '500 - 1000 DM': 63, '> 1000 DM': 48}
`employment_duration`: {'1 - 4 years': 339, '> 7 years': 253, '4 - 7 years': 174, '< 1 year': 172, 'unemployed': 62}
`other_credit`: {'none': 814, 'bank': 139, 'store': 47}
`housing`: {'own': 713, 'rent': 179, 'other': 108}
`job`: {'skilled': 630, 'unskilled': 200, 'management': 148, 'unemployed': 22}
`phone`: {'no': 596, 'yes': 404}
`default`: {'no': 700, 'yes': 300}


Use both the numeric and non-numeric columns as appropriate.
```

----

Here is the question:

```
Create a graph using @my_credit_df and plotly express of checking account balance and the duration of the loan.?
```


In [10]:
print(f"Total Cost:           ${model.cost:.5f}")
print(f"Total Tokens:          {model.total_tokens:,}")
print(f"Total Prompt Tokens:   {model.input_tokens:,}")
print(f"Total Response Tokens: {model.response_tokens:,}")

Total Cost:           $0.00103
Total Tokens:          942
Total Prompt Tokens:   856
Total Response Tokens: 86


---

In [11]:
from llm_workflow.base import Workflow
from llm_workflow.openai import OpenAIChat
from llm_workflow.prompt_templates import PythonObjectMetadataTemplate
import pandas as pd
from notebook_helpers import mprint

credit = pd.read_csv('/code/tests/test_data/data/credit.csv')

prompt_template = PythonObjectMetadataTemplate()
model = OpenAIChat()
workflow = Workflow(tasks=[prompt_template, model])

# we can add objects in the constructor or dynamically
prompt_template.add_object('my_df', credit)

prompt = "Using @my_df, which columns are going to be the best predictors of `default`? " \
    "Use these columns to create a logistic regression model using statsmodels in python. " \
    "Process numeric and non-numeric columns appropriately. " \
    "Print the summary of all of the coefficients."

response = workflow(prompt)
mprint(response)

To determine the best predictors of `default` using the `my_df` DataFrame, we can use the logistic regression model from the statsmodels library in Python. We will include both numeric and non-numeric columns in the model.

First, we need to process the non-numeric columns appropriately by encoding them into dummy variables. Then, we can create the logistic regression model and print the summary of all the coefficients.

Here's a sample code to achieve this:

```python
import pandas as pd
import statsmodels.api as sm

# Process non-numeric columns by encoding them into dummy variables
my_df_processed = pd.get_dummies(my_df, columns=['checking_balance', 'credit_history', 'purpose', 'savings_balance', 'employment_duration', 'other_credit', 'housing', 'job', 'phone'])

# Define the independent variables (predictors) and the dependent variable
X = my_df_processed.drop('default', axis=1)
y = my_df['default']

# Add a constant to the independent variables
X = sm.add_constant(X)

# Create the logistic regression model
logit_model = sm.Logit(y, X)

# Fit the model
logit_result = logit_model.fit()

# Print the summary of all the coefficients
print(logit_result.summary())
```

This code processes the non-numeric columns by encoding them into dummy variables, creates the logistic regression model, fits the model, and then prints the summary of all the coefficients.

The best predictors of `default` can be determined by examining the coefficients in the summary output. The columns with higher absolute coefficient values are likely to be the best predictors of `default`.

### Variables Used:

In [12]:
prompt_template._extracted_variables_last_call

{'my_df'}

### Prompt Template Used:

In [13]:
mprint(workflow.history()[0].prompt)


Answer the question at the end of the text as truthfully and accurately as possible. Use the metadata of the python objects as appropriate. Tailor your response according to the most relevant objects. Don't use the if they don't appear relevant.

Here is the metadata:

```
A pd.DataFrame `my_df` that contains the following numeric and non-numeric columns:


Here are the numeric columns and corresponding summary statistics:

                       count      mean          std    min     25%     50%  \
months_loan_duration  1000.0    20.903    12.058814    4.0    12.0    18.0   
amount                1000.0  3271.258  2822.736876  250.0  1365.5  2319.5   
percent_of_income     1000.0     2.973     1.118715    1.0     2.0     3.0   
years_at_residence    1000.0     2.845     1.103718    1.0     2.0     3.0   
age                   1000.0    35.546    11.375469   19.0    27.0    33.0   
existing_loans_count  1000.0     1.407     0.577654    1.0     1.0     1.0   
dependents            1000.0     1.155     0.362086    1.0     1.0     1.0   

                          75%      max  
months_loan_duration    24.00     72.0  
amount                3972.25  18424.0  
percent_of_income        4.00      4.0  
years_at_residence       4.00      4.0  
age                     42.00     75.0  
existing_loans_count     2.00      4.0  
dependents               1.00      2.0  

Here are the non-numeric columns and corresponding value counts:

`checking_balance`: {'unknown': 394, '< 0 DM': 274, '1 - 200 DM': 269, '> 200 DM': 63}
`credit_history`: {'good': 530, 'critical': 293, 'poor': 88, 'very good': 49, 'perfect': 40}
`purpose`: {'furniture/appliances': 473, 'car': 337, 'business': 97, 'education': 59, 'renovations': 22, 'car0': 12}
`savings_balance`: {'< 100 DM': 603, 'unknown': 183, '100 - 500 DM': 103, '500 - 1000 DM': 63, '> 1000 DM': 48}
`employment_duration`: {'1 - 4 years': 339, '> 7 years': 253, '4 - 7 years': 174, '< 1 year': 172, 'unemployed': 62}
`other_credit`: {'none': 814, 'bank': 139, 'store': 47}
`housing`: {'own': 713, 'rent': 179, 'other': 108}
`job`: {'skilled': 630, 'unskilled': 200, 'management': 148, 'unemployed': 22}
`phone`: {'no': 596, 'yes': 404}
`default`: {'no': 700, 'yes': 300}


Use both the numeric and non-numeric columns as appropriate.
```

----

Here is the question:

```
Using @my_df, which columns are going to be the best predictors of `default`? Use these columns to create a logistic regression model using statsmodels in python. Process numeric and non-numeric columns appropriately. Print the summary of all of the coefficients.
```


In [14]:
print(f"Total Cost:           ${model.cost:.5f}")
print(f"Total Tokens:          {model.total_tokens:,}")
print(f"Total Prompt Tokens:   {model.input_tokens:,}")
print(f"Total Response Tokens: {model.response_tokens:,}")

Total Cost:           $0.00155
Total Tokens:          1,219
Total Prompt Tokens:   883
Total Response Tokens: 336


---

In [15]:
workflow("thank you")

"You're welcome! If you have any more questions or need further assistance, feel free to ask. Good luck with your logistic regression model!"

In [16]:
# no metadata used even though we are using same prompt
workflow.history()[-1].prompt

'thank you'

In [17]:
print(f"Total Cost:           ${model.cost:.5f}")
print(f"Total Tokens:          {model.total_tokens:,}")
print(f"Total Prompt Tokens:   {model.input_tokens:,}")
print(f"Total Response Tokens: {model.response_tokens:,}")

Total Cost:           $0.00284
Total Tokens:          2,476
Total Prompt Tokens:   2,112
Total Response Tokens: 364


---