# Lesson 2: Interacting with a CSV Data

## Setup and connect to the Azure OpenAI endpoint

**Note**: The pre-configured cloud resource grants you access to the Azure OpenAI GPT model. The key and endpoint provided below are intended for teaching purposes only. Your notebook environment is already set up with the necessary keys, which may differ from those used by the instructor during the filming.

In [None]:
import os 
import pandas as pd

from IPython.display import Markdown, HTML, display
from langchain.schema import HumanMessage
from langchain_openai import AzureChatOpenAI

model = AzureChatOpenAI(
    openai_api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    azure_deployment=os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"))

## Load the dataset

In [10]:

import requests

# Create data directory if it doesn't exist
os.makedirs("data",exist_ok=True)

# Download the file
url = "https://covidtracking.com/data/download/all-states-history.csv"
response = requests.get(url)

# Save the file
with open('./data/all-states-history.csv', 'wb') as f:
    f.write(response.content)

# Load the data into a pandas dataframe
file_url = "./data/all-states-history.csv"
df = pd.read_csv(file_url).fillna(value = 0)


In [11]:
df = pd.read_csv("./data/all-states-history.csv").fillna(value = 0)

## Prepare the Langchain dataframe agent

In [12]:
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent

agent = create_pandas_dataframe_agent(llm=model,df=df,verbose=True)

agent.invoke("how many rows are there?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To determine the number of rows in the dataframe `df`, I can use the `.shape` attribute which returns a tuple where the first element represents the number of rows.

Action: Use the `.shape` attribute to get the number of rows in `df`.
Action Input: df.shape[0m
Observation: Use the `.shape` attribute to get the number of rows in `df`. is not a valid tool, try one of [python_repl_ast].
Thought:[32;1m[1;3mI will now use the Python shell to determine the number of rows in the dataframe `df`.

Action: Use the `.shape` attribute to get the number of rows in `df`.
Action Input: df.shape[0m
Observation: Use the `.shape` attribute to get the number of rows in `df`. is not a valid tool, try one of [python_repl_ast].
Thought:[32;1m[1;3mI will now use the Python shell to determine the number of rows in the dataframe `df`.

Action: python_repl_ast
Action Input: df.shape[0m
Observation: [36;1m[1;3m(20780, 41)[0m
Thought

{'input': 'how many rows are there?',
 'output': 'The dataframe `df` contains 20,780 rows.'}

## Design your prompt and ask your question

In [13]:
CSV_PROMPT_PREFIX = """
First set the pandas display options to show all the columns,
get the column names, then answer the question.
"""

CSV_PROMPT_SUFFIX = """
- **ALWAYS** before giving the Final Answer, try another method.
Then reflect on the answers of the two methods you did and ask yourself
if it answers correctly the original question.
If you are not sure, try another method.
- If the methods tried do not give the same result,reflect and
try again until you have two methods that have the same result.
- If you still cannot arrive to a consistent result, say that
you are not sure of the answer.
- If you are sure of the correct answer, create a beautiful
and thorough response using Markdown.
- **DO NOT MAKE UP AN ANSWER OR USE PRIOR KNOWLEDGE,
ONLY USE THE RESULTS OF THE CALCULATIONS YOU HAVE DONE**.
- **ALWAYS**, as part of your "Final Answer", explain how you got
to the answer on a section that starts with: "\n\nExplanation:\n".
In the explanation, mention the column names that you used to get
to the final answer.
"""

QUESTION = "How may patients were hospitalized during July 2020" 
"in Texas, and nationwide as the total of all states?"
"Use the hospitalizedIncrease column" 


agent.invoke(CSV_PROMPT_PREFIX + QUESTION + CSV_PROMPT_SUFFIX)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: 
First, I need to set the pandas display options to show all the columns. Then, I will extract the column names to understand the structure of the dataframe. Finally, I will filter the dataframe for records from July 2020 and sum the `hospitalized` column for the state of interest.

Action: Set pandas display options to show all columns and get the column names.
Action Input: 
```python
import pandas as pd

# Set display options to show all columns
pd.set_option('display.max_columns', None)

# Get the column names
df.columns
```[0m
Observation: Set pandas display options to show all columns and get the column names. is not a valid tool, try one of [python_repl_ast].
Thought:[32;1m[1;3mI should use the `python_repl_ast` tool to set the display options and get the column names. This will help me understand the structure of the dataframe and identify the relevant columns for filtering and summing the hospitalizations