# Azure OpenAI - AlphaCentauri Project
## Step 1: Setting up your Azure & Langchain

In [7]:
import os
import pandas as pd
from IPython.display import Markdown, HTML, display
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv(override=True)


True

In [9]:

from openai import AzureOpenAI

# Azure OpenAI Configuration (CORRECT endpoint from Azure Portal)
#endpoint = "https://js-alphacentauri-resource.cognitiveservices.azure.com/"
endpoint ="https://js-alphacentauri-resource.openai.azure.com/"
deployment = "gpt-4.1-mini"
api_version = "2024-12-01-preview"

# Get API key from environment
subscription_key = os.getenv("AZURE_OPENAI_API_KEY")

# Create Azure OpenAI client
client = AzureOpenAI(
    api_version=api_version,
    azure_endpoint=endpoint,
    api_key=subscription_key,
)

print("✅ Azure OpenAI client configured")
print(f"Endpoint: {endpoint}")
print(f"Deployment: {deployment}")
print(f"API Version: {api_version}")


✅ Azure OpenAI client configured
Endpoint: https://js-alphacentauri-resource.openai.azure.com/
Deployment: gpt-4.1-mini
API Version: 2024-12-01-preview


In [11]:
# Test the connection
response = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {
            "role": "user",
            "content": "Say hello and tell me you're working! Tell me not to worry, everything will be fine.",
        }
    ],
    model=deployment
)

print(response.choices[0].message.content)


Hello! I'm here and working for you. Don't worry, everything will be fine! If you need any help, just let me know.


### LangChain Integration


In [24]:
from langchain_openai import AzureChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

# Create LangChain model
model = AzureChatOpenAI(
    openai_api_version=api_version,
    azure_deployment=deployment,
    azure_endpoint=endpoint,
    api_key=subscription_key
)

# Test with LangChain
messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Translate 'Hello, how are you?' to French and Spanish.")
]

response = model.invoke(messages)
print(response.content)


Sure! 

- French: Bonjour, comment ça va ?  
- Spanish: Hola, ¿cómo estás?


### Streaming Response Example


In [29]:
v_messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Translate this sentence from English "
    "to French and Spanish. I like red cars and "
    "blue houses, but my dog is yellow.")
]   

## alternative way to stream the response
    # messages=[
    #     {
    #         "role": "system",
    #         "content": "You are a helpful assistant.",
    #     },
    #     {
    #         "role": "user",
    #         "content": "Translate this sentence from English "
    # "to French and Spanish. I like red cars and "
    # "blue houses, but my dog is yellow.",
    #     }
    # ],

# Stream the output for better user experience
response = client.chat.completions.create(
    stream=True,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {
            "role": "user",
            "content": "Translate this sentence from English "
    "to French, West Vlaams and Spanish. I like red cars and "
    "blue houses, but my dog is yellow.",
        }
    ],
    model=deployment,
)

print("Streaming response:")
for update in response:
    if update.choices:
        print(update.choices[0].delta.content or "", end="")


Streaming response:
Sure! Here is the translation of the sentence "I like red cars and blue houses, but my dog is yellow." in French, West Vlaams, and Spanish:

French: J'aime les voitures rouges et les maisons bleues, mais mon chien est jaune.  
West Vlaams: Ik häd van rode wagens en blauwe huzen, moar mien hond is geel.  
Spanish: Me gustan los coches rojos y las casas azules, pero mi perro es amarillo.

### 🔍 Debug: Verify Deployment Name


In [27]:
model.invoke((v_messages))

AIMessage(content="Sure! Here is the translation:\n\nFrench: J'aime les voitures rouges et les maisons bleues, mais mon chien est jaune.  \nSpanish: Me gustan los coches rojos y las casas azules, pero mi perro es amarillo.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 48, 'prompt_tokens': 41, 'total_tokens': 89, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-mini-2025-04-14', 'system_fingerprint': 'fp_3dcd5944f5', 'id': 'chatcmpl-CWlOQGWGdgQng9dd1MG9julAyDZP8', 'service_tier': None, 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, '

## Step 2: Interacting with CSV Data

In [30]:
# Create LangChain model
model = AzureChatOpenAI(
    openai_api_version=api_version,
    azure_deployment=deployment,
    azure_endpoint=endpoint,
    api_key=subscription_key
)


### Load the dataset
- baseline model -->
- could use 
  a) fine tuning for sql tasks 
  b) RAG (database as a source)
    - use Langchain agents to connect to SQL Database or CSV file
    - via Azure OpenAI Assistants API (function calling + code interpreter) (stateful management + short term memory)
    - via Azure OpenAI Fucntion Calling: to perform tasks based on your questions
    - Vai Native Database API

In [31]:
#read in the data from the csv file
df = pd.read_csv("./data/all-states-history.csv").fillna(value = 0)

In [36]:
#check if the file is read 
df.head()

Unnamed: 0,date,state,death,deathConfirmed,deathIncrease,deathProbable,hospitalized,hospitalizedCumulative,hospitalizedCurrently,hospitalizedIncrease,...,totalTestResults,totalTestResultsIncrease,totalTestsAntibody,totalTestsAntigen,totalTestsPeopleAntibody,totalTestsPeopleAntigen,totalTestsPeopleViral,totalTestsPeopleViralIncrease,totalTestsViral,totalTestsViralIncrease
0,2021-03-07,AK,305.0,0.0,0,0.0,1293.0,1293.0,33.0,0,...,1731628.0,0,0.0,0.0,0.0,0.0,0.0,0,1731628.0,0
1,2021-03-07,AL,10148.0,7963.0,-1,2185.0,45976.0,45976.0,494.0,0,...,2323788.0,2347,0.0,0.0,119757.0,0.0,2323788.0,2347,0.0,0
2,2021-03-07,AR,5319.0,4308.0,22,1011.0,14926.0,14926.0,335.0,11,...,2736442.0,3380,0.0,0.0,0.0,481311.0,0.0,0,2736442.0,3380
3,2021-03-07,AS,0.0,0.0,0,0.0,0.0,0.0,0.0,0,...,2140.0,0,0.0,0.0,0.0,0.0,0.0,0,2140.0,0
4,2021-03-07,AZ,16328.0,14403.0,5,1925.0,57907.0,57907.0,963.0,44,...,7908105.0,45110,580569.0,0.0,444089.0,0.0,3842945.0,14856,7908105.0,45110


#### <span style="color: red">SOME ATTENTION REQUIRED</span> 
create_pandas_dataframe_agent() maak een “agent” die:
- Je dataframe (df) kent
- Je prompt interpreteert (“How many rows are there?”)
- Daarvoor zelf Python-code schrijft
- En die code uitvoert in een “Python REPL” (read–eval–print-loop) in jouw proces.
De “REPL” is in feite een mini-Python-console binnen jouw proces.
En de LLM (of een kwaadaardige prompt) kan daarin eender welke code uitvoeren.
Niet alleen veilige dingen zoals len(df), maar ook bijvoorbeeld:

--> gezien we controle hebben over de data hier kunnen we **allow_dangerous_code=True** toevoegen, MAAR GEBRUIK DIT NIET IN PRODUCTIE!


In [38]:
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent

agent = create_pandas_dataframe_agent(llm=model,df=df,verbose=True, allow_dangerous_code=True)

agent.invoke("how many rows are there?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To determine the number of rows in the dataframe `df`, I can use the `len()` function or check the shape of the dataframe.

Action: python_repl_ast
Action Input: len(df)[0m[36;1m[1;3m20780[0m[32;1m[1;3mThought: The number of rows in the dataframe `df` is 20780.
Final Answer: There are 20780 rows in the dataframe `df`.[0m

[1m> Finished chain.[0m


{'input': 'how many rows are there?',
 'output': 'There are 20780 rows in the dataframe `df`.'}

In [39]:
agent.invoke("What can you tell me about the data?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should analyze the dataframe structure and the data it contains. From the `print(df.head())` output, I can see the columns and a few first rows. The data contains COVID-19 related statistics, like deaths, hospitalization, testing results, and increases in numbers for various US states on a certain date. The date column shows the date of the data, and the state column shows the state abbreviation. There are numerical columns representing counts related to deaths (confirmed, probable, increases), hospitalizations, ICU cases, testing negative, positive, recoveries, etc. It appears to be a time series or snapshot of COVID-19 data by state.

Thought: I will summarize the key points about the data to provide a clear answer.

Final Answer: The dataframe `df` contains COVID-19 related data for different US states on specific dates. Each row corresponds to a state and date and includes various statistics such as the total n

{'input': 'What can you tell me about the data?',
 'output': 'The dataframe `df` contains COVID-19 related data for different US states on specific dates. Each row corresponds to a state and date and includes various statistics such as the total number of deaths (confirmed and probable), increases in deaths, hospitalizations (current, cumulative, increases), ICU and ventilator usage, testing results (negative, positive, antibody, antigen, viral tests), recoveries, and other related metrics. The data appears to be detailed state-level daily or periodic reports capturing both cumulative numbers and daily changes across multiple COVID-19 indicators.'}

In [40]:
CSV_PROMPT_PREFIX = """
First set the pandas display options to show all the columns,
get the column names, then answer the question.
"""

CSV_PROMPT_SUFFIX = """
- **ALWAYS** before giving the Final Answer, try another method.
Then reflect on the answers of the two methods you did and ask yourself
if it answers correctly the original question.
If you are not sure, try another method.
- If the methods tried do not give the same result,reflect and
try again until you have two methods that have the same result.
- If you still cannot arrive to a consistent result, say that
you are not sure of the answer.
- If you are sure of the correct answer, create a beautiful
and thorough response using Markdown.
- **DO NOT MAKE UP AN ANSWER OR USE PRIOR KNOWLEDGE,
ONLY USE THE RESULTS OF THE CALCULATIONS YOU HAVE DONE**.
- **ALWAYS**, as part of your "Final Answer", explain how you got
to the answer on a section that starts with: "\n\nExplanation:\n".
In the explanation, mention the column names that you used to get
to the final answer.
"""

QUESTION = "How may patients were hospitalized during July 2020" 
"in Texas, and nationwide as the total of all states?"
"Use the hospitalizedIncrease column" 


agent.invoke(CSV_PROMPT_PREFIX + QUESTION + CSV_PROMPT_SUFFIX)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mQuestion: How many patients were hospitalized during July 2020?

Thought: I need to first set the display options to show all columns in pandas, then check all the column names to identify the columns related to hospitalization. After that, I should filter the dataframe to only include rows from July 2020 and sum the relevant hospitalization column(s). I will try two methods by using different hospitalization columns and compare the results.

Action: python_repl_ast
Action Input: |
    import pandas as pd
    pd.set_option('display.max_columns', None)
    colnames = df.columns.tolist()
    colnames[0m[36;1m[1;3mSyntaxError: invalid syntax (<unknown>, line 1)[0m[32;1m[1;3mThought: It seems from the observation that the input was interpreted as multi-line code with a syntax error possibly due to the environment. I will try to run the commands one by one and check the column names again.

Action: python_repl_ast
Action Inp

{'input': '\nFirst set the pandas display options to show all the columns,\nget the column names, then answer the question.\nHow may patients were hospitalized during July 2020\n- **ALWAYS** before giving the Final Answer, try another method.\nThen reflect on the answers of the two methods you did and ask yourself\nif it answers correctly the original question.\nIf you are not sure, try another method.\n- If the methods tried do not give the same result,reflect and\ntry again until you have two methods that have the same result.\n- If you still cannot arrive to a consistent result, say that\nyou are not sure of the answer.\n- If you are sure of the correct answer, create a beautiful\nand thorough response using Markdown.\n- **DO NOT MAKE UP AN ANSWER OR USE PRIOR KNOWLEDGE,\nONLY USE THE RESULTS OF THE CALCULATIONS YOU HAVE DONE**.\n- **ALWAYS**, as part of your "Final Answer", explain how you got\nto the answer on a section that starts with: "\n\nExplanation:\n".\nIn the explanation, 