# Skill 2 : Data Analysis from CSV file

To really have a Smart Search Engine or Virtual assistant that can answer any question about your corporate documents, this "engine" must understand tabular data, aka, sources with tables, rows and columns with numbers. 
This is a different problem that simply looking for the top most similar results.  The concept of indexing, bringing top results, embedding, doing a cosine semantic search and summarize an answer, doesn't really apply to this problem.
We are dealing now with sources with Tables in which each row and column are related to each other, and in order to answer a question, all of the data is needed, not just top results.

In this notebook, the goal is to show how to deal with this kind of use cases. To continue with our Covid-19 theme, we will be using an open dataset called ["Covid Tracking Project"](https://covidtracking.com/data/download). The COVID Tracking Project dataset is a  CSV file that provides the latest numbers on tests, confirmed cases, hospitalizations, and patient outcomes from every US state and territory (they stopped tracking on March 7 2021).

Imagine that many documents on a data lake are tabular data, or that your use case is to ask questions in natural language to a LLM model and this model needs to get the context from a CSV file or even a SQL Database in order to answer the question. A GPT Smart Search Engine, must understand how to deal with this sources, understand the data and answer acoordingly.

In [2]:
import os
import pandas as pd
from langchain_openai import AzureChatOpenAI
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_csv_agent, create_pandas_dataframe_agent

from common.prompts import CSV_PROMPT_PREFIX

from IPython.display import Markdown, HTML, display  

from dotenv import load_dotenv
load_dotenv("credentials.env")

def printmd(string):
    display(Markdown(string))

In [None]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"]

## Download the dataset and load it into Pandas Dataframe

In [None]:
os.makedirs("data",exist_ok=True)

In [44]:
wget "https://blobstoragedorak.blob.core.windows.net/cybersecurity/Tickets_ServiceDesk.xlsx - Info de tickets.csv" -P ./data/

SyntaxError: invalid syntax (3632395299.py, line 1)

In [4]:
file_url = "./data/SD_Tickets - 2023_nocap.csv"
#file_url = "https://raw.githubusercontent.com/C4r105V4rg4S/TESTCSV/main/Tickets.csv"
df = pd.read_csv(file_url,  encoding='ISO-8859-1', low_memory=False).fillna(value = 0)
print("Rows and Columns:",df.shape)
df.head()
print(df.dtypes)

Rows and Columns: (175866, 31)
ID de la solicitud                  int64
Cuenta                             object
Asunto                             object
TÃ©cnico                           object
Grupo de soporte asignado          object
Producto                           object
CategorÃ­a 3                       object
Hora de creaciÃ³n                  object
Hora de finalizaciÃ³n              object
Estado de solicitud                object
Creado por                         object
Incidente de seguridad             object
Impacto                            object
Urgencia                           object
Prioridad                          object
Estado vencido                       bool
CÃ³digo de cierre de solicitud     object
Tiempo de resoluciÃ³n de SLA       object
Tiempo de respuesta de SLA         object
Tiempo transcurrido 1              object
Tipo de Incidencia                 object
Tipo de ticket                     object
Nombre de SLA                      object
Hor

In [57]:
df.columns

Index(['ID de la solicitud', 'Cuenta', 'Asunto', 'TÃ©cnico',
       'Grupo de soporte asignado', 'Producto', 'CategorÃ­a 3',
       'Hora de creaciÃ³n', 'Hora de finalizaciÃ³n', 'Estado de solicitud',
       'Creado por', 'Incidente de seguridad', 'Impacto', 'Urgencia',
       'Prioridad', 'Estado vencido', 'CÃ³digo de cierre de solicitud',
       'Tiempo de resoluciÃ³n de SLA', 'Tiempo de respuesta de SLA',
       'Tiempo transcurrido 1', 'Tipo de Incidencia', 'Tipo de ticket',
       'Nombre de SLA', 'Hora de resoluciÃ³n', 'Hrs SLA',
       'Tiempo transcurrido', 'NotificaciÃ³n', 'Unnamed: 27', 'Unnamed: 28',
       'Unnamed: 29', 'Unnamed: 30'],
      dtype='object')

## CSV/Pandas Agent

LLMs are great for building question-answering systems over various types of data sources. In this section we’ll go over how to build Q&A systems over data stored in a CSV file(s). Like we did in the last notebook, the key to working with tabular CSV files is to give an LLM access to tools/experts for querying and interacting with the data

In [58]:
# Let's delve into a challenging question that demands a multi-step solution. 
# The path to solving it might not be immediately clear.
# When examining the dataframe above, even a human might struggle to determine which columns are pertinent.

QUESTION = """
- Cuantos tickets tiene el cliente Procesar?
- Cual es el top 5 de tickets del cliente Procesar que han tenido el mayor tiempo abierto desde su fecha de creación hasta su fecha de finalización?
- De que tecnologias tiene tickets el cliente Procesar? 
- Cuantos tickets tiene cada tecnología? 
"""

In [59]:
# First we load our LLM
COMPLETION_TOKENS = 1000
llm = AzureChatOpenAI(deployment_name=os.environ["GPT4o_DEPLOYMENT_NAME"], 
                      temperature=0.5, max_tokens=COMPLETION_TOKENS)

Now we need our agent and our expert/tool.  
LangChain has created an out-of-the-box agents that we can use to solve our Q&A to CSV tabular data file problem.

In [61]:
agent_executor = create_pandas_dataframe_agent(llm=llm,
                                               df=df,
                                               verbose=True,
                                               agent_type="openai-tools",
                                               allow_dangerous_code=True)

In [45]:
agent_executor

AgentExecutor(verbose=True, agent=RunnableMultiActionAgent(runnable=RunnableAssign(mapper={
  agent_scratchpad: RunnableLambda(lambda x: format_to_openai_tool_messages(x['intermediate_steps']))
})
| ChatPromptTemplate(input_variables=['agent_scratchpad', 'input'], input_types={'agent_scratchpad': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]]}, messages=[SystemMessage(content='\n- First set the pandas display options to show all the columns, get the column names, then answer the question.\n- **ALWAYS** before giving the Final Answer, try another method. Then reflect on the answers of the two methods you did and ask yourself if it answers correctly the original question. If you are not sure, try another method.\n- If the methods tried do not give the

In [62]:
%%time
try:
    display(Markdown(agent_executor.invoke(QUESTION)['output']))
except Exception as e:
    print(e)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[df['Cuenta'] == 'Procesar'].shape[0]"}`


[0m[36;1m[1;3m25380[0m[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df[df['Cuenta'] == 'Procesar'].sort_values(by='Tiempo transcurrido 1', ascending=False).head(5)"}`


[0m[36;1m[1;3m        ID de la solicitud    Cuenta  \
134319             1199801  Procesar   
65109               964376  Procesar   
79104              1068887  Procesar   
12752               157690  Procesar   
100092              403861  Procesar   

                                                   Asunto  \
134319    RevisiÃ³n de report mensual diciembre 2023 SIEM   
65109       PROCESAR | SIEM SEGUIMIENTO | SesiÃ³n Interna   
79104   PROCESAR || Agregar alarma Discovery:.PROCESAR...   
12752                        Cambio de Ruteo SDWAN-CONSAR   
100092        Â Cambios DNS con afectaciÃ³n AV TrendMicro   

                              TÃ©cnic

### Respuestas:

1. **Cantidad de tickets del cliente Procesar**: 
   - El cliente Procesar tiene **25,380** tickets.

2. **Top 5 de tickets del cliente Procesar con mayor tiempo abierto**:
   - Aquí están los detalles de los 5 tickets con mayor tiempo abierto:

| ID de la solicitud | Cuenta  | Asunto                                                | Técnico                   | Grupo de soporte asignado | Producto              | Categoría 3               | Hora de creación | Hora de finalización | Estado de solicitud | Creado por                | Incidente de seguridad | Impacto               | Urgencia | Prioridad | Estado vencido | Código de cierre de solicitud | Tiempo de resolución de SLA | Tiempo de respuesta de SLA | Tiempo transcurrido 1 | Tipo de Incidencia | Tipo de ticket  | Nombre de SLA            | Hora de resolución | Hrs SLA | Tiempo transcurrido | Notificación | Unnamed: 27 | Unnamed: 28 | Unnamed: 29 | Unnamed: 30 |
|--------------------|---------|-------------------------------------------------------|---------------------------|--------------------------|-----------------------|---------------------------|------------------|----------------------|---------------------|--------------------------|------------------------|----------------------|----------|-----------|----------------|------------------------------|-----------------------------|---------------------------|------------------------|--------------------|-----------------|--------------------------|--------------------|---------|---------------------|--------------|-------------|-------------|-------------|-------------|
| 1199801            | Procesar| Revisión de report mensual diciembre 2023 SIEM        | Carolina Tinoco Barranco  | IT - Orión               | Siem - Logrhythm      | Análisis Otro             | 10/01/2024 12:10 | No asignado          | Solucionado         | api.user                 | No                     | 4- Menor/Localizado | 4- Baja  | Baja      | False          | No asignado                  | 00:00:00                    | 01:00:00                  | 23:59:56              | No asignado        | Orden de Trabajo | ITS-PROCESAR-SOL-BAJO-24 | 11/01/2024 12:10  | 24.0    | 82.95               | N3           | 10.0        | 1.0         | 2024.0      | 10/01/2024  |
| 964376             | Procesar| PROCESAR | SIEM SEGUIMIENTO | Sesión Interna        | Carolina Tinoco Barranco  | IT - Orión               | Siem - Logrhythm      | Reuniones y Conferencias  | 08/11/2023 10:33 | No asignado          | Solucionado         | api.user                 | No                     | 4- Menor/Localizado | 4- Baja  | Baja      | False          | No asignado                  | 00:00:00                    | 00:00:00                  | 23:58:40              | No asignado        | Orden de Trabajo | KCS-PROCESAR-SOL-BAJO-24 | 10/11/2023 10:32  | 24.0    | 83.66               | FALSE        | 0.0         | 0.0         | 0.0         | 0           |
| 1068887            | Procesar| PROCESAR || Agregar alarma Discovery:.PROCESAR...     | Carolina Tinoco Barranco  | IT - Orión               | Siem - Logrhythm      | Análisis Otro             | 05/12/2023 14:57 | No asignado          | Solucionado         | api.user                 | No                     | 4- Menor/Localizado | 4- Baja  | Baja      | False          | No asignado                  | 00:00:00                    | 01:00:00                  | 23:56:01              | No asignado        | Orden de Trabajo | ITS-PROCESAR-SOL-BAJO-24 | 06/12/2023 14:53  | 24.0    | 104.21              | False        | 0.0         | 0.0         | 0.0         | 0           |
| 157690             | Procesar| Cambio de Ruteo SDWAN-CONSAR                         | Vanessa Nicole Resendiz   | IT - Orión               | Firewall - Checkpoint | ABC de Pentest            | 11/04/2023 10:53 | 16/04/2023 10:56     | Cerrado             | Osvaldo Juarez Lopez     | No                     | 4

CPU times: total: 1.08 s
Wall time: 17.9 s


That was good pandas code, and pretty fast. Let's add some prompt engineering to improve answer.

### Improving the Prompt

We can give it now an improved sager prompt. In `prompts.py` you can find `CSV_PROMPT_PREFIX`

In [63]:
CSV_PROMPT_PREFIX

'\n- First set the pandas display options to show all the columns, get the column names, then answer the question.\n- **ALWAYS** before giving the Final Answer, try another method. Then reflect on the answers of the two methods you did and ask yourself if it answers correctly the original question. If you are not sure, try another method.\n- If the methods tried do not give the same result, reflect and try again until you have two methods that have the same result. \n- If you still cannot arrive to a consistent result, say that you are not sure of the answer.\n- If you are sure of the correct answer, create a beautiful and thorough response using Markdown.\n- **DO NOT MAKE UP AN ANSWER OR USE PRIOR KNOWLEDGE, ONLY USE THE RESULTS OF THE CALCULATIONS YOU HAVE DONE**. \n- **ALWAYS**, as part of your "Final Answer", explain how you got to the answer on a section that starts with: "\n\nExplanation:\n". In the explanation, mention the column names that you used to get to the final answer. \

In [64]:
agent_executor = create_pandas_dataframe_agent(llm=llm,
                                               df=df,
                                               prefix=CSV_PROMPT_PREFIX,
                                               verbose=True,
                                               agent_type="openai-tools",
                                               allow_dangerous_code=True)

In [48]:
agent_executor

AgentExecutor(verbose=True, agent=RunnableMultiActionAgent(runnable=RunnableAssign(mapper={
  agent_scratchpad: RunnableLambda(lambda x: format_to_openai_tool_messages(x['intermediate_steps']))
})
| ChatPromptTemplate(input_variables=['agent_scratchpad', 'input'], input_types={'agent_scratchpad': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]]}, messages=[SystemMessage(content='\n- First set the pandas display options to show all the columns, get the column names, then answer the question.\n- **ALWAYS** before giving the Final Answer, try another method. Then reflect on the answers of the two methods you did and ask yourself if it answers correctly the original question. If you are not sure, try another method.\n- If the methods tried do not give the

In [65]:
%%time
try:
    display(Markdown(agent_executor.invoke(QUESTION)['output']))
except Exception as e:
    print(e)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "import pandas as pd\npd.set_option('display.max_columns', None)\ndf.columns.tolist()"}`


[0m[36;1m[1;3m['ID de la solicitud', 'Cuenta', 'Asunto', 'TÃ©cnico', 'Grupo de soporte asignado', 'Producto', 'CategorÃ\xada 3', 'Hora de creaciÃ³n', 'Hora de finalizaciÃ³n', 'Estado de solicitud', 'Creado por', 'Incidente de seguridad', 'Impacto', 'Urgencia', 'Prioridad', 'Estado vencido', 'CÃ³digo de cierre de solicitud', 'Tiempo de resoluciÃ³n de SLA', 'Tiempo de respuesta de SLA', 'Tiempo transcurrido 1', 'Tipo de Incidencia', 'Tipo de ticket', 'Nombre de SLA', 'Hora de resoluciÃ³n', 'Hrs SLA', 'Tiempo transcurrido', 'NotificaciÃ³n', 'Unnamed: 27', 'Unnamed: 28', 'Unnamed: 29', 'Unnamed: 30'][0m[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df['Cuenta'].value_counts().to_dict()"}`


[0m[36;1m[1;3m{'Procesar': 25380, 'Kio Cyber Security': 16762, 'Grupo Axo': 13215, 'KIO 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df_procesar['Producto'].value_counts()"}`


[0m[36;1m[1;3mSiem - Logrhythm            4992
Firewall - Checkpoint       4209
Dbf/Dam - Imperva           3449
Ips - Checkpoint            3165
Firewall                    2623
                            ... 
Smartreporting                 1
Enlace                         1
Microsoft Word                 1
Installation Manual            1
Unidad De Disco (Driver)       1
Name: Producto, Length: 87, dtype: int64[0m[32;1m[1;3m### Análisis de Tickets del Cliente "Procesar"

#### 1. **Cantidad de Tickets del Cliente "Procesar"**
El cliente "Procesar" tiene un total de **25,380 tickets**.

#### 2. **Top 5 de Tickets con Mayor Tiempo Abierto**
Aquí están los 5 tickets del cliente "Procesar" que han tenido el mayor tiempo abierto desde su fecha de creación hasta su fecha de finalización:

| ID de la Solicitud | Tiempo Transcurrido (hrs) |
|--------------------|----------------------

### Análisis de Tickets del Cliente "Procesar"

#### 1. **Cantidad de Tickets del Cliente "Procesar"**
El cliente "Procesar" tiene un total de **25,380 tickets**.

#### 2. **Top 5 de Tickets con Mayor Tiempo Abierto**
Aquí están los 5 tickets del cliente "Procesar" que han tenido el mayor tiempo abierto desde su fecha de creación hasta su fecha de finalización:

| ID de la Solicitud | Tiempo Transcurrido (hrs) |
|--------------------|---------------------------|
| 971                | 5001.89                   |
| 1160               | 5000.99                   |
| 1170               | 5000.94                   |
| 1418               | 4999.80                   |
| 1476               | 4999.55                   |

#### 3. **Tecnologías con Tickets del Cliente "Procesar"**
El cliente "Procesar" tiene tickets relacionados con las siguientes tecnologías:

- Siem - Logrhythm
- Firewall - Checkpoint
- Dbf/Dam - Imperva
- Ips - Checkpoint
- Firewall
- ... (y otras 82 tecnologías más)

#### 4. **Cantidad de Tickets por Tecnología**
A continuación se muestra la cantidad de tickets que tiene cada tecnología para el cliente "Procesar":

- **Siem - Logrhythm**: 4992 tickets
- **Firewall - Checkpoint**: 4209 tickets
- **Dbf/Dam - Imperva**: 3449 tickets
- **Ips - Checkpoint**: 3165 tickets
- **Firewall**: 2623 tickets
- ... (y otras tecnologías con menor cantidad de tickets)

---

### 

Explanation:
Para llegar a estas conclusiones, se realizaron los siguientes pasos:
1. **Cantidad de Tickets**: Se utilizó el método `value_counts()` sobre la columna `Cuenta` para contar el número de tickets por cliente.
2. **Top 5 de Tickets con Mayor Tiempo Abierto**: Se filtraron los tickets del cliente "Procesar" y se utilizó `nlargest()` en la columna `Tiempo transcurrido`.
3. **Tecnologías con Tickets**: Se analizaron los valores únicos en la columna `Producto` para el cliente "Procesar".
4. **Cantidad de Tickets por Tecnología**: Se utilizó `value_counts()` sobre la columna `Producto` para contar el número de tickets por tecnología.

Estas operaciones se realizaron asegurando la consistencia y exactitud de los datos antes de llegar a las conclusiones finales.

CPU times: total: 844 ms
Wall time: 1min 14s


## Evaluation
Let's see if the answer is correct

In [15]:
#df['date'] = pd.to_datetime(df['date'])
july_2020 = df[(df['date'] >= '2020-07-01') & (df['date'] <= '2020-07-31')]
texas_hospitalized_july_2020 = july_2020[july_2020['state'] == 'TX']['hospitalizedIncrease'].sum()
nationwide_hospitalized_july_2020 = july_2020['hospitalizedIncrease'].sum()

KeyError: 'date'

In [16]:
print( "TX:",texas_hospitalized_july_2020,"Nationwide:",nationwide_hospitalized_july_2020)

NameError: name 'texas_hospitalized_july_2020' is not defined

It is Correct!

**Note**: Obviously, there were hospitalizations in Texas in July 2020 (Try asking ChatGPT), but this particular File, for some reason has 0 on the column "hospitalizedIncrease" for Texas in July 2020. This proves though that the model is NOT making up information or using prior knowledge, but instead using only the results of its calculation on this CSV file. That's what we need!

# Summary

So, we just solved our problem on how to ask questions in natural language to our Tabular data hosted on a CSV File.
With this approach you can see then that it is NOT necessary to make a dump of a database data into a CSV file and index that on a Search Engine, you don't even need to use the above approach and deal with a CSV data dump file. With the Agents framework, the best engineering decision is to interact directly with the data source API without the need to replicate the data in order to ask questions to it. Remember, LLMs can do SQL very well. 


**Note**: We don't recommend using a pandas agent to answer questions from tabular data. It is not fast and it makes too many parsing mistakes. We recommend using SQL (see next notebook).

# NEXT
We can see that GPT-3.5, and even better GPT-4, is powerful and can translate a natural language question into the right steps in python in order to query a CSV data loaded into a pandas dataframe. 
That's pretty amazing. However the question remains: **Do I need then to dump all the data from my original sources (Databases, ERP Systems, CRM Systems) in order to be searchable by a Smart Search Engine?**

The next Notebook answers this question by implementing a Question->SQL process and get the information from data in a SQL Database, eliminating the need to dump it out.