# Query a CSV File using OpenAI LLM with LangChain

This project demonstrates how to use OpenAI's LLM with LangChain to query data from a CSV file. It involves creating a retrieval-based system that processes the CSV content, converts it into a vectorized format, and allows natural language queries to extract meaningful insights.

---

## Features

- **CSV Parsing**: Automatically reads and processes the CSV file.
- **OpenAI LLM Integration**: Uses OpenAI's powerful language model to interpret and respond to natural language queries.
- **Vectorization**: Transforms CSV data into vector embeddings for efficient querying.
- **LangChain Framework**: Leverages LangChain's tools to simplify the retrieval and response generation process.

---

## Prerequisites

1. **Python**: Ensure Python 3.8 or above is installed.
2. **OpenAI API Key**: Obtain an API key from [OpenAI](https://platform.openai.com/).
3. **Required Libraries**: Install the necessary dependencies:
   ```bash
   pip install langchain openai pandas tiktoken faiss-cpu


In [20]:
import os
import numpy as np
import pandas as pd

import openai

from dotenv import load_dotenv
load_dotenv()

from langchain.document_loaders import csv_loader
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings


In [14]:
# Retrieve the API key from .env file
openai_api_key = os.getenv('OPENAI_API_KEY')

# setting the API key
openai.api_key = openai_api_key

In [15]:
# Load the data
loader = csv_loader.CSVLoader(file_path='data/titanic.csv')
loader

<langchain_community.document_loaders.csv_loader.CSVLoader at 0x1f6ef7d81c0>

In [22]:
# Initialize the embedding model
embedding_model = OpenAIEmbeddings()

# Initialize the index creator with the embedding model
index_creator = VectorstoreIndexCreator(embedding=embedding_model)

docsearch = index_creator.from_loaders(loaders=[loader])

In [32]:
# Initialize the OpenAI LLM
# llm = OpenAI(model="gpt-3.5-turbo", temperature=0)

# Create the RetrievalQA chain
chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),  # Pass the initialized LLM instance, not the class
    chain_type='stuff',
    retriever=docsearch.vectorstore.as_retriever(),  # Ensure docsearch is properly set up
    input_key='question'  # Specify the input key for your chain
)

In [41]:
query = "What are the columns in the Titanic dataset?"

# Run the chain
response = chain({'question': query})

print(response['result'])

 The columns in the Titanic dataset are PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked.


In [44]:
query = "What is the maximum Age in the Titanic dataset?"

# Run the chain
response = chain({'question': query})

print(response['result'])

 The maximum Age in the Titanic dataset is 71.


In [None]:
query = "how many pclass are there in the titanic dataset?"

# Run the chain
response = chain({'question': query})

print(response['result'])

In [49]:
query = "what is the maximum Fare in the titanic dataset?"

# Run the chain
response = chain({'question': query})

print(response['result'])

 I don't know.


In [45]:
query = "What is the Average Age in the Titanic dataset?"

# Run the chain
response = chain({'question': query})

print(response['result'])

 The average age in the Titanic dataset cannot be determined solely based on the given information, as there are missing values for age in some of the passenger records.


In [46]:
data = pd.read_csv('data/titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [51]:
query = "distinct Cabin names in the titanic dataset?"

# Run the chain
response = chain({'question': query})

print(response['result'])


There are 28 distinct Cabin names in the Titanic dataset. 


In [52]:
data['Cabin'].nunique()

147