#  05-FAQRAGSystem

- Author: [Taylor(Jihyun Kim)](https://github.com/Taylor0819)
- Design: 
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb)

## Overview (손보기)
In this tutorial, you’ll learn how to build a `Neo4j` graph using the Titanic dataset and run Cypher queries through the LangChain library. Aimed at beginners, it covers:

- Data Loading & Preprocessing: Convert the Titanic CSV to a Pandas DataFrame, handle missing values, and prepare fields for graph modeling.
- Graph Modeling in Neo4j: Create passenger nodes and relationships (e.g., SAME_TICKET) to represent key Titanic connections.
- Running Cypher Queries: Retrieve passenger info, filter by class or survival, and calculate statistics.


### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [Load Titanic data](#load-titanic-data)
- [Consider the Data with Arrows.app](#consider-the-data-with-arrows.app)
- [Data Restructure](#data-preprocessing)
- [Neo4j Database Connection](#neo4j-database-connection)
- [Usage Example](usage-example)


### References

- [LangChain Neo4j](https://python.langchain.com/docs/integrations/graphs/neo4j_cypher)
- [Neo4j Arrows](https://neo4j.com/labs/arrows)
----

## Environment Setup
Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]** 

- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.


In [33]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [34]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain_neo4j",
        "langchain_openai"
    ],
    verbose=False,
    upgrade=False,
)

In [35]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "05-FAQRAGSystem", 
        "NEO4J_URL": "",
        "NEO4J_USERNAME": "",
        "NEO4J_PASSWORD": "",
    }
)

Environment variables have been set successfully.


In [36]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

In [37]:
import os
import nest_asyncio

# Allow async
nest_asyncio.apply()

## Load Titanic Data

### Data Preparation

In this tutorial, we will use the following csv file:

- Download Link: [Kaggle Titanic Dataset](https://www.kaggle.com/datasets/brendan45774/test-file/data)
- Author : brendan45774 (kaggle ID)
- File name: "tested.csv"
- File path: "../data/tested.csv"

There are two ways to obtain the dataset:

1. Download directly from the Kaggle link above
2. Use the Python code below to automatically download via Kaggle API

In [38]:
# Download and save sample CSV file to ./data directory
import requests
import zipfile

def download_csv(url, zip_path, extract_dir):
    """
    Downloads a CSV file from the given URL and saves it to the specified path.

        Args:
        url (str): The URL of the CSV file to download
        zip_path (str): The full path (including file name) where the zip file will be temporarily saved
        extract_dir (str): The directory path where the contents will be extracted
    """
    try:
        # Ensure the directory exists
        os.makedirs(os.path.dirname(zip_path), exist_ok=True)

        # Download the file
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an error for bad status codes

        # Save the file to the specified path
        with open(zip_path, "wb") as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)

        # Extract the zip file
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(extract_dir)

        print(f"File downloaded and extracted to {extract_dir}")
        
        # Delete the temporary zip file (optional)
        os.remove(zip_path)
        print(f"Temporary zip file deleted: {zip_path}")
        
    except Exception as e:
        print(f"An error occurred: {e}")
        print(f"CSV downloaded and saved to: {zip_path}")

# Configuration for the PDF file
url = "https://www.kaggle.com/api/v1/datasets/download/brendan45774/test-file"
zip_path = "../data/data.zip"
extract_dir = "../data"

# Download the PDF
download_csv(url, zip_path, extract_dir)


File downloaded and extracted to ../data
Temporary zip file deleted: ../data/data.zip


In [39]:
import pandas as pd
file_path = "./data/tested.csv"

# Read Titanic.csv file
df = pd.read_csv('../data/tested.csv')

# This includes checking data structure, sample data
print("=== DataFrame Info ===")
df.info()

print("=== Sample data ===")
df.head()

=== DataFrame Info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB
=== Sample data ===


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### Column Descriptions
Key column descriptions:
- `PassengerId` : Unique identifier for each passenger
- `Survived` : Survival status (0 = No, 1 = Yes)
- `Pclass` : Ticket class (1, 2, 3)
- `Name` : Passenger name
- `Sex` : Gender
- `Age` : Age in years
- `SibSp` : Number of siblings/spouses aboard
- `Parch` : Number of parents/children aboard
- `Ticket` : Ticket number
- `Fare` : Passenger fare
- `Cabin` : Cabin number
- `Embarked` : Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

# Consider the data with Arrows.app
When converting a complete tabular dataset like a passenger manifest into a graph, it may seem simple to create nodes for passengers, tickets, and departure points while turning the remaining columns into properties. 

However, the flexibility of the graph structure requires careful consideration of how to categorize data into nodes, relationships, or properties. The way the data is structured may vary depending on the types of queries you plan to run on the graph.

To assist with this, Neo4j provides `Arrows.app`, a tool that allows you to visualize relationships across the graph before uploading any specific data. With [arrows.app](https://arrows.app), you can explore and experiment with different ways to model the data. To demonstrate this, I will present an example graph that represents a complex data structure.

### Defining the Relationship Categories
The first step was to define the categories of relationships we were interested in.
Here are the three relationships I had to define: MARRIED_TO, SIBLING_TO, PARENT_OF.
![explanation-01](../assets/05-faqragsystem-flow-explanation-01.png)


Both MARRIED_TO and SIBLING_TO would imply the same relationship in the other direction between the same nodes. 

PARENT_OF would imply a reverse relationship of CHILD_OF.

## Data Restructure
Our main goal is to analyze family relationships. Let’s assume that most traveling families shared the same ticket number, but we also want to avoid mistakenly matching children as spouses of their parents. 

To handle this, we’ll rely on age data to distinguish adult relationships from child relationships. Fortunately, the passenger manifest provides relatively complete ticket and age columns, which will help us create more accurate relationships—such as MARRIED_TO, SIBLING_TO, or PARENT_OF—without mixing up parents and children.

In [40]:
# Optional: parse last names from "Name" if helpful
df["LastName"] = df["Name"].apply(lambda x: x.split(",")[0].strip())

# Drop rows where Age is null
df.dropna(subset=["Age"], inplace=True)

# Convert "Ticket" to a string type to maintain consistency
df["Ticket"] = df["Ticket"].astype(str)


In [41]:
# 
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,LastName
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,Kelly
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,Wilkes
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,Myles
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,Wirz
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,Hirvonen


### Generate Relationship Edges
1. Sibling Relationships
2. Marriage Relationships
3. Parent-Child Relationships

# Neo4j Database Connection
First, install the Neo4j graph database. This tutorial is based on Neo4j Desktop .

[Neo4j Desktop Installation link](https://neo4j.com/docs/operations-manual/current/installation/)

[Note] 
You can set up Neo4j in several ways 

1. [`Neo4j Desktop`](https://neo4j.com/docs/operations-manual/current/installation/) :  A desktop application for local development

2. [`Neo4j Sandbox` ](https://neo4j.com/sandbox/) : A free, cloud-based platform for working with graph databases

3. [`Docker` ](https://neo4j.com/docs/operations-manual/current/docker/) : Run Neo4j in a container using the official Neo4j Docker image

In [49]:
import os

os.environ["NEO4J_URI"] = "bolt://localhost:7687"
os.environ["NEO4J_USERNAME"] = "taylor"
os.environ["NEO4J_PASSWORD"] = "titanic12"

### Define Neo4j Credentials
[how to import csv file](https://neo4j.com/docs/getting-started/appendix/tutorials/guide-import-desktop-csv/#csv-location)

We will import a CSV file into the Neo4j Desktop by adding it to the `import` folder.

You can open a finder window by hovering over the three dots to the right side of the started DBMS and select Open folder, then Import
![explanation-02](../assets/05-faqragsystem-flow-explanation-02.png)

You can directly drag & drop files into this folder to add them
![explanation-03](../assets/05-faqragsystem-flow-explanation-03.png)


Let's verify the data import using Neo4j Browser
![explanation-04](../assets/05-faqragsystem-flow-explanation-04.png)

We will use a simple Cypher query to verify that the data has been successfully added:

This query will count the number of rows in the tested.csv file and return the total count. If the data is accessible and correctly loaded, you will see the total row count in the result.

`LOAD CSV FROM 'file:///tested.csv' AS row
RETURN count(row);`

![explanation-05](../assets/05-faqragsystem-flow-explanation-05.png)
![explanation-06](../assets/05-faqragsystem-flow-explanation-06.png)

It has been successfully loaded!


In [50]:
# Neo4j 그래프 연결 시도
from langchain_neo4j import Neo4jGraph

try:
    graph = Neo4jGraph(
        url=os.environ["NEO4J_URI"],
        username=os.environ["NEO4J_USERNAME"],
        password=os.environ["NEO4J_PASSWORD"]
    )
    print("Neo4j 연결 성공!")
except Exception as e:
    print(f"연결 오류: {str(e)}")

Neo4j 연결 성공!


# Usage Example
query를통해... titanic 데이터 알아보기
랭체인 사용 