To generate text2cypher dataset based on your own data, you could refer to: https://github.com/tomasonjo/text2cypher

# Initial Set Up

In [1]:
from langchain_community.graphs import Neo4jGraph
from langchain_groq import ChatGroq
from langchain.chains import GraphCypherQAChain

import os
from dotenv import load_dotenv 
load_dotenv()

# groq_api_key = userdata.get('GROQ_API')
# hf_api = userdata.get('HF_API')A

#https://demo.neo4jlabs.com:7473/browser/
URL = os.getenv("NEO4J_URL")
database = os.getenv("NEO4J_DATABASE")
password = os.getenv("NEO4J_PASSWORD")

In [2]:
graph = Neo4jGraph(url=URL,database=database,username=database,password=password)
print(graph.schema)

  graph = Neo4jGraph(url=URL,database=database,username=database,password=password)


Node properties:
Director {name: STRING}
Movie {id: INTEGER, title: STRING}
Synopsis {seqId: INTEGER, text: STRING, textEmbedding: LIST}
Actor {name: STRING}
Relationship properties:

The relationships:
(:Movie)-[:HAS_SYNOPSIS]->(:Synopsis)
(:Movie)-[:DIRECTED_BY]->(:Director)
(:Movie)-[:ACTED_BY]->(:Actor)


New Update from Langchain (09/05/24): an enhanced schema parameter representation that samples the database values and return them to the LLM to be able to generate more accurate Cypher statements

https://python.langchain.com/v0.1/docs/integrations/graphs/neo4j_cypher/#enhanced-schema-information

In [5]:
graph = Neo4jGraph(url=URL,database=database,username=database,password=password,enhanced_schema=True)
print(graph.schema)



Node properties:
- **Director**
  - `name`: STRING Example: "봉준호"
- **Movie**
  - `id`: INTEGER Example: "1"
  - `title`: STRING Example: "플란다스의 개"
- **Synopsis**
  - `seqId`: INTEGER Example: "0"
  - `text`: STRING Example: "대학 시간강사인 고윤주(이성재)는 이번에도 교수직 추천에서 보기 좋게 떨어진다. 돈 잘 버"
- **Actor**
  - `name`: STRING Example: "이성재, 배두나, 변희봉, 김호정, 김뢰하, 고수희, 김진구, 임상수, 성정선, 조재하, "
Relationship properties:

The relationships:
(:Movie)-[:HAS_SYNOPSIS]->(:Synopsis)
(:Movie)-[:DIRECTED_BY]->(:Director)
(:Movie)-[:ACTED_BY]->(:Actor)


## Generate QA_Dataset

In [6]:
!pip install langchain-google-genai==2.0.1


Collecting langchain-google-genai==2.0.1
  Using cached langchain_google_genai-2.0.1-py3-none-any.whl.metadata (3.9 kB)
Collecting google-generativeai<0.9.0,>=0.8.0 (from langchain-google-genai==2.0.1)
  Using cached google_generativeai-0.8.3-py3-none-any.whl.metadata (3.9 kB)
Collecting google-ai-generativelanguage==0.6.10 (from google-generativeai<0.9.0,>=0.8.0->langchain-google-genai==2.0.1)
  Using cached google_ai_generativelanguage-0.6.10-py3-none-any.whl.metadata (5.6 kB)
Collecting google-api-core (from google-generativeai<0.9.0,>=0.8.0->langchain-google-genai==2.0.1)
  Downloading google_api_core-2.23.0-py3-none-any.whl.metadata (3.0 kB)
Collecting google-api-python-client (from google-generativeai<0.9.0,>=0.8.0->langchain-google-genai==2.0.1)
  Downloading google_api_python_client-2.154.0-py2.py3-none-any.whl.metadata (6.7 kB)
Collecting google-auth>=2.15.0 (from google-generativeai<0.9.0,>=0.8.0->langchain-google-genai==2.0.1)
  Downloading google_auth-2.36.0-py2.py3-none-an

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
peft 0.14.0 requires huggingface-hub>=0.25.0, which is not installed.
peft 0.14.0 requires safetensors, which is not installed.
peft 0.14.0 requires torch>=1.13.0, which is not installed.
peft 0.14.0 requires transformers, which is not installed.


In [2]:
import os
import json
from typing import List

from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.graphs import Neo4jGraph

os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")

llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash-001", timeout=60)


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)
  from .autonotebook import tqdm as notebook_tqdm


In [4]:
query_types = {
    "Simple Retrieval Queries": "These queries focus on basic data extraction, retrieving nodes or relationships based on straightforward criteria such as labels, properties, or direct relationships. Examples include fetching all nodes labeled as 'Person' or retrieving relationships of a specific type like 'EMPLOYED_BY'. Simple retrieval is essential for initial data inspections and basic reporting tasks. Always limit the number of results if more than one row is expected from the questions by saying 'first 3' or 'top 5' elements",
    "Complex Retrieval Queries": "These advanced queries use the rich pattern-matching capabilities of Cypher to handle multiple node types and relationship patterns. They involve sophisticated filtering conditions and logical operations to extract nuanced insights from interconnected data points. An example could be finding all 'Person' nodes who work in a 'Department' with over 50 employees and have at least one 'REPORTS_TO' relationship. Always limit the number of results if more than one row is expected from the questions by saying 'first 3' or 'top 5' elements",
    "Simple Aggregation Queries": "Simple aggregation involves calculating basic statistical metrics over properties of nodes or relationships, such as counting the number of nodes, averaging property values, or determining maximum and minimum values. These queries summarize data characteristics and support quick analytical conclusions. Always limit the number of results if more than one row is expected from the questions by saying 'first 3' or 'top 5' elements",
    "Pathfinding Queries": "Specialized in exploring connections between nodes, these queries are used to find the shortest path, identify all paths up to a certain length, or explore possible routes within a network. They are essential for applications in network analysis, routing, logistics, and social network exploration. Always limit the number of results if more than one row is expected from the questions by saying 'first 3' or 'top 5' elements",
    "Complex Aggregation Queries": "The most sophisticated category, these queries involve multiple aggregation functions and often group results over complex subgraphs. They calculate metrics like average number of reports per manager or total sales volume through a network, supporting strategic decision making and advanced reporting. Always limit the number of results if more than one row is expected from the questions by saying 'first 3' or 'top 5' elements",
    "Verbose query": "These queries are characterized by their explicit and detailed specifications about the data retrieval process and the exact information needed. They involve elaborate instructions for navigating through complex data structures, specifying precise criteria for inclusion, exclusion, and sorting of data points. Verbose queries typically require the breakdown of each step in the querying process, from the initial identification of relevant data nodes and relationships to the intricate filtering and sorting mechanisms that must be applied. Always limit the number of results if more than one row is expected from the questions by saying 'first 3' or 'top 5' elements",
    "Evaluation query": "This query type focuses on retrieving specific pieces of data from complex databases with precision. Use clear and detailed instructions to extract relevant information, such as movie titles, product names, or employee IDs, depending on the context. Always ask for a single property or item, titled intuitively based on the data retrieved (e.g., Movie Titles Featuring Tom Cruise). Limit the results to a specific number like 'first 3' or 'top 5' to keep the output concise and focused.",
    "Multi-step Queries": "Multistep queries in a graph database involve executing several operations or traversals to derive the answer. These queries typically combine different data elements by following multiple relationships and filtering nodes at various steps to reach a final result. They often require joining data from various parts of the schema, aggregating results, or applying multiple conditions to uncover complex insights that are not immediately apparent from a single node or relationship"
}

In [None]:
prompt_template = """Your task is to generate 100 questions that are directly related to a specific graph schema in Neo4j. Each question should target distinct aspects of the schema, such as relationships between nodes, properties of nodes, or characteristics of node types. Ensure that the questions vary in complexity, covering basic, intermediate, and advanced queries.
Imagine you are a user at a company that needs to present all the types of questions that the graph can answer.
You have to be very diligent at your job. Make sure you will accomplish a diversity of questions, ranging from various complexities.

Avoid ambiguous questions. For clarity, an ambiguous question is one that can be interpreted in multiple ways or does not have a straightforward answer based on the schema. For example, avoid asking, "What is related to this?" without specifying the node type or relationship.
Please design each question to yield a limited number of results, specifically between 3 to 10 results. This will ensure that the queries are precise and suitable for detailed analysis and training.
The goal of these questions is to create a dataset for training AI models to convert natural language queries into Cypher queries effectively.
It is vital that the database contains information that can answer the question!
Never write any assumptions, just the questions!!!
Make sure to generate 100 questions!

Make sure to create questions for the following graph schema:{input}\n
Here are some example nodes and relationship values: {values}.
Don't use any values that aren't found in the schema or in provided values.
{query_type}
Also, do not ask questions that there is no way to answer based on the schema or provided example values.
Find good questions that will test the capabilities of graph answering.
The output of the should be 1 question per row. Example output format:
What movies did Tom Cruise acted in?
Which product made the most revenue?
Who is the manager of the team that completed the most projects last year?
Generated questions:"""

In [None]:
from langchain_core.prompts.prompt import PromptTemplate

prompt = PromptTemplate(
    input_variables=["input", "values", "query_type"], template=prompt_template
)

chain = prompt | llm

In [None]:
import re

def remove_enumeration(text):
    # This regular expression matches numbers followed by a dot and an optional space at the start of a string
    return re.sub(r'^\d+\.\s?', '', text).strip()

In [None]:
all_questions = []

print(database)
graph = Neo4jGraph(
    url=URL,
    database=database,
    username=database,
    password=password,
    enhanced_schema=True,
    sanitize=True,
    timeout=30,
)
schema = graph.schema
for type in query_types:
    print(type)
    instructions = f"{type}: {query_types[type]}"
    # Sample values
    values = graph.query(
    """
    MATCH (n)
    WHERE rand() > 0.6
    WITH n LIMIT 2
    CALL {
        WITH n
        MATCH p=(n)-[*3..3]-()
        RETURN p LIMIT 1
    }
    RETURN p
    """
        )

    try: # sometimes it timeouts
        questions = chain.invoke(
            {"input": schema, "query_type": instructions, "values": values}
        )
        all_questions.extend(
        [
            {"question": remove_enumeration(el), "type": type, "database": database}
            for el in questions.content.split("\n") if not "## 100" in el and el
        ]
        )
    except:
        continue

neo4j




Simple Retrieval Queries




Complex Retrieval Queries




Simple Aggregation Queries




Pathfinding Queries




Complex Aggregation Queries




Verbose query




Evaluation query




Multi-step Queries


In [None]:
import pandas as pd
df = pd.DataFrame.from_records(all_questions)
df

Unnamed: 0,question,type,database
0,## Neo4j Graph Schema Questions:,Simple Retrieval Queries,neo4j
1,What are the titles of the movies directed by ...,Simple Retrieval Queries,neo4j
2,"Which movies feature the actor ""이성재""?",Simple Retrieval Queries,neo4j
3,"What is the synopsis of the movie ""플란다스의 개""?",Simple Retrieval Queries,neo4j
4,"Who directed the movie with the id ""198""?",Simple Retrieval Queries,neo4j
...,...,...,...
1119,Which movies have synopses that contain the wo...,Multi-step Queries,neo4j
1120,Which movies have synopses that contain the wo...,Multi-step Queries,neo4j
1121,Which movies have synopses that contain the wo...,Multi-step Queries,neo4j
1122,Which movies have synopses that contain the wo...,Multi-step Queries,neo4j


In [None]:
df.drop_duplicates(subset='question').to_csv('gemini_questions.csv', index=False)

In [None]:
import os
from typing import List, Union

import pandas as pd
from langchain.chains.graph_qa.cypher_utils import CypherQueryCorrector, Schema
from langchain_community.graphs import Neo4jGraph
from langchain_core.messages import (
    AIMessage,
    SystemMessage,
    ToolMessage,
)
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    MessagesPlaceholder,
)
from langchain_core.pydantic_v1 import BaseModel
from langchain_core.runnables import RunnablePassthrough
# from langchain_google_vertexai import ChatVertexAI

# LLMs
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "credentials.json"
# llm = ChatVertexAI(model_name="gemini-1.5-pro")

system = """Given an input question, convert it to a Cypher query.
To translate a question into a Cypher query, please follow these steps:

1. Carefully analyze the provided graph schema to understand what nodes, relationships, and properties are available. Pay attention to the node labels, relationship types, and property keys.
2. Identify the key entities and relationships mentioned in the natural language question. Map these to the corresponding node labels, relationship types, and properties in the graph schema.
3. Think through how to construct a Cypher query to retrieve the requested information step-by-step. Focus on:
   - Identifying the starting node(s)
   - Traversing the necessary relationships
   - Filtering based on property values
   - Returning the requested information
Feel free to use multiple MATCH, WHERE, and RETURN clauses as needed.
4. Explain how your Cypher query will retrieve the necessary information from the graph to answer the original question. Provide this explanation inside <explanation> tags.
5. Once you have finished explaining, construct the Cypher query inside triple backticks ```cypher```.

Remember, the goal is to construct a Cypher query that will retrieve the relevant information to answer the question based on the given graph schema.
Carefully map the entities and relationships in the question to the nodes, relationships, and properties in the schema.
Additional instructions:
1. **Array Length**: Always use `size(array)` instead of `length(array)` to get the number of elements in an array.
2. **Implicit aggregations**: Always use intermediate WITH clause when performing aggregations
3. **Target Neo4j version is 5**: Use Cypher syntax for Neo4j version 5 and above. Do not use any deprecated syntax.
"""

# Generate Cypher statement based on natural language input
cypher_template = """Based on the Neo4j graph schema below, write a Cypher query that would answer the user's question:
{schema}

Question: {question}"""  # noqa: E501

cypher_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",system ),
        ("human", cypher_template),
    ]
)

cypher_response = (
    cypher_prompt
    | llm
    | StrOutputParser()
)

In [None]:
questions = pd.read_csv("gemini_questions.csv")
questions = questions.iloc[1:].reset_index(drop=True)
questions.head()


Unnamed: 0,question,type,database
0,What are the titles of the movies directed by ...,Simple Retrieval Queries,neo4j
1,"Which movies feature the actor ""이성재""?",Simple Retrieval Queries,neo4j
2,"What is the synopsis of the movie ""플란다스의 개""?",Simple Retrieval Queries,neo4j
3,"Who directed the movie with the id ""198""?",Simple Retrieval Queries,neo4j
4,"List the names of all actors who acted in ""주노명...",Simple Retrieval Queries,neo4j


In [None]:
schemas = {}
all_schemas = []

graph = Neo4jGraph(
    url=URL,
    database=database,
    username=database,
    password=password,
    enhanced_schema=True,
    sanitize=True,
)
schema = graph.schema
schemas[database] = schema
all_schemas.append(
    {
        "database": database,
        "schema": schema,
        "structured_schema": graph.structured_schema,
    }
)

df_schemas = pd.DataFrame.from_records(all_schemas)
df_schemas.to_csv("text2cypher_schemas.csv", index=False)



In [None]:
schemas = pd.read_csv('text2cypher_schemas.csv')
schemas.head()
schema_dict = {}
for i, row in schemas.iterrows():
    schema_dict[row['database']] = row['schema']

In [None]:
schemas

Unnamed: 0,database,schema,structured_schema
0,neo4j,Node properties:\n- **Director**\n - `name`: ...,{'node_props': {'Director': [{'property': 'nam...


In [None]:
schema_dict

{'neo4j': 'Node properties:\n- **Director**\n  - `name`: STRING Example: "봉준호"\n- **Movie**\n  - `title`: STRING Example: "플란다스의 개"\n  - `id`: INTEGER Example: "1"\n- **Synopsis**\n  - `seqId`: INTEGER Example: "0"\n  - `text`: STRING Example: "대학 시간강사인 고윤주(이성재)는 이번에도 교수직 추천에서 보기 좋게 떨어진다. 돈 잘 버"\n- **Actor**\n  - `name`: STRING Example: "이성재, 배두나, 변희봉, 김호정, 김뢰하, 고수희, 김진구, 임상수, 성정선, 조재하, "\nRelationship properties:\n\nThe relationships:\n(:Movie)-[:HAS_SYNOPSIS]->(:Synopsis)\n(:Movie)-[:DIRECTED_BY]->(:Director)\n(:Movie)-[:ACTED_BY]->(:Actor)'}

In [None]:
import re
def extract_cypher(text):
    # Adjust pattern to capture after ```cypher and spans multiple lines until ```
    pattern = r"```cypher\n(.*?)\n```"
    match = re.search(pattern, text, re.DOTALL)

    if match:
        # Return the extracted text if triple backticks are present
        return match.group(1).strip()
    else:
        # Return the original text if triple backticks are not present
        return None

def extract_explanation(text):
    pattern = re.compile(r'<explanation>(.*?)</explanation>', re.DOTALL)
    match = pattern.search(text)
    if match:
        explanation_content = match.group(1).strip()
        return explanation_content
    else:
        return None

In [None]:
cypher_responses = []

In [None]:
import time
for i, row in questions.iterrows():
    if i % 50 == 0:
        print(i)
    schema = schema_dict[row["database"]]
    try:
        output = cypher_response.invoke({"question": row["question"], "schema": schema})
        cypher_responses.append(
            {
                "question": row["question"],
                "database": row["database"],
                "output": output,
                "type": row["type"],
                "cypher": extract_cypher(output),
                "explanation": extract_explanation(output)
            }
        )
    except Exception as e:
        time.sleep(2)
        output = cypher_response.invoke({"question": row["question"], "schema": schema})
        cypher_responses.append(
            {
                "question": row["question"],
                "database": row["database"],
                "output": output,
                "type": row["type"],
                "cypher": extract_cypher(output),
                "explanation": extract_explanation(output)
            }
        )



0




ResourceExhausted: 429 Resource has been exhausted (e.g. check quota).

In [None]:
import time

def invoke_with_backoff(question, schema, retries=5, backoff=2):
    for attempt in range(retries):
        try:
            return cypher_response.invoke({"question": question, "schema": schema})
        except Exception as e:
            if attempt < retries - 1:
                time.sleep(backoff ** attempt)  # 지수 백오프
            else:
                raise e

for i, row in questions.iterrows():
    if i % 50 == 0:
        print(i)
    schema = schema_dict[row["database"]]
    try:
        output = invoke_with_backoff(row["question"], schema)
        cypher_responses.append(
            {
                "question": row["question"],
                "database": row["database"],
                "output": output,
                "type": row["type"],
                "cypher": extract_cypher(output),
                "explanation": extract_explanation(output)
            }
        )
    except Exception as e:
        print(f"Failed after retries: {e}")


0




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




50




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




100




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




150




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




Failed after retries: 429 Resource has been exhausted (e.g. check quota).
200




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




250




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




300




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




Failed after retries: 429 Resource has been exhausted (e.g. check quota).
350




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




400




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




Failed after retries: 429 Resource has been exhausted (e.g. check quota).




450


In [None]:
results = pd.DataFrame.from_records(cypher_responses)

In [None]:
results

Unnamed: 0,question,database,output,type,cypher,explanation
0,What are the titles of the movies directed by ...,neo4j,<explanation>\nThe query will first match all ...,Simple Retrieval Queries,MATCH (m:Movie)-[:DIRECTED_BY]->(d:Director {n...,The query will first match all movies that hav...
1,"Which movies feature the actor ""이성재""?",neo4j,<explanation>\nThis query finds all movies tha...,Simple Retrieval Queries,MATCH (m:Movie)-[:ACTED_BY]->(a:Actor)\nWHERE ...,This query finds all movies that have a relati...
2,"What is the synopsis of the movie ""플란다스의 개""?",neo4j,<explanation>\nThis query first matches the mo...,Simple Retrieval Queries,"MATCH (m:Movie {title: ""플란다스의 개""})-[:HAS_SYNOP...",This query first matches the movie with the ti...
3,"Who directed the movie with the id ""198""?",neo4j,<explanation>\nThis query will first find the ...,Simple Retrieval Queries,MATCH (m:Movie {id: 198})<-[:DIRECTED_BY]-(d:D...,This query will first find the movie with the ...
4,"List the names of all actors who acted in ""주노명...",neo4j,<explanation>\nThis Cypher query will first ma...,Simple Retrieval Queries,"MATCH (m:Movie {title: ""주노명 베이커리""})<-[:ACTED_B...",This Cypher query will first match all movies ...
...,...,...,...,...,...,...
473,Which movies have synopses that contain the wo...,neo4j,<explanation>This query first finds all movies...,Multi-step Queries,MATCH (m:Movie)<-[:ACTED_BY]-(a:Actor {name: '...,This query first finds all movies that satisfy...
474,Which movies have synopses that contain the wo...,neo4j,<explanation>This query retrieves movies that ...,Multi-step Queries,MATCH (m:Movie)<-[:ACTED_BY]-(a:Actor)\nWHERE ...,This query retrieves movies that meet a comple...
475,Which movies have synopses that contain the wo...,neo4j,<explanation>This query will first find all mo...,Multi-step Queries,"MATCH (m:Movie)\nWHERE m.title IN [""플란다스의 개""]\...",This query will first find all movies that hav...
476,Which movies have synopses that contain the wo...,neo4j,<explanation>\nThis Cypher query retrieves mov...,Multi-step Queries,,This Cypher query retrieves movies that satisf...


In [None]:


def create_graph(database):
    return Neo4jGraph(
        url=URL,
        username=database,
        password=password,
        database=database,
        refresh_schema=False,
        timeout=10,
    )

syntax_error = []
returns_results = []
timeouts = []
not_possible = []
last_graph = ""
for i, row in results.iterrows():
    if i % 100 == 0:
        print(i)

    # To avoid a new driver for every request
    if row["database"] != last_graph:
        last_graph = row["database"]
        print(last_graph)
        graph = create_graph(row["database"])
    if not isinstance(row['cypher'],str) or row["cypher"].startswith("//"):
            returns_results.append(False)
            syntax_error.append(False)
            timeouts.append(False)
            not_possible.append(True)
    else:
        not_possible.append(False)
        try:
            data = graph.query(row["cypher"])
            if data:
                returns_results.append(True)
            else:
                returns_results.append(False)
            syntax_error.append(False)
            timeouts.append(False)
        except ValueError as e:
            if "Generated Cypher Statement is not valid" in str(e):
                syntax_error.append(True)
                print(f"Syntax error in Cypher query: {e}")
            else:
                syntax_error.append(False)
                print(f"Other ValueError: {e}")
            returns_results.append(False)
            timeouts.append(False)
        except Exception as e:
            if (
                hasattr(e, 'code') and e.code
                == "Neo.ClientError.Transaction.TransactionTimedOutClientConfiguration"
            ):
                returns_results.append(False)
                syntax_error.append(False)
                timeouts.append(True)
            else:
                returns_results.append(False)
                syntax_error.append(False)
                timeouts.append(True)
                # Some weird errors we create a new graph object
                try:
                    graph._driver.close()
                except:
                    pass
                graph = create_graph(row["database"])

0
neo4j




100




200




300
400




In [None]:
syntax_error

[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,


In [None]:
results["syntax_error"] = syntax_error
results["timeout"] = timeouts
results["returns_results"] = returns_results
results["no_cypher"] = not_possible

In [None]:
results.to_csv('text2cypher_gemini.csv', index=False)


In [None]:
results

Unnamed: 0,question,database,output,type,cypher,explanation,syntax_error,timeout,returns_results,no_cypher
0,What are the titles of the movies directed by ...,neo4j,<explanation>\nThe query will first match all ...,Simple Retrieval Queries,MATCH (m:Movie)-[:DIRECTED_BY]->(d:Director {n...,The query will first match all movies that hav...,False,False,True,False
1,"Which movies feature the actor ""이성재""?",neo4j,<explanation>\nThis query finds all movies tha...,Simple Retrieval Queries,MATCH (m:Movie)-[:ACTED_BY]->(a:Actor)\nWHERE ...,This query finds all movies that have a relati...,False,False,False,False
2,"What is the synopsis of the movie ""플란다스의 개""?",neo4j,<explanation>\nThis query first matches the mo...,Simple Retrieval Queries,"MATCH (m:Movie {title: ""플란다스의 개""})-[:HAS_SYNOP...",This query first matches the movie with the ti...,False,False,True,False
3,"Who directed the movie with the id ""198""?",neo4j,<explanation>\nThis query will first find the ...,Simple Retrieval Queries,MATCH (m:Movie {id: 198})<-[:DIRECTED_BY]-(d:D...,This query will first find the movie with the ...,False,False,False,False
4,"List the names of all actors who acted in ""주노명...",neo4j,<explanation>\nThis Cypher query will first ma...,Simple Retrieval Queries,"MATCH (m:Movie {title: ""주노명 베이커리""})<-[:ACTED_B...",This Cypher query will first match all movies ...,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
473,Which movies have synopses that contain the wo...,neo4j,<explanation>This query first finds all movies...,Multi-step Queries,MATCH (m:Movie)<-[:ACTED_BY]-(a:Actor {name: '...,This query first finds all movies that satisfy...,False,False,False,False
474,Which movies have synopses that contain the wo...,neo4j,<explanation>This query retrieves movies that ...,Multi-step Queries,MATCH (m:Movie)<-[:ACTED_BY]-(a:Actor)\nWHERE ...,This query retrieves movies that meet a comple...,False,False,False,False
475,Which movies have synopses that contain the wo...,neo4j,<explanation>This query will first find all mo...,Multi-step Queries,"MATCH (m:Movie)\nWHERE m.title IN [""플란다스의 개""]\...",This query will first find all movies that hav...,False,False,False,False
476,Which movies have synopses that contain the wo...,neo4j,<explanation>\nThis Cypher query retrieves mov...,Multi-step Queries,,This Cypher query retrieves movies that satisf...,False,False,False,True


In [None]:
# Assume df is your DataFrame and col1, col2, col3 are the boolean columns
distribution_col1 = results["syntax_error"].value_counts()
distribution_col2 = results["timeout"].value_counts()
distribution_col3 = results["returns_results"].value_counts()
distribution_col4 = results["no_cypher"].value_counts()


print("Distribution for syntax_error:\n", distribution_col1)
print("Distribution for timeout:\n", distribution_col2)
print("Distribution for returns_results:\n", distribution_col3)
print("Distribution for no_cypher:\n", distribution_col4)

Distribution for syntax_error:
 syntax_error
False    478
Name: count, dtype: int64
Distribution for timeout:
 timeout
False    389
True      89
Name: count, dtype: int64
Distribution for returns_results:
 returns_results
False    312
True     166
Name: count, dtype: int64
Distribution for no_cypher:
 no_cypher
False    473
True       5
Name: count, dtype: int64


# Testing

In [None]:
model = ChatGroq(temperature=0, model_name="llama3-8b-8192", groq_api_key = groq_api_key)
chain = GraphCypherQAChain.from_llm(graph=graph, llm=model, verbose=True)

In [None]:
questions = ["Who is the oldest director?",
             "Find all directors who have directed a movie in Spanish language.",
             "Give me 5 movies where a director has also acted?",
             "List all movies with an IMDb rating greater than 5 that have been directed by a director born in China."
             ]

# POSSIBLE CORRECT CYPHER QUERY
# 1. MATCH (d:Director) WHERE d.born IS NOT NULL RETURN d ORDER BY d.born ASC LIMIT 1
# 2. MATCH (d:Director)-[:DIRECTED]->(m:Movie) WHERE 'Spanish' IN m.languages RETURN d.name
# 3. MATCH (d:Director)-[:ACTED_IN]->(m:Movie) WHERE exists{ (d)-[:DIRECTED]->(m) } RETURN m.title AS MovieTitle, m.movieId AS MovieID LIMIT 5
# 4. MATCH (m:Movie)<-[:DIRECTED]-(d:Director) WHERE m.imdbRating > 5 AND d.bornIn = 'China' RETURN m

for q in questions:
    print("\n", q)
    try:
        result = chain.invoke(q)['result']
        print(result)
    except:
        pass


 Who is the oldest director?


[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (d:Director)-[:born]->(b) RETURN d.name AS director, b AS birthdate ORDER BY b ASC LIMIT 1;[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m
I don't know the answer.

 Find all directors who have directed a movie in Spanish language.


[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (d:Director)-[:DIRECTED]->(m:Movie)-[:IN_GENRE]->(g:Genre)<-[:IN_GENRE]-(m2:Movie)<-[:LANGUAGES]-(l:Language) WHERE l.name = 'Spanish' RETURN d;[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m
I don't know the answer.

 Give me 5 movies where a director has also acted?


[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (d:Director)-[:ACTED_IN]->(m:Movie) RETURN m LIMIT 5;[0m
Full Context:
[32;1m[1;3m[{'m': {'languages': ['English'], 'year': 1919, 'imdbId': '0009932', 'runtime'

New Update from Langchain (09/05/2024): use validate_cypher parameter with enhanced schema parameter to get the best results

In [None]:
model = ChatGroq(temperature=0, model_name="llama3-8b-8192", groq_api_key = groq_api_key)
chain = GraphCypherQAChain.from_llm(graph=graph, llm=model, verbose=True, validate_cypher = True)

In [None]:
questions = ["Who is the oldest director?",
             "Find all directors who have directed a movie in Spanish language.",
             "Give me 5 movies where a director has also acted?",
             "List all movies with an IMDb rating greater than 5 that have been directed by a director born in China."
             ]

for q in questions:
    print("\n", q)
    try:
        result = chain.invoke(q)['result']
        print(result)
    except:
        pass


 Who is the oldest director?


[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3m[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m
I don't know the answer.

 Find all directors who have directed a movie in Spanish language.


[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3m[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m
I don't know the answer.

 Give me 5 movies where a director has also acted?


[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (d:Director)-[:ACTED_IN]->(m:Movie) RETURN m LIMIT 5;[0m
Full Context:
[32;1m[1;3m[{'m': {'languages': ['English'], 'year': 1919, 'imdbId': '0009932', 'runtime': 12, 'imdbRating': 6.1, 'movieId': '72626', 'countries': ['USA'], 'imdbVotes': 503, 'title': 'Billy Blazes, Esq.', 'url': 'https://themoviedb.org/movie/53516', 'tmdbId': '53516', 'plot': 'Billy Blazes confronts Crooked Charley, who has been rulin

# Fine Tuning using Unsloth

The code below is from the Unsloth repository: https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing

In [1]:
!pip uninstall xformers
!pip install xformers==0.0.20  # PyTorch 2.5.1과 호환되는 버전을 선택


^C


ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


In [None]:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch==2.2.2+cu121
  Downloading https://download.pytorch.org/whl/cu121/torch-2.2.2%2Bcu121-cp310-cp310-linux_x86_64.whl (757.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m757.3/757.3 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.2.2+cu121)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.2.2+cu121)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cu

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "MLP-KTLim/llama-3-Korean-Bllossom-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

ModuleNotFoundError: No module named 'unsloth'

In [None]:
!pip install triton==2.3.0


Collecting triton==2.3.0
  Downloading triton-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading triton-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (168.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.1/168.1 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: triton
  Attempting uninstall: triton
    Found existing installation: triton 2.0.0
    Uninstalling triton-2.0.0:
      Successfully uninstalled triton-2.0.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.0.1 requires triton==2.0.0; platform_system == "Linux" and platform_machine == "x86_64", but you have triton 2.3.0 which is incompatible.
torchaudio 2.2.2+cu121 requires torch==2.2.2, but you have torch 2.0.1 which is incompatible.
torchvision 0.17.2+cu121 requires torch=

Adding LORA

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Create a dataset

In [None]:
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = f"Convert text to cypher query based on this schema: {graph.schema}"
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for input, output in zip(inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt.format(instructions, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

Load data from HuggingFace

In [None]:
from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
dataset

Dataset({
    features: ['output', 'input', 'instruction', 'text'],
    num_rows: 51760
})

Load our own data

we're going to use: https://github.com/tomasonjo/text2cypher/blob/main/datasets/synthetic_gpt4turbo_demodbs/text2cypher_gpt4turbo.csv

In [None]:
import pandas as pd

df = pd.read_csv('/content/text2cypher_gpt4turbo.csv')
df = df[(df['database'] == 'recommendations') & (df['syntax_error'] == False) & (df['timeout'] == False)]
df

Unnamed: 0,question,cypher,type,database,syntax_error,timeout,returns_results,false_schema
7275,What are the top 5 movies with a runtime great...,MATCH (m:Movie)\nWHERE m.runtime > 120\nRETURN...,Simple Retrieval Queries,recommendations,False,False,True,
7276,List the first 3 genres with movies having an ...,MATCH (m:Movie)-[:IN_GENRE]->(g:Genre)\nWHERE ...,Verbose query,recommendations,False,False,True,
7277,List the first 5 directors who have a biograph...,MATCH (d:Director)\nWHERE d.bio IS NOT NULL\nR...,Simple Retrieval Queries,recommendations,False,False,True,
7278,Which 3 movies have the most detailed plot des...,"MATCH (m:Movie)\nRETURN m.title, m.plot\nORDER...",Simple Retrieval Queries,recommendations,False,False,True,
7279,Show the top 5 actors who have acted in movies...,MATCH (a:Actor)-[:ACTED_IN]->(m:Movie)<-[:DIRE...,Simple Retrieval Queries,recommendations,False,False,True,
...,...,...,...,...,...,...,...,...
8067,Which movies have been acted in by more than 1...,MATCH (a:Actor)-[:ACTED_IN]->(m:Movie)\nWITH m...,Complex Retrieval Queries,recommendations,False,False,True,
8068,Find all movies where the director has directe...,MATCH (d:Director)-[:DIRECTED]->(m:Movie)\nWIT...,Complex Retrieval Queries,recommendations,False,False,False,
8069,Find all movies that have a plot mentioning 'h...,MATCH (m:Movie)\nWHERE m.plot CONTAINS 'hero'\...,Complex Retrieval Queries,recommendations,False,False,True,
8070,Which movies have been rated the highest by us...,"MATCH (u:User)-[r:RATED]->(m:Movie)\nWITH u, c...",Complex Retrieval Queries,recommendations,False,False,True,


In [None]:
df = df[['question','cypher']]
df.rename(columns={'question': 'input','cypher':'output'}, inplace=True)
df.reset_index(drop=True, inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={'question': 'input','cypher':'output'}, inplace=True)


Unnamed: 0,input,output
0,What are the top 5 movies with a runtime great...,MATCH (m:Movie)\nWHERE m.runtime > 120\nRETURN...
1,List the first 3 genres with movies having an ...,MATCH (m:Movie)-[:IN_GENRE]->(g:Genre)\nWHERE ...
2,List the first 5 directors who have a biograph...,MATCH (d:Director)\nWHERE d.bio IS NOT NULL\nR...
3,Which 3 movies have the most detailed plot des...,"MATCH (m:Movie)\nRETURN m.title, m.plot\nORDER..."
4,Show the top 5 actors who have acted in movies...,MATCH (a:Actor)-[:ACTED_IN]->(m:Movie)<-[:DIRE...
...,...,...
757,Which movies have been acted in by more than 1...,MATCH (a:Actor)-[:ACTED_IN]->(m:Movie)\nWITH m...
758,Find all movies where the director has directe...,MATCH (d:Director)-[:DIRECTED]->(m:Movie)\nWIT...
759,Find all movies that have a plot mentioning 'h...,MATCH (m:Movie)\nWHERE m.plot CONTAINS 'hero'\...
760,Which movies have been rated the highest by us...,"MATCH (u:User)-[r:RATED]->(m:Movie)\nWITH u, c..."


In [None]:
from datasets import Dataset
dataset = Dataset.from_pandas(df)
dataset = dataset.map(formatting_prompts_func, batched = True)
dataset



Map:   0%|          | 0/762 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'output', 'text'],
    num_rows: 762
})

In [None]:
dataset[0]

{'input': 'What are the top 5 movies with a runtime greater than 120 minutes?',
 'output': 'MATCH (m:Movie)\nWHERE m.runtime > 120\nRETURN m\nORDER BY m.runtime DESC\nLIMIT 5',
 'text': 'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nConvert text to cypher query based on this schema: Node properties:\nMovie {posterEmbedding: LIST, url: STRING, runtime: INTEGER, revenue: INTEGER, budget: INTEGER, plotEmbedding: LIST, imdbRating: FLOAT, released: STRING, countries: LIST, languages: LIST, plot: STRING, imdbVotes: INTEGER, imdbId: STRING, year: INTEGER, poster: STRING, movieId: STRING, tmdbId: STRING, title: STRING}\nGenre {name: STRING}\nUser {userId: STRING, name: STRING}\nActor {url: STRING, bornIn: STRING, bio: STRING, died: DATE, born: DATE, imdbId: STRING, name: STRING, poster: STRING, tmdbId: STRING}\nDirector {url: STRING, bornIn: STRING, born: DATE, d

# Train the model

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # max_steps = 60,
        num_train_epochs=1,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/762 [00:00<?, ? examples/s]

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.594 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 762 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 95
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,1.297
2,1.2773
3,1.2661
4,1.1609
5,1.0193
6,0.8402
7,0.6771
8,0.5366
9,0.399
10,0.2963


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1065.0767 seconds used for training.
17.75 minutes used for training.
Peak reserved memory = 7.344 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 49.797 %.
Peak reserved memory for training % of max memory = 0.0 %.


# Inference

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    prompt.format(
        f"Convert text to cypher query based on this schema: {graph.schema}", # instruction
        "Who is the oldest director?", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Convert text to cypher query based on this schema: Node properties:
Movie {posterEmbedding: LIST, url: STRING, runtime: INTEGER, revenue: INTEGER, budget: INTEGER, plotEmbedding: LIST, imdbRating: FLOAT, released: STRING, countries: LIST, languages: LIST, plot: STRING, imdbVotes: INTEGER, imdbId: STRING, year: INTEGER, poster: STRING, movieId: STRING, tmdbId: STRING, title: STRING}
Genre {name: STRING}
User {userId: STRING, name: STRING}
Actor {url: STRING, bornIn: STRING, bio: STRING, died: DATE, born: DATE, imdbId: STRING, name: STRING, poster: STRING, tmdbId: STRING}
Director {url: STRING, bornIn: STRING, born: DATE, died: DATE, tmdbId: STRING, imdbId: STRING, name: STRING, poster: STRING, bio: STRING}
Person {url: STRING, died: DATE, bornIn: STRING, born: DATE, imdbId: STRING, name: STRING, p

# Save the Finetuned

Local Saving

In [None]:
# model.save_pretrained("lora_model") # Local saving
# tokenizer.save_pretrained("lora_model")

Online Saving to HuggingFace

In [None]:
# should have write access

model.push_to_hub("projectwilsen/llama3_text2cypher_recommendations", token = hf_api)
tokenizer.push_to_hub("projectwilsen/llama3_text2cypher_recommendations", token = hf_api)

README.md:   0%|          | 0.00/580 [00:00<?, ?B/s]



adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/projectwilsen/llama3_text2cypher_recommendations


# Load Finetuned Model from HuggingFace

In [None]:
from unsloth import FastLanguageModel

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "MLP-KTLim/llama-3-Korean-Bllossom-8B", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

adapter_config.json:   0%|          | 0.00/732 [00:00<?, ?B/s]

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

In [None]:
inputs = tokenizer(
[
    prompt.format(
        f"Convert text to cypher query based on this schema: {graph.schema}", # instruction
        "Who is the oldest director?", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
result = tokenizer.batch_decode(outputs)
response = result[0].split("### Response:")[1].split("###")[0].strip().replace("<|end_of_text|>", "")
print(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


MATCH (d:Director)
WHERE d.born IS NOT NULL
RETURN d.name, d.born
ORDER BY d.born ASC
LIMIT 1


# Evaluating

Unsloth has not integrated in Langchain, so need little adjustment

In [None]:
from langchain.chains import LLMChain
from langchain_groq import ChatGroq
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from google.colab import userdata
groq_api_key = userdata.get('GROQ_API')

CYPHER_QA_TEMPLATE = """You convert context to a final answer. Understand the question, the context, then generate result.
Here is an example:

Question: Who is the director of Harry Potter 1 and 8?
Context: [{{d.name: Chris Columbus, d.born: 10 September 1958}},{{d.name: David Yates, d.born: 8 October 1963}}]
Helpful Answer: Chris Columbus and David Yates is the director of Harry Potter

Follow this example when generating answers.
Answer in short, don't hallucinate!
Question: {question}
Information: {context}
Helpful Answer:
"""

qa_prompt = ChatPromptTemplate.from_template(CYPHER_QA_TEMPLATE)
output_parser = StrOutputParser()
llm = ChatGroq(temperature=0, model_name="llama3-8b-8192", groq_api_key = groq_api_key)
chain = qa_prompt | llm | output_parser

context = graph.query(response)
question = 'Who is the oldest director?'

chain.invoke({"context":context , "question":question})

'Georges Méliès'

In [None]:
questions = ["Who is the oldest director?",
             "Find all directors who have directed a movie in Spanish language.",
             "Give me 5 movies where a director has also acted?",
             "List all movies with an IMDb rating greater than 5 that have been directed by a director born in China."
             ]

def generate_cypher_query(question):
  inputs = tokenizer(
  [
      prompt.format(
          f"Convert text to cypher query based on this schema: {graph.schema}", # instruction
          question, # input
          "", # output - leave this blank for generation!
      )
  ], return_tensors = "pt").to("cuda")

  outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
  result = tokenizer.batch_decode(outputs)
  cypher_query = result[0].split("### Response:")[1].split("###")[0].strip().replace("<|end_of_text|>", "")
  return cypher_query

for q in questions:
    print("\n",q)
    cypher_query = generate_cypher_query(q)
    print(cypher_query)
    context = graph.query(cypher_query)
    print('context: ', context)
    result = chain.invoke({"context":context , "question":q})
    print(result)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



 Who is the oldest director?
MATCH (d:Director)
WHERE d.born IS NOT NULL
RETURN d
ORDER BY d.born
LIMIT 1
context:  [{'d': {'bornIn': 'Paris, France', 'tmdbId': '11523', 'imdbId': '0617588', 'born': neo4j.time.Date(1861, 12, 8), 'name': 'Georges Méliès', 'bio': 'Georges Méliès, full name Marie-Georges-Jean Méliès, was a French illusionist and filmmaker famous for leading many technical and narrative developments in the earliest days of cinema.  One of the first filmmakers to use multiple exposures, time-lapse photography, tracking shots, dissolves, and hand-painted color in his work, Méliès pioneered effects that would define cinematic special effects for decades to come...', 'died': neo4j.time.Date(1938, 1, 21), 'poster': 'https://image.tmdb.org/t/p/w440_and_h660_face/ba3Kfc01Dbigt41lyuFoZR7gmv1.jpg', 'url': 'https://themoviedb.org/person/11523'}}]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Georges Méliès is the oldest director.

 Find all directors who have directed a movie in Spanish language.
MATCH (d:Director)-[:DIRECTED]->(m:Movie)
WHERE 'Spanish' IN m.languages
RETURN d.name, collect(m.title) AS movies
context:  [{'d.name': 'Alejandro Jodorowsky', 'movies': ['Topo, El', 'Fando and Lis (Fando y Lis)']}, {'d.name': 'Alfonso Arau', 'movies': ['Like Water for Chocolate (Como agua para chocolate)']}, {'d.name': 'Abel Ferrara', 'movies': ['King of New York']}, {'d.name': 'Nacho Vigalondo', 'movies': ['Timecrimes (Cronocrímenes, Los)']}, {'d.name': 'Luis Buñuel', 'movies': ['Tristana', 'Nazarin (Nazarín)', 'Simon of the Desert (Simón del desierto)', 'Viridiana', 'Exterminating Angel, The (Ángel exterminador, El)']}, {'d.name': 'Mikhail Kalatozov', 'movies': ['I Am Cuba (Soy Cuba/Ya Kuba)']}, {'d.name': 'Les Blank', 'movies': ['Burden of Dreams']}, {'d.name': 'Juan Piquer Simón', 'movies': ['Pieces (Mil gritos tiene la noche) (One Thousand Cries Has the Night)']}, {'d.name'

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Here is the answer:

The directors who have directed a movie in Spanish language are:

* Alejandro Jodorowsky
* Alfonso Arau
* Abel Ferrara
* Nacho Vigalondo
* Luis Buñuel
* Mikhail Kalatozov
* Les Blank
* Juan Piquer Simón
* Pedro Almodóvar
* Gregory Nava
* Luis Puenzo
* Barbet Schroeder
* Martin Campbell
* Steven Soderbergh
* Luis Mandoki
* Fernando Trueba
* Guillermo del Toro
* Fernando E. Solanas
* Tomás Gutiérrez Alea
* Jorge Fons
* Álex de la Iglesia
* Alfonso Cuarón
* Alejandro Amenábar
* Juan José Campanella
* Julio Medem
* Walter Salles
* José Luis Cuerda
* Juan Carlos Fresnadillo
* Fabián Bielinsky
* Juan Pablo Rebella
* Agustín Díaz Yanes
* Alejandro Agresti
* Fernando León de Aranoa
* Alejandro González Iñárritu
* Dunia Ayaso
* Félix Sabroso
* Sebastián Cordero
* Joshua Marston
* Daniel Sánchez Arévalo
* Damián Szifrón
* J.A. Bayona
* Patricia Riggen
* Isidro Ortiz
* Luis Piedrahita
* Rodrigo Sopeña
* Jaume Balagueró
* Christian Molina
* Cary Joji Fukunaga
* Gustavo Taretto

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Based on the provided information, here are 5 movies where a director has also acted:

1. **Safety Last! (1923)** - Harold Lloyd, the director, also acted in the film.
2. **The Kid Brother (1927)** - Harold Lloyd, the director, also acted in the film.
3. **The Golem (1920)** - Paul Wegener, the director, also acted in the film.
4. **The Freshman (1925)** - Bob Kortman, the director, also acted in the film.
5. **Billy Blazes, Esq. (1919)** - unknown director, but the film features a director-actor, possibly the director of the film.

Please note that the information provided does not include detailed information about the directors' acting roles in the films.

 List all movies with an IMDb rating greater than 5 that have been directed by a director born in China.
MATCH (d:Director {bornIn: 'China'})-[:DIRECTED]->(m:Movie)
WHERE m.imdbRating > 5
RETURN m
context:  [{'m': {'languages': ['Cantonese', ' Mandarin'], 'year': 1991, 'imdbId': '0102293', 'runtime': 91, 'imdbRating': 7.1, 'movieI