# Hybrid Search Demo

## Create Graph Schema

Since our data is stored in a **TigerGraph** instance—whether on-premise or in the cloud—we need to configure the connection settings. The recommended approach is to use **environment variables**, such as setting them with the `export` command in the shell. Here, for demonstration purposes, we configure them within Python using the `os.environ` method.

In [1]:
import os
os.environ["TG_HOST"] = "http://127.0.0.1"
os.environ["TG_USERNAME"] = "tigergraph"
os.environ["TG_PASSWORD"] = "tigergraph"

Since the graph has already been created in **TigerGraph**, you should retrieve it using the `Graph.from_db` method, which only requires the graph name.

In [2]:
from tigergraphx import Graph
G = Graph.from_db(graph_name="KGRec")

## Graph-based Similarity Search

To conduct the **graph-based similarity search**, you should run a pre-installed query that identifies similar songs based on user download history by leveraging the relationships in the **KGRec** graph.

```sql
CREATE OR REPLACE QUERY graph_based_similarity_search(
  VERTEX<User> input,
  UINT k = 10
) FOR GRAPH KGRec SYNTAX V3 {
  OrAccum @visited;
  SumAccum<DOUBLE> @sum_score;

  Users = {input};

  Songs =
    SELECT t
    FROM (s:Users) -[e:downloaded]- (t)
    POST-ACCUM
      t.@visited = TRUE
  ;

  SimilarSongs =
    SELECT t
    FROM (s:Songs) -[e:similar_to]- (t)
    WHERE t.@visited == FALSE
    ACCUM t.@sum_score += e.score
    ORDER BY t.@sum_score DESC
    LIMIT k
  ;

  PRINT SimilarSongs;
}
```

Then, you should execute the `graph_based_similarity_search` query for **user 17418216**, retrieving the **top k songs** based on the accumulated similarity score to the songs the user has downloaded.

In [3]:
graph_search_results = G.run_query("graph_based_similarity_search", params={"input": 17418216, "k": 4})
for result in graph_search_results:
    for key, songs in result.items():
        for song in songs:
            print(song)

{'v_id': '4425', 'v_type': 'Song', 'attributes': {'id': 4425, 'description': "Thousand Foot Krutch vocalist Trevor McNevan -LRB- from NewReleaseTuesday -RRB- : `` This is another firecracker , more of an adrenaline rock song .\\nI could n't help but picture NASCAR drivers flying by on the track to this .\\nI love big , anthemic songs that are calls to action - so this one is case and point . ''", '@sum_score': 4.51074226856731, '@visited': False}}
{'v_id': '5148', 'v_type': 'Song', 'attributes': {'id': 5148, 'description': "TFK frontman/songwriter Trevor McNevan had the idea for this song for some time .\\nHe told NewReleaseTuesday : `` Although it 's in the same vein as some of our other high-octane songs , like ` Fire It Up , ' it 's quite different .\\nI wanted it to have that U2 Vertigo type vibe ; that big stadium energy with single notes on the main guitar riff , instead of chords . ''\\nThis was a challenge for McNevan to sing as its one of the highest songs vocally he 's writte

## Vector-based Similarity Search

Now, let's perform the **vector-based similarity search**. You should continue using **user 17418216**, first retrieving the songs they have downloaded using the `G.get_neighbors` method. Then, you should obtain the embeddings of these songs using the `G.fetch_nodes` method.

In [4]:
import numpy as np
df = G.get_neighbors(start_nodes=17418216, start_node_type="User", edge_types="downloaded")
song_ids = set(df['id'])
songs = G.fetch_nodes(song_ids, vector_attribute_name="emb_1", node_type="Song")

Then, you should compute the **user's embedding** by averaging the embeddings of the songs they have downloaded. This embedding represents the user's overall musical preferences.

In [5]:
embeddings = np.array(list(songs.values()))
user_embedding = np.mean(embeddings, axis=0)
print(embeddings.shape)
print(user_embedding.shape)

(59, 1536)
(1536,)


Now, you should perform a **vector-based similarity search** to find the **top 4 most similar songs** to the user's embedding. The query searches for songs whose embeddings (`emb_1`) are closest to the computed **user embedding**, and each result includes the **song's ID and description**.

In [6]:
vector_search_results = G.search(
    data=user_embedding.tolist(),
    vector_attribute_name="emb_1",
    node_type="Song",
    limit=4,
    return_attributes=["id", "description"]
)
for node in vector_search_results:
    print(node)

{'id': 7103, 'distance': 0.08542156, 'description': "Frontman and songwriter Conor O'Brien told The Daily Telegraph that this atheist anthem started out `` pretty mental , drum and bass electronica , lyrics about cities crumbling and people dying , sounds of fire and apocalyptic thing .\\n`` After rehearsing the song with his band , it gradually transformed into `` a pretty straightforward folk rock song about smiling into the void . ''\\nRegarding the song 's meaning , O'Brien told The Sun it , `` is a sort of ode to meaninglessness , that absolute void that we all feel at some stage in our lives , if not every single day .\\nIt sort of proposes that this void is the very thing that binds us all together as human beings on this planet . ''\\nO'Brien described this song to The Sun , as a `` tragi-comedy . ''\\nadding that he `` was very much influenced by Slaughterhouse-Five by Kurt Vonnegut - that feeling of deep , dark moments being placed alongside compulsive hilarity .\\n`` America

## Hybrid Search

Now, you will **combine graph-based and vector-based recommendations** using a **hybrid search**. The process consists of three key steps:

1. **Extract and Normalize Scores**  
   - Retrieve recommendations from both **graph-based** and **vector-based** methods.  
   - Normalize **graph scores** and **vector distances** to ensure comparability.

2. **Merge and Fill Missing Values**  
   - Combine both sets of results into a unified dataset.  
   - Handle missing scores by setting defaults and aligning descriptions.

3. **Compute and Rank Hybrid Scores**  
   - Compute the **hybrid score** using a weighted combination:  
     $$ \text{hybrid\_score} = \alpha \times \text{graph\_score\_norm} + (1 - \alpha) \times \text{vector\_score\_norm} $$
   - Rank the results and retrieve the **top 4 recommendations**.

In [7]:
import pandas as pd

# Extract graph-based recommendations
graph_recs = []
for result in graph_search_results:
    if isinstance(result, dict):  # Ensure result is a dictionary
        for key, songs in result.items():
            if isinstance(songs, list):  # Ensure songs is a list
                for song in songs:
                    if isinstance(song, dict) and 'attributes' in song:
                        graph_recs.append({
                            "id": int(song.get('v_id', 0)),  # Default ID to 0 if missing
                            "graph_score": song['attributes'].get('@sum_score', 0),  # Default to 0 if missing
                            "description": song['attributes'].get('description', 'No description available')  # Default description
                        })

# Extract vector-based recommendations
vector_recs = [
    {
        "id": int(node.get("id", 0)),  # Default ID to 0 if missing
        "vector_distance": node.get("distance", 1.0),  # Default max distance to 1.0
        "description": node.get("description", "No description available")  # Default description
    }
    for node in vector_search_results
]

# Convert to DataFrame
df_graph = pd.DataFrame(graph_recs)
df_vector = pd.DataFrame(vector_recs)

# Convert `id` column to int before merging
df_graph['id'] = df_graph['id'].astype(int)
df_vector['id'] = df_vector['id'].astype(int)

# Normalize Graph Scores
if not df_graph.empty and 'graph_score' in df_graph:
    df_graph['graph_score_norm'] = (df_graph['graph_score'] - df_graph['graph_score'].min()) / \
                                   (df_graph['graph_score'].max() - df_graph['graph_score'].min())
else:
    df_graph['graph_score_norm'] = 0  # Default normalization if empty

# Normalize Vector Scores (inverse because lower is better)
if not df_vector.empty and 'vector_distance' in df_vector:
    df_vector['vector_score_norm'] = (df_vector['vector_distance'].max() - df_vector['vector_distance']) / \
                                     (df_vector['vector_distance'].max() - df_vector['vector_distance'].min())
else:
    df_vector['vector_score_norm'] = 0  # Default normalization if empty

# Merge both DataFrames
df_merged = pd.merge(df_graph, df_vector, on='id', how='outer')

# Fill missing scores and descriptions
df_merged['graph_score_norm'] = df_merged['graph_score_norm'].fillna(0)
df_merged['vector_score_norm'] = df_merged['vector_score_norm'].fillna(0)
df_merged['description_x'] = df_merged['description_x'].fillna(df_merged['description_y'])
df_merged = df_merged.rename(columns={"description_x": "description"}).drop(columns=["description_y"])

# Compute Hybrid Score with weight α = 0.5
alpha = 0.5
df_merged['hybrid_score'] = alpha * df_merged['graph_score_norm'] + (1 - alpha) * df_merged['vector_score_norm']

# Sort by Hybrid Score and select top 4
df_sorted = df_merged.sort_values(by='hybrid_score', ascending=False).head(4)

# Print results one by one
for _, row in df_sorted.iterrows():
    print(f"ID: {row['id']}")
    print(f"Hybrid Score: {row['hybrid_score']:.4f}")
    print(f"Description: {row['description']}\n" + "-" * 80)

ID: 4425
Hybrid Score: 0.5000
Description: Thousand Foot Krutch vocalist Trevor McNevan -LRB- from NewReleaseTuesday -RRB- : `` This is another firecracker , more of an adrenaline rock song .\nI could n't help but picture NASCAR drivers flying by on the track to this .\nI love big , anthemic songs that are calls to action - so this one is case and point . ''
--------------------------------------------------------------------------------
ID: 5996
Hybrid Score: 0.5000
Description: Frontman Justin Pierre told Alternative Press that the genesis of this song harks back to 2007 : `` The original idea for this song came while we were recording Even If It Kills Me .\nI had a few lines for verses and part of the chorus , but I was n't sure where it was going .\nThere was n't enough time to explore it back then , so we saved it for this record .\nI had this strange image in my head of two people sitting on the roof of a house at night in the fall , shivering slightly and silently together ; the

## Vector Search for QA System

A **QA (Question-Answering) system** is an AI-powered solution that processes user queries, retrieves relevant information from structured or unstructured data, and generates accurate, context-aware responses.

To perform **vector search** for a QA system, you should first convert the **user query** into an embedding using an embedding model. In this example, we will use OpenAI's **`text-embedding-ada-002`** model.

In [8]:
import openai

def get_question_embedding(question, model="text-embedding-ada-002"):
    """Convert a question into an embedding (List[float]) using OpenAI API."""
    try:
        response = openai.embeddings.create(input=[question], model=model)
        return response.data[0].embedding  # Returns the embedding as List[float]
    except Exception as e:
        print(f"Error generating embedding: {e}")
        return None  # Return None if there is an error

question = 'Are there any songs in the dataset that mention a specific genre (e.g., "rock," "jazz," "pop") in their descriptions?'
embedding = get_question_embedding(question)

2025-03-11 22:15:02,661 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


Now, you should perform a **vector-based search** to find the **top 10 most similar songs** to the given query embedding. The search compares the query embedding with **song embeddings** stored under the attribute `"emb_1"`, and then retrieves the most relevant results.

In [9]:
retrieved_songs = G.search(
    data=embedding,
    vector_attribute_name="emb_1",
    node_type="Song",
    limit=10,
)
print(retrieved_songs)

[{'id': 1728, 'distance': 0.2234166, 'description': "According to vocalist/guitarist James Petralli , White Denim 's songs are the musical manifestations of abstract paintings or philosophical tracts .\\nHe explained to UK music magazine the NME : `` The things that I like to read are generally abstract .\\nI like patterns , I like reading poetry and avant-garde prose and I 'm more interested in musical patterns in literature than I am in long-form narratives .\\nI look at paintings and try to visualise an object or image , then assimilate how that makes me feel into a series of phrases and try to make it musical . ''\\nPetralli told the NME the song is about , `` creating work and weighing its importance . ''\\nThis meditative hoedown is loosely based on some excerpts from The Blue and Brown Books by Austrian philosopher Ludwig Wittgenstein .\\nPetralli told the NME : `` The lyric-writing process is like an excavation , I 'm trying to pull words and melodies out of what 's already the

## Generate an LLM Prompt for QA

Once you have retrieved relevant songs, you should generate a **structured prompt** to provide context for a **Large Language Model (LLM)** to answer the user's query. The following function formats the retrieved song descriptions into a structured input for the LLM.

In [10]:
def generate_llm_prompt(question, retrieved_songs):
    """Generate a structured prompt for an LLM to answer a question using retrieved song descriptions."""
    
    prompt_template = """You are an expert in analyzing song descriptions and answering user queries based on provided song data.

### Task:
Answer the following question based on the retrieved song descriptions. Use the given information to generate a relevant, concise, and insightful response.

### Question:
{question}

### Retrieved Songs:
{retrieved_songs}

Each song entry consists of:
- **id**: A unique identifier for the song.
- **description**: A textual description of the song.

### Instructions:
1. **Analyze** the descriptions to find relevant information related to the question.
2. **Synthesize** an answer using the most relevant songs.
3. **Provide explanations** or insights if necessary.
4. **Avoid speculation** beyond the provided descriptions.

### Response:
"""

    # Format the retrieved songs as a structured string
    song_entries = "\n".join(
        [f"- id: {song['id']}\n Description: {song['description']}" for song in retrieved_songs]
    )

    return prompt_template.format(question=question, retrieved_songs=song_entries)

llm_prompt = generate_llm_prompt(question, retrieved_songs)

# Print the generated prompt
print(llm_prompt)

You are an expert in analyzing song descriptions and answering user queries based on provided song data.

### Task:
Answer the following question based on the retrieved song descriptions. Use the given information to generate a relevant, concise, and insightful response.

### Question:
Are there any songs in the dataset that mention a specific genre (e.g., "rock," "jazz," "pop") in their descriptions?

### Retrieved Songs:
- id: 1728
 Description: According to vocalist/guitarist James Petralli , White Denim 's songs are the musical manifestations of abstract paintings or philosophical tracts .\nHe explained to UK music magazine the NME : `` The things that I like to read are generally abstract .\nI like patterns , I like reading poetry and avant-garde prose and I 'm more interested in musical patterns in literature than I am in long-form narratives .\nI look at paintings and try to visualise an object or image , then assimilate how that makes me feel into a series of phrases and try to

## Querying OpenAI with the Generated Prompt

Next, you should send the **generated LLM prompt** to OpenAI's **GPT-4** model to get an AI-generated response.

In [11]:
def chat_with_openai(llm_prompt, model="gpt-4"):
    """Send the LLM prompt to OpenAI's API and get a response using the new OpenAI API (>=1.0.0)."""
    try:
        client = openai.OpenAI()  # New API requires initializing a client
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant that analyzes song descriptions."},
                {"role": "user", "content": llm_prompt}
            ],
            temperature=0.7
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error querying OpenAI: {e}")
        return None

response = chat_with_openai(llm_prompt)
print(response)

2025-03-11 22:15:18,574 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Yes, there are several songs in the dataset that mention a specific genre in their descriptions. 

Song with id: 3937 is described by Kid Rock as a "bluesy roadhouse shuffle", indicating a blues genre. 

Song with id: 3095 draws on reggae-inspired genres like dancehall and dub. This song seems to incorporate elements of dancehall and dub, which are sub-genres of reggae.

Song with id: 504 has Justin Timberlake attempting to put his own spin on Radiohead's electronic rock, suggesting that this song belongs to the electronic rock genre.

Song with id: 525 spans sounds of folk, country, rock, and Americana. It is described as a rock song by the Court Yard Hounds.

Song with id: 92 uses a sample from a 1984 hip-hop single, indicating that it might belong to the hip-hop genre.

Song with id: 63 by Brooklyn quintet Friends is described as a pop band, and the song's lyrical

## Hybrid Search for QA System

Now, you should enhance the **QA system** by integrating **hybrid search**, which combines **vector-based and graph-based** results to improve accuracy.

In [12]:
retrieved_songs = G.search(
    data=embedding,
    vector_attribute_name="emb_1",
    node_type="Song",
    limit=5,
)
print(retrieved_songs)

[{'id': 3095, 'distance': 0.222947, 'description': "This song draws on reggae-inspired genres like dancehall and dub to illustrate an eccentric young woman .\\n`` I love the picture of what this girl is like because I do believe , much like the narrator believes , that there is a person for every person , '' Kirsten Bush told CMT News .\\n`` There might even be more than one , but I do believe in it .\\nI 'm a hopeless and helpless romantic .\\nWhen that song started to unfold , we got to the bridge of it , and I was referencing dancehall , like hyip-dibi-dibi-dibi , hyip-dibi-dibi-dibi . ''\\nJennifer Nettles told CMT News that she embraced the chance to write lyrics based on the dub rhythm of the actual words .\\n`` The lyrics themselves can just be fun words that sound really ` riki-tiki ' to say , '' she noted .\\n`` It does n't have to always have to make narrative sense but just have fun with having fun words .\\nIf you remember the Sugar Hill Gang -LSB- a pioneering hip hop grou

Next, you should retrieve **graph-based recommendations** for the previously retrieved songs. The **`get_neighbors`** query finds similar songs by traversing **`similar_to`** edges, accumulating similarity scores, and ranking the top `k` results. Below is the query definition:  

```sql
CREATE OR REPLACE QUERY get_neighbors(
  SET<VERTEX<Song>> input,
  UINT k = 10
) FOR GRAPH KGRec SYNTAX V3 {
  OrAccum @visited;
  SumAccum<DOUBLE> @sum_score;
  Songs = {input};
  Songs =
    SELECT s
    FROM Songs:s
    POST-ACCUM
      s.@visited = TRUE
  ;
  SimilarSongs =
    SELECT t
    FROM (s:Songs) -[e:similar_to]- (t)
    WHERE t.@visited == FALSE
    ACCUM t.@sum_score += e.score
    ORDER BY t.@sum_score DESC
    LIMIT k
  ;
  PRINT SimilarSongs;
}
```

Now, run the following Python code to execute the query and retrieve similar songs:

In [13]:
retrieved_song_ids = [song["id"] for song in retrieved_songs]
neighbors = G.run_query("get_neighbors", params={"input": retrieved_song_ids, "k": 5})
print(neighbors)

[{'SimilarSongs': [{'v_id': '5362', 'v_type': 'Song', 'attributes': {'id': 5362, 'description': "Vocalist/guitarist James Petralli told UK music magazine the NME about this Rock-Operaesque tune , which veers into prog territory : `` I do n't think we 'll ever go full-on prog but all of us like to be challenged in that way .\\nWe listen to a lot of Soft Machine and a lot of Yes .\\nProg music is challenging and hilarious and we 've always wanted to do something which walks that line .\\n` Anvil Everything ' was our attempt to fulfil a bit of a prog fantasy , to see if we could make it acceptable . ''", '@sum_score': 0.7105263157894737, '@visited': False}}, {'v_id': '6306', 'v_type': 'Song', 'attributes': {'id': 6306, 'description': "It took just three takes to lay down this cut .\\n`` Bob Seger called me and told me this song kicks ass , '' Rock told Detroit Free Press .\\n`` Need I say more ?\\n`` He added : `` There is nothing - no pop song , no rap song , no soul song , no country so

Now, you should merge the **vector search results** with the **graph search results** while ensuring that no duplicate songs are included.

In [14]:
# Convert vector search results to a list of dictionaries
combined_results = {song["id"]: {
    "id": song["id"],
    "description": song["description"]
} for song in retrieved_songs}

# Add graph search results (ensuring no duplicates)
for song in neighbors[0]["SimilarSongs"]:
    song_id = int(song["v_id"])
    if song_id not in combined_results:  # Avoid duplicates
        combined_results[song_id] = {
            "id": song_id,
            "description": song["attributes"]["description"]
        }

# Convert the merged dictionary back to a list format
retrieved_songs_combined = list(combined_results.values())

Finally, you should generate a new **LLM prompt** using the combined results and query OpenAI again.

In [15]:
llm_prompt = generate_llm_prompt(question, retrieved_songs_combined)

# Print the generated prompt
print(llm_prompt)

You are an expert in analyzing song descriptions and answering user queries based on provided song data.

### Task:
Answer the following question based on the retrieved song descriptions. Use the given information to generate a relevant, concise, and insightful response.

### Question:
Are there any songs in the dataset that mention a specific genre (e.g., "rock," "jazz," "pop") in their descriptions?

### Retrieved Songs:
- id: 3095
 Description: This song draws on reggae-inspired genres like dancehall and dub to illustrate an eccentric young woman .\n`` I love the picture of what this girl is like because I do believe , much like the narrator believes , that there is a person for every person , '' Kirsten Bush told CMT News .\n`` There might even be more than one , but I do believe in it .\nI 'm a hopeless and helpless romantic .\nWhen that song started to unfold , we got to the bridge of it , and I was referencing dancehall , like hyip-dibi-dibi-dibi , hyip-dibi-dibi-dibi . ''\nJenn

In [16]:
response = chat_with_openai(llm_prompt)
print(response)

2025-03-11 22:15:44,068 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
Yes, several songs in the dataset mention specific genres in their descriptions. 

- Song 3095 draws on reggae-inspired genres like dancehall and dub.
- Song 525 spans sounds of folk, country, rock, and Americana.
- Song 3937 is described as a bluesy roadhouse shuffle.
- Song 504 is Justin Timberlake's attempt to put his own spin on Radiohead's electronic rock.
- Song 5362 is described as a rock-opera tune that veers into prog territory.
- Song 6306 is described as blues-based rock 'n' roll.
- Song 4586 is inspired by artists such as Blondie, Peter Gabriel, R.E.M., and The Clash which suggests a mix of pop, rock and new wave influences.

These descriptions indicate that a variety of genres are represented in the dataset.


## Drop Graph

Once testing is complete, you can **drop the graph schema** to clean up resources.

In [17]:
>>> G.drop_graph()

2025-03-11 22:15:45,726 - tigergraphx.core.managers.schema_manager - INFO - Dropping graph: KGRec...
2025-03-11 22:15:49,041 - tigergraphx.core.managers.schema_manager - INFO - Graph dropped successfully.


---