# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Use Case Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/Projects_with_Domains.csv",
    metadata_columns=[
      "Project Title",
      "Project Domain",
      "Secondary Domain",
      "Description",
      "Judge Comments",
      "Score",
      "Project Name",
      "Judge Score"
    ]
)

synthetic_usecase_data = loader.load()

for doc in synthetic_usecase_data:
    doc.page_content = doc.metadata["Description"]

Let's look at an example document to see if everything worked as expected!

In [4]:
synthetic_usecase_data[0]

Document(metadata={'source': './data/Projects_with_Domains.csv', 'row': 0, 'Project Title': 'InsightAI 1', 'Project Domain': 'Security', 'Secondary Domain': 'Finance / FinTech', 'Description': 'A low-latency inference system for multimodal agents in autonomous systems.', 'Judge Comments': 'Technically ambitious and well-executed.', 'Score': '85', 'Project Name': 'Project Aurora', 'Judge Score': '9.5'}, page_content='A low-latency inference system for multimodal agents in autonomous systems.')

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "Synthetic_Usecases".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [5]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    synthetic_usecase_data,
    embeddings,
    location=":memory:",
    collection_name="Synthetic_Usecases"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [6]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [7]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [8]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [9]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [10]:
naive_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'Based on the provided data, the most common project domain appears to be "Healthcare / MedTech," which is mentioned multiple times in different projects.'

In [11]:
naive_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there are use cases related to security in the provided context. Specifically, one project titled "WealthifyAI 3" is described as a medical imaging solution that improves early diagnosis through vision transformers, with a note indicating it is well-structured and scalable, having good potential for commercialization. Additionally, there is a project called "Pathfinder 24" in the Healthcare / MedTech domain with a secondary focus on Security, which involves an AI-powered platform optimizing logistics routes for sustainability.'

In [23]:
naive_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The judges\' comments on the fintech projects were generally positive. For example, one project described as "WealthifyAI" received the comment: "Comprehensive and technically mature approach." Another project, "Pathfinder 27," was praised for "Excellent code quality and use of open-source libraries," with a high judge score of 9.8. Overall, the judges highlighted the strength, innovation, and real-world impact of the fintech projects, indicating favorable reviews.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [13]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(synthetic_usecase_data)

We'll construct the same chain - only changing the retriever.

In [14]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [15]:
bm25_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'Based on the provided data, it appears that the most common project domain is not explicitly stated. However, from the sample entries, the project domains include Productivity Assistants, Legal / Compliance, Data / Analytics, and Healthcare / MedTech. Since the data provided is limited and does not specify the overall distribution, I cannot determine with certainty which domain is most common across all projects.\n\nIf you have access to the full dataset or additional information, I can help analyze it further.'

In [16]:
bm25_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Based on the provided context, there are no specific use cases related to security mentioned.'

In [21]:
bm25_retrieval_chain.invoke({"question" : "Which projects mention legal stuff?"})["response"].content
# bm25_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The project that mentions legal stuff is "TaskFlow 13," which has the domain "Legal / Compliance."'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

##### ✅ Answer
I created the query "Which projects mention legal stuff?"
I think this would be better with Best-Matching 25 because BM25 prioritizes exact token overlap, and it would look for exactly "legal". In this case it found "Legal / Compliance". Embeddings may return semantically related items, but not an exact match.



## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [18]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [24]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [25]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'Based on the provided data, the most common project domain among the listed projects appears to be "Healthcare / MedTech," as it is mentioned in one of the examples. However, since the dataset is limited and only a few entries are shown, I cannot determine the overall most common project domain with certainty. Could you provide more data or specify if you\'d like analysis of the complete dataset?'

In [26]:
contextual_compression_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Based on the provided context, there are no specific use cases explicitly related to security. The use cases mentioned focus on federated learning to improve privacy in healthcare applications, but there is no direct mention of security use cases.'

In [27]:
contextual_compression_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The judges\' comments about the fintech projects were generally positive. For example, the project "Pathfinder 27" received praise for its "excellent code quality and use of open-source libraries," and scored a high judge score of 9.8. Additionally, the project "PlanPilot 35" was recognized for being "a clever solution with measurable environmental benefit" and received a judge score of 8.4. Overall, the judges appreciated the quality, innovation, and environmental considerations of the fintech-related projects.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [28]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
) 

In [29]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [30]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'The most common project domain in the provided data appears to be "Healthcare / MedTech," which is mentioned multiple times across different projects.'

In [31]:
multi_query_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there are usecases related to security. Specifically, one example is the project "SecureNest," which involves a document summarization and retrieval system for enterprise knowledge bases. It received high praise for its comprehensive and technically mature approach, indicating its relevance to security and compliance concerns in enterprise environments.'

In [32]:
multi_query_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'Judges generally had positive comments about the fintech projects. For example, one judge described a project as "A clever solution with measurable environmental benefit," indicating appreciation for innovative and impactful ideas. Another noted that a project was "Technically ambitious and well-executed," showing recognition of technical strength. Overall, the comments highlighted the projects\' creativity, solid validation, and potential for real-world impact in the fintech domain.'

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

##### ✅ Answer

- It can generate phrases or synonyms that overcome vocabulary mismatches accross documents
- Different phrasings map to different embedding neighborhoods, catching items a single vector might miss.


## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [33]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = synthetic_usecase_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [34]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [35]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [36]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [37]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [38]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'Based on the provided data, the project domains mentioned include "Productivity Assistants," "Healthcare / MedTech," "Creative / Design / Media," and "Security." There isn\'t enough data to determine a single most common project domain definitively, as the sample is limited. However, from the examples given, "Productivity Assistants" appears as a project domain. If more data were available, we could provide a more accurate answer.'

In [39]:
parent_document_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Based on the provided context, there are no specific use cases mentioned related to security.'

In [40]:
parent_document_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The judges had positive comments about the fintech projects. Specifically, they described the projects as having promising or clever ideas, being technically mature or ambitious, and demonstrating measurable environmental benefits.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [41]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [42]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [43]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'The most common project domain in the provided data appears to be "Healthcare / MedTech," as it is mentioned multiple times among the projects listed.'

In [44]:
ensemble_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there are usecases related to security in the provided context. Specifically, the project titled "WealthifyAI 3" focuses on security, described as a "medical imaging solution improving early diagnosis through vision transformers." Additionally, "SecureNest 28" and "SecureNest 49" also have security-related aspects; the former involves a hardware-aware model quantization benchmark suite, and the latter is about a document summarization and retrieval system for enterprise knowledge bases, which can be related to security and confidentiality of information.'

In [45]:
ensemble_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'Judges had a variety of comments about the fintech projects. For example, they described some projects as having a clever solution with measurable environmental benefits, such as the project "TrendLens 6." Another fintech-related project, "SecureNest 28," was noted by judges as being conceptually strong, although it needed more benchmarking results. Overall, the judges appreciated innovative approaches and strong execution, but specific opinions varied depending on the project.'

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [46]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [47]:
semantic_documents = semantic_chunker.split_documents(synthetic_usecase_data[:20])

Let's create a new vector store.

In [48]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Synthetic_Usecase_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [49]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [50]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [51]:
semantic_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'Based on the provided data, the most common project domain appears to be "Developer Tools / DevEx," which is mentioned twice. Other domains such as "Creative / Design / Media," "Productivity Assistants," "Customer Support / Helpdesk," "Legal / Compliance," and "QA / Testing / Validation" are each mentioned once. Therefore, the most common project domain in this dataset is "Developer Tools / DevEx."'

In [52]:
semantic_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there are use cases related to security. Specifically, one project titled "BioForge" is in the Security domain and involves a medical imaging solution improving early diagnosis through vision transformers. Additionally, "InsightAI" focuses on a low-latency inference system for multimodal agents in autonomous systems, which is also related to security.'

In [53]:
semantic_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The judges had generally positive comments about the fintech projects. For example, one project described as a federated learning toolkit improving privacy in healthcare applications was praised as having a "comprehensive and technically mature approach." Another similar project was noted as "technically ambitious and well-executed." Additionally, a project focused on a bioinformatics pipeline leveraging transformers for genome annotation received comments highlighting it as "a forward-looking idea with solid supporting data." Overall, the judges appreciated the technical ambition, clarity, and potential impact of the fintech-related projects.'

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

##### ✅ Answer

A lot of repetitive sentences tend to look the same, so the semantic distance between is tiny. The chunks tend to conglomorate together into one large redudant chunk.

- We could add rule based anchors and split on headings, bullet points, or specific text
- We could overlap chunks
- switch the threshold to standard deviation, or other mathematical methods.


# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [56]:
# Golden dataset generation with Ragas (synthetic)
# - Uses already-loaded `synthetic_usecase_data` (LangChain Documents) as corpus
# - Generates a small testset for quick iteration; adjust `testset_size` as needed

import os
import pandas as pd

# Prefer local ragas checkout if present (via env var or parent search), then inspect version and import API
import sys
from pathlib import Path

api_ok = False
tried_paths = []

# 1) Honor explicit env var if provided (set to absolute path of ragas/src)
_env_src = os.getenv("RAGAS_SRC")
if _env_src:
    p = Path(_env_src).expanduser().resolve()
    tried_paths.append(str(p))
    if p.is_dir() and str(p) not in sys.path:
        sys.path.insert(0, str(p))

# 2) Try common relative locations: ./ragas/src, ../ragas/src, ../../ragas/src
cwd = Path.cwd()
for candidate in [cwd/"ragas"/"src", cwd.parent/"ragas"/"src", cwd.parent.parent/"ragas"/"src"]:
    candidate = candidate.resolve()
    tried_paths.append(str(candidate))
    if candidate.is_dir() and str(candidate) not in sys.path:
        sys.path.insert(0, str(candidate))

# Import ragas (namespace packages may not have __file__)
import importlib
_ragas = importlib.import_module("ragas")
ragas_origin = getattr(_ragas, "__file__", None) or getattr(getattr(_ragas, "__spec__", None), "origin", None)
print("ragas resolved origin:", ragas_origin)
print("sys.path candidates tried:\n - " + "\n - ".join(tried_paths))

try:
    from ragas.testset.generator import TestsetGenerator as _TG
    from ragas.testset.evolutions import simple, reasoning, multi_context
    TestsetGenerator = _TG
    api_ok = True
except Exception as _e:
    print("ragas.testset import failed, trying ragas.tesset:", _e)
    try:
        from ragas.tesset.generator import TestsetGenerator as _TG
        from ragas.tesset.evolutions import simple, reasoning, multi_context
        TestsetGenerator = _TG
        api_ok = True
        print("Imported from ragas.tesset.*")
    except Exception as _e2:
        print("ragas.tesset import also failed:", _e2)
        # Fallback: try loading directly from file paths under candidate roots for both names
        import importlib.util as _ilu
        TestsetGenerator = None
        simple = reasoning = multi_context = None
        for _base in list(dict.fromkeys(tried_paths + sys.path)):
            try:
                _base_path = Path(_base)
                for _pkg in ("testset", "tesset"):
                    _gen = _base_path / "ragas" / _pkg / "generator.py"
                    _evo = _base_path / "ragas" / _pkg / "evolutions.py"
                    if _gen.is_file() and _evo.is_file():
                        _spec_gen = _ilu.spec_from_file_location(f"ragas.{_pkg}.generator", str(_gen))
                        _mod_gen = _ilu.module_from_spec(_spec_gen)
                        assert _spec_gen.loader is not None
                        _spec_gen.loader.exec_module(_mod_gen)

                        _spec_evo = _ilu.spec_from_file_location(f"ragas.{_pkg}.evolutions", str(_evo))
                        _mod_evo = _ilu.module_from_spec(_spec_evo)
                        assert _spec_evo.loader is not None
                        _spec_evo.loader.exec_module(_mod_evo)

                        TestsetGenerator = getattr(_mod_gen, "TestsetGenerator", None)
                        simple = getattr(_mod_evo, "simple", None)
                        reasoning = getattr(_mod_evo, "reasoning", None)
                        multi_context = getattr(_mod_evo, "multi_context", None)
                        if TestsetGenerator and simple and reasoning and multi_context:
                            api_ok = True
                            print(f"Loaded ragas.{_pkg} from:", _gen.parent)
                            break
                if api_ok:
                    break
            except Exception as _inner:
                # Try next candidate
                continue
        if not api_ok:
            raise ImportError(
                "Could not import ragas.testset/tesset from site-packages or local ragas/src.\n"
                "Set RAGAS_SRC to the absolute path of your ragas/src directory and rerun.\n"
                f"Paths tried: {tried_paths}"
            )

from langchain_openai import ChatOpenAI

# Ensure OPENAI_API_KEY present (the notebook earlier collects it via getpass)
assert os.getenv("OPENAI_API_KEY"), "OPENAI_API_KEY must be set to generate synthetic data."

if not api_ok:
    raise ImportError(
        "Your installed ragas does not include the synthetic testset API (ragas.testset.*). "
        "Please upgrade ragas in this environment (e.g., `uv add \"ragas>=0.1.14\"`) or install the generation extras."
    )

# Ensure documents have a stable filename key (helps some generators)
for _d in synthetic_usecase_data:
    if "filename" not in _d.metadata:
        _d.metadata["filename"] = _d.metadata.get("source", "projects_with_domains.csv")

# Build generator using existing OpenAI stack
# Reuse OpenAI embeddings defined earlier as `embeddings`
# Use lightweight chat models for cost; you can swap to larger models if desired
_generator_llm = ChatOpenAI(model="gpt-4.1-nano")
_critic_llm = ChatOpenAI(model="gpt-4.1-nano")

generator = TestsetGenerator.from_langchain(
    _generator_llm,
    _critic_llm,
    embeddings,
)

# Generate with a balanced distribution of question types
testset_size = 30
_distributions = {simple: 0.6, reasoning: 0.3, multi_context: 0.1}

testset = generator.generate_with_langchain_docs(
    synthetic_usecase_data,
    test_size=testset_size,
    distributions=_distributions,
)

# Preview and persist
df = testset.to_pandas()
print(df.head(5))
print(f"\nGenerated {len(df)} synthetic Q/A pairs.")

output_path = "./data/golden_dataset.jsonl"
testset.to_jsonl(output_path)
print(f"Saved golden dataset to {output_path}")

ragas resolved origin: None
sys.path candidates tried:
 - /Users/tylerwelsh-personal/Dev/AIE8/09_Advanced_Retrieval/ragas/src
 - /Users/tylerwelsh-personal/Dev/AIE8/ragas/src
 - /Users/tylerwelsh-personal/Dev/ragas/src
ragas.testset import failed, trying ragas.tesset: No module named 'ragas.testset'
ragas.tesset import also failed: No module named 'ragas.tesset'


ImportError: Could not import ragas.testset/tesset from site-packages or local ragas/src.
Set RAGAS_SRC to the absolute path of your ragas/src directory and rerun.
Paths tried: ['/Users/tylerwelsh-personal/Dev/AIE8/09_Advanced_Retrieval/ragas/src', '/Users/tylerwelsh-personal/Dev/AIE8/ragas/src', '/Users/tylerwelsh-personal/Dev/ragas/src']