# Tensorlake + QDrant + LangGraph RAG Demo with Academic Research Papers

Learn more about Qdrant and Tensorlake on the [Tensorlake docs](https://tlake.link/qdrant-tensorlake).

Prefer a video walkthrough? Checkout this [YouTube tutorial](https://www.youtube.com/watch?v=Segv3wI1PdM).

## Setup and Dependencies

In [None]:
!pip install tensorlake qdrant-client sentence-transformers pandas numpy langgraph langsmith langchain-openai

In [None]:
%env TENSORLAKE_API_KEY=YOUR_TENSORLAKE_API_KEY
%env QDRANT_API_KEY=YOUR_QDRANT_API_KEY
%env QDRANT_DATABASE_URL=YOUR_QDRANT_DATABASE_URL
%env OPENAI_API_KEY=YOUR_OPENAI_API_KEY

In [None]:
# TensorLake DocAI setup
from tensorlake.documentai import (
    DocumentAI,
    EnrichmentOptions,
    ParsingOptions,
    StructuredExtractionOptions,
    ChunkingStrategy,
    TableOutputMode,
    TableParsingFormat,
    ParseStatus,
)

# Qdrant client setup
from qdrant_client import QdrantClient
from qdrant_client.http import models
from qdrant_client.http.models import Filter, FieldCondition, MatchValue, MatchText

# LangGraph agent setup
from langgraph.prebuilt import create_react_agent

# Helper packages
from pydantic import BaseModel, Field
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np
from typing import List, Dict, Any
import re
from uuid import uuid4
import time
import os

## Parse all of the documents with Tensorlake
1. Set up your Tensorlake Client
2. Create two arrays to store structured data and chunks
3. Create a list of the file URLs

### Initialize the Tensorlake DocAI Client

In [None]:
doc_ai = DocumentAI(api_key=os.getenv('TENSORLAKE_API_KEY'))

In [None]:
all_structured_data = []
all_chunks = []

files = [
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/CHI_13.pdf",
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/CSCW_14_1.pdf",
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/CSCW_14_2.pdf",
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/CSCW_14_3.pdf",
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/ICER_11.pdf",
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/ICER_12_2.pdf",
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/ICER_13.pdf",
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/ITICSE_13.pdf",
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/Koli_14.pdf",
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/SIGCSE_13.pdf",
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/SarahEsper_ResearchExam.pdf",
    "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/UCSDTechReport_11.pdf"
]

### Define your JSON Schema for Structured Data Extraction

In [None]:
class Author(BaseModel):
    """Author information for a research paper"""
    name: str = Field(description="Full name of the author")
    affiliation: str = Field(description="Institution or organization affiliation")

class Conference(BaseModel):
    """Conference or journal information"""
    name: str = Field(description="Name of the conference or journal")
    year: str = Field(description="Year of publication")
    location: str = Field(description="Location of the conference or journal publication")

class Reference(BaseModel):
    """Reference to another publication"""
    author_names: List[str] = Field(description="List of author names for this reference")
    title: str = Field(description="Title of the referenced publication")
    publication: str = Field(description="Name of the publication venue (journal, conference, etc.)")
    year: str = Field(description="Year of publication")

class ResearchPaper(BaseModel):
    """Complete schema for extracting research paper information"""
    authors: List[Author] = Field(description="List of authors with their affiliations. Authors will be listed below the title and above the main text of the paper. Authors will often be in multiple columns and there may be multiple authors associated to a single affiliation.")
    conference_journal: Conference = Field(description="Conference or journal information")
    title: str = Field(description="Title of the research paper")
    abstract: str = Field(description="Abstract or summary of the paper")
    keywords: List[str] = Field(description="List of keywords associated with the paper")
    acm_classification: str = Field(description="ACM classification code or category")
    general_terms: List[str] = Field(description="List of general terms or categories")
    acknowledgments: str = Field(description="Acknowledgments section")
    references: List[Reference] = Field(description="List of references cited in the paper")

# Convert to JSON schema for Tensorlake
json_schema = ResearchPaper.model_json_schema()

### Define a function for extracting data for each research paper

In [None]:
def process_research_paper(file_url):
    doc_structured_data = []
    doc_chunks = []

    # Configure parsing options
    parsing_options = ParsingOptions(
        chunking_strategy=ChunkingStrategy.SECTION,
        table_parsing_strategy=TableParsingFormat.TSR,
        table_output_mode=TableOutputMode.MARKDOWN,
    )
    # Create structured extraction options with the JSON schema
    structured_extraction_options = [StructuredExtractionOptions(
        schema_name="ResearchPaper",
        json_schema=json_schema,
    )]
    # Create enrichment options
    enrichment_options = EnrichmentOptions(
        figure_summarization=True,
        figure_summarization_prompt="Summarize the figure beyond the caption by describing the data as it relates to the context of the research paper.",
        table_summarization=True,
        table_summarization_prompt="Summarize the table beyond the caption by describing the data as it relates to the context of the research paper.",
    )

    # Parse the document
    parse_id = doc_ai.parse(file_url, parsing_options, structured_extraction_options, enrichment_options)
    print(f"Started parsing job: {parse_id} for document {file_url}")
    result = doc_ai.wait_for_completion(parse_id)

    if result:
        print(f"Job {parse_id} completed successfully for {file_url}")
        if result.structured_data:
            print(f"Extracted {len(result.structured_data)} structured data items")
        if result.chunks:
            print(f"Extracted {len(result.chunks)} chunks")

    # Extract structured data and chunks from the result
    if result and result.structured_data:
        # Add metadata to structured data
        structured_data = result.structured_data
        doc_structured_data.append(structured_data)
        print(f"Extracted structured data for {file_url}")
    else:
        print(f"No structured data found for {file_url}")

    if result and result.chunks:
        # Process document chunks
        chunks = result.chunks
        doc_chunks.extend(chunks)
        print(f"Extracted {len(chunks)} chunks for {file_url}")
    else:
        print(f"No chunks found for {file_url}")

    print(f"Processed {file_url}")

    # Return structured data and chunks
    return doc_structured_data, doc_chunks

### Parse each of the files

In [None]:
for file in files:
  print(f"Processing file: {file}")

  # Process the filings
  structured_data, chunks = process_research_paper(file)

  # Store results
  all_structured_data.append(structured_data)
  all_chunks.append(chunks)

## Upload the points to QDrant

### Initialize the QDrant Client

In [None]:
# Initialize QDrant client (Cloud version)
qdrant_client = QdrantClient(
    url=os.getenv('QDRANT_DATABASE_URL'),
    api_key=os.getenv('QDRANT_API_KEY')
)

# Initialize sentence transformer for embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')

### Step 1: Create the collection if it doesn't exist

In [None]:
# Create the collection if it doesn't exist
collection_name = "research_paper_example"
if not qdrant_client.collection_exists(collection_name=collection_name):
    qdrant_client.create_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE)
    )

### Step 2: Create the Embeddings and Payloads and Upsert to Qdrant
From the structured data output from Tensorlake, create a Payload to associate with each chunk for each document.

In [None]:
# Flatten all chunks and match them with their corresponding structured data
all_points = []

for doc_idx, (structured_data_list, chunks, pages) in enumerate(zip(all_structured_data, all_chunks, all_pages)):
    if not chunks:
        print(f"No chunks found for document {doc_idx}")
        continue

    # Get the structured data for this document (assuming first item in list)
    structured_data = structured_data_list[0][0] if structured_data_list else None

    # Extract metadata from structured data
    authors = []
    author_names = []  # For searchable text field
    references = []
    conference_name = ""
    conference_year = ""
    conference_location = ""
    title = ""
    keywords = []

    if structured_data:
        print("Found structured data")
        # Extract author information
        if 'authors' in structured_data.data:
            print(f"Extracting {len(structured_data.data['authors'])} authors")
            for author in structured_data.data['authors']:
                print(f"Processing author: {author}")
                if isinstance(author, dict):
                    author_name = author.get('name', '')
                    author_affiliation = author.get('affiliation', '')
                    authors.append(f"{author_name} ({author_affiliation})")
                    author_names.append(author_name)  # For searchable text field

        # Extract conference information
        if 'conference_journal' in structured_data.data:
            print("Extracting conference information")
            conf = structured_data.data['conference_journal']
            print(f"Processing conference: {conf}")
            if isinstance(conf, dict):
                conference_name = conf.get('name', '')
                conference_year = conf.get('year', '')
                conference_location = conf.get('location', '')
                print(f"Conference: {conference_name} ({conference_year}) at {conference_location}")

        # Extract other metadata
        title = structured_data.data.get('title', '')
        print(f"Title: {title}")
        keywords = structured_data.data.get('keywords', [])
        print(f"Keywords: {keywords}")

        # Extract references
        if 'references' in structured_data.data:
            print(f"Extracting {len(structured_data.data['references'])} references")
            for ref in structured_data.data['references']:
                if isinstance(ref, dict):
                    ref_authors = ref.get('author_names', [])
                    ref_title = ref.get('title', '')
                    ref_publication = ref.get('publication', '')
                    ref_year = ref.get('year', '')
                    references.append({
                        "authors": ref_authors,
                        "title": ref_title,
                        "publication": ref_publication,
                        "year": ref_year
                    })

    # Extract the markdown chunks and table and figure summaries for this document
    texts = [chunk.content for chunk in chunks]

    # Create embeddings for all of the chunks and summaries
    vectors = model.encode(texts).tolist()

    for i, data in enumerate(texts):
        # Enhanced payload with structured data
        payload = {
            "content": data,
            "document_index": doc_idx,
            # Structured data fields for filtering
            "title": title,
            "authors": authors,  # List of "Name (Affiliation)" strings
            "author_names": author_names,  # List of just names for easier filtering
            "conference_name": conference_name,
            "conference_year": conference_year,
            "conference_location": conference_location,
            "keywords": keywords,
            "references": references,  # List of reference dicts
            # Create searchable text fields
            "authors_text": " ".join(author_names),  # For author search (just names)
            "authors_full": " ".join(authors),  # Full author info with affiliations
            "conference_text": f"{conference_name} {conference_year}",  # For conference search
        }

        all_points.append(models.PointStruct(
            id=str(uuid4()),  # Unique ID
            vector=vectors[i],
            payload=payload
        ))

if not all_points:
    raise ValueError("No points to upload. Ensure your parsing worked and chunks were generated.")

In [None]:
# Upsert into Qdrant
qdrant_client.upsert(collection_name=collection_name, points=all_points)
print(f"Inserted {len(all_points)} chunks into Qdrant with enhanced metadata")

Inserted 566 chunks into Qdrant with enhanced metadata


### Step 3: Create a Qdrant Index for relevant searches based on structured data extraction

In [None]:
# Create index for author names
qdrant_client.create_payload_index(
    collection_name=collection_name,
    field_name="authors_text",
    field_schema="keyword",
)

# Create index for conference names
qdrant_client.create_payload_index(
    collection_name=collection_name,
    field_name="conference_name",
    field_schema="keyword",
)

# Create index for conference years
qdrant_client.create_payload_index(
    collection_name=collection_name,
    field_name="conference_year",
    field_schema="keyword",
)

# Create index for Author names
qdrant_client.create_payload_index(
    collection_name=collection_name,
    field_name="author_names",
    field_schema="keyword",
)

# Create index for keywords
qdrant_client.create_payload_index(
    collection_name=collection_name,
    field_name="keywords",
    field_schema="keyword",
)

# Create index for Title
qdrant_client.create_payload_index(
    collection_name=collection_name,
    field_name="title",
    field_schema="keyword",
)

print(f"Created indices for {collection_name} collection")

## Query and filter your Qdrant collection

### Search the Qdrant collection with a query

In [None]:
points = qdrant_client.query_points(
    collection_name="research_papers",
    query=model.encode("Does computer science education improve problem solving skills?").tolist(),
    limit=3,
).points

for point in points:
    print(point.payload.get('title', 'Unknown'), "score:", point.score)

CodeSpells: Bridging Educational Language Features with Industry-Standard Languages score: 0.57552844
CHILDREN'S PERCEPTIONS OF WHAT COUNTS AS A PROGRAMMING LANGUAGE score: 0.55624765
Experience Report: an AP CS Principles University Pilot score: 0.54369175


### Filter the Qdrant collection


In [None]:
search_results = qdrant_client.query_points(
    collection_name=collection_name,\
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="author_names",
                match=models.MatchValue(
                    value="William G. Griswold",
                ),
            )
        ]
    ),
    search_params=models.SearchParams(exact=False),
    limit=3,
)
points = search_results.points

print(f"Found {len(points)} results:")

for point in points:
    print(f" - {point.payload.get('title', 'Unknown')} | {point.payload.get('authors_text', 'Unknown')}")

Found 3 results:
 - CodeSpells: Embodying the Metaphor of Wizardry for Programming | Sarah Esper Stephen R. Foster William G. Griswold
 - CODESPELLS: HOW TO DESIGN QUESTS TO TEACH JAVA CONCEPTS * | Sarah Esper Samantha R. Wood Stephen R. Foster Sorin Lerner William G. Griswold
 - CodeSpells: Bridging Educational Language Features with Industry-Standard Languages | Sarah Esper Stephen R. Foster William G. Griswold Carlos Herrera Wyatt Snyder


### Filter, then search the Qdrant Collection

In [None]:
points = qdrant_client.query_points(
    collection_name="research_papers",
    query=model.encode("Does computer science education improve problem solving skills?").tolist(),
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="author_names",
                match=models.MatchValue(
                    value="William G. Griswold",
                ),
            )
        ]
    ),
    limit=3,
).points

for point in points:
    print(point.payload.get('title', 'Unknown'), point.payload.get('conference_name', 'Unknown'), "score:", point.score)

CodeSpells: Bridging Educational Language Features with Industry-Standard Languages Koli Calling '14 score: 0.57552844
CODESPELLS: HOW TO DESIGN QUESTS TO TEACH JAVA CONCEPTS Consortium for Computing Sciences in Colleges score: 0.4907498
CodeSpells: Bridging Educational Language Features with Industry-Standard Languages Koli Calling '14 score: 0.4823265


# Integrate with a LangGraph Agent

## Create simple tools that will query Qdrant

This tool will query without filtering

In [None]:
def query_qdrant(question):
  """This function will query the Qdrant vector database for the question and return relevant markdown chunks with their respective metadata of where the chunk was found, including title and authors of the paper, what conference it was published at, and the year it was published."""
  print(f"Asking {question}")
  search_results = qdrant_client.query_points(
      collection_name="research_papers",
      query=model.encode(question).tolist(),
      limit=3,
  )
  print(f"Found {len(search_results.points)} results:")
  for point in search_results.points:
    print(f" - {point.payload.get('title', 'Unknown')} (Score: {point.score:.4f})")
  return search_results.points

This tool will first filter, then question

In [None]:
def filtered_qdrant_query(question, filter_field, filter_value):
  """If the question mentions a person's name, assume it is an Author Name. If the question mentions a conference where papers are published, assume it is the Conference Name. If the question mentioned a year, assume it is the Conference Year. This function will first filter the Qdrant vector database based on the filter_field and filter_value, then query using the question and return relevant markdown chunks with their respective metadata of where the chunk was found, including title and authors of the paper, what conference it was published at, and the year it was published. Filtered fields can be one of: author_names, title, conference_name, conference_year, or keywords"""
  print(f"Asking {question} by first filtering {filter_field} by {filter_value}")
  search_results = qdrant_client.query_points(
      collection_name=collection_name,\
      query_filter=models.Filter(
          must=[
              models.FieldCondition(
                  key=filter_field,
                  match=models.MatchValue(
                      value=filter_value,
                  ),
              )
          ]
      ),
      query=model.encode(question).tolist(),
      search_params=models.SearchParams(exact=False),
      limit=3,
  )
  print(f"Found {len(search_results.points)} results:")
  for point in search_results.points:
    print(f" - {point.payload.get('title', 'Unknown')} (Score: {point.score:.4f})")
  return search_results.points

## Create the LangGraph agent with the tools

In [None]:
agent = create_react_agent(
    model="openai:gpt-4o-mini",
    tools=[query_qdrant, filtered_qdrant_query],
    # A static prompt that never changes
    prompt="Answer the question asked using the data retrieved from either the query_qdrant or filtered_qdrant_query tool. In your response, always include metadata of the research paper where the information was found. The metadata will be available in the data from the tool."
)

## Ask the agent questions

In [None]:
question = "Does computer science education improve problem solving skills?"

result = agent.invoke({"messages": [{"role": "user", "content": f"{question}"}]})

print(result["messages"][-1].content)

Asking Does computer science education improve problem solving skills?
Found 3 results:
 - CodeSpells: Bridging Educational Language Features with Industry-Standard Languages (Score: 0.5755)
 - CHILDREN'S PERCEPTIONS OF WHAT COUNTS AS A PROGRAMMING LANGUAGE (Score: 0.5562)
 - Experience Report: an AP CS Principles University Pilot (Score: 0.5437)
Computer science education has been shown to improve problem-solving skills, particularly through structured programs and innovative teaching methods. For instance, there are studies emphasizing the incorporation of programming languages such as Scratch and Java in educational curricula, which encourage problem-solving abilities among students. The introduction of educational programming environments allows students to engage in computational thinking and improve their ability to analyze problems and create solutions.

Here are some insights drawn from relevant academic papers:

1. In the paper titled **"CodeSpells: Bridging Educational Langua

In [None]:
question = "Does William G. Griswold think computer science education improve problem solving skills?"

result = agent.invoke({"messages": [{"role": "user", "content": f"{question}"}]})

print(result["messages"][-1].content)

Asking Does William G. Griswold think computer science education improve problem solving skills? by first filtering author_names by William G. Griswold
Found 3 results:
 - CodeSpells: Bridging Educational Language Features with Industry-Standard Languages (Score: 0.5294)
 - CodeSpells: Bridging Educational Language Features with Industry-Standard Languages (Score: 0.5065)
 - On the Nature of Fires and How to Spark Them When You’re Not There (Score: 0.5038)
William G. Griswold, in collaboration with other researchers, has presented research indicating that computer science education can significantly influence not only programming skills but also broader problem-solving abilities. In the paper titled "CodeSpells: Bridging Educational Language Features with Industry-Standard Languages," they discuss an educational initiative aimed at engaging students in programming and altering their perception of what it means to be a computer scientist. Specifically, the curriculum developed and teste

In [None]:
question = "What are the key findings in papers published in 2013?"

result = agent.invoke({"messages": [{"role": "user", "content": f"{question}"}]})

print(result["messages"][-1].content)

Asking key findings by first filtering conference_year by 2013
Found 3 results:
 - On the Nature of Fires and How to Spark Them When You’re Not There (Score: 0.3649)
 - CodeSpells: Embodying the Metaphor of Wizardry for Programming (Score: 0.3294)
 - From Competition to Metacognition: Designing Diverse, Sustainable Educational Games (Score: 0.3256)
Here are some key findings from papers published in 2013:

1. **Title:** On the Nature of Fires and How to Spark Them When You’re Not There
   - **Authors:** Sarah Esper, Stephen R. Foster, William G. Griswold
   - **Conference:** SIGCSE
   - **Location:** Denver, Colorado, USA
   - **Key Findings:** This research discussed the grounded theory on CS0 and CS1 education, emphasizing the role of gamification and active learning in informal learning spaces.
   - **Citation Information:** [SIGCSE 2013]

2. **Title:** CodeSpells: Embodying the Metaphor of Wizardry for Programming
   - **Authors:** Sarah Esper, Stephen R. Foster, William G. Griswol