# wiki3 Knowledge Graph Extraction Pipeline

Extract structured knowledge from Wikipedia articles using LLMs, generate embeddings, and prepare for browser-based querying with DuckDB-Wasm.

## Setup

In [1]:
import { ChatOpenAI } from "npm:@langchain/openai";
import { WikipediaQueryRun } from "npm:@langchain/community/tools/wikipedia_query_run";
import { RecursiveCharacterTextSplitter } from "npm:@langchain/textsplitters";
import { ChatPromptTemplate } from "npm:@langchain/core/prompts";
import { z } from "npm:zod";

console.log("Dependencies loaded");

Dependencies loaded


## Wikipedia Loader & Chunking

In [2]:
const wikipediaFetcher = new WikipediaQueryRun();
const articleTitle = "Albert Einstein";
const rawContent = await wikipediaFetcher.call(articleTitle);

console.log(`Fetched: ${articleTitle}`);
console.log(`Content length: ${rawContent.length} characters`);
console.log(`Preview: ${rawContent.substring(0, 300)}...`);

Fetched: Albert Einstein
Content length: 4000 characters
Preview: Page: Albert Einstein
Summary: Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist best known for developing the theory of relativity. Einstein also made important contributions to quantum theory. His mass–energy equivalence formula E = mc2, which arises from spec...


## Configure LLM for Knowledge Graph Extraction

In [3]:
const llm = new ChatOpenAI({
  model: "gpt-5-mini",
  temperature: 1,
  apiKey: Deno.env.get("OPENAI_API_KEY"),
});

// Define the extraction schema using Zod
const extractionSchema = z.object({
  entities: z.array(
    z.object({
      id: z.string().describe("Unique entity identifier (e.g., person_1, org_2)"),
      label: z.string().describe("Entity name or label"),
      type: z.enum(["Person", "Organization", "Place", "Concept", "Event", "Work"]),
      description: z.string().describe("Brief entity description from text"),
    })
  ).describe("List of entities extracted from the text"),
  relations: z.array(
    z.object({
      source_id: z.string().describe("Source entity ID"),
      target_id: z.string().describe("Target entity ID"),
      relation_type: z.string().describe("Type of relationship (e.g., BORN_IN, WORKED_AT, DISCOVERED)"),
      description: z.string().describe("Description of the relationship"),
    })
  ).describe("List of relationships between entities"),
});

// Use withStructuredOutput for reliable JSON extraction
const structuredLlm = llm.withStructuredOutput(extractionSchema);

console.log("LLM and schema configured with withStructuredOutput");

LLM and schema configured with withStructuredOutput


## Extract Knowledge Graph from Text

In [4]:
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1024,
  chunkOverlap: 128,
});

const chunks = await splitter.splitText(rawContent);
console.log(`Split into ${chunks.length} chunks`);
console.log(`Chunk 1 preview: ${chunks[0].substring(0, 150)}...`);

Split into 7 chunks
Chunk 1 preview: Page: Albert Einstein
Summary: Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist best known for developing the t...


In [5]:
const extractionPrompt = ChatPromptTemplate.fromMessages([
  ["system", `You are an expert at extracting knowledge graphs from text.
Extract entities and relationships from the provided text.
Assign unique IDs to each entity using the format: type_number (e.g., "person_1", "org_2", "place_3").
Identify relationships between entities that appear in the same context.
Focus on factual relationships like: BORN_IN, WORKED_AT, DISCOVERED, FOUNDED, RECEIVED, MARRIED_TO, DEVELOPED, etc.`],
  ["human", "Extract entities and relationships from this text:\n\n{text}"]
]);

const extractionChain = extractionPrompt.pipe(structuredLlm);
console.log("Extraction chain created");

Extraction chain created


In [6]:
const allEntities = new Map();
const allRelations = [];
const processChunks = 2;

for (let i = 0; i < Math.min(processChunks, chunks.length); i++) {
  console.log(`Processing chunk ${i + 1}/${Math.min(processChunks, chunks.length)}...`);
  try {
    const result = await extractionChain.invoke({
      text: chunks[i],
    });

    // Deduplicate entities by label
    for (const entity of result.entities) {
      const key = entity.label.toLowerCase();
      if (!allEntities.has(key)) {
        allEntities.set(key, {
          id: `${entity.type.toLowerCase()}_${allEntities.size + 1}`,
          label: entity.label,
          type: entity.type,
          description: entity.description,
        });
      }
    }

    // Map relations to deduplicated entity IDs
    for (const relation of result.relations) {
      const sourceEntity = result.entities.find(e => e.id === relation.source_id);
      const targetEntity = result.entities.find(e => e.id === relation.target_id);
      
      if (sourceEntity && targetEntity) {
        const sourceKey = sourceEntity.label.toLowerCase();
        const targetKey = targetEntity.label.toLowerCase();
        
        allRelations.push({
          source_id: allEntities.get(sourceKey)?.id || relation.source_id,
          target_id: allEntities.get(targetKey)?.id || relation.target_id,
          relation_type: relation.relation_type,
          description: relation.description,
        });
      }
    }

    console.log(`  Extracted ${result.entities.length} entities, ${result.relations.length} relations`);
  } catch (error) {
    console.error(`  Error processing chunk ${i + 1}:`, error.message);
  }
}

console.log(`\nTotal unique entities: ${allEntities.size}`);
console.log(`Total relations: ${allRelations.length}`);

Processing chunk 1/2...
  Extracted 9 entities, 10 relations
Processing chunk 2/2...
  Extracted 21 entities, 24 relations

Total unique entities: 28
Total relations: 34


## Inspect Extracted Knowledge Graph

In [7]:
console.log("\n=== ENTITIES ===");
for (const [, entity] of allEntities) {
  console.log(`[${entity.id}] ${entity.label} (${entity.type})`);
  console.log(`  → ${entity.description}`);
}


=== ENTITIES ===
[person_1] Albert Einstein (Person)
  → German-born theoretical physicist (14 March 1879 – 18 April 1955), best known for developing the theory of relativity.
[place_2] Germany (Place)
  → Country of birth (Albert Einstein described as German-born).
[concept_3] theory of relativity (Concept)
  → Theory developed by Albert Einstein; includes special relativity and general relativity.
[concept_4] quantum theory (Concept)
  → Field of physics to which Einstein made important contributions.
[work_5] E = mc2 (mass–energy equivalence formula) (Work)
  → Mass–energy equivalence formula attributed to Einstein; arises from special relativity and is described as 'the world's most famous equation'.
[concept_6] special relativity (Concept)
  → Part of the theory of relativity from which the mass–energy equivalence formula E = mc2 arises.
[event_7] 1921 Nobel Prize in Physics (Event)
  → Award received by Albert Einstein in 1921 for his services to theoretical physics and especial

In [8]:
console.log("\n=== RELATIONS ===");
const entitiesArr = Array.from(allEntities.values());
for (const relation of allRelations) {
  const source = entitiesArr.find(e => e.id === relation.source_id);
  const target = entitiesArr.find(e => e.id === relation.target_id);
  console.log(`${source?.label} --[${relation.relation_type}]--> ${target?.label}`);
}


=== RELATIONS ===
Albert Einstein --[BORN_IN]--> Germany
Albert Einstein --[BORN_ON]--> Albert Einstein
Albert Einstein --[DIED_ON]--> Albert Einstein
Albert Einstein --[DEVELOPED]--> theory of relativity
Albert Einstein --[CONTRIBUTED_TO]--> quantum theory
Albert Einstein --[DEVELOPED]--> E = mc2 (mass–energy equivalence formula)
E = mc2 (mass–energy equivalence formula) --[ARISES_FROM]--> special relativity
Albert Einstein --[RECEIVED]--> 1921 Nobel Prize in Physics
Albert Einstein --[DISCOVERED]--> law of the photoelectric effect
Albert Einstein --[HAS_OCCUPATION]--> theoretical physicist
Albert Einstein --[BORN_IN]--> German Empire
Albert Einstein --[MOVED_TO]--> Switzerland
Albert Einstein --[RENOUNCED_CITIZENSHIP]--> German citizenship
Albert Einstein --[SUBJECT_OF]--> Kingdom of Württemberg
Albert Einstein --[ENROLLED_IN]--> Swiss federal polytechnic school in Zurich
Albert Einstein --[GRADUATED_FROM]--> Swiss federal polytechnic school in Zurich
Swiss federal polytechnic schoo

## Generate Embeddings

In [10]:
// Generate deterministic mock embeddings using Web Crypto API
async function getMockEmbedding(text: string): Promise<number[]> {
  const encoder = new TextEncoder();
  const data = encoder.encode(text);
  const hashBuffer = await crypto.subtle.digest("SHA-256", data);
  const hashArray = new Uint8Array(hashBuffer);
  
  // Expand hash to 384 dimensions (typical embedding size)
  const embedding: number[] = [];
  for (let i = 0; i < 384; i++) {
    embedding.push((hashArray[i % hashArray.length] - 128) / 128);
  }
  return embedding;
}

const entityEmbeddings = new Map();
for (const [, entity] of allEntities) {
  entityEmbeddings.set(entity.id, {
    ...entity,
    embedding: await getMockEmbedding(entity.label + entity.description),
  });
}

console.log(`Generated embeddings for ${entityEmbeddings.size} entities`);

Generated embeddings for 28 entities


## Export to DuckDB-Wasm Format

In [11]:
const duckdbFormat = {
  metadata: {
    source: articleTitle,
    extracted_at: new Date().toISOString(),
    entity_count: allEntities.size,
    relation_count: allRelations.length,
  },
  entities: Array.from(entityEmbeddings.values()),
  relations: allRelations,
};

console.log("\n=== EXPORT ===");
console.log(JSON.stringify(duckdbFormat, null, 2).substring(0, 500) + "...");


=== EXPORT ===
{
  "metadata": {
    "source": "Albert Einstein",
    "extracted_at": "2025-12-18T00:41:41.666Z",
    "entity_count": 28,
    "relation_count": 34
  },
  "entities": [
    {
      "id": "person_1",
      "label": "Albert Einstein",
      "type": "Person",
      "description": "German-born theoretical physicist (14 March 1879 – 18 April 1955), best known for developing the theory of relativity.",
      "embedding": [
        0.734375,
        0.9765625,
        -0.03125,
        -0.484375,
     ...
