# Phase 2-3: 小規模な知識グラフを作ってみる【LLM】

LLMで文章からトリプルを抽出して知識グラフを作ってみる。

## 目標

- LLMを使ったトリプル抽出を試してみる
- NLPでのトリプル抽出との違いをまとめる

## 参考

- [LangChain Docs](https://docs.langchain.com/oss/python/langchain/overview)

## 実装

In [48]:
import os
from dotenv import load_dotenv

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

from typing import List, cast
from pydantic import BaseModel, Field

import networkx as nx
import matplotlib.pyplot as plt

import json
from pathlib import Path

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
load_dotenv()

OPENAI_MODEL = str(os.getenv("OPENAI_MODEL"))
if OPENAI_MODEL is None:
    raise ValueError("環境変数 OPENAI_MODEL が設定されていません。")
else:
    print(f"OpenAI Model : {OPENAI_MODEL}")

OpenAI Model : gpt-5-nano-2025-08-07


### サンプルテキスト

In [None]:
texts = [
    "Logistic Regression is a model widely used for binary classification.",
    "Random Forest is an ensemble model of many decision trees.",
    "Transformer is a model based on the attention mechanism.",
    "GPT is a Transformer-based model that uses only the decoder block.",
]

### トリプル抽出処理

`with_structured_output` を利用する。そのため、Pydanticでモデルを定義する

In [11]:
class Triple(BaseModel):
    subject: str = Field(description="The subject (noun phrase)")
    verb: str = Field(description="The predicate verb in lemma form")
    object: str = Field(description="The object (noun phrase)")

class TripleList(BaseModel):
    triples: List[Triple]

In [39]:
llm = ChatOpenAI(
    model=OPENAI_MODEL,
    temperature=0,
)

structured_llm = llm.with_structured_output(
    schema=TripleList,
    method="json_schema",
)

prompt = ChatPromptTemplate(
    [
        ("system", 
        """
You extract factual SVO triples and return them ONLY in structured JSON.

Normalization rules:
1. Subjects, verbs, and objects must be normalized to lemma form.
2. Multi-word expressions must be converted to snake_case.
   - Example: "binary classification" → "binary_classification"
   - Example: "Logistic Regression" → "logistic_regression"
3. Predicates must be verbs in lemma form only.
4. Articles and determiners ("a", "the") must be removed.
5. Do NOT include adjectives unless they are semantically necessary.
6. Output must strictly follow the provided JSON schema.
7. Do NOT output any text outside JSON.

Extraction rules:
- Extract simple factual SVO relations (subject–verb–object).
- If the verb includes a preposition (“used for”), convert it into a single snake_case predicate (“use_for”).
- Avoid inventing facts. Only extract what is explicitly stated.

Example input:
"Logistic Regression is a model widely used for binary classification."

Expected normalized triples:
[
  {{"subject": "logistic_regression", "verb": "be", "object": "model"}},
  {{"subject": "logistic_regression", "verb": "use_for", "object": "binary_classification"}}
]
    """
        ),
        ("human", "{text}"),
    ]
)

chain = prompt | structured_llm

### サンプルテキストでトリプル抽出

In [None]:
res = cast(TripleList, chain.invoke({"text": texts[0]}))

In [41]:
for t in res.triples:
    print(f"S:{t.subject}, V:{t.verb}, O:{t.object}")

S:logistic_regression, V:be, O:model
S:logistic_regression, V:use_for, O:binary_classification


In [42]:
res

TripleList(triples=[Triple(subject='logistic_regression', verb='be', object='model'), Triple(subject='logistic_regression', verb='use_for', object='binary_classification')])

### 全てのサンプルテキストで抽出してみよう

パイプを使ってパイプライン化する

In [45]:
def triple_converter(res: TripleList):
    return [
        {"subject": t.subject, "verb": t.verb, "object": t.object}
        for t in res.triples
    ]

pipeline = chain | triple_converter

triples = []
for text in texts:
    print(text)
    triples.extend(pipeline.invoke({"text": text}))


Logistic Regression is a model widely used for binary classification.
Random Forest is an ensemble model of many decision trees.
Transformer is a model based on the attention mechanism.
GPT is a Transformer-based model that uses only the decoder block.


In [46]:
triples

[{'subject': 'logistic_regression', 'verb': 'be', 'object': 'model'},
 {'subject': 'logistic_regression',
  'verb': 'use_for',
  'object': 'binary_classification'},
 {'subject': 'random_forest', 'verb': 'be', 'object': 'ensemble_model'},
 {'subject': 'transformer', 'verb': 'be', 'object': 'model'},
 {'subject': 'model', 'verb': 'base_on', 'object': 'attention_mechanism'},
 {'subject': 'gpt', 'verb': 'be', 'object': 'transformer_based_model'},
 {'subject': 'gpt', 'verb': 'use', 'object': 'decoder_block'}]

In [None]:
output_path = Path("../data/phase_2_outputs/triples_llm.json")
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(triples, f, ensure_ascii=False, indent=2)