In [1]:
from llama_index.core import Document


1. 可以自己客製化自己的 loader，寫好 extract 的格式即可

In [2]:
example_doc = Document.example()
print(example_doc)

Doc ID: 8a919def-6d6c-413a-a7c2-12b4ba6eaee3
Text: Context LLMs are a phenomenal piece of technology for knowledge
generation and reasoning. They are pre-trained on large amounts of
publicly available data. How do we best augment LLMs with our own
private data? We need a comprehensive toolkit to help perform this
data augmentation for LLMs.  Proposed Solution That's where LlamaIndex
comes in. Ll...


In [3]:
content = example_doc.get_content()
print(content)

Context
LLMs are a phenomenal piece of technology for knowledge generation and reasoning.
They are pre-trained on large amounts of publicly available data.
How do we best augment LLMs with our own private data?
We need a comprehensive toolkit to help perform this data augmentation for LLMs.

Proposed Solution
That's where LlamaIndex comes in. LlamaIndex is a "data framework" to help
you build LLM  apps. It provides the following tools:

Offers data connectors to ingest your existing data sources and data formats
(APIs, PDFs, docs, SQL, etc.)
Provides ways to structure your data (indices, graphs) so that this data can be
easily used with LLMs.
Provides an advanced retrieval/query interface over your data:
Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
Allows easy integrations with your outer application framework
(e.g. with LangChain, Flask, Docker, ChatGPT, anything else).
LlamaIndex provides tools for both beginner users and advanced users.
Ou

In [4]:
type(example_doc)

llama_index.core.schema.Document

In [5]:
type(content)

str

In [8]:
example_doc

Document(id_='8a919def-6d6c-413a-a7c2-12b4ba6eaee3', embedding=None, metadata={'filename': 'README.md', 'category': 'codebase'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='\nContext\nLLMs are a phenomenal piece of technology for knowledge generation and reasoning.\nThey are pre-trained on large amounts of publicly available data.\nHow do we best augment LLMs with our own private data?\nWe need a comprehensive toolkit to help perform this data augmentation for LLMs.\n\nProposed Solution\nThat\'s where LlamaIndex comes in. LlamaIndex is a "data framework" to help\nyou build LLM  apps. It provides the following tools:\n\nOffers data connectors to ingest your existing data sources and data formats\n(APIs, PDFs, docs, SQL, etc.)\nProvides ways to structure your data (indices, graphs) so that this data can be\neasily used with LLMs.

## 製作範例：
```
import json
from llama_index.core import Document
from llama_index.core.node_parser import JSONNodeParser
```
### 1. 您的原始資料
json_data = { ... } # 您的 JSON 內容

### 2. 封裝成 Document 並放入 Metadata
```
doc = Document(
    text=json.dumps(json_data, ensure_ascii=False),
    metadata={
        "file_name": "cathay_social_dashboard.json",
        "category": "social_media_analysis",
        "author": "yu yuna", # 參考您的報告作者 
        "data_year": "2022"  # 參考您的資料區間 
    }
)
```
### 3. 餵給 Parser
```
parser = JSONNodeParser()
nodes = parser.get_nodes_from_documents([doc])
```

為了確保切分後的 Node 更好用，建議在 metadata 設定中加入以下控制：

排除不必要的 Metadata 出現於 Text 中： 
如果您不想讓 author 這種資訊出現在 LLM 閱讀的 text 區塊中（浪費 token），可以設定：
```
doc.excluded_llm_metadata_keys = ["author", "file_name"]
```

In [9]:
! pip3 install llama-index-embeddings-huggingface



In [11]:
! pip3 install ipywidgets

Collecting ipywidgets
  Downloading ipywidgets-8.1.8-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Downloading widgetsnbextension-4.0.15-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab_widgets~=3.0.15 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.16-py3-none-any.whl.metadata (20 kB)
Downloading ipywidgets-8.1.8-py3-none-any.whl (139 kB)
Downloading jupyterlab_widgets-3.0.16-py3-none-any.whl (914 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m914.9/914.9 kB[0m [31m756.6 kB/s[0m  [33m0:00:01[0meta [36m0:00:01[0m
[?25hDownloading widgetsnbextension-4.0.15-py3-none-any.whl (2.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m1.1 MB/s[0m  [33m0:00:02[0m eta [36m0:00:01[0m0m
[?25hInstalling collected packages: widgetsnbextension, jupyterlab_widgets, ipywidgets
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [ipywidgets]
[1A[2KSuccessf

In [10]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")


  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]


In [None]:
emb_test = embed_model.get_text_embedding("這是一堂關於LlamaIndex的教學課程")
emb_test

In [14]:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_dir="./data/json").load_data()

In [15]:
documents

[Document(id_='46336f3f-8870-4069-b3d1-f0fa810ae5b8', embedding=None, metadata={'file_path': '/Users/dingtseng/Documents/python_rag_class/demo_1/llamaIndex-tutorial/data/json/techorange-2024-10-16.json', 'file_name': 'techorange-2024-10-16.json', 'file_type': 'application/json', 'file_size': 40751, 'creation_date': '2025-12-18', 'last_modified_date': '2025-01-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='[\n    {\n        "title": "【台灣陷入尷尬處境】半導體耗電量狂飆又得淨零轉型，核電會是最佳解？",\n        "content": "台灣企業滿足了全球高達 68% 的晶片製造供應，隨之而來的是大量能源需求——根據綠色和平組織的預測，2030 年台灣半導體製造產業之耗電量，將會相當於 2021 年紐西蘭全島的耗電量的兩倍，其中將有 82% 的需求來自台積電。\\n台灣為何陷入能源危機、與淨零目標遙遙無期？\\nAI 時代用

方法Ａ：
```
# 1. 解析成 Nodes
parser = JSONNodeParser()
nodes = parser.get_nodes_from_documents(documents)

# 2. 直接用 nodes 建立索引 (注意這裡沒有 .from_documents)
index = VectorStoreIndex(nodes, embed_model=embed_model)
```

方法Ｂ：
```
from llama_index.core.node_parser import JSONNodeParser

parser = JSONNodeParser()

# 在建立時指定 transformations
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embed_model,
    transformations=[parser]  # 指定使用你的 JSON Parser
)
```

. 總結：我該選哪一個？
- 如果你希望流程透明：選 方法 A。你自己手動跑一次 parser.get_nodes_from_documents(documents)，你可以順便 print(len(nodes)) 看看檔案被切成了幾塊，心裡比較踏實。

- 如果你追求程式碼簡潔：選 方法 B。這是在 LlamaIndex 新版本中更推薦的做法，它把「切分」和「索引」整合在一個流水線中。