# 論文翻訳の回答精度を向上させるために
https://gpt-index.readthedocs.io/en/latest/examples/metadata_extraction/MetadataExtraction_LLMSurvey.html

### メタデータエクストラクタの定義

In [5]:
from llama_index import ServiceContext
from llama_index.schema import MetadataMode

In [6]:
from llama_index.node_parser import SimpleNodeParser
from llama_index.node_parser.extractors import (
    MetadataExtractor, 
    SummaryExtractor, 
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
    EntityExtractor
)

from llama_index.text_splitter import TokenTextSplitter

In [7]:
text_splitter = TokenTextSplitter(separator=' ', chunk_size=256, chunk_overlap=128)
node_parser = SimpleNodeParser(text_splitter=text_splitter)

### データのロード、エクストラクタの実行

In [8]:
from src.XMLUtils import DocumentReader

In [9]:
base_path = "/home/paper_translator/data"
document_name = (
    "Learning_Transferable_Visual_Models_From_Natural_Language_Supervision"
)
document_path = f"{base_path}/documents/{document_name}"
xml_path = f"{document_path}/{document_name}.tei.xml"

In [10]:
reader = DocumentReader()
docs = reader.load_data(xml_path=xml_path)



In [10]:
print(docs[0].get_content())

Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years (Dai & Le, 2015;Peters et al., 2018;Howard & Ruder, 2018;Radford et al., 2018;Devlin et al., 2018;Raffel et al., 2019).


In [11]:
orig_nodes = node_parser.get_nodes_from_documents(docs)

In [12]:
nodes = orig_nodes[20:28]

In [13]:
print(nodes[3].get_content(metadata_mode="all"))

section No.: 2.2.
section title: Creating a Sufficiently Large Dataset
pdf_title: Learning Transferable Visual Models From Natural Language Supervision
pdf_idno: arXiv:2103.00020v1[cs.CV]
pdf_lang: en
pdf_published: 26 Feb 2021
pdf_authors: ['Alec Radford', 'Jong Wook Kim', 'Chris Hallacy', 'Aditya Ramesh', 'Gabriel Goh', 'Sandhini Agarwal', 'Girish Sastry', 'Amanda Askell', 'Pamela Mishkin', 'Jack Clark', 'Gretchen Krueger', 'Ilya Sutskever']

2017), and YFCC100M (Thomee et al., 2016). While MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each. By comparison, other computer vision systems are trained on up to 3.5 billion Instagram photos (Mahajan et al., 2018). YFCC100M, at 100 million photos, is a possible alternative, but the


### メタデータエクストラクタの実行

In [11]:
from src.translator.llama_cpp import create_llama_cpp_model

In [12]:
model_path = "/home/paper_translator/data/models/ELYZA-japanese-Llama-2-7b-fast-instruct-q4_K_M.gguf"
llm = create_llama_cpp_model(package_name="llama_index", model_path=model_path)

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660 Ti, compute capability 7.5
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /home/paper_translator/data/models/ELYZA-japanese-Llama-2-7b-fast-instruct-q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 45043,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    6:            blk.0.f

In [14]:
metadata_extractor_1 = MetadataExtractor(
    extractors=[
        TitleExtractor(llm=llm),
        QuestionsAnsweredExtractor(questions=3, llm=llm),
        SummaryExtractor(llm=llm, summaries=["prev", "self"]),
        KeywordExtractor(llm=llm),
        EntityExtractor(prediction_threshold=0.5)
    ],
    in_place=False,
)
node_parser_1 = SimpleNodeParser(text_splitter=text_splitter, metadata_extractor=metadata_extractor_1)
nodes_1 = node_parser_1.get_nodes_from_documents(docs)
print(nodes_1[3].get_content(metadata_mode="all"))

Downloading (…)lve/main/config.json:   0%|          | 0.00/5.08k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/712M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]