# Metadata Extraction and Augmentation w/ Marvin

This notebook walks through using [`Marvin`](https://github.com/PrefectHQ/marvin) to extract and augment metadata from text. Marvin uses the LLM to identify and extract metadata.  Metadata can be anything from additional and enhanced questions and answers to business object identification and elaboration.  This notebook will demonstrate pulling out and elaborating on Sports Supplement information in a csv document.

Note: You will need to supply a valid open ai key below to run this notebook.

## Setup

In [None]:
%pip install llama-index-llms-openai
%pip install llama-index-extractors-marvin

In [None]:
# !pip install marvin

In [None]:
from llama_index.core import SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.extractors.marvin import MarvinMetadataExtractor

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-..."

In [None]:
documents = SimpleDirectoryReader("data").load_data()

# limit document text length
documents[0].text = documents[0].text[:10000]

In [None]:
import marvin

from pydantic import BaseModel, Field

marvin.settings.openai.api_key = os.environ["OPENAI_API_KEY"]
marvin.settings.openai.chat.completions.model = "gpt-4o"


class SportsSupplement(BaseModel):
    name: str = Field(..., description="The name of the sports supplement")
    description: str = Field(
        ..., description="A description of the sports supplement"
    )
    pros_cons: str = Field(
        ..., description="The pros and cons of the sports supplement"
    )

In [None]:
# construct text splitter to split texts into chunks for processing
# this takes a while to process, you can increase processing time by using larger chunk_size
# file size is a factor too of course
node_parser = TokenTextSplitter(
    separator=" ", chunk_size=512, chunk_overlap=128
)

# create metadata extractor
metadata_extractor = MarvinMetadataExtractor(
    marvin_model=SportsSupplement
)  # let's extract custom entities for each node.

# use node_parser to get nodes from the documents
from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(transformations=[node_parser, metadata_extractor])

nodes = pipeline.run(documents=documents, show_progress=True)

Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 41.49it/s]
Extracting marvin metadata: 100%|██████████| 9/9 [00:22<00:00,  2.46s/it]


In [None]:
from pprint import pprint

for i in range(5):
    pprint(nodes[i].metadata)

{'creation_date': '2024-08-07',
 'file_name': 'Sports Supplements.csv',
 'file_path': '/data001/home/dongwoo.jeong/llama_index/docs/docs/examples/metadata_extraction/data/Sports '
              'Supplements.csv',
 'file_size': 62403,
 'file_type': 'text/csv',
 'last_modified_date': '2024-08-07',
 'marvin_metadata': {'description': 'L-arginine alpha-ketoglutarate is a '
                                    'supplement often used to improve peak '
                                    'power output and strength–power during '
                                    'weight training. A 2006 study by Campbell '
                                    'et al. found that AAKG supplementation '
                                    'improved maximum effort 1-repetition '
                                    'bench press and Wingate peak power '
                                    'performance.',
                     'name': 'AAKG',
                     'pros_cons': 'Pros: Improves peak power output and '
 