![NVIDIA](images/nvidia.png)

# Document Tagging

In this notebook you'll extend your skill set of generating structure data by learning how to extract data and tag it as you specify out of long form text.

---

## Objectives

By the time you complete this notebook you will:

- Be able to construct Pydantic classes that represent collections of other Pydantic classes.
- Perform extraction and tagging against long-form text.

---

## Imports

In [None]:
from typing import List
from pprint import pprint

from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field

---

## Create a Model Instance

In [None]:
base_url = 'http://llama:8000/v1'
model = 'meta/llama-3.1-8b-instruct'
llm = ChatNVIDIA(base_url=base_url, model=model, temperature=0)

---

## Document Tagging

Knowing what you already do about creating Pydantic specifications for structured data generation, you are going to have an easy time extending this skill to being able to extract and tag data out of long form text.

To learn the technique let's assume that we want to extract the name of any piece of fruit mentioned in a piece of text. We'll begin, as we have been, by defining a schema for our data, and instantiating a parser that, in conjunction with a prompt using its format instructions, will be able to parse structured data out of our prompt and what the LLM can ascertain about it.

In [None]:
class Fruit(BaseModel):
    """The name of a piece of fruit."""

    name: str = Field(description="The name of the piece of fruit")

In [None]:
parser = JsonOutputParser(pydantic_object=Fruit)

In [None]:
format_instructions = parser.get_format_instructions()

In [None]:
template = ChatPromptTemplate.from_messages([
    ("system", "You are an AI that generates JSON and only JSON according to the instructions provided to you."),
    ("human", (
        "Generate JSON about the user input according to the provided format instructions.\n" +
        "Input: {input}\n" +
        "Format instructions {format_instructions}")
    )
])

In [None]:
template_with_format_instructions = template.partial(format_instructions=format_instructions)

In [None]:
chain = template_with_format_instructions | llm | parser

And now we do something slightly different than what we did in the previous notebook. Instead of providing a single entity meant to be transformed into a structured data entity, we provide free form text.

Given the simplicity of the following statement, however, it should come as no surprise that our chain is well-capable to identify and capture the single piece of fruit mentioned.

In [None]:
chain.invoke({"input": "An apple fell from the tree."})

---

## Lists of Structured Data

When it comes to extracting and tagging multiple data entities out of free form text, the ingredient that we are missing is the ability to specify that rather than capture a single data type for a piece of given text, that we wish to extract a **list** of some defined entity.

Using Pydantic, along with Python's `typing.List`, this is rather straightforward: we create a new Pydantic class, with a helpful docstring, that is comprised of a `List` of a another Pydantic class.

In [None]:
from typing import List

In [None]:
class Fruits(BaseModel):
    """The names of fruits"""
    fruits: List[Fruit]

With the list-bearing `Fruits` class now at our disposal, we can construct a parser and chain as usual.

In [None]:
parser = JsonOutputParser(pydantic_object=Fruits)

In [None]:
format_instructions = parser.get_format_instructions()

In [None]:
template_with_format_instructions = template.partial(format_instructions=format_instructions)

In [None]:
chain = template_with_format_instructions | llm | parser

But now when we pass a longer piece of text containing multiple pieces of fruit, we can see we are able to extract and tag them all.

In [None]:
chain.invoke({"input": "An apple fell from the tree. It hit the ground right next to a banana peel."})

---

## Exercise: Do Document Tagging for Apollo Story

Below is an account of the Apollo 11 landing. Your goal for this exercise is to extract and tag several entities from within the account.

Specifically, you should extract and tag the following:
- Details about the entire landing which will include
    - A list of any crew members mentioned in the account. For each crew member you should capture their:
        - name
        - role during the mission
    - A list of parts and modules belonging to any spacecraft mentioned in the account. For each part of a spacecraft extracted you should capture its:
        - name
        - the specific part or module of the spacecraft that it is
    - A list of any significant quotes made during the account. For each significant quote you should extract and tag:
        - the quote itself
        - The name of the speaker of the quote

Feel free to jump right in if you'd like. If you prefer, you can also expand the _Walkthrough_ section below for step by step guidance on this exercise.

In [None]:
apollo_story = """
On July 20, 1969, Apollo 11, the first manned mission to land on the Moon, successfully touched down in the Sea of Tranquility. \
The crew consisted of Neil Armstrong, who served as the mission commander, \
Edwin 'Buzz' Aldrin, the lunar module pilot, and Michael Collins, the command module pilot.

The spacecraft consisted of two main parts: the command module Columbia and the lunar module Eagle. \
As Armstrong stepped onto the lunar surface, he famously declared, "That's one small step for man, one giant leap for mankind."

Buzz Aldrin also descended onto the Moon's surface, where he and Armstrong conducted experiments and collected samples. \
Michael Collins remained in lunar orbit aboard Columbia, ensuring the successful return of his fellow astronauts.

The mission was a pivotal moment in space exploration and remains a significant achievement in human history.
"""

### Your Work Here

---

## Walkthrough

### Define Crew Member Details

Following the guidelines above, create a class that represents the details of a given crew member.

Feel free to check out the Solution below if you get stuck.

### Your Work Here

### Solution

In [None]:
class CrewMember(BaseModel):
    """Details of a crew member"""
    name: str = Field(description="Name of the crew member")
    role: str = Field(description="Role of the crew member in the mission")

### Define Spacecraft Details

Following the guidelines above, create a class that represents the details of the spacecraft mentioned in the account.

Feel free to check out the Solution below if you get stuck.

### Your Work Here

### Solution

In [None]:
class SpacecraftDetail(BaseModel):
    """Details of the spacecraft"""
    name: str = Field(description="Name of the spacecraft")
    part: str = Field(description="Specific part or module of the spacecraft")

### Define Significant Quotes

Following the guidelines above, create a class that represents the details any significant quote made in the account.

Feel free to check out the Solution below if you get stuck.

### Your Work Here

### Solution

In [None]:
class SignificantQuote(BaseModel):
    """Details of a significant quote"""
    quote: str = Field(description="The quote")
    speaker: str = Field(description="Name of the person who said the quote")

### Define Combined Details About the Entire Landing

Create a class for the combined details of the Apollo 11 mission. It should contains lists of the other 3 classes you created above.

Feel free to check out the Solution below if you get stuck.

### Your Work Here

### Solution

In [None]:
class Apollo11Details(BaseModel):
    """Combined details of the Apollo 11 mission"""
    crew_members: List[CrewMember]
    spacecraft_details: List[SpacecraftDetail]
    significant_quotes: List[SignificantQuote]

### Create the Extraction Chain

With all the data classes well defined, now it's time to create a chain, including the use of `JsonOutputParser` to be used in conjunction with our LLM instance to perform the actual extraction and tagging.

Feel free to check out the Solution below if you get stuck.

### Your Work Here

### Solution

In [None]:
parser = JsonOutputParser(pydantic_object=Apollo11Details)

format_instructions = parser.get_format_instructions()

template_with_format_instructions = template.partial(format_instructions=format_instructions)

chain = template_with_format_instructions | llm | parser

### Invoke the Extraction Chain

All that's left to do now is invoke your chain with the apollo_story account above.

Feel free to check out the Solution below if you get stuck.

### Your Work Here

### Solution

In [None]:
apollo_details = chain.invoke({"input": apollo_story})

In [None]:
pprint(apollo_details)

---

## Summary

This notebook concludes this section on structured data generation, which we hope you'll agree is a powerful tool with a great number of applications.

Related to their ability to generate structured data, LLMs can generate structured data intended to indicate when and how an application ought to invoke (potentially) non-LLM-related functionality. We call this technique tool use, and in the next section you'll learn how to create tools, and integrate their use with LLM interactions via agents.