# Nodes

- Nodes represent "chunks" of source Documents,whether that is a text chunk, an image, or more. 
- They also contain metadata and relationship information with other nodes and index structures.
- You may also choose to "parse" source Documents into Nodes through our NodeParser classes.

In [1]:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_files= ["paul_essay.txt"]).load_data()

In [2]:
print(documents)

[Document(id_='efb8eabf-eb54-4bac-96be-d56a323832ee', embedding=None, metadata={'file_path': 'paul_essay.txt', 'file_name': 'paul_essay.txt', 'file_type': 'text/plain', 'file_size': 75176, 'creation_date': '2024-07-15', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Before college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\r\n\r\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "

In [33]:
from llama_index.core.node_parser import SentenceSplitter
parser = SentenceSplitter(chunk_size=512,chunk_overlap=20)
nodes = parser.get_nodes_from_documents(documents)

In [34]:
nodes

[TextNode(id_='56cf3a09-aadc-4ba9-ae4c-6c154f3c3fa8', embedding=None, metadata={'file_path': 'paul_essay.txt', 'file_name': 'paul_essay.txt', 'file_type': 'text/plain', 'file_size': 75176, 'creation_date': '2024-07-15', 'last_modified_date': '2024-03-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='efb8eabf-eb54-4bac-96be-d56a323832ee', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'paul_essay.txt', 'file_name': 'paul_essay.txt', 'file_type': 'text/plain', 'file_size': 75176, 'creation_date': '2024-07-15', 'last_modified_date': '2024-03-30'}, hash='07211b0b7cbaae77a0544f309c9c1dc910211048cb7946ea1267c46346d42fe3'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='bb29f613-2a7f-4011

In [35]:
print(nodes[1].get_content())

[1]

The first of my friends to get a microcomputer built it himself. It was sold as a kit by Heathkit. I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer.

Computers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good enough. This was when I really started programming. I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book. There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but it was a lot better than a typewriter.

Though I liked programming, I didn't plan to study it in college. In college I was going to study philosophy, which sounded much more powerful. It seemed, to my naive high school self, to be the study of th

In [36]:
print(nodes[2].get_content())

All you had to do was teach SHRDLU more words.

There weren't any classes in AI at Cornell then, not even graduate classes, so I started trying to teach myself. Which meant learning Lisp, since in those days Lisp was regarded as the language of AI. The commonly used programming languages then were pretty primitive, and programmers' ideas correspondingly so. The default language at Cornell was a Pascal-like language called PL/I, and the situation was similar elsewhere. Learning Lisp expanded my concept of a program so fast that it was years before I started to have a sense of where the new limits were. This was more like it; this was what I had expected college to do. It wasn't happening in a class, like it was supposed to, but that was ok. For the next couple years I was on a roll. I knew what I was going to do.

For my undergraduate thesis, I reverse-engineered SHRDLU. My God did I love working on that program. It was a pleasing bit of code, but what made it even more exciting was my 

In [37]:
print(nodes[3].get_content())

By which I mean the sort of AI in which a program that's told "the dog is sitting on the chair" translates this into some formal representation and adds it to the list of things it knows.

What these programs really showed was that there's a subset of natural language that's a formal language. But a very proper subset. It was clear that there was an unbridgeable gap between what they could do and actually understanding natural language. It was not, in fact, simply a matter of teaching SHRDLU more words. That whole way of doing AI, with explicit data structures representing concepts, was not going to work. Its brokenness did, as so often happens, generate a lot of opportunities to write papers about various band-aids that could be applied to it, but it was never going to get us Mike.

So I looked around to see what I could salvage from the wreckage of my plans, and there was Lisp. I knew from experience that Lisp was interesting for its own sake and not just for its association with AI,

# SemanticSplitterNodeParser
- Instead of chunking text with a fixed chunk size, the semantic splitter adaptively picks the breakpoint in-between sentences using embedding
- similarity. This ensures that a "chunk" contains sentences that are semantically related to each other.

In [11]:
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en
# embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

embed_model = HuggingFaceEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

In [12]:
SemanticSplitter_nodes = splitter.get_nodes_from_documents(documents)

In [13]:
print(SemanticSplitter_nodes[1].get_content())

The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.




In [16]:
print(SemanticSplitter_nodes[2].get_content())

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer.

I was puzzled by the 1401. I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type. So I'm not surprised I can't remember any programs I wrote, because they can't have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical er

In [18]:
print(SemanticSplitter_nodes[3].get_content())

It wasn't happening in a class, like it was supposed to, but that was ok. For the next couple years I was on a roll. I knew what I was going to do.

For my undergraduate thesis, I reverse-engineered SHRDLU. My God did I love working on that program. It was a pleasing bit of code, but what made it even more exciting was my belief — hard to imagine now, but not unique in 1985 — that it was already climbing the lower slopes of intelligence.

I had gotten into a program at Cornell that didn't make you choose a major. You could take whatever classes you liked, and choose whatever you liked to put on your degree. I of course chose "Artificial Intelligence." When I got the actual physical diploma, I was dismayed to find that the quotes had been included, which made them read as scare-quotes. At the time this bothered me, but now it seems amusingly accurate, for reasons I was about to discover.

I applied to 3 grad schools: MIT and Yale, which were renowned for AI at the time, and Harvard, whi

In [23]:
import os
from getpass import getpass

In [24]:
HF_TOKEN = getpass()

 ········


In [25]:
# create llm model
from llama_index.llms.huggingface import HuggingFaceInferenceAPI
llm = HuggingFaceInferenceAPI(model_name="mistralai/Mixtral-8x7B-Instruct-v0.1", token=HF_TOKEN)
llm

HuggingFaceInferenceAPI(callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x0000018952580E90>, system_prompt=None, messages_to_prompt=<function messages_to_prompt at 0x00000189328B6A20>, completion_to_prompt=<function default_completion_to_prompt at 0x000001893292D120>, output_parser=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>, query_wrapper_prompt=None, model_name='mistralai/Mixtral-8x7B-Instruct-v0.1', token='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', timeout=None, headers=None, cookies=None, task=None, context_window=3900, num_output=256, is_chat_model=False, is_function_calling_model=False)

In [38]:
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import display_source_node

In [39]:
vector_index = VectorStoreIndex(SemanticSplitter_nodes, embed_model=embed_model)
query_engine = vector_index.as_query_engine(llm=llm)

In [40]:
response = query_engine.query(
    "Tell me about the author's programming journey through childhood to college"
)

In [41]:
print(str(response))


The author, Paul Graham, began his programming journey in 9th grade when he tried writing programs on the IBM 1401 that his school district used for data processing. He used an early version of Fortran and had to type programs on punch cards, which were then loaded into memory and run. However, he found it difficult to figure out what to do with the 1401 as he didn't have any data stored on punch cards and didn't know enough math to do anything interesting.

His programming journey took a turn when microcomputers became available. He was impressed and envious when one of his friends built a microcomputer himself using a kit from Heathkit. It wasn't until a few years later that he convinced his father to buy a TRS-80, which he used to write simple games, a program to predict how high his model rockets would fly, and a word processor that his father used to write at least one book.

In college, Paul initially planned to study philosophy, but he found the courses to be boring. He then sw

In [45]:
# base
base_vector_index = VectorStoreIndex(nodes, embed_model=embed_model)
base_query_engine = base_vector_index.as_query_engine(llm=llm)

In [46]:
base_response = base_query_engine.query(
    "Tell me about the author's programming journey through childhood to college"
)

In [47]:
print(str(base_response))


The author, Paul Graham, began his programming journey in 9th grade when he was 13 or 14. He tried writing programs on the IBM 1401 that his school district used for data processing. However, he struggled to figure out what to do with it due to the lack of input options. He couldn't remember any programs he wrote during that time, except for one that didn't terminate, causing a social error.

With the advent of microcomputers, everything changed for him. His friend, who owned a Heathkit microcomputer, could type programs directly into the computer, leaving Paul feeling impressed and envious. It took years of nagging before his father bought him a TRS-80 in about 1980. During this time, he wrote simple games, a program to predict how high his model rockets would fly, and a word processor that his father used to write at least one book.

In college, Paul initially planned to study philosophy, thinking it was the study of ultimate truths. However, he found the subject boring and decided 