# Indexes - Structuring documents to LLMs can work with them

## Document Loaders
Easy ways to import data from other sources. Shared functionality with [OpenAI Plugins](https://openai.com/blog/chatgpt-plugins) [specifically retrieval plugins](https://github.com/openai/chatgpt-retrieval-plugin)

See a [big list](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html) of document loaders here. A bunch more on [Llama Index](https://llamahub.ai/) as well.

In [4]:
from langchain.document_loaders import TextLoader
# import langchain 
# langchain.debug = True

In [12]:
# TextLoader - Load text from a file
text_loader = TextLoader(file_path="../data/sample_loader.txt", encoding="utf-8")
data = text_loader.load()
data

[Document(metadata={'source': '../data/sample_loader.txt'}, page_content='The Project Gutenberg eBook of The Works of Edgar Allan Poe — Volume 2\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: The Works of Edgar Allan Poe — Volume 2\n\nAuthor: Edgar Allan Poe\n\nRelease date: April 1, 2000 [eBook #2148]\n                Most recently updated: May 19, 2024\n\nLanguage: English\n\nCredits: David Widger\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE WORKS OF EDGAR ALLAN POE — VOLUME 2 ***\n\n\n\n\nThe Works of Edgar Allan Poe\n\nby Edgar Allan Poe\n\nThe Raven Edition\n\nVOLU

In [13]:
print(data[0].page_content[1855:1984])

s little back
      library, or book-closet, _au troisième_, No. 33, _Rue Dunôt,
      Faubourg St. Germain_. For one hour at lea


## Text Splitters
Often times your document is too long (like a book) for your LLM. You need to split it up into chunks. Text splitters help with this.

There are many ways you could split your text into chunks, experiment with [different ones](https://python.langchain.com/en/latest/modules/indexes/text_splitters.html) to see which is best for you.

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [29]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 150,
    chunk_overlap  = 20,
)
texts = text_splitter.create_documents([data[0].page_content])
texts

[Document(page_content='The Project Gutenberg eBook of The Works of Edgar Allan Poe — Volume 2\n    \nThis ebook is for the use of anyone anywhere in the United States and'),
 Document(page_content='most other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms'),
 Document(page_content='of the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,'),
 Document(page_content='you will have to check the laws of the country where you are located\nbefore using this eBook.'),
 Document(page_content='Title: The Works of Edgar Allan Poe — Volume 2\n\nAuthor: Edgar Allan Poe'),
 Document(page_content='Release date: April 1, 2000 [eBook #2148]\n                Most recently updated: May 19, 2024\n\nLanguage: English\n\nCredits: David Widger'),
 Document(page_content='*** START OF THE PROJECT GUTENBERG EBOOK THE WORKS OF EDGAR ALLAN POE — VOL

In [30]:
print (f"You have {len(texts)} documents")

You have 4923 documents


In [31]:
print ("Preview:")
print (texts[0].page_content, "\n")
print (texts[1].page_content)

Preview:
The Project Gutenberg eBook of The Works of Edgar Allan Poe — Volume 2
    
This ebook is for the use of anyone anywhere in the United States and 

most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms


There are a ton of different ways to do text splitting and it really depends on your retrieval strategy and application design. Check out more splitters [here](https://python.langchain.com/docs/modules/data_connection/document_transformers/)

## Retrievers
Easy way to combine documents with language models.

There are many different types of retrievers, the most widely supported is the VectoreStoreRetriever

In [47]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings

In [48]:
embeddings = OllamaEmbeddings(model='llama3')

In [49]:
db = FAISS.from_documents(texts[:10], embeddings)
db

<langchain_community.vectorstores.faiss.FAISS at 0x310889050>

In [50]:
retriever = db.as_retriever()
retriever

VectorStoreRetriever(tags=['FAISS', 'OllamaEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x310889050>)

In [51]:
retriever

VectorStoreRetriever(tags=['FAISS', 'OllamaEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x310889050>)

In [52]:
docs = retriever.get_relevant_documents("what types of things did the author want to build?")

In [54]:
len(docs), docs

(4,
 [Document(page_content='you will have to check the laws of the country where you are located\nbefore using this eBook.'),
  Document(page_content='Title: The Works of Edgar Allan Poe — Volume 2\n\nAuthor: Edgar Allan Poe'),
  Document(page_content='Release date: April 1, 2000 [eBook #2148]\n                Most recently updated: May 19, 2024\n\nLanguage: English\n\nCredits: David Widger'),
  Document(page_content='most other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms')])

## VectorStores

In [55]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

# Get embedding engine ready
embeddings = OllamaEmbeddings(model='llama3')

In [56]:
print (f"You have {len(texts)} documents")

You have 4923 documents


In [57]:
embedding_list = embeddings.embed_documents([text.page_content for text in texts[:10]])

In [58]:
print (f"You have {len(embedding_list)} embeddings")
print (f"Here's a sample of one: {embedding_list[0][:3]}...")

You have 10 embeddings
Here's a sample of one: [-3.612340211868286, 0.40659618377685547, 1.3247272968292236]...


# Workshop exercise:

1. Find a manual of some electronics device online (like a microwave or a TV or a phone)
2. convert it to text
3. save it to a text file under the data directory
4. Load the text file
5. Split the text into chunks
6. Embed the chunks
7. Store the chunks in a vector store
8. Retrieve a chunk based on a query
9. ask the model a question about the manual