## Retrieval augmented generation

In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

# Loaders

## Loaders deal with the specifics of accessing and converting data

### Accessing
#### Web Sites
#### Data Bases
#### Youtube
#### arXiv
....
### Data Types
####PDF
####HTML
####JSON
#### Word, PowerPoint

## Returns a list of "Document" objects

### public, proprietary, unstructured, structured, Databases

In [76]:
# !pip install openai
# !pip install python-dotenv
# !pip install langchain
# !pip install langchain_community
# !pip install pypdf

In [4]:
import os
import openai
import sys
sys.path.append("../..")

In [86]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [6]:
from langchain.document_loaders import PyPDFLoader

In [15]:
loader = PyPDFLoader("sample_data/Pret.pdf")
pages = loader.load()
page = pages[0]
print(page.page_content[:10])

The
Histor


### The issue in this case is in many PDFs where the output is printed as two lines with one word per line, is likely due to the way the text is extracted from the PDF. PDF files can have complex layouts, and the text extraction process sometimes captures text in a format that doesn't preserve the original flow of the text.

## Possible Causes:

### Line Breaks and Whitespace: The PDF might have line breaks, carriage returns, or irregular spacing that are preserved during text extraction, causing the text to appear on separate lines.

###Text Extraction Method: The library or method used for extracting text might not be handling the PDF layout correctly, especially if the PDF has multiple columns, images, or other complex formatting.

In [17]:
print(page.page_content[:30])

The
History
of
Pret
A
Manger:



In [23]:
def reprocess_text(text):
    cleaned_text = ' '.join(text.split())
    return cleaned_text

reprocessed_page_content = reprocess_text(page.page_content)
print(reprocessed_page_content[:100])

The History of Pret A Manger: From a Single Shop to a Global Phenomenon Introduction Pret A Manger, 


In [20]:
# pages = loader.load()

In [32]:
len(pages)

6

In [37]:
page = pages[0]
page.metadata, page.page_content[:50], page.construct, page.dict

({'source': 'sample_data/Pret.pdf', 'page': 0},
 'The\nHistory\nof\nPret\nA\nManger:\nFrom\na\nSingle\nShop\nt',
 <bound method BaseModel.construct of <class 'langchain_core.documents.base.Document'>>,
 <bound method BaseModel.dict of Document(metadata={'source': 'sample_data/Pret.pdf', 'page': 0}, page_content='The\nHistory\nof\nPret\nA\nManger:\nFrom\na\nSingle\nShop\nto\na\nGlobal\nPhenomenon\nIntroduction\nPret\nA\nManger,\noften\nsimply\nknown\nas\nPret,\nis\na\nglobally\nrecognized\ncoffee\nshop\nand\nsandwich\nchain\nwith\na\nunique\napproach\nto\nfood\nand\nservice.\nWith\nits\nroots\nin\nthe\nbustling\nstreets\nof\nLondon,\nPret\nhas\ngrown\nto\nbecome\na\nfavorite\namong\nthose\nseeking\nfresh,\nhealthy,\nand\nconvenient\nmeals.\nThis\narticle\ndelves\ninto\nthe\nhistory\nof\nPret,\ntracing\nits\njourney\nfrom\na\nsingle\nshop\nto\na\nglobal\nbrand.\nThe\nBeginning:\nA\nSimple\nIdea\nThe\nstory\nof\nPret\nA\nManger\nbegins\nin\n1983\nwhen\ntwo\ncollege\nfriends,\nSinclair\nBe

In [38]:
print(reprocessed_page_content[:50])

The History of Pret A Manger: From a Single Shop t


In [39]:
page.metadata

{'source': 'sample_data/Pret.pdf', 'page': 0}

# Youtube Loading

In [41]:
# !pip install yt-dlp -U

In [42]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [None]:
# ! pip install yt_dlp
# ! pip install pydub
# !yt-dlp -U
# !pip install yt-dlp
# !pip install youtube-dl

In [58]:
# !pip install ty_dlp
# !pip install langchain
# !pip install langchain_community

# Web Pages

In [78]:
from langchain.document_loaders import WebBaseLoader

In [79]:
loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [80]:
docs = loader.load()

In [81]:
docs[0].metadata

{'source': 'https://github.com/basecamp/handbook/blob/master/37signals-is-you.md',
 'title': 'File not found · GitHub',
 'description': 'Basecamp Employee Handbook. Contribute to basecamp/handbook development by creating an account on GitHub.',
 'language': 'en'}

In [83]:
print(docs[0].page_content[:200])









































































































File not found · GitHub













































Skip to content














# The loaded page contents could be very sparse. Then you need to process the loaded data

# Notion

## Follow steps here for an example Notion site such as this one.

### Duplicate the page into your own Notion space and export as Markdonw or csv
### Unzip it and save it as folder that contains the markdown file for the Notion page.

In [73]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("sample_data/README.md")
loader.load()

[]

In [71]:
# len(docs)

0

In [None]:
# print(docs[0].page_content[0:100])

In [None]:
# docs[0].metadata