## Retrieval augmented generation

In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

# Loaders

## Loaders deal with the specifics of accessing and converting data

### Accessing
#### Web Sites
#### Data Bases
#### Youtube
#### arXiv
....
### Data Types
####PDF
####HTML
####JSON
#### Word, PowerPoint

## Returns a list of "Document" objects

### public, proprietary, unstructured, structured, Databases

In [26]:
# !pip install openai
# !pip install python-dotenv
# !pip install langchain
# !pip install langchain_community
# !pip install pypdf

Collecting pypdf
  Downloading pypdf-4.3.1-py3-none-any.whl.metadata (7.4 kB)
Downloading pypdf-4.3.1-py3-none-any.whl (295 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.8/295.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-4.3.1


In [8]:
import os
import openai
import sys
sys.path.append("../..")

In [14]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [23]:
from langchain.document_loaders import PyPDFLoader

In [27]:
loader = PyPDFLoader("sample_data/The_History_of Pret A_Manger_ From_a_Single_Shop_to_a_Global_Phenomenon.pdf")

In [28]:
pages = loader.load()

In [29]:
len(pages)

6

In [30]:
page = pages[0]

In [33]:
print(page.page_content[:10])

The
Histor


In [34]:
page.metadata

{'source': 'sample_data/The_History_of Pret A_Manger_ From_a_Single_Shop_to_a_Global_Phenomenon.pdf',
 'page': 0}

# Youtube Loading

In [40]:
!pip install yt-dlp -U



In [35]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [47]:
# ! pip install yt_dlp
# ! pip install pydub
# !yt-dlp -U
# !pip install yt-dlp
# !pip install youtube-dl



In [51]:
!pip install ty_dlp

[31mERROR: Could not find a version that satisfies the requirement ty_dlp (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for ty_dlp[0m[31m
[0m

In [55]:
import yt_dlp
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="sample_data/youtube/"
# loader = GenericLoader(
#     YoutubeAudioLoader([url],save_dir),
#     OpenAIWhisperParser()
# )
# docs = loader.load()

# docs = yt_dlp.YoutubeDL({'format': 'bestaudio'}).download([url])

In [83]:
from langchain.document_loaders import WebBaseLoader

In [85]:
loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [87]:
docs = loader.load()

In [88]:
docs[0].metadata

{'source': 'https://github.com/basecamp/handbook/blob/master/37signals-is-you.md',
 'title': 'File not found · GitHub',
 'description': 'Basecamp Employee Handbook. Contribute to basecamp/handbook development by creating an account on GitHub.',
 'language': 'en'}

In [64]:
print(docs[0].page_content[:1000])









































































































File not found · GitHub













































Skip to content












Navigation Menu

Toggle navigation




 













            Sign in
          








        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







Codespaces
        Instant dev environments
      







GitHub Copilot
        Write better code with AI
      







Code review
        Manage code changes
      







Issues
        Plan and track work
      







Discussions
        Collaborate outside of code
      




Explore



      All features

    



      Documentation

    





      GitHub Skills

    





      Blog

    









        Solutions
        





By size



      Enterprise

    



      Teams

    



 

# then you need to process the loaded data

# Notion

## Follow steps here for an example Notion site such as this one.

### Duplicate the page into your own Notion space and export as Markdonw or csv
### Unzip it and save it as folder that contains the markdown file for the Notion page.

In [69]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

In [77]:
# len(docs)

In [76]:
# print(docs[0].page_content[0:100])

In [91]:
# docs[0].metadata