### Load Source Data

In [1]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("speech.txt")
text_documents = loader.load()

text_documents

[Document(page_content="Training a large language model involves several steps. First, we gather a massive dataset of text, which can include books, websites, and other written material. Next, we preprocess this data by cleaning it, removing any irrelevant information, and converting it into a format that can be used for training. After that, we initialize the model with random weights and begin the training process. During training, the model is presented with a sequence of words and asked to predict the next word in the sequence. The model's predictions are compared to the actual next word, and the difference is used to update the model's weights. This process is repeated many times, with the model seeing the same data multiple times, until it is able to accurately predict the next word in a sequence. Finally, we evaluate the model on a separate dataset to ensure that it is able to generalize to new data.", metadata={'source': 'speech.txt'})]

In [2]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")

In [3]:
# Web base loader
from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(web_path=("https://lilianweng.github.io/posts/2023-06-23-agent/", ), 
                       bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                           class_=("post-title", "post-content", "post-headeer")
                       )),)

# Load the document
document = loader.load()

document

[Document(page_content='\n      LLM Powered Autonomous Agents\n    Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.\n\n\nMemory\n\nShort-term memory: I would consider all the in-context learning (S

In [4]:
# PDF loader
from langchain_community.document_loaders import PyPDFLoader
import bs4

loader = PyPDFLoader("The Impatient Woman's Guide to Getting Pregnant.pdf")
# Load the document
document = loader.load()

document

[Document(page_content='', metadata={'source': "The Impatient Woman's Guide to Getting Pregnant.pdf", 'page': 0}),
 Document(page_content='T ha nk you f or  dow nl oa di ng t hi s  S i m on &  S c hus t e r  e book.\nG e t  a  F R E E  e book w he n you j oi n our  m a i l i ng l i s t . P l us , ge t  upda t e s  on\nne w  r e l e a s e s , de a l s , r e c om m e nde d r e a ds , a nd m or e  f r om  S i m on &\nS c hus t e r . C l i c k be l ow  t o s i gn up a nd s e e  t e r m s  a nd c ondi t i ons .\nC L I C K  H E R E  T O  S I G N  U P\nA l r e a dy a  s ubs c r i be r ?  P r ovi de  your  e m a i l  a ga i n s o w e  c a n r e gi s t e r  t hi s\ne book a nd s e nd you m or e  of  w ha t  you l i ke  t o r e a d. Y ou w i l l  c ont i nue  t o\nr e c e i ve  e xc l us i ve  of f e r s  i n your  i nbox.', metadata={'source': "The Impatient Woman's Guide to Getting Pregnant.pdf", 'page': 1}),
 Document(page_content='', metadata={'source': "The Impatient Woman's Guide to Gettin

### Transform Source Data

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

documents = text_splitter.split_documents(document)

documents[1]

Document(page_content='C o n t e n t s\nF or e w or d\nI nt r oduc t i on\n1:  W ha t  t o D o B e f or e  Y ou S t a r t  T r yi ng\n2:  D i e t a r y S uppl e m e nt s , E xe r c i s e , a nd t he  S O S  D i e t\n3:  N ot  t he  K i nd of  E gg T ha t  C om e s  i n a  C a r t on:  H ow  t o F i gur e  O ut  W he n\nY ou A r e  O vul a t i ng\n4:  H or i z ont a l  M a m bo:  T he  B e s t  T i m e  t o D a nc e  t he  B a by D a nc e\n5:  C hoos i ng t he  S e x of  Y our  B a by\n6:  N ot  a s  S c a r y a s  Y ou’ ve  H e a r d:  T he  R e a l  S t a t s  on A ge  a nd F e r t i l i t y\n7:  “ I f  O ne  M or e  P e r s on T e l l s  M e  t o ‘ J us t  R e l a x’  . . .” :  T he  P s yc hol ogi c a l\nS i de  of  G e t t i ng P r e gna nt\n8:  P e e i ng on M or e  S t i c ks :  T he  T w o- W e e k W a i t  a nd W he n t o T e s t  f or\nP r e gna nc y\n9:  S a d E ndi ngs :  M i s c a r r i a ge\n10:  W he n t o S e e  Y our  D oc t or  a nd W ha t  t o E xpe c t  W he n Y ou D

In [13]:
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma, FAISS 

db = FAISS.from_documents(documents[:5], OpenAIEmbeddings())

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

In [None]:
faiss.D