<a href="https://colab.research.google.com/github/raphaelroosewelt/langChain/blob/main/Building_AI_Applications_with_LangChain_and_GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#To perform this analysis, we need to install the following packages:
#### **openai**: For interacting with OpenAI's API.
#### **langchain**: A framework for developing applications with generative AI.
#### **langchain-openai** and **langchain-community**: LangChain extension modules for OpenAI and DuckDB functionality.
#### **langgraph**: A package to orchestrate LLM systems.
#### **tiktoken**: A string encoder that generates tokens used by OpenAI, useful for estimating token usage.
#### **duckdb**: We will use DuckDB as a vector database.

In [None]:
# We tested the code-along with the following package versions
!pip install openai==1.63.2 \
             langchain==0.3.19 \
             langchain-core==0.3.40 \
			       langchain-openai==0.3.6 \
			       langchain-community==0.3.18 \
             langgraph==0.2.74 \
			       tiktoken==0.9.0 \
             typing_extensions==4.12.2 \
 			       duckdb==1.2.0



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Import ReadTheDocsLoader
from langchain.document_loaders import ReadTheDocsLoader

# Create a loader for the `scikit-learn-docs` folder
loader = ReadTheDocsLoader(path = "/content/drive/My Drive/1AOj1d2_hWwvcMotz9ziY0UHOPasDXQpS")


# Load the data
raw_documents = loader.load()

In [None]:
# Import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create the text splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=2500,
    chunk_overlap=200
)

# Split the documents
documents = splitter.split_documents(raw_documents)

In [None]:
# Import tiktoken
import tiktoken

# Create an encoder
tokenizer = tiktoken.encoding_for_model("text-embedding-3-large")

# Count tokens in each document
tokens_per_document = [len(tokenizer.encode(doc.page_content)) for doc in documents]

# Calculate the sum of all token counts
sum_of_tokens = sum(tokens_per_document)

# Calculate a cost estimate
estimated_cost = sum_of_tokens/1_000_000 * 0.13
print(f"Estimated cost is {estimated_cost}")

Estimated cost is 0.0


In [None]:
from google.colab import userdata
import os

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

In [None]:
# Create the DuckDB connection
import duckdb
conn = duckdb.connect("embeddings.db")

# Import the DuckDB vectorstore
from langchain_community.vectorstores import DuckDB

# Import OpenAIEmbeddings
from langchain_openai.embeddings import OpenAIEmbeddings

# Create the embedding function
embedding_function = OpenAIEmbeddings(
    model="text-embedding-ada-002",
    chunk_size=500,
)

# Create a database from the documents and embedding function
db = DuckDB.from_documents(
    documents=documents,
    embedding=embedding_function,
    connection=conn,
    table_name="embeddings",
)

InvalidInputException: Invalid Input Error: Need a DataFrame with at least one column

To use the OpenAI API, you'll need an API key. If you don't already have one, you can create one on the OpenAI website.
In Colab, add the key to the secrets manager under the "🔑" in the left panel. Give it the name `OPENAI_API_KEY`. Then pass the key to the SDK: