<a href="https://colab.research.google.com/github/raphaelroosewelt/langChain/blob/main/_Building_AI_Applications_with_LangChain_and_GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#To perform this analysis, we need to install the following packages:
#### **openai**: For interacting with OpenAI's API.
#### **langchain**: A framework for developing applications with generative AI.
#### **langchain-openai** and **langchain-community**: LangChain extension modules for OpenAI and DuckDB functionality.
#### **langgraph**: A package to orchestrate LLM systems.
#### **tiktoken**: A string encoder that generates tokens used by OpenAI, useful for estimating token usage.
#### **duckdb**: We will use DuckDB as a vector database.

In [1]:
# We tested the code-along with the following package versions
!pip install openai==1.63.2 \
             langchain==0.3.19 \
             langchain-core==0.3.40 \
			       langchain-openai==0.3.6 \
			       langchain-community==0.3.18 \
             langgraph==0.2.74 \
			       tiktoken==0.9.0 \
             unstructured[all-docs] \
             typing_extensions==4.12.2 \
 			       duckdb==1.2.0 > /dev/null 2>&1

In [2]:
# Reinstalar numpy e pandas com versões compatíveis
!pip install numpy==1.24.4 pandas==2.1.4 --force-reinstall --no-cache-dir > /dev/null 2>&1

## Task 1: Load Data

To embed and store data, we need to provide LangChain with `Document` objects. This can be easily achieved using LangChain's [Document Loaders](https://python.langchain.com/docs/concepts/document_loaders/). For our project, we will use the `ReadTheDocsLoader` to load the scikit-learn documentation. The documentation files are located in the `sckit-learn-docs` folder, which contains all the HTML files from the scikit-learn documentation (https://scikit-learn.org/dev/versions.html).

Our goal is to load these HTML files as `Document` objects using the `ReadTheDocsLoader`. This loader will read the directory containing the HTML files, strip out the HTML tags, and convert the content into `Document` objects. By the end of this task, we will have a variable `raw_documents` that contains a list of `Document` objects, with each `Document` corresponding to an HTML file.

Note: In this step, we are not loading the documents into a database; we are simply loading them into a list.

### Instructions
1. Import `ReadTheDocsLoader` from `langchain.document_loaders`.
2. Create the loader, pointing to the `sckit-learn-docs` directory.
3. Load the data into `raw_documents` by calling `loader.load()`.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
from google.colab import userdata
import os
os.environ["TokenAgentKey"] = userdata.get("TokenAgentKey")

In [5]:
import os

# Caminho base
path = '/content/drive/My Drive/Docs/scikit-learn-docs'

# Listar conteúdo
for root, dirs, files in os.walk(path):
    print(f"📁 {root}")
    for file in files:
        print(f"   └── {file}")

📁 /content/drive/My Drive/Docs/scikit-learn-docs
📁 /content/drive/My Drive/Docs/scikit-learn-docs/api
   └── sklearn.kernel_approximation.html
   └── sklearn.covariance.html
   └── sklearn.ensemble.html
   └── sklearn.base.html
   └── deprecated.html
   └── sklearn.semi_supervised.html
   └── sklearn.neighbors.html
   └── sklearn.metrics.html
   └── sklearn.linear_model.html
   └── sklearn.kernel_ridge.html
   └── sklearn.utils.html
   └── sklearn.gaussian_process.html
   └── sklearn.tree.html
   └── sklearn.isotonic.html
   └── sklearn.calibration.html
   └── sklearn.multiclass.html
   └── sklearn.frozen.html
   └── sklearn.neural_network.html
   └── sklearn.feature_selection.html
   └── sklearn.datasets.html
   └── sklearn.random_projection.html
   └── sklearn.decomposition.html
   └── sklearn.discriminant_analysis.html
   └── sklearn.svm.html
   └── index.html
   └── sklearn.inspection.html
   └── sklearn.manifold.html
   └── sklearn.cluster.html
   └── sklearn.html
   └── sklearn.e

## Task 2: Slice the documents into smaller chunks

In the previous step, we turned each HTML file into a Document. These files may be very long and potentially too large to embed fully. It's also a good practice to avoid embedding large documents:
- Long documents often contain several concepts. Retrieval will be easier if each concept is indexed separately.
- Retrieved documents will be injected into a prompt, so keeping them short will keep the prompt small(ish).

LangChain has a collection of tools to do this: [Text Splitters](https://python.langchain.com/docs/concepts/text_splitters/). In our case, we'll be using the most straightforward one and simplest to use: the [Recursive Character Text Splitter](https://python.langchain.com/docs/how_to/recursive_text_splitter/). The recursive text splitter will recursively reduce the input by splitting it by paragraph, then sentences, then words as needed until the chunk is small enough.

### Instructions
1. Import the `RecursiveCharacterTextSplitter` from `langchain.text_splitter`.
2. Create a text splitter configured with `chunk_size=5000` and `chunk_overlap=200`.  
   _These values are arbitrary, and you'll need to try different ones to see which best serve your use case._
3. Split the `raw_documents` and store them as `documents`, using the `.split_documents()` method.

In [6]:
from langchain.document_loaders import DirectoryLoader, UnstructuredHTMLLoader

loader = DirectoryLoader(
    path='/content/drive/My Drive/Docs/scikit-learn-docs',
    glob='**/*.html',
    loader_cls=UnstructuredHTMLLoader
)

raw_documents = loader.load()
print(f"Total loaded documents: {len(raw_documents)}")



Total loaded documents: 663


In [7]:
print(f"Total documents: {len(raw_documents)}")
print(raw_documents[:1])  # Show the first iten (if any)

Total documents: 663
[Document(metadata={'source': '/content/drive/My Drive/Docs/scikit-learn-docs/api/sklearn.kernel_approximation.html'}, page_content='sklearn.kernel_approximation#\n\nApproximate kernel feature maps based on Fourier transforms and count sketches.\n\nUser guide. See the Kernel Approximation section for further details.\n\nAdditiveChi2Sampler Approximate feature map for additive chi2 kernel. Nystroem Approximate a kernel map using a subset of the training data. PolynomialCountSketch Polynomial kernel approximation via Tensor Sketch. RBFSampler Approximate a RBF kernel feature map using random Fourier features. SkewedChi2Sampler Approximate feature map for "skewed chi-squared" kernel.\n\nprevious\n\nisotonic_regression\n\nnext\n\nAdditiveChi2Sampler')]


In [8]:
# Import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create the text splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=5000,
    chunk_overlap=200
)

# Split the documents
documents = splitter.split_documents(raw_documents)

## Task 3: Count Tokens and Estimate Embedding Cost

We're now ready to embed our documents. Before we proceed, it's important to understand the size of our data and estimate the cost of embedding. We'll use the [`tiktoken`](https://github.com/openai/tiktoken) library for this purpose. `tiktoken` allows us to encode and decode text into tokens, which is crucial for understanding the token count of our documents.

> 💡 To better understand what a token is in the context of GPT, visit [OpenAI's Tokenizer page](https://platform.openai.com/tokenizer) to see how text translates into tokens.

You can find the pricing for different models on OpenAI's [pricing page](https://platform.openai.com/docs/pricing).

### Instructions
1. Import the `tiktoken` library.
2. Create a tokenizer for the `text-embedding-3-large` model using the `.encoding_for_model()` method.
3. Count the tokens in each document using the `.encode()` method.
4. Calculate the total number of tokens.
5. Estimate the cost. The `text-embedding-3-large` model costs `$0.13` per 1M tokens.

In [9]:
# Import tiktoken
import tiktoken

# Create an encoder
tokenizer = tiktoken.encoding_for_model("text-embedding-3-small") #we can use any snapshot, just browse at: https://platform.openai.com/docs/pricing

# Count tokens in each document
tokens_per_document = [len(tokenizer.encode(doc.page_content)) for doc in documents]

# Calculate the sum of all token counts
sum_of_tokens = sum(tokens_per_document)

# Calculate a cost estimate
estimated_cost = sum_of_tokens/1_000_000 * 0.02
print(f"Estimated cost is {estimated_cost}")

Estimated cost is 0.005041759999999999


## Task 4: Embed the Documents and Store Embeddings in the Vector Database

We are now ready to embed our documents. Since embedding incurs a cost, we will save the embeddings into a database. LangChain simplifies this process using a [Vector Store](https://python.langchain.com/docs/concepts/vectorstores/).

There are many vector stores to choose from (see the [full list](https://python.langchain.com/docs/integrations/vectorstores/)). Today, we will use [DuckDB](https://duckdb.org/), but you can use any other as they share the same interface in LangChain. Each vector store has unique features (like metadata filtering), so explore them to find the best fit for your use case.

DuckDB is a local analytical database management system designed for fast execution of complex queries. It is particularly well-suited for analytical workloads and can be embedded directly into your application. In addition to its traditional database capabilities, DuckDB supports vector operations, such as similarity search, making it a versatile choice for storing and querying document embeddings. Furthermore, it comes pre-installed in this online notebook environment.

### Instructions
1. Import `duckdb` and create a database connection using `duckdb.connect`. Store the database in a single file (e.g., `"embeddings.db"`), which you can pass to the `connect` function.
2. Import `DuckDB` from `langchain_community.vectorstores`.
3. Import `OpenAIEmbeddings` from `langchain_openai`.
4. Create the embedding function. Set the model to `"text-embedding-3-large"` and set a chunk size of `500`. Setting the chunk size is necessary because we have too many documents to embed at once.
5. Create a database from our documents using `DuckDB.from_documents()`. Pass the documents, embedding function, the previously created DuckDB connection, and set the table name to `"embeddings"`.  
   **Warning: Executing this will embed thousands of documents and will cost about $0.005042**

In [11]:
import os
# Substitua pelo valor da sua chave
#os.environ["TokenAgentKey"] = "TokenAgentKey"

In [15]:
# Create the DuckDB connection
import duckdb
conn = duckdb.connect("embeddings.db")

# Import the DuckDB vectorstore
from langchain_community.vectorstores import DuckDB

# Import OpenAIEmbeddings
from langchain_openai.embeddings import OpenAIEmbeddings

# Create the embedding function
embedding_function = OpenAIEmbeddings(
    model="text-embedding-3-small",
    chunk_size=500,
    openai_api_key="Sua chave aqui"
)

To use the OpenAI API, you'll need an API key. If you don't already have one, you can create one on the OpenAI website.
In Colab, add the key to the secrets manager under the "🔑" in the left panel. Give it the name `OPENAI_API_KEY`. Then pass the key to the SDK:

In [16]:
# Create a database from the documents and embedding function
db = DuckDB.from_documents(
    documents=raw_documents,#documents
    embedding=embedding_function,
    connection=conn,
    table_name="embeddings",
)

## Step 5: Query the Vector Database

Now that we have a vector database, we can query it. A vector database stores embeddings (vectors) and allows searching through them using the K-Nearest Neighbors algorithm (or a variation of it). When we query it, the following steps will occur:
1. Embed the text query to obtain a vector. It is crucial that this embedding is made using the same embedding technique that was used to embed the documents.
2. Calculate the distance (or similarity) between the query vector and all other vectors.
3. Sort results by similarity.
4. Return the most similar documents.

To do this with LangChain, we can use the `.similarity_search()` method of the database.

### Instructions
1. Call the `similarity_search` method on `db` with the search query as a parameter. Store the results in `results`.
2. Display the results.

In [18]:
# Call the `similarity_search_with_score` method on `db`
results = db.similarity_search("oi")

# Show the results
results[0]

Document(metadata={'source': '/content/drive/My Drive/Docs/scikit-learn-docs/modules/generated/oas-function.html', '_similarity_score': 0.26869213082374366}, page_content='oas#\n\nprevious\n\nledoit_wolf_shrinkage\n\nnext\n\nshrunk_covariance\n\nOn this page\n\nThis Page\n\nShow Source')

## Step 6: Create a LangGraph Graph

In this step, we will create a LangGraph graph to handle the retrieval and generation of answers based on our vector database. LangGraph allows us to define a sequence of operations (or states) that our data will go through. We will define two main functions: `retrieve` and `generate`. The `retrieve` function will query our vector database to get the most relevant documents, and the `generate` function will use a language model to generate an answer based on these documents.

### Instructions
1. We saved you some time and already included the necessary imports.

2. Create the prompt template and chat model:
    - Use `hub.pull` to pull the prompt template named `"rlm/rag-prompt"`. This will download the publicly defined prompt template: https://smith.langchain.com/hub/rlm/rag-prompt.
    - Initialize the chat model using `init_chat_model` with the model name `"gpt-4o-mini"` and specify the model provider as `"openai"`. Set the `temperature` to `0`.

3. Set up your State structure:
    - Define a class `State` that inherits from `TypedDict`.
    - The `State` class should have three fields: `context` (a list of `Document` objects), `question` (a string), and `answer` (a string).

4. Define the `retrieve` function which will be used as a `Node` in your graph:
    - Create a function named `retrieve` that takes a `state` parameter of type `State`.
    - Inside the function, query the vector database using the `similarity_search` method with the question from the state.
    - Return a dictionary with the key `"context"` and the retrieved documents as the value.

5. Define the `generate` function which will be used as a `Node` in your graph:
    - Create a function named `generate` that takes a `state` parameter of type `State`.
    - Inside the function, concatenate the content of the documents in a single string.
    - Use the prompt template to create messages by invoking it with a dictionary containing the concatenated context and the question.
    - Generate a response by invoking the chat model with the messages.
    - Return a dictionary with the key `"answer"` and the generated response as the value.

6. Build and compile the graph using the state and the functions:
    - Initialize a `StateGraph` object with the `State` class.
    - Add a sequence of the `retrieve` and `generate` functions to the graph builder.
    - Add an edge from `START` to the `"retrieve"` state.
    - Compile the graph using the `compile` method of the graph builder.

By following these steps, you will create a LangGraph graph that can retrieve relevant documents from the vector database and generate answers using a language model.

In [33]:
# Add the necessary imports
from langchain import hub
from langchain_core.documents import Document
from typing_extensions import List, TypedDict
from langchain.chat_models import init_chat_model
from langgraph.graph import START, StateGraph

# Creat the prompt template and chat model
prompt = hub.pull("rlm/rag-prompt")
llm = init_chat_model(
    model="gpt-4.1-nano",
    openai_api_key="Sua chave aqui",
    temperature=0,
)

# Set up your State structure
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

# Define the retrieve function
def retrieve(state: State):
    retrieved_documents = db.similarity_search(
        state["question"]
    )
    return {
        "context": retrieved_documents,
    }

# Define the generate function
def generate(state: State):
    context = "\n\n".join([
        doc.page_content
        for doc in state["context"]
    ])
    messages = prompt.invoke({
        "question": state["question"],
        "context": context,
    })
    response = llm.invoke(messages)
    return {
        "answer": response.content,
    }

# Build and compile the graph using the state and the functions
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, 'retrieve')
graph = graph_builder.compile()



In [35]:
# Invoke the graph with your question
response = graph.invoke({
    "question": "How can I do k-nearest with Scikit Learn? Answer like a Pirate.",
})

# Display the answer
from IPython.display import display, Markdown
display(Markdown(response["answer"]))

Arrr, to do k-nearest in Scikit Learn, ye use the `KNeighborsClassifier` or `KNeighborsRegressor` for classification and regression, respectively. First, ye fit the model with `fit(X, y)` where X be yer data and y be the labels, then call `kneighbors` to find the closest mates. Ye can also use `kneighbors_graph` to see the neighbor connections, aye!