https://www.youtube.com/watch?v=tBJ-CTKG2dM&t=801s&ab_channel=JamesBriggs

In [15]:
import requests

In [16]:
res = requests.get("https://python.langchain.com/en/latest/")
res

<Response [200]>

In [17]:
from bs4 import BeautifulSoup
import urllib.parse
import html
import re

In [18]:
domain = "https://python.langchain.com/"
domain_full = domain+"en/latest/"

In [19]:
soup = BeautifulSoup(res.text, 'html.parser')

# Find all links to local pages on the website

local_links = []
for link in soup.find_all('a',href=True):
    href=link['href']
    if href.startswith(domain) or href.startswith('./') \
        or href.startswith('/') or href.startswith('modules') \
        or href.startswith('user_cases'):
        local_links.append(urllib.parse.urljoin(domain_full,href))

# Find the main content using CSS selectors
main_content = soup.select('body main')[0]

# Extract the HTML code of the main content
main_content_html = str(main_content)

# Extract the plaintext of the main content
main_content_text = main_content.get_text()

# Remove all HTML tags
main_content_text = re.sub(r'<[^>]+>','',main_content_text)

# Remove extract white space
main_content_text = ' '.join(main_content_text.split())

# Replace HTML entities with their corresponding characters
main_content_text = html.unescape(main_content_text)

print(main_content_text)

.rst .pdf Welcome to LangChain Contents Getting Started Modules Use Cases Reference Docs LangChain Ecosystem Additional Resources Welcome to LangChain# LangChain is a framework for developing applications powered by language models. We believe that the most powerful and differentiated applications will not only call out to a language model via an API, but will also: Be data-aware: connect a language model to other sources of data Be agentic: allow a language model to interact with its environment The LangChain framework is designed with the above principles in mind. This is the Python specific portion of the documentation. For a purely conceptual guide to LangChain, see here. For the JavaScript documentation, see here. Getting Started# Checkout the below guide for a walkthrough of how to get started using LangChain to create an Language Model application. Getting Started Documentation Modules# There are several main modules that LangChain provides support for. For each module we provid

In [20]:
def scrape(url: str):
    res = requests.get(url)
    if res.status_code != 200:
        print(f"{res.status_code} for '{url}'")
        return None
    soup = BeautifulSoup(res.text, 'html.parser')

    # Find all links to local pages on the website
    local_links = []
    for link in soup.find_all('a',href=True):
        href=link['href']
        if href.startswith(domain) or href.startswith('./') \
            or href.startswith('/') or href.startswith('modules') \
            or href.startswith('user_cases'):
            local_links.append(urllib.parse.urljoin(domain_full,href))

    # Find the main content using CSS selectors
    main_content = soup.select('body main')[0]

    # Extract the HTML code of the main content
    main_content_html = str(main_content)

    # Extract the plaintext of the main content
    main_content_text = main_content.get_text()

    # Remove all HTML tags
    main_content_text = re.sub(r'<[^>]+>','',main_content_text)

    # Remove extract white space
    main_content_text = ' '.join(main_content_text.split())

    # Replace HTML entities with their corresponding characters
    main_content_text = html.unescape(main_content_text)

    # Return as JSON
    return {
        "url":url,
        "text":main_content_text
    }, local_links

In [21]:
links = ["https://python.langchain.com/en/latest/"]
scraped = set()
data = []

while True:
    if len(links) == 0:
        print("Complete")
        break
    url = links[0]
    print(url)
    res = scrape(url)
    scraped.add(url)
    if res is not None:
        page_content, local_links = res
        data.append(page_content)
        # add new links to links list
        links.extend(local_links)
        # remove duplicates
        links = list(set(links))
    # remove links 
    links = [link for link in links if link not in scraped]

https://python.langchain.com/en/latest/
https://python.langchain.com/en/latest/modules/agents/tools/examples/bing_search.html
https://python.langchain.com/en/latest/modules/models/llms/examples/async_llm.html
https://python.langchain.com/en/latest/modules/models/chat.html
https://python.langchain.com/en/latest/modules/chains/generic/transformation.html
https://python.langchain.com/en/latest/modules/models/llms/getting_started.html
https://python.langchain.com/en/latest/modules/chains/examples/llm_summarization_checker.html
https://python.langchain.com/en/latest/modules/prompts/example_selectors/examples/custom_example_selector.html
https://python.langchain.com/en/latest/modules/chains/index_examples/qa_with_sources.html
https://python.langchain.com/en/latest/modules/agents/tools/examples/requests.html
https://python.langchain.com/en/latest/tracing.html
https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/chatgpt_loader.html
https://python.langchain.com/en/la

In [22]:
import tiktoken

In [23]:
tokenizer = tiktoken.get_encoding('p50k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [24]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 20,
    length_function = tiktoken_len,
    separators=["\n\n", "\n", " ",""]
)

Process the data into more chunks using this approach.

In [25]:
from uuid import uuid4
from tqdm.auto import tqdm

chunks = []

for idx, record in enumerate(tqdm(data)):
    texts = text_splitter.split_text(record['text'])
    chunks.extend([{
        'id': str(uuid4()),
        'text': texts[i],
        'chunk': i,
        'url': record ['url']
    } for i in range(len(texts))])

100%|██████████| 323/323 [00:02<00:00, 137.73it/s]


Our chunks are ready so now we move onto embedding and indexing everything

# Initialize Embedding Model

We use text-embedding-ada-002 as the embedding model. We can embed text like so.

In [26]:
import openai
import os
from dotenv import load_dotenv

load_dotenv()
# environment variables

True

In [27]:
# Initialize openai API key
openai.api_key = os.getenv("OPENAI_API_KEY")

embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=embed_model
)

In the response res we will find a JSON-like object containing our new embeddings within the 'data' field.

In [28]:
res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

Inside 'data' we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains 1536 dimensions - we may just switch to another method of embedding from the other video

In [29]:
len(res['data'])

2

In [30]:
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

(1536, 1536)

 We will apply this same embedding logic to the langchain docs dataset we've just scraped. But before doing so we must create a place to store the embeddings.

# Initializing the Index

Now we need to place to store these embeddings and enable a effecient vector search through them all. To do that we use Pinecone, we can get a free API key and enter it below where we will initialize our connection to Pinecone and create a new index.

In [31]:
import pinecone

In [32]:
index_name = 'bloom'

# Initialize connection to pinecone
pinecone.init(
    api_key=os.getenv("PINECONE_API_KEY"),
    environment="us-central1-gcp"
)

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=len(res['data'][0]['embedding']),
        metric='dotproduct'
    )

# Connect to index
index = pinecone.Index(index_name)

# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 4},
                'Reseach The Coca Cola Company business model': {'vector_count': 52},
                'test-user-1@openai.com': {'vector_count': 4},
                'test-user-2@openai.com': {'vector_count': 4},
                'test-user-3@openai.com': {'vector_count': 8}},
 'total_vector_count': 72}

 We can see the index is currenlty empty with a total_vector_count of 0. We can begin populating it with OpenAI text-embedding-ada-002 built embeddings like so

In [33]:
from tqdm.auto import tqdm
import datetime
from time import sleep

batch_size = 100 # how many embedding we create and insert at once

for i in tqdm(range(0, len(chunks), batch_size)):
    # find end of batch
    i_end = min(len(chunks), i+batch_size)
    meta_batch = chunks[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    # create embeddings (try-except added to avoice RateLimitError)
    try:
        res = openai.Embedding.create(inpute=texts, engine=embed_model)
    except:
        done = False
        while not done:
            sleep(5)
            try:
                res = openai.Embedding.create(input=texts, engine=embed_model)
                done = True
            except:
                pass
    embeds = [record['embedding'] for record in res['data']]
    # cleanup metadata
    meta_batch = [{
        'text': x['text'],
        'chunk': x['chunk'],
        'url': x['url'] 
    } for x in meta_batch]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)
    

100%|██████████| 16/16 [01:45<00:00,  6.61s/it]


Now we've added all of our langchain docs to the index. With that we can move on to retrieve and then answer generation using GPT-4

# Retrieval

To search through our documents we first need to create a query vector xq. Using xq we retrieve the most relevant chunks from the LancChain docs, like so:

In [34]:
# After first load you can now just start from here:
# POST INITITAL LOAD

import pinecone
import openai
import os
from dotenv import load_dotenv

load_dotenv()
# environment variables

# Initialize openai API key
openai.api_key = os.getenv("OPENAI_API_KEY")

embed_model = "text-embedding-ada-002"

In [35]:
# POST INITITAL LOAD

index_name = 'bloom'

# Initialize connection to pinecone
pinecone.init(
    api_key=os.getenv("PINECONE_API_KEY"),
    environment="us-central1-gcp"
)

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=len(res['data'][0]['embedding']),
        metric='dotproduct'
    )

# Connect to index
index = pinecone.Index(index_name)

# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1555},
                'Reseach The Coca Cola Company business model': {'vector_count': 52},
                'test-user-1@openai.com': {'vector_count': 4},
                'test-user-2@openai.com': {'vector_count': 4},
                'test-user-3@openai.com': {'vector_count': 8}},
 'total_vector_count': 1623}

In [36]:
query = "How many namespaces can you have in an index??"

res = openai.Embedding.create(
    input=[query],
    engine=embed_model
)

# retrieve from Pinecone
xq = res['data'][0]['embedding']

# get relevant contexts (including the questions)
res = index.query(xq, top_k=5, include_metadata=True)

res

{'matches': [{'id': 'a0daca02-4813-4bb5-b6bb-e303b388a296',
              'metadata': {'chunk': 2.0,
                           'text': '24626,2006,8155,7381,2,3,15965,872,9626,10008,7,1922,5784,3995,19130,2261,14763,6304,2008,18192,927,14678,4531,14,82,16514,3692,109,1513,899,879,2226,2751,1854,1931,156,8524,2426,721,1021,904,1423,4415,988,3030,426,5684,1411,23,867,2685,4720,1300,504,567,6974,9,184,26,469,2238,5,1648,109,1127,450,6708,5318,1002,258,3392,1991,4,29,212,2,375,537,1046,314,1720,78,890,1861,1,1172,2275,129,29,632,274,599,731,1305,392,307,536,592,87,113,762,845,2552,3788,220,669,3,750,1174,601,310,611,27,54,49,398,51,',
                           'url': 'https://python.langchain.com/en/latest/modules/agents/tools/examples/requests.html'},
              'score': 0.756360114,
              'values': []},
             {'id': '21e87429-9bfe-4c17-90de-1b936d5014b1',
              'metadata': {'chunk': 2.0,
                           'text': 'from langchain.indexes import '
     

With retrieval complete, we move on to fedding these into GPT-4 to produce answers.

# Retrieval Augmented Generation

GPT-4 is currently accessed via the ChatCompletions endpoint of OpenAI. To add the information we retrieved into the model, we need to pass it into our user prompts alongside our original query. We can do that like so:

In [37]:
# get list of retrieved text
contexts = [item['metadata']['text'] for item in res['matches']]

augmented_query = "\n\n---\n\n".join(contexts)+"\n\n-----\n\n"+query

In [38]:
print(augmented_query)

24626,2006,8155,7381,2,3,15965,872,9626,10008,7,1922,5784,3995,19130,2261,14763,6304,2008,18192,927,14678,4531,14,82,16514,3692,109,1513,899,879,2226,2751,1854,1931,156,8524,2426,721,1021,904,1423,4415,988,3030,426,5684,1411,23,867,2685,4720,1300,504,567,6974,9,184,26,469,2238,5,1648,109,1127,450,6708,5318,1002,258,3392,1991,4,29,212,2,375,537,1046,314,1720,78,890,1861,1,1172,2275,129,29,632,274,599,731,1305,392,307,536,592,87,113,762,845,2552,3788,220,669,3,750,1174,601,310,611,27,54,49,398,51,

---

from langchain.indexes import VectorstoreIndexCreator index = VectorstoreIndexCreator().from_loaders([loader]) Running Chroma using direct local API. Using DuckDB in-memory for database. Data will be transient. Now that the index is created, we can use it to ask questions of the data! Note that under the hood this is actually doing a few steps as well, which we will cover later in this guide. query = "What did the president say about Ketanji Brown Jackson" index.query(query) " The preside

Now we ask the question:

In [39]:
# system message to 'prime' the model
primer = f"""You are Q&A bot. A highly intelligent system that answers user questions based on the information provided by the user above each question. If the information can not be found in the information provided by the user you truthfully say "I don't know"."""

In [40]:
res = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": augmented_query}
    ]
)

To display this response nicely, we will display it in markdown

In [41]:
from IPython.display import Markdown

display(Markdown(res['choices'][0]['message']['content']))

I don't know.

Let's compare this to a non-augmented query...

In [42]:
res = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": primer},
        {"role": "user", "content": query}
    ]
)
display(Markdown(res['choices'][0]['message']['content']))

I don't know.

What if we dropped the "I don't know" part of the primer?

In [43]:
res = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are Q&A bot. A highly intelligent system that answers user questions"},
        {"role": "user", "content": query}
    ]
)
display(Markdown(res['choices'][0]['message']['content']))

The number of namespaces you can have in an index varies depending on the system or platform being used. In general, there is no strict limit, but best practices recommend keeping the number of namespaces manageable and organized to ensure optimal performance and maintainability.