# Web Page Indexing and Vectorization 👀

This Jupyter notebook contains a script that performs indexing and vectorization of web page contents. The primary purpose of this script is to crawl through a specified web page, extract the textual contents, and subsequently store these contents as vector objects in a database.

The vectorized information can then be utilized in a Retrieval-Augmented Generation (RAG) flow to answer questions using a Language Model (LLM). This process enables the creation of a more context-aware and responsive system, capable of providing detailed responses based on the indexed and vectorized information from the web page.

The notebook is structured in a step-by-step manner, guiding you through the process of web page crawling, text extraction, vectorization, and storage in a database. Each step is accompanied by detailed explanations and code snippets to provide a comprehensive understanding of the process.

## Web Crawler and Content Extractor

This code implements a web crawler and content extractor that:

1. Extracts URLs from the given HTML content, filtering for the same domain and validating the URLs. ✅
2. Crawls a website starting from a given URL, iteratively processing and extracting links from each page. ✅
3. Returns a mist of HTML documents extracted from the website ✅

The code displays the source URL of each processed page and the total number of pages in the extracted content.

In [4]:
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
import sys
sys.path.append('..')
from src.paginx.page_crawler.crawler import Crawler

In [3]:

crawler = Crawler("https://vectrix.ai")
crawler.extract()

2024-05-04 18:12:38,686 - src.paginx.page_crawler.crawler - INFO - Starting extraction pipeline for site: vectrix.ai
2024-05-04 18:12:38,687 - langchain_community.document_loaders.async_html - INFO - fake_useragent not found, using default user agent.To get a realistic header for requests, `pip install fake_useragent`.
Fetching pages: 100%|##########| 1/1 [00:00<00:00,  2.23it/s]
2024-05-04 18:12:39,682 - src.paginx.page_crawler.crawler - INFO - Visiting the following links: ['https://vectrix.ai/platform', 'http://www.vectrix.ai/projects/products', 'https://vectrix.ai/blog', 'https://vectrix.ai/about-us', 'http://www.vectrix.ai/projects/advice', 'https://vectrix.ai/contact-us', 'https://vectrix.ai/career', 'http://www.vectrix.ai/projects/projects', 'https://vectrix.ai/Services']
2024-05-04 18:12:39,683 - langchain_community.document_loaders.async_html - INFO - fake_useragent not found, using default user agent.To get a realistic header for requests, `pip install fake_useragent`.
Fetchi

['{"title": "Vectrix - Platform", "author": null, "hostname": null, "date": null, "fingerprint": "fd1d589b7d1b0dc2", "id": null, "license": null, "comments": null, "raw_text": "Build the best generative AI applications on the planet with our building blocks Our platform is your foundation for building unique generative AI applications that can enhance customer service, increase sales, and automate tasks. Dive into the extensive capabilities of our adaptable platform and discover how it can cater to your specific use case, integrating the transformative potential of AI directly into your business operations. How does it work? We empower organizations with AI by merging industry-leading technology with our unique proprietary components and customizations.We utilize open technologies like OpenAI and Dialogflow and Google Cloud, among others, and boost their potential with our custom elements.These elements include our integration blocks, AI model fine-tuning and customization module, anal

In [None]:
from urllib.parse import urlparse
from langchain_community.document_loaders import AsyncHtmlLoader
from bs4 import BeautifulSoup
import validators
from trafilatura import extract

def is_valid_url(url_string: str) -> bool:
    result = validators.url(url_string)
    # Url with the words Slide-template are not valid
    if "slide-template" in url_string.lower():
        return False
    if result:
        return result
    else:
        return False
    
def prepend_url(base_url, link):
    if link.startswith('/'):
        return base_url + link
    else:
        return link
    
def strip_query_string(url):
    parsed = urlparse(url)
    return parsed.scheme + "://" + parsed.netloc + parsed.path

def extract_site_urls(html: str, site_name:str, url: str) -> list:
    '''
    Extract the URLs from the HTML content
    Only if the domain name is the same
    Returns a list of strings
    '''
    soup = BeautifulSoup(html, "html.parser")
    links = [link.get("href") for link in soup.find_all("a") if link.get("href") is not None]

    # Remove empty links and mailto links
    links = [link for link in links if len(link) > 1 and not link.startswith("mailto")]
    
    # Add the base URL to the links
    links = [prepend_url(url, link) for link in links]
    links = [url + link if not link.startswith("http") else link for link in links]
    # Remove links that are not from the same site
    links = [link for link in links if site_name in link]
    # Check fo valid URLs
    links = [link for link in links if is_valid_url(link)]
    # Remove everything after the # sign
    links = [link.split("#")[0] for link in links]
    # Remove duplicates
    links = list(set(links))    
    return links    


def get_site_contents(url:str, max_pages:int = 200, check_query_strings: bool = False, startswith: str = None) -> list:
    '''
    The input as a URL
    Returns the site content as Markdown and a list of links (from that same site)
    '''
    site_name = url.split("//")[1].split("/")[0].replace("www.", "")
    print("Starting extraction pipeline for site: ", site_name)

    visited_links = []

    loader = AsyncHtmlLoader(url)
    index_page = loader.load()
    visited_links.append(url)

    html = index_page[0].page_content


    links = extract_site_urls(html, site_name, url)
    processed_pages = []

    while len(links) > 0 :
        print("Visiting the following links: ", links)
        if startswith:
            links = [link for link in links if link.startswith(startswith)]
        other_pages = AsyncHtmlLoader(links, ignore_load_errors=True)
        try:
            docs = other_pages.load()
        except Exception as e:
            print("Error loading the page: ", e)
        processed_pages.extend(docs)
        if len(processed_pages) > max_pages:
            print("Maximum number of pages reached")
            break
        visited_links.extend(strip_query_string(link) for link in links)
        print("Number of pages processed: ", len(processed_pages))
        for doc in docs:
            # Extracting links from 
            links.extend(extract_site_urls(doc.page_content, site_name, url))

        # Remove visited links
        links = [link for link in links if strip_query_string(link) not in visited_links]

        print("Number of links to visit: ", len(links))

    processed_pages.extend(index_page)

    print("Download finished. Extracting content from the pages.")
    # Apply the excract method to each element of the list
    docs_transformed = [extract(doc.page_content, output_format="json", include_comments=False) for doc in processed_pages]
    
    #html2text = Html2TextTransformer()
    #docs_transformed = html2text.transform_documents(processed_pages)

    return docs_transformed

scraped_pages = get_site_contents(url, max_pages=1000, startswith=url)
print("Number of pages scraped: ", len(scraped_pages))

## Data Preprocessing and Chunking
In this step we will split all the extracted web pages into logical chunks. 

➡️ We will use the [trafilatura](https://trafilatura.readthedocs.io/en/latest/) library to extract the main content of the web pages. It will return a json objects with the following attributes
- `title`: The title of the page
- `author`: The author of the page, in most cases this will be empty
- `hostname`: The hostname of the page
- `date`: The date of the page
- `fingerprint`: A fingerprint of the page
- `id`: The id of the page, most of the time this will be empty
- `license`: The license of the page, most of the time this will be empty
- `comments`: The comments of the page, most of the time this will be empty
- `raw_text` : The raw text of the page: html elements are removed, also visual elements are removed
- `language`: The language of the page
- `image`: The images of the page, contains the URLs
- `pagetype`: Always set to website
- `source`: Main URL of the website
- `source-hostname`: Hostname of the website
- `excerpt`: An excerpt of the page
- `categories`: The categories of the page
- `tags`: The tags of the page

➡️ We will pipe this to another splitter to further cut the sections into smaller chunks if they are too large. For this we use Langchains 

➡️  Also we will attach an LLM to the chain to ignore chunks that are not relevant, for example: navigation bars, footers, etc.



### Chunking and metadata extraction
Using the functions below we extract the medata and devide the text into chunks. 

In [None]:
import json
from langchain_text_splitters import RecursiveCharacterTextSplitter

def extract_metadata(pages: list) -> list:
    '''
    This function will extract the metdata extracted from the pages by the Trafilatura library
    '''
    keys =  ['title', 'hostname', 'image', 'source', 'source-hostname', 'excerpt']
    metadata = []

    for page in pages:
        page = json.loads(page)
        metadata.append({key: page[key] for key in keys if key in page})

    return metadata

def ner_processing(content: list, metadatas: list, chunk_size: int = 1000) -> list:
    '''
    Split the content into chunks of a certain size;
    Inputs:
    - content: list of strings
    - metadatas: list of dictionaries
    - chunk_size: int (optional), default 1000

    Returns a list of dictionaries
    '''

    # Splitting the content into chunks
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=chunk_size,
    chunk_overlap=0,
)
    return text_splitter.create_documents(content, metadatas=metadatas)



metadata = extract_metadata(scraped_pages)
content = [' '.join(json.loads(page)["raw_text"].split()) for page in scraped_pages]
chunks = ner_processing(content, metadata)
#chunks = [chunk.dict() for chunk in chunks]
len(chunks)

In [None]:
print(chunks[2].metadata)

### NER Extraction Pipeline
Here we will use langchain and and LLM to extract the Named Entities from the text.

In [None]:
#from langchain_openai import ChatOpenAI
from langchain_community.llms import Replicate
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import Optional, List
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate


#llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
#llm = ChatGroq(temperature=0, model_name="llama3-70b-8192")
#llm = Ollama(model_name="llama3-70b-8192", temperature=0)

llm = Replicate(
    model="meta/meta-llama-3-70b-instruct",
    model_kwargs={"temperature": 0},
)

#llm = ChatAnthropic(model='claude-3-sonnet-20240229')




class Entity(BaseModel):
    entity_type: Optional[str] = Field(description="The type of the entity, for example 'person', 'location', 'organization' etc.")
    entity_name: Optional[str] = Field(description="The name of the entity, for example 'John Doe', 'New York', 'Apple Inc.' etc.")

# Define your desired data structure.
class NERExtraction(BaseModel):
    entity_list: List[Entity] = Field(description="List of entities extracted from the text")
    language: str = Field(description="The language of the text")
    category: str = Field(description="Return the subject what this text excaclty is about")


# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=NERExtraction)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)


chain = prompt | llm | parser

In [None]:
response = chain.invoke({"query": chunks[2].page_content})
print(response.json(indent=2))

In [None]:
import nest_asyncio
import asyncio, uuid
from tqdm.notebook import tqdm

nest_asyncio.apply()

async def process_page_content(chunk, semaphore):
    async with semaphore:
        try:
            response = await chain.ainvoke({"query": chunk.page_content})
            chunk.metadata['uuid'] = str(uuid.uuid4())
            chunk.metadata['NER'] = response.dict()
            return chunk
        except Exception as e:
            print(f"An error occurred: {e}")
            return None  # or some other value indicating failure

async def main():
    semaphore = asyncio.Semaphore(5)  # Limit concurrency to 4
    tasks = []
    chunks_with_responses = []
    for i, chunk in enumerate(chunks):
        task = asyncio.create_task(process_page_content(chunk, semaphore))
        tasks.append(task)
    
    responses = []
    for i, future in tqdm(enumerate(asyncio.as_completed(tasks)), total=len(tasks), desc="Processing tasks"):
        chunk = await future
        chunks_with_responses.append(chunk)
        #print(f"Task {i+1} of {len(tasks)} completed.")
    
    return chunks_with_responses

# Get the current event loop
loop = asyncio.get_event_loop()

# Run the main function using the current event loop
results = loop.run_until_complete(main())

# remove all results that are None
results = [result for result in results if result is not None]

In [None]:
print(results[2].json(indent=2))

In [None]:
# Show the memory usage of this notebook
import os
import psutil
process = psutil.Process(os.getpid())
print("Memory used: ", process.memory_info().rss / 1024 ** 2, "MB")

## Storing the results in a postgres database

In [None]:
from langchain_core.documents import Document
from langchain_cohere import CohereEmbeddings
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector

connection = "postgresql+psycopg://postgres:mysecretpassword@localhost/paginx"
collection_name = url
embeddings = CohereEmbeddings()

In [None]:
#vectorstore.drop_tables()

In [None]:
vectorstore = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

In [None]:
vectorstore.add_documents(results, ids=[result.metadata["uuid"] for result in results])

In [None]:
result = (vectorstore.similarity_search("When is the company founded ? ", k=3)[1])

In [None]:
print(result.page_content)