## How to build a web crawler 
Acquiring data in text form is the first step to use embeddings. This tutorial creates a new set of data by crawling the OpenAI website, a technique that you can also use for your own company or personal website.

This crawler will start from the root URL passed in at the bottom of the code below, visit each page, find additional links, and visit those pages as well (as long as they have the same root domain).

To begin, import the required packages, set up the basic URL, and define a HTMLParser class.

In [1]:
import requests
import re
import urllib.request
from bs4 import BeautifulSoup
from collections import deque
from html.parser import HTMLParser
from urllib.parse import urlparse, unquote
import os

# Regex pattern to match a URL
HTTP_URL_PATTERN = r'^http[s]*://.+'

domain = "openai.com" # <- put your domain to be crawled
full_url = "https://openai.com/" # <- put your domain to be crawled with https or http

# Create a class to parse the HTML and get the hyperlinks
class HyperlinkParser(HTMLParser):
    def __init__(self):
        super().__init__()
        # Create a list to store the hyperlinks
        self.hyperlinks = []

    # Override the HTMLParser's handle_starttag method to get the hyperlinks
    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)

        # If the tag is an anchor tag and it has an href attribute, add the href attribute to the list of hyperlinks
        if tag == "a" and "href" in attrs:
            self.hyperlinks.append(attrs["href"])

The next function takes a URL as an argument, opens the URL, and reads the HTML content. Then, it returns all the hyperlinks found on that page.

In [2]:
# Function to get the hyperlinks from a URL
def get_hyperlinks(url):

    # Try to open the URL and read the HTML
    try:
        # Open the URL and read the HTML
        with urllib.request.urlopen(url) as response:

            # If the response is not HTML, return an empty list
            if not response.info().get('Content-Type').startswith("text/html"):
                return []

            # Decode the HTML
            html = response.read().decode('utf-8')
    except Exception as e:
        print(e)
        return []

    # Create the HTML Parser and then Parse the HTML to get hyperlinks
    parser = HyperlinkParser()
    parser.feed(html)

    return parser.hyperlinks

The goal is to crawl through and index only the content that lives under the OpenAI domain. For this purpose, a function that calls the get_hyperlinks function but filters out any URLs that are not part of the specified domain is needed.

In [3]:
# Function to get the hyperlinks from a URL that are within the same domain
def get_domain_hyperlinks(local_domain, url):
    clean_links = []
    for link in set(get_hyperlinks(url)):
        clean_link = None

        # If the link is a URL, check if it is within the same domain
        if re.search(HTTP_URL_PATTERN, link):
            # Parse the URL and check if the domain is the same
            url_obj = urlparse(link)
            if url_obj.netloc == local_domain:
                clean_link = link

        # If the link is not a URL, check if it is a relative link
        else:
            if link.startswith("/"):
                link = link[1:]
            elif link.startswith("#") or link.startswith("mailto:"):
                continue
            clean_link = "https://" + local_domain + "/" + link

        if clean_link is not None:
            if clean_link.endswith("/"):
                clean_link = clean_link[:-1]
            clean_links.append(clean_link)

    # Return the list of hyperlinks that are within the same domain
    return list(set(clean_links))

The crawl function is the final step in the web scraping task setup. It keeps track of the visited URLs to avoid repeating the same page, which might be linked across multiple pages on a site. It also extracts the raw text from a page without the HTML tags, and writes the text content into a local .txt file specific to the page.

In [4]:
def get_valid_filename(url):
    parsed_url = urlparse(url)
    path_segments = [segment for segment in parsed_url.path.split('/') if segment]
    unquoted_segments = [unquote(segment) for segment in path_segments]
    return '_'.join(unquoted_segments)

def crawl(url):
    # Parse the URL and get the domain
    local_domain = urlparse(url).netloc

    # Create a queue to store the URLs to crawl
    queue = deque([url])

    # Create a set to store the URLs that have already been seen (no duplicates)
    seen = set([url])

    # Create directories if they don't exist
    os.makedirs("text/" + local_domain, exist_ok=True)
    os.makedirs("processed", exist_ok=True)

    # # Create a directory to store the text files
    # if not os.path.exists("text/"):
    #         os.mkdir("text/")

    # if not os.path.exists("text/"+local_domain+"/"):
    #         os.mkdir("text/" + local_domain + "/")

    # # Create a directory to store the csv files
    # if not os.path.exists("processed"):
    #         os.mkdir("processed")

    # While the queue is not empty, continue crawling
    while queue:

        # Get the next URL from the queue
        url = queue.pop()
        print(url) # for debugging and to see the progress

        # Generate a valid file name from the URL
        file_name = get_valid_filename(url)
        file_path = os.path.join("text", local_domain, file_name + ".txt")

        # Save text from the url to a <url>.txt file
        with open(file_path, "w", encoding="UTF-8") as f:
        # with open('text/'+local_domain+'/'+url[8:].replace("/", "_") + ".txt", "w", encoding="UTF-8") as f:

            # Get the text from the URL using BeautifulSoup
            soup = BeautifulSoup(requests.get(url).text, "html.parser")

            # Get the text but remove the tags
            text = soup.get_text()

            # If the crawler gets to a page that requires JavaScript, it will stop the crawl
            if ("You need to enable JavaScript to run this app." in text):
                print("Unable to parse page " + url + " due to JavaScript being required")
            else:
                # Otherwise, write the text to the file in the text directory
                f.write(text)

        # Get the hyperlinks from the URL and add them to the queue
        for link in get_domain_hyperlinks(local_domain, url):
            if link not in seen:
                queue.append(link)
                seen.add(link)

crawl(full_url)

https://openai.com/
https://openai.com/about
https://openai.com/research/whisper
https://openai.com/research?topics=open-source
https://openai.com/research/dall-e-3-system-card
https://openai.com/research/dall-e-3-system-card#content
https://openai.com/research?authors=openai
https://openai.com/research?models=dall-e-3
https://openai.com/research/frontier-ai-regulation
https://openai.com/research/frontier-ai-regulation#content
https://openai.com/research?topics=community
https://openai.com/research/vpt
https://openai.com/research?authors=peter-zhokhov
https://openai.com/research?authors=jie-tang
https://openai.com/research?authors=bowen-baker
https://openai.com/research?authors=raul-sampedro
https://openai.com/research?authors=jeff-clune
https://openai.com/research/vpt#content
https://openai.com/research?authors=ilge-akkaya
https://openai.com/research?authors=joost-huizinga
https://openai.com/research?topics=human-feedback
https://openai.com/research?topics=reinforcement-learning
https

## Building an embeddings index


CSV is a common format for storing embeddings. You can use this format with Python by converting the raw text files (which are in the text directory) into Pandas data frames. Pandas is a popular open source library that helps you work with tabular data (data stored in rows and columns).

Blank empty lines can clutter the text files and make them harder to process. A simple function can remove those lines and tidy up the files.

In [5]:
def remove_newlines(serie):
    serie = serie.str.replace('\n', ' ')
    serie = serie.str.replace('\\n', ' ')
    serie = serie.str.replace('  ', ' ')
    serie = serie.str.replace('  ', ' ')
    return serie

Converting the text to CSV requires looping through the text files in the text directory created earlier. After opening each file, remove the extra spacing and append the modified text to a list. Then, add the text with the new lines removed to an empty Pandas data frame and write the data frame to a CSV file.

Extra spacing and new lines can clutter the text and complicate the embeddings process. The code used here helps to remove some of them but you may find 3rd party libraries or other methods useful to get rid of more unnecessary characters.

In [6]:
import pandas as pd

# Create a list to store the text files
texts=[]

# Get all the text files in the text directory
for file in os.listdir("text/" + domain + "/"):

    # Open the file and read the text
    with open("text/" + domain + "/" + file, "r", encoding="UTF-8") as f:
        text = f.read()

        # Omit the first 11 lines and the last 4 lines, then replace -, _, and #update with spaces.
        texts.append((file[11:-4].replace('-',' ').replace('_', ' ').replace('#update',''), text))

# Create a dataframe from the list of texts
df = pd.DataFrame(texts, columns = ['fname', 'text'])

# Set the text column to be the raw text with the newlines removed
df['text'] = df.fname + ". " + remove_newlines(df.text)
df.to_csv('processed/scraped.csv')
df.head()

  serie = serie.str.replace('\\n', ' ')


Unnamed: 0,fname,text
0,,. OpenAI CloseSearch Submit Skip to main c...
1,,. About CloseSearch Submit Skip to main co...
2,,. Safety & responsibility CloseSearch Subm...
3,,. Product CloseSearch Submit Skip to main ...
4,,. Pricing CloseSearch Submit Skip to main ...


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 441 entries, 0 to 440
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   fname   441 non-null    object
 1   text    441 non-null    object
dtypes: object(2)
memory usage: 7.0+ KB


### Tokenization

Tokenization is the next step after saving the raw text into a CSV file. This process splits the input text into tokens by breaking down the sentences and words.

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

The API has a limit on the maximum number of input tokens for embeddings. To stay below the limit, the text in the CSV file needs to be broken down into multiple rows. The existing length of each row will be recorded first to identify which rows need to be split.

The new model, text-embedding-ada-002, replaces five separate models for text search, text similarity, and code search, and outperforms our previous most capable model, Davinci, at most tasks, while being priced 99.8% lower.

Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts.

Stronger performance. text-embedding-ada-002 outperforms all the old embedding models on text search, code search, and sentence similarity tasks and gets comparable performance on text classification.

In [11]:
!pip install tiktoken



In [12]:
!pip show tiktoken

Name: tiktoken
Version: 0.1.2
Summary: 
Home-page: 
Author: 
Author-email: 
License: 
Location: D:\AI Training\MyVirtualEnv\env\Lib\site-packages
Requires: blobfile, regex, requests
Required-by: 


In [2]:
!where python

D:\AI Training\MyVirtualEnv\env\Scripts\python.exe
C:\Users\sonya\AppData\Local\Programs\Python\Python311\python.exe
C:\Users\sonya\AppData\Local\Microsoft\WindowsApps\python.exe


In [3]:
import tiktoken

# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
tokenizer = tiktoken.get_encoding("cl100k_base")

df = pd.read_csv('processed/scraped.csv', index_col=0)
df.columns = ['title', 'text']

# Tokenize the text and save the number of tokens to a new column
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

# Visualize the distribution of the number of tokens per row using a histogram
df.n_tokens.hist()

ModuleNotFoundError: No module named 'tiktoken'

In [1]:
import sys
sys.executable

'C:\\Users\\sonya\\AppData\\Local\\Programs\\Python\\Python311\\python.exe'