<a href="https://colab.research.google.com/github/rjslvn/personal/blob/master/QnA_Bot_with_Embeddings_and_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This code is designed to scrape a website, process the text, and then use OpenAI's API to answer questions based on the scraped text.

Click the little Play button on the top left of each codeblock to get started (make sure to do it for all 13 steps)

you may need to run these commands


```
!pip install bs4 tiktoken openai numpy pandas os pypdf2 requests tqdm 
```



**Step 1: Import Necessary Libraries**

This step imports all the necessary Python libraries that will be used in the script. These include libraries for handling PDF files, making HTTP requests, parsing HTML, manipulating data, and interacting with OpenAI's API.

In [None]:
################################################################################
### Step 1
################################################################################
import PyPDF2
from io import BytesIO
import requests
import re
import urllib.request
from bs4 import BeautifulSoup
from collections import deque
from html.parser import HTMLParser
from urllib.parse import urlparse
import os
import pandas as pd
import tiktoken
import openai
import numpy as np
from openai.embeddings_utils import distances_from_embeddings, cosine_similarity
import time
from tqdm import tqdm
time.sleep(1)
# Regex pattern to match a URL
HTTP_URL_PATTERN = r'^https[*]://.+'
openai.api_key = "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"
# Define root domain to crawl
# Prompt the user for domain input# Prompt the user for a start URL input
start_url = input("Enter the start URL (e.g. https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average): ")
url_obj = urlparse(start_url)
domain = url_obj.netloc
local_domain = domain

# Create a class to parse the HTML and get the hyperlinks
class HyperlinkParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.hyperlinks = []

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if tag == "a" and "href" in attrs:
            self.hyperlinks.append(attrs["href"])

full_url = start_url


**Step 2: Define Variables and Functions**

Here, several variables and functions are defined. These include a regular expression pattern for matching URLs, an OpenAI API key, and a function for parsing hyperlinks in HTML.

In [None]:



################################################################################
### Step 2
################################################################################

# Function to get the hyperlinks from a URL
def get_hyperlinks(url):
    
    # Try to open the URL and read the HTML
    try:
        # Open the URL and read the HTML
        with urllib.request.urlopen(url) as response:

            # If the response is not HTML, return an empty list
            if not response.info().get('Content-Type').startswith("text/html"):
                return []
            
            # Decode the HTML
            html = response.read().decode('utf-8')
    except Exception as e:
        print(e)
        return []

    # Create the HTML Parser and then Parse the HTML to get hyperlinks
    parser = HyperlinkParser()
    parser.feed(html)

    return parser.hyperlinks



**Step 3: Get Hyperlinks**

This step defines a function to extract all hyperlinks from a given URL. It opens the URL, reads the HTML, and uses the HyperlinkParser class to extract all hyperlinks.

In [None]:
################################################################################
### Step 3
################################################################################

# Function to get the hyperlinks from a URL that are within the same domain
def get_domain_hyperlinks(local_domain, url):
    clean_links = []
    for link in set(get_hyperlinks(url)):
        clean_link = None

        # If the link is a URL, check if it is within the same domain or subdomain
        if re.search(HTTP_URL_PATTERN, link):
            # Parse the URL and check if the domain or subdomain is the same
            url_obj = urlparse(link)
            if local_domain in url_obj.netloc:
                clean_link = link

        # If the link is not a URL, check if it is a relative link
        else:
            if link.startswith("/"):
                link = link[1:] #CHANGE THIS FOR DEPTH DEPTH DEPTH DEPTH DEPTH
            elif link.startswith("#") or link.startswith("mailto:"):
                continue
            clean_link = "https://" + local_domain + "/" + link

        if clean_link is not None:
            if clean_link.endswith("/"):
                clean_link = clean_link[:-1]
            clean_links.append(clean_link)

    # Return the list of hyperlinks that are within the same domain or subdomain
    return list(set(clean_links))




**Step 4: Get Domain Hyperlinks**

This function retrieves all hyperlinks from a given URL that are within the same domain. It cleans up the links and ensures they belong to the same domain or subdomain.

In [None]:

################################################################################
### Step 4
################################################################################

def sanitize_filename(filename):
    invalid_chars = '\\/:*?"<>|'
    for char in invalid_chars:
        filename = filename.replace(char, '_')
    return filename
def crawl(url):
    # Parse the URL and get the domain
    local_domain = urlparse(url).netloc

    # Create a directory to store the text files
    if not os.path.exists("text/"):
        os.mkdir("text/")

    if not os.path.exists("text/" + local_domain + "/"):
        os.mkdir("text/" + local_domain + "/")

    # Create a directory to store the csv files
    if not os.path.exists("processed"):
        os.mkdir("processed")

    # Get the hyperlinks from the URL (one level deep)
    one_level_links = get_domain_hyperlinks(local_domain, url)

    # Process the initial URL
    process_url(url, local_domain)

    # Process each link one level deep
    for link in one_level_links:
        process_url(link, local_domain)

def process_url(url, local_domain):
    print(url)  # for debugging and to see the progress

    if url.lower().endswith(".pdf"):
        response = requests.get(url)
        pdf_content = BytesIO(response.content)
        try:
            pdf_reader = PyPDF2.PdfReader(pdf_content)
            pdf_text = ""
            for page_num in range(len(pdf_reader.pages)):
                pdf_text += pdf_reader.pages[page_num].extract_text()
            save_text_to_file(url, local_domain, pdf_text)
        except PyPDF2.errors.PdfReadError:
            print(f"Unable to parse PDF at {url}")
    else:
        # Get the text from the URL using BeautifulSoup
        soup = BeautifulSoup(requests.get(url).text, "html.parser")
        text = soup.get_text()
        save_text_to_file(url, local_domain, text)

def save_text_to_file(url, local_domain, text):
    # Save text from the url to a <url>.txt file
    with open("text/" + local_domain + "/" + sanitize_filename(url[8:]) + ".txt", "w", encoding="UTF-8") as f:
        # If the crawler gets to a page that requires JavaScript, it will stop the crawl
        if "You need to enable JavaScript to run this app." in text:
            print("Unable to parse page " + url + " due to JavaScript being required")

        # Otherwise, write the text to the file in the text directory
        f.write(text)

crawl(full_url)



**Step 5: Crawl the Website**

This step involves crawling the website and processing each URL. It retrieves all the hyperlinks from the start URL and processes each one. The processing involves extracting text from the URL and saving it to a file.

In [None]:

################################################################################
### Step 5
################################################################################

def remove_newlines(serie):
    serie = serie.str.replace('\n', ' ')
    serie = serie.str.replace('\\n', ' ')
    serie = serie.str.replace('  ', ' ')
    serie = serie.str.replace('  ', ' ')
    return serie


################################################################################
### Step 6
################################################################################

# Create a list to store the text files
texts=[]

# Get all the text files in the text directory
for file in os.listdir("text/" + domain + "/"):

    # Open the file and read the text
    with open("text/" + local_domain + "/" + file, "r", encoding="UTF-8") as f:
        text = f.read()

        # Omit the first 11 lines and the last 4 lines, then replace -, _, and #update with spaces.
        texts.append((file[11:-4].replace('-',' ').replace('_', ' ').replace('#update',''), text))

# Create a dataframe from the list of texts
df = pd.DataFrame(texts, columns = ['fname', 'text'])

#Here's the continuation of the corrected code:

# Set the text column to be the raw text with the newlines removed
df['text'] = df.fname + ". " + remove_newlines(df.text)
df.to_csv('processed/scraped.csv')
df.head()



**Step 6: Process the Text**

This step processes the text that was retrieved from the website. It involves removing newline characters and splitting the text into chunks of a maximum number of tokens.

stion, and then uses the previous steps to generate and print an answer.

In [None]:
################################################################################
### Step 7
################################################################################

# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
tokenizer = tiktoken.get_encoding("cl100k_base")

df = pd.read_csv('processed/scraped.csv', index_col=0)
df.columns = ['title', 'text']

# Tokenize the text and save the number of tokens to a new column
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

# Visualize the distribution of the number of tokens per row using a histogram
df.n_tokens.hist()



**Step 7: Tokenize the Text**

The text is tokenized (broken down into individual words or terms) and the number of tokens is saved to a new column in the dataframe.



In [None]:
################################################################################
### Step 8
################################################################################

max_tokens = 500

# Function to split the text into chunks of a maximum number of tokens
def split_into_many(text, max_tokens = max_tokens):

    # Split the text into sentences
    sentences = text.split('. ')

    # Get the number of tokens for each sentence
    n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]
    
    chunks = []
    tokens_so_far = 0
    chunk = []

    # Loop through the sentences and tokens joined together in a tuple
    for sentence, token in zip(sentences, n_tokens):

        # If the number of tokens so far plus the number of tokens in the current sentence is greater 
        # than the max number of tokens, then add the chunk to the list of chunks and reset
        # the chunk and tokens so far
        if tokens_so_far + token > max_tokens:
            chunks.append(". ".join(chunk) + ".")
            chunk = []
            tokens_so_far = 0

        # If the number of tokens in the current sentence is greater than the max number of 
        # tokens, go to the next sentence
        if token > max_tokens:
            continue

        # Otherwise, add the sentence to the chunk and add the number of tokens to the total
        chunk.append(sentence)
        tokens_so_far += token + 1
        
    # Add the last chunk to the list of chunks
    if chunk:
        chunks.append(". ".join(chunk) + ".")

    return chunks
    

shortened = []

# Loop through the dataframe
for row in df.iterrows():

    # If the text is None, go to the next row
    if row[1]['text'] is None:
        continue

    # If the number of tokens is greater than the max number of tokens, split the text into chunks
    if row[1]['n_tokens'] > max_tokens:
        shortened += split_into_many(row[1]['text'])
    
    # Otherwise, add the text to the list of shortened texts
    else:
        shortened.append( row[1]['text'] )



**Step 8: Split Text into Chunks**

If the number of tokens in a text is greater than the maximum number of tokens, the text is split into chunks. Each chunk has a maximum number of tokens.



**Step 9: Create a DataFrame**

A DataFrame is created from the shortened texts. The number of tokens for each text is calculated and added to the DataFrame.


In [None]:
################################################################################
### Step 9
################################################################################

df = pd.DataFrame(shortened, columns = ['text'])
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))
df.n_tokens.hist()




**Step 10: Get Embeddings**

This step involves getting embeddings for the text. Embeddings are a way of representing text in a numerical format that can be understood by machine learning models.


In [None]:
################################################################################
### Step 10
################################################################################

# Note that you may run into rate limit issues depending on how many files you try to embed
# Please check out our rate limit guide to learn more on how to handle this: https://platform.openai.com/docs/guides/rate-limits

df['embeddings'] = df.text.apply(lambda x: openai.Embedding.create(input=x, engine='text-embedding-ada-002')['data'][0]['embedding'])
df.to_csv('processed/embeddings.csv')
# Load your embeddings DataFrame
df = pd.read_csv('processed/embeddings.csv')




**Step 11: Load Embeddings**

The embeddings are loaded into the DataFrame.


In [None]:
################################################################################
### Step 11
################################################################################

# Load the embeddings from the csv file
df = pd.read_csv('processed/embeddings.csv', index_col=0)
df['embeddings'] = df.embeddings.apply(lambda x: np.array(eval(x)))
# Convert the embeddings from strings to numpy arrays

# Convert embeddings into a matrix for cosine similarity computation
embeddings_matrix = np.vstack(df['embeddings'].values)

print(embeddings_matrix.shape)

# Calculate the cosine similarity between the embeddings
def custom_cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

df['similarity'] = df.embeddings.apply(lambda x: custom_cosine_similarity(df.embeddings[0], x))

# Sort the dataframe by similarity
df = df.sort_values('similarity', ascending=False)

# Print the top 5 most similar texts
print(df.head(5))




**Step 12: Answer Questions**

This step involves answering questions based on the most similar context from the DataFrame texts. It uses OpenAI's API to generate the answers.



In [None]:
################################################################################
### Step 12
################################################################################

# Save the dataframe to a csv file
df.to_csv('processed/similarities.csv')

# Print a message to indicate that the process is complete
print("Process complete. The similarities have been saved to 'processed/similarities.csv'.")

# Load the embeddings from the CSV file
df = pd.read_csv('processed/embeddings.csv', index_col=0)


# Convert the embeddings from strings to numpy arrays
df['embeddings'] = df.embeddings.apply(lambda x: np.array(eval(x)))

# Convert embeddings into a matrix for cosine similarity computation
embeddings_matrix = np.vstack(df['embeddings'].values)

# Function to get most similar documents to a query
def get_similar_documents(query_embedding, embeddings_matrix, top_n=5):
    query_embedding = np.array(query_embedding)
    similarity_scores = cosine_similarity(query_embedding.reshape(1, -1), embeddings_matrix.T) # This line has been corrected
    sorted_indices = np.argsort(similarity_scores[0])[::-1]
    return sorted_indices[:top_n]

# Function to get the query embedding
def get_query_embedding(query):
    return openai.Embedding.create(input=query, engine='text-embedding-ada-002')['data'][0]['embedding']

def answer_question(df, question, model="text-davinci-003", max_len=1800, max_tokens=150, stop_sequence=None, debug=False):
    """
    Answer a question based on the most similar context from the dataframe texts or provide an open-ended response
    """
    context = create_context(
        question,
        df,
        max_len=max_len,
    )
    # If debug, print the raw model response
    if debug:
        print("Context:\n" + context)
        print("\n\n")

    try:
        # Create a completions using the question and context
        response = openai.Completion.create(
            engine=model,
            prompt=f"Answer the question based on the context below, and if the question can't be answered based on the context, provide an open-ended response.\n\nContext: {context}\n\n---\n\nQuestion: {question}\nAnswer:",
            temperature=0.5,
            max_tokens=max_tokens,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stop=stop_sequence,
        )
        return response.choices[0].text.strip()
    except Exception as e:
        print(e)
        return ""

# Replace "your_openai_api_key" with your actual OpenAI API key

def create_context(
    question, df, max_len=1800, size="ada"
):
    """
    Create a context for a question by finding the most similar context from the dataframe
    """

    # Get the embeddings for the question
    q_embeddings = openai.Embedding.create(input=question, engine='text-embedding-ada-002')['data'][0]['embedding']

    # Get the distances from the embeddings
    df['distances'] = distances_from_embeddings(q_embeddings, df['embeddings'].values, distance_metric='cosine')


    returns = []
    cur_len = 0

    # Sort by distance and add the text to the context until the context is too long
    for i, row in df.sort_values('distances', ascending=True).iterrows():
        
        # Add the length of the text to the current length
        cur_len += row['n_tokens'] + 4
        
        # If the context is too long, break
        if cur_len > max_len:
            break
        
        # Else add it to the text that is being returned
        returns.append(row["text"])

    # Return the context
    return "\n\n###\n\n".join(returns)



**Step 13: Interactive Question-Answering**

This final step allows for interactive question-answering. It prompts the user to enter a que

In [None]:

################################################################################
### Step 13
################################################################################
conversation_history = ""

while True:
    user_question = input("Enter your question (type 'exit' to quit): ")
    if user_question.lower() == 'exit':
        break
    
    conversation_history += f"\n\nQuestion: {user_question}"
    answer = answer_question(df, question=user_question, debug=False)
    conversation_history += f"\nAnswer: {answer}"
    print(answer)
