Welcome to the WebBot created to scrape a website content and later traning an Ai chatbot model on the scraped content. 
## Note: 
1. Either you can use already scraped data as well as vector database provided along with the file, otherwise if you always have the option to make your updated one.

2. While running the scraping code, it will be better if you yourself interrupt the code after 2-3 minutes, otherwise it will keep scraping. 

In [26]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import time
import os

# Set to store visited URLs to avoid duplicates
visited_urls = set()
seen_texts = set()

def extract_main_keyword(url):
    """Extract the main keyword from a URL, like 'botpenguin' from 'https://botpenguin.com'."""
    domain = urlparse(url).netloc
    # Split domain by dots and select the second last part, which usually represents the main name
    parts = domain.split('.')
    if len(parts) > 1:
        return parts[-2]  # Get the second last part as the main keyword
    return domain

def is_valid_url(url, main_keyword):
    """Check if a URL contains the main keyword, allowing subdomains and different TLDs."""
    return main_keyword in urlparse(url).netloc

def get_all_links(url, soup, main_keyword):
    """Extract all links containing the main keyword."""
    links = []
    for link in soup.find_all('a', href=True):
        full_link = urljoin(url, link['href'])
        # Check if the link is valid and belongs to the same keyword-based domain
        if full_link not in visited_urls and is_valid_url(full_link, main_keyword):
            links.append(full_link)
    return links

def extract_text(soup):
    """Extract text from all relevant tags except headers and footers."""
    # Remove header and footer elements
    for header in soup.find_all(['header', 'footer']):
        header.decompose()  # Remove the header/footer from the soup object

    # Extract text from relevant tags
    texts = []
    for tag in soup.find_all(['p', 'div', 'span', 'li']):
        text = tag.get_text(strip=True)
        if text:  # Only add non-empty text
            texts.append(text)
    
    return texts

def remove_existing_file(file_name='output.txt'):
    """Remove the existing file if it exists."""
    if os.path.exists(file_name):
        os.remove(file_name)
        print(f"{file_name} has been removed.")

def write_text_to_file(text, file_name='output.txt'):
    """Append text to a file."""
    with open(file_name, 'a', encoding='utf-8') as file:
        file.write(text + '\n')

def scrape_page(url, main_keyword):
    """Extract desired data from the page without repeatedly scraping headers, footers, or already seen content."""
    print(f"Scraping: {url}")
    heading = f"URL : {url}"
    write_text_to_file(heading) 
    try:
        # Add headers to mimic a browser request
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Exclude header, footer, or repetitive sections identified by classes or tags
        for tag in soup(['footer', 'aside']):
            tag.decompose()  # Remove these elements from the soup
        
        # Extract text from specific tags, avoiding duplicates
        paragraphs = soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'ol', 'li'])

        for para in paragraphs:
            text = para.get_text(strip=True)
            # Skip if the text has been seen before to reduce repetition
            if text and text not in seen_texts:
                print(text)  # Optional: to see the text being processed
                write_text_to_file(text)  # Append text to the file
                seen_texts.add(text)  # Mark this text as seen

        # Mark the URL as visited
        visited_urls.add(url)

        # Return all links found on the current page
        return get_all_links(url, soup, main_keyword)

    except requests.exceptions.RequestException as e:
        print(f"Error scraping {url}: {e}")
        return []

def crawl_and_scrape(base_url):
    """Crawl the website starting from the base URL, scraping each page."""
    main_keyword = extract_main_keyword(base_url)
    urls_to_scrape = set([base_url])

    while urls_to_scrape:
        # Get a URL to scrape
        current_url = urls_to_scrape.pop()
        
        # Scrape the page and get new links
        new_links = scrape_page(current_url, main_keyword)
        
        # Add new links to the queue for scraping
        urls_to_scrape.update(new_links)

        # Add a delay to avoid overwhelming the server
        time.sleep(1)




In [29]:

def CreateNewScrapeFile():
   # Correct input handling
    base_url = input("Enter the website URL (e.g., https://www.botpenguin.com): ").strip()

    # Ensure the URL starts with http:// or https://
    if not base_url.startswith(('http://', 'https://')):
        base_url = 'https://' + base_url

    # Remove the existing output file if it exists
    remove_existing_file()

    # Start the crawling and scraping process
    crawl_and_scrape(base_url)


def load_existing_ScrapedData():
    print("Existing Output.txt file will be loaded.")

choice = input("Would you like to (1) Srape data from website or (2) load an existing one? Enter 1 or 2: ")
    
if choice == '1':
    CreateNewScrapeFile()
elif choice == '2':
    load_existing_ScrapedData()
else:
    print("Invalid choice. Please enter 1 or 2.")

Existing Output.txt file will be loaded.


Here just installing, some neccessary modules, rest we will see on the way!!

In [31]:
!pip install --upgrade --quiet  langchain langchain-community langchainhub langchain-openai chromadb bs4



[notice] A new release of pip is available: 24.1.2 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Replace your OpenAi Key in the below given placeholder.

In [32]:

import os
os.environ["OPENAI_API_KEY"] = "<Your OpenAi Key>"


In [33]:
# Using 'with' statement ensures the file is closed automatically
with open('output.txt', 'r', encoding='utf-8') as file:
    # Read the entire content of the file
    content = file.read()

# Print the content of the file
print(len(content))

157842


In [34]:
from langchain.text_splitter import CharacterTextSplitter
import re
def split_with_character_text_splitter(content):
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    return text_splitter.split_text(content)


splits = split_with_character_text_splitter(content)



Created a chunk of size 1058, which is longer than the specified 1000
Created a chunk of size 1050, which is longer than the specified 1000
Created a chunk of size 1305, which is longer than the specified 1000
Created a chunk of size 1629, which is longer than the specified 1000
Created a chunk of size 1273, which is longer than the specified 1000
Created a chunk of size 1004, which is longer than the specified 1000
Created a chunk of size 1475, which is longer than the specified 1000
Created a chunk of size 1457, which is longer than the specified 1000


In [35]:
len(splits)

202

In [36]:
from pprint import pprint

pprint(splits[:5])

['URL : https://www.botpenguin.com\n'
 'Why BotPenguinProductSolutionsPricingPartnersResources\n'
 'Why BotPenguin\n'
 'Product\n'
 'Solutions\n'
 'Pricing\n'
 'Partners\n'
 'Resources\n'
 'Engage, Converse and Convertyour visitors using AI Chatbot Agent\n'
 'Generate 10x more leads, solve up to 80% customer queries, engage 70% more '
 'visitorsto earn 90% more revenue by automating business communication.\n'
 'URL : https://www.botpenguin.com/chatbot-pricing\n'
 'Honest, Transparent and AffordableChatbot pricing\n'
 'No hidden costNo Markup cost (Meta charges)Get started for FREEGet FREE '
 'Green Tick Verification\n'
 'No hidden cost\n'
 'No Markup cost (Meta charges)\n'
 'Get started for FREE\n'
 'Get FREE Green Tick Verification\n'
 'Baby Plan\n'
 '$0\n'
 'Messages1,000Conversations100Chatbot1\n'
 'Messages1,000\n'
 'Conversations100\n'
 'Chatbot1\n'
 'What do you get?',
 'No Markup cost (Meta charges)\n'
 'Get started for FREE\n'
 'Get FREE Green Tick Verification\n'
 'Baby Plan\n

In [37]:
from langchain.docstore.document import Document
# texts = split_with_character_text_splitter.create_documents([text])
docs = split_with_character_text_splitter(content)
documents = []
for doc in docs:
    documents.append(Document(page_content=doc))

Created a chunk of size 1058, which is longer than the specified 1000
Created a chunk of size 1050, which is longer than the specified 1000
Created a chunk of size 1305, which is longer than the specified 1000
Created a chunk of size 1629, which is longer than the specified 1000
Created a chunk of size 1273, which is longer than the specified 1000
Created a chunk of size 1004, which is longer than the specified 1000
Created a chunk of size 1475, which is longer than the specified 1000
Created a chunk of size 1457, which is longer than the specified 1000


In [38]:
import os
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()


In [39]:
from langchain_community.vectorstores import FAISS 

def create_new_vectorstore():
   
    # Create a new vector store from documents
    vectorstore = FAISS.from_documents(documents, embeddings)
    vectorstore.save_local("vectorstore")
    print("New vector store created and saved locally.")
    return vectorstore


def load_existing_vectorstore():
    # Load an existing vector store
    vectorstore = FAISS.load_local("vectorstore", embeddings, allow_dangerous_deserialization=True)
    print("Existing vector store loaded.")
    return vectorstore

choice = input("Would you like to (1) create a new vector database or (2) load an existing one? Enter 1 or 2: ")
    
if choice == '1':
    vectorstore=create_new_vectorstore()
elif choice == '2':
    vectorstore = load_existing_vectorstore()
else:
    print("Invalid choice. Please enter 1 or 2.")

Existing vector store loaded.


In [40]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [41]:
retrieved_docs = retriever.invoke("What is BotPenguin?")

In [42]:
len(retrieved_docs)

6

In [43]:
print(retrieved_docs[4].page_content)

This will remove the manual process of creating any White-Labeled or Reseller account.
Video Link:https://drive.google.com/drive/folders/19KilDEeSwU5Bz05t5e1SU6O4tWiYwUxF?usp=drive_link
Help Link:https://help.botpenguin.com/partner-documentation/botpenguin-partner-onboarding/signup-as-a-botpenguin-partner
✅MS Teams
A New Platform for Bot Creation has been added in BotPenguin now our customers are also able to add Bots for MS Teams.Ability to train the bot with the AI CapabilitiesAbility to to access inboxAbility to access setting related to the bot.
A New Platform for Bot Creation has been added in BotPenguin now our customers are also able to add Bots for MS Teams.
Ability to train the bot with the AI Capabilities
Ability to to access inbox
Ability to access setting related to the bot.
✅Zoho Commerce Integration


In [44]:
from langchain_core.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Write with simple language in Paul Graham style. Write at least least 5 sentences.

{context}

Question: {question}

Helpful Answer:"""

prompt = PromptTemplate.from_template(template)

In [45]:
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

# llm = ChatOpenAI(model_name="gpt-4-0125-preview", temperature=0)
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [46]:
rag_chain.invoke("What is BotPenguin?")

'BotPenguin is an omnichannel platform that allows clients to be present on all social media platforms 24/7, 365 days a year. It helps in solving customer queries easily and efficiently. With a user-friendly interface, BotPenguin does not require technical knowledge or coding skills to set up chatbots. As a BotPenguin Partner, you can earn high incentives and have the flexibility to set your pricing. The platform integrates with popular CRM systems like Zoho, Agile CRM, and Bitrix24, making it easy to handle customer support in one place.'

In [47]:
for chunk in rag_chain.stream("What is Access Management?"):
    print(chunk, end="", flush=True)

Access Management refers to the process of controlling and managing user access to certain features, functionalities, or data within a system or platform. This includes setting permissions, roles, and restrictions for different users or user groups to ensure that they only have access to the resources they need. Access Management also involves handling authentication, authorization, and user roles to maintain security and privacy within the system. It allows administrators to regulate who can view, edit, or delete specific information or perform certain actions within the platform. Overall, Access Management is crucial for maintaining data integrity, security, and user experience.

In [48]:
from langchain_core.runnables import RunnableParallel

rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | prompt
    | llm
    | StrOutputParser()
)


rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

rag_chain_with_source.invoke("What is Access Management?")

{'context': [Document(page_content='Activated plans visible to customers; archived plans remain hidden.\nFeature availability dynamically controlled via toggles.\nManage consumption limits and pricing for add-ons.\nInvoices now include detailed plan and settlement information.\nCustomer and Agency Management\nAgencies must configure billing details and can now view and download invoices.Settlement handled based on message/conversation consumption or days used.New tabs for subscriptions, billing, and configurations added.Agencies can purchase plans and add-ons directly from their dashboard.Retention period defined for customer accounts after agency plan expiry.Renewal emails and consumption notifications implemented.\nAgencies must configure billing details and can now view and download invoices.\nSettlement handled based on message/conversation consumption or days used.\nNew tabs for subscriptions, billing, and configurations added.\nAgencies can purchase plans and add-ons directly fro

In [49]:
def format_to_markdown(data):
    markdown_output = (
        f"Question:{data['question']}\n\nAnswer:\n{data['answer']}"#\n\nSources:\n\n"
    )
    for i, doc in enumerate(data["context"], start=1):
        page_content = doc.page_content.split("\n")[
            0
        ] 
    return markdown_output

In [50]:
from IPython.display import display, Markdown

res = rag_chain_with_source.invoke("What are various plans avilable at BotPenguin?")
display(Markdown(format_to_markdown(res)))

Question:What are various plans avilable at BotPenguin?

Answer:
At BotPenguin, there are different plans available for users. The plans include the Baby Plan, King Plan, and more. The Baby Plan is like a trial with no time limit, allowing users to take their time to size up. The King Plan allows users to create chatbots for websites, Facebook, and Telegram, along with other features. Additionally, there are options to add more agents at additional charges. Users can also create custom plans for their customers and manage them accordingly.

In [51]:
def ask(q):
    res = rag_chain_with_source.invoke(q)
    return display(Markdown(format_to_markdown(res)))

In [52]:
ask("How can i use BotPenguin in my Business?")

Question:How can i use BotPenguin in my Business?

Answer:
To use BotPenguin in your business, you can start by becoming a partner through the BotPenguin Partnership Program. Depending on your preference and business model, you can choose to be an affiliate partner, white label partner, or implementation partner. Once you have partnered with BotPenguin, you can integrate the platform into your business by following the provided guidelines and steps. This will allow you to leverage the benefits of BotPenguin's omnichannel platform, easy-to-use chatbot setup, and seamless integration with popular CRM platforms. By using BotPenguin in your business, you can enhance customer support, increase efficiency, and drive growth.

In [53]:
ask("How can i contact with Botpenguin, provide Contact Details if possible?")

Question:How can i contact with Botpenguin, provide Contact Details if possible?

Answer:
You can contact BotPenguin through their headquarters in Mohali, India, located at 303, C-184, Third Floor, Sector 75, Mohali, Punjab 160071. Additionally, you can reach their sales office in Illinois, USA, at 2323 N Pulaski Rd, Chicago, IL 60639, United States. For more information, you can visit their website at https://help.botpenguin.com/product-updates/release-updates/april-24-releases.

In [54]:
ask("What are the various ratings that Botpenguin have obtained?")

Question:What are the various ratings that Botpenguin have obtained?

Answer:
BotPenguin has received positive ratings on platforms like G2, Capterra, Trustpilot, and Good Firms. Customers have praised the platform for its ability to execute in a timely manner, its practicality for sales and marketing objectives, and its robust support for all queries and issues. The platform has been rated highly for its ease of use, integrations with popular platforms, and transparency in its incentive structure. Overall, BotPenguin has garnered positive feedback from customers and partners across various platforms.