### Build a EVUniverse KB from Blog articles

Although LLMs are powerful, they do not know about information they were not trained on. If you want to use an LLM to answer questions about documents it was not trained on, you have to give it information about those documents.

The idea is that for every question you want to ask chatGPT, you first do a retrieval step to fetch any relevant documents and then pass those documents, along with the original question, to the language model and have it generate a response.

**Full Q&A Solution Procedure:**

 1. Prerequisites: Import libraries, set API key (if needed)
 2. Collect: We crawl the few hundred EVUniverse Help articles from their Website.
 3. Chunk: Documents are split into short, semi-self-contained sections to be embedded
 4. Embed: Each section is embedded with the OpenAI API
 5. Question Answer System: Build a question answer system with your embeddings
 6. Ask Questions
 
This notebook is concetrating on Step 2: Collection of data.

#### 2. Collect: Build a EVUniverse KB from Blogs with llamaIndex and Playwright. 


In [None]:
# Delete the KB Data Directory.  This is mostly for testing and makes it easier to clean things up.

import os                                   # for accessing files and Env Variables
import shutil
project_dir = "../../evuniverse/"

DEBUG = False

def resetEVUniverseKB(delete_this_dir, debug = DEBUG):


    # Delete the directory that stores the text files
    try:
        if os.path.exists(delete_this_dir):
            if debug:
                print("Found Dir", os.path.abspath(delete_this_dir), "\n")
            if os.path.abspath(delete_this_dir) != os.path.abspath(project_dir):
                shutil.rmtree(delete_this_dir)
                if debug:
                    print("Deleted Dir", os.path.abspath(delete_this_dir), "\n")
            else:
                print("Cannot delete:", os.path.abspath(delete_this_dir), "Its the main project dir:", os.path.abspath(project_dir), "\n")
    except Exception as ex:
        print("Unable to delete directory:", delete_this_dir, "Exception: ", ex)


# Retrive all EVUniverse Help Center articles
# This uses llamaIndex and Playwright packages in the background

import os                                   # for accessing files and Env Variables
import llama_index
from llama_index import download_loader

def crawlDomainBlogs(domain, debug):
    
    #Crawl the KB for data
    loader = download_loader()
    docs = loader.load_data()
    
    if debug:
        print("crawlDomainBlogs: We found", len(docs), "documents in ", URL_PREFIX, " KB\n\n")
        # print("crawlDomainBlogs: First Document: ", docs[0], "\n")

    return docs

#### 3. Convert stripped text to CSV format

CSV is a common format for storing data along with its meta data. You can use this format with Python by converting the raw text files (which are in the text directory) into Pandas data frames. Pandas is a popular open source library that helps you work with tabular data stored in rows and columns.


The Docs are something like this:

#### Each crawled document looks something like this:

Home
»
Stay Charged: Your Ultimate Guide to Electric Car Maintenance

Stay Charged: Your Ultimate Guide To Electric Car Maintenance
June 01, 2023
EV Universe
EV Universe
As the wave of electric vehicle (EV) enthusiasm continues to surge, you might be contemplating joining the electric revolution. When we talk about electric car maintenance, we mean the regular checks and care to keep your EV running at its best. Like any vehicle, an electric car needs regular care. However, you’ll be glad to know that maintenance for electric cars tends to be simpler and less frequent than for their gasoline-powered counterparts. With fewer moving parts, electric vehicles skip the need for oil changes, spark plug replacements, or timing belt adjustments – all thanks to the absence of an internal combustion engine.

 

electric car maintenance i7 front pty EVU

 

But don’t be fooled – electric cars aren’t entirely maintenance-free. In this guide, we’re going to delve into everything you need to know about maintaining an electric car. So buckle up, and let’s take this informative ride together!

 

What Is Electric Car Maintenance?
When we talk about electric car maintenance, we’re referring to the regular checks and care needed to keep your electric vehicle (EV) running smoothly and efficiently. Just like with any vehicle, maintenance is crucial. However, the good news for EV owners is that electric car maintenance is usually simpler and less frequent than it is for gasoline-powered vehicles.

 

Electric vehicles have fewer moving parts than their conventional counterparts. Without an internal combustion engine (ICE), there’s no need for oil changes, spark plug replacements, or timing belt adjustments. This makes maintenance for electric cars generally less complex and less frequent compared to ICE cars. If you’re interested in a more in-depth comparison between electric car and ICE car maintenance, we recommend you check out our article on How EV Maintenance Compares to That for ICE Cars.


In [None]:
import pandas as pd

# Blank empty lines can clutter the text files and make them harder to process. 
# A simple function can remove those lines and tidy up the files.
def remove_newlines(text):
    text = text.str.replace('\n', ' ')
    if (" ".join(text).count('\\n') > 0):
        text = text.str.replace('\\n', ' ')
    text = text.str.replace('  ', ' ')
    return text


def populateDF(docs, debug = DEBUG):
        
    # Create a dataframe from the documents extrated
    df = pd.DataFrame(texts, columns = ['title', 'text'])

    # Remove Blank lines from the text.
    df['text'] = remove_newlines(df.text)
    
    return df

In [None]:
import os

# Collect
processed_data_dir = "../processed_data/"
URL_PREFIX = "https://www.evuniverse.com/"     #For URL of Supporting Evidence
DEBUG = False


def storeRawDataAsCSV(domain = URL_PREFIX, processed_data_dir = processed_data_dir, debug = DEBUG):
    
    docs = crawlDomainBlogs(domain, debug)  
    
    # Store Data in CSV format.
    df = populateDF(docs, debug)
    
    # Create a directory to store the csv files
    if not os.path.exists(processed_data_dir):
        os.mkdir(processed_data_dir)

    # Export the dataframe df to a CSV file called scraped.
    df.to_csv(processed_data_dir + '/scraped.csv')


In [None]:
storeRawDataAsCSV(processed_data_dir = "../processed_data1/", debug = True)

In [None]:
resetEVUniverseKB(delete_this_dir = "../processed_data1/", debug = True)