In [5]:
task = """Role and Goal: "VPC QuestGen" specializes in generating high-quality Q&A pairs from AWS VPC documentation. The primary goal is to provide detailed and technically accurate explanations. When given a URL, the AI uses its web browsing capability to access and analyze the content of the provided link. The focus is on generating complete Q&A pairs that offer in-depth insights and comprehensive explanations, steering clear of partial or example-based answers.

Task Simplification: The AI focuses on a single URL at a time provided later in this prompt. It uses the OpenAI API's web browsing feature to directly access the link, read the content, and generate Q&A pairs based on the information available on the page.

Style and Format: The AI maintains a uniform format for the Q&A pairs, emphasizing clarity and thoroughness in the answers. It is tasked with providing high-quality Q&A pairs based on the web content accessed. Responses include the URL for transparency and traceability. The AI avoids displaying extraneous content, sticking strictly to generating Q&A pairs from the browsed content.

Quality Assurance: The inclusion of the URL in responses enables source verification, ensuring the relevance and reliability of the information used. The AI is programmed to provide fully explanatory answers that cover all aspects of the query, based on the information available on the browsed page.

Interaction Style: The AI adopts a formal, technical tone. It clearly states any limitations and ensures that responses are comprehensive and based solely on the content of the accessed URL.

Personalization: The AI is optimized for efficiency, focusing on generating thorough and relevant Q&A pairs using information from the browsed URL. It does not engage in tasks beyond this scope.

Error Handling: If the AI determines that the page lacks sufficient information to create high-quality Q&A pairs, it will return no response, adhering to the principle of quality over quantity in information delivery."""

In [8]:
task = """VPC QuestGen's Role: Specialize in creating detailed, technically accurate Q&A pairs from AWS VPC documentation, using a URL provided later in this prompt. It employs the OpenAI API's web browsing feature to access and analyze linked content for generating comprehensive Q&A pairs.

Task Focus: Handle one URL at a time, using the information from the link to produce Q&A pairs that are clear, thorough, and formatted uniformly. Responses include the URL for source verification.

Quality and Style: The AI generates high-quality answers, maintaining a formal, technical tone. It avoids irrelevant content, focusing solely on the browsed URL's content. Responses cover all aspects of the query, ensuring relevance and reliability.

Response Format: Each response should follow a strict format - (QUESTION: ..., ANSWER: ... ). Each Q&A pair must include the source URL at the end. Questions are to be concise and directly related to the content, while answers provide in-depth explanations.

Limitations and Error Handling: The AI clearly states any limitations in the information available. If the page lacks enough data to create quality Q&A pairs, the AI will not provide a response, prioritizing quality over quantity."""

In [7]:
# ----- Import Libraries -----
import pandas as pd
import openai
from dotenv import load_dotenv
import os

In [4]:
# ----- User Settings -----
input_file_path = '../06_Data/Capstone_Data/Classified_VPC_Links.csv'
output_file_path = '../06_Data/Capstone_Data/Documentation_QA_Pairs.csv'
openai_api_key_env_var = "OPENAI_KEY"
max_tokens = 1000

In [12]:
# ----- Function Definitions -----
def get_openai_response(url, custom_prompt_func, task):
    prompt = custom_prompt_func(task, url)
    try:
        response = openai.ChatCompletion.create(
            model="gpt-4-1106-preview",  # Use the chat model
            messages=[{"role": "system", "content": "You are a helpful assistant."}, 
                      {"role": "user", "content": prompt}]
        )
        # Adjust the response parsing for chat model output
        return response['choices'][0]['message']['content']
    except Exception as e:
        print(f"Error in getting response: {e}")
        return None

def scraping_prompt(task, url):
    return f"I am going to provide instructions on how to answer. Please follow them strictly: {task}, URL: {url}"

def process_vpc_links(file_path, output_path, task, custom_prompt_func=scraping_prompt, num_links=None):
    links_df = pd.read_csv(file_path)
    if num_links is not None:
        links_df = links_df.head(num_links)

    responses = []

    for index, row in links_df.iterrows():
        url = row['LINK']
        response = get_openai_response(url, custom_prompt_func, task)
        responses.append(response)
        print(f"Processed link {index + 1}/{len(links_df)}")

    links_df['Response'] = responses
    # links_df[['DESC', 'LINK', 'Question', 'Answer']] = links_df['Response'].str.split('\n', expand=True)
    # links_df.drop(columns=['Response'], inplace=True)

    links_df.to_csv(output_path, index=False)
    print(f"Processed links with responses saved to: {output_path}")

# ----- Script Execution -----
# Load API key from .env file
load_dotenv()
openai_api_key_env_var = "OPENAI_KEY"
openai.api_key = os.getenv(openai_api_key_env_var)

input_file_path = '../06_Data/Capstone_Data/Classified_VPC_Links.csv'
output_file_path = '../06_Data/Capstone_Data/Documentation_QA_Pairs.csv'

# Process links and get responses, modify num_links to the desired number for testing
process_vpc_links(input_file_path, output_file_path, task, num_links=2)

Processed link 1/2
Processed link 2/2
Processed links with responses saved to: ../06_Data/Capstone_Data/Documentation_QA_Pairs.csv
