## Validation dataset
This notebook takes the FAQ questionnaire from the ROSA workshop documents and creates a fine-tuning or validation dataset for text generation models.

In [3]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [4]:
### Generated using successive iterations of Chat GPT code (minor edits) 

with open('../data/external/rosaworkshop/14-faq.md', 'r') as f:
    contents = f.read()

# Split the text into lines
lines = contents.split('\n')

# Initialize variables to store the current question and its answer
current_question = None
current_answer = None

question_answer = {}

# Loop through each line of the text
for line in lines:
    # Check if the line starts with '###', indicating a new question
    if line.startswith('###'):
        # If there was a previous question, print it and its answer
        if current_question and current_answer:
            question_answer[current_question] = current_answer
        # Set the new question as the current question
        current_question = line[4:]
        # Reset the current answer
        current_answer = None
    if line.startswith('#'):
        continue
    # Otherwise, the line is an answer to the current question
    elif line != '':
        # Append the answer to the current answer (if any)
        if current_answer:
            current_answer += '\n' + line
        else:
            current_answer = line

# Print the last question and its answer (if any)
if current_question and current_answer:
    question_answer[current_question] = current_answer


In [5]:
validation_set = pd.DataFrame(question_answer.items())
validation_set.rename(columns={0:'Question', 1:'Answer'}, inplace=True)
validation_set

Unnamed: 0,Question,Answer
0,What is Red Hat OpenShift Service on AWS (ROSA)?,"Red Hat Openshift Service on AWS (ROSA) is a fully-managed turnkey application platform that allows you to focus on what matters most, delivering value to your customers by building and deploying applications. Red Hat SRE experts manage the underlying platform so you don’t have to worry about the complexity of infrastructure management."
1,Where can I go to get more information/details?,- [ROSA Webpage](https://www.openshift.com/products/amazon-openshift)\n- [ROSA Workshop](https://www.rosaworkshop.io)\n- [ROSA Documentation](https://docs.openshift.com/rosa/welcome/index.html)
2,What are the benefits of Red Hat OpenShift Service on AWS (Key Features)?,"- **Native AWS service:** Access and use Red Hat OpenShift on demand with a self-service on-boarding experience through the AWS management console.\n- **Flexible, consumption-based pricing:** Scale as per your business needs and pay as you go with flexible pricing with an on-demand hourly or annual billing model.\n- **Single bill for Red Hat OpenShift & AWS usage:** Customers will receive a single bill from AWS for both Red Hat OpenShift and AWS consumption.\n- **Fully integrated support experience:** Installation, management, maintenance, and upgrades are performed by Red Hat site reliability engineers (SRE) with joint Red Hat and Amazon support and a 99.95% SLA.\n- **AWS service integration:** AWS has a robust portfolio of cloud services, such as compute, storage, networking, database, analytics, and machine learning, which are directly accessible via Red Hat OpenShift Service on AWS. This makes it easier to build, operate, and scale globally on demand through a familiar management interface.\nAdditional key features of Red Hat OpenShift Service on AWS:\n- **Maximum Availability:** Deploy clusters across multiple Availability Zones in supported Regions to maximize availability to maintain high availability for your most demanding mission-critical applications and data.\n- **Cluster node scaling:** Easily add or remove compute nodes to match resource demand\n- **Optimized clusters:** Choose from memory-optimized, compute-optimized, or general purpose EC2 instance types, with clusters sized to meet your needs. See [AWS compute types](https://docs.openshift.com/rosa/rosa_architecture/rosa_policy_service_definition/rosa-service-definition.html#rosa-sdpolicy-aws-compute-types_rosa-service-definition).\n- **Global availability:** Please refer to the [product regional availability page](https://docs.openshift.com/rosa/rosa_architecture/rosa_policy_service_definition/rosa-service-definition.html#rosa-sdpolicy-regions-az_rosa-service-definition) page for an up-to-date view of where Red Hat OpenShift Service on AWS is available."
3,What are the differences between Red Hat OpenShift Service on AWS and Kubernetes?,"Everything you need to deploy and manage containers is bundled with ROSA, including container management, automation (Operators), networking, load balancing, service mesh, CI/CD, firewall, monitoring, registry, authentication, and authorization capabilities. These components are tested together for unified operations as a complete platform. Automated cluster operations, including over-the-air platform upgrades, further enhance your Kubernetes experience."
4,What exactly am I responsible for and what is Red Hat / AWS responsible for?,"In short, anything that is related to deploying the cluster or keeping the cluster running will be Red Hat’s or AWS’s responsibility, and anything relating to the applications, users, or data is the customers responsibility. Please see our [responsibility matrix](https://docs.openshift.com/rosa/rosa_architecture/rosa_policy_service_definition/rosa-policy-responsibility-matrix.html) for more details."
...,...,...
60,What features are upcoming for ROSA?,The current ROSA roadmap can be seen at: [https://red.ht/rosa-roadmap](https://red.ht/rosa-roadmap)
61,What kind of instances are supported for worker nodes?,"See [AWS compute types](https://docs.openshift.com/rosa/rosa_architecture/rosa_policy_service_definition/rosa-service-definition.html#rosa-sdpolicy-aws-instance-types_rosa-service-definition) in the service definition for the up to date list of supported instance types. Additionally, spot instances are supported."
62,"Does ROSA support an air-gapped, disconnected environment where the ROSA cluster does not have internet access?","No, the ROSA cluster must have egress to the internet to access our registry, S3, send metrics etc. The service requires a number of [egress endpoints](https://docs.openshift.com/rosa/rosa_install_access_delete_clusters/rosa_getting_started_iam/rosa-aws-prereqs.html#osd-aws-privatelink-firewall-prerequisites). Ingress can be limited to PrivateLink (for Red Hat SRE) and VPN or similar for customer access."
63,Is node autoscaling available?,Yes. Autoscaling allows you to automatically adjust the size of the cluster based on the current workload. See [About autoscaling nodes on a cluster](https://docs.openshift.com/rosa/rosa_cluster_admin/rosa_nodes/rosa-nodes-about-autoscaling-nodes.html) in the documentation for more details.


In [6]:
validation_set.to_csv('../data/processed/validation_data.csv')

## Create question answer pairs from the documentation dataset

In [9]:
import markdown

# Open the Markdown file and read its contents
with open("../data/external/rosaworkshop/1-account_setup.md", "r") as file:
    md_text = file.read()

In [10]:
md_text

'There are currently two supported credential methods when creating a ROSA cluster. One method uses an IAM user with the *AdministratorAccess* policy (only for the account using ROSA). The other, more recent, and **recommended** method uses AWS STS. Please see the section "[ROSA with STS Explained](15-sts_explained.md)" for a detailed explanation. In this workshop we will only be using the STS method.\n\n## Prerequisites\n\nPlease review the prerequisites found in the documentation at [Prerequisites for ROSA w/STS](https://docs.openshift.com/rosa/rosa_planning/rosa-sts-aws-prereqs.html) before getting started.\n\n\nYou will need the following pieces of information from your AWS account:\n\n- AWS IAM User\n- AWS Access Key ID\n- AWS Secret Access Key\n\n### A Red Hat account\nIf you do not have a Red Hat account, create one here <https://console.redhat.com/>. Accept the required terms and conditions. Then check your email for a verification link.\n\n### Install the AWS CLI\n[Install the

In [None]:
import pandas as pd

import re
import nltk

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import MarkdownTextSplitter
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader, DirectoryLoader
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.prompts.prompt import PromptTemplate

from dotenv import load_dotenv, find_dotenv
import pandas as pd
import time
pd.set_option('display.max_colwidth', None)

load_dotenv(find_dotenv("credentials.env"), override=True)
import os
os.environ["LANGCHAIN_TRACING"] = "true"

In [None]:
chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="stuff")
answer = chain({"input_documents": docs, "question": query}, return_only_outputs=True)

In [66]:
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI

llm = OpenAI(temperature=0, max_tokens=-1)
prompt = PromptTemplate(
    input_variables=["md"],
    template="{md} \n List and describe in detail the 15 major points covered in this guide. Write 100 words for each point",
)

In [67]:
llm

OpenAI(cache=None, verbose=False, callback_manager=<langchain.callbacks.shared.SharedCallbackManager object at 0x7f15fa1d9610>, client=<class 'openai.api_resources.completion.Completion'>, model_name='text-davinci-003', temperature=0.0, max_tokens=-1, top_p=1, frequency_penalty=0, presence_penalty=0, n=1, best_of=1, model_kwargs={}, openai_api_key=None, openai_api_base=None, openai_organization=None, batch_size=20, request_timeout=None, logit_bias={}, max_retries=6, streaming=False, allowed_special=set(), disallowed_special='all')

In [68]:
from langchain.chains import LLMChain
chain = LLMChain(llm=llm, prompt=prompt)

In [69]:
ans = chain.run(md_text)

In [70]:
print(ans)

.

1. Credential Methods: There are two supported credential methods when creating a ROSA cluster. The first method uses an IAM user with the *AdministratorAccess* policy and the second, more recent, and recommended method uses AWS STS.
2. Prerequisites: Before getting started, it is important to review the prerequisites found in the documentation at [Prerequisites for ROSA w/STS](https://docs.openshift.com/rosa/rosa_planning/rosa-sts-aws-prereqs.html). This includes having a Red Hat account, installing the AWS CLI, enabling ROSA, installing the ROSA CLI, and installing the OpenShift CLI.
3. Configure the AWS CLI: After installing the AWS CLI, it is important to configure it with the correct AWS Access Key ID, AWS Secret Access Key, default region, and output format.
4. Ensure the ELB Service Role Exists: It is important to make sure that the service role for ELB already exists, otherwise the cluster deployment could fail. As such, it is important to check for the role and create it if