Import necessary libraries and modules.

In [None]:
!pip install -r requirements.txt

In [3]:
import pandas as pd
import numpy as np
import time
import os
import uuid
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate

Retrieve environmental variables.

In [4]:
openai_key = os.getenv('OPENAI_API_KEY')
pinecone_key = os.getenv('PINECONE_API_KEY')

Prepare the dataset for further processing by renaming columns, creating unique IDs using the `uuid` library, and leaving the 'Vector' column empty.

In [5]:
columns = ["ID", "Title", "Text", "Vector"]
data = pd.read_csv('technology_data.csv')
data = data.drop(columns=['description','url','category'])
data = data.head(50)
data['ID'] = data.apply(lambda x: uuid.uuid4(), axis=1)
data['Vector'] = np.nan
data = data.rename(columns={'headlines':'Title','content':'Text'})
data.insert(0, 'ID', data.pop('ID'))

Save CSV file

In [6]:
knowledge_base = 'knowledge_base.csv'
data.to_csv(knowledge_base, sep='\t',index=False)

Read a CSV file containing the dataset and populate the 'Vector' column with OpenAI embeddings.

In [20]:
df_base = pd.read_csv(knowledge_base, sep='\t')

response = OpenAIEmbeddings(openai_api_key=openai_key,
  model='text-embedding-3-small'
)
df_base['Vector'] = response.embed_documents(df_base['Text'])

Initialize a Pinecone client with the provided Pinecone API key. Define dimensionality and choose 'cosine similarity' as the metric.

In [8]:
clientPine = Pinecone(api_key=pinecone_key)

index_name = "rag-db"
if index_name not in clientPine.list_indexes().names():
    clientPine.create_index(
        name=index_name,
        dimension=len(df_base.loc[0,'Vector']),
        metric='cosine',
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        ))
    time.sleep(1)

In this section, we upsert data into a Pinecone index. We iterate over each row in the base DataFrame (`df_base`) to prepare data for upsertion into the Pinecone index.

In [9]:
index = clientPine.Index(index_name)

upsert_data = []

for idx, row in df_base.iterrows():
    item = {
        'id': row['ID'],
        "metadata": {
            "title": row['Title'],
            "text": row['Text']
        },
        "values": row['Vector']
    }
    upsert_data.append(item)

index.upsert(vectors=upsert_data, namespace='knowledge_base')

{'upserted_count': 50}

Initialize a Pinecone Vector Store for efficient storage and retrieval of embeddings.

In [29]:
vector_store = PineconeVectorStore(index_name='rag-db',
                                  embedding=OpenAIEmbeddings(
                                  openai_api_key=openai_key,
                                  model='text-embedding-3-small'),
                                  pinecone_api_key=pinecone_key,
                                  namespace="knowledge_base")

Define a function to generate a prompt template for a conversational AI system.

In [23]:
def get_prompt(instruction, examples, new_system_prompt):
    SYSTEM_PROMPT = B_SYS + new_system_prompt + E_SYS
    prompt_template =  SYSTEM_PROMPT + instruction  + "\n" + examples
    return prompt_template

B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
sys_prompt = """\
You are a helpful, respectful and honest assistant designed to assist with. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""


instruction = """CONTEXT:/n/n {context}/n
"""

examples = """
Q: {question}
A: """
template = get_prompt(instruction, examples, sys_prompt)

Construct a prompt template for the Question-Answering (QA) Chain.

In [24]:
QA_CHAIN_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template=template,
)

We perform a similarity search using the `similarity_search` method of the `vector_store` object. The method takes the following parameters:
- `query`: The query string representing the user query.
- `k`: The number of nearest neighbors to retrieve (in this case, 3).

In [25]:
query = "Apple has released iOS 17.3"
doc = vector_store.similarity_search(
    query,
    k=3
)
print(doc)

[Document(page_content='Apple has released iOS 17.3, the latest version of its mobile operating system for iPhones. One of the key new features is ‘Stolen Device Protection’, which adds extra security measures to protect users’ data if their phone gets stolen. This should be high on every iPhone user’s list to enable, as it better protects your information without any effort on your part.\nWhen you turn on Stolen Device Protection, your iPhone will put limits on certain settings changes when it’s not in a familiar location like your home or workplace. If a thief unlocks your phone and tries to alter these settings, they’ll be required to authenticate with Face ID or Touch ID first. So even if they have your passcode, they can’t modify protected settings without also duplicating your biometrics – a near-impossible task.', metadata={'title': 'What is Apple’s new ‘Stolen Device Protection’ for iPhones and how to turn it on'}), Document(page_content='Apple is continuing to open up iOS to c

Set up a Conversational AI system using OpenAI's GPT-4 model and a Retrieval-based Question-Answering (QA) chain.

In [26]:
llm = ChatOpenAI(model_name='gpt-4', max_tokens=488,
                 temperature=0,
                 model_kwargs={"stop": ["\nQ:", "\nA:"]},api_key=openai_key)

qa_chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

In this section, we initiate a conversational query to the QA Chain to check if we can receive a response based on our data.

In [27]:
query = "What about Nintendo new screen?"
doc = qa_chain.invoke(query)['result']
print(doc)

According to the prediction by Omdia analyst Hiroshi Hayase, the successor to the Nintendo Switch, often referred to as the "Nintendo Switch 2," could feature an 8-inch LCD screen. This would be a significant increase from the 6.2-inch and 7-inch displays found on the original Switch and Switch OLED models respectively. However, Nintendo has not officially confirmed these details yet.
