<a href="https://colab.research.google.com/github/tractorjuice/MLOpsAIKB/blob/main/Building_MLOps_AI_Body_of_Knowledge_Part_4_Query_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLOps AI Body of Knowledge Using Langchain & OpenAI
## Part 4, query the vector database using ChatGPT

This example shows how to create and query an internal knowledge base using ChatGPT.

This does not require a GPU runtime.

## Set Up


Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
import os

KB_FOLDER = "/content/gdrive/MyDrive/MLOpsKB"  # Google drive folder to save the knowledgebase
YT_DATASTORE = os.path.join(KB_FOLDER, "youtube/datastore")  # Sub-directory for YouTube FAIS datastore files
YT_AUDIO_FOLDER = os.path.join(KB_FOLDER, "youtube/audio")  # Sub-directory for audio files
TRANSCRIPTS_FOLDER = os.path.join(YT_AUDIO_FOLDER, "transcripts")  # Sub-directory for transcripts of audio files
TRANSCRIPTS_TEXT_FOLDER = os.path.join(TRANSCRIPTS_FOLDER, "text")  # Sub-directory for text of audio files
TRANSCRIPTS_WHISPER_FOLDER = os.path.join(TRANSCRIPTS_FOLDER, "whisper_chunks")  # Sub-directory for Whisper chunks of audio files

# Check if directory exists and if not, create it
if not os.path.exists(KB_FOLDER):
    os.makedirs(KB_FOLDER)

# Check if directory exists and if not, create it
if not os.path.exists(YT_DATASTORE):
    os.makedirs(YT_DATASTORE)

# Check if sub-directory exists and if not, create it
if not os.path.exists(YT_AUDIO_FOLDER):
    os.makedirs(YT_AUDIO_FOLDER)

# Check if sub-directory exists and if not, create it
if not os.path.exists(TRANSCRIPTS_FOLDER):
    os.makedirs(TRANSCRIPTS_FOLDER)

# Check if sub-directory exists and if not, create it
if not os.path.exists(TRANSCRIPTS_TEXT_FOLDER):
    os.makedirs(TRANSCRIPTS_TEXT_FOLDER)

# Check if sub-directory exists and if not, create it
if not os.path.exists(TRANSCRIPTS_WHISPER_FOLDER):
    os.makedirs(TRANSCRIPTS_WHISPER_FOLDER)

Load required dependencies

In [None]:
!pip install -q langchain
!pip install -q openai
!pip install -q tiktoken

Use Pinecone or FAISS for the Vector Database

In [4]:
vectorstore = 'FAISS' # Set to 'Pinecone' or 'FAISS' for the vector datbase

In [5]:
if vectorstore == 'Pinecone':
    !pip install -q pinecone-client
    from langchain.vectorstores import Pinecone
    from tqdm.autonotebook import tqdm
    import pinecone

    # initialize pinecone
    pinecone.init(
        api_key="",  # find at app.pinecone.io
        environment="us-west4-gcp-free"  # next to api key in console
        )

    index_name = "knowledge" # Put your Pincecone index name here
    name_space = "mlopskb" # Put your Pincecone namespace here

else:
    !pip install -q faiss-cpu
    from langchain.vectorstores import FAISS


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[?25h

Set up OPEN_API_KEY and necessary variables

In [8]:
import os
os.environ["OPENAI_API_KEY"] = "" # Add you OpenAI Key here

#MODEL = "gpt-3"
#MODEL = "gpt-3.5-turbo"
#MODEL = "gpt-3.5-turbo-0613"
#MODEL = "gpt-3.5-turbo-16k"
MODEL = "gpt-3.5-turbo-16k-0613"
#MODEL = "gpt-4"
#MODEL = "gpt-4-0613"
#MODEL = "gpt-4-32k-0613"

# Query using the vector store with ChatGPT integration

Setup access to the Pinecone or FAISS vector database

In [9]:
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [10]:
if vectorstore == 'Pinecone':
    vector_store = Pinecone.from_existing_index(index_name, embeddings, namespace=name_space)

else:
    # Open datastore
    from langchain.vectorstores import FAISS
    if os.path.exists(f"{YT_DATASTORE}"):
        vector_store = FAISS.load_local(
            f"{YT_DATASTORE}",
            OpenAIEmbeddings()
            )
    else:
        print(f"Missing files. Upload index.faiss and index.pkl files to data_store directory first")


Setup the prompt

In [11]:
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

system_template="""
    You are MLOpsGPT a mlops specialist bot.
    You use examples from MLOps in your answers.
    Your language should be for an 12 year old to understand.
    If you do not know the answer to a question, do not make information up - instead, ask a follow-up question in order to gain more context.
    Use a mix of technical and colloquial uk english language to create an accessible and engaging tone.
    Use the following pieces of context to answer the users question.
    Take note of the sources and include them in the answer in the format: "SOURCES: source1 source2", use "SOURCES" in capital letters regardless of the number of sources.
----------------
{summaries}
"""
messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template("{question}")
]
prompt = ChatPromptTemplate.from_messages(messages)

Initialise the LLM API

In [12]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain

chain_type_kwargs = {"prompt": prompt}
llm = ChatOpenAI(model_name=MODEL, temperature=0)  # Modify model_name if you have access to GPT-4
chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 3}), # Use MMR search and return 5 (max 20) video sources
    return_source_documents=True,
    chain_type_kwargs=chain_type_kwargs
)

#### Use the chain to query

Print the sources so we can find the YouTube videos

In [13]:
query = "How can I learn about MLOps?"
result = chain(query)

In [14]:
print(result['question'])
print(result['answer'])

source_documents = result['source_documents']
for index, document in enumerate(source_documents):
    print(f"\nSource {index + 1}:")
    print(f"Video title: {document.metadata['title']}")
    print(f"Video author: {document.metadata['author']}")
    print(f"Source video: https://youtu.be/{document.metadata['source_url']}?t={int(document.metadata['source'])}")
    print(f"Content: {document.page_content}")

How can I learn about MLOps?
If you want to learn about MLOps, there are a few ways you can go about it. One option is to do some online research. You can search for articles, tutorials, and videos that explain what MLOps is and how it works. Another option is to join online communities or forums where people discuss MLOps. This can be a great way to ask questions and learn from others who are already experienced in the field.

There are also dedicated websites and platforms that offer courses and training on MLOps. These courses can provide you with a structured learning path and hands-on experience. Some popular platforms for learning MLOps include Coursera, Udemy, and edX.

Additionally, attending conferences, webinars, and workshops related to MLOps can be a valuable learning experience. These events often feature talks and presentations from experts in the field, giving you the opportunity to learn from their knowledge and experiences.

Remember, learning about MLOps is an ongoing

In [15]:
query = "what are the key components of MLOps?"
result = chain(query)

In [16]:
print(result['question'])
print(result['answer'])

source_documents = result['source_documents']
for index, document in enumerate(source_documents):
    print(f"\nSource {index + 1}:")
    print(f"Video title: {document.metadata['title']}")
    print(f"Video author: {document.metadata['author']}")
    print(f"Source video: https://youtu.be/{document.metadata['source_url']}?t={int(document.metadata['source'])}")
    print(f"Content: {document.page_content}")

what are the key components of MLOps?
The key components of MLOps are:

1. Data Management: This involves collecting, storing, and organizing the data that is used to train and test machine learning models. Good data management ensures that the data is accurate, complete, and properly labeled.

2. Model Development: This is the process of creating and fine-tuning machine learning models. It involves selecting the right algorithms, training the models on the data, and evaluating their performance.

3. Model Deployment: Once a model is developed, it needs to be deployed so that it can be used in real-world applications. This involves setting up the necessary infrastructure, such as servers or cloud platforms, and integrating the model into the existing software systems.

4. Monitoring and Maintenance: After a model is deployed, it needs to be continuously monitored to ensure that it is performing as expected. This includes monitoring its accuracy, detecting any drift or degradation in pe

In [17]:
query = "Who leads the MLOps Community?"
result = chain(query)

In [18]:
print(result['question'])
print(result['answer'])

source_documents = result['source_documents']
for index, document in enumerate(source_documents):
    print(f"\nSource {index + 1}:")
    print(f"Video title: {document.metadata['title']}")
    print(f"Video author: {document.metadata['author']}")
    print(f"Source video: https://youtu.be/{document.metadata['source_url']}?t={int(document.metadata['source'])}")
    print(f"Content: {document.page_content}")

Who leads the MLOps Community?
The MLOps Community is led by a group of people who are passionate about MLOps and its development. One of the key figures in the MLOps Community is the co-host of the MLOps Community podcast. She has been very active in the MLOps community and is dedicated to sharing knowledge and fostering collaboration among MLOps practitioners. While she is not the sole leader, she plays an important role in driving the community forward. (

Source 1:
Video title: The Next Million AI Apps // Mark Huang // LLMs in Pod Con Part 2 Workshop Day 1
Video author: MLOps.community
Source video: https://youtu.be/P68tSuuc010?t=3771
Content: in the MLOps community.

Source 2:
Video title: Evaluation // Panel 1 // Large Language Models in Production Conference Part 2
Video author: MLOps.community
Source video: https://youtu.be/e0ZLqfus_TY?t=30
Content: co-hosts on the MLOps community podcast.

Source 3:
Video title: Evaluation // Panel 1 // Large Language Models in Production Conf