In [1]:
# Project 
"""
***Stack***: OPL (OpenAI, Pinecone, Langchain) / OCL (OpenAI, Chromdb, Langchain)

***Goal***
- Why shouldn't be using ChatGPT instead rather than creating on our own?
    - we should be taking care of data privacy policies and should not be feeding our own private documents
    - with ChatGPT knowledge is limited to September 2021

***Learning***
- How can LLMs learn new knowledge?
    - Fine-tuning on a training set: is most natural way to teach the model knowledge, but it can be time consuming and expensive | its also builts long term memory, which is not always necessary
    - Model inputs: means inserting the knowledge into an input message. for example, we can send an entire book or PDF document to the model as an input message and then we can start questions on topics found in the input message. This is good way to build short term memory for the model
    When we have a large corpus of text, it can be difficult to use model inputs because each model is limited to maximum of tokens, which is most cases is around 4000. for example, we cannot simply send the text from a 500 page document to the model because this will exceed the maximum number of tokens that the model supports
    Note: the recommended approach is to use model inputs with embedded based search | embeddings are simple to implement and work especially well with questions

***pipeline***
- 1. Prepare the document (once per document): firstly, we prepare the search data and we'll do that once per document
a) Load the data into Langchain Documents: load the data into launching documents
b) Split the documents into chunks: split the documents into short and mostly self-contained sections called chunks
c) Embed the chunks into numeric vectors: embedd the chunks into numeric vectors using an embedding model such as OpenAI's text embedding ADA 002
d) Save the chunks and the embeddings to a vector database: saving the chunks and embeddings to a vector database such as Pinecone, Chroma, Milvus or Quadrant

- 2. Search (once per query)
a) Embedd the user's question.
b) using the question's embedding and the chunk embeddings, rank the vectors by similarity(Cosine or Euclidean distance) to the question's embedding. The nearest vectors represent chunks similar to the question

- 3. Ask (once per query)
a) Insert the question and the most relevant chunks into a message(2(b)) to a GPT model.
b) Return GPT's answer.
"""

# Note
"""
the above technique is also called as retrieval augmentation generation because we retrieve relevant information from an external knowledge base and give that information to our LLM. 
the external knowledge base is our window into the world beyond the training data. Because practice is more valuable than 1000 words

the general paradigm that we are using is ReAct that combines reasoning and acting advances to enable language models to solve various language reasoning and decision making tasks

# What is RAG
- RAG stands for Retrieval Augmented Generation and its a technique that combines an LLM way to search for information
- this lets the model look at stuff like private documents or database while its generating text
- RAG helps overcome knowledge limits, makes answers more factual, and ensure the model handle complex questions
- In RAG system, external data is retrieved and then passed it to llm when during the generation step 

# What is Chroma db
- Chroma is an opensource in-memory vector store, making it a better fit for small to medium size projects 
- we don't need a seperate server or hosting for it
- where we don't need to mess around with indexes and namespaces, which saves time and effort
"""

'\nthe above technique is also called as retrieval augmentation generation because we retrieve relevant information from an external knowledge base and give that information to our LLM. \nthe external knowledge base is our window into the world beyond the training data. Because practice is more valuable than 1000 words\n\nthe general paradigm that we are using is ReAct that combines reasoning and acting advances to enable language models to solve various language reasoning and decision making tasks\n'

In [1]:
# Technologies
# Document Loaders: Transform loaders | public | proprietary dataset or service loaders
# link: https://python.langchain.com/docs/modules/data_connection/document_loaders/

In [1]:
# Environmental variables
import os
from dotenv import find_dotenv, load_dotenv

try: 
    file = '../LLM/.env'
    keys = find_dotenv(file,raise_error_if_not_found=True)
    load_dotenv(keys, override=True)
    # print(os.environ.get('OPENAI_API_KEY'))
    print("Initialize Sucessfull!")
except:
    print("Not Initialized!!!")
    raise Exception("Need to Initialize Again...")

Initialize Sucessfull!


In [3]:
# Requirements.txt # need to restart the kernel
"""
pypdf
docx2txt
tqdm
wikipedia
pinecone-client
chromadb
"""

'\npypdf\ndocx2txt\ntqdm\nwikipedia\n'

In [20]:
# Libraries
from langchain_openai import ChatOpenAI

from langchain.chains import RetrievalQA, ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

# document loaders
from langchain_community import document_loaders
from langchain.text_splitter import RecursiveCharacterTextSplitter # By default, the characters it tries to split on are double backslash N and whitespace
import tiktoken

# vector database
import pinecone
from langchain_community.vectorstores import Pinecone
from langchain_openai import OpenAIEmbeddings
from pinecone import PodSpec

from langchain_community.vectorstores import Chroma

##### Langchain Documents

In [3]:
# Load the document
def load_document(file):
    """Functions links the PDFs using library called PyPDF into an array of documents where each document contains the page, content and metadata with a page number."""
    # prevent from circular dependencies and benefit from a more reliable refactoring of our code. | if we utilize the function and it will work because it contains everthing it int
    # from langchain.document_loaders import PyPDFLoader 
    
    # import json
    # from pathlib import Path
    name, ext = os.path.splitext(file) # file.split('/')[-2], file.split('/')[-1]

    loader = {
        '.pdf': document_loaders.PyPDFLoader(file),
        '.docx': document_loaders.Docx2txtLoader(file),
        '.txt': document_loaders.TextLoader(file),
        '.csv': document_loaders.CSVLoader(file),
        '.py': document_loaders.PythonLoader(file),
        '.html': document_loaders.BSHTMLLoader(file), # UnstructuredHTMLLoader(file)
        # '.json': json.loads(Path(file).read_text())
    } # url of the file or file path in a file system

    if ext not in loader.keys():
        print("Extension Doesn't Exists!")
        return None
    
    if ext == '.json':
        return loader[ext]
    
    print(f"Loading the '{file}'")
    data = loader[ext].load_and_split() if ext == '.pdf' else loader[ext].load() # this will return a list of langchain documents, one document for each page
    return data # data is splitted by pages and we can use indexes to display a specific page

# Load all the documents : https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory
def load_all_documents(dirpath):
    """Function accepts the directory path as an argument and return the list of documents(page_content, meta_data)"""
    name, ext = os.path.splitext(file) # file.split('/')[-2], file.split('/')[-1]

    loader = {
        '.pdf': document_loaders.DirectoryLoader(dirpath, show_progress=True, use_multithreading=True, loader_cls=document_loaders.PyPDFLoader),
        '.docx': document_loaders.DirectoryLoader(dirpath, show_progress=True, use_multithreading=True, loader_cls=document_loaders.Docx2txtLoader),
        '.txt': document_loaders.DirectoryLoader(dirpath, show_progress=True, use_multithreading=True, loader_cls=document_loaders.TextLoader, loader_kwargs={'autodetect_encoding':True}),
        '.csv': document_loaders.DirectoryLoader(dirpath, show_progress=True, use_multithreading=True, loader_cls=document_loaders.CSVLoader),
        '.py': document_loaders.DirectoryLoader(dirpath, show_progress=True, use_multithreading=True, loader_cls=document_loaders.PythonLoader)
    } # silent_errors=True which can silenced which could not be loaded

    if ext not in loader.keys():
        print("Extension Doesn't Exists!")
        return None

    data = loader[ext].load_and_split() if ext == '.pdf' else loader[ext].load()
    print(f"Documents: {len(data)}")
    return sorted(data, key=lambda x: x.page_content.split('\n')[0]) # sorted through title of the documents

# load from wikipedia
def load_wikipedia_documents(query, lang='en', load_max_docs=2):
    """Functions accepts three arguments <(query, lang, load_max_docs)> whereas query: question | lang: language of text | load_max_docs: maximum documents to return"""
    print("Function has been invoked and will take enough time to process based on the maximum document size %s..." %load_max_docs)
    loader = document_loaders.WikipediaLoader(query=query, lang=lang, load_max_docs=load_max_docs)
    data = loader.load()
    return data


# driver code: files
dirpath = "../files/MySQL"
file = "../files/MySQL/C3-WK01-DY02-PracticeExercise.pdf"
page = 2

# one document at a time
data = load_document(file)
print(f"Total '{len(data)}' Pages in the '{file.split('/')[-1]}'")
# print(f'There are "{len(data[page].page_content)}" characters at the {page} page.')
# print("Metadata:", data[page].metadata)
# print(f"Page {page}: {data[page].page_content}")

# list of documents
# data = load_all_documents(dirpath)
# data

# driver code: wikipedia
# data = load_wikipedia_documents('LLM(Large Language Models)')
# print(data[0].page_content)


Loading the '../files/MySQL/C3-WK01-DY02-PracticeExercise.pdf'
Total '4' Pages in the 'C3-WK01-DY02-PracticeExercise.pdf'


### Chunking

In [4]:
# document chunking
def data_chunks(data, chunk_size=256, chunk_overlap=0):
    """Function accepts the document_loader object and returns the chunks and takes two additional arguments as chunk_size & chunk_overlap"""
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    chunks = text_splitter.split_documents(data)
    # print(f"Chunk Size: {chunk_size}")
    # print(f"Chunk Overlap: {chunk_overlap}")
    # print(f"Total Chunks: {len(chunks)}")
    return chunks

# embedding costs: tokens 
def embedding_costs(chunks, model='text-embedding-3-small', price=0.02):
    """Function accepts text or document as chunks then calculates the embedding costs. By default it will embedd using model='text-embedding-3-small' with price=0.02 """
    enc = tiktoken.encoding_for_model(model)
    total_tokens = sum([len(enc.encode(page.page_content)) for page in chunks])
    embed_cost = total_tokens / 1000 * price
    # print(f"Total Tokens: {total_tokens}")
    # print(f"Embedding Cost in USD: {embed_cost:.6f}")
    return total_tokens, round(embed_cost, 6)

# document-chunks-tokens-embeddings calculator
def document_chunks_tokens_embeds_calculator(data, chunk_size=256, chunk_overlap=0, model='text-embedding-3-small', price=0.02):
    """Function accepts three arguments
    data: document_loaders object
    chunk_size & chunk_overlap
    model: describes embedding model
    and prints chunk_size, chunk_overlap, chunks, total_chunks, total_tokens, embed_costs
    # and returns chunk_size, chunk_overlap, chunks, total_chunks, total_tokens, embed_costs
    """
    if chunk_overlap is None:
        chunk_overlap = chunk_size // 2 + 1

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = chunk_size, # should be higher and need to be experiemented
        chunk_overlap= chunk_overlap, # overlap between chunks needed to maintain some continuity between them
        length_function=len # indicates how the length of chunks is calculated
        # The default is to just count the number of characters, but because we work with LLMs and LLMs use tokens 
        # Instead it should be token counter 
    )
    chunks = text_splitter.split_documents(data)
    enc = tiktoken.encoding_for_model(model)
    tokens = [enc.encode(page.page_content) for page in chunks]
    total_tokens = sum(map(len, tokens))
    embed_cost = total_tokens / 1000 * price

    print(f"Chunk Size: {chunk_size}")
    print(f"Chunk Overlap: {chunk_overlap}")
    print(f"Chunks: {chunks[:5]}")
    print(f"Total Chunks: {len(chunks)}")
    print(f"Tokens: {tokens[:5]}")
    print(f"Total Tokens: {total_tokens}")
    print(f"Embedding Cost in USD: {embed_cost:.6f}")

    user = input("Are you want to continue...[Y/N | y/n]: ")
    if user in ('y', 'Y'):
        return chunk_size, chunk_overlap, chunks, len(chunks), total_tokens, round(embed_cost, 6)
    else:
        raise Exception("User has raised the error to prevent the process of embedding...")


# driver code: chunks
# one document at a time
# data = load_document(file)
# print(f"Total '{len(data)}' Pages in the '{file.split('/')[-1]}'")

# chunks = data_chunks(data)
# print(f"There are {len(chunks)} chunks.")
# print(chunks[2].page_content)

# total_tokens, embed_costs = embedding_costs(chunks)
# print(f"Total Tokens: {total_tokens}")
# print(f"Embedding Cost in USD: {embed_costs:.6f}")

# chunk_size, chunk_overlap, chunks, total_chunks, total_tokens, embed_costs = document_chunks_tokens_embeds_calc(data)
    
# Use-Case: Customize 
# data = load_document(file)
# print(f"Total '{len(data)}' Pages in the '{file.split('/')[-1]}'")
# document_chunks_tokens_embeds_calculator(data) # chunk_size, chunk_overlap, chunks, total_chunks, total_tokens, embed_costs
# document_chunks_tokens_embeds_calculator(data, chunk_size=10, chunk_overlap=5) # chunk_size, chunk_overlap, chunks, total_chunks, total_tokens, embed_costs
# document_chunks_tokens_embeds_calculator(data, chunk_size=256, chunk_overlap=None) # chunk_size, chunk_overlap, chunks, total_chunks, total_tokens, embed_costs
# document_chunks_tokens_embeds_calculator(data, chunk_size=1) # chunk_size, chunk_overlap, chunks, total_chunks, total_tokens, embed_costs


### Vector Database: Pinecone

In [5]:
# Upload to vector database like Pinecone for fast retrieval through similarity score
def insert_or_fetch_pinecone_embeddings(index_name, chunks):
    """
    Function will create index and if the index doesn't exists, embed the chunks and add both the chunks and embeddings into the pinecone index | if the index already exists, the function will just load the embeddings from the index
    
    function takes two arguments: 
    index_name: vector database index name
    chunks: document_loaders object
    """
    pc = pinecone.Pinecone()
    embeddings = OpenAIEmbeddings(model='text-embedding-3-small', dimensions=1536)

    if index_name in pc.list_indexes().names():
        print("Index exists! Fetching...", end=' ')
        vector_store = Pinecone.from_existing_index(index_name, embeddings)
        print("Completed...")
    else:
        print("Creating Index %s and embeddings..." %index_name, end=' ')
        pc.create_index(
            name=index_name,
            dimension=1536,
            metric='cosine',
            spec=PodSpec(
                environment='gcp-starter'
            )
        )
        vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name)
        print("Sucessfully Upserted the Chunks...")
    return vector_store

# Delete the existing index or all the indexes
def delete_pinecone_index(index_name=None):
    pc = pinecone.Pinecone()
    indexes = pc.list_indexes().names()

    if pc.list_indexes().names() == []:
        print("There is no indexes available to delete...")
    elif index_name is None:
        print("Deleting all the indexes consists of %s" %indexes, end=' ')
        for index in indexes:
            pc.delete_index(index)
        print("Done...")
    elif index_name not in indexes:
        print("Index Doesn't Exists!")
    else:
        print(f"Deleting the index: {index_name}", end=' ')
        pc.delete_index(index_name)
        print("Done...")

# Asking Questions & Getting the Answers using similarity search 
def question_answer_bot(vector_store, question, model='gpt-3.5-turbo', temperature=1, search_type='similarity', k=3):
    llm = ChatOpenAI(model=model, temperature=temperature)
    retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k':k})
    chain = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retriever=retriever) # chain_type='stuff' will use all of the text from the documents in the prompt which we got as a result
    answer = chain.invoke(question)
    return answer

# Driver code
# data = load_all_documents(dirpath)
# chunks = data_chunks(data)
# # document_chunks_tokens_embeds_calculator(chunks) # Calculate the Embedding costs

# # Ensuring there is no index already created since we are on free-tier plan
# delete_pinecone_index()

# # Creating the index or returning the existing vector store
# index_name = 'glca-da-mysql-practice-exercises'
# vector_store = insert_or_fetch_pinecone_embeddings(index_name, chunks) # will return existing vector_store if index already exists or will create new index + upsert embeddings 

# # Asking Questions & Getting the Answers using similarity search 
# question = """What is the whole document about?"""
# answer = question_answer_bot(vector_store, question)
# print(answer)
# print(answer['query'])
# print(answer['result'])
"""The document provides information about SQL transactions, including the structure of a transaction (BEGIN TRANSACTION, SQL STATEMENTS, SAVEPOINT, COMMIT or ROLLBACK), and it also covers the purpose of using the * symbol in a SELECT command in SQL. Additionally, it seems to include a practice exercise related to SQL."""

'The document provides information about SQL transactions, including the structure of a transaction (BEGIN TRANSACTION, SQL STATEMENTS, SAVEPOINT, COMMIT or ROLLBACK), and it also covers the purpose of using the * symbol in a SELECT command in SQL. Additionally, it seems to include a practice exercise related to SQL.'

### Driver code

In [11]:
data = load_all_documents(dirpath)
chunks = data_chunks(data)
# document_chunks_tokens_embeds_calculator(chunks) # Calculate the Embedding costs

# Ensuring there is no index already created since we are on free-tier plan
# delete_pinecone_index()

# Creating the index or returning the existing vector store
index_name = 'glca-da-mysql-practice-exercises'
vector_store = insert_or_fetch_pinecone_embeddings(index_name, chunks) # will return existing vector_store if index already exists or will create new index + upsert embeddings 


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:02<00:00,  5.55it/s]


Documents: 49
Index exists! Fetching... Completed...


In [None]:
"""The document provides information about SQL transactions, including the structure of a transaction (BEGIN TRANSACTION, SQL STATEMENTS, SAVEPOINT, COMMIT or ROLLBACK), and it also covers the purpose of using the * symbol in a SELECT command in SQL. Additionally, it seems to include a practice exercise related to SQL."""

In [None]:
# Want to train more on the specific document through wikipedia
# data = load_wikipedia_documents("MySQL", load_max_docs=50) # https://en.wikipedia.org/wiki/MySQL
# chunks = data_chunks(data)
# document_chunks_tokens_embeds_calculator(chunks)
# index_name='glca-da-mysql-practice-exercises'
# vector_store = insert_or_fetch_pinecone_embeddings(index_name, chunks)
# print("Training done Sucessfully.....")

# data = load_wikipedia_documents("Stored procedure", load_max_docs=500)
# data = load_wikipedia_documents("Database trigger", load_max_docs=500)
# data = load_wikipedia_documents("Cursor (databases)", load_max_docs=500)
# data = load_wikipedia_documents("View (SQL)", load_max_docs=500)
# data = load_wikipedia_documents("Information schema", load_max_docs=500)


### User Interface

In [14]:
user = True
q = 1
print("Write down the question or Quit/Exit")

while user:
    question = input(f"Prompt {q}: ")

    if question in ('q', 'quit', 'Q', '', ' ', 'Quit', 'Exit', 'exit'):
        user = False
        print("\nThank you for visting us....")
        break
    
    answer = question_answer_bot(vector_store, question, k=10) # answer['query']
    print(answer['result'])
    q += 1
    print('\n')


Write down the question or Quit/Exit


Prompt 1:  What are the main objectives of this document?


The main objectives of the document are to cover theory questions related to SQL, including concepts like Data Control Language (DCL) commands, Transaction Control (TCL) commands, and the process of executing SQL scripts. It also aims to provide practice exercises on these concepts to enhance understanding and knowledge in SQL programming.





Prompt 2:  do you see anything special about this document


Yes, there seems to be a practice exercise related to SQL queries, particularly focusing on finding Savings Account numbers that have corresponding AddonCredit card transactions.





Prompt 3:  does this document helps in practicing sql queries for a beginner to get a job as data analyst and mention why?


Yes, this document can be helpful for practicing SQL queries as a beginner aiming to get a job as a data analyst. It provides practice exercises that cover a variety of SQL concepts such as creating databases, writing queries with subqueries, and retrieving specific information from databases. By working through these exercises, beginners can improve their SQL skills, gain hands-on experience in querying databases, and become more proficient in handling data. This practical experience is valuable for aspiring data analysts as it can demonstrate their ability to work with databases and extract meaningful insights from data, which are essential skills for the role.





Prompt 4:  what are the hard concepts


I'm sorry, but I cannot provide specific details on the hard concepts in the practice exercise as the content is proprietary to Great Learning. If you have any specific questions or need help understanding any SQL concepts, feel free to ask!





Prompt 5:  what are the imp concepts of this documents for a fresher to understanding the sql query for the first time


For a fresher looking to understand SQL queries for the first time, here are some important concepts from the provided document:

1. **SQL Commands**: Understanding the different types of SQL commands like Data Definition Language (DDL) commands, which are used to define database structures, and Data Manipulation Language (DML) commands, which are used to manipulate data.

2. **Data Types in SQL**: Familiarizing yourself with the various data types available in SQL, such as character strings, numeric values, and date/time values.

3. **Executing SQL Scripts**: Knowing the two main ways to execute an SQL script - pasting it into the command line or running it from a file, and understanding the process for each.

4. **Functions**: Being aware of functions like `now()`, `curdate()`, `curtime()`, `current_timestamp()` that are used to display the current date and time in SQL.

5. **Indexing**: Understanding the purpose of using "INDEX" in SQL, and how it contributes to faster retrieval of 

Prompt 6:  can you tell me those concepts which the difficulty level is higher for a fresher to understand and interpret


For a fresher, the concepts related to window functions in SQL might be a bit challenging to understand and interpret initially due to their complexity. Window functions are used to perform calculations across a set of rows related to the current row and can involve advanced concepts like partitions, frames, and ordering within the dataset. It might take some time and practice to grasp these concepts fully.





Prompt 7:  does this document provide the knowledge about the windows functions and examples to understand


Yes, the document provides information about window functions in SQL, including the different types of window functions and their examples. It also explains the purpose of the "OVER()" clause in window functions. If you have any specific questions about window functions or examples, feel free to ask for more details.





Prompt 8:  q


Thank you for visting us....


<hr>

### Vector Database: Chromadb

In [6]:
# create embeddings and vector store in chromadb
def create_embeddings_chroma(chunks, persist_directory='./chromadb'):
    """Function which creates the embeddings (OpenAIEmbeddings class), Saves them in a chroma database and returns the vector store object"""
    print("Started Creating Embeddings...", end=' ')
    embeddings = OpenAIEmbeddings(model='text-embedding-3-small', dimensions=1536)
    vector_store = Chroma.from_documents(chunks, embeddings, persist_directory=persist_directory)
    print("Done...")
    return vector_store

# loading embeddings 
def load_embeddings_chroma(persist_directory='./chromadb'):
    """Function will load the existing chroma db and return vector store object"""
    embeddings = OpenAIEmbeddings(model='text-embedding-3-small', dimensions=1536)
    vector_store = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
    return vector_store

# Driver Code
# data = load_all_documents("../files/MySQL")
# chunks = data_chunks(data)
# vector_store = create_embeddings_chroma(chunks, './Sessions/chromadb')
# db = load_embeddings_chroma('./Sessions/chromadb')
# question = """What is the whole document about?"""
# answer = question_answer_bot(vector_store, question)
# print(answer['query'])
# print(answer['result'])

#### Adding Memory (Chat History) - for follow up questions

In [18]:
llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0, streaming=True)
vector_store = load_embeddings_chroma('./Sessions/chroma_db')
retriever = vector_store.as_retriever(search='similarity', search_kwargs={'k':5})
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True) # memory is specifically designed to store and manage conversation history within the langchain application | memory_key='chat_history' gives a memory a label | when retrieving or interacting with stored conversation we will use the key 'chat_history'

system_template = r"""
Use the following pieces of context to answer the user's question.
If you don't find the answer in the provided context, just respond "I don't know."
----------------------------------
Context: ```{context}```

"""
user_template = """
Question: ```{question}```
Chat History: ```{chat_history}```
"""

messages=[
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template(user_template)
]

qa_prompt = ChatPromptTemplate.from_messages(messages)

conversation_retrieval_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    chain_type='stuff',
    combine_docs_chain_kwargs={'prompt': qa_prompt},
    verbose=True
)

# Driver Code
print(qa_prompt)

input_variables=['chat_history', 'context', 'question'] messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], template='\nUse the following pieces of context to answer the user\'s question.\nIf you don\'t find the answer in the provided context, just respond "I don\'t know."\n----------------------------------\nContext: ```{context}```\n\n')), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['chat_history', 'question'], template='\nQuestion: ```{question}```\nChat History: ```{chat_history}```\n'))]


In [8]:
def ask_question(question, model='gpt-3.5-turbo', temperature=0, vector_store=load_embeddings_chroma('./Sessions/chroma_db'), k=5):
    """Function takes five arguments 
    question as str type 
    vector_store: as Chroma object
    model: by default 'gpt-3.5-turbo'
    temperature: ranges from 0-2 by accurate-creative
    k: returns no of output text 

    as an output returns response object
    """
    llm = ChatOpenAI(model=model, temperature=temperature, streaming=True)
    retriever = vector_store.as_retriever(search='similarity', search_kwargs={'k':k})
    chain = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retreiver=retriever)
    response = chain.invoke({'question': question})
    return response

def ask_with_memory(question, vector_store=load_embeddings_chroma('./Sessions/chroma_db'), chat_history=[], model='gpt-3.5-turbo', temperature=0, k=5):
    """Function takes five arguments 
    question as str type 
    vector_store: as Chroma object
    model: by default 'gpt-3.5-turbo'
    temperature: ranges from 0-2 by accurate-creative
    k: returns no of output text 

    as an output returns response object, chat_history
    """
    llm = ChatOpenAI(model=model, temperature=temperature, streaming=True)
    retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': k})
    conversation_retrieval_chain = ConversationalRetrievalChain.from_llm(llm, retriever)
    response = conversation_retrieval_chain({'question': question, 'chat_history': chat_history})
    chat_history.append((question, response['answer']))
    return response, chat_history 


In [9]:
# Loading documents
data = load_all_documents("../files/MySQL")
chunks = data_chunks(data)
vector_store = create_embeddings_chroma(chunks, './Sessions/chroma_db')
db = load_embeddings_chroma('./Sessions/chroma_db')

100%|██████████| 12/12 [00:01<00:00, 10.12it/s]


Documents: 49
Started Creating Embeddings... Done...


In [15]:
question = """What is the whole document about?"""
question = """How many types of subqueries we have?"""
question = """can you quote the number of subqueries we have?"""
question = """How many types of window functions we have in sql?"""
question = """which are the above topics represents high difficulty level for a fresher to understand and interpret"""
response = ask_question(question, conversation_retrieval_chain)
print(response)
print(response['question'])
print(response['answer'])



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Human: which are the above topics represents high difficulty level for a fresher to understand and interpret
Assistant: Based on the provided context, the topics from "GLCA SQL Week3 Day 2 Practice Exercise" are likely to represent a higher difficulty level for a fresher to understand and interpret compared to the topics from "GLCA SQL Week3 Day 1 Practice Exercise."
Human: What is the whole document about?
Assistant: The topic of the document mentioned in the conversation is related to SQL practice exercises from Week 3, specifically Day 2.
Human: How many types of subqueries we have?
Assistant: There are four types of subqueries mentioned in the provided context:
1. Single-row sub-query
2. Multi-row sub-query
3. Multi-column sub-query
4. Correla

In [16]:
# chat history key
for history in response['chat_history']:
    print(history)

content='which are the above topics represents high difficulty level for a fresher to understand and interpret'
content='Based on the provided context, the topics from "GLCA SQL Week3 Day 2 Practice Exercise" are likely to represent a higher difficulty level for a fresher to understand and interpret compared to the topics from "GLCA SQL Week3 Day 1 Practice Exercise."'
content='What is the whole document about?'
content='The topic of the document mentioned in the conversation is related to SQL practice exercises from Week 3, specifically Day 2.'
content='How many types of subqueries we have?'
content='There are four types of subqueries mentioned in the provided context:\n1. Single-row sub-query\n2. Multi-row sub-query\n3. Multi-column sub-query\n4. Correlated sub-query'
content='can you quote the number of subqueries we have?'
content='Based on the provided context, there are four types of subqueries mentioned:\n1. Single-row sub-query\n2. Multi-row sub-query\n3. Multi-column sub-query

### Prompt Engineering

In [19]:
db = load_embeddings_chroma('./Sessions/chroma_db')
question = """How many types of commands we have in sql?"""
question = """What type of triggers we do have in sql?"""
response = ask_question(question, conversation_retrieval_chain)
response['question']
response['answer']



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: 
Use the following pieces of context to answer the user's question.
If you don't find the answer in the provided context, just respond "I don't know."
----------------------------------
Context: ```specific SQL implementation being used, but the most commonly used data 
types include: CHAR, VA RCHAR, INT, FLOAT, DOUBLE, DATE, TIME, 
DATETIME, and TIMESTAMP  
 
2. What is the difference between Char and Varchar data types in SQL?

Section – A: Theory Questions  
 
1. How many data types are available in SQL?  
There are several data types in SQL, including character strings, numeric values, 
and date/time values. The exact number of data types can vary depending on the

3. Explain the concept of SQL commands and how many types . 
A SQL command is a set of instructions written in the SQL language that is used 
to interact with a database. These co

"I don't know."