# Basic QA in LangChain
- Embed document
- Create vector store
- Query vector store with qa chain
- Parse results

Note: This uses langchain 0.3.7 which is no longer supported but is the latest version on Kaggle. LangChain 1.xxx is significantly different using their new LCEL.

In [6]:
%pip install -q transformers langchain sentence-transformers pypdf faiss-cpu langchain-community torch

Note: you may need to restart the kernel to use updated packages.


#### Import Dependencies

In [None]:
import langchain
import re
import os
import torch
import json
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFLoader, UnstructuredMarkdownLoader, UnstructuredHTMLLoader  # Assumes both loaders exist
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from sentence_transformers import SentenceTransformer
from langchain.chains import ConversationalRetrievalChain
from langchain.schema import Document
print(langchain.__version__)

ImportError: cannot import name 'Tensor' from 'torch' (unknown location)

In [None]:
def write_output_to_file(output, filename: str):
    # Ensure the output directory exists
    out_dir = "../out/"
    #os.makedirs(out_dir, exist_ok=True)
    
    # Define the full file path
    file_path = os.path.join(out_dir, filename)
    
    # Write the output to the file
    with open(file_path, "w") as file:
        file.write(str(output))
    
    print(f"Output successfully written to {file_path}")

#### Load, Clean, and Split Documents

In [None]:
# Function to clean text (to remove unwanted line breaks within sentences)
def clean_text(text):
    return re.sub(r'(?<!\n)\n(?!\n)', ' ', text)

# Function to load documents based on file type
def load_documents(file_path):
    _, file_extension = os.path.splitext(file_path)
    
    if file_extension.lower() == '.pdf':
        loader = PyPDFLoader(file_path)
        print("Loading PDF document...")
    elif file_extension.lower() == '.md':
        loader = UnstructuredMarkdownLoader(file_path)
        print("Loading Markdown document...")
    elif file_extension.lower() == '.html':
        loader = UnstructuredHTMLLoader(file_path)
    else:
        raise ValueError("Unsupported file format. Please provide a PDF or Markdown file.")
    
    documents = loader.load()
    cleaned_documents = [Document(page_content=clean_text(doc.page_content)) for doc in documents]
    return cleaned_documents

In [None]:
# Load the document
file_path = "/kaggle/input/course-bot-data/bain_syllabus.pdf"  # Change this to the path of your PDF or Markdown file
documents = load_documents(file_path)

In [None]:
# Set some params
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
split_docs = text_splitter.split_documents(documents)

print(f"Total chunks created: {len(split_docs)}")

print("Sample chunks:")
for i, doc in enumerate(split_docs[:5]):
    write_output_to_file(print(f"Chunk {i + 1}:\n{doc.page_content}\n"), 'chunk.out': str)

#### Create Embeddings and Vector Store

In [None]:
# Initialize embeddings and vector store
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
vector_store = FAISS.from_documents(split_docs, embeddings)

Loading PDF document...
Total chunks created: 22
Sample chunks:
Chunk 1:
I. Course Description 1. Course Summary a. PHY 161/PHYS 215 General Physics I is an algebra-based introduction to mechanics,  thermodynamics, and waves. Topics include motion in one and two dimensions,  Newton’s laws of motion, equilibrium, work, energy, momentum, rotational motion,  gravity, heat, waves, and sound. Examples from medicine and biology will be  included whenever possible. 2. College Credit Hours (Dual-Enrollment) a. This course is dual enrolled with PHYS 215 General Physics I at Francis Marion  University (FMU) and taught by a GSSM instructor. Students will each have a FMU  transcript with their overall grade earned in this course. Students may earn up to 4  college credit hours depending on their grade and the transfer policies of their  college/university. Refer to the Dual Enrollment FAQ in the Course Catalog for  more information. 3. Learning Outcomes a. Upon completion of this course, students 



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Question: What will students use to submit homework assignments?
Query embedding (first 5 values): [0.010682711377739906, -0.009580872021615505, -0.0028102046344429255, -0.06568533182144165, -0.029421314597129822]...

Retrieved Context 1 (Chunk Index: 5):
a. Students will submit HW assignments and complete in-class tests using WebAssign, an  online platform used by many universities that provides students with instant feedback  on problem responses along with helpful tutorials.  b. How to sign up: See “WebAssign Registration” module on Canvas for help with signing up  for a WebAssign account and the class key code. 3. Needed Supplies 1. To the class, students should, at minimum, bring…  (1) Writing utensils with notebook, printed notes, or tablets/iPad.  (a) Note - Students should NOT use their phone or laptop in class.  (b) Note - Students should NOT wear headphones during class.  2. To the lab, students should bring…  (1) One person per group should bring a laptop if possible. I have

  answer = qa_chain({"question": question, "chat_history": chat_history})
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Answer: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

a. Students will submit HW assignments and complete in-class tests using WebAssign, an  online platform used by many universities that provides students with instant feedback  on problem responses along with helpful tutorials.  b. How to sign up: See “WebAssign Registration” module on Canvas for help with signing up  for a WebAssign account and the class key code. 3. Needed Supplies 1. To the class, students should, at minimum, bring…  (1) Writing utensils with notebook, printed notes, or tablets/iPad.  (a) Note - Students should NOT use their phone or laptop in class.  (b) Note - Students should NOT wear headphones during class.  2. To the lab, students should bring…  (1) One person per group should bring a laptop if possible. I have a few classroom  laptops as well. This will be used for Google Docs/Google Sheets/L

In [None]:
# Grab/declare tokenizer and model (used transformers below)
model_path = "/kaggle/input/llama-3.2/transformers/3b-instruct/1" # Use with models loaded into Kaggle
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

In [None]:
# Declare langchian Q&A chain using a hugging face pipeline
llm_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=100, temperature=0.7)
llm = HuggingFacePipeline(pipeline=llm_pipeline)

# Set up retrieval-based QA chain with vector store as the retriever
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
qa_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever)

In [None]:
# Debug function to track the retrieval process
def debug_retrieval(question):
    print(f"Question: {question}")
    
    # Print query embedding to verify unique representation
    query_embedding = embeddings.embed_query(question)
    print(f"Query embedding (first 5 values): {query_embedding[:5]}...\n")

    # Retrieve relevant documents and print context chunks
    retrieval_result = retriever.get_relevant_documents(question)
    for i, doc in enumerate(retrieval_result):
        chunk_index = split_docs.index(doc) if doc in split_docs else -1  # Find the index of the document in split_docs
        print(f"Retrieved Context {i + 1} (Chunk Index: {chunk_index}):\n{doc.page_content}\n")

    return retrieval_result

In [None]:
# Ask a question
chat_history = []
question = 'What will students use to submit homework assignments?'

# Retrieve the context
#retrieval_result = retriever.get_relevant_documents(question)
retrieval_result = debug_retrieval(question) # Use debug retrieval to get print statements
context = " ".join([doc.page_content for doc in retrieval_result])

# Run answer using the qa chani we declared earlier
answer = qa_chain({"question": question, "chat_history": chat_history})
print("Answer:", answer['answer'])

In [None]:
def parse_response(response) -> dict:
    # Use regex to find matches for section headers and their contents
    matches = re.findall(r'([A-Z][a-zA-Z]*):\s(.*?)(?=\n[A-Z]|$)', answer_text, re.DOTALL)

    return {title: content.strip() for title, content in matches} 

parsed_output = parse_response(answer['answer'])
for section, text in parsed_output.items():
    print(f"{section}: {text}\n")

Experiment with retrieval QA class rather than the conversational QA one. Note that both are deprecated. But this is what Kaggle will run.

In [8]:
from langchain.chains import RetrievalQA
# Set up retrieval-based QA chain
retriever = vector_store.as_retriever(search_kwargs={"k": 3})
#qa_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever)
qa_chain = RetrievalQA.from_chain_type(llm=llm,retriever=retriever)
# Ask a question
chat_history = []
question = "Can I turn in homework late?"

# Retrieve the context
retrieval_result = retriever.get_relevant_documents(question)
#retrieved_result = debug_retrieval(question)
context = " ".join([doc.page_content for doc in retrieval_result])

# Generate the answer
response = qa_chain({"query": question})

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Each answer from the ConversationalQA chain has several key value pairs:
1. question
2. chat_history
3. answer

When using retrieval QA chain the parts are
1. query
2. result

The response is simply two strings. The result string contains several sections including the prompt leading up to Answer, Question, Helpful Answer

Exploring the cosine similarity of contexts for a question we know works:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Get the embedding of the query
query = "What will students use to submit homework assignments?"
query_embedding = embeddings.embed_query(query)

# Retrieve all chunk embeddings from the vector store (FAISS in this case)
# Extracting all chunks and embeddings for manual similarity calculation
all_chunk_embeddings = vector_store.index.reconstruct_n(0, len(split_docs))

# Calculate cosine similarity between the query embedding and each chunk
similarity_scores = cosine_similarity([query_embedding], all_chunk_embeddings).flatten()

# Pair each chunk with its similarity score
chunk_scores = list(zip(split_docs, similarity_scores))

# Sort chunks by similarity score in descending order
sorted_chunk_scores = sorted(chunk_scores, key=lambda x: x[1], reverse=True)

# Print the top 5 most similar chunks
print("Top 5 most relevant chunks:")
for i, (doc_chunk, score) in enumerate(sorted_chunk_scores[:5]):
    print(f"Chunk {i + 1} - Similarity Score: {score:.4f}")
    print(f"Content: {doc_chunk.page_content}\n")


In [None]:
# Ask a question
questions = ['What topics are included in this course?',
             'Through what university is this course dual-enrolled?',
             'Please list the learning outcomes for the course.',
             'Will this course incorporate inquiry-based activities?',
             'Are there any free e-textbooks provided to students?',
             'How will students submit labs?',
             "How big will lab groups be?"
             'What is the primary text for the course?',
             'What will students use to submit homework assignments?',
             'What supplies do students need to bring to class?',
             'What weighting is given to the Final Exam?',
             'Please list 3 keys for success.',
             'List any prerequisites or co-requisites for the course.',
             'Describe the coure lab tardiness policy.',
             'Are students allowed to wear headphones?',
             'How many homework assignments do students have to complete?',
             "What is the course policy on missed/late assignments?",
             "Describe how tests/exams are administered.",
             'What is the name of the Director of the Center for Academic Success?',
             'If a student engages in plagiarism/cheating, what will happen?',
             'List the dates of the exams for the course.',
             ]

In [None]:
import matplotlib.pyplot as plt
def debug_retrieval2(question):
    print(f"Question: {question}")
    
    # Print query embedding to verify unique representation
    query_embedding = embeddings.embed_query(question)
    print(f"Query embedding (first 5 values): {query_embedding[:5]}...\n")

    # Retrieve relevant documents and print context chunks
    retrieval_result = retriever.get_relevant_documents(question)
    for i, doc in enumerate(retrieval_result):
        chunk_index = split_docs.index(doc) if doc in split_docs else -1  # Find the index of the document in split_docs
        print(f"Retrieved Context {i + 1} (Chunk Index: {chunk_index}):\n{doc.page_content}\n")

    return retrieval_result, chunk_index


def debug_answers(question):
    retrieved_result = debug_retrieval(question)
    context = " ".join([doc.page_content for doc in retrieval_result])
    print("Retrieved Context:")
    for i, doc in enumerate(retrieval_result):
        print(f"Context {i + 1}:\n{doc.page_content}\n")
    # Generate the answer
    answer = qa_chain({"question": question, "chat_history": chat_history})
    print("Answer:", answer['answer'])

# Set up retrieval-based QA chain
retriever = vector_store.as_retriever(search_kwargs={"k": 1})

chunks_used = []
for question in questions:
    chunks_used.append(debug_retrieval2(question)[1])
    #debug_retrieval2(question)

        
    
plt.hist(chunks_used)
plt.title('Histogram of Context Chunks Used')
plt.show()

## Hyperparameter Tuning
- Experiment with chunk size -- smaller means more specific but could miss info
- Experiment with chunk overlap
- Implement semantic splitting to split chunks on obvious sections. Use nltk/spacy?
- Experiment with different vector stores: FAISS (fast) vs. Chroma vs. Weaviate
- Experiment with k, number of retrieved documents (try k=5, 3, etc.)
- Try adjusting similarity threshold of when model thinks things are similar
- 

- Study query embeddings and potentiall add preprocessing
- Further preprocessing of input documents
- 