# Question Answering Project
This project involves building a question answering system using a pre-trained language model. The system will take a context passage and a question as input and generate an answer based on the provided context.

In [1]:
from dotenv import load_dotenv, find_dotenv
import os
load_dotenv(find_dotenv(), override=True)
api_key = os.getenv("OPENAI_API_KEY")

In [2]:
def load_document(file):
    name, extension = os.path.splitext(file)
    if extension == ".pdf":
        from langchain.document_loaders import PyPDFLoader
        print(f"Loading document from {file}")
        loader = PyPDFLoader(file)
    elif extension == ".docx":
        from langchain.document_loaders import Docx2txtLoader
        print(f"Loading document from {file}")
        loader = Docx2txtLoader(file)
    else:
        raise ValueError(f"Unsupported file extension: {extension}")
    data = loader.load()
    return data

In [3]:
data = load_document("files/constitution.pdf")
print(data[1].page_content)  # Print content of the second page

Loading document from files/constitution.pdf
C O N S T I T U T I O N O F T H E U N I T E D S T A T E S  
 
 
 
 
We the People of the United States, in Order to form a 
more perfect Union, establish Justice, insure domestic 
Tranquility, provide for the common defence, promote 
the general Welfare, and secure the Blessings of Liberty to 
ourselves and our Posterity, do ordain and establish this 
Constitution for the United States of America  
 
 
Article.  I. 
SECTION. 1 
All legislative Powers herein granted shall be vested in a 
Congress of the United States, which shall consist of a Sen- 
ate and House of Representatives. 
SECTION. 2 
The House of Representatives shall be composed of Mem- 
bers chosen every second Year by the People of the several 
States, and the Electors in each State shall have the Qualifi- 
cations requisite for Electors of the most numerous Branch 
of the State Legislature. 
No Person shall be a Representative who shall not have 
attained to the Age of twenty f

In [4]:
print(f"Document has {len(data)} pages.")
print(f"There are {len(data[0].metadata)} metadata fields on the first page.")

Document has 19 pages.
There are 13 metadata fields on the first page.


In [5]:
data_docx = load_document("files/Sam_Villasmith_Resume_2025 _Dev.docx")
print(data_docx[0].page_content)  # Print content of the first page

Loading document from files/Sam_Villasmith_Resume_2025 _Dev.docx
SAMUEL VILLA-SMITH, MBA

Senior Software Engineer

📧 svillasmith2@gmail.com | 📱 (806) 440-2215 | 🏠 Fritch, TX

🔗 https://www.linkedin.com/in/samuel-villa-smith-mbaa803a0109  | 🌐 https://github.com/samvillasmith | 



PROFESSIONAL SUMMARY

Experienced Senior Software Engineer with strong background in secure cloud-native applications and full-stack development. Data-driven PhD student in Information Technology with expertise in AI, Machine Learning, and Natural Language Processing (NLP). Combines technical expertise with business acumen to architect and develop robust, security-first web and mobile solutions. AWS Solutions Architect certified with proven experience in implementing defensive security measures and optimizing application performance.





TECHNICAL SKILLS

Development: React, TypeScript, Next.js, Node.js, Tailwind CSS, Shadcn UI, T3 Stack, Full- Stack Development

Data & AI: Advanced Analytics, Data Visualiza

## Basic Educational Overview of Technologies Used

## Service Loaders

In [6]:
def load_From_wikipedia(query, lang="en", load_max_docs=3):
    from langchain.document_loaders import WikipediaLoader
    print(f"Loading document from Wikipedia for query: {query}")
    loader = WikipediaLoader(query=query, lang=lang, load_max_docs=load_max_docs)
    data = loader.load()
    return data

In [7]:
data_wiki = load_From_wikipedia("Artificial Intelligence")
print(data_wiki[0].page_content)  # Print content of the first Wikipedia article

Loading document from Wikipedia for query: Artificial Intelligence
Artificial intelligence (AI) is the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.
High-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon, and Netflix); virtual assistants (e.g., Google Assistant, Siri, and Alexa); autonomous vehicles (e.g., Waymo); generative and creative tools (e.g., language models and AI art); and superhuman play and analysis in strategy games (e.g., chess and Go). However, many AI applications are not perceived as AI: "A lot of cutting edge A

In [8]:
def chunk_data(data, chunk_size=256, chunk_overlap=0):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    chunks = text_splitter.split_documents(data)
    return chunks

In [9]:
chunks = chunk_data(data)

In [10]:
print(f"Document has {len(chunks)} chunks after splitting.")

Document has 236 chunks after splitting.


In [11]:
print(chunks[9])  # Print the first chunk

page_content='Representative; and until such enumeration shall be made, 
the State of New Hampshire shall be entitled to chuse 
three, Massachusetts eight, Rhode-Island and Providence 
Plantations one, Connecticut five, New-York six, New' metadata={'producer': 'Adobe PDF Library 23.1.125', 'creator': 'Acrobat PDFMaker 23 for Word', 'creationdate': '2023-04-10T12:53:44-04:00', 'company': '', 'created': 'D:20030612', 'lastsaved': 'D:20230409', 'moddate': '2023-04-10T13:09:52-04:00', 'sourcemodified': 'D:20230410165309', 'title': 'constitution_pdf2', 'source': 'files/constitution.pdf', 'total_pages': 19, 'page': 1, 'page_label': '2'}


## Vector Stores

In [12]:
from pinecone import Pinecone
pc = Pinecone() 

In [13]:
from pinecone import ServerlessSpec

index_name = "qa-docs"
if index_name not in pc.list_indexes().names():
    print(f"Creating index: {index_name}")
    pc.create_index(
        name=index_name,  # Changed: use 'name=' keyword argument
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(  # Changed: 'spec' instead of 'serverless'
            cloud="aws",
            region="us-east-1"
        )
    )
    print(f"Index {index_name} created.")
else:
    print(f"Index {index_name} already exists.")

Creating index: qa-docs
Index qa-docs created.


In [14]:
index = pc.Index(index_name)
index.describe_index_stats()

  from .autonotebook import tqdm as notebook_tqdm


{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

### Upserting Vectors into Pinecone Index
When upserting vectors into a Pinecone index, ensure that the vectors are in the correct format and that the index is properly initialized. Below is an example of how to upsert vectors into a Pinecone index.

In [15]:
import random
# Create 5 vectors, each of dimension 1536
vectors = [[random.random() for _ in range(1536)] for _ in range(5)]  # 5 vectors of dimension 1536
ids = list('abcde')  # 5 unique IDs
index_name = "qa-docs"
# Pinecone expects a list of (id, vector) tuples
index.upsert(vectors=list(zip(ids, vectors)))

{'upserted_count': 5}

### Updating Pinecone vectors 
To update vectors in a Pinecone index, you can use the `upsert` method. If the vector ID already exists in the index, the existing vector will be updated with the new vector data. Here's an example of how to update vectors in a Pinecone index:

In [16]:
index.upsert(vectors=[("a", [0.1]*1536), ("b", [0.2]*1536)])  # Example vectors

{'upserted_count': 2}

### Fetching a vector by ID
To fetch a vector by its ID from a Pinecone index, you can use the `fetch` method. This method retrieves the vector associated with the specified ID. Below is an example of how to fetch a vector by its ID:

In [17]:
index.fetch(ids=["a", "b"])

FetchResponse(namespace='', vectors={}, usage={'read_units': 1})

### Deleting vectors by ID
To delete vectors from a Pinecone index by their IDs, you can use the `delete` method. This method removes the vectors associated with the specified IDs from the index. Below is an example of how to delete vectors by their IDs: 

In [18]:
index.delete(ids=["a", "b"])

{}

In [19]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

In [20]:
index.fetch(ids=['c'])

FetchResponse(namespace='', vectors={}, usage={'read_units': 1})

### Querying Pinecone Index
To query a Pinecone index, you can use the `query` method. This method allows you to search for vectors similar to a given query vector. Below is an example of how to query a Pinecone index:

In [21]:
index.delete(ids=["b", "c", "d", "e"])

{}

In [22]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

In [23]:
# Create 5 vectors, each of dimension 1536
vectors = [[random.random() for _ in range(1536)] for _ in range(5)]  # 5 vectors of dimension 1536
ids = list('abcde')  # 5 unique IDs
index_name = "qa-docs"
# Pinecone expects a list of (id, vector) tuples
index.upsert(vectors=list(zip(ids, vectors)))

{'upserted_count': 5}

In [24]:
query_vector = [random.random() for _ in range(1536)]  # A random query vector of dimension 1536
# Query the index for the top 3 most similar vectors
index.query(vector=query_vector, top_k=3)

{'matches': [], 'namespace': '', 'usage': {'read_units': 1}}

## Namespaces
Namespaces in Pinecone allow you to organize your vectors into separate groups. This can be useful for managing different datasets or applications within the same Pinecone index. When upserting, querying, or deleting vectors, you can specify a namespace to operate within that specific group.

In [25]:
vectors = [[random.random() for _ in range(1536)] for _ in range(3)]  # 3 vectors of dimension 1536
ids = list('xyz')  # 3 unique IDs
index.upsert(vectors=list(zip(ids, vectors)), namespace="test-namespace")

{'upserted_count': 3}

In [26]:
vectors = [[random.random() for _ in range(1536)] for _ in range(2)]  # 2 vectors of dimension 1536
ids = list('wv')  # 2 unique IDs
index.upsert(vectors=list(zip(ids, vectors)), namespace="test-namespace-2")

{'upserted_count': 2}

In [27]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

In [28]:
# This won't work because the vectors are in different namespaces
index.fetch(ids=['x', 'w'])

FetchResponse(namespace='', vectors={}, usage={'read_units': 1})

In [29]:
# This will work because we specify the namespace
index.fetch(ids=['x'], namespace="test-namespace")

FetchResponse(namespace='test-namespace', vectors={}, usage={'read_units': 1})

In [30]:
# This also applies when deleting vectors
index.delete(ids=['x'], namespace="test-namespace")

{}

In [31]:
# To delete all vectors in a namespace and the namespace itself
index.delete(delete_all=True, namespace="test-namespace-2")

{}

In [32]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

In [33]:
index.delete(delete_all=True, namespace="test-namespace")

{}

In [34]:
index.delete(ids=["a", "b", "c", "d", "e"])

{}

In [35]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

## RAG (Retrieval-Augmented Generation)
Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based methods with generative models to improve the quality of generated responses. In a RAG system, relevant documents or passages are retrieved from a knowledge base based on the input query, and these retrieved documents are then used to inform the generation of the final response.

## Embedding and Uploading to a Vector Database

In [43]:
def get_or_create_vector_store(index_name, chunks):
    # Updated imports
    from langchain_pinecone import PineconeVectorStore
    from langchain_openai import OpenAIEmbeddings
    from pinecone import ServerlessSpec
    from langchain.vectorstores import Pinecone as LangchainPinecone
    
    embeddings = OpenAIEmbeddings()
    
    if index_name in pc.list_indexes().names():
        print(f"Index {index_name} exists. Fetching existing embeddings.")
        
        # Use PineconeVectorStore with updated method
        vector_store = PineconeVectorStore.from_existing_index(
            index_name=index_name,
            embedding=embeddings  

        )
        
        # Get stats
        index = pc.Index(index_name)
        stats = index.describe_index_stats()
        print(f"Loaded {stats.total_vector_count} vectors from index {index_name}.")
    else:
        print(f"Creating index {index_name} and inserting embeddings.")
        pc.create_index(
            name=index_name,
            dimension=1536,
            metric="cosine",
            spec=ServerlessSpec(
                cloud="aws",
                region="us-east-1"
            )
        )
        
        # Use PineconeVectorStore for creation
        vector_store = PineconeVectorStore.from_documents(
            documents=chunks,
            embedding=embeddings,  
            index_name=index_name
        )
        print(f"Inserted {len(chunks)} vectors into index {index_name}.")
    
    return vector_store

In [37]:
def delete_index(index_name):
    if index_name in pc.list_indexes().names():
        print(f"Deleting index: {index_name}")
        pc.delete_index(index_name)
        print(f"Index {index_name} deleted.")
    else:
        print(f"Index {index_name} does not exist.")

In [45]:
delete_index("qa-docs")

Deleting index: qa-docs
Index qa-docs deleted.


# Application Implementation

In [46]:
data = load_document("files/constitution.pdf")
chunks = chunk_data(data)

Loading document from files/constitution.pdf


In [47]:
index_name = "qa-docs"
vector_store = get_or_create_vector_store(index_name, chunks)

Creating index qa-docs and inserting embeddings.
Inserted 236 vectors into index qa-docs.


## Asking and Answering Questions

Note: This does not handle memory or chat history. It is a simple question and answer system.

In [48]:
def ask_and_get_answer(vector_store, question):
    from langchain_openai import OpenAI
    from langchain.chains import RetrievalQA
    from langchain.chat_models import ChatOpenAI

    llm = ChatOpenAI(model="gpt-4o", temperature=0.2)
    retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})
    chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
    answer = chain.invoke(question)
    return answer

In [49]:
question = "What is the purpose of the Constitution?"
answer = ask_and_get_answer(vector_store, question)

  llm = ChatOpenAI(model="gpt-4o", temperature=0.2)


In [50]:
print(answer)

{'query': 'What is the purpose of the Constitution?', 'result': 'The purpose of the Constitution is to establish the framework for the government of the United States, ensure justice, provide for the common defense, promote the general welfare, and secure the blessings of liberty for the people and their future generations.'}


In [52]:
import time
i = 1
print("Enter quit or exit to stop.")
while True:
    question = input(f"Question {i}: ")
    if question.lower() in ["quit", "exit"]:
        break
    start_time = time.time()
    answer = ask_and_get_answer(vector_store, question)
    end_time = time.time()
    print(f"Answer: {answer}")
    print(f"Time taken: {end_time - start_time:.2f} seconds\n")
    i += 1

Enter quit or exit to stop.
Answer: {'query': 'explain the concept of the federal government ', 'result': 'The federal government is a system of governance in which power and authority are divided between a central government and individual states or provinces. In the context of the United States, the federal government refers to the national government, which is established by the U.S. Constitution. It is responsible for managing national affairs and has specific powers and responsibilities that are distinct from those of state governments.\n\nKey features of the U.S. federal government include:\n\n1. **Separation of Powers**: The federal government is divided into three branches: the legislative branch (Congress), the executive branch (headed by the President), and the judicial branch (the Supreme Court and other federal courts). Each branch has its own distinct powers and responsibilities, providing a system of checks and balances to prevent any one branch from becoming too powerful

In [58]:
data_wiki2 = load_From_wikipedia("Machine Learning")
chunks = chunk_data(data_wiki2)

Loading document from Wikipedia for query: Machine Learning


In [59]:
def add_documents_to_index(index_name, chunks):
    from langchain_pinecone import PineconeVectorStore
    from langchain_openai import OpenAIEmbeddings
    
    embeddings = OpenAIEmbeddings()
    vector_store = PineconeVectorStore.from_existing_index(
        index_name=index_name,
        embedding=embeddings
    )
    
    # Add new documents
    vector_store.add_documents(chunks)
    print(f"Added {len(chunks)} documents to {index_name}")
    
    return vector_store

In [60]:
add_documents_to_index(index_name, chunks)

Added 66 documents to qa-docs


<langchain_pinecone.vectorstores.PineconeVectorStore at 0x21d1d7eb820>

In [62]:
print("Enter quit or exit to stop.")
while True:
    question = input(f"Question {i}: ")
    if question.lower() in ["quit", "exit"]:
        break
    start_time = time.time()
    answer = ask_and_get_answer(vector_store, question)
    end_time = time.time()
    print(f"Answer: {answer}")
    print(f"Time taken: {end_time - start_time:.2f} seconds\n")
    i += 1

Enter quit or exit to stop.
Answer: {'query': 'what is ML?', 'result': 'Machine learning (ML) is a field of study in artificial intelligence that focuses on the development and study of statistical algorithms that can learn from data and generalize to unseen data. This allows the algorithms to perform tasks without explicit instructions. ML finds applications in various fields such as natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. It is also used in business for predictive analytics. The foundations of machine learning are based on statistics and mathematical optimization methods.'}
Time taken: 3.35 seconds

Answer: {'query': 'Tell me about proprietary software with free and open-source editions', 'result': 'Proprietary software with free and open-source editions refers to software that is available in both a proprietary version, which is typically sold or licensed under restrictive terms, and an open-source version, which 