# Question Answering Project - Basic Overview of Technology
This project involves building a question answering system using a pre-trained language model. The system will take a context passage and a question as input and generate an answer based on the provided context.

In [6]:
from dotenv import load_dotenv, find_dotenv
import os
load_dotenv(find_dotenv(), override=True)
api_key = os.getenv("OPENAI_API_KEY")

In [7]:
def load_document(file):
    name, extension = os.path.splitext(file)
    if extension == ".pdf":
        from langchain.document_loaders import PyPDFLoader
        print(f"Loading document from {file}")
        loader = PyPDFLoader(file)
    elif extension == ".docx":
        from langchain.document_loaders import Docx2txtLoader
        print(f"Loading document from {file}")
        loader = Docx2txtLoader(file)
    else:
        raise ValueError(f"Unsupported file extension: {extension}")
    data = loader.load()
    return data

In [8]:
data = load_document("files/constitution.pdf")
print(data[1].page_content)  # Print content of the second page

Loading document from files/constitution.pdf
C O N S T I T U T I O N O F T H E U N I T E D S T A T E S  
 
 
 
 
We the People of the United States, in Order to form a 
more perfect Union, establish Justice, insure domestic 
Tranquility, provide for the common defence, promote 
the general Welfare, and secure the Blessings of Liberty to 
ourselves and our Posterity, do ordain and establish this 
Constitution for the United States of America  
 
 
Article.  I. 
SECTION. 1 
All legislative Powers herein granted shall be vested in a 
Congress of the United States, which shall consist of a Sen- 
ate and House of Representatives. 
SECTION. 2 
The House of Representatives shall be composed of Mem- 
bers chosen every second Year by the People of the several 
States, and the Electors in each State shall have the Qualifi- 
cations requisite for Electors of the most numerous Branch 
of the State Legislature. 
No Person shall be a Representative who shall not have 
attained to the Age of twenty f

In [9]:
print(f"Document has {len(data)} pages.")
print(f"There are {len(data[0].metadata)} metadata fields on the first page.")

Document has 19 pages.
There are 13 metadata fields on the first page.


In [12]:
data_docx = load_document("files/Sam_Villasmith_Resume_2025 _Dev.docx")
print(data_docx[0].page_content)  # Print content of the first page

Loading document from files/Sam_Villasmith_Resume_2025 _Dev.docx
SAMUEL VILLA-SMITH, MBA

Senior Software Engineer

📧 svillasmith2@gmail.com | 📱 (806) 440-2215 | 🏠 Fritch, TX

🔗 https://www.linkedin.com/in/samuel-villa-smith-mbaa803a0109  | 🌐 https://github.com/samvillasmith | 



PROFESSIONAL SUMMARY

Experienced Senior Software Engineer with strong background in secure cloud-native applications and full-stack development. Data-driven PhD student in Information Technology with expertise in AI, Machine Learning, and Natural Language Processing (NLP). Combines technical expertise with business acumen to architect and develop robust, security-first web and mobile solutions. AWS Solutions Architect certified with proven experience in implementing defensive security measures and optimizing application performance.





TECHNICAL SKILLS

Development: React, TypeScript, Next.js, Node.js, Tailwind CSS, Shadcn UI, T3 Stack, Full- Stack Development

Data & AI: Advanced Analytics, Data Visualiza

## Service Loaders

In [13]:
def load_From_wikipedia(query, lang="en", load_max_docs=3):
    from langchain.document_loaders import WikipediaLoader
    print(f"Loading document from Wikipedia for query: {query}")
    loader = WikipediaLoader(query=query, lang=lang, load_max_docs=load_max_docs)
    data = loader.load()
    return data

In [14]:
data_wiki = load_From_wikipedia("Artificial Intelligence")
print(data_wiki[0].page_content)  # Print content of the first Wikipedia article

Loading document from Wikipedia for query: Artificial Intelligence
Artificial intelligence (AI) is the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.
High-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon, and Netflix); virtual assistants (e.g., Google Assistant, Siri, and Alexa); autonomous vehicles (e.g., Waymo); generative and creative tools (e.g., language models and AI art); and superhuman play and analysis in strategy games (e.g., chess and Go). However, many AI applications are not perceived as AI: "A lot of cutting edge A

In [15]:
def chunk_data(data, chunk_size=256, chunk_overlap=0):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    chunks = text_splitter.split_documents(data)
    return chunks

In [16]:
chunks = chunk_data(data)

In [19]:
print(f"Document has {len(chunks)} chunks after splitting.")

Document has 236 chunks after splitting.


In [23]:
print(chunks[9])  # Print the first chunk

page_content='Representative; and until such enumeration shall be made, 
the State of New Hampshire shall be entitled to chuse 
three, Massachusetts eight, Rhode-Island and Providence 
Plantations one, Connecticut five, New-York six, New' metadata={'producer': 'Adobe PDF Library 23.1.125', 'creator': 'Acrobat PDFMaker 23 for Word', 'creationdate': '2023-04-10T12:53:44-04:00', 'company': '', 'created': 'D:20030612', 'lastsaved': 'D:20230409', 'moddate': '2023-04-10T13:09:52-04:00', 'sourcemodified': 'D:20230410165309', 'title': 'constitution_pdf2', 'source': 'files/constitution.pdf', 'total_pages': 19, 'page': 1, 'page_label': '2'}


## Vector Stores

In [2]:
from pinecone import Pinecone
pc = Pinecone() 

In [5]:
from pinecone import ServerlessSpec

index_name = "qa-docs"
if index_name not in pc.list_indexes().names():
    print(f"Creating index: {index_name}")
    pc.create_index(
        name=index_name,  # Changed: use 'name=' keyword argument
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(  # Changed: 'spec' instead of 'serverless'
            cloud="aws",
            region="us-east-1"
        )
    )
    print(f"Index {index_name} created.")
else:
    print(f"Index {index_name} already exists.")

Creating index: qa-docs
Index qa-docs created.


In [7]:
index = pc.Index(index_name)
index.describe_index_stats()

  from .autonotebook import tqdm as notebook_tqdm


{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

### Upserting Vectors into Pinecone Index
When upserting vectors into a Pinecone index, ensure that the vectors are in the correct format and that the index is properly initialized. Below is an example of how to upsert vectors into a Pinecone index.

In [10]:
import random
# Create 5 vectors, each of dimension 1536
vectors = [[random.random() for _ in range(1536)] for _ in range(5)]  # 5 vectors of dimension 1536
ids = list('abcde')  # 5 unique IDs
index_name = "qa-docs"
# Pinecone expects a list of (id, vector) tuples
index.upsert(vectors=list(zip(ids, vectors)))

{'upserted_count': 5}

### Updating Pinecone vectors 
To update vectors in a Pinecone index, you can use the `upsert` method. If the vector ID already exists in the index, the existing vector will be updated with the new vector data. Here's an example of how to update vectors in a Pinecone index:

In [11]:
index.upsert(vectors=[("a", [0.1]*1536), ("b", [0.2]*1536)])  # Example vectors

{'upserted_count': 2}

### Fetching a vector by ID
To fetch a vector by its ID from a Pinecone index, you can use the `fetch` method. This method retrieves the vector associated with the specified ID. Below is an example of how to fetch a vector by its ID:

In [12]:
index.fetch(ids=["a", "b"])

FetchResponse(namespace='', vectors={'a': Vector(id='a', values=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 

### Deleting vectors by ID
To delete vectors from a Pinecone index by their IDs, you can use the `delete` method. This method removes the vectors associated with the specified IDs from the index. Below is an example of how to delete vectors by their IDs: 

In [13]:
index.delete(ids=["a", "b"])

{}

In [14]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 3}},
 'total_vector_count': 3,
 'vector_type': 'dense'}

In [15]:
index.fetch(ids=['c'])

FetchResponse(namespace='', vectors={'c': Vector(id='c', values=[0.204195708, 0.933384538, 0.955573142, 0.687194824, 0.480368823, 0.0152939521, 0.88135618, 0.0481844731, 0.534945786, 0.472482055, 0.47474122, 0.273595393, 0.678864, 0.526243627, 0.370931864, 0.245773956, 0.914251685, 0.727751732, 0.114832327, 0.875365, 0.309205383, 0.536275566, 0.959496617, 0.558822036, 0.416250348, 0.620590806, 0.494170338, 0.106574051, 0.85168159, 0.981046915, 0.184834421, 0.179612383, 0.693224669, 0.0970473886, 0.868960679, 0.101248957, 0.779321134, 0.352455765, 0.091754362, 0.595546842, 0.132745624, 0.140526801, 0.755977213, 0.457343072, 0.619590044, 0.630564272, 0.0619884618, 0.895795405, 0.467174858, 0.630901337, 0.608219206, 0.327438682, 0.024047019, 0.133987859, 0.24483113, 0.344723523, 0.731977165, 0.96343106, 0.837494433, 0.721877575, 0.358941615, 0.855726957, 0.80548656, 0.715823233, 0.416341335, 0.835415781, 0.421025783, 0.207307503, 0.848284364, 0.218429953, 0.903052151, 0.909516931, 0.97119

### Querying Pinecone Index
To query a Pinecone index, you can use the `query` method. This method allows you to search for vectors similar to a given query vector. Below is an example of how to query a Pinecone index:

In [16]:
index.delete(ids=["b", "c", "d", "e"])

{}

In [17]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

In [18]:
# Create 5 vectors, each of dimension 1536
vectors = [[random.random() for _ in range(1536)] for _ in range(5)]  # 5 vectors of dimension 1536
ids = list('abcde')  # 5 unique IDs
index_name = "qa-docs"
# Pinecone expects a list of (id, vector) tuples
index.upsert(vectors=list(zip(ids, vectors)))

{'upserted_count': 5}

In [20]:
query_vector = [random.random() for _ in range(1536)]  # A random query vector of dimension 1536
# Query the index for the top 3 most similar vectors
index.query(vector=query_vector, top_k=3)

{'matches': [{'id': 'c', 'score': 0.766354382, 'values': []},
             {'id': 'a', 'score': 0.756619394, 'values': []},
             {'id': 'e', 'score': 0.756510854, 'values': []}],
 'namespace': '',
 'usage': {'read_units': 1}}

## Namespaces
Namespaces in Pinecone allow you to organize your vectors into separate groups. This can be useful for managing different datasets or applications within the same Pinecone index. When upserting, querying, or deleting vectors, you can specify a namespace to operate within that specific group.

In [21]:
vectors = [[random.random() for _ in range(1536)] for _ in range(3)]  # 3 vectors of dimension 1536
ids = list('xyz')  # 3 unique IDs
index.upsert(vectors=list(zip(ids, vectors)), namespace="test-namespace")

{'upserted_count': 3}

In [32]:
vectors = [[random.random() for _ in range(1536)] for _ in range(2)]  # 2 vectors of dimension 1536
ids = list('wv')  # 2 unique IDs
index.upsert(vectors=list(zip(ids, vectors)), namespace="test-namespace-2")

{'upserted_count': 2}

In [24]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 5},
                'test-namespace': {'vector_count': 3},
                'test-namespace-2': {'vector_count': 2}},
 'total_vector_count': 10,
 'vector_type': 'dense'}

In [25]:
# This won't work because the vectors are in different namespaces
index.fetch(ids=['x', 'w'])

FetchResponse(namespace='', vectors={}, usage={'read_units': 1})

In [26]:
# This will work because we specify the namespace
index.fetch(ids=['x'], namespace="test-namespace")

FetchResponse(namespace='test-namespace', vectors={'x': Vector(id='x', values=[0.302782059, 0.0229782499, 0.0920657888, 0.164872706, 0.969619751, 0.192568123, 0.18780975, 0.872923613, 0.31419012, 0.971828222, 0.127541482, 0.282759309, 0.884009421, 0.734989941, 0.0820429549, 0.855031252, 0.943318069, 0.518241346, 0.12030521, 0.722625315, 0.736760378, 0.151344463, 0.159780368, 0.046338845, 0.577729344, 0.286768913, 0.0496764518, 0.873149931, 0.00705067255, 0.325902015, 0.995396435, 0.0904139504, 0.0126104169, 0.980078459, 0.381761074, 0.280359834, 0.538507, 0.813257337, 0.481419712, 0.359915912, 0.438147098, 0.731691658, 0.5145244, 0.629424214, 0.341347158, 0.429759383, 0.428638428, 0.919464, 0.16889967, 0.34036231, 0.783960342, 0.61053884, 0.167912588, 0.165352046, 0.0050615971, 0.853727698, 0.956301689, 0.45242852, 0.721725941, 0.818641365, 0.987665594, 0.044940453, 0.609688342, 0.299595237, 0.25013116, 0.432191521, 0.955679595, 0.0513980351, 0.662623644, 0.081861, 0.793734, 0.90854787

In [28]:
# This also applies when deleting vectors
index.delete(ids=['x'], namespace="test-namespace")

{}

In [33]:
# To delete all vectors in a namespace and the namespace itself
index.delete(delete_all=True, namespace="test-namespace-2")

{}

In [34]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 5}, 'test-namespace': {'vector_count': 2}},
 'total_vector_count': 7,
 'vector_type': 'dense'}