In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

if os.environ['OPENAI_API_KEY']:
    print("API Key is set.")

API Key is set.


In [2]:
from langchain_openai import ChatOpenAI


In [3]:
llm = ChatOpenAI(model="gpt-5-nano",temperature=0)

## **RAG IMPLEMENTATION WITH PDFs**

#### **STEP 1: Extracting Text from PDFs**

In [4]:
from langchain_community.document_loaders import PyPDFLoader


loader = PyPDFLoader("./Docs/fabric-onelake.pdf")

docs = loader.load()

docs

[Document(metadata={'producer': 'Microsoft Learn PDF 1.0.25309.01', 'creator': 'Microsoft Learn', 'creationdate': '2025-12-12T17:01:10+00:00', 'title': 'fabric onelake | Microsoft Learn', 'moddate': '2025-12-12T17:01:10+00:00', 'source': './Docs/fabric-onelake.pdf', 'total_pages': 382, 'page': 0, 'page_label': '1'}, page_content='Tell us about your PDF experience.\nOneLake in Microsoft Fabric\ndocumentation\nOneLake is a single, unified, logical data lake for the whole organization. OneLake comes\nautomatically with every Microsoft Fabric tenant with no infrastructure to manage.\nAbout OneLake\nｅOVERVIEW\nWhat is OneLake?\nOneLake security\nOneLake catalog\nOneLake access and APIs\n｀DEPLOY\nImplement medallion lakehouse architecture\nｂGET STARTED\nCreate a lakehouse with OneLake\nOneLake file explorer\nFind data in the OneLake catalog\nUse Iceberg tables in OneLake\nOneLake shortcuts\nｐCONCEPT\nWhat are shortcuts?\nｂGET STARTED\nCreate a shortcut\nｃHOW-TO GUIDE'),
 Document(metadata={'

### **Creating own Metadata for PDF Chunks**

In [5]:
for i in docs:

    i.metadata = {"source": "fabric-onelake.pdf",
                  "developer": "Microsoft"}


#### **STEP 2: Splitting the Document into CHUNKS**

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100)

chunks = splitter.split_documents(docs)
chunks

[Document(metadata={'source': 'fabric-onelake.pdf', 'developer': 'Microsoft'}, page_content='Tell us about your PDF experience.\nOneLake in Microsoft Fabric\ndocumentation\nOneLake is a single, unified, logical data lake for the whole organization. OneLake comes\nautomatically with every Microsoft Fabric tenant with no infrastructure to manage.\nAbout OneLake\nｅOVERVIEW\nWhat is OneLake?\nOneLake security\nOneLake catalog\nOneLake access and APIs\n｀DEPLOY\nImplement medallion lakehouse architecture\nｂGET STARTED\nCreate a lakehouse with OneLake\nOneLake file explorer\nFind data in the OneLake catalog\nUse Iceberg tables in OneLake\nOneLake shortcuts\nｐCONCEPT\nWhat are shortcuts?\nｂGET STARTED\nCreate a shortcut\nｃHOW-TO GUIDE'),
 Document(metadata={'source': 'fabric-onelake.pdf', 'developer': 'Microsoft'}, page_content='Access shortcuts\nOneLake and Azure integration\nｃHOW-TO GUIDE\nIntegrate OneLake with Azure Databricks\nIntegrate OneLake with Azure HDInsight\nIntegrate OneLake with

In [8]:
chunks[0].metadata

{'source': 'fabric-onelake.pdf', 'developer': 'Microsoft'}

#### **STEP 3: Creating Embeddings for the Chunks**

In [9]:
from langchain_openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

#### **STEP 4: Create and Store Embeddings in Vector Store**

In [10]:
from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model
)

#### **STEP 5: Semantic Search**

In [11]:
vectorstore.similarity_search("Why do we need OneLake?", k= 3)

[Document(metadata={'source': 'fabric-onelake.pdf', 'developer': 'Microsoft'}, page_content='OneLake, the OneDrive for data\nArticle• 07/25/2024\nOneLake is a single, unified, logical data lake for your whole organization. A data Lake\nprocesses large volumes of data from various sources. Like OneDrive, OneLake comes\nautomatically with every Microsoft Fabric tenant and is designed to be the single place\nfor all your analytics data. OneLake brings customers:\nOne data lake for the entire organization\nOne copy of data for use with multiple analytical engines\nBefore OneLake, it was easier for customers to create multiple lakes for different\nbusiness groups rather than collaborating on a single lake, even with the extra overhead\nof managing multiple resources. OneLake focuses on removing these challenges by\nimproving collaboration. Every customer tenant has exactly one OneLake. There can\nnever be more than one and if you have Fabric, there can never be zero. Every Fabric\ntenant au

#### **Talk to LLM**

In [19]:
context = vectorstore.similarity_search("What is AI?", k= 3)

In [22]:
response = llm.invoke(f"What is AI? You can answer using the following context: {context}")
print(response.content)

AI, or artificial intelligence, is the capability of computational systems to perform tasks that are typically associated with human intelligence. This includes learning, reasoning, problem-solving, perception, and decision-making. It’s a field of computer science that studies and develops methods and software that let machines perceive their environment and use learning and intelligence to take actions aimed at achieving defined goals.

Key points:
- AI spans technologies like neural networks, statistics, operations research, and economics, and it draws on insights from psychology, linguistics, philosophy, neuroscience, and more.
- Some researchers aim for artificial general intelligence (AGI): AI that can perform virtually any cognitive task as well as a human.

Common applications:
- Web search engines (e.g., Google Search)
- Recommendation systems (YouTube, Amazon, Netflix)
- Virtual assistants (Siri, Google Assistant, Alexa)
- Autonomous vehicles (Waymo)
- Generative and creative 