In [None]:
# Document Loaders

In [None]:
# Example 1 - Text Doc loader

In [4]:
!pip install -U langchain langchain-community pypdf beautifulsoup4

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting pypdf
  Downloading pypdf-5.9.0-py3-none-any.whl.metadata (7.1 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchai

In [9]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("/content/sample.txt")  # Make sure the file is uploaded
docs = loader.load()

print(docs[0].page_content)  # Show first document's content


﻿ 
Detailed Learning Guide: AI, ML, Deep Learning, GenAI, Prompt Engineering, Copilot, LangChain, Agentic AI, and RAG
1. Quick Overview: AI, ML, Deep Learning, and Generative AI
1.1 Artificial Intelligence (AI)
·       Definition: AI is the overarching concept of machines that can perform tasks that typically require human intelligence, such as reasoning, decision-making, and perception.
·       Types: Rule-based systems, expert systems, search algorithms, and adaptive software.
·       Key Attributes: Problem-solving, logical deduction, language understanding, and planning.
1.2 Machine Learning (ML)
·       Definition: A subset of AI where algorithms learn from data to make predictions or decisions without explicit programming.
·       Categories:
o   Supervised Learning: Learning from labeled data (e.g., spam detection).
o   Unsupervised Learning: Finding patterns in unlabeled data (e.g., clustering).
o   Reinforcement Learning: Agents learn by interacting with the environment and re

In [10]:
# Example 2 - Text loader , data & metadata retriever

In [11]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("/content/sample.txt")  # Upload your .txt file to /content
docs = loader.load()

print("📄 Page Content:\n", docs[0].page_content)
print("🗂️ Metadata:\n", docs[0].metadata)


📄 Page Content:
 ﻿ 
Detailed Learning Guide: AI, ML, Deep Learning, GenAI, Prompt Engineering, Copilot, LangChain, Agentic AI, and RAG
1. Quick Overview: AI, ML, Deep Learning, and Generative AI
1.1 Artificial Intelligence (AI)
·       Definition: AI is the overarching concept of machines that can perform tasks that typically require human intelligence, such as reasoning, decision-making, and perception.
·       Types: Rule-based systems, expert systems, search algorithms, and adaptive software.
·       Key Attributes: Problem-solving, logical deduction, language understanding, and planning.
1.2 Machine Learning (ML)
·       Definition: A subset of AI where algorithms learn from data to make predictions or decisions without explicit programming.
·       Categories:
o   Supervised Learning: Learning from labeled data (e.g., spam detection).
o   Unsupervised Learning: Finding patterns in unlabeled data (e.g., clustering).
o   Reinforcement Learning: Agents learn by interacting with the e

In [12]:
# csv loader

In [15]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(file_path="/content/data.csv")  # Upload your .csv
docs = loader.load()

print("📄 First Row Content:\n", docs[0].page_content)
print("🗂️ Metadata:\n", docs[0].metadata)


📄 First Row Content:
 SO43701: SO43704
1: 1
2019-07-01: 2019-07-01
Christy Zhu: Julio Ruiz
christy12@adventure-works.com: julio1@adventure-works.com
Mountain-100 Silver, 44: Mountain-100 Black, 48
3399.99: 3374.99
271.9992: 269.9992
🗂️ Metadata:
 {'source': '/content/data.csv', 'row': 0}


In [16]:
# pdf loader

In [17]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("/content/example.pdf")  # Upload your .pdf
pages = loader.load()

print("📄 Page 1 Content:\n", pages[0].page_content[:500])
print("🗂️ Metadata:\n", pages[0].metadata)


📄 Page 1 Content:
 
🗂️ Metadata:
 {'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 19.3 (Windows)', 'creationdate': '2024-04-15T20:37:40+01:00', 'moddate': '2024-04-15T20:37:52+01:00', 'trapped': '/False', 'source': '/content/example.pdf', 'total_pages': 118, 'page': 0, 'page_label': '1'}


In [14]:
# web page loader

In [18]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://docs.databricks.com/aws/en/machine-learning/")
docs = loader.load()

print("📄 Web Page Content:\n", docs[0].page_content[:500])
print("🗂️ Metadata:\n", docs[0].metadata)




📄 Web Page Content:
 AI and machine learning on Databricks | Databricks Documentation
Skip to main contentGet startedDevelopersReferenceRelease notesResourcesSupportKnowledge BaseCommunityTrainingFeedbackEnglishEnglish日本語PortuguêsAWSAzureGCPSAPTry DatabricksAI and machine learningOn this pageAI and machine learning on Databricks
Build, deploy, and manage AI and machine learning applications with Mosaic AI, an integrated platform that unifies the entire AI lifecycle from data preparation to production monitoring.
For
🗂️ Metadata:
 {'source': 'https://docs.databricks.com/aws/en/machine-learning/', 'title': 'AI and machine learning on Databricks | Databricks Documentation', 'description': 'Build AI and machine learning applications on Databricks using unified data and ML platform capabilities.', 'language': 'en'}


In [19]:
# now lets implement the doc loaders using LLM to extract from knowledge base

In [21]:
# Example 1 - Text Loader

In [24]:
!pip install -U langchain langchain-openai langchain-community pypdf beautifulsoup4 striprtf


Collecting langchain-openai
  Downloading langchain_openai-0.3.28-py3-none-any.whl.metadata (2.3 kB)
Collecting striprtf
  Downloading striprtf-0.0.29-py3-none-any.whl.metadata (2.3 kB)
Downloading langchain_openai-0.3.28-py3-none-any.whl (70 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.6/70.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading striprtf-0.0.29-py3-none-any.whl (7.9 kB)
Installing collected packages: striprtf, langchain-openai
Successfully installed langchain-openai-0.3.28 striprtf-0.0.29


In [22]:
from google.colab import userdata
import os

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")


In [26]:
# load doc

In [25]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.document_loaders import TextLoader

# Load file
loader = TextLoader("/content/sample.txt")
docs = loader.load()
document_text = docs[0].page_content

# Setup GPT model
llm = ChatOpenAI(model="gpt-4o", temperature=0.2)

# Setup prompt template
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that summarizes documents."),
    ("human", "Summarize the following content:\n\n{input_text}")
])




In [27]:
# generate summary

In [28]:
chain = prompt | llm

response = chain.invoke({"input_text": document_text})

print("📄 Summary:\n", response.content)


📄 Summary:
 The document is a detailed learning guide covering various AI-related topics, including AI, ML, Deep Learning, Generative AI, Prompt Engineering, Copilot, LangChain, Agentic AI, and Retrieval-Augmented Generation (RAG). Here's a summary of each section:

1. **AI, ML, Deep Learning, and Generative AI**:
   - **AI**: Machines performing tasks requiring human intelligence, including reasoning and decision-making.
   - **ML**: AI subset where algorithms learn from data to make predictions or decisions. Includes supervised, unsupervised, and reinforcement learning.
   - **Deep Learning**: A branch of ML using neural networks with many layers to process complex data.
   - **Generative AI**: Models that create new content mimicking human-generated data, using techniques like GANs and VAEs.

2. **Prompt Engineering**:
   - Involves crafting structured inputs (prompts) for generative models to elicit specific responses.
   - Techniques include zero-shot, few-shot, chain-of-thought, 

In [29]:
# Example 2 - PDF Loader

In [30]:

# load doc


In [39]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.document_loaders import PyPDFLoader

# Load file
loader = PyPDFLoader("/content/example.pdf")
docs = loader.load()
document_text = docs[10].page_content

# Setup GPT model
llm = ChatOpenAI(model="gpt-4o", temperature=0.2)

# Setup prompt template
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that summarizes documents."),
    ("human", "Summarize the following content:\n\n{input_text}")
])


In [40]:
# generate summary

In [41]:
chain = prompt | llm

response = chain.invoke({"input_text": document_text})

print("📄 Summary:\n", response.content)


📄 Summary:
 The document provides a comparison of various AI models, including DBRX, Instruct GPT-3.57, GPT-48, Claude 3, Gemini 1.0 Pro, Gemini 1.5 Pro, Mistral Medium, and Mistral Large, across different benchmarks. The benchmarks include MT Bench, MMLU 5-shot, HellaSwag 10-shot, HumanEval 0-shot, GSM8k CoT maj@1, and WinoGrande 5-shot.

- **MT Bench**: Scores range from 8.05 to 9.03, with Mistral Large scoring the highest.
- **MMLU 5-shot**: Scores range from 70.0% to 86.8%, with Claude 3 achieving the highest score.
- **HellaSwag 10-shot**: Scores range from 84.7% to 95.4%, with Claude 3 again scoring the highest.
- **HumanEval 0-shot (Programming)**: Scores range from 38.4% to 84.9%, with Gemini 1.5 Pro scoring the highest.
- **GSM8k CoT maj@1 (5-shot)**: Scores range from 57.1% to 95.0%, with Claude 3 scoring the highest.
- **WinoGrande 5-shot**: Scores range from 81.6% to 88.0%, with Mistral Large scoring the highest.

The table indicates the quality of DBRX Instruct and other l

In [42]:
# Example 3 - CSV Loader

In [45]:
# load doc & generate summary

In [46]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(file_path="/content/data.csv")  # Upload file
docs = loader.load()
csv_text = "\n".join([doc.page_content for doc in docs])

llm = ChatOpenAI(model="gpt-4o", temperature=0.3)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that summarizes CSV data."),
    ("human", "Summarize the key insights from the following CSV data:\n\n{input_text}")
])

chain = prompt | llm
response = chain.invoke({"input_text": csv_text})
print("📊 CSV Summary:\n", response.content)


📊 CSV Summary:
 The CSV data contains sales transactions for various customers, all handled by Christy Zhu. The key insights from the data are as follows:

1. **Product Details**: The primary product sold is the "Mountain-100 Silver, 44", with a consistent price of $3399.99 and a tax amount of $271.9992. Other products mentioned include "Road-150 Red", "Mountain-100 Black", and "Road-650 Black/Red", with varying sizes and prices.

2. **Sales Dates**: The transactions span from July 1, 2019, to December 31, 2019. Each transaction is recorded with a specific sales order number (e.g., SO43701 to SO45265).

3. **Customer Information**: Each transaction involves a different customer, identified by their name and email address. This suggests a diverse customer base with no repeat customers within the dataset.

4. **Product Variations**: While the "Mountain-100 Silver, 44" is the most common product, there are variations in color and size for other products, such as "Road-150 Red" and "Mounta

In [None]:
# Example 4 - WebSite Loader

In [48]:
# Load Web-doc and generate summary

In [47]:
from langchain_community.document_loaders import WebBaseLoader

url = "https://docs.databricks.com/aws/en/machine-learning/"
loader = WebBaseLoader(url)
docs = loader.load()
web_text = docs[0].page_content

llm = ChatOpenAI(model="gpt-4o", temperature=0.3)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that summarizes web content."),
    ("human", "Summarize the following webpage content:\n\n{input_text}")
])

chain = prompt | llm
response = chain.invoke({"input_text": web_text})
print("🌐 Web Summary:\n", response.content)


🌐 Web Summary:
 The Databricks documentation on AI and machine learning provides an overview of how to build, deploy, and manage AI and machine learning applications using Mosaic AI, an integrated platform that supports the entire AI lifecycle from data preparation to production monitoring.

Key features and tools include:

1. **Generative AI Applications**: Develop enterprise-grade generative AI applications such as fine-tuned large language models (LLMs) and AI agents. Tools like AI Playground, Agent Bricks, and the Mosaic AI Agent Framework facilitate prototyping, building, and deploying these applications.

2. **Classic Machine Learning Models**: Use automated tools and collaborative environments to create machine learning models. Features include AutoML for model building, Databricks Runtime for ML with pre-configured clusters, and MLflow for tracking and managing the model lifecycle.

3. **Deep Learning Models**: Develop deep learning models using built-in frameworks like PyTorch

In [49]:
# End of the examples.