# Load Documents Using LangChain for Different Sources


In [7]:
pip install jq



In [8]:
!pip install --user "langchain-community==0.2.1"




In [9]:
pip install pypdf



In [10]:
pip install pymupdf



In [11]:
pip install unstructured



In [12]:

%%capture

!pip install --user "markdown"


In [13]:

!pip install --user "docx2txt==0.8"
!pip install --user "beautifulsoup4==4.12.3"


Collecting docx2txt==0.8
  Downloading docx2txt-0.8.tar.gz (2.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: docx2txt
  Building wheel for docx2txt (setup.py) ... [?25l[?25hdone
  Created wheel for docx2txt: filename=docx2txt-0.8-py3-none-any.whl size=3960 sha256=07d897a29d05dd1966d424fde59d200384c29c6fc6a89ea7053644200b9e8019
  Stored in directory: /root/.cache/pip/wheels/0f/0e/7a/3094a4ceefe657bff7e12dd9592a9d5b6487ef4338ace0afa6
Successfully built docx2txt
Installing collected packages: docx2txt
Successfully installed docx2txt-0.8
Collecting beautifulsoup4==4.12.3
  Downloading beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Downloading beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.9/147.9 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.12.3


 Each client provides their data in different formats: some in PDFs, others in Word documents, CSV files, or even HTML webpages. Manually loading and parsing each document type is not only time-consuming but also prone to errors. Your goal is to streamline this process, making it efficient and error-free.

To achieve this, you'll use LangChain’s powerful document loaders. These loaders allow you to read and convert various file formats into a unified document structure that can be easily processed. For example, you'll load client policy documents from text files, financial reports from PDFs, marketing strategies from Word documents, and product reviews from JSON files. By the end of this lab, you will have a robust pipeline that can handle any new file formats clients might send, saving you valuable time and effort.

 - Understand how to use `TextLoader` to load text files.
 - Learn how to load PDFs using `PyPDFLoader` and `PyMuPDFLoader`.
 - Use `UnstructuredMarkdownLoader` to load Markdown files.
 - Load JSON files with `JSONLoader` using jq schemas.
 - Process CSV files with `CSVLoader` and `UnstructuredCSVLoader`.
 - Load Webpage content using `WebBaseLoader`.
 - Load Word documents using `Docx2txtLoader`.
 - Utilize `UnstructuredFileLoader` for various file types.



In [1]:
# You can also use this section to suppress warnings generated by your code:

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

from pprint import pprint
import json
from pathlib import Path
import nltk
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_community.document_loaders import JSONLoader
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders.csv_loader import UnstructuredCSVLoader
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import Docx2txtLoader
from langchain_community.document_loaders import UnstructuredFileLoader

nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [2]:
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/Ec5f3KYU1CpbKRp1whFLZw/new-Policies.txt"

--2025-05-25 22:11:32--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/Ec5f3KYU1CpbKRp1whFLZw/new-Policies.txt
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.45.118.108
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.45.118.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6363 (6.2K) [text/plain]
Saving to: ‘new-Policies.txt.1’


2025-05-25 22:11:32 (2.29 GB/s) - ‘new-Policies.txt.1’ saved [6363/6363]



Next, we will use the `TextLoader` class to load the file.


In [3]:
loader = TextLoader("new-Policies.txt")
loader

<langchain_community.document_loaders.text.TextLoader at 0x7bdfa95f6690>

In [4]:
data = loader.load()

Let's present the entire data (document) here.

This is a `document` object that includes `page_content` and `metadata` attributes.


In [5]:
data

[Document(metadata={'source': 'new-Policies.txt'}, page_content="1. Code of Conduct\n\nOur Code of Conduct establishes the core values and ethical standards that all members of our organization must adhere to. We are committed to fostering a workplace characterized by integrity, respect, and accountability.\n\nIntegrity: We commit to the highest ethical standards by being honest and transparent in all our dealings, whether with colleagues, clients, or the community. We protect sensitive information and avoid conflicts of interest.\n\nRespect: We value diversity and every individual's contribution. Discrimination, harassment, or any form of disrespect is not tolerated. We promote an inclusive environment where differences are respected, and everyone is treated with dignity.\n\nAccountability: We are responsible for our actions and decisions, complying with all relevant laws and regulations. We aim for continuous improvement and report any breaches of this code, supporting investigations

In [6]:
pprint(data[0].page_content[:1000])

('1. Code of Conduct\n'
 '\n'
 'Our Code of Conduct establishes the core values and ethical standards that '
 'all members of our organization must adhere to. We are committed to '
 'fostering a workplace characterized by integrity, respect, and '
 'accountability.\n'
 '\n'
 'Integrity: We commit to the highest ethical standards by being honest and '
 'transparent in all our dealings, whether with colleagues, clients, or the '
 'community. We protect sensitive information and avoid conflicts of '
 'interest.\n'
 '\n'
 "Respect: We value diversity and every individual's contribution. "
 'Discrimination, harassment, or any form of disrespect is not tolerated. We '
 'promote an inclusive environment where differences are respected, and '
 'everyone is treated with dignity.\n'
 '\n'
 'Accountability: We are responsible for our actions and decisions, complying '
 'with all relevant laws and regulations. We aim for continuous improvement '
 'and report any breaches of this code, supporting i

### Load from PDF files

Sometimes, we may have files in PDF format that we want to load for processing.

LangChain provides several classes for loading PDFs. Here, we introduce two classes: `PyPDFLoader` and `PyMuPDFLoader`.

#### PyPDFLoader

Load the PDF using `PyPDFLoader` into an array of documents, where each document contains the page content and metadata with the page number.


In [7]:
pdf_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/Q81D33CdRLK6LswuQrANQQ/instructlab.pdf"

loader = PyPDFLoader(pdf_url)

pages = loader.load_and_split()

In [8]:
print(pages[0])

page_content='LAB: L ARGE -SCALE ALIGNMENT FOR CHATBOTS
MIT-IBM Watson AI Lab and IBM Research
Shivchander Sudalairaj∗
Abhishek Bhandwaldar∗
Aldo Pareja∗
Kai Xu
David D. Cox
Akash Srivastava∗,†
*Equal Contribution, †Corresponding Author
ABSTRACT
This work introduces LAB (Large-scale Alignment for chatBots), a novel method-
ology designed to overcome the scalability challenges in the instruction-tuning
phase of large language model (LLM) training. Leveraging a taxonomy-guided
synthetic data generation process and a multi-phase tuning framework, LAB sig-
nificantly reduces reliance on expensive human annotations and proprietary mod-
els like GPT-4. We demonstrate that LAB-trained models can achieve compet-
itive performance across several benchmarks compared to models trained with
traditional human-annotated or GPT-4 generated synthetic data. Thus offering a
scalable, cost-effective solution for enhancing LLM capabilities and instruction-
following behaviors without the drawbacks of cata

In [9]:
## display first 3 pages
for p,page in enumerate(pages[0:3]):
    print(f"page number {p+1}")
    print(page)

page number 1
page_content='LAB: L ARGE -SCALE ALIGNMENT FOR CHATBOTS
MIT-IBM Watson AI Lab and IBM Research
Shivchander Sudalairaj∗
Abhishek Bhandwaldar∗
Aldo Pareja∗
Kai Xu
David D. Cox
Akash Srivastava∗,†
*Equal Contribution, †Corresponding Author
ABSTRACT
This work introduces LAB (Large-scale Alignment for chatBots), a novel method-
ology designed to overcome the scalability challenges in the instruction-tuning
phase of large language model (LLM) training. Leveraging a taxonomy-guided
synthetic data generation process and a multi-phase tuning framework, LAB sig-
nificantly reduces reliance on expensive human annotations and proprietary mod-
els like GPT-4. We demonstrate that LAB-trained models can achieve compet-
itive performance across several benchmarks compared to models trained with
traditional human-annotated or GPT-4 generated synthetic data. Thus offering a
scalable, cost-effective solution for enhancing LLM capabilities and instruction-
following behaviors without the dra

#### PyMuPDFLoader

`PyMuPDFLoader` is the fastest of the PDF parsing options. It provides detailed metadata about the PDF and its pages, and returns one document per page.


In [10]:
loader = PyMuPDFLoader(pdf_url)
loader

<langchain_community.document_loaders.pdf.PyMuPDFLoader at 0x7bdf65dc5210>

In [11]:
data = loader.load()
print(data[0])

page_content='LAB: LARGE-SCALE ALIGNMENT FOR CHATBOTS
MIT-IBM Watson AI Lab and IBM Research
Shivchander Sudalairaj∗
Abhishek Bhandwaldar∗
Aldo Pareja∗
Kai Xu
David D. Cox
Akash Srivastava∗,†
*Equal Contribution, †Corresponding Author
ABSTRACT
This work introduces LAB (Large-scale Alignment for chatBots), a novel method-
ology designed to overcome the scalability challenges in the instruction-tuning
phase of large language model (LLM) training. Leveraging a taxonomy-guided
synthetic data generation process and a multi-phase tuning framework, LAB sig-
nificantly reduces reliance on expensive human annotations and proprietary mod-
els like GPT-4. We demonstrate that LAB-trained models can achieve compet-
itive performance across several benchmarks compared to models trained with
traditional human-annotated or GPT-4 generated synthetic data. Thus offering a
scalable, cost-effective solution for enhancing LLM capabilities and instruction-
following behaviors without the drawbacks of catast

The `metadata` attribute reveals that `PyMuPDFLoader` provides more detailed metadata information than `PyPDFLoader`.


### Load from Markdown files

Sometimes, our file source might be in Markdown format.

LangChain provides the `UnstructuredMarkdownLoader` to load content from Markdown files.


In [12]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/eMSP5vJjj9yOfAacLZRWsg/markdown-sample.md'

--2025-05-25 22:11:49--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/eMSP5vJjj9yOfAacLZRWsg/markdown-sample.md
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.45.118.108
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.45.118.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3398 (3.3K) [text/markdown]
Saving to: ‘markdown-sample.md.1’


2025-05-25 22:11:49 (1.19 GB/s) - ‘markdown-sample.md.1’ saved [3398/3398]



In [13]:
markdown_path = "markdown-sample.md"
loader = UnstructuredMarkdownLoader(markdown_path)
loader

data = loader.load()

data

[Document(metadata={'source': 'markdown-sample.md'}, page_content='An h1 header\n\nParagraphs are separated by a blank line.\n\n2nd paragraph. Italic, bold, and monospace. Itemized lists look like:\n\nthis one\n\nthat one\n\nthe other one\n\nNote that --- not considering the asterisk --- the actual text content starts at 4-columns in.\n\nBlock quotes are written like so.\n\nThey can span multiple paragraphs, if you like.\n\nUse 3 dashes for an em-dash. Use 2 dashes for ranges (ex., "it\'s all in chapters 12--14"). Three dots ... will be converted to an ellipsis. Unicode is supported. ☺\n\nAn h2 header\n\nHere\'s a numbered list:\n\nfirst item\n\nsecond item\n\nthird item\n\nNote again how the actual text starts at 4 columns in (4 characters from the left side). Here\'s a code sample:\n\n# Let me re-iterate ...\nfor i in 1 .. 10 { do-something(i) }\n\nAs you probably guessed, indented 4 spaces. By the way, instead of indenting the block, you can use delimited blocks, if you like:\n\n~~~

### Load from JSON files

The JSONLoader uses a specified [jq schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files. It uses the jq python package, which we've installed before.


In [14]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/hAmzVJeOUAMHzmhUHNdAUg/facebook-chat.json'

--2025-05-25 22:11:52--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/hAmzVJeOUAMHzmhUHNdAUg/facebook-chat.json
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2167 (2.1K) [application/json]
Saving to: ‘facebook-chat.json.1’


2025-05-25 22:11:52 (720 MB/s) - ‘facebook-chat.json.1’ saved [2167/2167]



In [15]:
file_path='facebook-chat.json'
data = json.loads(Path(file_path).read_text())
pprint(data)

{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
 'is_still_participant': True,
 'joinable_mode': {'link': '', 'mode': 1},
 'magic_words': [],
 'messages': [{'content': 'Bye!',
               'sender_name': 'User 2',
               'timestamp_ms': 1675597571851},
              {'content': 'Oh no worries! Bye',
               'sender_name': 'User 1',
               'timestamp_ms': 1675597435669},
              {'content': 'No Im sorry it was my mistake, the blue one is not '
                          'for sale',
               'sender_name': 'User 2',
               'timestamp_ms': 1675596277579},
              {'content': 'I thought you were selling the blue one!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675595140251},
              {'content': 'Im not interested in this bag. Im interested in the '
                          'blue one!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675595109305},
   

We use `JSONLoader` to load data from the JSON file. However, JSON files can have various attribute-value pairs. If we want to load a specific attribute and its value, we need to set an appropriate `jq schema`.

So for example, if we want to load the `content` from the JSON file, we need to set `jq_schema='.messages[].content'`.


In [16]:
loader = JSONLoader(
    file_path=file_path,
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()

### Load from CSV files
CSV files are a common format for storing tabular data. The `CSVLoader` provides a convenient way to read and process this data.


In [17]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IygVG_j0M87BM4Z0zFsBMA/mlb-teams-2012.csv'
loader = CSVLoader(file_path='mlb-teams-2012.csv')
data = loader.load()

--2025-05-25 22:11:58--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IygVG_j0M87BM4Z0zFsBMA/mlb-teams-2012.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 848 [text/csv]
Saving to: ‘mlb-teams-2012.csv.1’


2025-05-25 22:11:59 (340 MB/s) - ‘mlb-teams-2012.csv.1’ saved [848/848]



In [18]:
data

[Document(metadata={'source': 'mlb-teams-2012.csv', 'row': 0}, page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98'),
 Document(metadata={'source': 'mlb-teams-2012.csv', 'row': 1}, page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97'),
 Document(metadata={'source': 'mlb-teams-2012.csv', 'row': 2}, page_content='Team: Yankees\n"Payroll (millions)": 197.96\n"Wins": 95'),
 Document(metadata={'source': 'mlb-teams-2012.csv', 'row': 3}, page_content='Team: Giants\n"Payroll (millions)": 117.62\n"Wins": 94'),
 Document(metadata={'source': 'mlb-teams-2012.csv', 'row': 4}, page_content='Team: Braves\n"Payroll (millions)": 83.31\n"Wins": 94'),
 Document(metadata={'source': 'mlb-teams-2012.csv', 'row': 5}, page_content='Team: Athletics\n"Payroll (millions)": 55.37\n"Wins": 94'),
 Document(metadata={'source': 'mlb-teams-2012.csv', 'row': 6}, page_content='Team: Rangers\n"Payroll (millions)": 120.51\n"Wins": 93'),
 Document(metadata={'source': 'mlb-teams-2012.csv', '

When you load data from a CSV file, the loader typically creates a separate `Document` object for each row of data in the CSV.


#### UnstructuredCSVLoader

In contrast to `CSVLoader`, which treats each row as an individual document with headers defining the data, `UnstructuredCSVLoader` considers the entire CSV file as a single unstructured table element. This approach is beneficial when you want to analyze the data as a complete table rather than as separate entries.


In [19]:
loader = UnstructuredCSVLoader(
    file_path="mlb-teams-2012.csv", mode="elements"
)
data = loader.load()
data[0].page_content

'Team "Payroll (millions)" "Wins" Nationals 81.34 98 Reds 82.20 97 Yankees 197.96 95 Giants 117.62 94 Braves 83.31 94 Athletics 55.37 94 Rangers 120.51 93 Orioles 81.43 93 Rays 64.17 90 Angels 154.49 89 Tigers 132.30 88 Cardinals 110.30 88 Dodgers 95.14 86 White Sox 96.92 85 Brewers 97.65 83 Phillies 174.54 81 Diamondbacks 74.28 81 Pirates 63.43 79 Padres 55.24 76 Mariners 81.97 75 Mets 93.35 74 Blue Jays 75.48 73 Royals 60.91 72 Marlins 118.07 69 Red Sox 173.18 69 Indians 78.43 68 Twins 94.08 66 Rockies 78.06 64 Cubs 88.19 61 Astros 60.65 55'

In [20]:
print(data[0].metadata["text_as_html"])

<table><tr><td>Team</td><td>"Payroll (millions)"</td><td>"Wins"</td></tr><tr><td>Nationals</td><td>81.34</td><td>98</td></tr><tr><td>Reds</td><td>82.20</td><td>97</td></tr><tr><td>Yankees</td><td>197.96</td><td>95</td></tr><tr><td>Giants</td><td>117.62</td><td>94</td></tr><tr><td>Braves</td><td>83.31</td><td>94</td></tr><tr><td>Athletics</td><td>55.37</td><td>94</td></tr><tr><td>Rangers</td><td>120.51</td><td>93</td></tr><tr><td>Orioles</td><td>81.43</td><td>93</td></tr><tr><td>Rays</td><td>64.17</td><td>90</td></tr><tr><td>Angels</td><td>154.49</td><td>89</td></tr><tr><td>Tigers</td><td>132.30</td><td>88</td></tr><tr><td>Cardinals</td><td>110.30</td><td>88</td></tr><tr><td>Dodgers</td><td>95.14</td><td>86</td></tr><tr><td>White Sox</td><td>96.92</td><td>85</td></tr><tr><td>Brewers</td><td>97.65</td><td>83</td></tr><tr><td>Phillies</td><td>174.54</td><td>81</td></tr><tr><td>Diamondbacks</td><td>74.28</td><td>81</td></tr><tr><td>Pirates</td><td>63.43</td><td>79</td></tr><tr><td>Padres</

### Load from URL/Website files

Usually we use `BeautifulSoup` package to load and parse a HTML or XML file. But it has some limitations.

The following code is using `BeautifulSoup` to parse a website. Let's see what limitation it has.


In [21]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.ibm.com/topics/langchain'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE HTML>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="en" name="languageCode"/>
  <meta content="us" name="countryCode"/>
  <meta content="What Is LangChain?" name="searchTitle"/>
  <meta content="Data Platform SDRs" name="focusArea"/>
  <title>
   What Is LangChain? | IBM
  </title>
  <script data-routing="program=131558,environment=1281329,tier=publish" defer="defer" src="https://rum.hlx.page/.rum/@adobe/helix-rum-js@%5E2/dist/rum-standalone.js" type="text/javascript">
  </script>
  <link href="/content/dam/adobe-cms/default-images/icon-16x16.png" rel="icon" sizes="16x16"/>
  <link href="/content/dam/adobe-cms/default-images/icon-32x32.png" rel="icon" sizes="32x32"/>
  <link href="/content/dam/adobe-cms/default-images/icon-150x150.png" rel="icon" sizes="150x150"/>
  <link href="/content/dam/adobe-cms/default-images/icon-192x192.png" rel="icon" sizes="192x192"/>
  <link href="/content/dam/adobe-cms/default-images/icon-512x512.png" rel="icon" sizes="512x51

From the print output, we can see that `BeautifulSoup` not only load the web content, but also a lot of HTML tags and external links, which are not necessary if we just want to load the text content of the web.

So LangChain's `WebBaseLoader` can effectively address this limitation.

`WebBaseLoader` is designed to extract all text from HTML webpages and convert it into a document format suitable for further processing.


In [22]:
loader = WebBaseLoader("https://www.ibm.com/topics/langchain")
data = loader.load()
data

#### Load from multiple web pages

You can load multiple webpages simultaneously by passing a list of URLs to the loader. This will return a list of documents corresponding to the order of the URLs provided.


In [23]:
loader = WebBaseLoader(["https://www.ibm.com/topics/langchain", "https://www.redhat.com/en/topics/ai/what-is-instructlab"])
data = loader.load()
data

[Document(metadata={'source': 'https://www.ibm.com/topics/langchain', 'title': 'What Is LangChain? | IBM', 'description': 'LangChain is an open source orchestration framework for the development of applications using large language models (LLMs), like chatbots and virtual agents.\u202f', 'language': 'en'}, page_content="\n\n\n\n\n\n\n\n\nWhat Is LangChain? | IBM\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n                    \n\n\n\n  \n    What is LangChain?\n\n\n\n\n\n\n    \n\n\n                \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAI Agents\n\n\n\nWelcome\n\n\n\n\n\nCaret right\n\nIntroduction\n\n\n\n\nOverview\n\n\n\n\nAI agents vs AI assistants\n\n\n\n\nAgentic AI\n\n\n\n\nAgentic AI vs generative AI\n\n\n\n\nTypes of AI agents\n\n\n\n\n\n\n\nCaret right\n\nComponents\n\n\n\n\nOverview\n\n\n\n\nPerception\n\n\n\n\nReasoning\n\n\n\n\n

### Load from WORD files


In [24]:
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/94hiHUNLZdb0bLMkrCh79g/file-sample.docx"
loader = Docx2txtLoader("file-sample.docx")
data = loader.load()
data

--2025-05-25 22:16:00--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/94hiHUNLZdb0bLMkrCh79g/file-sample.docx
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1311881 (1.3M) [application/vnd.openxmlformats-officedocument.wordprocessingml.document]
Saving to: ‘file-sample.docx’


2025-05-25 22:16:00 (11.5 MB/s) - ‘file-sample.docx’ saved [1311881/1311881]



[Document(metadata={'source': 'file-sample.docx'}, page_content='Demonstration of DOCX support in calibre\n\nThis document demonstrates the ability of the calibre DOCX Input plugin to convert the various typographic features in a Microsoft Word (2007 and newer) document. Convert this document to a modern ebook format, such as AZW3 for Kindles or EPUB for other ebook readers, to see it in action.\n\nThere is support for images, tables, lists, footnotes, endnotes, links, dropcaps and various types of text and paragraph level formatting.\n\nTo see the DOCX conversion in action, simply add this file to calibre using the “Add Books” button and then click “Convert”.  Set the output format in the top right corner of the conversion dialog to EPUB or AZW3 and click “OK”.\n\n\n\nText Formatting\n\nInline formatting\n\nHere, we demonstrate various types of inline text formatting and the use of embedded fonts.\n\nHere is some bold, italic, bold-italic, underlined and struck out  text. Then, we hav

### Load from Unstructured Files

Sometimes, we need to load content from various text sources and formats without writing a separate loader for each one. Additionally, when a new file format emerges, we want to save time by not having to write a new loader for it. `UnstructuredFileLoader` addresses this need by supporting the loading of multiple file types. Currently, `UnstructuredFileLoader` can handle text files, PowerPoints, HTML, PDFs, images, and more.


In [25]:
loader = UnstructuredFileLoader("new-Policies.txt")
data = loader.load()
data

[Document(metadata={'source': 'new-Policies.txt'}, page_content="1. Code of Conduct\n\nOur Code of Conduct establishes the core values and ethical standards that all members of our organization must adhere to. We are committed to fostering a workplace characterized by integrity, respect, and accountability.\n\nIntegrity: We commit to the highest ethical standards by being honest and transparent in all our dealings, whether with colleagues, clients, or the community. We protect sensitive information and avoid conflicts of interest.\n\nRespect: We value diversity and every individual's contribution. Discrimination, harassment, or any form of disrespect is not tolerated. We promote an inclusive environment where differences are respected, and everyone is treated with dignity.\n\nAccountability: We are responsible for our actions and decisions, complying with all relevant laws and regulations. We aim for continuous improvement and report any breaches of this code, supporting investigations

We also can load `.md` file.


In [26]:
loader = UnstructuredFileLoader("markdown-sample.md")
data = loader.load()
data

[Document(metadata={'source': 'markdown-sample.md'}, page_content='An h1 header\n\nParagraphs are separated by a blank line.\n\n2nd paragraph. Italic, bold, and monospace. Itemized lists look like:\n\nthis one\n\nthat one\n\nthe other one\n\nNote that --- not considering the asterisk --- the actual text content starts at 4-columns in.\n\nBlock quotes are written like so.\n\nThey can span multiple paragraphs, if you like.\n\nUse 3 dashes for an em-dash. Use 2 dashes for ranges (ex., "it\'s all in chapters 12--14"). Three dots ... will be converted to an ellipsis. Unicode is supported. ☺\n\nAn h2 header\n\nHere\'s a numbered list:\n\nfirst item\n\nsecond item\n\nthird item\n\nNote again how the actual text starts at 4 columns in (4 characters from the left side). Here\'s a code sample:\n\n# Let me re-iterate ...\nfor i in 1 .. 10 { do-something(i) }\n\nAs you probably guessed, indented 4 spaces. By the way, instead of indenting the block, you can use delimited blocks, if you like:\n\n~~~

In [27]:
#### Multiple files with different formats
files = ["markdown-sample.md", "new-Policies.txt"]
loader = UnstructuredFileLoader(files)
data = loader.load()
data

[Document(metadata={'source': ['markdown-sample.md', 'new-Policies.txt']}, page_content='An h1 header\n\nParagraphs are separated by a blank line.\n\n2nd paragraph. Italic, bold, and monospace. Itemized lists look like:\n\nthis one\n\nthat one\n\nthe other one\n\nNote that --- not considering the asterisk --- the actual text content starts at 4-columns in.\n\nBlock quotes are written like so.\n\nThey can span multiple paragraphs, if you like.\n\nUse 3 dashes for an em-dash. Use 2 dashes for ranges (ex., "it\'s all in chapters 12--14"). Three dots ... will be converted to an ellipsis. Unicode is supported. ☺\n\nAn h2 header\n\nHere\'s a numbered list:\n\nfirst item\n\nsecond item\n\nthird item\n\nNote again how the actual text starts at 4 columns in (4 characters from the left side). Here\'s a code sample:\n\n# Let me re-iterate ...\nfor i in 1 .. 10 { do-something(i) }\n\nAs you probably guessed, indented 4 spaces. By the way, instead of indenting the block, you can use delimited block