# Load Documents Using LangChain for Different Sources


In [None]:
pip install jq

In [None]:
!pip install --user "langchain-community==0.2.1"


In [None]:
pip install pypdf

In [None]:
pip install pymupdf

In [None]:
pip install unstructured

In [None]:

%%capture

!pip install --user "markdown"


In [None]:

!pip install --user "docx2txt==0.8"
!pip install --user "beautifulsoup4==4.12.3"


 Each client provides their data in different formats: some in PDFs, others in Word documents, CSV files, or even HTML webpages. Manually loading and parsing each document type is not only time-consuming but also prone to errors. Your goal is to streamline this process, making it efficient and error-free.

To achieve this, you'll use LangChain’s powerful document loaders. These loaders allow you to read and convert various file formats into a unified document structure that can be easily processed. For example, you'll load client policy documents from text files, financial reports from PDFs, marketing strategies from Word documents, and product reviews from JSON files. By the end of this lab, you will have a robust pipeline that can handle any new file formats clients might send, saving you valuable time and effort.

 - Understand how to use `TextLoader` to load text files.
 - Learn how to load PDFs using `PyPDFLoader` and `PyMuPDFLoader`.
 - Use `UnstructuredMarkdownLoader` to load Markdown files.
 - Load JSON files with `JSONLoader` using jq schemas.
 - Process CSV files with `CSVLoader` and `UnstructuredCSVLoader`.
 - Load Webpage content using `WebBaseLoader`.
 - Load Word documents using `Docx2txtLoader`.
 - Utilize `UnstructuredFileLoader` for various file types.



In [None]:
# You can also use this section to suppress warnings generated by your code:

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

from pprint import pprint
import json
from pathlib import Path
import nltk
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_community.document_loaders import JSONLoader
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders.csv_loader import UnstructuredCSVLoader
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import Docx2txtLoader
from langchain_community.document_loaders import UnstructuredFileLoader

nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

In [None]:
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/Ec5f3KYU1CpbKRp1whFLZw/new-Policies.txt"

Next, we will use the `TextLoader` class to load the file.


In [None]:
loader = TextLoader("new-Policies.txt")
loader

In [None]:
data = loader.load()

Let's present the entire data (document) here.

This is a `document` object that includes `page_content` and `metadata` attributes.


In [None]:
data

In [None]:
pprint(data[0].page_content[:1000])

### Load from PDF files

Sometimes, we may have files in PDF format that we want to load for processing.

LangChain provides several classes for loading PDFs. Here, we introduce two classes: `PyPDFLoader` and `PyMuPDFLoader`.

#### PyPDFLoader

Load the PDF using `PyPDFLoader` into an array of documents, where each document contains the page content and metadata with the page number.


In [None]:
pdf_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/Q81D33CdRLK6LswuQrANQQ/instructlab.pdf"

loader = PyPDFLoader(pdf_url)

pages = loader.load_and_split()

In [None]:
print(pages[0])

In [None]:
## display first 3 pages
for p,page in enumerate(pages[0:3]):
    print(f"page number {p+1}")
    print(page)

#### PyMuPDFLoader

`PyMuPDFLoader` is the fastest of the PDF parsing options. It provides detailed metadata about the PDF and its pages, and returns one document per page.


In [None]:
loader = PyMuPDFLoader(pdf_url)
loader

In [None]:
data = loader.load()
print(data[0])

The `metadata` attribute reveals that `PyMuPDFLoader` provides more detailed metadata information than `PyPDFLoader`.


### Load from Markdown files

Sometimes, our file source might be in Markdown format.

LangChain provides the `UnstructuredMarkdownLoader` to load content from Markdown files.


In [None]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/eMSP5vJjj9yOfAacLZRWsg/markdown-sample.md'

In [None]:
markdown_path = "markdown-sample.md"
loader = UnstructuredMarkdownLoader(markdown_path)
loader

data = loader.load()

data

### Load from JSON files

The JSONLoader uses a specified [jq schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files. It uses the jq python package, which we've installed before.


In [None]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/hAmzVJeOUAMHzmhUHNdAUg/facebook-chat.json'

In [None]:
file_path='facebook-chat.json'
data = json.loads(Path(file_path).read_text())
pprint(data)

We use `JSONLoader` to load data from the JSON file. However, JSON files can have various attribute-value pairs. If we want to load a specific attribute and its value, we need to set an appropriate `jq schema`.

So for example, if we want to load the `content` from the JSON file, we need to set `jq_schema='.messages[].content'`.


In [None]:
loader = JSONLoader(
    file_path=file_path,
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()

### Load from CSV files
CSV files are a common format for storing tabular data. The `CSVLoader` provides a convenient way to read and process this data.


In [None]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IygVG_j0M87BM4Z0zFsBMA/mlb-teams-2012.csv'
loader = CSVLoader(file_path='mlb-teams-2012.csv')
data = loader.load()

In [None]:
data

When you load data from a CSV file, the loader typically creates a separate `Document` object for each row of data in the CSV.


#### UnstructuredCSVLoader

In contrast to `CSVLoader`, which treats each row as an individual document with headers defining the data, `UnstructuredCSVLoader` considers the entire CSV file as a single unstructured table element. This approach is beneficial when you want to analyze the data as a complete table rather than as separate entries.


In [None]:
loader = UnstructuredCSVLoader(
    file_path="mlb-teams-2012.csv", mode="elements"
)
data = loader.load()
data[0].page_content

In [None]:
print(data[0].metadata["text_as_html"])

### Load from URL/Website files

Usually we use `BeautifulSoup` package to load and parse a HTML or XML file. But it has some limitations.

The following code is using `BeautifulSoup` to parse a website. Let's see what limitation it has.


In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.ibm.com/topics/langchain'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())

From the print output, we can see that `BeautifulSoup` not only load the web content, but also a lot of HTML tags and external links, which are not necessary if we just want to load the text content of the web.

So LangChain's `WebBaseLoader` can effectively address this limitation.

`WebBaseLoader` is designed to extract all text from HTML webpages and convert it into a document format suitable for further processing.


In [None]:
loader = WebBaseLoader("https://www.ibm.com/topics/langchain")
data = loader.load()
data

#### Load from multiple web pages

You can load multiple webpages simultaneously by passing a list of URLs to the loader. This will return a list of documents corresponding to the order of the URLs provided.


In [None]:
loader = WebBaseLoader(["https://www.ibm.com/topics/langchain", "https://www.redhat.com/en/topics/ai/what-is-instructlab"])
data = loader.load()
data

### Load from WORD files


In [None]:
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/94hiHUNLZdb0bLMkrCh79g/file-sample.docx"
loader = Docx2txtLoader("file-sample.docx")
data = loader.load()
data

### Load from Unstructured Files

Sometimes, we need to load content from various text sources and formats without writing a separate loader for each one. Additionally, when a new file format emerges, we want to save time by not having to write a new loader for it. `UnstructuredFileLoader` addresses this need by supporting the loading of multiple file types. Currently, `UnstructuredFileLoader` can handle text files, PowerPoints, HTML, PDFs, images, and more.


In [None]:
loader = UnstructuredFileLoader("new-Policies.txt")
data = loader.load()
data

We also can load `.md` file.


In [None]:
loader = UnstructuredFileLoader("markdown-sample.md")
data = loader.load()
data

In [None]:
#### Multiple files with different formats
files = ["markdown-sample.md", "new-Policies.txt"]
loader = UnstructuredFileLoader(files)
data = loader.load()
data