### This Notebook is to demonstrate commonly used Loaders and Splitters
#### In LangChain, a Document is a simple structure with 2 fields:
- **page_content (string)**: This field contains the raw text of the document.
- **metadata (dictionary)**: This field stores additional metadata about the text, such as the source URL, author, or any other relevant information.

In [None]:
from langchain.document_loaders import TextLoader

# Load a document from a text file using TextLoader.
loader = TextLoader("./loaders-samples/sample.txt")
document = loader.load()
print(document)


In [None]:
document[0].page_content

In [None]:
document[0].metadata

### Types of Document Loaders in LangChain
#### LangChain offers three main types of Document Loaders:
- **Transform Loaders**: These loaders handle different input formats and transform them into the Document format. For instance, consider a CSV file named "data.csv" with columns for "name" and "age". Using the CSVLoader, you can load the CSV data into Documents.
- **Public Dataset or Service Loaders**: LangChain provides loaders for popular public sources, allowing quick retrieval and creation of Document. For example, the WikipediaLoader can load content from Wikipedia.
- **Proprietary Dataset or Service Loaders**: These loaders are designed to handle proprietary sources that may require additional authentication or setup. For instance, a loader could be created specifically for loading data from an internal database or an API with proprietary access.

#### Transform Loader example

In [None]:
# CSVLoader

from langchain.document_loaders import CSVLoader

# Load data from a CSV file using CSVLoader
loader = CSVLoader("../csv/HR-Employee-Attrition.csv")
documents = loader.load()

# Access the content and metadata of each document
for document in documents:
  content = document.page_content
  metadata = document.metadata

  print(content)
  print("-----------")

#### PDFLoader
Loads each page of the PDF as one document

In [18]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("../pdf/CV (1).pdf")
pages = loader.load()

In [None]:
cnt = 0
for page in pages:
  cnt = cnt + 1
  print("---- Document #", cnt)
  print(page.page_content.strip())

#### WebBaseLoader
This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream.

In [24]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.ibm.com/")
data = loader.load()

In [None]:
data[0].page_content

In [None]:
# Combine strip() with string formatting for basic formatting
formatted_text = data[0].page_content.strip().replace("\n\n", "\n")  # Replace double newlines with single newlines

print(formatted_text)

In [None]:
# Use regular expressions for more comprehensive cleaning:
import re

# Remove unnecessary whitespace and multiple newlines
cleaned_text = re.sub(r"\s+", " ", formatted_text)  # Replace multiple spaces with single space
cleaned_text = re.sub(r"\n+", "\n\n", cleaned_text)  # Limit newlines to two per paragraph

print(cleaned_text)

### JSON Loader

In [29]:
from langchain_community.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint

file_path = "./loaders-samples/sample.json"
data = json.loads(Path(file_path).read_text())


In [None]:
pprint(data)

In [31]:
loader = loader = JSONLoader(
  file_path="./loaders-samples/sample.json",
  jq_schema=".employees[].email",
  text_content=False
)

data = loader.load()

In [None]:
data

### Public Dataset or Service Loaders

#### Wikipedia Loader

In [33]:
from langchain.document_loaders import WikipediaLoader

# Load content from Wikipedia using WikipediaLoader
loader = WikipediaLoader("Machine_learning")
document = loader.load()

In [None]:
document[0].page_content

In [None]:
document[0].metadata

#### IMDB Movie Script Loader

In [36]:
from langchain_community.document_loaders import IMSDbLoader
loader = IMSDbLoader("https://imsdb.com/scripts/BlacKkKlansman.html")

data = loader.load()

In [None]:
# Remove unnecessary newlines and carriage returns
formatted_text = data[0].page_content[:5000].strip()

# Print the formatted text
print(formatted_text)

#### YouTubeLoader

In [None]:
%pip install --upgrade --quiet  youtube-transcript-api

In [43]:
from langchain_community.document_loaders import YoutubeLoader

loader = YoutubeLoader.from_youtube_url(
  "https://www.youtube.com/watch?v=QsYGlZkevEg", add_video_info=False
)

data = loader.load()

In [None]:
# Remove unnecessary newlines and carriage returns
formatted_text = data[0].page_content[:5000].strip()

# Print the formatted text
print(data)

#### Add Video preferences, Add language preferences
- Language param: It's a list of language codes in a descending priority, en by default.
- Translation param: It's a translate preference, you can translate available transcript to your preferred language.

In [None]:
loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=IkfPtvA6RmA",
    add_video_info=True,
    language=["en", "id"],
    translation="en",
)
ytdata = loader.load()

In [None]:
ytdata

In [None]:
# Remove unnecessary newlines and carriage returns
formatted_text = ytdata[0].page_content[:5000].strip()

# Print the formatted text
print(formatted_text)

### Text Splitters

Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "sematically related" means could depend on the type of text. This notebook showcases several ways to do that.

At a high level, text splitters work as following:
- Split the text up into small, semantically meaningful chunks (often sentences).
- Start combine these small chunks into larger chunk ultil you reach a certain size (as measured by some function).
- Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are 2 different axes along which you can customize your text splitter:
- How the text is split
- How the chunk size is measured

In [65]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
  separator="\n\n",
  chunk_size=200,
  chunk_overlap=20,
  length_function=len,
  is_separator_regex=False,
)

In [66]:
loader = WebBaseLoader("https://www.ibm.com/")
data = loader.load()

In [None]:
chunks = text_splitter.split_text(data[0].page_content)
len(chunks)

In [None]:
for chunk in chunks:
  print(chunk)
  print("---")

In [None]:
documents = text_splitter.create_documents([data[0].page_content])
len(documents)

In [None]:
for doc in documents:
  print(doc)
  print("---")

#### RecursiveCharacterTextSplitter
This text splitter is the recommendation one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

- How the text is split: By list of characters.
- How the chunk size is measured: By number of characters
- The RecursiveCharacterTextSplitter class does use chunk_size and overlap parameters to split the text into chunks of the specified size and overlap. This is because its split_text recursively splits the text based on different seperators until the length of splits is less than the chunk_size.

In [62]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

rectext_splitter = RecursiveCharacterTextSplitter(
  # Set a really small chunk size, just to show.
  chunk_size=100,
  chunk_overlap=20,
  length_function=len,
  is_separator_regex=False,
)

In [63]:
texts = rectext_splitter.create_documents([data[0].page_content])

In [None]:
for text in texts:
  print(text)
  print("---")