## This Notebook is to demonstrate commonly used Loaders and Splitters

#### In LangChain, a Document is a simple structure with two fields:
- `page_content (string)`: This field contains the raw text of the document.
- `metadata (dictionary)`: This field stores additional metadata about the text, such as the source URL, author, or any other relevant information.

In [7]:
from langchain.document_loaders import TextLoader
 
# Load text data from a file using TextLoader
loader = TextLoader("docs/sample.txt")
document = loader.load()
print(document)

[Document(metadata={'source': 'docs/sample.txt'}, page_content='The Lorem ipum filling text is used by graphic designers, programmers and printers with the aim of occupying the spaces of a website, an advertising product or an editorial production whose final text is not yet ready.\n\nThis expedient serves to get an idea of the finished product that will soon be printed or disseminated via digital channels.\n\nIn order to have a result that is more in keeping with the final result, the graphic designers, designers or typographers report the Lorem ipsum text in respect of two fundamental aspects, namely readability and editorial requirements.\n\nThe choice of font and font size with which Lorem ipsum is reproduced answers to specific needs that go beyond the simple and simple filling of spaces dedicated to accepting real texts and allowing to have hands an advertising/publishing product, both web and paper, true to reality.\n\nIts nonsense allows the eye to focus only on the graphic lay

In [3]:
document[0].page_content

'The Lorem ipum filling text is used by graphic designers, programmers and printers with the aim of occupying the spaces of a website, an advertising product or an editorial production whose final text is not yet ready.\n\nThis expedient serves to get an idea of the finished product that will soon be printed or disseminated via digital channels.\n\nIn order to have a result that is more in keeping with the final result, the graphic designers, designers or typographers report the Lorem ipsum text in respect of two fundamental aspects, namely readability and editorial requirements.\n\nThe choice of font and font size with which Lorem ipsum is reproduced answers to specific needs that go beyond the simple and simple filling of spaces dedicated to accepting real texts and allowing to have hands an advertising/publishing product, both web and paper, true to reality.\n\nIts nonsense allows the eye to focus only on the graphic layout objectively evaluating the stylistic choices of a project, 

In [6]:
print(document[0].metadata)

{'source': 'sample.txt'}


### Types of Document Loaders in LangChain

#### LangChain offers three main types of Document Loaders:

- `Transform Loaders`: These loaders handle different input formats and transform them into the Document format. For instance, consider a CSV file named "data.csv" with columns for "name" and "age". Using the CSVLoader, you can load the CSV data into Documents.
- `Public Dataset or Service Loaders`: LangChain provides loaders for popular public sources, allowing quick retrieval and creation of Documents. For example, the WikipediaLoader can load content from Wikipedia.
- `Proprietary Dataset or Service Loaders`: These loaders are designed to handle proprietary sources that may require additional authentication or setup. For instance, a loader could be created specifically for loading data from an internal database or an API with proprietary access.

### Transform Loader example

In [8]:
# CSVLoader

from langchain.document_loaders import CSVLoader
 
# Load data from a CSV file using CSVLoader
loader = CSVLoader("docs/sale.csv")
documents = loader.load()
 
# Access the content and metadata of each document
for document in documents:
    content = document.page_content
    metadata = document.metadata
 
    # Process the content and metadata
    print(content)
    print("------")

saleid: 191228
saledate: 2013-12-31
customerid: 1134
tax: 0
shipping: 44.37
------
saleid: 191229
saledate: 2013-12-31
customerid: 14958
tax: 0
shipping: 9.95
------
saleid: 191230
saledate: 2013-12-31
customerid: 3275
tax: 0
shipping: 9.95
------
saleid: 191231
saledate: 2014-01-01
customerid: 17448
tax: 0
shipping: 20.11
------
saleid: 191232
saledate: 2014-01-01
customerid: 11852
tax: 0
shipping: 9.95
------
saleid: 191233
saledate: 2014-01-01
customerid: 5230
tax: 0
shipping: 9.95
------
saleid: 191234
saledate: 2014-01-02
customerid: 14956
tax: 0
shipping: 9.95
------
saleid: 191235
saledate: 2014-01-02
customerid: 10767
tax: 0
shipping: 46.34
------
saleid: 191236
saledate: 2014-01-02
customerid: 8907
tax: 2.32
shipping: 9.95
------
saleid: 191237
saledate: 2014-01-02
customerid: 12110
tax: 0
shipping: 18.75
------
saleid: 191238
saledate: 2014-01-02
customerid: 653
tax: 0
shipping: 39.05
------
saleid: 191239
saledate: 2014-01-02
customerid: 14473
tax: 0
shipping: 9.95
------
sa

### PDFLoader
Loads each page of the PDF as one document

In [10]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/Software-Engineer-CV.pdf")
pages = loader.load()

In [11]:
cnt = 0
for page in pages:
    cnt = cnt+1
    print("---- Document #", cnt)
    print(page.page_content.strip())


---- Document # 1
Name: Sunil Sharma                              Mobile: +91 9898989898  
 
Designation: Senior Technical Lead                      Mail Id: sunil.sharma @gmail.com  
 
Objective:   
Experienced S enior Software Developer with 1 2 years of hands -on expertise in 
designing, developing, and delivering high -quality software solutions.  
Proven track record of successfully leading and collaborating with cross -functional 
teams to deliver projects on time and within budget. Seeking to leverage my technical 
skills and leadership experience to contribute to innovative software projects.  
Education:  
Bachelor in Engineering in Electronics and Communication  
K.L.N.  College of Information Technology, Madurai - 2007  
Professional Summary:  
• 12 years  of experience in Software Development in C on  Linux Environment . 
• Over 5 years of programming  experience as an Oracle PL/SQL  developer in 
Analysis, Design and Implementation of business application using Oracle DBMS

### WebBaseLoader
This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. 

In [12]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.ibm.com/")
data = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [13]:
data[0].page_content

"\n\n\n\n\n\n\n\n\nIBM - Australia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nReinvent customer service with AI\n\n\n\n\n\n \n\n\n  \n  \n      Learn how to achieve ultimate ROI from AI with custom workflows and Salesforce\n  \n\n\n\n\n    \n\n\n\n\n\nRead the State of Salesforce report\n\n\nExplore AI for customer service\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\r\n                                    \n\n\n  \n  \n      Latest news\n  \n\n\n\n\n    \n\n\r\n                                \n\nInteract with watsonx Assistant\n\n\n\n\nMelbourne: Integration Labs Comes to You on 24th September\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nRecommended for you\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n    \n\n    \n    \n\n\n\n        \n        \n    \n\n        Take advantage of our current deals and promotions to save today\n        \n    \n\n\n\n    \n\n\n\n        \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n    \n\n    \n   

In [14]:
# Combine strip() with string formatting for basic formatting
formatted_text = data[0].page_content.strip().replace("\n\n", "\n")  # Replace double newlines with single newlines

print(formatted_text)

IBM - Australia





























Reinvent customer service with AI


 

  
  
      Learn how to achieve ultimate ROI from AI with custom workflows and Salesforce
  


    


Read the State of Salesforce report

Explore AI for customer service












                                    

  
  
      Latest news
  


    

                                
Interact with watsonx Assistant


Melbourne: Integration Labs Comes to You on 24th September








Recommended for you










    
    
    

        
        
    
        Take advantage of our current deals and promotions to save today
        
    

    

        









    
    
    

        
        
    
        Get 40% off your subscription
        
    

    

        









    
    
    

        
        
    
        Save 10% on SPSS Statistics subscription
        
    

    

        









    
    
    

        
        
    
        Save 30% with your subscription
        
    

    

 

In [15]:
# Use regular expressions for more comprehensive cleaning:
import re

# Remove unnecessary whitespace and multiple newlines
cleaned_text = re.sub(r"\s+", " ", formatted_text)  # Replace multiple spaces with single space
cleaned_text = re.sub(r"\n+", "\n\n", cleaned_text)  # Limit newlines to two per paragraph

print(cleaned_text)

IBM - Australia Reinvent customer service with AI Learn how to achieve ultimate ROI from AI with custom workflows and Salesforce Read the State of Salesforce report Explore AI for customer service Latest news Interact with watsonx Assistant Melbourne: Integration Labs Comes to You on 24th September Recommended for you Take advantage of our current deals and promotions to save today Get 40% off your subscription Save 10% on SPSS Statistics subscription Save 30% with your subscription Browse our technology From our flagship products for enterprise hybrid cloud infrastructure to next-generation AI, security and storage solutions, find the answer to your business challenge. View all products Shop special offers and discounts AI & machine learning Use IBM Watsonx’s AI or build your own machine learning models Analytics Aggregate and analyze large datasets Compute & servers Run workloads on hybrid cloud infrastructure Databases Store, query and analyze structured data DevOps Manage infrastru

### JSON Loader

In [66]:
#!pip install jq

In [18]:
from langchain_community.document_loaders import JSONLoader

import json
from pathlib import Path
from pprint import pprint

file_path='docs/sample.json'
data = json.loads(Path(file_path).read_text())

In [92]:
pprint(data)

{'employees': [{'email': 'shyamjaiswal@gmail.com', 'name': 'Shyam'},
               {'email': 'bob32@gmail.com', 'name': 'Bob'},
               {'email': 'jai87@gmail.com', 'name': 'Jai'}]}


In [19]:
loader = JSONLoader(
    file_path="docs/sample.json", 
    jq_schema=".employees[].email", 
    text_content=False)

data = loader.load()

In [98]:
data

[Document(page_content='shyamjaiswal@gmail.com', metadata={'source': '/Users/Manas/FreshersProjects/GenAI-Learning/Self-Learning/sample.json', 'seq_num': 1}),
 Document(page_content='bob32@gmail.com', metadata={'source': '/Users/Manas/FreshersProjects/GenAI-Learning/Self-Learning/sample.json', 'seq_num': 2}),
 Document(page_content='jai87@gmail.com', metadata={'source': '/Users/Manas/FreshersProjects/GenAI-Learning/Self-Learning/sample.json', 'seq_num': 3})]

## Public Dataset or Service Loaders

### Wikipedia Loader

In [9]:
from langchain.document_loaders import WikipediaLoader
 
# Load content from Wikipedia using WikipediaLoader
loader = WikipediaLoader("Machine_learning")
document = loader.load()

In [10]:
document[0].page_content

'Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, generative artificial neural networks have been able to surpass many previous approaches in performance.Machine learning approaches have been applied to many fields including large language models, computer vision, speech recognition, email filtering, agriculture, and medicine, where it is too costly to develop algorithms to perform the needed tasks. ML is known in its application across business problems under the name predictive analytics. Although not all machine learning is statistically based, computational statistics is an important source of the field\'s methods.\nThe mathematical foundations of ML are provided by mathematical optimization (mathematical programming) methods. Data mining is a related (parallel) field of study, 

In [11]:
document[0].metadata

{'title': 'Machine learning',
 'summary': "Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, generative artificial neural networks have been able to surpass many previous approaches in performance.Machine learning approaches have been applied to many fields including large language models, computer vision, speech recognition, email filtering, agriculture, and medicine, where it is too costly to develop algorithms to perform the needed tasks. ML is known in its application across business problems under the name predictive analytics. Although not all machine learning is statistically based, computational statistics is an important source of the field's methods.\nThe mathematical foundations of ML are provided by mathematical optimization (mathematical programming) methods. Data mining

### IMDB Movie Script Loader

In [14]:
from langchain_community.document_loaders import IMSDbLoader

loader = IMSDbLoader("https://imsdb.com/scripts/BlacKkKlansman.html")

data = loader.load()

In [15]:
# Remove unnecessary newlines and carriage returns
formatted_text = data[0].page_content[:5000].strip()

# Print the formatted text
print(formatted_text)

BLACKKKLANSMAN
                         
                         
                         
                         
                                      Written by

                          Charlie Wachtel & David Rabinowitz

                                         and

                              Kevin Willmott & Spike Lee








                         FADE IN:
                         
          SCENE FROM "GONE WITH THE WIND"
                         
          Scarlett O'Hara, played by Vivian Leigh, walks through the
          Thousands of injured Confederate Soldiers pulling back to
          reveal the Famous Shot of the tattered Confederate Flag in
          "Gone with the Wind" as The Max Stein Music Score swells from
          Dixie to Taps.
                         
                                   BEAUREGARD- KLAN NARRATOR (O.S.)
                       They say they may have lost the
                       Battle but they didn't 

### YouTubeLoader

In [17]:
#!pip install --upgrade --quiet  youtube-transcript-api

In [25]:
from langchain_community.document_loaders import YoutubeLoader

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=QsYGlZkevEg", add_video_info=False
)

data = loader.load()

In [26]:
# Remove unnecessary newlines and carriage returns
formatted_text = data[0].page_content[:5000].strip()

# Print the formatted text
print(data)

[Document(page_content='LADIES AND GENTLEMEN, PEDRO PASCAL! [ CHEERS AND APPLAUSE ] >> THANK YOU, THANK YOU. THANK YOU VERY MUCH. I\'M SO EXCITED TO BE HERE. THANK YOU. I SPENT THE LAST YEAR SHOOTING A SHOW CALLED "THE LAST OF US" ON HBO. FOR SOME HBO SHOES, YOU GET TO SHOOT IN A FIVE STAR ITALIAN RESORT SURROUNDED BY BEAUTIFUL PEOPLE, BUT I SAID, NO, THAT\'S TOO EASY. I WANT TO SHOOT IN A FREEZING CANADIAN FOREST WHILE BEING CHASED AROUND BY A GUY WHOSE HEAD LOOKS LIKE A GENITAL WART. IT IS AN HONOR BEING A PART OF THESE HUGE FRANCHISEs LIKE "GAME OF THRONES" AND "STAR WARS," BUT I\'M STILL GETTING USED TO PEOPLE RECOGNIZING ME. THE OTHER DAY, A GUY STOPPED ME ON THE STREET AND SAYS, MY SON LOVES "THE MANDALORIAN" AND THE NEXT THING I KNOW, I\'M FACE TIMING WITH A 6-YEAR-OLD WHO HAS NO IDEA WHO I AM BECAUSE MY CHARACTER WEARS A MASK THE ENTIRE SHOW. THE GUY IS LIKE, DO THE MANDO VOICE, BUT IT\'S LIKE A BEDROOM VOICE. WITHOUT THE MASK, IT JUST SOUNDS PORNY. PEOPLE WALKING BY ON THE STR

#### Add Video preferences, Add language preferences
- Language param : It’s a list of language codes in a descending priority, en by default.
- translation param : It’s a translate preference, you can translate available transcript to your preferred language.

In [104]:
loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=1W8o0F_l6hA",
    add_video_info=True,
    language=["en", "id"],
    translation="en",
)
ytdata = loader.load()

In [105]:
ytdata

[Document(page_content='We don\'t have to look far to see how humanity\'s fight against climate change might play out. Science fiction authors from around the world, writing from different traditions, coalesce on one point. A divided world, unable to fight the common threat. In Cixin Liu\'s "The Three-Body Problem" trilogy, the threat of aliens from a dying planet invading to conquer Earth does not bring humanity together. Instead, a potent combination of fear, jingoism and competition for scarce resources among countries fractures the world into competing geoeconomic blocs centered around three powers: the United States, China and Europe. And none of these blocs succeed in staving off the alien invasion with, as you can imagine, disastrous consequences for the survival of humanity. Sorry for the spoilers about the books. (Laughter) In George Orwell\'s "1984," the world’s three superstates -- Oceania, Eurasia and East Asia -- fight each other in perpetuity in a disputed area located mo

In [106]:
# Remove unnecessary newlines and carriage returns
formatted_text = ytdata[0].page_content[:5000].strip()

# Print the formatted text
print(formatted_text)

We don't have to look far to see how humanity's fight against climate change might play out. Science fiction authors from around the world, writing from different traditions, coalesce on one point. A divided world, unable to fight the common threat. In Cixin Liu's "The Three-Body Problem" trilogy, the threat of aliens from a dying planet invading to conquer Earth does not bring humanity together. Instead, a potent combination of fear, jingoism and competition for scarce resources among countries fractures the world into competing geoeconomic blocs centered around three powers: the United States, China and Europe. And none of these blocs succeed in staving off the alien invasion with, as you can imagine, disastrous consequences for the survival of humanity. Sorry for the spoilers about the books. (Laughter) In George Orwell's "1984," the world’s three superstates -- Oceania, Eurasia and East Asia -- fight each other in perpetuity in a disputed area located mostly around parts of Africa 

## Text Splitters

Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

At a high level, text splitters work as following:

- Split the text up into small, semantically meaningful chunks (often sentences).
- Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
- Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are two different axes along which you can customize your text splitter:

- How the text is split
- How the chunk size is measured

In [165]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

In [166]:
loader = WebBaseLoader("https://www.ibm.com/")
data = loader.load()

In [173]:
chunks = text_splitter.split_text(data[0].page_content)
len(chunks)

22

In [169]:
for chunk in chunks:
    print(chunk)
    print('----')

IBM - India

Building trust with Responsible AI

 


  
  
      Being responsible means being trustworthy. Can you truly rely on your AI solution?
  


    

Learn Responsible AI
----
Learn Responsible AI


Meet watsonx.governance


See how watsonx is enhancing the fan experience at this year’s GRAMMYs

Hybrid cloud can help unlock the power of GenAI

Recommended for you
----
Move data of any size across any distance
----
Create, manage, secure, and socialize your APIs
        
    

    

        

    

    

        
        
    

        Manage and protect your mobile workforce
----
Save 10% on SPSS Statistics subscription
        
    

    

        


                

  
    Browse our technology
----
From our flagship products for enterprise hybrid cloud infrastructure to next-generation AI, security and storage solutions, find the answer to your business challenge.
----
View all products
                
            

                Shop special offers and discounts
----
A

In [174]:
documents = text_splitter.create_documents([data[0].page_content])
len(documents)

22

In [176]:
for doc in documents:
    print(doc)
    print('----')

page_content='IBM - India\n\nBuilding trust with Responsible AI\n\n \n\n\n  \n  \n      Being responsible means being trustworthy. Can you truly rely on your AI solution?\n  \n\n\n    \n\nLearn Responsible AI'
----
page_content='Learn Responsible AI\n\n\nMeet watsonx.governance\n\n\nSee how watsonx is enhancing the fan experience at this year’s GRAMMYs\n\nHybrid cloud can help unlock the power of GenAI\n\nRecommended for you'
----
page_content='Move data of any size across any distance'
----
page_content='Create, manage, secure, and socialize your APIs\n        \n    \n\n    \n\n        \n\n    \n\n    \n\n        \n        \n    \n\n        Manage and protect your mobile workforce'
----
page_content='Save 10% on SPSS Statistics subscription\n        \n    \n\n    \n\n        \n\n\n                \n\n  \n    Browse our technology'
----
page_content='From our flagship products for enterprise\xa0hybrid cloud infrastructure\xa0to next-generation AI, security and storage solutions, find t

## RecursiveCharacterTextSplitter

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.
- The RecursiveCharacterTextSplitter class does use chunk_size and overlap parameters to split the text into chunks of the specified size and overlap. This is because its split_text method recursively splits the text based on different separators until the length of the splits is less than the chunk_size.

In [186]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

rectext_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

In [187]:
texts = rectext_splitter.create_documents([data[0].page_content])

In [188]:
for text in texts:
    print(text)
    print("-----")

page_content='IBM - India'
-----
page_content='Building trust with Responsible AI'
-----
page_content='Being responsible means being trustworthy. Can you truly rely on your AI solution?'
-----
page_content='Learn Responsible AI\n\n\nMeet watsonx.governance'
-----
page_content='See how watsonx is enhancing the fan experience at this year’s GRAMMYs'
-----
page_content='Hybrid cloud can help unlock the power of GenAI\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nRecommended for you'
-----
page_content='Move data of any size across any distance'
-----
page_content='Create, manage, secure, and socialize your APIs'
-----
page_content='Manage and protect your mobile workforce'
-----
page_content='Save 10% on SPSS Statistics subscription'
-----
page_content='Browse our technology'
-----
page_content='From our flagship products for enterprise\xa0hybrid cloud infrastructure\xa0to next-generation AI,'
-----
page_content='next-generation AI, security and storage solutions, find the answer to your busines