# Document Loading

Data Loading in langchain is the process of loading data from various sources like files, databases, web pages, etc. The data is loaded into a Document object which is a part of the langchain library. The Document object contains the text content of the data and metadata like the source URL, title, etc. The Document object is then used for further processing like text analysis, summarization, etc.

Source: https://python.langchain.com/docs/integrations/document_loaders

In [4]:
pip install -qU langchain_community

Note: you may need to restart the kernel to use updated packages.


In [11]:

from langchain.document_loaders import TextLoader

loader = TextLoader(r"D:\Project\Generative AI\Productimate.io\Modules\RAG\Indexing\test.txt")
document = loader.load()
print(document[0])

page_content='Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce augue ex, vulputate ac nibh in, venenatis imperdiet ante. Nulla ut accumsan nibh. Sed enim mi, eleifend sit amet tellus et, tempor congue risus. In consequat, purus congue mollis posuere, urna ex scelerisque lorem, sit amet commodo purus arcu ut lacus. Nullam sed porttitor velit, sit amet placerat mauris. Sed et convallis dui, nec ornare ante. Vestibulum dolor justo, commodo vel sapien ut, ultricies tristique magna. Pellentesque tortor sem, commodo nec finibus at, tincidunt a ipsum. Praesent turpis sem, tempor ut ex non, gravida commodo nunc. Duis leo orci, laoreet eu ultrices nec, auctor at magna.

Interdum et malesuada fames ac ante ipsum primis in faucibus. Nam ultrices sed mi ut maximus. Praesent tincidunt semper mauris. Aenean varius neque nulla. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Sed ac bibendum odio. Cras ipsum velit, aliquet vel elit at, placerat euis

In [6]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.espn.com/")
documents = loader.load()
documents[0]

USER_AGENT environment variable not set, consider setting it to identify your requests.


Document(metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}, page_content="\n\n\n\n\n\n\n\n\nESPN - Serving Sports Fans. Anytime. Anywhere.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n        Skip to main content\n    \n\n        Skip to navigation\n    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<\n\n>\n\n\n\n\n\n\n\n\n\nMenuESPN\n\n\n\n\n\nscores\n\n\n\nNFLNBANHLMLBWNBASoccerMMAMore SportsBoxingNCAACricketF1GamingGolfHorseLLWSNASCARNLLNBA G LeagueNBA Summer LeagueNCAAFNCAAMNCAAWNWSLOlympicsPLLProfessional WrestlingRacingRN BBRN FBRugbySports BettingTennisTGLUFLX GamesEditionsFantasyWatchESPN BETESPN+\n\n\n\n\n\n\n\n\n\n\n

In [13]:

# checking out the metadata
documents[0].metadata

{'source': 'https://www.espn.com/',
 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.',
 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.',
 'language': 'en'}

In [12]:

# load multiple pages
loader = WebBaseLoader(["https://www.espn.com/", "https://python.langchain.com/docs/integrations/document_loaders/web_base/"])
documents = loader.load()
documents[1].page_content

'\n\n\n\n\nWebBaseLoader | ü¶úÔ∏èüîó LangChain\n\n\n\n\n\n\nSkip to main contentWe are growing and hiring for multiple roles for LangChain, LangGraph and LangSmith.  Join our team!IntegrationsAPI ReferenceMoreContributingPeopleError referenceLangSmithLangGraphLangChain HubLangChain JS/TSv0.3v0.3v0.2v0.1üí¨SearchProvidersAnthropicAWSGoogleHugging FaceMicrosoftOpenAIMoreProvidersAbsoAcreomActiveloop Deep LakeADS4GPTsAerospikeAgentQLAI21 LabsAimAINetworkAirbyteAirtableAlchemyAleph AlphaAlibaba CloudAnalyticDBAnnoyAnthropicAnyscaleApache Software FoundationApache DorisApifyAppleArangoDBArceeArcGISArgillaArizeArthurArxivAscendAskNewsAssemblyAIAstra DBAtlasAwaDBAWSAZLyricsAzure AIBAAIBagelBagelDBBaichuanBaiduBananaBasetenBeamBeautiful SoupBibTeXBiliBiliBittensorBlackboardbookend.aiBoxBrave SearchBreebs (Open Knowledge)Bright DataBrowserbaseBrowserlessByteDanceCassandraCerebrasCerebriumAIChaindeskChromaClarifaiClearMLClickHouseClickUpCloudflareClovaCnosDBCogneeCogniSwitchCohereCollege Conf

Loading CSV

In [15]:

from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path=r"D:\Project\Generative AI\Productimate.io\Modules\RAG\Indexing\Air_Quality.csv", content_columns=["Name"])
documents = loader.load()
len(documents)

18025

Loading PDF


In [17]:
!pip install pypdf

# load moby dick into a langchain doc
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(r"D:\Project\Generative AI\Productimate.io\Modules\RAG\Indexing\moby-dick.pdf")
documents = loader.load()

print("number of pages: ", len(documents))

# Print first 500 characters of first page
print("First 500 characters of first page:")
print(documents[7].page_content[:500])

Collecting pypdf
  Using cached pypdf-5.6.1-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.6.1-py3-none-any.whl (304 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.6.1
number of pages:  468
First 500 characters of first page:
CHAPTER I.
LOOMINGS
Call me Ishmael. Some years ago‚Äînever mind how long precisely‚Äîhaving little
or no money in my purse, and nothing particular to interest me on shore, I thought
I would sail about a little and see the watery part of the world. It is a way I have
of driving oÔ¨Ä the spleen, and regulating the circulation. Whenever I Ô¨Ånd myself
growing grim about the mouth; whenever it is a damp, drizzly November in my
soul; whenever I Ô¨Ånd myself involuntarily pausing before coÔ¨Én warehouses, and
br


In [18]:
pip install --upgrade --quiet html2text

Note: you may need to restart the kernel to use updated packages.


In [19]:
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer

loader = AsyncHtmlLoader(["https://www.espn.com/"])
documents = loader.load()
html2text = Html2TextTransformer()
documents2 = html2text.transform_documents(documents)
print(documents2[0].page_content)

USER_AGENT environment variable not set, consider setting it to identify your requests.


Skip to main content  Skip to navigation

<

>

Menu

## ESPN

  *   *   *   * scores

  * NFL
  * NBA
  * NHL
  * MLB
  * WNBA
  * Soccer
  * MMA
  * More Sports

    * Boxing
    * NCAA
    * Cricket
    * F1
    * Gaming
    * Golf
    * Horse
    * LLWS
    * NASCAR
    * NLL
    * NBA G League
    * NBA Summer League
    * NCAAF
    * NCAAM
    * NCAAW
    * NWSL
    * Olympics
    * PLL
    * Professional Wrestling
    * Racing
    * RN BB
    * RN FB
    * Rugby
    * Sports Betting
    * Tennis
    * TGL
    * UFL
    * X Games

  * Editions
  * Fantasy
  * Watch
  * ESPN BET
  * ESPN+

##

  * Subscribe Now
  * NBA Finals Game 7: Pacers vs. Thunder

  * The Ultimate Fighter

  * UFC 317: Topuria vs. Oliveira (Jun. 28, PPV)

## Quick Links

  * NCAA Baseball Tournament

  * NBA Playoffs

  * NBA Draft

  * WNBA Season Schedule

  * Where To Watch

  * Today's Top Odds

  * ESPN Radio: Listen Live

## Favorites

Manage Favorites

## Customize ESPN

Create AccountLog In

## Fanta