## **Document Loaders In LangChain**

#### **TextLoader and CSVLoader**

In [5]:
from langchain_community.document_loaders import TextLoader, CSVLoader

text_loader = TextLoader("nvda_news_1.txt")
text_loader.load()



In [6]:
type(text_loader)

langchain_community.document_loaders.text.TextLoader

In [7]:
text_loader.file_path

'nvda_news_1.txt'

#### **CSVLoader**

In [8]:
csv_loader = CSVLoader(file_path="movies.csv",source_column="title")
csv_data = csv_loader.load()
csv_data

[Document(metadata={'source': 'K.G.F: Chapter 2', 'row': 0}, page_content='movie_id: 101\ntitle: K.G.F: Chapter 2\nindustry: Bollywood\nrelease_year: 2022\nimdb_rating: 8.4\nstudio: Hombale Films\nlanguage_id: 3\nbudget: 1\nrevenue: 12.5\nunit: Billions\ncurrency: INR'),
 Document(metadata={'source': 'Doctor Strange in the Multiverse of Madness', 'row': 1}, page_content='movie_id: 102\ntitle: Doctor Strange in the Multiverse of Madness\nindustry: Hollywood\nrelease_year: 2022\nimdb_rating: 7\nstudio: Marvel Studios\nlanguage_id: 5\nbudget: 200\nrevenue: 954.8\nunit: Millions\ncurrency: USD'),
 Document(metadata={'source': 'Thor: The Dark World', 'row': 2}, page_content='movie_id: 103\ntitle: Thor: The Dark World\nindustry: Hollywood\nrelease_year: 2013\nimdb_rating: 6.8\nstudio: Marvel Studios\nlanguage_id: 5\nbudget: 165\nrevenue: 644.8\nunit: Millions\ncurrency: USD'),
 Document(metadata={'source': 'Thor: Ragnarok', 'row': 3}, page_content='movie_id: 104\ntitle: Thor: Ragnarok\nindus

In [11]:
csv_data[1].page_content

'movie_id: 102\ntitle: Doctor Strange in the Multiverse of Madness\nindustry: Hollywood\nrelease_year: 2022\nimdb_rating: 7\nstudio: Marvel Studios\nlanguage_id: 5\nbudget: 200\nrevenue: 954.8\nunit: Millions\ncurrency: USD'

In [12]:
csv_data[1].metadata

{'source': 'Doctor Strange in the Multiverse of Madness', 'row': 1}

#### **UnstructuredURLLoader**

UnstructuredURLLoader of Langchain internally uses unstructured python library to load the content from url's

In [13]:
#installing necessary libraries, libmagic is used for file type detection
!pip3 install unstructured libmagic python-magic python-magic-bin

Collecting unstructured
  Using cached unstructured-0.18.18-py3-none-any.whl.metadata (25 kB)
Collecting libmagic
  Using cached libmagic-1.0-py3-none-any.whl
Collecting python-magic
  Using cached python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting python-magic-bin
  Using cached python_magic_bin-0.4.14-py2.py3-none-win_amd64.whl.metadata (710 bytes)
Collecting nltk (from unstructured)
  Using cached nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting beautifulsoup4 (from unstructured)
  Using cached beautifulsoup4-4.14.2-py3-none-any.whl.metadata (3.8 kB)
Collecting unstructured-client (from unstructured)
  Using cached unstructured_client-0.42.3-py3-none-any.whl.metadata (23 kB)
Collecting soupsieve>1.2 (from beautifulsoup4->unstructured)
  Using cached soupsieve-2.8-py3-none-any.whl.metadata (4.6 kB)
Collecting webencodings (from html5lib->unstructured)
  Using cached webencodings-0.5.1-py2.py3-none-any.whl.metadata (2.1 kB)
Collecting joblib (from nltk->unst

& was unexpected at this time.
The value specified in an AutoRun registry key could not be parsed.


In [15]:
from langchain_community.document_loaders import UnstructuredURLLoader

In [16]:
loader = UnstructuredURLLoader(
    urls = [
        "https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html",
        "https://www.moneycontrol.com/news/business/markets/market-corrects-post-rbi-ups-inflation-forecast-icrr-bet-on-these-top-10-rate-sensitive-stocks-ideas-11142611.html"
    ]
)

In [17]:
data = loader.load()
len(data)

2

In [18]:
data[0].page_content[0:100]

'English\n\nHindi\n\nGujarati\n\nSpecials\n\nHello, Login\n\nHello, Login\n\nLog-inor Sign-Up\n\nMy Account\n\nMy Pro'

In [19]:
data[0].metadata

{'source': 'https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html'}

#### **Text Splitters**

Why do we need text splitters in first place?

LLM's have token limits. Hence we need to split the text which can be large into small chunks so that each chunk size is under the token limit. There are various text splitter classes in langchain that allows us to do this.

In [20]:
# Taking some random text from wikipedia

text = """Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan. 
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine. 
Set in a dystopian future where humanity is embroiled in a catastrophic blight and famine, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for humankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007 and was originally set to be directed by Steven Spielberg. 
Kip Thorne, a Caltech theoretical physicist and 2017 Nobel laureate in Physics,[4] was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar. 
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm. Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles. 
Interstellar uses extensive practical and miniature effects, and the company Double Negative created additional digital effects.

Interstellar premiered in Los Angeles on October 26, 2014. In the United States, it was first released on film stock, expanding to venues using digital projectors. The film received generally positive reviews from critics and grossed over $677 million worldwide ($715 million after subsequent re-releases), making it the tenth-highest-grossing film of 2014. 
It has been praised by astronomers for its scientific accuracy and portrayal of theoretical astrophysics.[5][6][7] Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades."""

#### **Manual approach of splitting the text into chunks**

In [21]:
# Say LLM token limit is 100, in that case we can do simple thing such as this

text[0:100]

'Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher N'