## UnstructuredURLLoader

UnstructuredURLLoader of Langchain internally uses unstructured python library to load the content from url's

https://unstructured-io.github.io/unstructured/introduction.html

https://pypi.org/project/unstructured/#description

In [None]:
# #installing necessary libraries, libmagic is used for file type detection
# !pip3 install unstructured libmagic python-magic python-magic-bin

In [None]:
from langchain.document_loaders import UnstructuredURLLoader

loader = UnstructuredURLLoader(
    urls = [
        "https://en.wikipedia.org/wiki/One_Piece",
        "https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html",
        "https://www.moneycontrol.com/news/business/markets/market-corrects-post-rbi-ups-inflation-forecast-icrr-bet-on-these-top-10-rate-sensitive-stocks-ideas-11142611.html"
    ]
)

In [23]:
data = loader.load() 
data #list, len(data) = 3

[Document(metadata={'source': 'https://en.wikipedia.org/wiki/One_Piece'}, page_content='One Piece\n\nАдыгэбзэ\n\nالعربية\n\nAragonés\n\nAsturianu\n\nAzərbaycanca\n\nবাংলা\n\nBikol Central\n\nБългарски\n\nBrezhoneg\n\nCatalà\n\nČeština\n\nDansk\n\nالدارجة\n\nDeutsch\n\nEesti\n\nΕλληνικά\n\nEspañol\n\nEsperanto\n\nفارسی\n\nFrançais\n\nGalego\n\n한국어\n\nHausa\n\nHawaiʻi\n\nՀայերեն\n\nBahasa Indonesia\n\nItaliano\n\nעברית\n\nJawa\n\nქართული\n\nKurdî\n\nLadin\n\nLatviešu\n\nLëtzebuergesch\n\nMagyar\n\nМакедонски\n\nمصرى\n\nBahasa Melayu\n\nМонгол\n\nမြန်မာဘာသာ\n\nNederlands\n\nनेपाली\n\n日本語\n\nNapulitano\n\nOccitan\n\nOʻzbekcha / ўзбекча\n\nPapiamentu\n\nPolski\n\nPortuguês\n\nРусский\n\nShqip\n\nSicilianu\n\nSlovenčina\n\nSlovenščina\n\nکوردی\n\nСрпски / srpski\n\nSrpskohrvatski / српскохрватски\n\nSvenska\n\nTagalog\n\nதமிழ்\n\nTaqbaylit\n\nไทย\n\nTürkçe\n\nУкраїнська\n\nVèneto\n\nTiếng Việt\n\n吴语\n\n粵語\n\n中文\n\nBatak Mandailing\n\nEdit links\n\nFrom Wikipedia, the free encyclopedia\n\nJap

In [26]:
data[0].page_content[0:33] #string, 0-33 characters of the first document

'One Piece\n\nАдыгэбзэ\n\nالعربية\n\nAra'

In [27]:
data[0].metadata #dictionary of metadata for the first document

{'source': 'https://en.wikipedia.org/wiki/One_Piece'}

In [29]:
# check type
type(data)
type(data[0].page_content[0:33])
type(data[0].metadata)

dict

## RecursiveTextSplitter

In [1]:
# Taking some random text from wikipedia

text = """One Piece (stylized in all caps) is a Japanese manga series written and illustrated by Eiichiro Oda. 
It has been serialized in Shueisha's shōnen manga magazine Weekly Shōnen Jump since July 1997, with its chapters compiled in 110 tankōbon volumes as of November 2024. 
The series follows the adventures of Monkey D. Luffy and his crew, the Straw Hat Pirates, as he explores the Grand Line in search of the mythical treasure known as the "One Piece" to become the next King of the Pirates.

The manga spawned a media franchise, having been adapted into a festival film by Production I.G, and an anime series by Toei Animation, which began broadcasting in 1999. 
Additionally, Toei has developed fourteen animated feature films, one original video animation, and thirteen television specials. 
Several companies have developed various types of merchandising and media, such as a trading card game and numerous video games. 
The manga series was licensed for an English language release in North America and the United Kingdom by Viz Media and in Australia by Madman Entertainment. 
The anime series was licensed by 4Kids Entertainment for an English-language release in North America in 2004 before the license was dropped and subsequently acquired by Funimation in 2007. 
Netflix released a live action TV series adaptation in 2023.

One Piece has received praise for its storytelling, world-building, art, characterization, and humour. 
It has received many awards and is ranked by critics, reviewers, and readers as one of the best manga of all time. 
By August 2022, it had over 516.6 million copies in circulation in 61 countries and regions worldwide, making it the best-selling manga series in history, and the best-selling comic series printed in a book volume. 
Several volumes of the manga have broken publishing records, including the highest initial print run of any book in Japan. 
In 2015 and 2022, One Piece set the Guinness World Record for "the most copies published for the same comic book series by a single author". 
It was the best-selling manga for eleven consecutive years from 2008 to 2018 and is the only manga that had an initial print of volumes of above 3 million continuously for more than 10 years, as well as the only one that had achieved more than 1 million copies sold in all of its over 100 published tankōbon volumes. 
One Piece is the only manga whose volumes have ranked first every year in Oricon's weekly comic chart existence since 2008."""

In [2]:
text

'One Piece (stylized in all caps) is a Japanese manga series written and illustrated by Eiichiro Oda. \nIt has been serialized in Shueisha\'s shōnen manga magazine Weekly Shōnen Jump since July 1997, with its chapters compiled in 110 tankōbon volumes as of November 2024. \nThe series follows the adventures of Monkey D. Luffy and his crew, the Straw Hat Pirates, as he explores the Grand Line in search of the mythical treasure known as the "One Piece" to become the next King of the Pirates.\n\nThe manga spawned a media franchise, having been adapted into a festival film by Production I.G, and an anime series by Toei Animation, which began broadcasting in 1999. \nAdditionally, Toei has developed fourteen animated feature films, one original video animation, and thirteen television specials. \nSeveral companies have developed various types of merchandising and media, such as a trading card game and numerous video games. \nThe manga series was licensed for an English language release in Nor

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", "\n", " "],  # List of separators based on requirement (defaults to ["\n\n", "\n", " "])
    chunk_size = 200,  # size of each chunk created
    chunk_overlap  = 0,  # size of  overlap between chunks in order to maintain the context
    length_function = len  # Function to calculate size, currently we are using "len" which denotes length of string however you can pass any token counter)
)

In [4]:
chunks = r_splitter.split_text(text)

for chunk in chunks:
    print(len(chunk))

100
166
198
20
169
129
128
156
189
60
102
114
199
14
122
140
199
116
123


**Detailed Workflow**

In [5]:
first_split = text.split("\n\n")[0]
first_split

'One Piece (stylized in all caps) is a Japanese manga series written and illustrated by Eiichiro Oda. \nIt has been serialized in Shueisha\'s shōnen manga magazine Weekly Shōnen Jump since July 1997, with its chapters compiled in 110 tankōbon volumes as of November 2024. \nThe series follows the adventures of Monkey D. Luffy and his crew, the Straw Hat Pirates, as he explores the Grand Line in search of the mythical treasure known as the "One Piece" to become the next King of the Pirates.'

In [6]:
len(first_split)

489

In [7]:
second_split = first_split.split("\n")
second_split

['One Piece (stylized in all caps) is a Japanese manga series written and illustrated by Eiichiro Oda. ',
 "It has been serialized in Shueisha's shōnen manga magazine Weekly Shōnen Jump since July 1997, with its chapters compiled in 110 tankōbon volumes as of November 2024. ",
 'The series follows the adventures of Monkey D. Luffy and his crew, the Straw Hat Pirates, as he explores the Grand Line in search of the mythical treasure known as the "One Piece" to become the next King of the Pirates.']

In [8]:
for split in second_split:
    print(len(split))

101
167
219


Third split exceeds chunk size 200. Now it will further try to split that using the third separator which is ' ' (space)

In [9]:
second_split[2]

'The series follows the adventures of Monkey D. Luffy and his crew, the Straw Hat Pirates, as he explores the Grand Line in search of the mythical treasure known as the "One Piece" to become the next King of the Pirates.'

When you split this using space (i.e. second_split[2].split(" ")), it will separate out each word and then it will merge those 
chunks such that their size is close to 200