# Text Loading
Project consist of the certain phases to process the data that can be further used for embedding.<br>
**Text loading** is one of the important phase in which can be used to load data from the docs or url provided by the client. This can be achieved with help of the webscraping tools or text-loaders.
<br>
<br>
For this project I had used langchain langchain library.<br>
***

In [None]:
#installing lanchain
!pip install langchain langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.20-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

***
Langchain library consists of several loaders that can be used to load the text from provided document.<br>
To know more about these loaders you can visit the official documentation by clicking this **[Link](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/)** 💁‍♂️
<br>
<br>
Some example of document loaders are:

In [None]:
#Example 1. Default text loader
from langchain.document_loaders import TextLoader as tl

In [None]:
text = tl("/content/drive/MyDrive/text/kanji.txt")
data = text.load()
data

[Document(metadata={'source': '/content/drive/MyDrive/text/kanji.txt'}, page_content='Kanji (漢字, pronounced [kaɲdʑi] ⓘ) are logographic Chinese characters, adapted from Chinese script, used in the writing of Japanese.[1] They were made a major part of the Japanese writing system during the time of Old Japanese and are still used, along with the subsequently-derived syllabic scripts of hiragana and katakana.[2][3] The characters have Japanese pronunciations; most have two, with one based on the Chinese sound. A few characters were invented in Japan by constructing character components derived from other Chinese characters. After the Meiji Restoration, Japan made its own efforts to simplify the characters, now known as shinjitai, by a process similar to China\'s simplification efforts, with the intention to increase literacy among the general public. Since the 1920s, the Japanese government has published character lists periodically to help direct the education of its citizenry through t

***
The document loader returns an array of the text that is extracted from the document.<br>
This document array consist of two elements:
- Metadata
- Page content
<br>

**Metadata** consists of the source path/link of the document that is used to load text. It is useful to show the client from where the data had been ntextracted. Whereas, **Page content** consists of actual data. This data is used to create embeddings for our project.
***

In [None]:
#Retrieving file path using metadata
print("File path 📁:\n"+ data[0].metadata["source"])

#Retrieving file content using page-content
print("\nFile content 📄:\n"+ data[0].page_content)

File path 📁:
/content/drive/MyDrive/text/kanji.txt

File content 📄:
Kanji (漢字, pronounced [kaɲdʑi] ⓘ) are logographic Chinese characters, adapted from Chinese script, used in the writing of Japanese.[1] They were made a major part of the Japanese writing system during the time of Old Japanese and are still used, along with the subsequently-derived syllabic scripts of hiragana and katakana.[2][3] The characters have Japanese pronunciations; most have two, with one based on the Chinese sound. A few characters were invented in Japan by constructing character components derived from other Chinese characters. After the Meiji Restoration, Japan made its own efforts to simplify the characters, now known as shinjitai, by a process similar to China's simplification efforts, with the intention to increase literacy among the general public. Since the 1920s, the Japanese government has published character lists periodically to help direct the education of its citizenry through the myriad Chinese c

In [None]:
#Example 2. CSV loader
from langchain.document_loaders.csv_loader import CSVLoader as cl

In [None]:
csv = cl("/content/drive/MyDrive/text/computer_games_dataset.csv")
data2 = csv.load()
data2[0]

Document(metadata={'source': '/content/drive/MyDrive/text/computer_games_dataset.csv', 'row': 0}, page_content='Name: A-Men 2\nDeveloper: Bloober Team\nProducer: Bloober Team\nGenre: Adventure, Puzzle\nOperating System: Microsoft Windows\nDate Released: June 24, 2015')

***
CSV Loader is very similar to the document loader that can be used to load data from a CSV file.<br>
Unlike default document loader csv loader consists of another attribut called `source_column`. This can be used to change the metadata's default source. This attribute accept name of the column we wish to use for source.
***

In [None]:
csv = cl("/content/drive/MyDrive/text/computer_games_dataset.csv", source_column = "Name")
data2 = csv.load()
data2[0].metadata["source"]

'A-Men 2'

In [None]:
#returning the all grand theft auto games
for i in data2:
  if "grand theft" in i.metadata["source"].lower():
    print(i.metadata["source"])

Grand Theft Auto
Grand Theft Auto: London 1961
Grand Theft Auto: London 1969
Grand Theft Auto 2
Grand Theft Auto III
Grand Theft Auto IV
Grand Theft Auto IV: The Lost and Damned
Grand Theft Auto: The Ballad of Gay Tony
Grand Theft Auto: San Andreas
Grand Theft Auto: Vice City
Grand Theft Auto V


***
With the help of these examples, we can understand how data can be loaded from documents using langchain. <br>
For our main project we need to load data from the urls that are provided by the user. This can be achieved by the help of URL loader. URL loader is a part of unstructured library of langchain.<br><br>
*To know more about unstructured library you can visit this* **[Documentation](https://python.langchain.com/v0.1/docs/integrations/document_loaders/url/)**
***

In [None]:
#Installing library
!pip install unstructured

Collecting unstructured
  Downloading unstructured-0.17.2-py3-none-any.whl.metadata (24 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured)
  Downloading python_iso639-2025.2.18-py3-none-any.whl.metadata (14 kB)
Collecting langdetect (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting rapidfuzz (from unstructured)
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting backoff (from unstructured)
  Downloa

In [None]:
from langchain.document_loaders import UnstructuredURLLoader as ul

In [None]:
urls=["https://www.moneycontrol.com/news/business/tata-motors-mahindra-gain-certificates-for-production-linked-payouts-11281691.html",
      "https://www.moneycontrol.com/news/business/tata-motors-launches-punch-icng-price-starts-at-rs-7-1-lakh-11098751.html",
      "https://www.moneycontrol.com/news/business/stocks/buy-tata-motors-target-of-rs-743-kr-choksey-11080811.html"]

loader2 = ul(urls = urls)
data3 = loader2.load()
data3[0].metadata

{'source': 'https://www.moneycontrol.com/news/business/tata-motors-mahindra-gain-certificates-for-production-linked-payouts-11281691.html'}

In [None]:
data3[0].page_content

"English\n\nHindi\n\nGujarati\n\nSpecials\n\nHello, Login\n\nHello, Login\n\nLog-inor Sign-Up\n\nMy Account\n\nMy Profile\n\nMy Portfolio\n\nMy Watchlist\n\nMy Alerts\n\nMy Messages\n\nPrice Alerts\n\nMy Profile\n\nMy PRO\n\nMy Portfolio\n\nMy Watchlist\n\nMy Alerts\n\nMy Messages\n\nPrice Alerts\n\nLogout\n\nLoans up to ₹50 LAKHS\n\nFixed Deposits\n\nCredit CardsLifetime Free\n\nCredit Score\n\nChat with Us\n\nDownload App\n\nFollow us on:\n\nNetwork 18\n\nGo Ad-Free\n\nMy Alerts\n\n>->MC_ENG_DESKTOP/MC_ENG_NEWS/MC_ENG_BUSINESS_AS/MC_ENG_ROS_NWS_BUS_AS_ATF_728\n\nMoneycontrol\n\nGo PRO@₹1/dayPRO\n\nMoneycontrol PRO\n\nAdvertisement\n\nRemove Ad\n\nBusiness\n\nMarkets\n\nStocks\n\nEconomy\n\nCompanies\n\nTrends\n\nIPO\n\nOpinion\n\nEV Special\n\nHomeNewsBusinessTata Motors, Mahindra gain certificates for production-linked payouts\n\nTrending Topics\n\nMutual FundsTrump Tariffs News LiveTrump Tariffs Exemption Indian RupeeRatan Tata Will\n\nTata Motors, Mahindra gain certificates for pr