### Data Ingestion (https://python.langchain.com/docs/integrations/document_loaders/)

In [3]:
from langchain_community.document_loaders.text import TextLoader

In [6]:
loader1 = TextLoader("E:/Agentic AI/25-05-2025/DataIngestion/Speech.txt")
loader1

<langchain_community.document_loaders.text.TextLoader at 0x2297e888a50>

In [8]:
text_doc = loader1.load()
text_doc

[Document(metadata={'source': 'E:/Agentic AI/25-05-2025/DataIngestion/Speech.txt'}, page_content='New Delhi : 31.01.2025\n\nDownload : Speeches ADDRESS BY THE HON’BLE PRESIDENT OF INDIA, SMT. DROUPADI MURMU TO PARLIAMENT(161.72 KB)\nHon’ble Members,\n\n1. It gives me immense pleasure to address this session of Parliament.\n\nJust two months ago, we celebrated the 75th anniversary of adoption of our Constitution, and only a few days ago, the Indian Republic completed 75 years of its journey. This occasion will elevate India&#39;s pride as the mother of democracy to new heights. On behalf of all the citizens of the country, I pay tribute to Babasaheb Ambedkar and all the framers of the Constitution.\n\nHon’ble Members,\n\n2. The historic festival of Mahakumbh is also underway in the country. Mahakumbh is a festival of India&#39;s cultural tradition and social consciousness. Millions of devotees from across the country and the world have taken the holy dip at Prayagraj. I express my sorro

### PDF

In [5]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("E:/Agentic AI/25-05-2025/DataIngestion/game_of_thrones_master_dialogue_s3s8.pdf")

docs = loader.load()
docs

[Document(metadata={'producer': 'macOS Version 10.15.4 (Build 19E266) Quartz PDFContext', 'creator': 'Final Draft 11', 'creationdate': "D:20200412013934Z00'00'", 'title': 'Master Dialogue for Game of Thrones Seasons 3 through 8', 'author': 'David J. Peterson', 'moddate': "D:20200412013934Z00'00'", 'source': 'E:/Agentic AI/25-05-2025/DataIngestion/game_of_thrones_master_dialogue_s3s8.pdf', 'total_pages': 196, 'page': 0, 'page_label': '1'}, page_content='(CONTINUED)\nGAME OF THRONES #302\nMASTER DOCUMENT\nLanguage Translations\nDavid J. Peterson\nRevised 8/7/12\nKEY:\n(Title of Associated .mp3 File)\nCHARACTER NAME\nEnglish dialogue as written.\nTRANSLATION\nOfficial transcription for closed captioning and subtitles.\nPHONETIC\nfo-NE-tik REN-dur-ing\nLiteral translation.\n-------------------------------------------------------------------\nEXT. ASTAPOR - APPROACH TO PRESENTATION AREA - DAY2.14 2.14\n(s3e2sc2-14_1.mp3)\nKRAZNYS\nTell the Westerosi whore that these Unsullied have been stan

### Web based loader

In [10]:
from langchain_community.document_loaders import WebBaseLoader
import bs4

web_loader = WebBaseLoader(
    web_paths=("https://python.langchain.com/docs/concepts/output_parsers/",),
    bs_kwargs=dict( parse_only = bs4.SoupStrainer(
        class_=("post-title",
            "post-content",
            "post-date")
    )),
)


In [11]:
webdoc = web_loader.load()
webdoc

[Document(metadata={'source': 'https://python.langchain.com/docs/concepts/output_parsers/'}, page_content='')]

In [13]:
from langchain_community.document_loaders import ArxivLoader

rp_doc = ArxivLoader(query="1706.03762", load_max_docs=2)

content_rp = rp_doc.load()
content_rp

[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntr

In [15]:
from langchain_community.document_loaders import WikipediaLoader

wiki_load = WikipediaLoader(query="MS Dhoni", load_max_docs=2).load()

print(wiki_load)

[Document(metadata={'title': 'MS Dhoni', 'summary': "Mahendra Singh Dhoni ( ; born 7 July 1981) is an Indian professional cricketer who plays as a right-handed batter and a wicket-keeper. Widely regarded as one of the most prolific wicket-keeper batsmen and captains and one of the greatest ODI batsmen, he represented the Indian cricket team and was the captain of the side in limited overs formats from 2007 to 2017 and in test cricket from 2008 to 2014. Dhoni has captained the most international matches and is the most successful Indian captain. He has led India to victory in the 2007 ICC World Twenty20, the 2011 Cricket World Cup, and the 2013 ICC Champions Trophy, being the only captain to win three different limited overs ICC tournaments. He also led the teams that won the Asia Cup in 2010, 2016 and was a member of the title winning squad in 2018.\nBorn in Ranchi, Dhoni made his first class debut for Bihar in 1999. He made his debut for the Indian cricket team on 23 December 2004 in 

## text Splitters

In [16]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

final_doc = text_splitter.split_documents(content_rp)

In [17]:
final_doc

[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntr

In [18]:
final_doc[0]

Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntra

In [19]:
final_doc[1]

Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntra

In [20]:
print(final_doc[2])

page_content='based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-
to-German translation task, improving over the existing best results, including
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,' metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, t

#### Charactertextsplitter

In [21]:
from langchain_text_splitters import CharacterTextSplitter 

char_text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,
    chunk_overlap=50
)

websplit = char_text_splitter.split_documents(wiki_load)



In [22]:
websplit

[Document(metadata={'title': 'MS Dhoni', 'summary': "Mahendra Singh Dhoni ( ; born 7 July 1981) is an Indian professional cricketer who plays as a right-handed batter and a wicket-keeper. Widely regarded as one of the most prolific wicket-keeper batsmen and captains and one of the greatest ODI batsmen, he represented the Indian cricket team and was the captain of the side in limited overs formats from 2007 to 2017 and in test cricket from 2008 to 2014. Dhoni has captained the most international matches and is the most successful Indian captain. He has led India to victory in the 2007 ICC World Twenty20, the 2011 Cricket World Cup, and the 2013 ICC Champions Trophy, being the only captain to win three different limited overs ICC tournaments. He also led the teams that won the Asia Cup in 2010, 2016 and was a member of the title winning squad in 2018.\nBorn in Ranchi, Dhoni made his first class debut for Bihar in 1999. He made his debut for the Indian cricket team on 23 December 2004 in 

In [29]:
from langchain_community.document_loaders import JSONLoader

json_doc = JSONLoader(
    file_path="E:/Agentic AI/25-05-2025/DataIngestion/sample.json",
    jq_schema=".",
    text_content=False  # Set True if you want to treat extracted objects as strings
)

loaded_doc = json_doc.load()
print(loaded_doc)

[Document(metadata={'source': 'E:\\Agentic AI\\25-05-2025\\DataIngestion\\sample.json', 'seq_num': 1}, page_content='{"product_id": "SKU123456", "product_name": "Motorola Edge Fusion", "brand": "Motorola", "category": "Smartphones", "price": {"currency": "USD", "value": 649.99, "discount": {"type": "percentage", "value": 10}, "final_price": 584.99}, "availability": "in_stock", "ratings": {"average_rating": 4.5, "total_reviews": 324}, "product_details": {"description": "Experience blazing speed and vibrant visuals with the Motorola Edge Fusion, equipped with a Snapdragon 888 processor and a 6.7-inch OLED display.", "specs": {"display": "6.7-inch OLED", "processor": "Qualcomm Snapdragon 888", "ram": "8GB", "storage": "128GB", "battery": "4500mAh", "camera": {"rear": "50MP + 12MP + 8MP", "front": "16MP"}, "os": "Android 12", "network": "5G"}}, "images": ["https://example.com/images/motorola_edge_fusion_front.jpg", "https://example.com/images/motorola_edge_fusion_back.jpg"], "shipping": {"