## **DATA INGESTION**
Data ingestion is the process of collecting and loading data from different sources into a system so that an AI/LLM can understand and use it.

**Why Data Ingestion is Necessary?**
- LLMs do NOT know your data
- LLMs are trained on public + generic data

They cannot access your private data and automatically load into AI system

---
**Document Loaders: https://docs.langchain.com/oss/python/integrations/document_loaders**

---

#### **Text Loader**

In [1]:
# Import TextLoader to load data from a .txt file
from langchain_community.document_loaders import TextLoader

# Create a TextLoader object and pass the text file path
loader = TextLoader("cosmics.txt")

# Load the text file content into LangChain Document
documents = loader.load()
documents

  from .autonotebook import tqdm as notebook_tqdm


[Document(metadata={'source': 'cosmics.txt'}, page_content='Cosmics: Their Presence in the Galaxy and Their Relation with the Sound “Om”\n\nThe universe is vast, mysterious, and filled with forces that go far beyond what human senses can directly perceive. From ancient civilizations to modern astrophysics, humans have tried to understand the nature of the cosmos and its underlying energy. The term cosmics broadly refers to cosmic entities, energies, radiations, and phenomena that exist throughout the galaxy. These cosmics play a crucial role in shaping stars, planets, life, and even human consciousness. Interestingly, ancient spiritual traditions, especially from India, describe the universe as originating from a primordial sound known as Om. This essay explores the presence of cosmics in the galaxy and their deep symbolic and scientific connection with the sound Om.\n\nUnderstanding Cosmics and Cosmic Presence\n\nCosmics can be understood as universal forces and energies present throu

#### **PDF Loader**

In [2]:
# Import PyPDFLoader to read PDF files
from langchain_community.document_loaders import PyPDFLoader

# Create a PDF loader object by providing the PDF file path
loader = PyPDFLoader("Deep Learning with Python - François Chollet - Manning (2018).pdf")

# Load the PDF content
# Each page of the PDF is loaded as a separate document
pages = loader.load()
pages

ValueError: File path Deep Learning with Python - François Chollet - Manning (2018).pdf is not a valid file or url

### **Web Based Loader**

In [None]:
# Import WebBaseLoader to load data from websites
from langchain_community.document_loaders import WebBaseLoader

# Create loader for a single web page
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Artificial_intelligence")
pages = loader.load()
pages

USER_AGENT environment variable not set, consider setting it to identify your requests.




**Loading multiple web pages**

In [None]:
urls = [
    "https://en.wikipedia.org/wiki/Machine_learning",
    "https://en.wikipedia.org/wiki/Deep_learning",
]
loader = WebBaseLoader(urls)
pages = loader.load()
pages

[Document(metadata={'source': 'https://en.wikipedia.org/wiki/Machine_learning', 'title': 'Machine learning - Wikipedia', 'language': 'en'}, page_content='\n\n\n\nMachine learning - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact us\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload fileSpecial pages\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAppearance\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDonate\n\nCreate account\n\nLog in\n\n\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\nDonate Create account Log in\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1\nHistory\n\n\n\n\n\n\n\n\n2\nRelationships to other fields\n\n\n\n\nTog

**CUSTOM WEB DATA EXTRACTION USING BEAUTIFULSOUP**

In [None]:
# bs4 is used to parse and filter HTML content from web pages
import bs4
loader = WebBaseLoader(
            web_path="https://www.ibm.com/think/topics/deep-learning#763338456",
            
            # bs_kwargs allows passing arguments directly to BeautifulSoup
            bs_kwargs=dict(parse_only = bs4.SoupStrainer(
                # SoupStrainer is used to parse only specific HTML elements
                # This improves performance and removes unwanted content
                class_ = ("side-nav list",)
                # bs4.SoupStrainer("p")==>paragraphs only
            )
        ))
pages = loader.load()
pages

[Document(metadata={'source': 'https://www.ibm.com/think/topics/deep-learning#763338456', 'title': 'Caret right'}, page_content='\n\nMachine learning\n\n\n\nWelcome\n\n\n\n\n\nCaret right\n\nIntroduction\n\n\n\n\nOverview\n\n\n\n\nMachine learning types\n\n\n\n\nMachine learning algorithms\n\n\n\n\n\n\n\nCaret right\n\nData science for machine learning\n\n\n\n\nStatistical machine learning\n\n\n\n\nLinear algebra for machine learning\n\n\n\n\nUncertainty quantification\n\n\n\n\nBias variance tradeoff\n\n\n\n\nBayesian Statistics\n\n\n\n\n\n\n\nCaret right\n\nFeature Engineering\n\n\n\n\nOverview\n\n\n\n\nFeature selection\n\n\n\n\nFeature extraction\n\n\n\n\nVector embedding\n\n\n\n\nLatent space\n\n\n\n\n\nCaret right\n\nDimensionality reduction\n\n\n\n\nPrincipal component analysis\n\n\n\n\nLinear discriminant analysis\n\n\n\n\n\n\nUpsampling\n\n\n\n\nDownsampling\n\n\n\n\nSynthetic data\n\n\n\n\nData leakage\n\n\n\n\n\n\n\nCaret right\n\nSupervised learning\n\n\n\n\nOverview\n\n\n\n

### **Wikipedia Loader**

**WIKIPEDIA LOADER (SINGLE QUERY)**

In [None]:
from langchain_community.document_loaders import WikipediaLoader
loader = WikipediaLoader(query="Virat Kohli", )
pages = loader.load()
pages

[Document(metadata={'title': 'Virat Kohli', 'summary': "Virat Kohli (born 5 November 1988) is an Indian international cricketer and the former all-format captain of the Indian national cricket team. He is a right-handed batter and occasional right-arm medium pace bowler. Considered one of the greatest all-format batsmen in the history of cricket, he has been nicknamed the King, the Chase Master, and the Run Machine for his skills, records and ability to lead his team to victory. Kohli has the most centuries in ODIs and the second-most centuries in international cricket with 85 tons across all formats. He is also the leading run-scorer in the Indian Premier League. Kohli is the most successful Test captain of India with most wins and 3 consecutive Test mace retainments. He is the only batter to earn 900+ rating points across all 3 formats.\nKohli was the captain of the 2008 U19 World Cup winning team and was a crucial member of the teams that won 2011 ODI World Cup, 2013 Champions Troph

**WIKIPEDIA LOADER (LIMITED DOCUMENTS)**

In [None]:
loader = WikipediaLoader(query="Mahendra Singh Dhoni", load_max_docs=3)
pages = loader.load()
pages

[Document(metadata={'title': 'MS Dhoni', 'summary': "Mahendra Singh  Dhoni ([məˈɦeːnd̪ɾə ˈsɪŋɡʱ ˈd̪ʱoːniː] ; born 7 July 1981) is an Indian professional cricketer who plays as a right-handed batter and a wicket-keeper. Widely regarded as one of the most prolific wicket-keeper batsmen and captains, he represented the Indian cricket team and was the captain of the side in limited overs formats from 2007 to 2017 and in Test cricket from 2008 to 2014. Dhoni has captained the most international matches and is the most successful Indian captain. He has led India to victory in the 2007 ICC World Twenty20, the 2011 Cricket World Cup, and the 2013 ICC Champions Trophy, being the only captain to win three different limited overs ICC tournaments. He also led the teams that won the Asia Cup in 2010 and 2016, and he was a member of the title winning squad in 2018.\nBorn in Ranchi, Dhoni made his first class debut for Bihar in 1999. He made his debut for the Indian cricket team on 23 December 2004 i

**WIKIPEDIA LOADER (MULTIPLE QUERIES)**

In [None]:
# Create a list of search queries
queries = [
    "Virat Kohli",
    "Mahendra Singh Dhoni",
    "Sachin Tendulkar"
]

# Empty list to store all Wikipedia documents
all_pages = []

# Loop through each query and load Wikipedia content
for query in queries:
    loader = WikipediaLoader(query=query, load_max_docs=2)   # Limit number of documents per query
    pages = loader.load()
    all_pages.extend(pages)

# Print all loaded Wikipedia documents
all_pages

#### **Arxiv Loader**

In [None]:
from langchain_community.document_loaders import ArxivLoader
loader = ArxivLoader(query = "1706.03762", load_max_docs = 3)
papers = loader.load()
papers

[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation 

In [None]:
len(papers)

1