---------------------------------
#### Ingesting Data
- into Our RAG Workflow
------------------------------------

- **LlamaHub** is a comprehensive library that extends the capabilities of the **LlamaIndex** framework.
- It includes over **180 connectors**, also known as **data readers** or **data loaders**, for seamless integration of external data sources.
- **Connectors** allow easy extraction of data from **databases**, **APIs**, **files**, and **websites**, converting it into **LlamaIndex Document objects**.
- Users don‚Äôt need to write **custom parsers** for different data types, as connectors standardize data ingestion.
- If needed, users can create and contribute their own **custom connectors** to the library.
- **LlamaHub** enables access to diverse data sources with minimal code.
- **Document objects** generated from connectors can be parsed into **nodes** and indexed as required.
- The unified output means core business logic doesn‚Äôt need to handle various data types, as the framework abstracts this complexity.


In [25]:
pip install llama-index-readers-web

Note: you may need to restart the kernel to use updated packages.


In [26]:
from llama_index.readers.web import SimpleWebPageReader

In [27]:
urls = ["https://docs.llamaindex.ai"]

In [28]:
documents = SimpleWebPageReader().load_data(urls)

In [29]:
from pprint import pprint

In [30]:
# Function to pretty print document content and metadata
def pretty_print_documents(documents):
    for doc in documents:
        print(f"Document ID: {doc.id_}")
        print("Metadata:")
        pprint(doc.metadata, indent=4)
        print("\nContent:")
        print(doc.text)
        print("="*80)  # separator between documents

# Call the function
#pretty_print_documents(documents)

In [31]:
for doc in documents:
    print(doc.text)


<!doctype html>
<html lang="en" class="no-js">
  <head>
    
      <meta charset="utf-8">
      <meta name="viewport" content="width=device-width,initial-scale=1">
      
      
      
        <link rel="canonical" href="https://docs.llamaindex.ai/en/stable/">
      
      
      
        <link rel="next" href="getting_started/concepts/">
      
      
      <link rel="icon" href="_static/assets/LlamaLogoBrowserTab.png">
      <meta name="generator" content="mkdocs-1.6.1, mkdocs-material-9.5.41">
    
    
      
        <title>LlamaIndex - LlamaIndex</title>
      
    
    
      <link rel="stylesheet" href="assets/stylesheets/main.0253249f.min.css">
      
        
        <link rel="stylesheet" href="assets/stylesheets/palette.06af60db.min.css">
      
      


    
    
      
    
    
      
        
        
        <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
        <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto:300,300i,4

When loading the data, SimpleWebPageReader iterates over a list of URLs provided by the user. 

For each URL, it performs a web request to fetch the page content. The response, initially in HTML format, can be transformed into plain text if the `html_to_text` flag is set to True. This transformation strips away the HTML tags and converts the web page content into a more digestible text format.

In this case, the HTML-to-text conversion feature requires the `html2text` package, which has to be installed first.

#### Use **SimpleDirectoryReader** 
- for quick and simple bulk data ingestion across multiple file formats.
- It's ideal for fast setup or when you have a **simple use case**.
- **SimpleDirectoryReader** acts like a versatile tool, automatically adapting to various file types.
- You can point it to a **folder** or a **list of files** for easy data loading.
- It supports loading a folder containing **PDFs**, **Word docs**, **plain text files**, and **CSVs** with minimal effort.


In [32]:
from llama_index.core import SimpleDirectoryReader

In [33]:
reader = SimpleDirectoryReader(
    input_dir = "files",
    recursive = True
)

In [34]:
documents = reader.load_data()


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\ANACONDA\Lib\site-packages\ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "D:\ANACONDA\Lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "D:\ANACONDA\Lib\site-packages\ipykernel\kernelapp.py", line 701, in start
    self.io_loop.start()
  File "D:\ANACONDA\Lib\site-packages\tornado\platform\asyncio.py", line 205, in star

ImportError: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.



ImportError: `llama-index-readers-file` package not found

In [15]:
len(documents)

1

In [13]:
for doc in documents:
    print(doc.metadata)

{'file_path': 'D:\\gridflowAI\\00-ALL COURSES REPO\\Innovative-AI\\01-GenAI\\06-LLMs\\code\\05-various LLMs\\10-LlamaIndex\\01-vector-stores\\files\\sample_document1.txt', 'file_name': 'sample_document1.txt', 'file_type': 'text/plain', 'file_size': 612, 'creation_date': '2024-09-07', 'last_modified_date': '2024-09-07'}
{'file_path': 'D:\\gridflowAI\\00-ALL COURSES REPO\\Innovative-AI\\01-GenAI\\06-LLMs\\code\\05-various LLMs\\10-LlamaIndex\\01-vector-stores\\files\\sample_document2.txt', 'file_name': 'sample_document2.txt', 'file_type': 'text/plain', 'file_size': 556, 'creation_date': '2024-09-07', 'last_modified_date': '2024-09-07'}


In [14]:
from pprint import pprint

In [15]:
# Function to pretty print document content and metadata
def pretty_print_documents(documents):
    for doc in documents:
        print(f"Document ID: {doc.id_}")
        print("Metadata:")
        pprint(doc.metadata, indent=4)
        print("\nContent:")
        print(doc.text)
        print("="*80)  # separator between documents

# Call the function
pretty_print_documents(documents)

Document ID: bb691967-23a4-4084-88b9-cc6308d8f6e6
Metadata:
{   'creation_date': '2024-09-07',
    'file_name': 'sample_document1.txt',
    'file_path': 'D:\\gridflowAI\\00-ALL COURSES '
                 'REPO\\Innovative-AI\\01-GenAI\\06-LLMs\\code\\05-various '
                 'LLMs\\10-LlamaIndex\\01-vector-stores\\files\\sample_document1.txt',
    'file_size': 612,
    'file_type': 'text/plain',
    'last_modified_date': '2024-09-07'}

Content:
In ancient Rome, the city of Rome itself was the heart of the vast Roman Empire. It was known for its grand architecture, including iconic structures like the Colosseum and the Pantheon. The Romans were skilled engineers and builders, creating an extensive network of roads, aqueducts, and bridges that connected their far-reaching territories. The Roman Republic, with its Senate and elected officials, gave rise to the famous Roman legions, which conquered vast lands and brought them under Roman rule. The Roman civilization's influence on art

In [16]:
from llama_index.core import SummaryIndex
from llama_index.core.node_parser import SimpleNodeParser

In [17]:
parser = SimpleNodeParser.from_defaults()

In [18]:
nodes = parser.get_nodes_from_documents(documents)

In [19]:
index = SummaryIndex(nodes)

In [20]:
query_engine = index.as_query_engine()

In [21]:
question = "Do Dogs provide love to owners"

response = query_engine.query(question)
print(response)

Dogs have been known to provide comfort, protection, and unwavering love to their owners.


In [22]:
question = "What was known as heart of Roman Empire?"

response = query_engine.query(question)
print(response)

The city of Rome itself was known as the heart of the vast Roman Empire.


**SimpleDirectoryReader** has built-in methods to automatically determine the best reader for each file type.
- No need to worry about file type details; it detects formats like **PDF**, **DOCX**, **CSV**, and **plain text** based on file extensions.
- The reader automatically selects the appropriate tool to extract content into **Document objects**.
- For **plain text files**, it simply reads the text content.
- For **binary files** like PDFs and Office documents, it uses libraries such as **PyPDF** and **Pillow** to extract the text.
