# Data Ingestion
**Data ingestion** is the process of collecting and loading raw data from different sources into a system for downstream processing. In a RAG pipeline, it means **reading and preparing the content from PDFs, DOCX, XLSX, and web pages** into text format.

In [1]:
!sudo apt-get update
!sudo apt-get install -y poppler-utils

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,702 kB]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,735 kB]
Get:13 http://security.ubuntu.com/ubun

In [12]:
!pip install -U unstructured-pytesseract
!pip install python-docx




In [3]:
!pip install -U langchain langchain-community unstructured pdf2image pytesseract python-docx openpyxl pdfminer.six pi_heif unstructured_inference

Collecting langchain-community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting unstructured
  Downloading unstructured-0.17.2-py3-none-any.whl.metadata (24 kB)
Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Collecting pdfminer.six
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pi_heif
  Downloading pi_heif-0.22.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.5 kB)
Collecting unstructured_inference
  Downloading unstructured_inference-1.0.2-py3-none-any.whl.metadata (5.3 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-commu

In [4]:
from langchain.document_loaders import (
    UnstructuredPDFLoader,
    UnstructuredWordDocumentLoader,
    UnstructuredExcelLoader,
    WebBaseLoader
)




In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Reading PDF files

In [6]:
pdf_loader = UnstructuredPDFLoader("/content/drive/MyDrive/Zeta Workshop/dataset/Avinash-CV.pdf", mode="elements")
elements = pdf_loader.load()


In [9]:
page_data = {}
for idx, el in enumerate(elements):  # First 5 elements
    # print(f"\n--- Page {idx+1} ---")
    # print("Available Metadata Keys:", el.metadata.keys())
    # print("Metadata:", el.metadata)
    # print("Page Number:", el.metadata['page_number'])
    # print("Text:", el.page_content)
    if el.metadata['page_number'] not in page_data:
      page_data[el.metadata['page_number']] = [el.page_content]
    else:
      page_data[el.metadata['page_number']].append(el.page_content)


In [10]:
for page in page_data:
  print(f"Page {page}:")
  print(page_data[page])

Page 1:
['Dr. Avinash Kumar Singh', 'Hyderabad, India | +91-9005722861 | avinashkumarsingh1986@gmail.com | http://avinashkumarsingh.in', 'Profile', 'With over 14 years in Al, I have evolved through roles as an ML Researcher, Engineer, Product Manager, and now as Chief Al Scientist. | have led the development and deployment of deep learning-based computer vision and NLP models on platforms like AWS, GCP, Humanoid Robots, Edge Devices like Jetson Nano, Raspberry Pi, and NXP boards. My expertise extends to tackling challenges in concurrency, security, and latency. My academic journey, enriched by a Ph.D. and postdoctoral research, provides a profound understanding of neural networks across diverse data types, while my industrial experience ensures practical AI solutions are deployed effectively, serving real users. This unique blend of research and industry expertise enables me to lead in crafting and delivering impactful Al innovations, driving business transformation and societal advanc

# Reading Word files

In [13]:
# docx_loader = UnstructuredWordDocumentLoader("/content/drive/MyDrive/Zeta Workshop/dataset/Avinash-CV.docx")
# docx_docs = docx_loader.load()

from docx import Document

def extract_docx_structure(docx_path):
    doc = Document(docx_path)
    structure = []

    for i, para in enumerate(doc.paragraphs):
        text = para.text.strip()
        style = para.style.name

        if not text:
            continue

        element = {
            "paragraph_num": i + 1,
            "text": text,
            "style": style
        }

        # Identify heading level
        if style.startswith("Heading"):
            element["heading_level"] = style

        structure.append(element)

    return structure

doc_structure = extract_docx_structure("/content/drive/MyDrive/Zeta Workshop/dataset/Avinash-CV.docx")

for item in doc_structure:
    print(f"\nParagraph {item['paragraph_num']}")
    print(f"Style: {item['style']}")
    if "heading_level" in item:
        print(f"Heading: {item['heading_level']}")
    print(f"Text: {item['text'][:200]}")



Paragraph 1
Style: Contact
Text: Hyderabad, India | +91-9005722861 | avinashkumarsingh1986@gmail.com	 | http://avinashkumarsingh.in

Paragraph 3
Style: Heading 1
Heading: Heading 1
Text: With over 14 years in AI, I have evolved through roles as an ML Researcher, Engineer, Product Manager, and now as Chief AI Scientist. I have led the development and deployment of deep learning-based c

Paragraph 6
Style: Heading 2
Heading: Heading 2
Text: AI Consultant & corporate trainer | robaita, Hyderabad, India | SeP 24 – CONT…

Paragraph 7
Style: List Bullet
Text: Mentored and empowered 1,500+ students and working professionals from diverse backgrounds. Led immersive sessions on Fine-tuning Large Language Models (LLMs), designing robust Retrieval-Augmented Gene

Paragraph 8
Style: List Bullet
Text: Engineered and deployed a cutting-edge visual language model-based computer vision system to drastically reduce loose picking of Apple iPhones in a warehouse environment, achieving a significant reduc

# Reading Excel/CSV files

In [18]:
excel_loader = UnstructuredExcelLoader("/content/drive/MyDrive/Zeta Workshop/dataset/headcount_2025.xlsx")
excel_docs = excel_loader.load()
excel_docs[0].page_content[:200]

'Name Department Position Salary (INR) 2025 Increase Sarah Verma Finance Financial Analyst 899000 11.78% Aditya Verma Sales Sales Manager 772000 10.18% Vihaan Gupta Marketing Digital Marketer 753000 3.'

In [19]:
# Load Excel file
import pandas as pd
file_path = "/content/drive/MyDrive/Zeta Workshop/dataset/headcount_2025.xlsx"
xls = pd.ExcelFile(file_path)

# List all sheets
print("Sheets:", xls.sheet_names)

Sheets: ['Sheet1']


In [20]:
# Read specific sheet (or loop through all)
for sheet in xls.sheet_names:
    print(f"\n--- Sheet: {sheet} ---")
    df = xls.parse(sheet)
    print(df.head())  # Show first few rows

    # Optionally convert each table to a dict or JSON for later use
    table_as_dict = df.to_dict(orient="records")



--- Sheet: Sheet1 ---
           Name Department             Position  Salary (INR) 2025 Increase
0   Sarah Verma    Finance    Financial Analyst        899000        11.78%
1  Aditya Verma      Sales        Sales Manager        772000        10.18%
2  Vihaan Gupta  Marketing     Digital Marketer        753000          3.7%
3  Kiara Mishra      Legal        Legal Manager        753000        13.39%
4  Atharv Gupta  Marketing  Marketing Executive       1116000         3.25%


# Reading Webpages

In [47]:
web_loader = WebBaseLoader("https://robaita.com/about/")
web_docs = web_loader.load()

In [48]:
web_docs[0]

Document(metadata={'source': 'https://robaita.com/about/', 'title': 'About | Robaita', 'description': 'Hi there! Thank you so much for visiting our page. We are Robaita, which stands for Robotics and Artificial Intelligence Training Academy…', 'language': 'en'}, page_content="About | RobaitaSkip to contentRobaitaBlogAboutCoursesGithubHomepageLinkedInAboutHi there!\nThank you so much for visiting our page. We are Robaita, which stands for Robotics and Artificial Intelligence Training Academy. Our website is designed to spread awareness about robotics and artificial intelligence. Our mission is to support and uplift underprivileged students and aspiring individuals who are interested in these fields. We are passionate about open source and teaching. All of our source codes, blogs, and materials are freely available for anyone to use. Let's learn and grow together. \nॐ सह नाववतु । \nसह नौ भुनक्तु । \nसह वीर्यं करवावहै । \nतेजस्वि नावधीतमस्तु मा विद्विषावहै । \nॐ शान्तिः शान्तिः शान्तिः ॥©

In [49]:
print(web_docs[0].metadata['source'])
print(web_docs[0].metadata['title'])
print(web_docs[0].metadata['description'])
print(web_docs[0].page_content)

https://robaita.com/about/
About | Robaita
Hi there! Thank you so much for visiting our page. We are Robaita, which stands for Robotics and Artificial Intelligence Training Academy…
About | RobaitaSkip to contentRobaitaBlogAboutCoursesGithubHomepageLinkedInAboutHi there!
Thank you so much for visiting our page. We are Robaita, which stands for Robotics and Artificial Intelligence Training Academy. Our website is designed to spread awareness about robotics and artificial intelligence. Our mission is to support and uplift underprivileged students and aspiring individuals who are interested in these fields. We are passionate about open source and teaching. All of our source codes, blogs, and materials are freely available for anyone to use. Let's learn and grow together. 
ॐ सह नाववतु । 
सह नौ भुनक्तु । 
सह वीर्यं करवावहै । 
तेजस्वि नावधीतमस्तु मा विद्विषावहै । 
ॐ शान्तिः शान्तिः शान्तिः ॥© 2024 by Robaita. All rights reserved.Theme by LekoArts


