<a href="https://colab.research.google.com/github/sreesanthrnair/DSA_Notes/blob/main/Data_Ingestion_Notes_(11_09_2025)ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Data Ingestion

##  What is Data Ingestion?
Data ingestion is the process of **collecting data from multiple sources and moving it into a storage system or processing framework** where it can be analyzed, transformed, or used by applications.

---

##  Types of Data Ingestion
1. **Batch Ingestion** – Data collected in chunks at intervals.  
2. **Real-Time (Streaming) Ingestion** – Data flows continuously.  
3. **Hybrid (Lambda Architecture)** – Mix of batch + streaming.

---

##  Steps in Data Ingestion
1. Connect to Data Source  
2. Extract Data  
3. Transform (optional)  
4. Load Data  
5. Monitor & Manage  

---

##  Tools & Technologies
- Batch: Apache Nifi, Talend, Airflow  
- Streaming: Apache Kafka, AWS Kinesis, Spark Streaming  
- Storage: Snowflake, BigQuery, Hadoop, Databricks


## Key Characteristics of Data Ingestion

- Scalability

Should handle large and growing datasets from multiple sources.

Example: An IoT system where millions of sensor readings are ingested per second.

- Flexibility

Must support different data formats (CSV, JSON, XML, Parquet, logs, etc.).

Should work with structured, semi-structured, and unstructured data.

- Speed (Latency)

Can operate in batch mode (minutes to hours) or real-time streaming (milliseconds).

Critical for applications like fraud detection or stock trading.

- Reliability & Fault Tolerance

System should recover from failures without losing data.

Uses checkpoints, retries, and acknowledgments to ensure delivery.

---


In [None]:
# Install required packages
!pip install requests sqlalchemy

In [None]:
import pandas as pd
import requests
import sqlite3

## 1. Ingest from a CSV File

In [None]:
# Example CSV file from a public dataset
url = "https://people.sc.fsu.edu/~jburkardt/data/csv/hw_200.csv"
csv_data = pd.read_csv(url)

print("CSV Data (first 5 rows):")
csv_data.head()

## 2. Ingest from an API

In [None]:
# Example API: JSONPlaceholder (fake posts)
api_url = "https://jsonplaceholder.typicode.com/posts"
response = requests.get(api_url)
api_data = pd.DataFrame(response.json())

print("API Data (first 5 rows):")
api_data.head()

## 3. Ingest from a Database (SQLite Demo)

In [None]:
# Save CSV data into SQLite DB and read it back
conn = sqlite3.connect("sample.db")
csv_data.to_sql("students", conn, if_exists="replace", index=False)

db_data = pd.read_sql("SELECT * FROM students LIMIT 5", conn)
print("Database Data (first 5 rows):")
db_data

## What is a Kafka Cluster?
A Kafka cluster is a group of servers (called brokers) working together to handle real-time data streams. It stores, processes, and moves data efficiently between systems


## Why Use Kafka?
- Real-time data streaming
- High performance (millions of messages/sec)
- Scalable (add more brokers easily)
- Reliable (data is replicated for safety)
- Decouples systems (producers and consumers don’t need to know each other)




```
# This is formatted as code
```

## Kafka Cluster Terms (Quick Points)
- Broker: Kafka server that stores and serves data.
- Topic: Named stream of data (like a folder).
- Partition: Split of a topic for parallel processing.
- Producer: Sends data to Kafka.
- Consumer: Reads data from Kafka.
- Consumer Group: Multiple consumers sharing the load.
- Offset: Position of a message in a partition.
- Replication: Copies of data for fault tolerance.
- ZooKeeper / KRaft: Manages cluster coordination.
