API Data Streaming Pipeline

A comprehensive real-time data streaming solution that ingests API data and processes it through a robust data engineering pipeline.

Overview

This project demonstrates an end-to-end data engineering pipeline that ingests data from APIs, processes it in real-time, and stores it for analysis. The architecture implements a scalable, fault-tolerant system using modern data engineering tools and technologies.

The pipeline follows these key steps:

Fetch data from external APIs
Stream the data through Kafka
Process the streams with Spark
Store processed data in Cassandra
Orchestrate workflows with Airflow

Technology Stack

Docker: Containerization platform for consistent deployment across environments
Apache Airflow: Workflow orchestration tool for scheduling and monitoring data pipelines
Apache Kafka: Distributed streaming platform for high-throughput, fault-tolerant data ingestion
Apache Spark: Unified analytics engine for large-scale data processing
Apache Cassandra: NoSQL database for handling high-velocity data with no single point of failure
PostgreSQL: Relational database for structured data storage and Airflow metadata

Architecture

Data Flow:

Ingestion Layer: External APIs → Kafka topics
Processing Layer: Kafka → Spark Streaming
Storage Layer: Processed data → Cassandra (NoSQL) and PostgreSQL (SQL)
Orchestration Layer: Airflow manages the entire workflow

Prerequisites

Docker and Docker Compose
Python 3.8+
Internet connection (for API access)
Basic understanding of data engineering concepts

Project Structure

├── dags/
│   └── kafka_stream.py         # Airflow DAG for API data ingestion
├── scripts/
│   └── entrypoint.sh           # Container initialization script
├── docker-compose.yml          # Container orchestration
├── requirements.txt            # Python dependencies
├── spark_stream.py             # Spark streaming job
├── .env                        # Environment variables (git-ignored)
└── README.md                   # Project documentation

Setup and Installation

1. Clone the repository

git clone https://github.com/shreyashreddyk/api-data-streaming.git
cd api-data-streaming

2. Install dependencies

pip install -r requirements.txt

3. Make the entrypoint script executable

chmod +x scripts/entrypoint.sh

4. Create .env file with your API credentials

API_KEY=your_api_key_here
API_SECRET=your_api_secret_here

5. Start the containers

docker-compose up -d

This command spins up the following containers:

Zookeeper (configuration service for Kafka)
Kafka Broker (message broker)
Schema Registry (metadata service)
Control Center (Kafka monitoring UI)
Spark Master & Worker (distributed processing)
Cassandra (NoSQL database)
Airflow Webserver & Scheduler
PostgreSQL (relational database)

Running the Pipeline

1. Access Airflow UI

Navigate to http://localhost:8080 in your browser to access the Airflow web interface.

2. Trigger the data ingestion DAG

In the Airflow UI, locate the api_data_pipeline DAG and click the "Play" button to trigger it. This will:

Call external APIs to fetch data
Stream the data into Kafka topics
Schedule the data processing workflow

3. Monitor Kafka streams

Access the Kafka Control Center at http://localhost:9021 to monitor topics, consumers, and data flow through the system.

4. Submit the Spark streaming job

spark-submit --master spark://localhost:7077 spark_stream.py

This starts the Spark streaming application that consumes data from Kafka, processes it according to business logic, and writes results to Cassandra.

5. Verify data in Cassandra

Connect to the Cassandra cluster and check that data is being stored correctly:

docker exec -it cassandra cqlsh -u cassandra -p cassandra localhost 9042

Then run queries to view your data:

DESCRIBE keyspaces;
USE api_data;
SELECT * FROM processed_data LIMIT 10;

Monitoring

Spark UI: Available at http://localhost:9090
Kafka Control Center: Available at http://localhost:9021
Airflow Dashboard: Available at http://localhost:8080

Troubleshooting

Common Issues

Container startup failures
- Check container logs: docker-compose logs [service_name]
- Verify that ports are not already in use
API rate limiting
- Implement appropriate backoff strategies in your API calls
- Check API response codes and handle accordingly
Kafka connection issues
- Ensure Zookeeper is running properly
- Check network connectivity between containers
Spark job failures
- Review Spark UI for detailed error messages
- Verify JAR dependencies are correctly specified

References

Ganiyu, Y. (2023). "Realtime Data Engineering Project With Airflow, Kafka, Spark, Cassandra and Postgres."
Apache Kafka Documentation: https://kafka.apache.org/documentation/
Apache Spark Documentation: https://spark.apache.org/docs/latest/
Apache Airflow Documentation: https://airflow.apache.org/docs/
Apache Cassandra Documentation: https://cassandra.apache.org/doc/latest/

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dags		dags
script		script
Data engineering architecture.png		Data engineering architecture.png
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
spark-stream-2.py		spark-stream-2.py
spark-stream.py		spark-stream.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

API Data Streaming Pipeline

Overview

Technology Stack

Architecture

Prerequisites

Project Structure

Setup and Installation

1. Clone the repository

2. Install dependencies

3. Make the entrypoint script executable

4. Create .env file with your API credentials

5. Start the containers

Running the Pipeline

1. Access Airflow UI

2. Trigger the data ingestion DAG

3. Monitor Kafka streams

4. Submit the Spark streaming job

5. Verify data in Cassandra

Monitoring

Troubleshooting

Common Issues

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

API Data Streaming Pipeline

Overview

Technology Stack

Architecture

Prerequisites

Project Structure

Setup and Installation

1. Clone the repository

2. Install dependencies

3. Make the entrypoint script executable

4. Create .env file with your API credentials

5. Start the containers

Running the Pipeline

1. Access Airflow UI

2. Trigger the data ingestion DAG

3. Monitor Kafka streams

4. Submit the Spark streaming job

5. Verify data in Cassandra

Monitoring

Troubleshooting

Common Issues

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages