This README provides instructions for setting up a data engineering project for simulating data generation using Python for Apache Kafka, processing the data with Apache Spark, and storing it in Amazon S3. All services will be orchestrated and run on Docker containers.
The Smart City Data Engineering project aims to simulate and process real-time data streams from various sources such as vehicles, GPS devices, weather stations, and traffic cameras. The project utilizes Apache Kafka for message queuing and distribution, Apache Spark for stream processing, and Amazon S3 for data storage.
Simulate data streams for the following topics:
vehicle
gps
weather
traffic_camera
Process the data streams using Apache Spark to perform real-time analytics, aggregation, and transformation.
Store the processed data in Amazon S3 for long-term storage and analysis.
All services will be containerized using Docker for easy deployment and management.
To run this project you need to have the following installed on your machine:
$ python3 -m venv env
$ source env/bin/activate
$ pip install -r requirements.txt
$ docker-compose up --build
You have successfully set up a data engineering project for simulating data generation using Python for Apache Kafka, processing the data with Apache Spark, and storing it in Amazon S3. By running all services on Docker containers, you can easily deploy and manage the entire data pipeline.
Feel free to customize and extend the project to incorporate additional data sources, processing logic, or storage destinations as needed for your smart city application.
For more information on Docker, Apache Kafka, Apache Spark, and Amazon S3, refer to their respective documentation: