Data engineering project with web scraping and API calls using MySQL, Pandas, and Google Cloud.
Gans is a fictional company providing electric scooters for rent. To distribute their scooters for customers efficiently, they need data about the cities of operation, such as population, weather forecast and flight arrival times from nearby airports.
The goal of this project is to establish a data engineering pipeline in the cloud (Google Cloud Platform). Information is collected using web scraping and API calls and continuously updated via cloud scheduling. A set of tables in a relational cloud SQL instance serve the data to be always accessible and up-to-date data.
To accomplish these tasks, I have compiled the following resources for this project.
-
A Python package for implementing the pipeline in this repository
-
A technical documentation describing the Python package and its setup locally and in the cloud
-
An article on about establishing a Data Engineering project on the Google Cloud Platform
-
Setup local MySQL database
Tools: MySQL Workbench, Python, SQLAlchemy, mysql-connector-python -
Collect static data of cities and airports using web scraping
Tools: Python, Pandas, BeautifulSoup -
Collect dynamic data of weather and flights using web API calls
Tools: Python, Pandas, Requests -
Implement a pipeline locally
Tools: Python, functions-framework -
Deploy the pipeline on Google Cloud Platform
Tools (Google Cloud Services): Cloud Functions, SQL, Cloud scheduler, Secret Manager -
Document the findings by writing an article
Tools: Medium.com -
Write a technical documentation of the implementation
Tools: MkDocs
- MySQL Workbench
- Pandas
- BeautifulSoup
- Requests
- RapidAPI
- Google Cloud Services
- MkDocs
- GitHub Pages
- Medium.com
Running the Python pipeline the first time fully creates the following SQL schema automatically.
├── pipeline <- Source code of the Python package
│ │
│ ├── database.py <- Database class as user interface for all operations
│ │
│ ├── create_database.sql <- SQL script for creating the database structure
│ │
│ ├── cities.py <- Internal functions for collecting static data using web scraping
│ ├── airports.py
│ │
│ ├── weather.py <- Internal functions for collecting dynamic data using APIs
│ └── flights.py
│
├── docs <- MkDocs documentation of the Python package 'pipeline'
│
├── requirements.txt <- Dependencies for reproducing the pipeline environment
│
├── example.env <- Environment variables for sensitive data
│
└── main.py <- Google Cloud Functions script