In this project, I worked on performing data analytics on Uber data using various tools and technologies, including Google Cloud Platform (GCP), Python, Compute Instance, Mage Data Pipeline Tool, BigQuery, and Looker Studio. The goal was to analyze and gain insights from the Uber dataset by implementing a modern data engineering pipeline.
The architecture that is implemented is leveraged the following technologies:
- Programming Language: Python
- Google Cloud Platform (GCP):
- Google Storage: Used for storing the dataset
- Compute Instance: Deployed to run the data analytics processes
- BigQuery: Utilized for data storage and querying
- Looker Studio: Employed as the visualization and reporting tool
- Modern Data Pipeline Tool: Integrated Mage Data Pipeline Tool to streamline data processing and transformation tasks.
I contributed to this open-source project by collaborating with the Mage AI team. The project can be found on GitHub at: https://github.com/mage-ai/mage-ai.
The project utilized the TLC Trip Record Data, which includes yellow and green taxi trip records. The dataset encompasses various fields such as pick-up and drop-off dates/times, locations, trip distances, fares, rate types, payment types, and passenger counts. The specific dataset used in the project can be accessed at: https://github.com/darshilparmar/uber-etl-pipeline-data-engineering-project/blob/main/data/uber_data.csv.
Additional information about the dataset can be found on the following resources:
NYC TLC Website: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page Data Dictionary: https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf