Skip to content

yeha98555/google-maps-analysis-pipeline

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Taiwan Travel Attractions Analysis Data Pipeline

Introduction

This project is a specialized continuation of a previously team-managed project hosted under a different account. Originally designed to serve a broader scope, this fork has been tailored to focus more intensively on specific functionalities related to analyzing and visualizing travel attractions in Taiwan. After forking, parts of the project that were not aligned with the new objectives were removed to streamline and specialize the application.

Features

  • Data Pipeline
    • Data Collection: Utilizes a crawler to gather attraction information from Google Maps Reviews based on lists from TripAdvisor.
    • Data Organization: Structures the collected data into a well-defined schema within a data warehouse, including source (src), operational data store (ods), fact tables, and dimension tables (dim). This organization supports efficient data analysis and trend identification.
  • Cloud Infrastructure
  • Data Visualization

Prerequisites

Before running this project, you must have the following installed:

  • Terraform (v1.8.5 or later)
  • Docker (v26.1.4 or later)

Installation

Setup Terraform

Refer to the detailed instructions in Terraform README for setting up Terraform.

Setup Airflow

For setting up Airflow, follow the steps provided in Airflow README.

Tech Stack

Technologies used in this project

  • Google Cloud: Provides the computing and storage resources, specifically using Google Cloud Storage, BigQuery and Cloud Functions.
  • Terraform: Manages the infrastructure as code.
  • Airflow: Orchestrates and schedules the data pipeline workflows.
  • Python: Used for scripting and data manipulation tasks, with key libraries including:
    • Pandas: For data manipulation and analysis.
    • PyArrow: For efficient data storage and retrieval.
    • SQLAlchemy: For database interaction.
    • Psycopg2-binary: For PostgreSQL database connectivity.
    • jieba: For Chinese text segmentation.
    • SnowNLP: For sentiment analysis of Chinese text.

Project Structure

.
├── .git                         // Folder for Git version control system
├── .vscode                      // Visual Studio Code configuration folder
├── airflow
|   ├── config                   // Configuration files for Airflow settings and environment variables
|   ├── dags                     // Contains Directed Acyclic Graphs (DAGs) for Airflow to schedule and run tasks
|   ├── plugins                  // Custom plugins for extending Airflow's built-in functionalities (automatically generated)
|   ├── logs                     // Logs generated by Airflow during DAG execution (automatically generated)
|   ├── utils
|   |   ├── common.py            // Contains common utility functions used across different modules
|   |   ├── gcp.py               // Google Cloud Platform specific utilities, including BigQuery and GCS operations
|   |   ├── email_callback.py    // Functions to handle email notifications on task failures or retries
|   |   └── config.yml           // Configuration file that stores environment and service settings
|   ├── variables                // Stores variables and configurations used across different Airflow tasks
|   ├── gcp_keyfile.json         // Google Cloud Platform service account key for Airflow
|   ├── crawler_gcp_keyfile.json // Additional GCP service account key for specific crawling tasks
|   ├── .env                     // Environment variables for Airflow setup
|   ├── .env.example             // .env template file, storing sample environment variables
|   ├── airflow.Dockerfile       // Dockerfile to build Airflow container image
|   ├── .dockerignore            // Specifies files to ignore during Docker build process
|   ├── docker-compose.yml       // Docker compose file to run Airflow in containers
|   ├── .gitignore               // Git ignore configuration, specifying files that don't need version control
|   └── README.md                // Airflow description file
├── terraform
|   ├── src                      // Contains the source code for Cloud Functions
|   ├── generated                // Stores zipped packages of the src content after running Terraform (automatically generated)
|   ├── main.tf                  // Main Terraform configuration file
|   ├── output.tf                // Terraform output configuration
|   ├── variables.tf             // Terraform variables definition
|   ├── .terraform.lock.hcl      // Terraform lock file for dependencies version locking
|   ├── .gitignore               // Git ignore configuration, specifying files that don't need version control
|   └── README.md                // Terraform description file
├── .gitignore                   // Git ignore configuration, specifying files that don't need version control
└── README.md                    // Project description file

Releases

No releases published

Packages

No packages published

Languages

  • Python 90.1%
  • HCL 9.7%
  • Dockerfile 0.2%