Skip to content

Latest commit

 

History

History
104 lines (85 loc) · 2.6 KB

README.md

File metadata and controls

104 lines (85 loc) · 2.6 KB

Google Maps Scraper

Update params.yaml file with your params. Run and get Google Maps Places and Reviews. And Upload to GCS.

Usage

Use in local

  1. Install Dependencies:
make install
  1. Set Environment Variables:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/crawler_gcp_keyfile.json"
export GCS_BUCKET_NAME="your-bucket-name"
export GCS_BLOB_NAME="your-blob-name"
  1. Get the results by running:
make run

Remember to add your params in params.yaml file.

  1. Clean repo:
make clean
  1. Clean repo and results:
make clean_all

Use in Docker Container

  1. Build Docker Image
docker build -t gmaps-scraper .
  1. Run Docker Container
docker run -it --rm -m 4g --shm-size=2g \
  -v $(pwd)/crawler_gcp_keyfile.json:/app/crawler_gcp_keyfile.json \
  -e GCS_BUCKET_NAME="your-bucket-name" \
  -e GCS_BLOB_NAME="your-blob-name" \
  gmaps-scraper

Use in Airflow

  1. Build Docker Image
docker build -t gmaps-scraper .
  1. Set Docker Proxy in Airflow docker-compose

  2. Add DockerOperator to your DAG

run_scraper = DockerOperator(
    task_id="e_gmaps-scraper",
    image="gmaps-scraper",
    api_version="auto",
    auto_remove=True,
    environment={
        "GCS_BUCKET_NAME": "your-bucket-name",
        "GCS_BLOB_NAME": "your-blob-name",
    },
    command="make run",
    mounts=[
        Mount(
            source="<your-gcp-keyfile>",  # local path
            target="/app/crawler_gcp_keyfile.json",
            type="bind",
            read_only=True,
        ),
    ],
    mount_tmp_dir=False,
    mem_limit="4g",  # 容器可以使用的最大内存為 4GB
    shm_size="2g",  # 共享内存大小為 2GB
    docker_url="tcp://docker-proxy:2375",
    network_mode="bridge",
)

TODO

  • Upload to GCS
  • Pack as Docker Image
  • Run in Airflow
  • Get more detailed Google Maps Places info
  • Filter time to get Google Maps Reviews
  • Refactor Code

Reference