Skip to content

singer-io/tap-google-search-console

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

tap-google-search-console

This is a Singer tap that produces JSON-formatted data following the Singer spec.

This tap:

  • Pulls raw data from the Google Search Console API
  • Extracts the following resources:
    • Sites
    • Sitemaps
    • Performance Reports
      • Custom (Summary for Site and Search Type by any combination of Date + Country, Device, Page, Query)
      • Date (Summary for Site and Search Type by Date)
      • Country (Summary for Site and Search Type by Date and Country)
      • Device (Summary for Site and Search Type by Date and Device)
      • Page (Summary for Site and Search Type by Date and Page)
      • Query (Summary for Site and Search Type by Date and Query)
  • Outputs the schema for each resource
  • Incrementally pulls data based on the input state

Streams

sites (GET)

sitemaps (GET)

performance_report_country (POST)

  • Performance Report Description
  • Endpoint: https://www.googleapis.com/webmasters/v3/sites/{site_url}/searchAnalytics/query
  • Primary keys: site_url, search_type, date, country
  • Foreign keys: site_url
  • Replication strategy: Incremental (query filtered based on date)
    • Filters: site_url, searchType, startDate (bookmark), endDate (current date)
    • Sort by: date ASC (when date is a dimension, results are sorted by date ascending)
    • Bookmark: date (date-time)
  • Transformations: Fields camelCase to snake_case, denest dimensions key/values, remove keys list node

performance_report_custom (POST)

  • Performance Report Description
  • Endpoint: https://www.googleapis.com/webmasters/v3/sites/{site_url}/searchAnalytics/query
  • Primary keys: site_url, search_type, date, dimensions_hash_key
    • Dimensions: date (required), country, device, page, query (based on catalog selection)
    • dimensions_hash_key: MD5 hash key of ordered list of selected dimension values
  • Foreign keys: site_url
  • Replication strategy: Incremental (query filtered based on date)
    • Filters: site_url, searchType, startDate (bookmark), endDate (current date)
    • Sort by: date ASC (when date is a dimension, results are sorted by date ascending)
    • Bookmark: date (date-time)
  • Transformations: Fields camelCase to snake_case, denest dimensions key/values, remove keys list node

performance_report_date (POST)

performance_report_device (POST)

  • Performance Report Description
  • Endpoint: https://www.googleapis.com/webmasters/v3/sites/{site_url}/searchAnalytics/query
  • Primary keys: site_url, search_type, date, device
  • Foreign keys: site_url
  • Replication strategy: Incremental (query filtered based on date)
    • Filters: site_url, searchType, startDate (bookmark), endDate (current date)
    • Sort by: date ASC (when date is a dimension, results are sorted by date ascending)
    • Bookmark: date (date-time)
  • Transformations: Fields camelCase to snake_case, denest dimensions key/values, remove keys list node

performance_report_page (POST)

  • Performance Report Description
  • Endpoint: https://www.googleapis.com/webmasters/v3/sites/{site_url}/searchAnalytics/query
  • Primary keys: site_url, search_type, date, page
  • Foreign keys: site_url
  • Replication strategy: Incremental (query filtered based on date)
    • Filters: site_url, searchType, startDate (bookmark), endDate (current date)
    • Sort by: date ASC (when date is a dimension, results are sorted by date ascending)
    • Bookmark: date (date-time)
  • Transformations: Fields camelCase to snake_case, denest dimensions key/values, remove keys list node

performance_report_query (keyword) (POST)

  • Performance Report Description
  • Endpoint: https://www.googleapis.com/webmasters/v3/sites/{site_url}/searchAnalytics/query
  • Primary keys: site_url, search_type, date, query
  • Foreign keys: site_url
  • Replication strategy: Incremental (query filtered based on date)
    • Filters: site_url, searchType, startDate (bookmark), endDate (current date)
    • Sort by: date ASC (when date is a dimension, results are sorted by date ascending)
    • Bookmark: date (date-time)
  • Transformations: Fields camelCase to snake_case, denest dimensions key/values, remove keys list node

Authentication

The Google Search Console Setup & Authentication Google Doc provides instructions show how to configure the Google Search Console for your domain and website URLs, configure Google Cloud to authorize/verify your domain ownership, generate an API key (client_id, client_secret), authenticate and generate a refresh_token, and prepare your tap config.json with the necessary parameters.

Quick Start

  1. Install

    Clone this repository, and then install using setup.py. We recommend using a virtualenv:

    > virtualenv -p python3 venv
    > source venv/bin/activate
    > python setup.py install
    OR
    > cd .../tap-google-search-console
    > pip install .
  2. Dependent libraries The following dependent libraries were installed.

    > pip install target-json
    > pip install target-stitch
    > pip install singer-tools
    > pip install singer-python
  3. Create your tap's config.json file. Include the client_id, client_secret, refresh_token, site_urls (website URL properties in a comma delimited list; do not include the domain-level property in the list), start_date (UTC format), user_agent (tap name with the api user email address) and request_timeout (the timeout for the requests. Default: 300).

    {
        "client_id": "YOUR_CLIENT_ID",
        "client_secret": "YOUR_CLIENT_SECRET",
        "refresh_token": "YOUR_REFRESH_TOKEN",
        "site_urls": "https://example.com, https://www.example.com, http://example.com, http://www.example.com, sc-domain:example.com",
        "start_date": "2019-01-01T00:00:00Z",
        "user_agent": "tap-google-search-console <api_user_email@example.com>",
        "request_timeout": 300
    }

    Optionally, also create a state.json file. currently_syncing is an optional attribute used for identifying the last object to be synced in case the job is interrupted mid-stream. The next run would begin where the last job left off. Only the performance_reports uses a bookmark. The date-time bookmark is stored in a nested structure based on the endpoint, site, and sub_type.

    {
      "currently_syncing": "sitemaps",
      "bookmarks": {
        "performance_report_custom": {
          "sc-domain:example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-08T00:00:00.000000Z"
          },
          "https://example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-08T00:00:00.000000Z"
          },
          "http://example.com": {
            "web": "2020-04-13T00:00:00.000000Z"
          }
        },
        "performance_report_date": {
          "http://www.example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-18T00:00:00.000000Z"
          },
          "sc-domain:example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-18T00:00:00.000000Z"
          },
          "https://www.example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-18T00:00:00.000000Z"
          },
          "https://example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-18T00:00:00.000000Z"
          },
          "http://example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-18T00:00:00.000000Z"
          }
        },
        "performance_report_device": {
          "sc-domain:example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-08T00:00:00.000000Z"
          },
          "https://example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-08T00:00:00.000000Z"
          },
          "http://example.com": {
            "web": "2020-04-16T00:00:00.000000Z"
          }
        },
        "performance_report_page": {
          "sc-domain:example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-08T00:00:00.000000Z"
          },
          "https://example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-08T00:00:00.000000Z"
          },
          "http://example.com": {
            "web": "2020-04-16T00:00:00.000000Z"
          }
        },
        "performance_report_query": {
          "sc-domain:example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-08T00:00:00.000000Z"
          },
          "https://example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-08T00:00:00.000000Z"
          },
          "http://example.com": {
            "web": "2020-04-13T00:00:00.000000Z"
          }
        },
        "performance_report_country": {
          "sc-domain:example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-08T00:00:00.000000Z"
          },
          "https://example.com": {
            "web": "2020-04-18T00:00:00.000000Z",
            "image": "2020-04-18T00:00:00.000000Z",
            "video": "2020-04-08T00:00:00.000000Z"
          },
          "http://example.com": {
            "web": "2020-04-16T00:00:00.000000Z"
          }
        }
      }
    }
  4. Run the Tap in Discovery Mode This creates a catalog.json for selecting objects/fields to integrate:

    tap-google-search-console --config config.json --discover > catalog.json

    See the Singer docs on discovery mode here.

  5. Run the Tap in Sync Mode (with catalog) and write out to state file

    For Sync mode:

    > tap-google-search-console --config tap_config.json --catalog catalog.json > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json

    To load to json files to verify outputs:

    > tap-google-search-console --config tap_config.json --catalog catalog.json | target-json > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json

    To pseudo-load to Stitch Import API with dry run:

    > tap-google-search-console --config tap_config.json --catalog catalog.json | target-stitch --config target_config.json --dry-run > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
  6. Test the Tap

    While developing the Google Search Console tap, the following utilities were run in accordance with Singer.io best practices: Pylint to improve code quality:

    > pylint tap_google_search_console -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments

    Pylint test resulted in the following score:

    Your code has been rated at 9.82/10.

    To check the tap and verify working:

    > tap-google-search-console --config tap_config.json --catalog catalog.json | singer-check-tap > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json

    Check tap resulted in the following:

    The output is valid.
    It contained 42192 messages for 8 streams.
    
          8 schema messages
      42118 record messages
        66 state messages
    
    Details by stream:
    +----------------------------+---------+---------+
    | stream                     | records | schemas |
    +----------------------------+---------+---------+
    | sites                      | 5       | 1       |
    | sitemaps                   | 4       | 1       |
    | performance_report_date    | 1170    | 1       |
    | performance_report_page    | 3640    | 1       |
    | performance_report_country | 7371    | 1       |
    | performance_report_query   | 11335   | 1       |
    | performance_report_device  | 639     | 1       |
    | performance_report_custom  | 17954   | 1       |
    +----------------------------+---------+---------+

Copyright © 2019 Stitch