Skip to content

Releases: GoogleCloudPlatform/gcpdiag

0.79

27 May 18:43
@Y0Q Y0Q
Compare
Choose a tag to compare

New Lints

  • lb/bp/2025_003: new rule: lint rule for best practices for load balancer backend service connection draining setting.
  • lb/bp/2025_002: new rule: Lint rule for backend service timeout best practice on load balancer.
  • interconnect/warn/2025_001: interconnect rule: check interconnect MTU mismatch

New Runbooks

  • interconnect/bgp-down-flap: interconnect BGP down flap runbook
  • gce/vm-creation: [New Runbook] GCE VM Creation runbook
  • gce/guestos-bootup: [New Runbook] Guest OS bootup issues

New Queries

  • orgpolicy._get_available_org_constraints: list all the org constraints available for a particular resource. Args: resource_id: The resource ID. resource_type: The resource type (project or organization). Returns: A list of available org policy constraints. Raises: utils.GcpApiError: on API errors.
  • billing.get_billing_info: Get Billing Information for a project, caching the result.
  • orgpolicy.get_all_project_org_policies: list all the org policies set for a particular resource. Args: project_id: The project ID. Returns: A dictionary of PolicyConstraint objects, keyed by constraint name. Raises: utils.GcpApiError: on API errors.
  • network.get_router_by_name

New Features

  • Implement --test-release flag in gcpdiag docker
  • Improved output message quality for all runbooks.
  • Add Bundle execution usage details to internal + external docs
  • Adding exceptions in constructing API endpoints for different services.
  • Add markdownlint precommit and graphviz dependencies
  • Add ossf scorecard Github Action and update pre-commit hooks

Fixes

  • Update job_id parameter.
  • Update service name parameter.
  • Update output messages for interconnect and dataproc runbooks.
  • Update cluster_name parameter.
  • Disambiguate name parameter for GKE and GCF runbooks; fix GCF failed deployments template bug.
  • Update dataflow/dataproc jinja templates.
  • [Bundles] Fix missing runbook_name error.
  • [gcpdiag runbook cli] Fix missing json report error.
  • Fix missing import errors.
  • Fix report.run_start_time error.
  • Update pipenv to use the latest version to fix import errors.
  • Update Github Actions workflow to use a newer python version to fix tests.
  • Create exception for missing runbook parameters.
  • Update README.md.
  • Fix dataproc runbook parameter bug.
  • Improve Runbooks Response handling.
  • Use the diagnostic engines runbook loader for tests.
  • Update ops agent onboarding parameters.
  • Update artifact config.
  • Update artifact upload version.
  • Use scope instead of region wording in the unhealthy backends runbook.
  • Fix ossf scorecard action filename in config; re-enable pylint now that sub dependency setuptools is fixed (pypa/setuptools#4892 (comment)).
  • Migrate info logs to debug logs for messages with PII / SPII.
  • Disable py lint.
  • Deprecate unused gh-pages github action (https://github.com/GoogleCloudPlatform/gcpdiag/actions/runs/13902316282/job/38976483921).
  • Improve the message when HC logs are not enabled.
  • github: Bump jinja2 from 3.1.5 to 3.1.6.

Full Changelog: v0.78...v0.79

0.78

04 Mar 23:08
Compare
Choose a tag to compare

0.78 (2025-03-04)

New Lints

gke/warn/2025_001: new rule: GKE external LB services are successfully created without encountering IP allocation failures due to external IPv6 subnet configurations.
asm/warn/2025_002: new rule: Upstream connection established successfully with no protocol errors
asm/warn/2025_001: new rule: ASM: Envoy doesn't report connection failure
gke/err/2025_001: GKE cluster complies with the serial port logging organization policy.

New Runbooks

gcf/failed-deployments: Cloud Run Functions runbook to assist users to check reasons for failed deployments of Gen2 cloud functions
nat/public-nat-ip-allocation-failed: public nat ip allocation failed runbook
dataproc/spark-job-failures: Dataproc Spark Job Runbook
lb/latency: new runbook: Load Balancer Latency v1

New Queries

gce.get_global_operations: Returns global operations object in the given project id.
cloudasset.search_all_resources: Searches all cloud asset inventory in the project.
lb.get_load_balancer_type_name: Returns a human-readable name for the given load balancer type.
dataproc.list_auto_scaling_policies: Lists all autoscaling policies in the given project and region.

New Features

pre-commit codespell check eliminate typos in the repositories
Step class attributes interpolation in step names
Check deprecated parameters in runbooks
Migrate repository to python 3.12
output/api_output.py: implement api output module
Create threading.local() in op.py for data isolation.

Fixes

Update gke/err/2024_003 to check for container.defaultNodeServiceAccount role
Update formatting/style of all runbooks.
fix runbook functionality to properly detect pod IP exhaustion and node IP exhaustion
Explicitly handle HTTP 401 errors - Add step_error for exceptions caused by GcpApiError - Handle edge case where
Improve error handling for iam.roles()

v0.77

14 Nov 12:27
Compare
Choose a tag to compare

0.77 (2024-11-13)

New Lint Rules

  • gke/err/2024_002: gke webhook failure endpoints not available
  • gke/warn/2024_007: GKE cluster in a dual-stack with external IPv6 access

New Runbooks

  • lb/ssl-certificate: Runbook for troubleshooting LB SSL certificates issues
  • gke/node-unavailability: Identifies the reasons for a GKE node being unavailable

New Queries

  • gke.get_cluster: Retrieve a single GKE cluster using its project, region, and cluster name.
  • dns.find_dns_records: Resolves DNS records for a given domain and returns a set of IP addresses.
  • lb.get_ssl_certificate: Returns object matching certificate name and region
  • lb.get_target_https_proxies: Retrieves the list of all TargetHttpProxy resources, regional and global, available to the specified project.
  • lb.get_forwarding_rule: Returns the specified ForwardingRule resource.

Enhancements

  • Functionality to auto suggest correct runbook names for misspelled runbooks
  • Updated docker images to ubuntu:24.04 (python 3.12)
  • Updated devcontainer to python 3.12
  • Migrated crm queries from v1 to v3
  • gce/vm-performance: Added PD performance health check
  • gce/vm-performance: Implemented disk average_io_latency check
  • Removed apis_utils.batch_execute_all call from orgpolicy query
  • Enabled gcpdiag.dev page indexing
  • Reduced API retries to 3 attempts
  • Improved START_TIME_UTC inconsistency & Error parsing date string fix
  • pubsub/pull-subscription-delivery: removed cold cache checks
  • Add functionality to disable query caching for edge cases
  • Improve error handling within gcpdiag library to raise errors for handling rather than exiting.

Fixes

  • lb.get_backend_service: Improved calls to fetch global backend
  • Added project_id parameters for the runbook tests without valid project ids

Deprecation

  • Flag --project: Full deprecation in runbook command to allow multiple project ids/numbers to be specified via --parameter

v0.76

01 Oct 17:12
Compare
Choose a tag to compare

0.76 (2024-10-1)

New Lint Rules

  • dataproc/warn/2024_005: Investigates if Data Fusion version is compatible with Dataproc version from the CDAP Preferences settings

New Runbooks

  • pubsub/pull-subscription-delivery: Investigates common Cloud Pub/Sub pull delivery issues related to delivery latency, quotas, pull rate and throughput rate

New Queries

  • pubsub.get_subscription: Retrieves a single pubsub subscription resource
  • apis.is_all_enabled: Check if a list of services are enabled on a given project
  • gke.get_release_schedule: Fetch GKE cluster release schedule

Enhancements

  • make new-rule: A make rule with a cookiecutter recipe to generate new lint rule templates
  • gce.get_gce_public_images: Improved gce_stub query to correctly fetch all image licenses during test.
  • Runbooks metrics generation for Google Internal Users
  • New flag --reason: argument primarily used by Google internal users to specify rational for executing the tool
  • Bundles: A runbook feature to allow execution of a collection of steps
  • Runbook operation (op.add_metadata) to create or retrieve metadata related to steps

Fixes

  • Enforce explicit parameter configuration in gce generalized steps.
  • dataflow/dataflow-permission: Refactored runbook to dataflow/job-permission
  • dataflow/bp/2024_002: Fixed resource filtering bug for forwarding rule (internal LB)
  • gce/vm-performance: Fixed disk performance benchmark lookup

Deprecation

  • apis_utils.batch_list_all: Replaced by apis\utils.multi_list_all
  • Flag --project: Soft deprecation in runbook command to allow multiple project ids/numbers to be spcified via --parameter
  • Deprecated pre-commit hook gke-eol-file

v0.75

03 Sep 17:14
Compare
Choose a tag to compare

0.75 (2024-9-2)

New Lint Rules

  • bigquery/WARN/2024_005: Checks BigQuery table does not exceed number of partition modifications
    to a column partitioned table
  • bigquery/WARN/2024_006: Checks BigQuery job does not exceed tabledata.list bytes
    per second per project
  • dataflow/ERR/2024_006: Checks Dataflow job does not fail during execution due
    to resource exhaustion in zone
  • datafusion/WARN_2024_004: Checks Data Fusion version is compatible with Dataproc
    version from the corresponding compute profiles
  • gke/WARN/2024_003: Checks Ingress traffic is successful if service is correctly mapped
  • gke/WARN/2024_004: Checks Ingress is successful if backendconfig crd is correctly mapped
  • gke/WARN/2024_005: Checks GKE Ingress successfully routes external traffic to NodePort service
  • gce/BP_EXT/2024_002: Calculate a GCE VM's IOPS and Throughput Limits

New Runbooks

  • lb/unhealthy-backends: Diagnose Unhealthy Backends of a Load Balancer
  • gke/resource-quota: Diagnose quota related issues related to gke clusters.
  • gce/vm-performance: Diagnose GCE VM performance
  • gke/image-pull: Diagnose Image Pull Failures related GKE clusters.
  • gke/node-auto-repair: RCA Node auto-repaired incidents
  • gke/gke-ip-masq-standard: Diagnose IP Masquerading issues on GKE clusters
  • dataflow/dataflow-permission: Diagnose Permission required for cluster creation and operation

New Query

  • lb.get_backend_service: Fetch instances matching compute backend service name and/or region
  • lb.get_backend_service_health: Fetch compute backend service health data
  • generic_api/datafusion: Re-implementation of how to call and test generic apis

Enhancements

  • cloudrun/service-deployment: 2 additional checks for image not found and image permissions failure
  • bigquery/WARN/2022_001: Updated lint rule diagnostic steps documentation
  • Implement ignorecase for input parameters
  • gce/ssh and gce/serial-log-analyzer: Include Auth failure checks in runbooks
  • Updated GKE version End of Life tracker
  • New API Stub for Recommender API

Fixes

  • gce/vm-termination: Made vm name and zone mandatory fields
  • Updated dependencies:
    • aiohttp: 3.9.5 -> 3.10.3
    • attrs: 23.2.0 -> 24.2.0
    • cachetools: 5.3.3 -> 5.4.0
    • certifi: 2024.6.2 -> 2024.7.4
    • exceptiongroup: 1.2.1 -> 1.2.2
    • google-api-python-client: 2.134.0 -> 2.141.0
    • google-auth: 2.30.0 -> 2.33.0
    • google-auth-oauthlib: 1.2.0 -> 1.2.1
    • importlib-resources: 6.4.0 -> 6.4.2
    • protobuf: 5.27.2 -> 5.27.3
    • pyyaml: 6.0.1 -> 6.0.2
    • soupsieve: 2.5 -> 2.6
  • Fix lint output and GCE query functions for multi-region resources
  • Removed deprecated option skip_delete from TF code

v0.74

11 Jul 15:51
Compare
Choose a tag to compare

Full Changelog: v0.68...v0.74

0.74 (2024-7-10)

Fixes

  • Re-roll of v0.72 after correcting pip module issue with the docker image build

New Lint Rule

datafusion/warn_2024_002 Data Fusion instance is in a running state

New Runbook

dataproc/cluster_creation Dataproc cluster creation diagnostic tree

0.73 (2024-7-8)

New Feature

  • Added search command to scale the docstrings for lint rules or runbooks to
    match keywords
  • added runbook check step outcome: step_ok, step_failed, etc.
  • Added a zonal endpoint in osconfig library. It returns inventories for all VMs under a certain zone

Fixes

  • Create runbook report regardless of the number of failed steps
  • Improve introductory error message for new runbooks
  • Update lint command API return value for display of resources in each rule
  • General spelling corrections
  • Add documentation for runbook operator methods
  • Remove unneeded google path reference in loading template block contenta
  • Update runbook name validation
  • Handle when gcloud command is not installed when running runbook generator
  • Allow to query logs for each test data separately in logs_stub
  • Update GKE EOL date
  • Relax contraints on location of end steps in runbook
  • Update pip dependencies; security fix for pdoc
  • Added monitoring to the list of supported products runbook steps
  • generic_api/datafusion apis.make_request() re-implementation
  • Update and improve runbook error handling

New Lint Rule

  • gke/err_2024_001_psa_violations Checking for no Pod Security Admission violations in the project
  • bigquery/warn_2024_002_invalid_external_connection BigQuery external
    connection with Cloud SQL does not fail
  • pubsub/err_2024_003_snapshot_creation_fails snapshot creation fails if
    backlog is too old
  • pubsub/err_2024_002_vpc_sc_new_subs_create_policy_violated check for
    pubsub error due to organization policy
  • bigquery/warn_2024_0003 BigQuery job does not fail due to Maximum API requests per user per method exceeded

New Runbook

  • gce/ops_agent Ops Agent Onboarding runbook
  • gcp/serial_log_analyzer runbook to analyse known issues logged into Serial Console logs
  • vertex/workbench_instance_stuck_in_provisioning Runbook to Troubleshoot Issue: Vertex AI Workbench Instance Stuck in Provisioning State
  • cloudrun/service_deployment Cloud Run deployment runbook
  • gke/ip_exhaustion gke ip exhaustion runbook
  • dataflow/failed_streaming_pipeline Diagnostic checks for failed Dataflow Streaming Pipelines
  • nat/out_of_resources vm external ip connectivity runbook

v0.67

21 Nov 18:24
Compare
Choose a tag to compare

0.67 (2023-10-17)

Fixes

  • Updating GKE EOL file and snapshot
  • Rewording message triggering internal leak test

New Command and Rules

  • Runbook POC with ssh runbook and terraform scripts

New rules

  • GKE cluster has workload identity enabled
  • Splunk job uses valid certificate

Full Changelog: v0.66...v0.67

gcpdiag 0.71

19 Apr 18:30
Compare
Choose a tag to compare

0.71 (2024-4-17)

New lint rules

  • datafusion/err_2024_001_delete_operation_failing datafusion
    deletion operation
  • gce/err_2024_003_vm_secure_boot_failures GCE Lint rule for boot
    failures for Shielded VM
  • gce/bp_2024_001_legacy_monitoring_agent GCE Legacy Monitoring Agent
    is not installed
  • gce/bp_2024_002_legacy_logging_agent GCE Legacy Logging Agent is not
    be installed
  • gce/bp_ext_2024_001_no_public_ip.py GCE SSH in Browser: SSH Button
    Disabled
  • pubsub/bp_2024_001_ouma_less_one_day Oldest Unacked Message Age
    Value less than 24 hours
  • bigquery/err_2024_001_query_too_complex query is too complex
  • bigquery/warn_2024_001_imports_or_query_appends_per_table table
    exceeds limit for imports or query appends

New query

New runbook

  • gce/vm_termination assist investigating underlying reasons behind
    termination or reboot
  • gke/cluster_autoscaler GKE Cluster autoscaler error messages check

New features

  • Add cache bypass option for runbook steps
  • Add runbook starter code generator; updates to code generator
  • Add API for runbook command

Fixes

  • Add mock data for datafusion API testing
  • Correct runbook documentation generation output
  • Improve runbook operator functions usage
  • Add dataflow and other components to supported runbook component list
  • Remove duplicate vm_termination.py script
  • Add jinja templates to docker image on cloud shell
  • correct argv passed for parsing in runbook command
  • Adding pipenv and git checks to help beginners get started easily on runbook
    generator
  • update idna pipenv CVE-2024-3651 Moderate severity
  • SSH runbook enhancements
  • runbook fixes - catch missing template errors, include project id when no
    parameters

gcpdiag 0.70

19 Apr 18:29
Compare
Choose a tag to compare

0.70 (2024-3-27)

New lint rules

  • pubsub/ERR_2024_001 bq subscription table not found
  • composer/WARN_2024_001 low scheduler cpu usuage
  • datafusion/WARN_2024_001 data fusion version
  • composer/WARN_2024_002 worker pod eviction
  • gce/ERR_2024_002 performance
  • notebooks/ERR_2024_001 executor explicit project permissions
  • dataflow/WARN_2024_001 dataflow operation ongoing
  • dataflow/ERR_2024_001 dataflow gce quotas
  • dataflow/WARN_2024_002 dataflow streaming appliance commit failed
  • dataflow/ERR_2024_002 dataflow key commit
  • gke/WARN_2024_001 cluster nap limits prevent autoscaling

New query

  • datafusion_cdap API query implementation - provides CDAP profile metadata

Fixes

  • Updated pipenv packages, Pipenv.lock dependencies
  • Updated github action workflow versions to stop warnings about node v10 and v10
  • Refactor Runbook: Implemented a modular, class-based design to facilitate a
    more configurable method for tree construction.

v0.69

26 Feb 20:43
Compare
Choose a tag to compare

0.69 (2024-2-21)

New feature

  • add universe_domain for Trusted Partner Client (TPC)

New rules

  • asm/WARN_2024_001 Webhook failed
  • lb/BP_2024_002 Check if global access is on for the regional iLB
  • pubsub/WARN_2024_003 Pub/Sub rule: CMEK - Topic Permissions
  • dataproc/WARN_2024_001 dataproc check hdfs safemode status
  • dataproc/WARN_2024_002 dataproc hdfs write issues
  • gce/ERR_2024_001 GCE rule:Snapshot creation rate limit
  • lb/BP_2024_001 session affinity enabled on load balancer
  • pubsub/WARN_2024_002 GCS subscription has the apt permissions
  • dataflow/ERR_2023_010 missing required field
  • pubsub/WARN_2024_001 DLQ Subscription has apt permissions

Fixes

  • Update Pull Request and Merge to only run when an update was committed
  • Creating a github action Workflow to automatically update the gke/eol.yaml file
  • Update gke/eol.yaml file

Full Changelog: https://github.com/GoogleCloudPlatform/gcpdiag/commits/v0.69