Releases: GoogleCloudPlatform/gcpdiag
0.79
New Lints
- lb/bp/2025_003: new rule: lint rule for best practices for load balancer backend service connection draining setting.
- lb/bp/2025_002: new rule: Lint rule for backend service timeout best practice on load balancer.
- interconnect/warn/2025_001: interconnect rule: check interconnect MTU mismatch
New Runbooks
- interconnect/bgp-down-flap: interconnect BGP down flap runbook
- gce/vm-creation: [New Runbook] GCE VM Creation runbook
- gce/guestos-bootup: [New Runbook] Guest OS bootup issues
New Queries
- orgpolicy._get_available_org_constraints: list all the org constraints available for a particular resource. Args: resource_id: The resource ID. resource_type: The resource type (project or organization). Returns: A list of available org policy constraints. Raises: utils.GcpApiError: on API errors.
- billing.get_billing_info: Get Billing Information for a project, caching the result.
- orgpolicy.get_all_project_org_policies: list all the org policies set for a particular resource. Args: project_id: The project ID. Returns: A dictionary of PolicyConstraint objects, keyed by constraint name. Raises: utils.GcpApiError: on API errors.
- network.get_router_by_name
New Features
- Implement --test-release flag in gcpdiag docker
- Improved output message quality for all runbooks.
- Add Bundle execution usage details to internal + external docs
- Adding exceptions in constructing API endpoints for different services.
- Add markdownlint precommit and graphviz dependencies
- Add ossf scorecard Github Action and update pre-commit hooks
Fixes
- Update
job_id
parameter. - Update service name parameter.
- Update output messages for interconnect and dataproc runbooks.
- Update
cluster_name
parameter. - Disambiguate
name
parameter for GKE and GCF runbooks; fix GCF failed deployments template bug. - Update dataflow/dataproc jinja templates.
- [Bundles] Fix missing
runbook_name
error. - [gcpdiag runbook cli] Fix missing json report error.
- Fix missing import errors.
- Fix
report.run_start_time
error. - Update pipenv to use the latest version to fix import errors.
- Update Github Actions workflow to use a newer python version to fix tests.
- Create exception for missing runbook parameters.
- Update
README.md
. - Fix dataproc runbook parameter bug.
- Improve Runbooks Response handling.
- Use the diagnostic engines runbook loader for tests.
- Update ops agent onboarding parameters.
- Update artifact config.
- Update artifact upload version.
- Use scope instead of region wording in the unhealthy backends runbook.
- Fix ossf scorecard action filename in config; re-enable pylint now that sub dependency setuptools is fixed (pypa/setuptools#4892 (comment)).
- Migrate info logs to debug logs for messages with PII / SPII.
- Disable py lint.
- Deprecate unused gh-pages github action (https://github.com/GoogleCloudPlatform/gcpdiag/actions/runs/13902316282/job/38976483921).
- Improve the message when HC logs are not enabled.
- github: Bump jinja2 from 3.1.5 to 3.1.6.
Full Changelog: v0.78...v0.79
0.78
0.78 (2025-03-04)
New Lints
gke/warn/2025_001: new rule: GKE external LB services are successfully created without encountering IP allocation failures due to external IPv6 subnet configurations.
asm/warn/2025_002: new rule: Upstream connection established successfully with no protocol errors
asm/warn/2025_001: new rule: ASM: Envoy doesn't report connection failure
gke/err/2025_001: GKE cluster complies with the serial port logging organization policy.
New Runbooks
gcf/failed-deployments: Cloud Run Functions runbook to assist users to check reasons for failed deployments of Gen2 cloud functions
nat/public-nat-ip-allocation-failed: public nat ip allocation failed runbook
dataproc/spark-job-failures: Dataproc Spark Job Runbook
lb/latency: new runbook: Load Balancer Latency v1
New Queries
gce.get_global_operations: Returns global operations object in the given project id.
cloudasset.search_all_resources: Searches all cloud asset inventory in the project.
lb.get_load_balancer_type_name: Returns a human-readable name for the given load balancer type.
dataproc.list_auto_scaling_policies: Lists all autoscaling policies in the given project and region.
New Features
pre-commit codespell check eliminate typos in the repositories
Step class attributes interpolation in step names
Check deprecated parameters in runbooks
Migrate repository to python 3.12
output/api_output.py: implement api output module
Create threading.local() in op.py for data isolation.
Fixes
Update gke/err/2024_003 to check for container.defaultNodeServiceAccount
role
Update formatting/style of all runbooks.
fix runbook functionality to properly detect pod IP exhaustion and node IP exhaustion
Explicitly handle HTTP 401 errors - Add step_error for exceptions caused by GcpApiError - Handle edge case where
Improve error handling for iam.roles()
v0.77
0.77 (2024-11-13)
New Lint Rules
- gke/err/2024_002: gke webhook failure endpoints not available
- gke/warn/2024_007: GKE cluster in a dual-stack with external IPv6 access
New Runbooks
- lb/ssl-certificate: Runbook for troubleshooting LB SSL certificates issues
- gke/node-unavailability: Identifies the reasons for a GKE node being unavailable
New Queries
- gke.get_cluster: Retrieve a single GKE cluster using its project, region, and cluster name.
- dns.find_dns_records: Resolves DNS records for a given domain and returns a set of IP addresses.
- lb.get_ssl_certificate: Returns object matching certificate name and region
- lb.get_target_https_proxies: Retrieves the list of all TargetHttpProxy resources, regional and global, available to the specified project.
- lb.get_forwarding_rule: Returns the specified ForwardingRule resource.
Enhancements
- Functionality to auto suggest correct runbook names for misspelled runbooks
- Updated docker images to ubuntu:24.04 (python 3.12)
- Updated devcontainer to python 3.12
- Migrated crm queries from v1 to v3
- gce/vm-performance: Added PD performance health check
- gce/vm-performance: Implemented disk average_io_latency check
- Removed apis_utils.batch_execute_all call from orgpolicy query
- Enabled gcpdiag.dev page indexing
- Reduced API retries to 3 attempts
- Improved START_TIME_UTC inconsistency & Error parsing date string fix
- pubsub/pull-subscription-delivery: removed cold cache checks
- Add functionality to disable query caching for edge cases
- Improve error handling within gcpdiag library to raise errors for handling rather than exiting.
Fixes
- lb.get_backend_service: Improved calls to fetch global backend
- Added project_id parameters for the runbook tests without valid project ids
Deprecation
- Flag
--project
: Full deprecation in runbook command to allow multiple project ids/numbers to be specified via--parameter
v0.76
0.76 (2024-10-1)
New Lint Rules
- dataproc/warn/2024_005: Investigates if Data Fusion version is compatible with Dataproc version from the CDAP Preferences settings
New Runbooks
- pubsub/pull-subscription-delivery: Investigates common Cloud Pub/Sub pull delivery issues related to delivery latency, quotas, pull rate and throughput rate
New Queries
- pubsub.get_subscription: Retrieves a single pubsub subscription resource
- apis.is_all_enabled: Check if a list of services are enabled on a given project
- gke.get_release_schedule: Fetch GKE cluster release schedule
Enhancements
make new-rule
: A make rule with a cookiecutter recipe to generate new lint rule templates- gce.get_gce_public_images: Improved gce_stub query to correctly fetch all image licenses during test.
- Runbooks metrics generation for Google Internal Users
- New flag
--reason
: argument primarily used by Google internal users to specify rational for executing the tool - Bundles: A runbook feature to allow execution of a collection of steps
- Runbook operation (op.add_metadata) to create or retrieve metadata related to steps
Fixes
- Enforce explicit parameter configuration in gce generalized steps.
- dataflow/dataflow-permission: Refactored runbook to
dataflow/job-permission
- dataflow/bp/2024_002: Fixed resource filtering bug for forwarding rule (internal LB)
- gce/vm-performance: Fixed disk performance benchmark lookup
Deprecation
- apis_utils.batch_list_all: Replaced by apis\utils.multi_list_all
- Flag
--project
: Soft deprecation in runbook command to allow multiple project ids/numbers to be spcified via--parameter
- Deprecated pre-commit hook gke-eol-file
v0.75
0.75 (2024-9-2)
New Lint Rules
- bigquery/WARN/2024_005: Checks BigQuery table does not exceed number of partition modifications
to a column partitioned table - bigquery/WARN/2024_006: Checks BigQuery job does not exceed tabledata.list bytes
per second per project - dataflow/ERR/2024_006: Checks Dataflow job does not fail during execution due
to resource exhaustion in zone - datafusion/WARN_2024_004: Checks Data Fusion version is compatible with Dataproc
version from the corresponding compute profiles - gke/WARN/2024_003: Checks Ingress traffic is successful if service is correctly mapped
- gke/WARN/2024_004: Checks Ingress is successful if backendconfig crd is correctly mapped
- gke/WARN/2024_005: Checks GKE Ingress successfully routes external traffic to NodePort service
- gce/BP_EXT/2024_002: Calculate a GCE VM's IOPS and Throughput Limits
New Runbooks
- lb/unhealthy-backends: Diagnose Unhealthy Backends of a Load Balancer
- gke/resource-quota: Diagnose quota related issues related to gke clusters.
- gce/vm-performance: Diagnose GCE VM performance
- gke/image-pull: Diagnose Image Pull Failures related GKE clusters.
- gke/node-auto-repair: RCA Node auto-repaired incidents
- gke/gke-ip-masq-standard: Diagnose IP Masquerading issues on GKE clusters
- dataflow/dataflow-permission: Diagnose Permission required for cluster creation and operation
New Query
- lb.get_backend_service: Fetch instances matching compute backend service name and/or region
- lb.get_backend_service_health: Fetch compute backend service health data
- generic_api/datafusion: Re-implementation of how to call and test generic apis
Enhancements
- cloudrun/service-deployment: 2 additional checks for image not found and image permissions failure
- bigquery/WARN/2022_001: Updated lint rule diagnostic steps documentation
- Implement ignorecase for input parameters
- gce/ssh and gce/serial-log-analyzer: Include Auth failure checks in runbooks
- Updated GKE version End of Life tracker
- New API Stub for Recommender API
Fixes
- gce/vm-termination: Made vm name and zone mandatory fields
- Updated dependencies:
- aiohttp: 3.9.5 -> 3.10.3
- attrs: 23.2.0 -> 24.2.0
- cachetools: 5.3.3 -> 5.4.0
- certifi: 2024.6.2 -> 2024.7.4
- exceptiongroup: 1.2.1 -> 1.2.2
- google-api-python-client: 2.134.0 -> 2.141.0
- google-auth: 2.30.0 -> 2.33.0
- google-auth-oauthlib: 1.2.0 -> 1.2.1
- importlib-resources: 6.4.0 -> 6.4.2
- protobuf: 5.27.2 -> 5.27.3
- pyyaml: 6.0.1 -> 6.0.2
- soupsieve: 2.5 -> 2.6
- Fix lint output and GCE query functions for multi-region resources
- Removed deprecated option skip_delete from TF code
v0.74
Full Changelog: v0.68...v0.74
0.74 (2024-7-10)
Fixes
- Re-roll of v0.72 after correcting pip module issue with the docker image build
New Lint Rule
datafusion/warn_2024_002 Data Fusion instance is in a running state
New Runbook
dataproc/cluster_creation Dataproc cluster creation diagnostic tree
0.73 (2024-7-8)
New Feature
- Added search command to scale the docstrings for lint rules or runbooks to
match keywords - added runbook check step outcome: step_ok, step_failed, etc.
- Added a zonal endpoint in osconfig library. It returns inventories for all VMs under a certain zone
Fixes
- Create runbook report regardless of the number of failed steps
- Improve introductory error message for new runbooks
- Update lint command API return value for display of resources in each rule
- General spelling corrections
- Add documentation for runbook operator methods
- Remove unneeded google path reference in loading template block contenta
- Update runbook name validation
- Handle when gcloud command is not installed when running runbook generator
- Allow to query logs for each test data separately in logs_stub
- Update GKE EOL date
- Relax contraints on location of end steps in runbook
- Update pip dependencies; security fix for pdoc
- Added monitoring to the list of supported products runbook steps
- generic_api/datafusion apis.make_request() re-implementation
- Update and improve runbook error handling
New Lint Rule
- gke/err_2024_001_psa_violations Checking for no Pod Security Admission violations in the project
- bigquery/warn_2024_002_invalid_external_connection BigQuery external
connection with Cloud SQL does not fail - pubsub/err_2024_003_snapshot_creation_fails snapshot creation fails if
backlog is too old - pubsub/err_2024_002_vpc_sc_new_subs_create_policy_violated check for
pubsub error due to organization policy - bigquery/warn_2024_0003 BigQuery job does not fail due to Maximum API requests per user per method exceeded
New Runbook
- gce/ops_agent Ops Agent Onboarding runbook
- gcp/serial_log_analyzer runbook to analyse known issues logged into Serial Console logs
- vertex/workbench_instance_stuck_in_provisioning Runbook to Troubleshoot Issue: Vertex AI Workbench Instance Stuck in Provisioning State
- cloudrun/service_deployment Cloud Run deployment runbook
- gke/ip_exhaustion gke ip exhaustion runbook
- dataflow/failed_streaming_pipeline Diagnostic checks for failed Dataflow Streaming Pipelines
- nat/out_of_resources vm external ip connectivity runbook
v0.67
0.67 (2023-10-17)
Fixes
- Updating GKE EOL file and snapshot
- Rewording message triggering internal leak test
New Command and Rules
- Runbook POC with ssh runbook and terraform scripts
New rules
- GKE cluster has workload identity enabled
- Splunk job uses valid certificate
Full Changelog: v0.66...v0.67
gcpdiag 0.71
0.71 (2024-4-17)
New lint rules
- datafusion/err_2024_001_delete_operation_failing datafusion
deletion operation - gce/err_2024_003_vm_secure_boot_failures GCE Lint rule for boot
failures for Shielded VM - gce/bp_2024_001_legacy_monitoring_agent GCE Legacy Monitoring Agent
is not installed - gce/bp_2024_002_legacy_logging_agent GCE Legacy Logging Agent is not
be installed - gce/bp_ext_2024_001_no_public_ip.py GCE SSH in Browser: SSH Button
Disabled - pubsub/bp_2024_001_ouma_less_one_day Oldest Unacked Message Age
Value less than 24 hours - bigquery/err_2024_001_query_too_complex query is too complex
- bigquery/warn_2024_001_imports_or_query_appends_per_table table
exceeds limit for imports or query appends
New query
-
osconfig
"OS management tools that can be used for patch management, patch compliance,
and configuration management on VM instances."
https://cloud.google.com/compute/docs/osconfig/rest
New runbook
- gce/vm_termination assist investigating underlying reasons behind
termination or reboot - gke/cluster_autoscaler GKE Cluster autoscaler error messages check
New features
- Add cache bypass option for runbook steps
- Add runbook starter code generator; updates to code generator
- Add API for runbook command
Fixes
- Add mock data for datafusion API testing
- Correct runbook documentation generation output
- Improve runbook operator functions usage
- Add dataflow and other components to supported runbook component list
- Remove duplicate vm_termination.py script
- Add jinja templates to docker image on cloud shell
- correct argv passed for parsing in runbook command
- Adding pipenv and git checks to help beginners get started easily on runbook
generator - update idna pipenv CVE-2024-3651 Moderate severity
- SSH runbook enhancements
- runbook fixes - catch missing template errors, include project id when no
parameters
gcpdiag 0.70
0.70 (2024-3-27)
New lint rules
- pubsub/ERR_2024_001 bq subscription table not found
- composer/WARN_2024_001 low scheduler cpu usuage
- datafusion/WARN_2024_001 data fusion version
- composer/WARN_2024_002 worker pod eviction
- gce/ERR_2024_002 performance
- notebooks/ERR_2024_001 executor explicit project permissions
- dataflow/WARN_2024_001 dataflow operation ongoing
- dataflow/ERR_2024_001 dataflow gce quotas
- dataflow/WARN_2024_002 dataflow streaming appliance commit failed
- dataflow/ERR_2024_002 dataflow key commit
- gke/WARN_2024_001 cluster nap limits prevent autoscaling
New query
- datafusion_cdap API query implementation - provides CDAP profile metadata
Fixes
- Updated pipenv packages, Pipenv.lock dependencies
- Updated github action workflow versions to stop warnings about node v10 and v10
- Refactor Runbook: Implemented a modular, class-based design to facilitate a
more configurable method for tree construction.
v0.69
0.69 (2024-2-21)
New feature
- add universe_domain for Trusted Partner Client (TPC)
New rules
- asm/WARN_2024_001 Webhook failed
- lb/BP_2024_002 Check if global access is on for the regional iLB
- pubsub/WARN_2024_003 Pub/Sub rule: CMEK - Topic Permissions
- dataproc/WARN_2024_001 dataproc check hdfs safemode status
- dataproc/WARN_2024_002 dataproc hdfs write issues
- gce/ERR_2024_001 GCE rule:Snapshot creation rate limit
- lb/BP_2024_001 session affinity enabled on load balancer
- pubsub/WARN_2024_002 GCS subscription has the apt permissions
- dataflow/ERR_2023_010 missing required field
- pubsub/WARN_2024_001 DLQ Subscription has apt permissions
Fixes
- Update Pull Request and Merge to only run when an update was committed
- Creating a github action Workflow to automatically update the gke/eol.yaml file
- Update gke/eol.yaml file
Full Changelog: https://github.com/GoogleCloudPlatform/gcpdiag/commits/v0.69