Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operation archive init job should wait for sys bundle to be healthy #100

Open
achulkov2 opened this issue Jan 10, 2024 · 0 comments
Open
Labels
backlog Backlog

Comments

@achulkov2
Copy link
Collaborator

achulkov2 commented Jan 10, 2024

When looking at cluster initialization during any of the e2e tests, one can see the following errors in the init-job-op-archive pod. They get retried eventually, but the backoff duration increases the length of the already very slow tests.

achulkov2@nebius-yt-dev:~$ kubectl logs yt-scheduler-init-job-op-archive-btqb6  -nquerytrackeraco
++ export YT_DRIVER_CONFIG_PATH=/config/client.yson
++ YT_DRIVER_CONFIG_PATH=/config/client.yson
+++ /usr/bin/ytserver-all --version
+++ head -c4
++ export YTSAURUS_VERSION=23.1
++ YTSAURUS_VERSION=23.1
++ /usr/bin/init_operation_archive --force --latest --proxy http-proxies.querytrackeraco.svc.cluster.local
2024-01-10 19:37:20,124 - INFO - Transforming archive from 48 to 48 version
2024-01-10 19:37:20,134 - INFO - Mounting table //sys/operations_archive/jobs
Traceback (most recent call last):
  File "/usr/bin/init_operation_archive", line 749, in <module>
    main()
  File "/usr/bin/init_operation_archive", line 744, in main
    force=args.force,
  File "/usr/bin/init_operation_archive", line 731, in run
    transform_archive(client, next_version, target_version, force, archive_path, shard_count=shard_count)
  File "/usr/bin/init_operation_archive", line 639, in transform_archive
    mount_table(client, path)
  File "/usr/bin/init_operation_archive", line 55, in mount_table
    client.mount_table(path, sync=True)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/client_impl_yandex.py", line 1394, in mount_table
    freeze=freeze, sync=sync, target_cell_ids=target_cell_ids)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/dynamic_table_commands.py", line 524, in mount_table
    response = make_request("mount_table", params, client=client)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/driver.py", line 126, in make_request
    client=client)
  File "<decorator-gen-3>", line 2, in make_request
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/common.py", line 422, in forbidden_inside_job
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_driver.py", line 301, in make_request
    client=client)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_helpers.py", line 455, in make_request_with_retries
    return RequestRetrier(method=method, url=url, **kwargs).run()
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/retries.py", line 79, in run
    return self.action()
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_helpers.py", line 410, in action
    _raise_for_status(response, request_info)
  File "/usr/local/lib/python3.7/dist-packages/yt/wrapper/http_helpers.py", line 290, in _raise_for_status
    raise error_exc
yt.common.YtResponseError: Error committing transaction 1-44d-10001-b753
    Error committing transaction 1-44d-10001-b753 at cell 65726e65-ad6b7562-10259-79747361
        No healthy tablet cells in bundle "sys"

***** Details:
Received HTTP response with error    
    origin          yt-scheduler-init-job-op-archive-btqb6 on 2024-01-10T19:37:20.203965Z    
    url             http://http-proxies.querytrackeraco.svc.cluster.local/api/v4/mount_table    
    request_headers {
                      "User-Agent": "Python wrapper 0.13-dev-5f8638fc66f6e59c7a06708ed508804986a6579f",
                      "Accept-Encoding": "gzip, identity",
                      "X-Started-By": "{\"pid\"=17;\"user\"=\"root\";}",
                      "X-YT-Header-Format": "<format=text>yson",
                      "Content-Type": "application/x-yt-yson-text",
                      "X-YT-Correlation-Id": "d71f4e98-4f2880b3-9213c0d0-9a5a9336"
                    }    
    response_headers {
                      "Content-Length": "1242",
                      "X-YT-Response-Message": "Error committing transaction 1-44d-10001-b753",
                      "X-YT-Response-Code": "1",
                      "X-YT-Response-Parameters": {},
                      "X-YT-Trace-Id": "c0235705-98e9c7a-369cf397-97d28dd7",
                      "X-YT-Error": "{\"code\":1,\"message\":\"Error committing transaction 1-44d-10001-b753\",\"attributes\":{\"host\":\"hp-0.http-proxies.querytrackeraco.svc.cluster.local\",\"pid\":1,\"tid\":12837479201307132255,\"fid\":18446447647636925386,\"datetime\":\"2024-01-10T19:37:20.202367Z\",\"trace_id\":\"c0235705-98e9c7a-369cf397-97d28dd7\",\"span_id\":1636727892750608515,\"cluster_id\":\"Native(Name=test-ytsaurus)\",\"path\":\"//sys/operations_archive/jobs\"},\"inner_errors\":[{\"code\":1,\"message\":\"Error committing transaction 1-44d-10001-b753 at cell 65726e65-ad6b7562-10259-79747361\",\"attributes\":{\"host\":\"hp-0.http-proxies.querytrackeraco.svc.cluster.local\",\"pid\":1,\"tid\":12837479201307132255,\"fid\":18446447647636925386,\"datetime\":\"2024-01-10T19:37:20.202206Z\",\"trace_id\":\"c0235705-98e9c7a-369cf397-97d28dd7\",\"span_id\":1636727892750608515},\"inner_errors\":[{\"code\":1,\"message\":\"No healthy tablet cells in bundle \\\"sys\\\"\",\"attributes\":{\"request_id\":\"dc5643d9-124e57a5-cf4b0583-8753d056\",\"connection_id\":\"6b2e13-a3e8b3e0-314a5f40-69069dfd\",\"verification_mode\":\"none\",\"realm_id\":\"65726e65-ad6b7562-10259-79747361\",\"timeout\":30000,\"method\":\"CommitTransaction\",\"address\":\"ms-0.masters.querytrackeraco.svc.cluster.local:9010\",\"encryption_mode\":\"optional\",\"service\":\"TransactionSupervisorService\"}}]}]}",
                      "X-YT-Request-Id": "93a09617-71caa1ec-cbfe7e46-922f5a1f",
                      "Content-Type": "application/json",
                      "Cache-Control": "no-store",
                      "X-YT-Proxy": "hp-0.http-proxies.querytrackeraco.svc.cluster.local",
                      "Authorization": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
                    }    
    params          {
                      "suppress_transaction_coordinator_sync": false,
                      "path": "//sys/operations_archive/jobs",
                      "freeze": false,
                      "mutation_id": "124ef88f-86123fd-62afd823-512f2084",
                      "retry": false
                    }    
    transparent     True
Error committing transaction 1-44d-10001-b753    
    origin          hp-0.http-proxies.querytrackeraco.svc.cluster.local on 2024-01-10T19:37:20.202367Z (pid 1, tid b227e3515560815f, fid fffef266ed3d2bca)    
    trace_id        c0235705-98e9c7a-369cf397-97d28dd7    
    span_id         1636727892750608515    
    cluster_id      Native(Name=test-ytsaurus)    
    path            //sys/operations_archive/jobs
Error committing transaction 1-44d-10001-b753 at cell 65726e65-ad6b7562-10259-79747361    
    origin          hp-0.http-proxies.querytrackeraco.svc.cluster.local on 2024-01-10T19:37:20.202206Z (pid 1, tid b227e3515560815f, fid fffef266ed3d2bca)    
    trace_id        c0235705-98e9c7a-369cf397-97d28dd7    
    span_id         1636727892750608515
No healthy tablet cells in bundle "sys"    
    origin          yt-scheduler-init-job-op-archive-btqb6 on 2024-01-10T19:37:20.204007Z    
    request_id      dc5643d9-124e57a5-cf4b0583-8753d056    
    connection_id   6b2e13-a3e8b3e0-314a5f40-69069dfd    
    verification_mode none    
    realm_id        65726e65-ad6b7562-10259-79747361    
    timeout         30000    
    method          CommitTransaction    
    address         ms-0.masters.querytrackeraco.svc.cluster.local:9010    
    encryption_mode optional    
    service         TransactionSupervisorService

We should wait for the tablet cells to be healthy before running the init job.

@nadya002 nadya002 added the backlog Backlog label Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Backlog
Projects
Status: No status
Development

No branches or pull requests

2 participants