Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error of Persistence Max QPS Reached for List Operations #3900

Open
JosefWN opened this issue Jan 16, 2021 · 18 comments
Open

Error of Persistence Max QPS Reached for List Operations #3900

JosefWN opened this issue Jan 16, 2021 · 18 comments

Comments

@JosefWN
Copy link

JosefWN commented Jan 16, 2021

See misplaced issue: uber/cadence-web#227

@JosefWN JosefWN changed the title frontend.ListMaxQPS default too low frontend.visibilityListMaxQPS default too low Jan 16, 2021
@longquanzheng
Copy link
Collaborator

Hi we increase in #3753 and released in https://github.com/uber/cadence/releases/tag/v0.17.0

Please check it out and let us know if that meet you expectation. :D

@JosefWN
Copy link
Author

JosefWN commented Jan 22, 2021

Ah, looks good, missed that one. I can close this issue then!

@JosefWN JosefWN closed this as completed Jan 22, 2021
@frtelg
Copy link

frtelg commented Apr 8, 2021

Unfortunately, this issue is still causing problems for me. I am running Cadence locally using the docker-compose.yml auto setup. I have recently upgraded all my dockers to the latest versions. After running a workflow and then navigating to the Cadence GUI, I get the following error when I navigate to my domain workflows: Persistence Max QPS Reached for List Operations. This is then fixed by creating a custom dynamic config file, mounting the folder it is located in the docker and change the DYNAMIC_CONFIG_FILE_PATH environment variable of the cadence server docker to the corresponding file.

I would have expected this to be not necessary anymore?

@longquanzheng
Copy link
Collaborator

longquanzheng commented Apr 8, 2021

@frtelg I have tested with my master-auto-setup image(updated two days ago) and works fine.
here is my image info:

$docker images ubercadence/server:master-auto-setup
REPOSITORY           TAG                 IMAGE ID       CREATED      SIZE
ubercadence/server   master-auto-setup   b68016029480   2 days ago   351MB

So make sure you upgrade the ubercadence/server:master-auto-setup image by running the command:
docker pull ubercadence/server:master-auto-setup

Like we said in https://github.com/uber/cadence/tree/master/docker#using-a-released-image (we probably should make it more clear),ubercadence/server:master-auto-setup is a changing image every minute/hour all the time, based on the commit on our master branch. You can use a released image if you want something stable.

@frtelg
Copy link

frtelg commented Apr 8, 2021

@longquanzheng I have pulled the latest version of the image before testing it. I will test it again tomorrow, maybe I was still using an older version after all. I'll let you know.

@longquanzheng
Copy link
Collaborator

@frtelg
I just updated my image to current latest and still works:

$docker images ubercadence/server:master-auto-setup
REPOSITORY           TAG                 IMAGE ID       CREATED        SIZE
ubercadence/server   master-auto-setup   7890845d4a29   17 hours ago   351MB

I only run the helloworld sample. So can you also check if your workflow is claling the ListAPI?

@frtelg
Copy link

frtelg commented Apr 9, 2021

I have tested again, using the following steps:

  1. docker pull ubercadence/server:master-auto-setup
  2. docker-compose up
  3. Then start my application. It is a basic Spring Boot application and my workflow is kind of a HelloWorld-kind of workflow in this case:
public interface GreetingWorkflow {
    String TASK_LIST = "Example";

    @WorkflowMethod(executionStartToCloseTimeoutSeconds = 360, taskList = TASK_LIST)
    void greet();

    @SignalMethod
    void changeName(String name);

    @SignalMethod
    void terminate();

    @QueryMethod
    String getCurrentName();
}
  1. The application initializes the WorkflowService, WorkflowClient and WorkerFactory using Spring Beans and then start the Workers.
  2. Workflow is then started through a REST call. I don't think the ListAPI is involved in all of this?
  3. Then I check the Cadence GUI and unfortunately get the error: Persistence Max QPS Reached for List Operations.

This is the cadence server docker:

376996c5e71c   ubercadence/server:master-auto-setup   "/docker-entrypoint.…"   7 minutes ago   Up 7 minutes   0.0.0.0:7933-7935->7933-7935/tcp, 0.0.0.0:7939->7939/tcp                                                                                     cadence_cadence_1

After this, I stop the dockers and the application and I add the following to the deployment.yaml file:

frontend.visibilityListMaxQPS:
- value: 10000
frontend.esVisibilityListMaxQPS:
- value: 10000

When I retest then, the GUI works as expected.

You can check out the application if you want, it is in my github: https://github.com/frtelg/cadence-spring-boot.

@longquanzheng
Copy link
Collaborator

I can't reproduce it. For a time I saw it and made it thought it was an issue in webUI but then I cannot reproduce it anymore...

@longquanzheng
Copy link
Collaborator

@frtelg does the released docker compose files help?

@frtelg
Copy link

frtelg commented Apr 13, 2021

@longquanzheng it is not really clear to me what files you are referring to. I have used the default docker-compose from the cadence project. My docker-compose file looks like this:

version: '3'
services:
  cassandra:
    image: cassandra:3.11
    ports:
      - "9042:9042"
  statsd:
    image: graphiteapp/graphite-statsd
    ports:
      - "8080:80"
      - "2003:2003"
      - "8125:8125"
      - "8126:8126"
  cadence:
    image: ubercadence/server:master-auto-setup
    ports:
     - "7933:7933"
     - "7934:7934"
     - "7935:7935"
     - "7939:7939"
    environment:
      - "CASSANDRA_SEEDS=cassandra"
      - "STATSD_ENDPOINT=statsd:8125"
      - "DYNAMIC_CONFIG_FILE_PATH=custom-config/development.yaml"
    depends_on:
      - cassandra
      - statsd
    volumes:
      - "./config:/etc/cadence/custom-config"
  cadence-web:
    image: ubercadence/web:latest
    environment:
      - "CADENCE_TCHANNEL_PEERS=cadence:7933"
    ports:
      - "8088:8088"
    depends_on:
      - cadence

@longquanzheng
Copy link
Collaborator

longquanzheng commented Apr 16, 2021

@frtelg I understand this is annoying, I have open PR: #4138
and also build an image so that you can test before the PR is landed
ubercadence/qlong-server:master-04-15-2021-auto-setup
Can you try use it and enable log level to debug to see why the requests are rate limited?
(default log level is info:

level: {{ default .Env.LOG_LEVEL "info" }}
)

And let me know when you see the debug logs like

{"level":"debug","ts":"2021-04-15T23:30:50.086-0700","msg":"List API request consumed QPS token","service":"cadence-frontend","wf-domain-name":"samples-domain","name":"github.com/uber/cadence/common/persistence.(*visibilitySamplingClient).ListClosedWorkflowExecutions","logging-call-at":"visibilitySamplingClient.go:328"}

and

{"level":"debug","ts":"2021-04-15T19:00:21.956-0700","msg":"List API request is being sampled","service":"cadence-frontend","wf-domain-name":"samples-domain","name":"github.com/uber/cadence/common/persistence.(*visibilitySamplingClient).ListClosedWorkflowExecutions","logging-call-at":"visibilitySamplingClient.go:326"}

If they are not from your application, we will have a clue how to fix it.

@frtelg
Copy link

frtelg commented Apr 19, 2021

@longquanzheng the supplied container version is not working:

cadence_1      | 2021/04/19 06:55:02 gocql: unable to dial control conn 172.24.0.2: dial tcp 172.24.0.2:9042: connect: connection refused
cadence_1      | 2021/04/19 06:55:02 cassandra schema version compatibility check failed: unable to create CQL Client: gocql: unable to create session: control: unable to connect to initial hosts: dial tcp 172.24.0.2:9042: connect: connection refused

This is my docker-compose.yml:

version: '3'
services:
  cassandra:
    image: cassandra:3.11
    ports:
      - "9042:9042"
  statsd:
    image: graphiteapp/graphite-statsd
    ports:
      - "8080:80"
      - "2003:2003"
      - "8125:8125"
      - "8126:8126"
  cadence:
    image: ubercadence/qlong-server:master-04-15-2021-auto-setup
    ports:
     - "7933:7933"
     - "7934:7934"
     - "7935:7935"
     - "7939:7939"
    environment:
      - "CASSANDRA_SEEDS=cassandra"
      - "STATSD_ENDPOINT=statsd:8125"
#      - "DYNAMIC_CONFIG_FILE_PATH=custom-config/development.yaml"
      - "LOG_LEVEL=debug"
    depends_on:
      - cassandra
      - statsd
    volumes:
      - "./config:/etc/cadence/custom-config"
  cadence-web:
    image: ubercadence/web:latest
    environment:
      - "CADENCE_TCHANNEL_PEERS=cadence:7933"
    ports:
      - "8088:8088"
    depends_on:
      - cadence

@longquanzheng
Copy link
Collaborator

@frtelg Sorry that error was totally my bad when building the customized image. I forgot to add the auto-setup argument.
At the same time I happened to have a local Cassandra to run in my laptop so I didn't catch it.

Can you try this one:
ubercadence/qlong-server:master-04-20-2021-auto-setup

LMK. Thanks

@longquanzheng
Copy link
Collaborator

@frtelg I finally reproduced stably this myself.
Will work on fixing it
Screen Shot 2021-04-24 at 3 42 32 PM
Screen Shot 2021-04-24 at 3 42 42 PM

@longquanzheng longquanzheng reopened this Apr 24, 2021
@longquanzheng
Copy link
Collaborator

longquanzheng commented Apr 24, 2021

^ I think I have root cause the issue. I think I got it repro because I updated my web image.

TL;DR

There is a change in the WebUI which always make 2 requests in the default page, so that it can show both open and closed workflows. However, our ratelimiting only have 1 as bucket size, even though the refiling rate is 10. So it will reject requests in very fast rate. Note that this is mostly only an issue in local docker-compose.
To mitigate, user can select the closed or open view themselves, and ignore the error for now.


There is a change in the WebUI that by default it will try to get both open and closed workflows. So the default page at least has to make two List requests.

However, looks like the ratelimiting doesn't work as we expected-- or we didn't configure it correctly. Even though MaxQPS defaults to 10, but it's only refiling rates. It doesn't allow 2 requests at the same time. There are a couple ways to fix:

  • Web UI do some retry
  • Backend configure the token bucket with a bigger initial size. Looks like it's controlled by this numOfPriority, which is only 1 for List API:
    rateLimiter := p.rateLimitersForList.getRateLimiter(domain, numOfPriorityForList, p.config.VisibilityListMaxQPS(domain))

    In other words, we are using token bucket as leakage bucket.

To mitigate, user can select the closed or open view themselves, and ignore the error for now.

@just-at-uber Do you think we can implement a retry logic in webUI? I think it's useful in many ways. Even we could potentially add some initial size configuration for ratelimiting, it's still good to have some retry on WebUI when talking to Cadence Frontend.

@frtelg
Copy link

frtelg commented Apr 26, 2021

Thanks! Great that you managed to find the bug. I did not find the time yet to retest it.

@just-at-uber
Copy link
Contributor

I think retry logic here is good to have anyway for this screen incase the API fails. Ideally the server should handle a higher load by default.

@longquanzheng
Copy link
Collaborator

@just-at-uber yeah I agreed that server should also improve. I took a look but currently all the ratelimiting in server doesn't allow any bursting. It may take more effort to introduce it(also new configuration)

@longquanzheng longquanzheng changed the title frontend.visibilityListMaxQPS default too low Error of Persistence Max QPS Reached for List Operations May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants