Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We randomly see a RedisTimeoutError #6162

Open
remy727 opened this issue Jan 11, 2024 · 65 comments
Open

We randomly see a RedisTimeoutError #6162

remy727 opened this issue Jan 11, 2024 · 65 comments

Comments

@remy727
Copy link

remy727 commented Jan 11, 2024

I know this error was already discussed a lot.
We randomly see a RedisTimeoutError. Our app is deployed on Heroku and uses Redis Cloud.

Jan 11 15:09:35 our-app app[worker] ERROR pid=3 tid=pwb ERROR: Error fetching job: Waited 3.0 seconds
Jan 11 15:09:35 our-app app[worker] WARN pid=3 tid=pwb WARN: RedisClient::ReadTimeoutError: Waited 3.0 seconds
Jan 11 15:09:35 our-app app[worker] WARN pid=3 tid=pwb WARN: 

This is config:

# frozen_string_literal: true

options_hsh = {
  url: ENV["REDIS_URL"] || "redis://127.0.0.1:6379",
  ssl_params: {
    verify_mode: OpenSSL::SSL::VERIFY_NONE
  }
}

Sidekiq.configure_server do |config|
  config.redis = options_hsh
end

Sidekiq.configure_client do |config|
  config.redis = options_hsh
end

We also checked Redis latency:

min: 326, max: 335, avg: 328.04 (45 samples) -- 15.31 seconds range
min: 325, max: 359, avg: 332.84 (44 samples) -- 15.20 seconds range
min: 325, max: 330, avg: 327.07 (45 samples) -- 15.26 seconds range
min: 325, max: 330, avg: 327.44 (45 samples) -- 15.28 seconds range
min: 325, max: 331, avg: 327.49 (45 samples) -- 15.30 seconds range
min: 325, max: 333, avg: 327.56 (45 samples) -- 15.30 seconds range
min: 325, max: 336, avg: 327.44 (45 samples) -- 15.27 seconds range
min: 325, max: 356, avg: 328.60 (45 samples) -- 15.32 seconds range
min: 325, max: 334, avg: 327.44 (45 samples) -- 15.27 seconds range
min: 324, max: 333, avg: 327.73 (45 samples) -- 15.30 seconds range
min: 325, max: 335, avg: 327.78 (45 samples) -- 15.30 seconds range
min: 325, max: 336, avg: 327.20 (45 samples) -- 15.28 seconds range
min: 325, max: 354, avg: 328.80 (45 samples) -- 15.34 seconds range
min: 325, max: 332, avg: 327.58 (45 samples) -- 15.29 seconds range
min: 325, max: 340, avg: 328.36 (44 samples) -- 15.00 seconds range
min: 325, max: 359, avg: 329.70 (44 samples) -- 15.04 seconds range
min: 325, max: 947, avg: 355.10 (41 samples) -- 15.04 seconds range
min: 386, max: 1809, avg: 590.72 (25 samples) -- 15.09 seconds range
min: 371, max: 1681, avg: 461.85 (33 samples) -- 15.65 seconds range
min: 374, max: 857, avg: 420.20 (35 samples) -- 15.14 seconds range
  • Ruby version: 3.2.2
  • Rails version: 7.1.2
  • Sidekiq / Pro / Enterprise version(s): Sidekiq 7.2
@mperham
Copy link
Collaborator

mperham commented Jan 11, 2024

What version is the Redis server?

@mperham
Copy link
Collaborator

mperham commented Jan 11, 2024

Can you include the output from redis-cli info?

@remy727
Copy link
Author

remy727 commented Jan 11, 2024

Sure.

# Server
redis_version:7.2.0
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:0000000000000000000000000000000000000000
redis_mode:standalone
os:Linux 5.4.0-1071-aws x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:7.5.0
process_id:12089385
run_id:f28c63be14220b56c1c14028bab584487d4d0fb1
tcp_port:13144
server_time_usec:1705005037000000
uptime_in_seconds:121430
uptime_in_days:1
hz:10
lru_clock:0
config_file:

# Clients
connected_clients:7
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:5
maxclients:256
cluster_connections:0

# Memory
used_memory:4135064
used_memory_human:3.94M
used_memory_rss:4135064
used_memory_peak:4319784
used_memory_peak_human:4.11M
used_memory_lua:40960
maxmemory_policy:noeviction
mem_fragmentation_ratio:1
mem_allocator:jemalloc-5.3.0

# Persistence
loading:0
rdb_changes_since_last_save:16636
rdb_bgsave_in_progress:0
rdb_last_save_time:1704883611
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:50
total_commands_processed:97020
instantaneous_ops_per_sec:4
total_net_input_bytes:18699852
total_net_output_bytes:223692477
instantaneous_input_kbps:0.45
instantaneous_output_kbps:2.58
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:5
evicted_keys:0
keyspace_hits:16247
keyspace_misses:235563
pubsub_channels:0
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:0
migrate_cached_sockets:0
total_forks:0
total_error_replies:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
total_active_defrag_time:0
current_active_defrag_time:0

# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.00
used_cpu_user:0.00
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
used_cpu_sys_main_thread:0.00
used_cpu_user_main_thread:0.00

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=15,expires=11,avg_ttl=28671441902

@mperham
Copy link
Collaborator

mperham commented Jan 11, 2024

I'm assuming that is your local Redis and not a production Redis. "db0:keys=15" is basically empty. I need to see production.

@remy727
Copy link
Author

remy727 commented Jan 11, 2024

This is production. I am using Redis Cloud.

@remy727
Copy link
Author

remy727 commented Jan 11, 2024

image

@mperham
Copy link
Collaborator

mperham commented Jan 11, 2024

Well that certainly looks lightly used. I can't explain why you'd see any errors. Make sure the redis-client gem is up to date.

@remy727
Copy link
Author

remy727 commented Jan 11, 2024

Confirmed we are using 0.19.1.

@pelargir
Copy link

pelargir commented Jan 22, 2024

We're having the same problem. A lightly used Heroku instance running Sidekiq with a Redis instance hosted on Redis Cloud. Every few minutes we get new errors in the log:

Jan 21 13:23:15 staging app/worker.1 pid=2 tid=20be INFO: Redis is online, 3.303780886111781 sec downtime
Jan 21 13:23:15 staging app/worker.1 pid=2 tid=20by INFO: Redis is online, 3.302901866962202 sec downtime
Jan 21 13:27:44 staging app/worker.1 pid=2 tid=1x3u ERROR: Error fetching job: Waited 3.0 seconds
Jan 21 13:27:44 staging app/worker.1 pid=2 tid=1x3u WARN: RedisClient::ReadTimeoutError: Waited 3.0 seconds
Jan 21 13:27:44 staging app/worker.1 pid=2 tid=1x3u WARN: 
Jan 21 13:27:44 staging app/worker.1 pid=2 tid=206e ERROR: Error fetching job: Waited 3.0 seconds
Jan 21 13:27:44 staging app/worker.1 pid=2 tid=206e WARN: RedisClient::ReadTimeoutError: Waited 3.0 seconds
Jan 21 13:27:44 staging app/worker.1 pid=2 tid=206e WARN: 
Jan 21 13:27:44 staging app/worker.1 pid=2 tid=2096 ERROR: Error fetching job: Waited 3.0 seconds
Jan 21 13:27:44 staging app/worker.1 pid=2 tid=2096 WARN: RedisClient::ReadTimeoutError: Waited 3.0 seconds
Jan 21 13:27:44 staging app/worker.1 pid=2 tid=2096 WARN: 
Jan 21 13:27:44 staging app/worker.1 pid=2 tid=20by ERROR: Error fetching job: Waited 3.0 seconds
Jan 21 13:27:44 staging app/worker.1 pid=2 tid=20by WARN: RedisClient::ReadTimeoutError: Waited 3.0 seconds
Jan 21 13:27:44 staging app/worker.1 pid=2 tid=20by WARN: 

This is our redis-cli info output for our Redis Cloud instance:

# Server
redis_version:7.2.0
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:0000000000000000000000000000000000000000
redis_mode:standalone
os:Linux 5.4.0-1049-aws x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:7.5.0
process_id:12105563
run_id:cd3398a50315a259bb36537848cffd8bace34995
tcp_port:12516
server_time_usec:1705939219000000
uptime_in_seconds:86046
uptime_in_days:0
hz:10
lru_clock:0
config_file:

# Clients
connected_clients:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
maxclients:30
cluster_connections:0

# Memory
used_memory:3066544
used_memory_human:2.92M
used_memory_rss:3066544
used_memory_peak:3597648
used_memory_peak_human:3.43M
used_memory_lua:35840
maxmemory_policy:noeviction
mem_fragmentation_ratio:1
mem_allocator:jemalloc-5.3.0

# Persistence
loading:0
rdb_changes_since_last_save:58378
rdb_bgsave_in_progress:0
rdb_last_save_time:1705853176
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:141
total_commands_processed:325380
instantaneous_ops_per_sec:0
total_net_input_bytes:29456100
total_net_output_bytes:231400860
instantaneous_input_kbps:0.07
instantaneous_output_kbps:2.54
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:69
evicted_keys:0
keyspace_hits:50243
keyspace_misses:434611
pubsub_channels:0
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:0
migrate_cached_sockets:0
total_forks:0
total_error_replies:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
total_active_defrag_time:0
current_active_defrag_time:0

# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.00
used_cpu_user:0.00
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
used_cpu_sys_main_thread:0.00
used_cpu_user_main_thread:0.00

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=39,expires=36,avg_ttl=17470728473

From our Gemfile:

    redis (5.0.8)
      redis-client (>= 0.17.0)
    redis-client (0.19.1)
      redis-client (>= 0.19.0)
    sidekiq (7.2.1)

Our initializer:

Sidekiq.configure_server do |config|
  config.redis = { ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE } }
end

Sidekiq.configure_client do |config|
  config.redis = { ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE } }
end

Our config:

:max_retries: 7
:queues:
  - default
  - low

I've contacted Redis Cloud support but they haven't provided any helpful guidance yet.

@mperham
Copy link
Collaborator

mperham commented Jan 22, 2024

Ruby and OpenSSL versions in use?

@pelargir
Copy link

ruby 3.3.0 (2023-12-25 revision 5124f9ac75)
OpenSSL 1.1.1f 31 Mar 2020

I also ran redis-cli --latency and got:

min: 24, max: 566, avg: 38.93 (1081 samples)

I'm starting to wonder if it was a zombie process. I created a Redis instance on Render and pointed my config there instead and have not been seeing the errors. I'll point back at Redis Cloud and see if the errors come back...

@mperham
Copy link
Collaborator

mperham commented Jan 22, 2024

I would recommend trying OpenSSL 3. Mixing modern versions (Ruby 3.3, redis 7.2) with legacy versions like OpenSSL 1.1 may cause subtle issues. That version is almost 4 years old now.

@pelargir
Copy link

As soon as I switched back to Redis Cloud I started getting the errors again:

2024-01-22T16:39:27.504660+00:00 app[worker.1]: pid=2 tid=206y class=ActiveStorage::PurgeJob jid=2526685e9b6fdb2d4caf744a elapsed=0.059 INFO: done
2024-01-22T16:41:01.945720+00:00 app[worker.1]: pid=2 tid=20sm ERROR: Error fetching job: Waited 3.0 seconds
2024-01-22T16:41:01.946268+00:00 app[worker.1]: pid=2 tid=20sm WARN: RedisClient::ReadTimeoutError: Waited 3.0 seconds
2024-01-22T16:41:01.946843+00:00 app[worker.1]: pid=2 tid=20sm WARN:
2024-01-22T16:41:01.947817+00:00 app[worker.1]: pid=2 tid=20t6 ERROR: Error fetching job: Waited 3.0 seconds
2024-01-22T16:41:01.947818+00:00 app[worker.1]: pid=2 tid=20t6 WARN: RedisClient::ReadTimeoutError: Waited 3.0 seconds
2024-01-22T16:41:01.948097+00:00 app[worker.1]: pid=2 tid=20t6 WARN:
2024-01-22T16:41:01.948898+00:00 app[worker.1]: pid=2 tid=20tq ERROR: Error fetching job: Waited 3.0 seconds
2024-01-22T16:41:01.948964+00:00 app[worker.1]: pid=2 tid=20tq WARN: RedisClient::ReadTimeoutError: Waited 3.0 seconds
2024-01-22T16:41:01.949192+00:00 app[worker.1]: pid=2 tid=20tq WARN:

The only difference is Render has a TLS URL whereas Redis Cloud provides a non-TLS URL. Could this have anything to do with SSL verify mode being set to "none" when a non-TLS URL is being used?

@remy727
Copy link
Author

remy727 commented Jan 22, 2024

The only difference is Render has a TLS URL whereas Redis Cloud provides a non-TLS URL. Could this have anything to do with SSL verify mode being set to "none" when a non-TLS URL is being used?

No. I had the same issue when I tried with Heroku Data for Redis which provides a TLS URL.

@remy727
Copy link
Author

remy727 commented Jan 22, 2024

I would recommend trying OpenSSL 3. Mixing modern versions (Ruby 3.3, redis 7.2) with legacy versions like OpenSSL 1.1 may cause subtle issues. That version is almost 4 years old now.

@mperham, my OpenSSL version is 3.2.0 but I have the same issue.

@pelargir
Copy link

After upgrading to the heroku-22 stack with OpenSSL 3.0.2, the problem appears to have gone away.

Perhaps Redis Cloud is running something on their end that does not play nicely with older versions of OpenSSL, whereas Render and Heroku Redis aren't.

@mperham thanks for your help.

@pelargir
Copy link

pelargir commented Jan 22, 2024

I spoke too soon. The error is back again. It took an hour or so to start happening. I restarted the dyno and it went away. I'll wait another hour and see if it comes back.

@pelargir
Copy link

It came back after 30 mins this time. Not sure what's going on. Same logs as before:

2024-01-22T23:24:20.474062+00:00 app[worker.1]: pid=2 tid=206y ERROR: Error fetching job: Waited 3.0 seconds
2024-01-22T23:24:20.474209+00:00 app[worker.1]: pid=2 tid=206y WARN: RedisClient::ReadTimeoutError: Waited 3.0 seconds
2024-01-22T23:24:20.474391+00:00 app[worker.1]: pid=2 tid=206y WARN:
2024-01-22T23:24:20.475225+00:00 app[worker.1]: pid=2 tid=20tq ERROR: Error fetching job: Waited 3.0 seconds
2024-01-22T23:24:20.475369+00:00 app[worker.1]: pid=2 tid=20tq WARN: RedisClient::ReadTimeoutError: Waited 3.0 seconds
2024-01-22T23:24:20.475636+00:00 app[worker.1]: pid=2 tid=20tq WARN:
2024-01-22T23:24:20.476392+00:00 app[worker.1]: pid=2 tid=20ve ERROR: Error fetching job: Waited 3.0 seconds
2024-01-22T23:24:20.476465+00:00 app[worker.1]: pid=2 tid=20ve WARN: RedisClient::ReadTimeoutError: Waited 3.0 seconds
2024-01-22T23:24:20.476694+00:00 app[worker.1]: pid=2 tid=20ve WARN:
2024-01-22T23:24:20.478731+00:00 app[worker.1]: pid=2 tid=20vy ERROR: Error fetching job: Waited 3.0 seconds
2024-01-22T23:24:20.478911+00:00 app[worker.1]: pid=2 tid=20vy WARN: RedisClient::ReadTimeoutError: Waited 3.0 seconds
2024-01-22T23:24:20.479083+00:00 app[worker.1]: pid=2 tid=20vy WARN:

@samk1
Copy link

samk1 commented Jan 25, 2024

Experiencing the same or a similar issue here. We're using Heroku redis. Also using heroku-22 stack, and exact same versions as originally reported. Our error message is

RedisClient::ReadTimeoutError Waited 2.0 seconds

which is different to the 3.0 seconds everyone else is reporting

Ruby version: 3.2.2
Rails version: 7.1.2
Sidekiq / Pro / Enterprise version(s): Sidekiq 7.2.0

..however our redis-client version is 0.19.1.

I've raised it with heroku support, they've sent me here. It is happening very rarely, so it's really just noise. (Bearing in mind this is happening on an application with very low utilisation)

Here is the output from redis-cli info:

# Server
redis_version:7.0.14
redis_git_sha1:c1d92a69
redis_git_dirty:1
redis_build_id:b1de837cc15df969
redis_mode:standalone
os:Linux 5.15.0-1049-aws x86_64
arch_bits:64
monotonic_clock:POSIX clock_gettime
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:9.4.0
process_id:4
process_supervised:no
run_id:a64abd3b6e3e88ecfd416eb1d2954d26c1238338
tcp_port:29459
server_time_usec:1706154004108726
uptime_in_seconds:4840594
uptime_in_days:56
hz:10
configured_hz:10
lru_clock:11655188
executable:/usr/lib/redis/7.0/bin/redis-server
config_file:/etc/redis/heroku-redis.conf
io_threads_active:0

# Clients
connected_clients:5
cluster_connections:0
maxclients:40
client_recent_max_input_buffer:20617
client_recent_max_output_buffer:20504
blocked_clients:2
tracking_clients:0
clients_in_timeout_table:2

# Memory
used_memory:3384600
used_memory_human:3.23M
used_memory_rss:13467648
used_memory_rss_human:12.84M
used_memory_peak:3685064
used_memory_peak_human:3.51M
used_memory_peak_perc:91.85%
used_memory_overhead:1804346
used_memory_startup:534480
used_memory_dataset:1580254
used_memory_dataset_perc:55.45%
allocator_allocated:3545968
allocator_active:4493312
allocator_resident:7188480
total_system_memory:16456396800
total_system_memory_human:15.33G
used_memory_lua:44032
used_memory_vm_eval:44032
used_memory_lua_human:43.00K
used_memory_scripts_eval:4928
number_of_cached_scripts:4
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:32768
used_memory_vm_total:76800
used_memory_vm_total_human:75.00K
used_memory_functions:184
used_memory_scripts:5112
used_memory_scripts_human:4.99K
maxmemory:52428800
maxmemory_human:50.00M
maxmemory_policy:noeviction
allocator_frag_ratio:1.27
allocator_frag_bytes:947344
allocator_rss_ratio:1.60
allocator_rss_bytes:2695168
rss_overhead_ratio:1.87
rss_overhead_bytes:6279168
mem_fragmentation_ratio:4.01
mem_fragmentation_bytes:10105344
mem_not_counted_for_evict:36024
mem_replication_backlog:1048592
mem_total_replication_buffers:1086712
mem_clients_slaves:38136
mem_clients_normal:70746
mem_cluster_links:0
mem_aof_buffer:2048
mem_allocator:jemalloc-5.2.1
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0

# Persistence
loading:0
async_loading:0
current_cow_peak:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:5624547
rdb_bgsave_in_progress:0
rdb_last_save_time:1701313410
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_saves:0
rdb_last_cow_size:802816
rdb_last_load_keys_expired:0
rdb_last_load_keys_loaded:0
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:0
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_rewrites:7
aof_rewrites_consecutive_failures:0
aof_last_write_status:ok
aof_last_cow_size:1073152
module_fork_in_progress:0
module_fork_last_cow_size:0
aof_current_size:28595763
aof_base_size:124430
aof_pending_rewrite:0
aof_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

# Stats
total_connections_received:163387
total_commands_processed:44273118
instantaneous_ops_per_sec:16
total_net_input_bytes:4346006451
total_net_output_bytes:1504233864
total_net_repl_input_bytes:0
total_net_repl_output_bytes:504714686
instantaneous_input_kbps:1.34
instantaneous_output_kbps:0.93
instantaneous_input_repl_kbps:0.00
instantaneous_output_repl_kbps:0.83
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:0
expired_keys:69298
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:61829
evicted_keys:0
evicted_clients:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
keyspace_hits:2016167
keyspace_misses:2208268
pubsub_channels:1
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:516
total_forks:8
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
total_active_defrag_time:0
current_active_defrag_time:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:3
dump_payload_sanitizations:0
total_reads_processed:36964298
total_writes_processed:33642871
io_threaded_reads_processed:0
io_threaded_writes_processed:0
reply_buffer_shrinks:9707
reply_buffer_expands:3740

# Replication
role:master
connected_slaves:1
slave0:ip=10.0.191.150,port=11789,state=online,offset=504714441,lag=0
master_failover_state:no-failover
master_replid:6056d1f9af710dbc8b88c1eb948d192cf3ac8cf9
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:504714441
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:503649833
repl_backlog_histlen:1064609

# CPU
used_cpu_sys:2214.261827
used_cpu_user:5403.988730
used_cpu_sys_children:0.015823
used_cpu_user_children:0.011875
used_cpu_sys_main_thread:2159.962719
used_cpu_user_main_thread:5395.535469

# Modules

# Errorstats
errorstat_NOAUTH:count=3

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=1136,expires=1126,avg_ttl=14160610755

Very similar sidekiq config except we're using sidekiq pro

# frozen_string_literal: true

Sidekiq.configure_server do |config|
  config.super_fetch!
  config.reliable_scheduler!

  # Need to configure SSL to VERIFY_NONE to work with redis 6 on Heroku
  # https://stackoverflow.com/questions/65834575/how-to-enable-tls-for-redis-6-on-sidekiq
  config.redis = { ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE } }
end

Sidekiq.configure_client do |config|
  # Need to configure SSL to VERIFY_NONE to work with redis 6 on Heroku
  # https://stackoverflow.com/questions/65834575/how-to-enable-tls-for-redis-6-on-sidekiq
  config.redis = { ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE } }
end

@dan-jensen
Copy link

Experiencing the same issue on a self-managed server running Ubuntu 22.04. Like others, we're seeing this intermittently. We get a burst of 5-15 exception reports from our app, then the problem disappears for another week or so. Logs show CPU and memory are NOT under load when this problem occurs.

Here is the redis-cli info:

# Server
redis_version:7.2.4
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:44edf16bebe6fed2
redis_mode:standalone
os:Linux 5.15.0-84-generic x86_64
arch_bits:64
monotonic_clock:POSIX clock_gettime
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:11.4.0
process_id:2140443
process_supervised:systemd
run_id:2cd2d22e1e896a2755a48e51b75b96e8b571b655
tcp_port:6379
server_time_usec:1707529719360288
uptime_in_seconds:1587085
uptime_in_days:18
hz:10
configured_hz:10
lru_clock:13030903
executable:/usr/bin/redis-server
config_file:/etc/redis/redis.conf
io_threads_active:0
listener0:name=tcp,bind=127.0.0.1,bind=-::1,port=6379

# Clients
connected_clients:9
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:20523
client_recent_max_output_buffer:0
blocked_clients:5
tracking_clients:0
clients_in_timeout_table:5
total_blocking_keys:1
total_blocking_keys_on_nokey:0

# Memory
used_memory:2223584
used_memory_human:2.12M
used_memory_rss:9883648
used_memory_rss_human:9.43M
used_memory_peak:2449608
used_memory_peak_human:2.34M
used_memory_peak_perc:90.77%
used_memory_overhead:1200687
used_memory_startup:866056
used_memory_dataset:1022897
used_memory_dataset_perc:75.35%
allocator_allocated:2363464
allocator_active:2703360
allocator_resident:5783552
total_system_memory:8323088384
total_system_memory_human:7.75G
used_memory_lua:39936
used_memory_vm_eval:39936
used_memory_lua_human:39.00K
used_memory_scripts_eval:312
number_of_cached_scripts:1
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:32768
used_memory_vm_total:72704
used_memory_vm_total_human:71.00K
used_memory_functions:184
used_memory_scripts:496
used_memory_scripts_human:496B
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.14
allocator_frag_bytes:339896
allocator_rss_ratio:2.14
allocator_rss_bytes:3080192
rss_overhead_ratio:1.71
rss_overhead_bytes:4100096
mem_fragmentation_ratio:4.45
mem_fragmentation_bytes:7661888
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_total_replication_buffers:0
mem_clients_slaves:0
mem_clients_normal:140399
mem_cluster_links:0
mem_aof_buffer:0
mem_allocator:jemalloc-5.3.0
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0

# Persistence
loading:0
async_loading:0
current_cow_peak:0
current_cow_size:0
current_cow_size_age:0
current_fork_perc:0.00
current_save_keys_processed:0
current_save_keys_total:0
rdb_changes_since_last_save:246
rdb_bgsave_in_progress:0
rdb_last_save_time:1707529507
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_saves:5272
rdb_last_cow_size:1007616
rdb_last_load_keys_expired:0
rdb_last_load_keys_loaded:1581
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_rewrites:0
aof_rewrites_consecutive_failures:0
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:525
total_commands_processed:7283763
instantaneous_ops_per_sec:2
total_net_input_bytes:420025646
total_net_output_bytes:61687497
total_net_repl_input_bytes:0
total_net_repl_output_bytes:0
instantaneous_input_kbps:0.10
instantaneous_output_kbps:0.01
instantaneous_input_repl_kbps:0.00
instantaneous_output_repl_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:18570
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:229262
evicted_keys:0
evicted_clients:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
keyspace_hits:794864
keyspace_misses:317229
pubsub_channels:0
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:732
total_forks:5272
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
total_active_defrag_time:0
current_active_defrag_time:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:7
dump_payload_sanitizations:0
total_reads_processed:5411013
total_writes_processed:5410491
io_threaded_reads_processed:0
io_threaded_writes_processed:0
reply_buffer_shrinks:8665
reply_buffer_expands:8442
eventloop_cycles:20543014
eventloop_duration_sum:4680801201
eventloop_duration_cmd_sum:78258250
instantaneous_eventloop_cycles_per_sec:11
instantaneous_eventloop_duration_usec:175
acl_access_denied_auth:0
acl_access_denied_cmd:0
acl_access_denied_key:0
acl_access_denied_channel:0

# Replication
role:master
connected_slaves:0
master_failover_state:no-failover
master_replid:c0659ec6f47274cc82baf25f4411dbe40e2b8263
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:2012.975291
used_cpu_user:2544.071009
used_cpu_sys_children:21.285451
used_cpu_user_children:32.497599
used_cpu_sys_main_thread:2012.578427
used_cpu_user_main_thread:2543.784237

# Modules

# Errorstats
errorstat_NOAUTH:count=6
errorstat_NOSCRIPT:count=1

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=2005,expires=2000,avg_ttl=86398957318

@mperham
Copy link
Collaborator

mperham commented Feb 10, 2024

I've noticed that no one posted a backtrace. Can anyone supply a backtrace or is it swallowed in this case?

@dan-jensen
Copy link

dan-jensen commented Feb 10, 2024

@mperham this is what our app reported to Sentry:

RedisClient::ReadTimeoutError
Waited 3.0 seconds

redis-client (0.19.1) lib/redis_client/ruby_connection/buffered_io.rb in block in fill_buffer at line 140
redis-client (0.19.1) lib/redis_client/ruby_connection/buffered_io.rb in loop at line 122
redis-client (0.19.1) lib/redis_client/ruby_connection/buffered_io.rb in fill_buffer at line 122
redis-client (0.19.1) lib/redis_client/ruby_connection/buffered_io.rb in ensure_remaining at line 114
redis-client (0.19.1) lib/redis_client/ruby_connection/buffered_io.rb in getbyte at line 85
redis-client (0.19.1) lib/redis_client/ruby_connection/resp3.rb in parse at line 113
redis-client (0.19.1) lib/redis_client/ruby_connection/resp3.rb in load at line 50
redis-client (0.19.1) lib/redis_client/ruby_connection.rb in block in read at line 96
redis-client (0.19.1) lib/redis_client/ruby_connection/buffered_io.rb in with_timeout at line 48
redis-client (0.19.1) lib/redis_client/ruby_connection.rb in read at line 96
redis-client (0.19.1) lib/redis_client/connection_mixin.rb in call at line 31
redis-client (0.19.1) lib/redis_client.rb in block (2 levels) in blocking_call at line 318
redis-client (0.19.1) lib/redis_client/middlewares.rb in call at line 16
redis-client (0.19.1) lib/redis_client.rb in block in blocking_call at line 317
redis-client (0.19.1) lib/redis_client.rb in ensure_connected at line 677
redis-client (0.19.1) lib/redis_client.rb in blocking_call at line 316
redis-client (0.19.1) lib/redis_client/decorator.rb in blocking_call at line 26
sidekiq (7.2.0) lib/sidekiq/fetch.rb in block in retrieve_work at line 47
sidekiq (7.2.0) lib/sidekiq/capsule.rb in block in redis at line 102
connection_pool (2.4.1) lib/connection_pool.rb in block (2 levels) in with at line 110
connection_pool (2.4.1) lib/connection_pool.rb in handle_interrupt at line 109
connection_pool (2.4.1) lib/connection_pool.rb in block in with at line 109
connection_pool (2.4.1) lib/connection_pool.rb in handle_interrupt at line 106
connection_pool (2.4.1) lib/connection_pool.rb in with at line 106
sidekiq (7.2.0) lib/sidekiq/capsule.rb in redis at line 99
sidekiq (7.2.0) lib/sidekiq/component.rb in redis at line 28
sidekiq (7.2.0) lib/sidekiq/fetch.rb in retrieve_work at line 47
sidekiq (7.2.0) lib/sidekiq/processor.rb in get_one at line 87
sidekiq (7.2.0) lib/sidekiq/processor.rb in fetch at line 99
sidekiq (7.2.0) lib/sidekiq/processor.rb in process_one at line 81
sidekiq (7.2.0) lib/sidekiq/processor.rb in run at line 72
sidekiq (7.2.0) lib/sidekiq/component.rb in watchdog at line 10
sidekiq (7.2.0) lib/sidekiq/component.rb in block in safe_thread at line 19

@mperham
Copy link
Collaborator

mperham commented Feb 10, 2024

Ugh, that looks like a bog standard read timeout from an overloaded Redis. I can't see any other hints about what's wrong. I'm still flummoxed.

@le0pard
Copy link
Contributor

le0pard commented Feb 17, 2024

RedisClient::ReadTimeoutError: Unknown Error (hiredis_read) (RedisClient::ReadTimeoutError)
  from hiredis-client (0.20.0) lib/redis_client/hiredis_connection.rb:95:in `_read'
  from hiredis-client (0.20.0) lib/redis_client/hiredis_connection.rb:95:in `read'
  from redis-client (0.20.0) lib/redis_client/connection_mixin.rb:31:in `call'
  from redis-client (0.20.0) lib/redis_client.rb:318:in `block (2 levels) in blocking_call'
  from redis-client (0.20.0) lib/redis_client/middlewares.rb:16:in `call'
  from redis-client (0.20.0) lib/redis_client.rb:317:in `block in blocking_call'
  from redis-client (0.20.0) lib/redis_client.rb:677:in `ensure_connected'
  from redis-client (0.20.0) lib/redis_client.rb:316:in `blocking_call'
  from redis-client (0.20.0) lib/redis_client/decorator.rb:26:in `blocking_call'
  from sidekiq (7.2.1) lib/sidekiq/fetch.rb:47:in `block in retrieve_work'
  from sidekiq (7.2.1) lib/sidekiq/capsule.rb:102:in `block in redis'
  from connection_pool (2.4.1) lib/connection_pool.rb:110:in `block (2 levels) in with'
  from connection_pool (2.4.1) lib/connection_pool.rb:109:in `handle_interrupt'
  from connection_pool (2.4.1) lib/connection_pool.rb:109:in `block in with'
  from connection_pool (2.4.1) lib/connection_pool.rb:106:in `handle_interrupt'
  from connection_pool (2.4.1) lib/connection_pool.rb:106:in `with'
  from sidekiq (7.2.1) lib/sidekiq/capsule.rb:99:in `redis'
  from sidekiq (7.2.1) lib/sidekiq/component.rb:28:in `redis'
  from sidekiq (7.2.1) lib/sidekiq/fetch.rb:47:in `retrieve_work'
  from sidekiq (7.2.1) lib/sidekiq/processor.rb:87:in `get_one'
  from sidekiq (7.2.1) lib/sidekiq/processor.rb:99:in `fetch'
  from sidekiq (7.2.1) lib/sidekiq/processor.rb:81:in `process_one'
  from sidekiq (7.2.1) lib/sidekiq/processor.rb:72:in `run'
  from sidekiq (7.2.1) lib/sidekiq/component.rb:10:in `watchdog'
  from sidekiq (7.2.1) lib/sidekiq/component.rb:19:in `block in safe_thread'

we are using sidekiq on same env for long time and this issue started only after update to sidekiq 7. Maybe it is bad new redis-client (because no such issues with redis.rb), maybe something related to new feature capsules, maybe usage of new redis protocol inside redis-client show that protocol is bad, than old one... Not clear where is the issue. Maybe if we will make redis client "pluggable" for sidekiq, it will help to understand where it came from (like select to use redis-client or redis.rb for sidekiq)

Looks like right now to solve this issue is best approach is not to update to sidekiq 7.

image

  • Redis not overload, same with sidekiq (cpu load for both less the 10%)
  • Redis 7.0.14 version
  • No TLS connection
  • latest stable openssl, ruby, rails, redis-client gem

@netwire88
Copy link

Any update on this? We can confirm this started happening as well after upgrading to Sidekiq 7 / Redis 7. We are using Redis Cloud by Redis Labs.

@mperham
Copy link
Collaborator

mperham commented Apr 10, 2024

@casperisfine do you have any advice for tracking down random read timeouts in redis-client?

@casperisfine
Copy link
Contributor

Not particularly no. You have to experience it yourself and know TCP / SSL etc to poke at it to figure out what's wrong. Not really possible to do proxy debugging with these things.

What's really not clear to me is why it would have started with Sidekiq 7. There was of course a bunch of change in the Redis client, but fundamentally, they don't work very differently. Timeout are unfortunately a common things, especially on public cloud, so I wonder if it's not just a reporting difference? Were previous versions of sidekiq reporting these errors or just swallowing them and retrying?

Also redis-rb retries commands by default, but IIRC Sidekiq configure redis-client with 1 retry too, right?

Again, super hard to tell from just reports like this, I also don't see any common point in the various reports. I see some TLS, some no-TLS, some hiredis-client, some Ruby driver, so that's pretty much all the codepaths in redis-client. I often write bugs, but I have a hard time believing all these configuration are similarly bugged.

@mperham
Copy link
Collaborator

mperham commented Apr 10, 2024

Good questions.

I've also seen compatibility issues with OpenSSL 3 and Ruby 3 along with actual Redis server bugs with OpenSSL 3. Make sure you're upgrading those stack bits to get network fixes. These errors are "below" Sidekiq in the stack and sadly I don't have really any insight into what's causing the issue.

@kikikiblue
Copy link

I also have same problem.
my app is on heroku.com.
versions is like this

ruby 3.3.1
rails 6.1.7.7
sidekiq 7.2.4
redis (5.2.0)
      redis-client (>= 0.22.0)
    redis-client (0.22.1)
      connection_pool

Heroku Data for Redis VERSION 7.0.15
openssl version on heroku-22 stack : openssl: '3.0.10+quic'

This is the backtrace

/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client/ruby_connection/buffered_io.rb:214:in `block in fill_buffer'
<internal:kernel>:187:in `loop'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client/ruby_connection/buffered_io.rb:197:in `fill_buffer'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client/ruby_connection/buffered_io.rb:187:in `ensure_remaining'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client/ruby_connection/buffered_io.rb:152:in `getbyte'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client/ruby_connection/resp3.rb:113:in `parse'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client/ruby_connection/resp3.rb:50:in `load'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client/ruby_connection.rb:98:in `block in read'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client/ruby_connection/buffered_io.rb:114:in `with_timeout'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client/ruby_connection.rb:98:in `read'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client/connection_mixin.rb:31:in `call'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client.rb:340:in `block (2 levels) in blocking_call'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client/middlewares.rb:16:in `call'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client.rb:339:in `block in blocking_call'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client.rb:699:in `ensure_connected'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client.rb:338:in `blocking_call'
/app/vendor/bundle/ruby/3.3.0/gems/redis-client-0.22.1/lib/redis_client/decorator.rb:26:in `blocking_call'
/app/vendor/bundle/ruby/3.3.0/gems/sidekiq-7.2.4/lib/sidekiq/fetch.rb:47:in `block in retrieve_work'
/app/vendor/bundle/ruby/3.3.0/gems/sidekiq-7.2.4/lib/sidekiq/capsule.rb:102:in `block in redis'
/app/vendor/bundle/ruby/3.3.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:110:in `block (2 levels) in with'
/app/vendor/bundle/ruby/3.3.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:109:in `handle_interrupt'
/app/vendor/bundle/ruby/3.3.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:109:in `block in with'
/app/vendor/bundle/ruby/3.3.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:106:in `handle_interrupt'
/app/vendor/bundle/ruby/3.3.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:106:in `with'
/app/vendor/bundle/ruby/3.3.0/gems/sidekiq-7.2.4/lib/sidekiq/capsule.rb:99:in `redis'
/app/vendor/bundle/ruby/3.3.0/gems/sidekiq-7.2.4/lib/sidekiq/component.rb:28:in `redis'
/app/vendor/bundle/ruby/3.3.0/gems/sidekiq-7.2.4/lib/sidekiq/fetch.rb:47:in `retrieve_work'
/app/vendor/bundle/ruby/3.3.0/gems/sidekiq-7.2.4/lib/sidekiq/processor.rb:87:in `get_one'
/app/vendor/bundle/ruby/3.3.0/gems/sidekiq-7.2.4/lib/sidekiq/processor.rb:99:in `fetch'
/app/vendor/bundle/ruby/3.3.0/gems/sidekiq-7.2.4/lib/sidekiq/processor.rb:81:in `process_one'
/app/vendor/bundle/ruby/3.3.0/gems/sidekiq-7.2.4/lib/sidekiq/processor.rb:72:in `run'
/app/vendor/bundle/ruby/3.3.0/gems/sidekiq-7.2.4/lib/sidekiq/component.rb:10:in `watchdog'
/app/vendor/bundle/ruby/3.3.0/gems/sidekiq-7.2.4/lib/sidekiq/component.rb:19:in `block in safe_thread'

but connection error messages was repeated so I downgraded to Sidekiq 6.5.5.
it looks like it's been resolved.

I am sorry for not providing any clues for the solution.
I appreciate your hard work and look forward to a solution.

@axlekb
Copy link

axlekb commented May 9, 2024

Not sure how much this will help, but figured I would add some details that might give someone some clues?

This week I upgraded 3 small Heroku apps to Sidekiq 7.2.4 using Redis Cloud 30MB and 100MB with Redis 7.2.3. All are Ruby 3.3.1, Rails 7.1.3.2. All use 1 worker dyno each with 5 concurrency.

App 1:

  • High volume of Sidekiq Jobs triggered by ecommerce webhooks. 100MB Redis Cloud. 8 sidekiq queues
  • Zero RedisTimeoutError encountered

App 2:

  • Staging for App 1. Single user, mostly. 30MB Redis Cloud. Scheduler kicks off Sidekiq jobs that run for ~10 seconds every 10 minutes. 8 sidekiq queues
  • Infrequent RedisTimeoutError: a few every couple hours. Timeouts do not seem correlated to scheduler frequency.
  • Recent Timeouts: 8:30a, 11:22p, 6:28p, 6:28p, 3:11p, 3:11p, 3:11p, 3:11p, 1:58p, 12:39p, 12:39p, 8:39a

App 3:

  • Different app, different Gemfile, but 80% the same. 30MB Redis Cloud. High usage 9-5. Jobs mainly triggered by user actions: ~ a job every minute or so. 5 Sidekiq queues.
  • Frequent RedisTimeoutError, a few per hour. More frequent during low throughput.
  • Recent Timeouts (all am): 10:05, 10:03, 8:18, 8:17, 7:27, 7:26, 7:26, 7:26, 7:12, 6:36, 6:36, 5:43, 5:43, 5:43, 5:40, 5:40, 5:40, 5:40, 5:40, 5:22, 5:22, 5:22, 5:15, 5:09

@casperisfine
Copy link
Contributor

@mperham I don't remember if we discussed this before.

But it just hit me that for many of these reports, it might simply due to redis-client default timeout being much stricter than the old redis 4.x gem. It used to be 5 seconds (which is a bit bonkers), for redis-client it's just 1 second.

This probably participated in handling lots of transient issues with various cloud providers, especially slow ones etc.

Maybe it would make sense to set an higher default timeout in Sidekiq, as it's much less latency sensitive as a web server.

@mperham
Copy link
Collaborator

mperham commented May 27, 2024

@casperisfine yeah, esp if the CPU is saturated and only switching every 100ms. I will get this into Sidekiq 7.3.

@pmichaeljones
Copy link
Contributor

Hi Mike. Do you have a timeframe for when you'll be pushing 7.3 to RubyGems?

@mperham
Copy link
Collaborator

mperham commented Jun 4, 2024

@pmichaeljones You can pick up the change in 7.x with this:

Sidekiq.configure_server do |cfg|
  cfg.redis = { timeout: 3, ... }
end

The 7.3 milestone is in the issues.

@axlekb
Copy link

axlekb commented Jun 4, 2024

Forgot to update: Upgraded to redis-client 0.22.2 and the timeouts on the free 30MB Redis Cloud plan went away.

Update: I upgraded from Redis Cloud 30MB ($0) to Redis Cloud 100MB ($10) and the timeouts went away. I do have an ticket open with Redis Cloud regarding this and will report back when it gets to a resolution.

@stanhu
Copy link
Contributor

stanhu commented Jun 4, 2024

Upgraded to redis-client 0.22.2 and the timeouts on the free 30MB Redis Cloud plan went away.

That's great, thanks. Based on redis-rb/redis-client#197 (comment) this suggests that increasing the BRPOP timeout from 3 to 4 seconds might be making a difference. With redis-client 0.22.1, the timeout is 3 because of conn.read_timeout + TIMEOUT (1 + 2), but with redis-client 0.22.2 conn.read_timeout is automatically added, so the timeout is now 4 seconds.

This might be giving credence to the timeout theory in #6162 (comment) and 1c2a37d.

Will Sidekiq 7.3, I believe the timeout will actually now be 3+3+2 = 8, so I think we'd want to drop the extra conn.read_timeout.

stanhu added a commit to stanhu/sidekiq that referenced this issue Jun 4, 2024
redis-client v0.22.2 automatically adds `read_timeout` to the socket
read timeout, so we no longer need to manually add this. Prior this
commit, the total socket read timeout crept up to 8 seconds:

```
conn.read_timeout + conn.read_timeout + TIMEOUT = 3 + 3 + 2 = 8
```

Now it is 5 seconds:

```
conn.read_timeout + TIMEOUT = 3 + 2 = 5
```

This relates to sidekiq#6162.
stanhu added a commit to stanhu/sidekiq that referenced this issue Jun 4, 2024
redis-client v0.22.2 automatically adds `read_timeout` to the socket
read timeout, so we no longer need to manually add this. Prior this
commit, the total socket read timeout crept up to 8 seconds:

```
conn.read_timeout + conn.read_timeout + TIMEOUT = 3 + 3 + 2 = 8
```

Now it is 5 seconds:

```
conn.read_timeout + TIMEOUT = 3 + 2 = 5
```

This relates to sidekiq#6162.
mperham pushed a commit that referenced this issue Jun 4, 2024
redis-client v0.22.2 automatically adds `read_timeout` to the socket
read timeout, so we no longer need to manually add this. Prior this
commit, the total socket read timeout crept up to 8 seconds:

```
conn.read_timeout + conn.read_timeout + TIMEOUT = 3 + 3 + 2 = 8
```

Now it is 5 seconds:

```
conn.read_timeout + TIMEOUT = 3 + 2 = 5
```

This relates to #6162.
@pmichaeljones
Copy link
Contributor

pmichaeljones commented Jun 7, 2024

Upgraded to redis-client 0.22.2 from redis-client 0.21.0 and I haven't seen any RedisClient::ReadTimeoutError: Waited 3.0 seconds errors during the last 24 hours. Previously there were at least 3-5 per day. 🤞

@rafaelbiriba
Copy link

Just to add more info:
I am using redis-client 0.22.1 and I still see RedisClient::ReadTimeoutError 3seconds at least 2 times per day.

@kikikiblue
Copy link

upgrade to redis-client (0.22.2) 5 days ago and it looks solved.
Thanks!!

@pmichaeljones
Copy link
Contributor

It's been 3 days since I upgraded to redis-client 0.22.2 and I haven't had another error since. Looks promising!

@netwire88
Copy link

netwire88 commented Jun 12, 2024

Unfortunately even after upgrading to redis-client 0.22.2 + sidekiq 7.2.4 on Redis Cloud 1GB running Redis Stack 7.2.3, it's not working. We are now getting RedisClient::ReadTimeoutError: Waited 4.0 seconds

@rafaelbiriba
Copy link

Same here!
redis-client 0.22.2 + sidekiq 7.2.4 running redis on heroku, we are getting RedisClient::ReadTimeoutError: Waited 4.0 seconds

@jeremyhaile
Copy link

jeremyhaile commented Jun 14, 2024

It is happening to us as well RedisClient::ReadTimeoutError: Waited 3.0 seconds We are using Heroku Data for Redis.

I also contacted Heroku support and was sent here. We're using sidekiq 7.2.4, redis-client 0.22.1. Our heroku app is on the heroku-22 stack.

@jeremyhaile
Copy link

@mperham here is our raw stack trace:

RedisClient::ReadTimeoutError: Waited 3.0 seconds (RedisClient::ReadTimeoutError)
  from redis_client/ruby_connection/buffered_io.rb:214:in `block in fill_buffer'
  from redis_client/ruby_connection/buffered_io.rb:197:in `loop'
  from redis_client/ruby_connection/buffered_io.rb:197:in `fill_buffer'
  from redis_client/ruby_connection/buffered_io.rb:187:in `ensure_remaining'
  from redis_client/ruby_connection/buffered_io.rb:152:in `getbyte'
  from redis_client/ruby_connection/resp3.rb:113:in `parse'
  from redis_client/ruby_connection/resp3.rb:50:in `load'
  from redis_client/ruby_connection.rb:98:in `block in read'
  from redis_client/ruby_connection/buffered_io.rb:114:in `with_timeout'
  from redis_client/ruby_connection.rb:98:in `read'
  from redis_client/connection_mixin.rb:31:in `call'
  from redis_client.rb:340:in `block (2 levels) in blocking_call'
  from redis_client/middlewares.rb:16:in `call'
  from redis_client.rb:339:in `block in blocking_call'
  from redis_client.rb:699:in `ensure_connected'
  from redis_client.rb:338:in `blocking_call'
  from redis_client/decorator.rb:26:in `blocking_call'
  from sidekiq/fetch.rb:47:in `block in retrieve_work'
  from sidekiq/capsule.rb:102:in `block in redis'
  from connection_pool.rb:110:in `block (2 levels) in with'
  from connection_pool.rb:109:in `handle_interrupt'
  from connection_pool.rb:109:in `block in with'
  from connection_pool.rb:106:in `handle_interrupt'
  from connection_pool.rb:106:in `with'
  from sidekiq/capsule.rb:99:in `redis'
  from sidekiq/component.rb:28:in `redis'
  from sidekiq/fetch.rb:47:in `retrieve_work'
  from sidekiq/processor.rb:87:in `get_one'
  from sidekiq/processor.rb:99:in `fetch'
  from sidekiq/processor.rb:81:in `process_one'
  from sidekiq/processor.rb:72:in `run'
  from sidekiq/component.rb:10:in `watchdog'
  from sidekiq/component.rb:19:in `block in safe_thread'

@mperham
Copy link
Collaborator

mperham commented Jun 14, 2024

@jeremyhaile whats your process concurrency? The only other cause of timeouts could be if your CPUs are way overloaded. Reduce your concurrency.

@jeremyhaile
Copy link

< 100 connections (from several servers/pools). Low activity. Yea, I'm not sure what's causing it. Perhaps we need to move off of Heroku. Or, perhaps there's a bug somewhere...

@kikikiblue
Copy link

kikikiblue commented Jun 15, 2024

@jeremyhaile Your environment is totally same except your redis-client version is 0.22.1
how about upgrade your redis-client?

redis-client (0.22.2)
sidekiq (7.2.4)
heroku-22 stack
heroku data for redis

and my sidekiq.rb redis config is like this.
{url: "#{ENV["REDIS_TLS_URL"]}/0", ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE }, timeout: 3}

I upgraded redis-client on Jun 5th and the timeout error hasn't appeared since.

@jeremyhaile
Copy link

@kikikiblue thanks for the idea. We updated to 0.22.2 yesterday which seems like it also increases the default read timeout to 4s. Unfortunately we've seen the error 3 times today already, with a 4s timeout error now. Our load is almost nothing today too, as most of our load is during weekdays.

@le0pard
Copy link
Contributor

le0pard commented Jun 16, 2024

In my cases any version updates for sidekiq and redis-client gems did not help. Help to change config:

SIDEKIQ_REDIS_OPTIONS = {
  network_timeout: 3,
  pool_timeout: 2,
  reconnect_attempts: [0.1, 0.25, 0.75], # I think this is major change, which helped to remove RedisTimeoutError errors 
  size: Integer(ENV.fetch('RAILS_MAX_THREADS', 10))
}.freeze

Sidekiq.configure_client do |config|
  config.redis = SIDEKIQ_REDIS_OPTIONS
end

Sidekiq.configure_server do |config|
  config.redis = SIDEKIQ_REDIS_OPTIONS
end

@stanhu
Copy link
Contributor

stanhu commented Jun 18, 2024

Oh, right, as mentioned above: For those of you having trouble, can you see if increasing the connection pool size helps? Perhaps you're running into #5929 (comment)

Also, do your jobs call Sidekiq.redis?

@rafaelbiriba
Copy link

For reference: Running redis-client 0.22.2 + sidekiq 7.2.4 gives me the RedisClient::ReadTimeoutError 3/4x more than before with 0.22.1
(But could be a coincidence)

@le0pard can you confirm if your suggestion does improve the situation?

@le0pard
Copy link
Contributor

le0pard commented Jun 18, 2024

@stanhu nope, my code did not use Sidekiq.redis. I have separate connection for such use

@rafaelbiriba I can repeat, if this was not clear from first message - errors disappear only after I reached this settings for redis in sidekiq (both server and client), any versions update did not help with errors.

Now (90 days range in error system filter by redis errors):

Screenshot 2024-06-18 at 10 43 50

4 months ago situation:

image

@stanhu
Copy link
Contributor

stanhu commented Jun 18, 2024

errors disappear only after I reached this settings for redis in sidekiq (both server and client), any versions update did not help with errors.

To be clear, the default concurrency value is 5, and you've doubled the Redis pool size from 5 to 10, correct? This suggests to me that for some reason there aren't enough connections in the pool for those threads. It would be nice to understand why that's the case because that would explain why #6158 might be needed.

As I mentioned in #5929 (comment):

  • In Sidekiq 6, there was one connection pool of size concurrency + 5.
  • In Sidekiq 7, a capsule has its own connection pool of size concurrency, and there is an internal Sidekiq connection pool with a fixed size of 10.

By increasing the pool size to 10, I think you're effectively doing what Sidekiq 6 did before.

@le0pard
Copy link
Contributor

le0pard commented Jun 18, 2024

only increasing pool did not help, help also reconnect_attempts with array. Pool just used as first suggestion at beginning, but did not help remove errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests