Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cluster consistency-check test #7754

Merged

Conversation

oranagra
Copy link
Member

@oranagra oranagra commented Sep 7, 2020

This test was failing from time to time see discussion at the bottom of #7635
This was probably due to timing, the DEBUG SLEEP executed by redis-cli
didn't sleep for enough time.

This commit changes:

  1. use SET-ACTIVE-EXPIRE instead of DEBUG SLEEP
  2. reduce many after sleeps with retry loops to speed up the test.
  3. add many comment explaining the different steps of the test and
    it's purpose.
  4. config appendonly before populating the volatile keys, so that they'll
    be part of the AOF command stream rather than the preamble RDB portion.

other complications: recently kill_instance switched from SIGKILL to
SIGTERM, and this would sometimes fail since there was an AOFRW running
in the background. now we wait for it to end before attempting the kill.

This test was failing from time to time see discussion at the bottom of redis#7635
This was probably due to timing, the DEBUG SLEEP executed by redis-cli
didn't sleep for enough time.

This commit changes:
1) use SET-ACTIVE-EXPIRE instead of DEBUG SLEEP
2) reduce many `after` sleeps with retry loops to speed up the test.
3) add many comment explaining the different steps of the test and
   it's purpose.

other complications: recently kill_instance switched from SIGKILL to
SIGTERM, and this would sometimes fail since there was an AOFRW running
in the background. now we wait for it to end before attempting the kill.
@trevor211
Copy link
Collaborator

While the fix LGTM, I got this error when I run the test on your branch:

Testing unit: 14-consistency-check.tcl
17:08:22> (init) Restart killed instances: OK
17:08:22> Cluster nodes are reachable: OK
17:08:22> Cluster nodes hard reset: OK
17:08:22> Cluster Join and auto-discovery test: OK
17:08:26> Before slots allocation, all nodes report cluster failure: OK
17:08:26> Create a 5 nodes cluster: OK
17:08:30> Cluster should start ok: OK
17:08:30> Cluster is writable: OK
17:08:30> Slave expired keys is loaded when restarted: appendonly=no: OK
17:08:33> Slave expired keys is loaded when restarted: appendonly=yes: OK
Cleaning up...
killing stale instance 19845
killing stale instance 19850
killing stale instance 19855
killing stale instance 19860
killing stale instance 19865
killing stale instance 19870
killing stale instance 19875
killing stale instance 19880
killing stale instance 19885
killing stale instance 19895
killing stale instance 19900
killing stale instance 19905
killing stale instance 19910
killing stale instance 19915
killing stale instance 19920
killing stale instance 19925
killing stale instance 19930
killing stale instance 19935
killing stale instance 19940
killing stale instance 19973
no files matched glob pattern "*/err.txt"
    while executing
"glob */err.txt"
    (procedure "log_crashes" line 19)
    invoked from within
"log_crashes"
    (procedure "cleanup" line 7)
    invoked from within
"cleanup"
    (procedure "main" line 8)
    invoked from within
"main"
Cleaning up...
killing stale instance 19845
killing stale instance 19850
killing stale instance 19855
killing stale instance 19860
killing stale instance 19865
killing stale instance 19870
killing stale instance 19875
killing stale instance 19880
killing stale instance 19885
killing stale instance 19895
killing stale instance 19900
killing stale instance 19905
killing stale instance 19910
killing stale instance 19915
killing stale instance 19920
killing stale instance 19925
killing stale instance 19930
killing stale instance 19935
killing stale instance 19940
killing stale instance 19973
no files matched glob pattern "*/err.txt"
    while executing
"glob */err.txt"
    (procedure "log_crashes" line 19)
    invoked from within
"log_crashes"
    (procedure "cleanup" line 7)
    invoked from within
"cleanup"
    invoked from within
"if {[catch main e]} {
    puts $::errorInfo
    if {$::pause_on_error} pause_on_error
    cleanup
    exit 1
}"
    (file "tests/cluster/run.tcl" line 24)

@oranagra oranagra closed this Sep 7, 2020
@oranagra oranagra reopened this Sep 7, 2020
@oranagra
Copy link
Member Author

oranagra commented Sep 7, 2020

@trevor211 sorry, that's a small fuckup i merged yesterday.. fix is in #7752.

@oranagra oranagra merged commit b491d47 into redis:unstable Sep 7, 2020
@oranagra oranagra deleted the fix_cluster_consistency_check_test branch September 7, 2020 15:06
oranagra added a commit to oranagra/redis that referenced this pull request Sep 9, 2020
This test was failing from time to time see discussion at the bottom of redis#7635
This was probably due to timing, the DEBUG SLEEP executed by redis-cli
didn't sleep for enough time.

This commit changes:
1) use SET-ACTIVE-EXPIRE instead of DEBUG SLEEP
2) reduce many `after` sleeps with retry loops to speed up the test.
3) add many comment explaining the different steps of the test and
   it's purpose.
4) config appendonly before populating the volatile keys, so that they'll
   be part of the AOF command stream rather than the preamble RDB portion.

other complications: recently kill_instance switched from SIGKILL to
SIGTERM, and this would sometimes fail since there was an AOFRW running
in the background. now we wait for it to end before attempting the kill.

(cherry picked from commit b491d47)
oranagra added a commit that referenced this pull request Sep 10, 2020
This test was failing from time to time see discussion at the bottom of #7635
This was probably due to timing, the DEBUG SLEEP executed by redis-cli
didn't sleep for enough time.

This commit changes:
1) use SET-ACTIVE-EXPIRE instead of DEBUG SLEEP
2) reduce many `after` sleeps with retry loops to speed up the test.
3) add many comment explaining the different steps of the test and
   it's purpose.
4) config appendonly before populating the volatile keys, so that they'll
   be part of the AOF command stream rather than the preamble RDB portion.

other complications: recently kill_instance switched from SIGKILL to
SIGTERM, and this would sometimes fail since there was an AOFRW running
in the background. now we wait for it to end before attempting the kill.

(cherry picked from commit b491d47)
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Nov 4, 2020
This test was failing from time to time see discussion at the bottom of redis#7635
This was probably due to timing, the DEBUG SLEEP executed by redis-cli
didn't sleep for enough time.

This commit changes:
1) use SET-ACTIVE-EXPIRE instead of DEBUG SLEEP
2) reduce many `after` sleeps with retry loops to speed up the test.
3) add many comment explaining the different steps of the test and
   it's purpose.
4) config appendonly before populating the volatile keys, so that they'll
   be part of the AOF command stream rather than the preamble RDB portion.

other complications: recently kill_instance switched from SIGKILL to
SIGTERM, and this would sometimes fail since there was an AOFRW running
in the background. now we wait for it to end before attempting the kill.
jschmieg pushed a commit to memKeyDB/memKeyDB that referenced this pull request Nov 6, 2020
This test was failing from time to time see discussion at the bottom of redis#7635
This was probably due to timing, the DEBUG SLEEP executed by redis-cli
didn't sleep for enough time.

This commit changes:
1) use SET-ACTIVE-EXPIRE instead of DEBUG SLEEP
2) reduce many `after` sleeps with retry loops to speed up the test.
3) add many comment explaining the different steps of the test and
   it's purpose.
4) config appendonly before populating the volatile keys, so that they'll
   be part of the AOF command stream rather than the preamble RDB portion.

other complications: recently kill_instance switched from SIGKILL to
SIGTERM, and this would sometimes fail since there was an AOFRW running
in the background. now we wait for it to end before attempting the kill.

(cherry picked from commit b491d47)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants