Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add update test for repair table script #2838

Merged
merged 1 commit into from
Jan 28, 2021

Conversation

mkindahl
Copy link
Contributor

@mkindahl mkindahl commented Jan 18, 2021

This commit creates an update repair test that breaks a few tables for
pre-2.0 versions to ensure that the repair script actually fixes them.
The integrity check for the update tests already contain a check that
dimension slices are valid, so there is no need to add a test for that.

In addition, the commit fixes two bugs in the repair scripts that could
prevent an update in rare circumstances.

For the 1.7.1--1.7.2 repair script: if there were several missing
dimension slices in different hypertables with the same column name,
the repair script would be confused on what contraint had what type and
generate an error.

For the 2.0.0-rc1--2.0.0-rc2 repair script: if a partition constraint
was broken, it would generate an error rather than repairing the
dimension slices because BIGINT_MIN would be cast to a double float
and then an attempt would be made to cast it back to bigint, causing
an overflow error.

Fixes #2824

@mkindahl mkindahl force-pushed the repair_script_tests branch 2 times, most recently from 25d220f to a705c96 Compare January 18, 2021 12:48
@codecov
Copy link

codecov bot commented Jan 18, 2021

Codecov Report

Merging #2838 (0ff29c3) into master (126f1c8) will increase coverage by 0.14%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2838      +/-   ##
==========================================
+ Coverage   90.07%   90.22%   +0.14%     
==========================================
  Files         212      212              
  Lines       34772    34725      -47     
==========================================
+ Hits        31322    31329       +7     
+ Misses       3450     3396      -54     
Impacted Files Coverage Δ
tsl/src/nodes/gapfill/planner.c 96.89% <ø> (-0.02%) ⬇️
src/plan_expand_hypertable.c 94.32% <100.00%> (+0.06%) ⬆️
src/plan_partialize.c 97.91% <100.00%> (+0.04%) ⬆️
src/planner.c 93.54% <100.00%> (ø)
tsl/src/nodes/decompress_chunk/decompress_chunk.c 94.05% <100.00%> (+0.50%) ⬆️
tsl/src/nodes/decompress_chunk/qual_pushdown.c 91.33% <100.00%> (+0.72%) ⬆️
src/import/planner.c 70.30% <0.00%> (+11.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e86bbe...0ff29c3. Read the comment docs.

@mkindahl mkindahl force-pushed the repair_script_tests branch 7 times, most recently from d610732 to bb9de19 Compare January 20, 2021 11:26
@mkindahl mkindahl marked this pull request as ready for review January 20, 2021 11:39
@mkindahl mkindahl requested a review from a team as a code owner January 20, 2021 11:39
@mkindahl mkindahl requested review from pmwkaa, k-rus, svenklemm and erimatnor and removed request for a team January 20, 2021 11:39
@mkindahl mkindahl force-pushed the repair_script_tests branch 3 times, most recently from 1166dbf to 9be2f04 Compare January 21, 2021 08:55
Copy link
Member

@svenklemm svenklemm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think its a good idea to drop constraints in the update test workflow. If you really need to drop them then your test should be completely separate from that workflow.

@mkindahl
Copy link
Contributor Author

I dont think its a good idea to drop constraints in the update test workflow. If you really need to drop them then your test should be completely separate from that workflow.

I am not sure what you're suggesting.

We want to test the repair that is being done as part of the update, so we should run ALTER EXTENSION to test that all works. To delete some dimension slices it is necessary to remove the foreign key constraint but it is not possible to re-add it until after the update is done.

It is possible to mirror the existing update test script with only this test (essentially doing it the same way) and create a bunch of new docker containers, but that would double the execution time with no obvious advantage.

Copy link
Contributor

@gayyappan gayyappan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. I agree with Sven that it is better to have a separate test path for this outside of the usual update tests. The metadata fix is ideally a one off task and not needed across upgrades.

ORDER BY dimension_slice_id LIMIT 1
);

DELETE FROM _timescaledb_catalog.dimension_slice WHERE id IN (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest removing the metadata for all the slices for one of the cases (worst case scenario).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -16,59 +16,66 @@ WITH
-- All dimension slices that are mentioned in the chunk_constraint
-- table but are missing from the dimension_slice table.
missing_slices AS (
SELECT dimension_slice_id,
SELECT ch.hypertable_id,
di.id as dimension_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest fixing indentation (probably a tabs vs spaces issue).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran untabify on the files.

Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look too much in detail at the test itself, but I have a high-level concern/comment.

If the previous update script didn't work for all cases, do we know that the update always failed? The case I am concerned about is a successful update without a full repair. In that case we need another repair for 2.0.0--2.0.1.

# We need to run the post repair script to make sure that the
# constraint is on the clean rerun as well since the setup script can
# remove it.
if [[ "${TEST_VERSION}" > "v6" ]] || [[ "${TEST_VERSION}" = "v6" ]]; then
Copy link
Contributor

@erimatnor erimatnor Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the repair test be tied to specific TimescaleDB versions? I.e., we want to make sure that any TimescaleDB versions that proceed the repair update script will have the error, but no other. Is this handled by "v6"? I worry about the case when we also run repair tests for future upgrades where the issue no longer exists and the update script won't run (e.g., 2.0.1 -> 2.1.0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check just make sure that we restore the constraint if it was dropped. The constraint is always dropped in v6 or later, hence needs to be restored in v6 and later. Inside the repair setup, there is a check that we are upgrading from a version before 2.0.0, which guarantees that the latest repair is done. If we are upgrading from 2.0.1 to 2.1.0, no dimension slices will be dropped, but the foreign key constraint still need to be restored.

This is because there is no easy way to check what we're upgrading from and to, only what version we are currently running as.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow completely, but it sounds like the repair test activates only in cases we are upgrading from a pre 2.0 release.

One concern I still have is that we now do a separate CI run for the repair, which means we run all update tests from 2.0.1 to future releases multiple times, since the repair doesn't happen there. It seems somewhat unnecessary to run these update tests again. Or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, repair tests are only activated when upgrading from pre-2.0 to 2.0. The check has to be updated when merging with 1.7 branch since repair tests are only running on 1.7.1 to later.

No, you got it right. Right now we run all update tests with and without repair because there are repairs that happen during update. For branches where there is no repair, we need to disable these tests, but that is a later problem. For the time being, we will at least upgrade from 1.7.x branch, and there will always be a repair done until 1.7 EOL.

@mkindahl
Copy link
Contributor Author

mkindahl commented Jan 22, 2021

Didn't look too much in detail at the test itself, but I have a high-level concern/comment.

If the previous update script didn't work for all cases, do we know that the update always failed? The case I am concerned about is a successful update without a full repair. In that case we need another repair for 2.0.0--2.0.1.

Both of the repairs pick out all dimension slices that are missing, but handle them incorrectly and can then trigger an error while processing them (this is what the changes in the commit do), so it does not seem likely that it can have missed some dimension slices.

We do not have any "post-update" checks in the update scripts though: is this something that we need to add? Right now we only have integrity checks in the update test post checks.

@mkindahl mkindahl force-pushed the repair_script_tests branch 3 times, most recently from 650a4e7 to fc74199 Compare January 25, 2021 09:55
@mkindahl mkindahl force-pushed the repair_script_tests branch 4 times, most recently from 8cb5ea0 to d2ca58e Compare January 27, 2021 10:52
Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but have some concerns that we are running some update tests multiple times.

@@ -27,9 +30,9 @@ jobs:
- name: Checkout TimescaleDB
uses: actions/checkout@v2

- name: Update tests ${{ matrix.pg }}
- name: Update tests ${{ matrix.pg }} ${{ matrix.kind }}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Update tests ${{ matrix.pg }} ${{ matrix.kind }}}
- name: Update tests ${{ matrix.pg }} ${{ matrix.kind }}

include:
- pg: 11.10
pg_major: 11
- pg: 12.5
pg_major: 12
- opt: "-r"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to run this for both PG versions? Maybe it is enough to just run with PG12?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are optimizations to do, but I would prefer to defer that to a separate PR.

# We need to run the post repair script to make sure that the
# constraint is on the clean rerun as well since the setup script can
# remove it.
if [[ "${TEST_VERSION}" > "v6" ]] || [[ "${TEST_VERSION}" = "v6" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow completely, but it sounds like the repair test activates only in cases we are upgrading from a pre 2.0 release.

One concern I still have is that we now do a separate CI run for the repair, which means we run all update tests from 2.0.1 to future releases multiple times, since the repair doesn't happen there. It seems somewhat unnecessary to run these update tests again. Or am I missing something?

runs-on: 'ubuntu-18.04'
strategy:
matrix:
pg: ["11.10","12.5"]
opt: ["", "-r"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think restricting the repair tests to upgrades between 1.7.xx -> 2.0 , 2.0.1 is sufficient and we don't need to run this for the entire matrix. I don't mind deferring that change to a later PR. But do think it is necessary so that update tests don't unnecessarily consume too much time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I picked this as an easy solution for now. We can tune it over time and should probably also avoid running a full test when we try the repair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that the actual breaking of the dimension slices only happen if you upgrade from pre-2.0. There are checks in the SQL files that control this.

Copy link
Contributor

@gayyappan gayyappan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. I think this should have a follow up PR to restrict the repair tests to specific timescale version upgrades. It is not necessary to test the repair script against all TS versions to verify that it works as expected.

Copy link
Member

@svenklemm svenklemm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is supposed to operate on different versions i dont think this should piggyback on the existing versioning in the update tests. I would suggest removing the toplevel matrix option and instead start the repair tests in scripts/test_updates_pg1[12].sh for the versions where it is needed

@mkindahl
Copy link
Contributor Author

Since this is supposed to operate on different versions i dont think this should piggyback on the existing versioning in the update tests. I would suggest removing the toplevel matrix option and instead start the repair tests in scripts/test_updates_pg1[12].sh for the versions where it is needed

The repair runs on all updates from pre-2.0 updates to 2.0.x so even if the check is moved into the shell scripts, it will execute for all tests, with the exception of 2.0.0 to 2.0.1 (all others to 2.0.x). There are other optimizations that I think we should do to avoid duplicating tests, but I would prefer to take this as a separate PR since the extra test being executed is not a significant problem and we have changes that we want to have in 2.0.1.

The commit fixes two bugs in the repair scripts that could
prevent an update in rare circumstances.

For the 1.7.1--1.7.2 repair script: if there were several missing
dimension slices in different hypertables with the same column name,
the repair script would be confused on what constraint had what type
and generate an error.

For the 2.0.0-rc1--2.0.0-rc2 repair script: if a partition constraint
was broken, it would generate an error rather than repairing the
dimension slices because BIGINT_MIN would be cast to a double float and
then an attempt would be made to cast it back to bigint, causing an
overflow error.

This commit also creates an update repair test that breaks a few tables
for pre-2.0 versions to ensure that the repair script actually fixes
them.  The integrity check for the update tests already contain a check
that dimension slices are valid, so there is no need to add a test for
that.

This commit adds an extra dimension in the workflow to test updates
with repair and run that separately. It also changes the update test
scripts to by default run without repair tests and add the additional
option `-r` for running repair tests in addition to the normal tests.

Fixes timescale#2824
@mkindahl mkindahl merged commit c716325 into timescale:master Jan 28, 2021
@mkindahl mkindahl deleted the repair_script_tests branch January 28, 2021 14:04
svenklemm added a commit that referenced this pull request Jan 28, 2021
This maintenance release contains bugfixes since the 2.0.0 release. We deem it
high priority for upgrading.

In particular the fixes contained in this maintenance release address issues
in continuous aggregates, compression, JOINs with hypertables and when
upgrading from previous versions.

**Bugfixes**
* #2772 Always validate existing database and extension
* #2780 Fix config enum entries for remote data fetcher
* #2806 Add check for dropped chunk on update
* #2828 Improve cagg watermark caching
* #2838 Fix catalog repair in update script
* #2842 Do not mark job as started when setting next_start field
* #2845 Fix continuous aggregate privileges during upgrade
* #2851 Fix nested loop joins that involve compressed chunks
* #2860 Fix projection in ChunkAppend nodes
* #2861 Remove compression stat update from update script
* #2865 Apply volatile function quals at decompresschunk node
* #2866 Avoid partitionwise planning of partialize_agg
* #2868 Fix corruption in gapfill plan
* #2874 Fix partitionwise agg crash due to uninitialized memory

**Thanks**
* @alex88 for reporting an issue with joined hypertables
* @brian-from-quantrocket for reporting an issue with extension update and dropped chunks
* @dhodyn for reporting an issue when joining compressed chunks
* @markatosi for reporting a segfault with partitionwise aggregates enabled
* @PhilippJust for reporting an issue with add_job and initial_start
* @sgorsh for reporting an issue when using pgAdmin on windows
* @WarriorOfWire for reporting the bug with gapfill queries not being
  able to find pathkey item to sort
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add update test for repair dimension slices
4 participants