Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit lists for duplicate merging of cases #9054

Closed
10 tasks done
Tracked by #7734
StefanKock opened this issue May 3, 2022 · 8 comments · Fixed by #11459 or #11472
Closed
10 tasks done
Tracked by #7734

Limit lists for duplicate merging of cases #9054

StefanKock opened this issue May 3, 2022 · 8 comments · Fixed by #11459 or #11472
Assignees
Labels
backend Affects the web backend cases change A change of an existing feature (ticket type) performance Issues that are meant to increase the performance of the application qa-verified Issue has been tested and verified by QA vaadin-app Affects the Vaadin application

Comments

@StefanKock
Copy link
Contributor

StefanKock commented May 3, 2022

Problem Description

Data used for merging is currently only limited based on the report date filter, but still has to go through big parts of the available data.
See #8697 (comment)

Proposed Change

  • Limit CaseFacadeEjb.getCasesForDuplicateMerging method to fetch 100 entries.
  • When 100 entries are returned display a warning below the grid in the UI: "The returned data has been limited to < placeholder > entries. You can adjust the creation date filter or add other filter criteria to drill down on this."
  • In addition show a drop-down that allows the user to define the limit: 100, 500, 1000, none
  • Re-evaluate the performance of this with the database used in Analyse slow system behaviour when creating an event participant / trying a case merge [2] #8697. If the performance is still bad, provide an explain analyze via https://explain.dalibo.com/) If the performance is acceptable (max 10 seconds waiting time), the currently available warning popup shown when entering the merge view can be removed (including Strings.infoMergeFiltersHint)

Acceptance Criteria

Implementation Details

  • LEFT OUTER JOIN LOCATION location3_ ON person2_.address_id=location3_.id
  • LEFT OUTER JOIN Region region4_ ON case0_.responsibleRegion_id=region4_.id
  • LEFT OUTER JOIN Region region5_ ON case0_.region_id=region5_.id
  • LEFT OUTER JOIN users user7_ ON case0_.reportingUser_id=user7_.id
  • LEFT OUTER JOIN Facility facility8_ ON case0_.healthFacility_id=facility8_.id
  • LEFT OUTER JOIN Region region10_ ON case1_.responsibleRegion_id=region10_.id
  • LEFT OUTER JOIN Region region11_ ON case1_.region_id=region11_.id
  • get rid of distinct
  • introduce index to cases sort by creation date DESC

Additional Information

  • After merging a duplicate, the list will be re-loaded, potentially causing a long delay. If we don't manage to improve the performance of the query itself we would at least have to optimize this aspect.
@StefanKock StefanKock added contacts cases backend Affects the web backend change A change of an existing feature (ticket type) performance Issues that are meant to increase the performance of the application labels May 3, 2022
@MartinWahnschaffe MartinWahnschaffe changed the title Limit lists for duplicate merging Limit lists for duplicate merging or cases and contacts May 23, 2022
@markusmann-vg markusmann-vg changed the title Limit lists for duplicate merging or cases and contacts Limit lists for duplicate merging for cases and contacts May 25, 2022
@markusmann-vg markusmann-vg changed the title Limit lists for duplicate merging for cases and contacts Limit lists for duplicate merging for cases and contacts [2] May 25, 2022
@AndyBakcsy-she AndyBakcsy-she added this to Backlog in SORMAS Team 2 - DEV - Iteration Backlog via automation Aug 11, 2022
@sergiupacurariu sergiupacurariu self-assigned this Aug 22, 2022
@sergiupacurariu sergiupacurariu moved this from Backlog to In Progress in SORMAS Team 2 - DEV - Iteration Backlog Aug 22, 2022
@sergiupacurariu sergiupacurariu moved this from In Progress to Backlog in SORMAS Team 2 - DEV - Iteration Backlog Aug 26, 2022
@sergiupacurariu sergiupacurariu removed their assignment Aug 26, 2022
@MateStrysewske MateStrysewske self-assigned this Sep 1, 2022
@MateStrysewske MateStrysewske moved this from Backlog to In Progress in SORMAS Team 2 - DEV - Iteration Backlog Sep 1, 2022
@StefanKock
Copy link
Contributor Author

Investigated the failing query (no space left on disk). The problem is that this query makes a Case x Case crossjoin that results probably in > 4 billion lines for 65k cases.

Executed sql query
SELECT DISTINCT case0_.id AS col_0_0_,
                case1_.id AS col_1_0_,
                case0_.creationDate AS col_2_0_
FROM cases case0_
CROSS JOIN cases case1_
LEFT OUTER JOIN Person person2_ ON case0_.person_id=person2_.id
LEFT OUTER JOIN LOCATION location3_ ON person2_.address_id=location3_.id
LEFT OUTER JOIN Region region4_ ON case0_.responsibleRegion_id=region4_.id
LEFT OUTER JOIN Region region5_ ON case0_.region_id=region5_.id
LEFT OUTER JOIN Symptoms symptoms6_ ON case0_.symptoms_id=symptoms6_.id
LEFT OUTER JOIN users user7_ ON case0_.reportingUser_id=user7_.id
LEFT OUTER JOIN Facility facility8_ ON case0_.healthFacility_id=facility8_.id
LEFT OUTER JOIN Person person9_ ON case1_.person_id=person9_.id
LEFT OUTER JOIN Region region10_ ON case1_.responsibleRegion_id=region10_.id
LEFT OUTER JOIN Region region11_ ON case1_.region_id=region11_.id
LEFT OUTER JOIN Symptoms symptoms12_ ON case1_.symptoms_id=symptoms12_.id
WHERE case0_.deleted=FALSE
  AND case1_.deleted=FALSE
  AND case0_.creationDate>$1
  AND case0_.creationDate<$2
  AND case0_.deleted=$3
  AND (unaccent(user7_.firstName) ILIKE unaccent($4)=TRUE
       OR unaccent(user7_.lastName) ILIKE unaccent($5)=TRUE
       OR unaccent(user7_.userName) ILIKE unaccent($6)=TRUE)
  AND similarity(((person2_.firstName||$7)||person2_.lastName),((person9_.firstName||$8)||person9_.lastName))>0.65
  AND case0_.disease=case1_.disease
  AND region4_.id=region10_.id
  AND (region5_.id IS NULL
       OR region5_.id=region11_.id)
  AND abs(date_part($9, case0_.reportDate)-date_part($10, case1_.reportDate))<=2592000
  AND (person2_.sex IS NULL
       OR person9_.sex IS NULL
       OR person2_.sex=$11
       OR person9_.sex=$12
       OR person2_.sex=person9_.sex)
  AND (person2_.birthdate_dd IS NULL
       OR person2_.birthdate_mm IS NULL
       OR person2_.birthdate_yyyy IS NULL
       OR person9_.birthdate_dd IS NULL
       OR person9_.birthdate_mm IS NULL
       OR person9_.birthdate_yyyy IS NULL
       OR person2_.birthdate_dd=person9_.birthdate_dd
       AND person2_.birthdate_mm=person9_.birthdate_mm
       AND person2_.birthdate_yyyy=person9_.birthdate_yyyy)
  AND (symptoms6_.onsetDate IS NULL
       OR symptoms12_.onsetDate IS NULL
       OR abs(date_part($13, symptoms6_.onsetDate)-date_part($14, symptoms12_.onsetDate))<=2592000)
  AND (case0_.creationDate<case1_.creationDate
       OR case1_.creationDate<=$15
       OR case1_.creationDate>=$16)
  AND case0_.id<>case1_.id
ORDER BY case0_.creationDate DESC
LIMIT $17

Idea: Only run the query for one case in case0_ collection to analyse that query, and then test where the sweet spot is (probably less than ten cases in one query). Search for duplicates until the desired limit is reached or all cases where checked.

@MateStrysewske
Copy link
Contributor

Current status and some Explain Analyse links:

Getting rid of the person comparison conditions (i.e. name, sex, birth date) and instead comparing by person ID basically makes the query execute instantly with the performance database from ~ 2 months ago (85,000 cases) because it drastically reduces the subset of data that needs to be filtered. However, this is not an option until we have a dedicated person duplicate merging, and we would then probably run into performance problems while implementing that.

Here's the Explain Analyse of the current state of the query:

https://explain.dalibo.com/plan/PgR8

@MartinWahnschaffe and I had a closer look at it and have identified the potential problem that the limit is only applied at the end of the query, i.e. the whole subset of data is considered before, which is very problematic because we're doing a cross join. One reason could be that the query above adds some sorting in between. However, even getting rid of this sorting by adjusting the query according to the next Explain Analyse doesn't really improve this situation (even though it drastically reduces the size of the plan and definitely make sense to implement anyway):

https://explain.dalibo.com/plan/8oI

The proposal to only apply the whole query to a very small subset of cases is very likely not practical because it will result in a huge number of queries being executed in most scenarios where the limit is not reached quickly.

@StefanKock StefanKock moved this from In Progress to Waiting in SORMAS Team 2 - DEV - Iteration Backlog Oct 14, 2022
@StefanKock StefanKock added the discuss Discussion about whether or not the issue should or can be implemented required label Oct 14, 2022
@markusmann-vg markusmann-vg added this to Backlog in SORMAS Team 2 - DEV - Iteration Backlog via automation Jan 27, 2023
@markusmann-vg markusmann-vg removed the discuss Discussion about whether or not the issue should or can be implemented required label Jan 27, 2023
@markusmann-vg markusmann-vg changed the title Limit lists for duplicate merging for cases and contacts [2] INVESTIGATION: Limit lists for duplicate merging for cases and contacts Jan 27, 2023
MartinWahnschaffe added a commit that referenced this issue Feb 8, 2023
SORMAS Team 4 - TEC - Iteration Backlog automation moved this from Review to Testing Feb 8, 2023
@StefanKock StefanKock added this to the Iteration 2023-01 - 1.80.0 milestone Feb 9, 2023
@MartinWahnschaffe MartinWahnschaffe changed the title Limit lists for duplicate merging for cases Limit lists for duplicate merging of cases Feb 9, 2023
@StefanKock StefanKock added this to Backlog in SORMAS Team 2 - DEV - Iteration Backlog via automation Feb 9, 2023
@StefanKock StefanKock moved this from Backlog to Testing in SORMAS Team 2 - DEV - Iteration Backlog Feb 9, 2023
@MartinWahnschaffe
Copy link
Contributor

Note: I have checked the duplicate detection on https://test-de.sormas.netzlink.com. Before the update it took 1 minute or more to open the view. Now it's just a few seconds, so there seems to be a positive tendency :-)

@adinaflorea9
Copy link

I am reopening the issue to the the fact that an error appears when filtering across directories and then deleting the inputted information. Reported it in #11470 but it was observed that it might have been caused by this ticket.

@adinaflorea9 adinaflorea9 reopened this Feb 9, 2023
SORMAS Team 2 - DEV - Iteration Backlog automation moved this from Testing to In Progress Feb 9, 2023
SORMAS Team 4 - TEC - Iteration Backlog automation moved this from Testing to In Progress Feb 9, 2023
MartinWahnschaffe added a commit that referenced this issue Feb 9, 2023
MartinWahnschaffe added a commit that referenced this issue Feb 9, 2023
SORMAS Team 2 - DEV - Iteration Backlog automation moved this from In Progress to Testing Feb 9, 2023
SORMAS Team 4 - TEC - Iteration Backlog automation moved this from In Progress to Testing Feb 9, 2023
@abrudanancuta
Copy link
Contributor

Validated on test.de, test.de2. test.de1 with the version 1.80.0-SNAPSHOT (a118b4a)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Affects the web backend cases change A change of an existing feature (ticket type) performance Issues that are meant to increase the performance of the application qa-verified Issue has been tested and verified by QA vaadin-app Affects the Vaadin application
Projects
None yet
7 participants