Refactoring for safer olm and megolm session usage #5380

BillCarsonFr · 2022-02-28T22:57:41Z

Rework following investigations on strange rolling UISIs errors.
Logs contained a lot of BAD_MAC errors on both olm and megolm.

Olm bad macs can pretty much be anything, like using wrong/desync session or concurrency, or just false positive logs when iterating other known sessions (though proliferation of session might be suspicious)

Megolm could be more probably concurrent decryption, as well as desync with cache/store, as the current code was not very safe.

The code has been refactored to create spefic classes to handle cache of sessions, and move this responsability out of the realm crypto store. Access are guarded by synchronized directive.
Concurrent decryption attempt are guarded with mutex (per olm/megolm session)
Access to olm account has been protected against concurrent access.
Refactor the addInboundroupSession logic (there was some crash report of NPE there) and code could cause cache beeing out of sync
Better logging
EDIT Cleaner unwedging with better logs. Unwedging done at the end of the sync now, some synchronisation added.

Added a new E2E sanity test that the following is working:

Proper key sharing to a set of user in e2e room
Proper key backup restore from a new session
Proper key forwarding
Accept and replace existing session with a better forwarded one

EDIT Resurrected unwedging test

github-actions · 2022-02-28T23:53:21Z

Unit Test Results

  96 files +  4   96 suites +4 1m 15s ⏱️ -5s
172 tests +10 172 ✔️ +10 0 💤 ±0 0 ❌ ±0
564 runs +40 564 ✔️ +40 0 💤 ±0 0 ❌ ±0

Results for commit 96b5174. ± Comparison against base commit bcdf004.

This pull request removes 2 and adds 12 tests. Note that renamed tests count towards both.

im.vector.app.features.onboarding.OnboardingViewModelTest ‑ given failure when handling display name updates action then emits failure event
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ when handling display name updates action then updates user display name and emits name updated event

im.vector.app.features.onboarding.OnboardingViewModelTest ‑ given a selected picture when handling save selected profile picture then updates upstream avatar and completes personalization
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ given no selected picture when saving selected profile picture then emits failure event
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ given upstream failure when handling display name update then emits failure event
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ given upstream update avatar fails when saving selected profile picture then emits failure event
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ when handling display name update then updates upstream user display name
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ when handling profile picture selected then updates selected picture state
im.vector.app.features.onboarding.OnboardingViewModelTest ‑ when handling profile picture skipped then completes personalization
im.vector.app.features.onboarding.UriFilenameResolverTest ‑ given a non content schema Uri when querying file name then returns last segment
im.vector.app.features.onboarding.UriFilenameResolverTest ‑ given a non hierarchical Uri when querying file name then is null
im.vector.app.features.onboarding.UriFilenameResolverTest ‑ given content schema Uri with backing content when querying file name then returns display name column
…

♻️ This comment has been updated with latest results.

bmarty

That's amazing work. I am not sure I understand all the code related to the crypto part. I am glad that we have some unit test. And you added lots of unit test which is also very nice. Thanks!
I would be more confident if we can have a review from someone in the crypto team. WDYT? @poljar maybe?

...x-sdk-android/src/androidTest/java/org/matrix/android/sdk/internal/crypto/E2eeSanityTests.kt

...-sdk-android/src/androidTest/java/org/matrix/android/sdk/internal/crypto/PreShareKeysTest.kt

matrix-sdk-android/src/main/java/org/matrix/android/sdk/internal/crypto/MXOlmDevice.kt

...-sdk-android/src/main/java/org/matrix/android/sdk/internal/crypto/model/OlmSessionWrapper.kt

poljar

Left some small nits, logic wise nothing seems to be wrong.

matrix-sdk-android/src/main/java/org/matrix/android/sdk/internal/crypto/OlmSessionStore.kt

poljar · 2022-03-04T10:03:37Z

matrix-sdk-android/src/main/java/org/matrix/android/sdk/internal/crypto/OlmSessionStore.kt

+        return internalGetAllSessions(deviceKey)
+    }
+
+    private fun internalGetAllSessions(deviceKey: String): MutableList<String> {


Is there a point of this being a separate method? Why not inline it into getDeviceSessionIds?

mmm, initialy it was called also in another method (that was already synchronized), so I created this. But looks like there is no point now

poljar · 2022-03-04T10:05:59Z

matrix-sdk-android/src/main/java/org/matrix/android/sdk/internal/crypto/OlmSessionStore.kt

+    }
+
+    /**
+     * Retrieve an end-to-end session between the logged-in user and another


Again, olm sessions are between devices.

fixed (and some more in crypto store interface)

matrix-sdk-android/src/main/java/org/matrix/android/sdk/internal/crypto/OlmSessionStore.kt

...src/main/java/org/matrix/android/sdk/internal/crypto/algorithms/megolm/MXMegolmEncryption.kt

matrix-sdk-android/src/main/java/org/matrix/android/sdk/internal/crypto/MXOlmDevice.kt

Co-authored-by: poljar <poljar@termina.org.uk>

michaelkaye

I've recently commented out some flaky integration tests that are in this area: #5449 - they seemed to be some sort of issue between two things opening a realm db at the same time. Do you think these concurrent access changes you mention might have helped there - is it worth un-ignoring those tests?

Also, it might be good to run the nightly integration tests against a PR like this that introduces more integration tests - otherwise we won't compile and run them until the night after merge.

You can do that in Actions -> "Nightly build" (in left sidebar) -> then in the blue bar, select this branch to run the workflow against.

It doesn't run them automatically any more as it takes a good long time and we don't want to slow down the PR merge time too much.

michaelkaye · 2022-03-08T15:51:13Z

matrix-sdk-android/src/androidTest/java/org/matrix/android/sdk/common/TestConstants.kt

@@ -23,7 +23,7 @@ object TestConstants {
    const val TESTS_HOME_SERVER_URL = "http://10.0.2.2:8080"

    // Time out to use when waiting for server response.
-    private const val AWAIT_TIME_OUT_MILLIS = 30_000


I found myself doing this in response to slow integration tests in #5459 - do you have any more information on why 60s is better?

(we changed 30s -> 60s here, but the github is not showing me both sides of the diff atm)

When doing test with several sessions everything is getting a lot slower (like E2E sanity test), each session is slower to sync, etc...
Everytime we test something with await/waitWithLatch it's using that as default timeout, and test are then failing when waiting from something coming back from the sync.
It also happens when running all tests, I wonder if we don't clean properly between test?
Previously I used the ANDROIDX_TEST_ORCHESTRATOR option to ensure that all test start with a clear state, but looks like it's not working anymore.

We definitely don't clean up the server between tests in a single junit run, but will clean up between CI builds generally (and if you just run a demo server locally, then that won't be cleaned up unless you explicitly do so)

bmarty

Let's merge this PR, 1.4.4 has been released.

BillCarsonFr marked this pull request as ready for review March 1, 2022 16:14

BillCarsonFr force-pushed the feature/bca/crypto_fix_rolling_uisi branch from 1fbe434 to 44599ed Compare March 2, 2022 08:59

bmarty reviewed Mar 2, 2022

View reviewed changes

BillCarsonFr requested a review from poljar March 4, 2022 08:40

poljar approved these changes Mar 4, 2022

View reviewed changes

BillCarsonFr and others added 24 commits March 4, 2022 19:21

Extract olm cache store

10ea166

Protect olm session from concurrent access

33f9bc5

protect olm account access

9df5f17

Fix test compilation

87d9308

Clean megolm import code

24c51ea

Added e2ee sanity tests

c97de48

Improve inbound group session cache + mutex

9b3c5d2

protect race on prekey + logs

ade16a0

better logs

9eb0473

test forward better key

11e8881

cleaning

2f665dd

clean test

122e785

extract test to dedicated class

2d9beb6

Clean ensure olm, fix unwedging, better logs

f238739

dispatch network calls to io

078ed1b

resurrect unwedge test + cleaning

b7bf39b

Use loggerTag

87de51b

avoid duplicate userId on key download

49d33f3

use mutex on suspend and not synchronized

6546f98

clean log level

714e1d7

fix test

ada83d0

code review cleaning

5d952fe

better comment

7616e2d

Co-authored-by: poljar <poljar@termina.org.uk>

Better comment

31d3fe3

Co-authored-by: poljar <poljar@termina.org.uk>

BillCarsonFr and others added 3 commits March 4, 2022 19:21

Better comment

99a07af

Co-authored-by: poljar <poljar@termina.org.uk>

Code review cleaning

db84c67

Save valid backup key before downloading keys

3c931d6

BillCarsonFr force-pushed the feature/bca/crypto_fix_rolling_uisi branch from 9172e74 to 3c931d6 Compare March 4, 2022 18:21

michaelkaye reviewed Mar 8, 2022

View reviewed changes

Fix ktlint

96b5174

bmarty approved these changes Mar 10, 2022

View reviewed changes

bmarty merged commit ce4ad88 into develop Mar 10, 2022

bmarty deleted the feature/bca/crypto_fix_rolling_uisi branch March 10, 2022 10:13

BillCarsonFr mentioned this pull request Apr 29, 2022

olm try to decrypt with recent session first #5872

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring for safer olm and megolm session usage #5380

Refactoring for safer olm and megolm session usage #5380

BillCarsonFr commented Feb 28, 2022 •

edited

github-actions bot commented Feb 28, 2022 •

edited

bmarty left a comment

poljar left a comment

poljar Mar 4, 2022

BillCarsonFr Mar 4, 2022

BillCarsonFr Mar 4, 2022

poljar Mar 4, 2022

BillCarsonFr Mar 4, 2022

michaelkaye left a comment

michaelkaye Mar 8, 2022 •

edited

BillCarsonFr Mar 8, 2022

michaelkaye Mar 8, 2022 •

edited

bmarty left a comment

Refactoring for safer olm and megolm session usage #5380

Refactoring for safer olm and megolm session usage #5380

Conversation

BillCarsonFr commented Feb 28, 2022 • edited

github-actions bot commented Feb 28, 2022 • edited

Unit Test Results

bmarty left a comment

Choose a reason for hiding this comment

poljar left a comment

Choose a reason for hiding this comment

poljar Mar 4, 2022

Choose a reason for hiding this comment

BillCarsonFr Mar 4, 2022

Choose a reason for hiding this comment

BillCarsonFr Mar 4, 2022

Choose a reason for hiding this comment

poljar Mar 4, 2022

Choose a reason for hiding this comment

BillCarsonFr Mar 4, 2022

Choose a reason for hiding this comment

michaelkaye left a comment

Choose a reason for hiding this comment

michaelkaye Mar 8, 2022 • edited

Choose a reason for hiding this comment

BillCarsonFr Mar 8, 2022

Choose a reason for hiding this comment

michaelkaye Mar 8, 2022 • edited

Choose a reason for hiding this comment

bmarty left a comment

Choose a reason for hiding this comment

BillCarsonFr commented Feb 28, 2022 •

edited

github-actions bot commented Feb 28, 2022 •

edited

michaelkaye Mar 8, 2022 •

edited

michaelkaye Mar 8, 2022 •

edited