Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate attachments from IndexedDB to file system #2129

Merged
merged 41 commits into from Mar 26, 2018

Conversation

gasi-signal
Copy link
Contributor

@gasi-signal gasi-signal commented Mar 15, 2018

  • Generate random file names.
  • Generate random file paths that prevent too many files per folder using fan-out.
  • Create attachment directory in user data folder.
  • Investigate operating system file indexing on:
  • Save incoming attachment files to disk
    • On received
    • On sync
  • Save outgoing attachments files to disk before sending
  • Display attachments either from disk or memory in attachment view. Be robust to multiple render passes.
    • Test that missing attachment on disk doesn’t break app. Behavior: Message is displayed without attachment.
  • Delete attachment files when message is deleted.

Relates to #1589.

@gasi-signal gasi-signal force-pushed the create-schema-version-message-index branch 2 times, most recently from 14aef88 to 55b7350 Compare March 16, 2018 18:37
@gasi-signal gasi-signal force-pushed the migrate-attachments-to-file-system branch 2 times, most recently from d8d3199 to a677b31 Compare March 19, 2018 21:38
@gasi-signal gasi-signal force-pushed the create-schema-version-message-index branch from 55b7350 to bde8436 Compare March 19, 2018 23:30
@gasi-signal gasi-signal changed the base branch from create-schema-version-message-index to development March 20, 2018 18:35
@gasi-signal gasi-signal force-pushed the migrate-attachments-to-file-system branch 3 times, most recently from 0aef6a2 to 47df1d5 Compare March 20, 2018 23:00
@gasi-signal gasi-signal changed the title WIP: Migrate attachments from IndexedDB to file system Migrate attachments from IndexedDB to file system Mar 20, 2018
Copy link
Contributor Author

@gasi-signal gasi-signal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotated PR for your convenience, @scottnonnenberg-signal 😄

!js/views/settings_view.js
!js/backup.js
!js/database.js
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just alphabetized it (using Sublime Sort Lines) and found some dupes.

const toArrayBuffer = require('to-arraybuffer');


const PATH = 'attachments.noindex';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the piece that helps prevent indexing on macOS.


const prefix = name.slice(0, 2);
return path.join(prefix, name);
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, this file was written before we had our discussion about module.exports so I haven’t converted it yet. Also, it’s nice to have => for some short functions without return which would require to have function keyword.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I'll amend my previous comment: exports.xxxx is totally reasonable since export xxx is the future. :0)

I'll get my preferred order in files when we move to that via export function xxx.

@@ -656,10 +657,13 @@
profileKey = storage.get('profileKey');
}

const loadData = Attachment.loadData(migrationContext.readAttachmentData);
const attachmentsWithData =
await Promise.all(messageWithSchema.attachments.map(loadData));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to surgically add places where we have to upgrade messages or load attachment metadata. I’d appreciate you thinking about places that I might have missed (and are therefore not in this PR) based on your extensive experience with the codebase — thanks 👀🙇

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only a few potential sources which could create a new message:

  1. incoming, from contact
  2. incoming, sent to contact but from your own device
  3. import
  4. sending, created locally

There are a lot of places where 'sync' messages are sent, but those will never have attachments. Though perhaps you do want to upgrade their schema as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that helps. I think I got (1) onMessageReceived, (2) onSentMessage, (4) Conversation::sendMessage. Let me look at (3) imported messages (Update: On hold until I get the app startup migration working.)

Re: sync messages. Good point. Indeed, right now they don’t have attachments but we may have other schema upgrades in the future. I’ll see if I can fit this in here, or at least do Message::initializeSchema so they’ll be picked up by our background task in the future.

initialize: function(attributes) {
if (_.isObject(attributes)) {
this.set(TypedMessage.initializeSchemaVersion(attributes));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to add if guard for when we call new Whisper.Message().

@@ -427,6 +444,7 @@
}
}
message.set({
schemaVersion : dataMessage.schemaVersion,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a place we have to look out for when adding new properties. Haven’t had time to think about how to make this less likely to be forgotten in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about in the constructor of the Message object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added Signal.Types.Message.initializeSchema to Whisper.Message::initialize constructor. But sadly that won’t cover handleDataMessage as it mutates Whisper.Message through this set call. In my experience, code is much easier to reason about when the constructor is the only place to construct a data type and it ensures the data type is valid after successful execution (Make Invalid States Impossible to Represent).

In fact, when you work with immutable data types, there simply is no such a thing as a set after the fact to mutate an existing ‘instance’. Yes, you can have set later but it returns a new data type. This means required fields (non-Maybes) have to be set in the constructor. Functions that do validation and data type construction are often called smart constructors. This article gives a good overview: https://github.com/paf31/purescript-book/blob/cf0b510052495e884059667e0585156cf68e354b/text/chapter14.md#smart-constructors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Maybe it would make sense to use a different technique to go from message to dataMessage? What keys don't we want to come through? What did you see when you spent some time with this part of the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally we’d only use constructors as the sole way to create an instance of a model so there is a single point to determine what data is valid vs not. Right now we have a mixture of new Whisper.Message(…) and later various set calls which makes it harder to reason about the final state of Message. For specifics, I’d have to spend more time with it but at a high-level I see a few problematic patterns:

  • We treat incoming, outgoing, verified-change, group update, end session, etc. messages all the same, i.e. as Whisper.Message. I understand we want to render them at the same time and persist them as a union (all properties flat and explicit type field) in messages store at the storage level, but it’d be nice to make a distinction at the app level. This would allow us to break down handleDataMessage into shorter functions, e.g. handleIncomingMessage, handleOutgoingMessage, handleVerifiedChangeMessage, etc. Happy to share how this would be modeled if we had support for union types.
  • We interweave pure construction of the data type as well as side-effects such as Message::save and Conversation::save. It’d be nice to break the two apart so we could test the side-effect free data type construction using unit tests.

What keys don't we want to come through?

For Whisper.Message specifically we seem to be doing fairly well to pick data vs rejecting it. Going through my own database, I found dataMessage: null in type: 'outgoing' which seems to be astray. Other than that, it’d be simply nice if there was a single new Whisper.Message(…) call that showed all the properties we initialize with.


// hasData :: Attachment -> Boolean
exports.hasData = attachment =>
attachment.data instanceof ArrayBuffer || ArrayBuffer.isView(attachment.data);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example where module.exports would make the function longer. But not a huge deal 😄

BTW, this definition of data is what window.crypto.subtle expects for encryption.


const attachmentWithoutData = Object.assign({}, attachment, { path });
delete attachmentWithoutData.data;
return attachmentWithoutData;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grr, forgot to use omit again. Can fix it before merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal, I'm just honestly surprised that it's not second-nature for you! :0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t start using Lodash until recently, that’s why. Deleting fields from a data type is also not something you do in statically typed languages 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it. Thanks for the tip. This is one example where yes, we could say that comment is not worth it but I appreciate you having made it and I learned something new 😄

'`attachment.data` must be an `ArrayBuffer` or `ArrayBufferView`; got: ' +
typeof attachment.data
));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helped improve our error messaging in case of problems and highlights the error sooner.

package.json Outdated
"spectron": "^3.8.0"
"spectron": "^3.8.0",
"string-to-arraybuffer": "^1.0.0",
"tmp": "^0.0.33"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I guess tmp is now not needed since you added it to dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it from devDependencies

Copy link
Contributor

@scottnonnenberg-signal scottnonnenberg-signal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, great work. Some design things to think about, some questions about dependencies. And a strong recommendation to avoid ` in strings unless we really need it. :0)

// getPath :: AbsolutePath -> AbsolutePath
exports.getPath = (userDataPath) => {
if (!isString(userDataPath)) {
throw new TypeError('`userDataPath` must be a string');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. The use of ` to denote a name is bit confusing when scanning this text, since that character is used for strings as well. I don't thing we get enough value out of it here to justify that cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what the particular concern is? The rationale is to use the Markdown syntax we’re so used to as developers for denoting identifiers. I feel '\'userDataPath\' must be a string is noisier and just 'userDataPath must be a string' can get confusing once we have more generic arguments such as 'data must be a string', etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backticks are harder for me to scan and understand. It's a twitch thing, and makes the code take more time to parse. IMO, markdown syntax is only useful when it gets converted to markdown.

I'm not really worried about a confusing error message like 'data must be a string' because we'll have the stack and therefore line numbers to help us out with addressing the error. If we're really looking for clarity, we might want to put the name of the method in the error.

@@ -656,10 +657,13 @@
profileKey = storage.get('profileKey');
}

const loadData = Attachment.loadData(migrationContext.readAttachmentData);
const attachmentsWithData =
await Promise.all(messageWithSchema.attachments.map(loadData));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only a few potential sources which could create a new message:

  1. incoming, from contact
  2. incoming, sent to contact but from your own device
  3. import
  4. sending, created locally

There are a lot of places where 'sync' messages are sent, but those will never have attachments. Though perhaps you do want to upgrade their schema as well?

const attachments = this.get('attachments');
const deleteData =
Attachment.deleteData(migrationContext.deleteAttachmentData);
await Promise.all(attachments.map(deleteData));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the last line of a function, how do we decide to use return or await? Is it because you explicitly don't want to make the results of those deleteData() calls available to any caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I want onDestroy to have return type void. I’ll add an explicit return; to make it clear 👍

@@ -427,6 +444,7 @@
}
}
message.set({
schemaVersion : dataMessage.schemaVersion,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about in the constructor of the Message object?

this.revokeImageUrl();
const attachments = this.get('attachments');
const deleteData =
Attachment.deleteData(migrationContext.deleteAttachmentData);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of API, I would like to have the final deleteData available to me here as an end-user, instead of needing to assemble it with context. I can see why we arrange things this way at declaration, keeping things pure, keeping things easy to test and compose. But I think we could simplify things here by having things be pre-assembled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that’d be nice. I will have to think about how to best expose it since the regular upgrade pipeline just needs the writeAttachmentData function on its own.

@@ -0,0 +1,105 @@
const fse = require('fs-extra');
const path = require('path');
const stringToArrayBuffer = require('string-to-arraybuffer');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to me like we're pulling this in for a very simple function:

function str2ab (str) {
	var array = new Uint8Array(str.length);
	for(var i = 0; i < str.length; i++) {
		array[i] = str.charCodeAt(i);
	}
	return array.buffer
}

I feel like the risk we take on in adding another dependency (even if a dev dependency) is not worthwhile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


const prefix = name.slice(0, 2);
return path.join(prefix, name);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I'll amend my previous comment: exports.xxxx is totally reasonable since export xxx is the future. :0)

I'll get my preferred order in files when we move to that via export function xxx.

@@ -77,6 +78,7 @@
"spellchecker": "^3.4.4",
"testcheck": "^1.0.0-rc.2",
"tmp": "^0.0.33",
"to-arraybuffer": "^1.0.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you talk about the need for this dependency?

Unlike string-to-arraybuffer, at least this function is has a nice performance optimization we probably wouldn't have written otherwise: https://github.com/jhiesey/to-arraybuffer/blob/6502d9850e70ba7935a7df4ad86b358fc216f9f0/index.js#L3-L27


const attachmentWithoutData = Object.assign({}, attachment, { path });
delete attachmentWithoutData.data;
return attachmentWithoutData;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal, I'm just honestly surprised that it's not second-nature for you! :0)

Copy link
Contributor Author

@gasi-signal gasi-signal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and the valuable feedback, especially pointing out where we create, send, and receive messages.

Re: to-arraybuffer. I need that to convert from Node.js Buffer (fs.readFile) to ArrayBuffer used in our UI.

@@ -656,10 +657,13 @@
profileKey = storage.get('profileKey');
}

const loadData = Attachment.loadData(migrationContext.readAttachmentData);
const attachmentsWithData =
await Promise.all(messageWithSchema.attachments.map(loadData));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that helps. I think I got (1) onMessageReceived, (2) onSentMessage, (4) Conversation::sendMessage. Let me look at (3) imported messages (Update: On hold until I get the app startup migration working.)

Re: sync messages. Good point. Indeed, right now they don’t have attachments but we may have other schema upgrades in the future. I’ll see if I can fit this in here, or at least do Message::initializeSchema so they’ll be picked up by our background task in the future.

@gasi-signal
Copy link
Contributor Author

@gasi gasi force-pushed the migrate-attachments-to-file-system branch from 642f376 to ead5d04 Compare March 26, 2018 14:53
@gasi-signal
Copy link
Contributor Author

Copy link
Contributor

@scottnonnenberg-signal scottnonnenberg-signal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! That attachment API is now much cleaner at all the call sites. :0)

Allows us to use `fs` with promises. Turns out it’s already a
transitive dependency.
Useful for creating random temporary directories for testing.
Used for testing attachment data migration.
Using hashes, we get the benefit of deduplication but if a user receives two
messages with the same attachment, deleting one would delete it for both since
they are only stored once. To avoid the complexity of tracking number of
references, we simply generate random file names similar to iMessage on MacOS
(?) and Signal Android.
We already use `tmp`.
Allows us to pass in Electron/Node.js specific functions. This allows better
unit testing in Mocha where we don’t have access to Electron APIs.
@gasi-signal gasi-signal force-pushed the migrate-attachments-to-file-system branch from a86ac5d to 7eaa6ef Compare March 26, 2018 20:43
@gasi-signal gasi-signal merged commit 7eaa6ef into development Mar 26, 2018
gasi-signal added a commit that referenced this pull request Mar 26, 2018
- [x] Generate random file names.
- [x] Generate random file paths that prevent too many files per folder using
      fan-out.
- [x] Create attachment directory in user data folder.
- [x] Investigate operating system file indexing on:
  - [x] Windows: Confirmed that `AppData` is not indexed by default.
  - [x] macOS: Confirmed that `~/Library` files are not indexed by default.
        Searching system files using Spotlight requires multi-step opt-in:
        https://lifehacker.com/5711409/how-to-search-for-hidden-packaged-and-system-files-in-os-x.
        More info https://apple.stackexchange.com/a/92785.
        Added `.noindex` suffix to `attachments` folder.
  - [x] Linux: n/a
- [x] Save incoming attachment files to disk
  - [x] On received
  - [x] On sync
- [x] Save outgoing attachments files to disk before sending
- [x] Display attachments either from disk or memory in attachment view.
      Be robust to multiple render passes.
  - [x] Test that missing attachment on disk doesn’t break app.
        Behavior: Message is displayed without attachment.
- [x] Delete attachment files when message is deleted.

Relates to #1589.
@gasi-signal gasi-signal deleted the migrate-attachments-to-file-system branch March 26, 2018 20:51
@gasi-signal
Copy link
Contributor Author

gasi-signal commented Mar 26, 2018

Rebased (no changes) and merged (before CI). Will address any unlikely CI issues after all three PRs are in development.

scottnonnenberg-signal added a commit that referenced this pull request Apr 5, 2018
Update to electron 1.8.4 (#2186)

Migrate all attachments from IndexedDB to file system in the background (#2208, #2193, #2165, #2162, #2129)

Save attachments to disk when importing Chrome app export (#2212)

Fixed: Read receipts setting would not be synchronized along with re-link (#2218)

Fixed: Clicking conversation in left pane when already selected would remove focus on message composition field - thanks
@colefranz! (#2032)

Fixed: Searching for the phone number of an existing contact, then selecting 'start conversation' would erase contact details (#2191)

Fixed: Selecting Settings menu option multiple times would open multiple instances of settings view - thanks @navdeepsinghkhalsa (#2167)

Dev:
  - Relax Node.js version requirements (#2203)
  - Fix a few typos in documentation - thanks @Vinnl (#2171)
  - Update issue template to mention that translation should be via Transifex (#2157)
gasi pushed a commit to gasi/Signal-Desktop that referenced this pull request Apr 6, 2018
Update to electron 1.8.4 (signalapp#2186)

Migrate all attachments from IndexedDB to file system in the background (signalapp#2208, signalapp#2193, signalapp#2165, signalapp#2162, signalapp#2129)

Save attachments to disk when importing Chrome app export (signalapp#2212)

Fixed: Read receipts setting would not be synchronized along with re-link (signalapp#2218)

Fixed: Clicking conversation in left pane when already selected would remove focus on message composition field - thanks
@colefranz! (signalapp#2032)

Fixed: Searching for the phone number of an existing contact, then selecting 'start conversation' would erase contact details (signalapp#2191)

Fixed: Selecting Settings menu option multiple times would open multiple instances of settings view - thanks @navdeepsinghkhalsa (signalapp#2167)

Dev:
  - Relax Node.js version requirements (signalapp#2203)
  - Fix a few typos in documentation - thanks @Vinnl (signalapp#2171)
  - Update issue template to mention that translation should be via Transifex (signalapp#2157)
scottnonnenberg-signal added a commit that referenced this pull request Apr 11, 2018
v1.7.0-beta.3

Update to electron 1.8.4 (#2186)

Migrate all attachments from IndexedDB to file system in the background (#2208, #2193, #2165, #2162, #2129)

Save attachments to disk when importing Chrome app export (#2212)

Fixed: Read receipts setting would not be synchronized along with re-link (#2218)

Fixed: Clicking conversation in left pane when already selected would remove focus on message composition field - thanks
@colefranz! (#2032)

Fixed: Searching for the phone number of an existing contact, then selecting 'start conversation' would erase contact details (#2191)

Fixed: Selecting Settings menu option multiple times would open multiple instances of settings view - thanks @navdeepsinghkhalsa (#2167)

Dev:
  - Relax Node.js version requirements (#2203)
  - Fix a few typos in documentation - thanks @Vinnl (#2171)
  - Update issue template to mention that translation should be via Transifex (#2157)
scottnonnenberg-signal added a commit that referenced this pull request Apr 11, 2018
Note: this release is equivalent to v1.7.0-beta.3. The changes listed below are compared to the previous production release, v1.6.1.

Update to electron 1.8.4 (#2186)

Migrate all attachments from IndexedDB to file system in the background (#2208, #2193, #2165, #2162, #2129)
Save attachments to disk when importing Chrome app export (#2212)

New option in settings: delete all application data (383e02e, #2144 and #2153)

Remove all configuration in database when we discover we are unlinked (9acb189 and 1c6d91b)

Delete everything in database when we link with a different phone number from previous link (9acb189)

Windows: Delete all data on uninstall (c855597)

Fixed: Read receipts setting would not be synchronized along with re-link (#2218)

Fixed: Clicking conversation in left pane when already selected would remove focus on message composition field - thanks @colefranz! (#2032)

Fixed: Searching for the phone number of an existing contact, then selecting 'start conversation' would erase contact details (#2191)

Fixed: Selecting Settings menu option multiple times would open multiple instances of settings view - thanks @navdeepsinghkhalsa (#2167)

Dev:
  - Redact file paths in anything that goes to the log on disk (#2110)
  - When top-level process errors happen, don't show dialog with stack trace (#2110)
  - Add `nsp` to CI runs (fd056e1)
  - Add eslint-plugin-mocha to disallow exclusive tests using *.only (#2110)
  - Preparation for encrypted backups (cea42bd)
  - Updates to structure of exported data - messages.zip, flat list of attachments (6d8f4b7)
  - Relax Node.js version requirements (#2203)
  - Fix a few typos in documentation - thanks @Vinnl (#2171)
  - Update issue template to mention that translation should be via Transifex (#2157)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants