Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import from wallabag asynchronously #1611

Closed
nicosomb opened this issue Jan 21, 2016 · 7 comments
Closed

Import from wallabag asynchronously #1611

nicosomb opened this issue Jan 21, 2016 · 7 comments
Labels
Milestone

Comments

@nicosomb
Copy link
Member

As we can see in #1598, import massive files is not possible (we set the limit to 20M on v2.wallabag.org but it's not the good solution.
We need to implement RabbitMQ as in #1581.
We need to refactor JSON export in wallabag v1 to only download articles URL, not content.

@nicosomb nicosomb added this to the 2.0.0 milestone Jan 21, 2016
@jcharaoui
Copy link
Contributor

I'd like to suggest Wallabag should keep import/export of content for a simple reason. With really large collections, some articles could be years old and the URLs to access them may not work down the road. The website could be completely offline, or the CMS may have changed and the old permalinks not migrated to the new platform.

Also, to solve the problem of loading big export files into memory, another solution would be to use a streaming json parser such as https://github.com/salsify/jsonstreamingparser. This would allow loading only one entry at a time in memory for processing, making the process much faster and less ressource-intensive.

@tcitworld
Copy link
Member

I'd like to suggest Wallabag should keep import/export of content for a simple reason.

We can't retrieve content from Pocket, but however the solution you provide us may help us to keep content from json files.

We need to refactor JSON export in wallabag v1 to only download articles URL, not content.

We should make this a choice, then. If export with content fails, then export just URLs and metadata.

@jcharaoui
Copy link
Contributor

We should make this a choice, then. If export with content fails, then export just URLs and metadata.

That's one option, but it could also be feasible for the export process to keep track of which entries are written out to disk and resume in case of a PHP timeout error. Regardless, unless I'm mistaken I think the export process (database to file) is by nature much faster than import (file to database) because we're having to run multiple database queries per entry. So making the v1 -> v2 JSON import more efficient seems to me like a bigger priority than refactoring the v1 export code.

@tcitworld
Copy link
Member

So making the v1 -> v2 JSON import more efficient seems to me like a bigger priority than refactoring the v1 export code.

A number of users also have encountered timeout/memory issues while exporting json from v1, so it stays a concern too.

@jcharaoui
Copy link
Contributor

So making the v1 -> v2 JSON import more efficient seems to me like a bigger priority than refactoring >> the v1 export code.

A number of users also have encountered timeout/memory issues while exporting json from v1, so it stays a concern too.

Agreed!

@nicosomb nicosomb modified the milestones: 2.0.0, 2.1.0 Mar 29, 2016
@nicosomb nicosomb changed the title [v2] Import from wallabag asynchronously Import from wallabag asynchronously Aug 28, 2016
@j0k3r j0k3r closed this as completed Sep 19, 2016
@HLFH
Copy link

HLFH commented Sep 20, 2016

Hi @j0k3r @nicosomb Is asynchronous export easier to implement now you have implemented the asynchronous import feature? But I guess it won't be for the 2.1.0 milestone.

@j0k3r
Copy link
Member

j0k3r commented Sep 20, 2016

Easier maybe.
But we still need to find how to ping user once the export is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants