Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added PDF endpoint #696

Merged
merged 9 commits into from
Nov 29, 2016
Merged

Added PDF endpoint #696

merged 9 commits into from
Nov 29, 2016

Conversation

Pchelolo
Copy link
Contributor

I've already almost lost this code once in my local branches, creating a PR to avoid loosing it again.

We have 2 issues:

  1. Not clear how to generate an etag since we don't know the current article revision
  2. swagger-ui download link is not perfect, I'll create an upstream PR to fix it. There were major issues with swagger-ui for this use-case before, but it has been fixed recently, and I've updated our swagger-ui forks to pull in the fixes, so it works now.

cc @wikimedia/services

@Pchelolo
Copy link
Contributor Author

Create a PR for swagger-ui to show the download link nicer

method: get
uri: 'https://pdf-electron.wmflabs.org/pdf?accessKey=secret&url=https://{{domain}}/wiki/{title}'
- return_response:
return:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a new step, the return block can be put directly in the previous block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, you say it's done, but I still see this as being a two-step procedure - one to fetch the pdf, the other to return it.

- get_pdf_from_backend:
request:
method: get
uri: 'https://pdf-electron.wmflabs.org/pdf?accessKey=secret&url=https://{{domain}}/wiki/{title}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better to contact the instance present in BetaCluster (deployment-pdfrender.deployment-prep.eqiad.wmflabs) since that one will match the one that will be in production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it's not exposed publicly right now? Should we assing a domain to it, something like pdfrender-beta.wmflabs.org ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, scratch my earlier comment. We need to parametrise two things here:

  • the service's URI (much like we do with Parsoid and all the others)
  • the accessKey query attribute, as it changes from one environment to the other; it also needs to be encodeURIComponent-ed since in production it contains unsafe characters

description: The PDF render of an article
schema:
type: file
# TODO: Add an etag somehow. How???
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get the latest rev ID from storage, can't we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now in the revision-table-access-check-filter we indeed go to the revision table, so we know the latest rev id. However, when we switch to the restriction-table-access-check-filter, we wouldn't know the latest revision any more. Obtaining it would require one more round-trip to Cassandra, which is not ideal. It's still not a huge deal since the PDF generation is very slow and a round-trip to cassandra could be done in parallel.

Other approach is to make MW emit an ETag (or just some custom header with a revision number) and make the service pass it through.

Yet another approach is to apply the new printable styles to Parsoid HTML and render the PDF based on Parsoid HTML and preserve the etag, but that's actually not a solution since Parsoid HTML doesn't have the table of contents and in general the output PDF is much worse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that doing another round trip to Cassandra is not such a big penalty, given that we don't expect much traffic here and the amount of time it takes to actually render a page in pdf.

@d00rman
Copy link
Contributor

d00rman commented Nov 1, 2016

Minor nits and comments, otherwise LGTM overall, but we probably don't want to deploy this before the service hits production.

@Pchelolo
Copy link
Contributor Author

Pchelolo commented Nov 1, 2016

Ye, for sure don't deploy yet, wait until it's deployed in prod. I've created a PR to avoid loosing the code

Also, I wanna continue the discussion about adding an optional? {/size} parameter with an enum of page sizes which to print on. Create a task for it perhaps?

@d00rman d00rman added the API label Nov 24, 2016
@@ -32,6 +32,8 @@ default_project: &default_project
cache_control: s-maxage=86400, max-age=86400
# 10 days Varnish caching, one day client-side
purged_cache_control: s-maxage=864000, max-age=86400
# Cache PDF for 5 minutes since it's not purged
pdf_cache_control: s-maxage=300, max-age=300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be 10 minutes probably.

@d00rman
Copy link
Contributor

d00rman commented Nov 24, 2016

Also, I wanna continue the discussion about adding an optional? {/size} parameter with an enum of page sizes which to print on. Create a task for it perhaps?

The service can accept a window height and width, so this would be possible. The defaults in prod are set to 1024x768, so yeah, we could discuss both the default options and the need for size, even though this would probably mean cache fragmentation.

@d00rman
Copy link
Contributor

d00rman commented Nov 24, 2016

LGTM in general, modulo minor comments (in-lined).

@Pchelolo
Copy link
Contributor Author

Pchelolo commented Nov 24, 2016

@d00rman Updated to address your comments. Would require a puppet patch to set up the options before deployment. Additionally I've upgraded our swagger-ui fork with this PR: swagger-api/swagger-ui#2465 so it should whow the correct link now and properly download the file.

pdf:
# Cache PDF for 5 minutes since it's not purged
cache_control: s-maxage=600, max-age=600
uri: https://pdf-electron.wmflabs.org/pdf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have kind of gotten into the habit of putting only the protocol://host:port part in the parameters, so let's keep the consistency with the other config options. Additionally, if we ever decide to use the other formats supported by the service (png, jpg), we won't need to change the options in the config.

@d00rman
Copy link
Contributor

d00rman commented Nov 25, 2016

Puppet config change patch is here

@Pchelolo
Copy link
Contributor Author

@d00rman Done!

@d00rman
Copy link
Contributor

d00rman commented Nov 25, 2016

👍

@gwicke gwicke merged commit 812c4dd into wikimedia:master Nov 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants