Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pdf2TextFilter error handling #704

Open
JulienPalard opened this issue Apr 13, 2022 · 2 comments
Open

Pdf2TextFilter error handling #704

JulienPalard opened this issue Apr 13, 2022 · 2 comments

Comments

@JulienPalard
Copy link
Contributor

I wanted to use urlwatch like this:

---
url: https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202205.pdf
filter:
  - pdf2text
---
url: https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202206.pdf
filter:
  - pdf2text
---
url: https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202207.pdf
filter:
  - pdf2text

to wait for next bulletins, but when a bulletin is not available yet meteofrance.fr redirects to an HTML page, leading to an exeption:

Traceback (most recent call last):
  File "urlwatch/handler.py", line 120, in process
    data = FilterBase.process(filter_kind, subfilter, self, data)
  File "urlwatch/filters.py", line 188, in process
    return filtercls(state.job, state).filter(data, subfilter)
  File "urlwatch/filters.py", line 399, in filter
    return '\n\n'.join(pdftotext.PDF(io.BytesIO(data), password=subfilter.get('password', '')))
pdftotext.Error: poppler error creating document

maybe the pdf2text need some kind of error handling option, what do you think?

@thp
Copy link
Owner

thp commented Apr 18, 2022

Hm, it seems like the page returns a "proper" 404 status code when the PDF isn't yet available.

I do wonder why the filter is run in this case. Have you maybe set things up to ignore 404 errors?

@JulienPalard
Copy link
Contributor Author

No, it they don't return a proper 404, in fact it really depends on the request:

$ curl -I https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202207.pdf
HTTP/1.1 404 Not Found
...

$ curl -i https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202207.pdf
HTTP/1.1 302 Found
Date: Tue, 19 Apr 2022 20:08:54 GMT
Location: https://donneespubliques.meteofrance.fr/?fond=donnee_indisponible
...

And in urlwatch, in the job retrieve function we're hitting the 2nd case:

(Pdb) p response
<Response [200]>
(Pdb) p response.url
'https://donneespubliques.meteofrance.fr/?fond=donnee_indisponible'
(Pdb) p response.history
[<Response [302]>]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants