Skip to content

JsonItemExporter puts lone comma in the output if encoder fails #3090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tlinhart opened this issue Jan 25, 2018 · 4 comments · Fixed by #5952
Closed

JsonItemExporter puts lone comma in the output if encoder fails #3090

tlinhart opened this issue Jan 25, 2018 · 4 comments · Fixed by #5952

Comments

@tlinhart
Copy link

If JsonItemExporter is unable to encode the item, it still writes a delimiter (comma) to the output file. Here is a sample spider:

# -*- coding: utf-8 -*-
import datetime
import scrapy

class DummySpider(scrapy.Spider):
    name = 'dummy'
    start_urls = ['http://example.org/']

    def parse(self, response):
        yield {'date': datetime.date(2018, 1, 1)}
        yield {'date': datetime.date(1234, 1, 1)}
        yield {'date': datetime.date(2019, 1, 1)})

Encoding the second items fails:

2018-01-25 09:05:57 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.item_scraped of <scrapy.extensions.feedexport.FeedExporter object at 0x7fcbfbd81250>>
Traceback (most recent call last):
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 224, in item_scraped
    slot.exporter.export_item(item)
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/scrapy/exporters.py", line 130, in export_item
    data = self.encoder.encode(itemdict)
  File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/scrapy/utils/serialize.py", line 22, in default
    return o.strftime(self.DATE_FORMAT)
ValueError: year=1234 is before 1900; the datetime strftime() methods require year >= 1900

The output looks like this:

[
{"date": "2018-01-01"},
,
{"date": "2019-01-01"}
]

This seems not to be a valid JSON file as e.g. json.load() and jq fail to parse it.

I think the problem is in export_item method of JsonItemExporter class where it outputs the comma before decoding the item. The correct approach would be to try to decode the item (possibly with other needed operations) and perform the write atomically.

@ghost ghost added the bug label Feb 6, 2018
@ghost
Copy link

ghost commented Feb 6, 2018

Thanks @tlinhart - will take a look at making a fix to that right away. :)

@gekco
Copy link

gekco commented Mar 2, 2018

Hi, I am looking for a beginner issue to start working with. Any chances i can pick this up and start working on it?

@kmike
Copy link
Member

kmike commented Mar 10, 2018

hey @gekco! this is almost fixed in #3111, the remaining issue is extra tests which are executed unintentionally.

@Gallaecio
Copy link
Member

Gallaecio commented Nov 7, 2022

Marking as a good first issue as finishing #3111 may be easy based on the feedback provided there.

adnan-awan added a commit to adnan-awan/scrapy that referenced this issue Jun 15, 2023
adnan-awan added a commit to adnan-awan/scrapy that referenced this issue Jul 10, 2023
adnan-awan added a commit to adnan-awan/scrapy that referenced this issue Jul 10, 2023
adnan-awan added a commit to adnan-awan/scrapy that referenced this issue Jul 10, 2023
adnan-awan added a commit to adnan-awan/scrapy that referenced this issue Jul 11, 2023
adnan-awan added a commit to adnan-awan/scrapy that referenced this issue Jul 12, 2023
kmike pushed a commit that referenced this issue Jul 22, 2023
…5952)

* Partial fix for #3090 - only addresses JSON feeds.

* Adding test case for #3090 to Json Exporter

* Changing the deliberate-fail JSON example to a complex

* Further tightening JsonItemExporter behaviour to prevent corruption.

Based on Mikhail's observation that to_bytes can fail also, leading
to the same dangling comma as the failure to encode to JSON.

Added a new test case to avoid reversion.

* [scrapy] JsonItemExporter puts lone comma in the output if encoder fails

- Add initial changes from cathal's PR
- #3090

* [scrapy] JsonItemExporter puts lone comma in the output if encoder fails

- Handle exception not to add empty item.
- #3090

* [scrapy] JsonItemExporter puts lone comma in the output if encoder fails

- Add comment for handling the exception
- #3090

* [scrapy] JsonItemExporter puts lone comma in the output if encoder fails

- Remove unused import
- #3090

* [scrapy] JsonItemExporter puts lone comma in the output if encoder fails

- Fix invalid json issue
- #3090

* [scrapy] JsonItemExporter puts lone comma in the output if encoder fails

- Perform CR changes
- #3090

---------

Co-authored-by: Cathal Garvey <cathalgarvey@cathalgarvey.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants