Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ItemPipeline with multiple JsonLinesItemExporter instances, file's BufferedWriter don't get flush #4786

Closed
codekoriko opened this issue Sep 14, 2020 · 3 comments

Comments

@codekoriko
Copy link

I wanted to have each Item type to be exported into separates .jsonl

For this I adapted the PerYearXmlExportPipeline example given in the Item Exporter Section of the doc.

The issue I was hitting was that the 'BufferedWriter' of my second exporter instance wasn't getting flush before crawl terminates.

My ItemPartitionedJsonLineExportPipeline code:

from scrapy.exporters import JsonLinesItemExporter

class ItemPartitionedJsonLineExportPipeline:

    def open_spider(self, spider):
        self.partioning_exporter = {}

    def close_spider(self, spider):
        for exporter in self.partioning_exporter.values():
            exporter.file.flush()

    def _exporter_for_item(self, item, spider):
        item_class = type(item).__name__
        if item_class not in self.partioning_exporter:
            feed_dir = spider.settings.get('FEED_DIR', '').rstrip('/')
            time_str = datetime.utcnow().replace(microsecond=0).isoformat()
            f = open(f"{feed_dir}/{spider.name}/{item_class}_{time_str}.jsonl", 'wb')
            exporter = JsonLinesItemExporter(f)
            exporter.start_exporting()
            self.partioning_exporter[item_class] = exporter
        return self.partioning_exporter[item_class]

    def process_item(self, item, spider):
        exporter = self._exporter_for_item(item, spider)
        exporter.export_item(item)
        return item

Expected behavior:
I expected both item written into their respective jsonl file or an exception to be raised

Actual behavior:

My first item type got it's 50 entries written correctly
My second item type, that have only one entry to be written, never gets written into it's jsonl file.
No exception is being raised

My Solution

manually flushing the buffer of each of my exporter instance in my close_spider function

    def close_spider(self, spider):
        for exporter in self.partioning_exporter.values():
            exporter.file.flush()

the JsonLinesItemExporter class doesn't have a finish_exporting( ) function but even for the other exporters that have it none are actually flushing the buffer.

My guess is that the buffer are automatically getting flushed at exist somehow. But I don't get why my second exporter instance is not. May be because only few items are generated?

@Gallaecio
Copy link
Member

Gallaecio commented Sep 16, 2020

Related to #4575

@Gallaecio
Copy link
Member

@psychonaute Can you confirm that the issue with your code is that you are not closing the file? (#4829, we need to update the documentation).

If that is the issue, and there’s already #4575 to support splitting items into different output files by item class, I think we can close this issue.,

@codekoriko
Copy link
Author

yeah sure, the 'item_classes' key to FEEDS work wonders 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants