You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to have each Item type to be exported into separates .jsonl
For this I adapted the PerYearXmlExportPipeline example given in the Item Exporter Section of the doc.
The issue I was hitting was that the 'BufferedWriter' of my second exporter instance wasn't getting flush before crawl terminates.
My ItemPartitionedJsonLineExportPipeline code:
from scrapy.exporters import JsonLinesItemExporter
class ItemPartitionedJsonLineExportPipeline:
def open_spider(self, spider):
self.partioning_exporter = {}
def close_spider(self, spider):
for exporter in self.partioning_exporter.values():
exporter.file.flush()
def _exporter_for_item(self, item, spider):
item_class = type(item).__name__
if item_class not in self.partioning_exporter:
feed_dir = spider.settings.get('FEED_DIR', '').rstrip('/')
time_str = datetime.utcnow().replace(microsecond=0).isoformat()
f = open(f"{feed_dir}/{spider.name}/{item_class}_{time_str}.jsonl", 'wb')
exporter = JsonLinesItemExporter(f)
exporter.start_exporting()
self.partioning_exporter[item_class] = exporter
return self.partioning_exporter[item_class]
def process_item(self, item, spider):
exporter = self._exporter_for_item(item, spider)
exporter.export_item(item)
return item
Expected behavior:
I expected both item written into their respective jsonl file or an exception to be raised
Actual behavior:
My first item type got it's 50 entries written correctly
My second item type, that have only one entry to be written, never gets written into it's jsonl file.
No exception is being raised
My Solution
manually flushing the buffer of each of my exporter instance in my close_spider function
def close_spider(self, spider):
for exporter in self.partioning_exporter.values():
exporter.file.flush()
the JsonLinesItemExporter class doesn't have a finish_exporting( ) function but even for the other exporters that have it none are actually flushing the buffer.
My guess is that the buffer are automatically getting flushed at exist somehow. But I don't get why my second exporter instance is not. May be because only few items are generated?
The text was updated successfully, but these errors were encountered:
@psychonaute Can you confirm that the issue with your code is that you are not closing the file? (#4829, we need to update the documentation).
If that is the issue, and there’s already #4575 to support splitting items into different output files by item class, I think we can close this issue.,
I wanted to have each Item type to be exported into separates .jsonl
For this I adapted the PerYearXmlExportPipeline example given in the Item Exporter Section of the doc.
The issue I was hitting was that the 'BufferedWriter' of my second exporter instance wasn't getting flush before crawl terminates.
My ItemPartitionedJsonLineExportPipeline code:
Expected behavior:
I expected both item written into their respective jsonl file or an exception to be raised
Actual behavior:
My first item type got it's 50 entries written correctly
My second item type, that have only one entry to be written, never gets written into it's jsonl file.
No exception is being raised
My Solution
manually flushing the buffer of each of my exporter instance in my close_spider function
the JsonLinesItemExporter class doesn't have a finish_exporting( ) function but even for the other exporters that have it none are actually flushing the buffer.
My guess is that the buffer are automatically getting flushed at exist somehow. But I don't get why my second exporter instance is not. May be because only few items are generated?
The text was updated successfully, but these errors were encountered: