Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV item exporter does not flush on close #4829

Closed
scientes opened this issue Oct 5, 2020 · 9 comments · Fixed by #5008
Closed

CSV item exporter does not flush on close #4829

scientes opened this issue Oct 5, 2020 · 9 comments · Fixed by #5008

Comments

@scientes
Copy link

scientes commented Oct 5, 2020

Description

The csv item exporter does not flush on close.

Steps to Reproduce

  1. Create pipeline with an CSV item exporter
  2. only export few items (one or two)
  3. Let spider close/finish

Expected behavior: All exported items are in the CSV Files

Actual behavior: Files with few items are empty (not even a header)

Reproduces how often: 100%

Versions

Scrapy : 2.3.0
lxml : 4.5.2.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 20.3.0
Python : 3.8.2 (default, Jul 16 2020, 14:00:26) - [GCC 9.3.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1f 31 Mar 2020)
cryptography : 2.8
Platform : Linux-5.4.0-7642-generic-x86_64-with-glibc2.29

Example pipline.py

from itemadapter import ItemAdapter
from scrapy.exporters import CsvItemExporter

class CricketcrawlerPipeline:
    def open_spider(self, spider):
        self.name_to_exporter = {}

    def close_spider(self, spider):
        for exporter in self.name_to_exporter.values():
            exporter.finish_exporting()

    def process_item(self, item, spider):
        exporter = self._exporter_for_item(item)
        exporter.export_item(item)
        return item

    def _exporter_for_item(self, item):
        name = item['name']
        if name not in self.name_to_exporter:
            f = open(f'{item.folder}/{name}.csv', 'wb')
            exporter = CsvItemExporter(f)
            exporter.start_exporting()
            self.name_to_exporter[name] = exporter
        return self.name_to_exporter[name]
@Gallaecio
Copy link
Member

They get an open file. I think it makes sense for the caller to be responsible of closing the file (just as it opened it), and not the item exporter.

@scientes scientes changed the title CSV item exporter doe not flush on close CSV item exporter does not flush on close Oct 5, 2020
@scientes
Copy link
Author

scientes commented Oct 5, 2020

To be totally honest i forgot about closing the file, that works as well. But for all other people who might also forget it, calling a flush on closing the CSVexporter doesn't hurt in my opinion. It just makes sure that all items are written to disc on closing the Exporter and the file-pointer is still usable after that so it shouldn't break anything.

@Gallaecio
Copy link
Member

Closing flushes, so proper calling code would not really benefit from this change. However, it is technically possible (though very rare, probably) for the calling code to use the same file object with multiple item exporters, in which case flushing could negatively impact performance.

I don’t expect great penalties from such a change. But I see no real benefit either. I’m -0.5 on adding flushing.

@scientes
Copy link
Author

scientes commented Oct 5, 2020

Well i'm more or less neutral on adding it, knowing that it could have a negative impact. Close the issue if you want to.

@wRAR
Copy link
Member

wRAR commented Oct 6, 2020

-0.5 from me too, for the same reasons

@GeorgeA92
Copy link
Contributor

In general case (with FEEDS setting or with -o scrapy command line tool parameter) scrapy use FileFeedStorage objects for maintaining (open/close) files for item exporters and it's controlled by FeedExporter extension.

ItemExporter code sample from scrapy docs (@scientes 's code probably based on it) doesn't use FileFeedStorage and it doesn' close files - this code... is not complete.

This issue , #4786 and, #4784 probably have the same origin.

@wRAR
Copy link
Member

wRAR commented Oct 16, 2020

That PerYearXmlExportPipeline is indeed problematic.

@scientes
Copy link
Author

Yes i've used the docs as basis. Maybe we should Update the documentation to close the files.

@wRAR
Copy link
Member

wRAR commented Oct 30, 2020

Another consideration, the file-like object that is passed to the exporter doesn't necessarily support the flush method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants