Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feed_slot_closed and feed_exporter_closed signals to work with FeedExporter #5876

Merged
merged 9 commits into from
Apr 26, 2023
Merged

Add feed_slot_closed and feed_exporter_closed signals to work with FeedExporter #5876

merged 9 commits into from
Apr 26, 2023

Conversation

guillermo-bondonno
Copy link
Contributor

The signals will be sent when a slot and the FeedExporter are closed respectively.

This allows for the use of extensions related to feeds (like email feeds or slack notifications) that may require them to be used without the need for a custom FeedExporter.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep the original deferred_list list, store DeferredList(deferred_list) in a variable, and add the feed_exporter_closed signal as a callback to that DeferredList(deferred_list)?

@codecov
Copy link

codecov bot commented Mar 27, 2023

Codecov Report

Merging #5876 (f875265) into master (d60b4ed) will decrease coverage by 0.06%.
The diff coverage is 93.28%.

❗ Current head f875265 differs from pull request most recent head 286707b. Consider uploading reports for the commit 286707b to get more accurate results

@@            Coverage Diff             @@
##           master    #5876      +/-   ##
==========================================
- Coverage   88.84%   88.78%   -0.06%     
==========================================
  Files         162      162              
  Lines       11055    11175     +120     
  Branches     1800     1817      +17     
==========================================
+ Hits         9822     9922     +100     
- Misses        954      966      +12     
- Partials      279      287       +8     
Impacted Files Coverage Δ
scrapy/mail.py 77.77% <0.00%> (-0.88%) ⬇️
scrapy/pipelines/files.py 72.00% <50.00%> (+0.09%) ⬆️
scrapy/utils/misc.py 96.52% <60.00%> (-1.35%) ⬇️
scrapy/utils/python.py 88.75% <81.25%> (-2.22%) ⬇️
scrapy/signalmanager.py 81.81% <83.33%> (+1.81%) ⬆️
scrapy/core/downloader/handlers/__init__.py 89.83% <87.50%> (-2.90%) ⬇️
scrapy/core/downloader/tls.py 97.22% <87.50%> (+0.16%) ⬆️
scrapy/core/downloader/contextfactory.py 85.24% <90.00%> (-2.26%) ⬇️
scrapy/core/engine.py 83.68% <90.90%> (-0.51%) ⬇️
scrapy/statscollectors.py 89.28% <92.85%> (-2.88%) ⬇️
... and 20 more

@guillermo-bondonno
Copy link
Contributor Author

guillermo-bondonno commented Mar 27, 2023

Why not keep the original deferred_list list, store DeferredList(deferred_list) in a variable, and add the feed_exporter_closed signal as a callback to that DeferredList(deferred_list)?

@Gallaecio Something like

async def close_spider(self, spider):
      closed_slots = []
      for slot in self.slots:
          d = self._close_slot(slot, spider)
          closed_slots.append(d)

      deferred_list = DeferredList(closed_slots)
      deferred_list.addCallback(lambda _: self.crawler.signals.send_catch_log_deferred(signals.feed_exporter_closed))

      await maybe_deferred_to_future(deferred_list)

?
This was part of a task by @VMRuiz which was pretty well defined, so I followed closely, but yes this works and does make sense to me at least

@VMRuiz
Copy link
Contributor

VMRuiz commented Mar 29, 2023

Why not keep the original deferred_list list, store DeferredList(deferred_list) in a variable, and add the feed_exporter_closed signal as a callback to that DeferredList(deferred_list)?

Previously, the code assumed that only the slots in self.slots needed to be closed when close_spider was called. However, with the new async approach, there may be pending processes that were started but have not yet finished closing, and thus have not raised the close feed signal yet as expected. To handle this, we register these pending processes in pending_deferreds, so they can be awaited if needed during the closing process.

By getting rid of the previous approach and relying only on awaiting self.pending_deferreds, we can refactor all this new logic inside the _close_slot method, as shown in the code here:

self.pending_deferreds.append(d)
d.addCallback(
lambda _: self.crawler.signals.send_catch_log(
signals.feed_slot_closed, slot=slot
)
)
d.addBoth(lambda _: self.pending_deferreds.remove(d))

The current commit centralices all the new logic inside _close_slot and simplifies the awaiting code in the close_spider method. The alternative would be something like this:

async def close_spider(self, spider):
    deferred_list = []
    for slot in self.slots:
        d = self._close_slot(slot, spider)
        deferred_list.append(d)
        
    # Await previous pending deferreds + current deferreds
    deferred_list += self.pending_deferreds
    if deferred_list:
        await maybe_deferred_to_future(DeferredList(deferred_list))

    # Send FEED_EXPORTER_CLOSED signal
    await maybe_deferred_to_future(
        self.crawler.signals.send_catch_log_deferred(signals.feed_exporter_closed)
    )
    
def item_scraped(self, item, spider):
    [...]
    d = self._close_slot(slot, spider)
    self.pending_deferreds.append(d)
    d.addBoth(lambda _: self.pending_deferreds.remove(d))
    [...]
    
 def _close_slot(self, slot, spider):
    [...]
    d.addCallback(
            lambda _: self.crawler.signals.send_catch_log(
                signals.feed_slot_closed, slot=slot
            )
    )
    return d

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I did not realize another method (item_scraped) was in the picture.

The approach looks good to me, then.

Next step: Please, check on the CI issues; we have some flaky tests, but at least some of the issues seem legit.

docs/topics/signals.rst Outdated Show resolved Hide resolved
docs/topics/signals.rst Outdated Show resolved Hide resolved
docs/topics/signals.rst Outdated Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
@jerrodcodecov
Copy link

Jerrod from Codecov here --> I'm noticing you guys are seeing this error a lot on Codecov. Did you change the git behavior or CI uploading step at all recently?

❗ Current head 63e6b46 differs from pull request most recent head a0b7ea9.
Consider uploading reports for the commit a0b7ea9 to get more accurate results```

@Gallaecio
Copy link
Member

Gallaecio commented Mar 30, 2023

Did you change the git behavior or CI uploading step at all recently?

Looking at https://github.com/scrapy/scrapy/blame/master/.github/workflows/tests-ubuntu.yml, I would say the last change was 3 years ago.

docs/topics/signals.rst Outdated Show resolved Hide resolved
docs/topics/signals.rst Outdated Show resolved Hide resolved
@guillermo-bondonno
Copy link
Contributor Author

Let me know if you had something else else in mind

@Gallaecio
Copy link
Member

Gallaecio commented Apr 24, 2023

For deprecation using

def create_deprecated_class(
may be best.

@Gallaecio
Copy link
Member

The original issue reported here, while minor, is still pending 😅

@Gallaecio Gallaecio merged commit b50c032 into scrapy:master Apr 26, 2023
@Gallaecio
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants