New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Make data_path work when outside project (used by HttpCacheMiddleware and Deltafetch plugin) #1581
Conversation
Current coverage is 83.31% (diff: 100%)@@ master #1581 diff @@
==========================================
Files 161 161
Lines 8717 8719 +2
Methods 0 0
Messages 0 0
Branches 1283 1284 +1
==========================================
+ Hits 7253 7264 +11
+ Misses 1215 1204 -11
- Partials 249 251 +2
|
Hey Elias, the http cache should definitely work with relative paths even for stand-alone spiders, so +1 for implementing this.
Hmm how's that? Is there code that relies on
It's a little misnamed but I think it makes sense as a utility function. "Give me this path relative to the project if I am in one" seems quite re-usable. |
None in Scrapy itself, I meant it as a theoretical argument, since we can't say for sure that no user is using it. :)
+1, so if we agree it's okay to change this function, maybe we can even rename it. Perhaps |
Btw, my current workaround to be able to use |
If this change is applied then such projects can suddenly start caching. I'd rather see this follow the API stability policy |
Hm, I'm not sure I can follow you @Digenis
This PR will have no impact on http caching in projects. Scrapy will always cache as soon as This PR will only affect the http cache behaviour for stand-alone spiders (those that you run through |
I typed the wrong word at some point, see edits. I use scrapyd for deployment. If I try To cache without scrapy.cfg I do think it's fair for people However I am aware of the issue and I don't have affected code |
@Digenis so the backwards-incompatible change users can face in practice is that with |
Exactly. |
A nice catch about scrapyd. It sounds like a bug fix though; I think we should make this change, but add a note about it. |
ping! I think the PR is almost ready, and it fixes an important problem - you can't use http cache with CrawlerRunner or CrawlerProcess without a Scrapy project (right?) |
Right, I still think that this is a bug fix. I agree with @jdemaeyer, I think it's reasonable to expect that if |
@eliasdorneles , I'm ok with including this for 1.2 |
472c02f
to
8e4947e
Compare
So, @redapple and I wrote tests for this, and while we were at it we found that Deltafetch is also uding that data_path function -- so this fix will apply to that as well (PR title updated to reflect it). Also, this will fix cache for |
Something I should mention, we revised the change in data_path to also prepend |
if inside_project(): | ||
path = join(project_data_dir(), path) | ||
else: | ||
path = join('.scrapy', path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, before it raised NotConfigured and now it creates a directory in CWD/.scrapy/<path>
.
And the goal is to allow httpcache and deltafetch (and similars) to run even outside a scrapy project.
It is a bit backwards incompatible, scripts running in readonly filesystem now will fail because directory can't be created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's indeed a bit backwards incompatible, there is a PR to mention it in the release notes: #2298
About the case of scripts running in readonly filesystem, is that a huge concern?
It looks minor to me, because for the breaking change to happen the script would need to be 1) running scrapy outside a project and 2) using the settings to enabling http cache or some other plugin that would try to use the data dir (in which case, the expected behavior would still be for the settings to be used and any existing data previously stored in .scrapy to be used).
So, still a bug fix, don't you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. Looks good then.
with inside_a_project() as proj_path: | ||
expected = os.path.join(proj_path, '.scrapy', 'somepath') | ||
self.assertEquals(expected, data_path('somepath')) | ||
self.assertEquals('/absolute/path', data_path('/absolute/path')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the test for abs path inside project only?
If it makes a difference then it should be tested out and inside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, lemme add it to the test outside project as well!
Update 1.2 release notes with data_path changes from #1581
I ran into an issue trying to enable cache with
scrapy runspider
: the middleware gets disabled because it tries to create thehttpcache
directory inside a project data directory.The
utils.project.data_path
currently assumes that you're inside the project dir (maybe reasonably, since it's in the project module?).I know this solution breaks compatibility a little bit, but IMO it would be solving a bug.
Btw, I noticed that this
data_path()
function is only being used by the HttpCacheMiddleware -- does it make sense to keep it as an util function?