Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Make data_path work when outside project (used by HttpCacheMiddleware and Deltafetch plugin) #1581

Merged
merged 4 commits into from Sep 30, 2016

Conversation

eliasdorneles
Copy link
Member

I ran into an issue trying to enable cache with scrapy runspider: the middleware gets disabled because it tries to create the httpcache directory inside a project data directory.

The utils.project.data_path currently assumes that you're inside the project dir (maybe reasonably, since it's in the project module?).

I know this solution breaks compatibility a little bit, but IMO it would be solving a bug.

Btw, I noticed that this data_path() function is only being used by the HttpCacheMiddleware -- does it make sense to keep it as an util function?

@codecov-io
Copy link

codecov-io commented Nov 3, 2015

Current coverage is 83.31% (diff: 100%)

Merging #1581 into master will increase coverage by 0.10%

@@             master      #1581   diff @@
==========================================
  Files           161        161          
  Lines          8717       8719     +2   
  Methods           0          0          
  Messages          0          0          
  Branches       1283       1284     +1   
==========================================
+ Hits           7253       7264    +11   
+ Misses         1215       1204    -11   
- Partials        249        251     +2   

Powered by Codecov. Last update 86eb865...2d932c1

@jdemaeyer
Copy link
Contributor

Hey Elias,

the http cache should definitely work with relative paths even for stand-alone spiders, so +1 for implementing this.

I know this solution breaks compatibility

Hmm how's that? Is there code that relies on data_path() raising an exception if not inside a project?

Btw, I noticed that this data_path() function is only being used by the HttpCacheMiddleware -- does it make sense to keep it as an util function?

It's a little misnamed but I think it makes sense as a utility function. "Give me this path relative to the project if I am in one" seems quite re-usable.

@eliasdorneles
Copy link
Member Author

Hmm how's that? Is there code that relies on data_path() raising an exception if not inside a project?

None in Scrapy itself, I meant it as a theoretical argument, since we can't say for sure that no user is using it. :)

It's a little misnamed but I think it makes sense as a utility function. "Give me this path relative to the project if I am in one" seems quite re-usable.

+1, so if we agree it's okay to change this function, maybe we can even rename it. Perhaps project_path_join or something.

@eliasdorneles
Copy link
Member Author

Btw, my current workaround to be able to use -s HTTPCACHE_ENABLED=1 with scrapy runspider is to create an empty scrapy.cfg file to trick Scrapy to think it's inside a project...

@Digenis
Copy link
Member

Digenis commented Nov 6, 2015

I know this solution breaks compatibility

Hmm how's that? Is there code that relies on data_path() raising an exception if not inside a project?

data_path() raises NotConfigured
which then bubbles up to the httpcache __init__.
This in effect disables caching in the absence of a data project path.
I assume people do their testing using the httpcache
and then they expect it not to be disabled in production (scrapyd).

If this change is applied then such projects can suddenly start caching.

I'd rather see this follow the API stability policy
and have a deprecation warning first.
When someone configures a relative cache dir without a project,
data_path(), before raising NotConfigured, it should warn
that such relative path will be being resolved by the next stable release.

@jdemaeyer
Copy link
Contributor

Hm, I'm not sure I can follow you @Digenis

If this change is applied then such projects can suddenly start caching.

This PR will have no impact on http caching in projects. Scrapy will always cache as soon as HTTPCACHE_ENABLED is set to True, independent of whether HTTPCACHE_DIR is a relative or an absolute path (and independent of whether that paths exists before running the spider). When people don't want the http cache in their production environment, they need to set HTTPCACHE_ENABLED = False in that environment's local_settings.py.

This PR will only affect the http cache behaviour for stand-alone spiders (those that you run through scrapy runspider instead of scrapy crawl), in that they can now also use relative paths. That someone explicitly sets HTTPCACHE_ENABLED = True in their Spider.custom_settings and then finds the http cache disabled (because they're not in a project) is definitely a bug and not part of an API in my eyes.

@Digenis
Copy link
Member

Digenis commented Nov 6, 2015

I typed the wrong word at some point, see edits.

I use scrapyd for deployment.
scrapy.utils.conf uses os.path
and scrapyd-deploy never added scrapy.cfg as a package resource to the generated setup.py.
So the most popular way to deploy projects (scrapyd) has been "project-less" so far.

If I try HTTPCACHE_DIR = './httpcache/' + some stuff
2015-11-06 22:30:19 [scrapy] WARNING: Disabled HttpCacheMiddleware: Unable to find scrapy.cfg file to infer project data dir
Yes, it is a warning that should worry the project maintainer
but it speaks only about being outside of a project
not misusage of the middleware.

To cache without scrapy.cfg
I can either pass to scrapy -s HTTPCACHE_DIR="$(readlink -f .)/httpcache"
or in settings: HTTPCACHE_DIR = os.path.join(os.getcwd(), 'httpcache')

I do think it's fair for people
to be able to define a cache directory as simply as possible in a project-less spider
but I expect such a change to introduce special pitfalls
for someone who moves a spider in or out of a project.
I think the cache <-> no cache problem is better than cache on ./ <-> cache on ../../.scrapy/.
Especially with a misnamed function (should have been more like "project_data_dir")
and a documentation (for httpcache) that doesn't clearly states where one is expected to find the cache.

However I am aware of the issue and I don't have affected code
so this all is just my 2 cents.

@kmike
Copy link
Member

kmike commented Feb 3, 2016

@Digenis so the backwards-incompatible change users can face in practice is that with HTTPCACHE_ENABLED=True cache will be enabled when using scrapyd? Did I get this right?

@Digenis
Copy link
Member

Digenis commented Feb 3, 2016

Exactly.

@kmike
Copy link
Member

kmike commented Feb 3, 2016

A nice catch about scrapyd. It sounds like a bug fix though; I think we should make this change, but add a note about it.

@kmike kmike mentioned this pull request Feb 17, 2016
@kmike
Copy link
Member

kmike commented Sep 9, 2016

ping! I think the PR is almost ready, and it fixes an important problem - you can't use http cache with CrawlerRunner or CrawlerProcess without a Scrapy project (right?)

@eliasdorneles
Copy link
Member Author

Right, I still think that this is a bug fix.
It's not impossible to use http cache with them, there is an ugly workaround I mentioned above (creating a scrapy.cfg file in the current/root directory).

I agree with @jdemaeyer, I think it's reasonable to expect that if HTTPCACHE_ENABLED=True Scrapy should try to use the cache and I'm not keen on doing a cycle of deprecation just for this change.
Is there another way we could alleviate the problem for that case, like logging a message or something?

@eliasdorneles
Copy link
Member Author

@redapple @kmike do you think if we document the backward incompatibility, we could get this in 1.2 as well?

@redapple
Copy link
Contributor

@eliasdorneles , I'm ok with including this for 1.2
I don't know if I can explain the backwards incompatiblity myself, can you give it a shot for #2216 ?
Also, can this feature have a test for it?

@redapple redapple added this to the v1.2 milestone Sep 22, 2016
@eliasdorneles
Copy link
Member Author

Sure, I'll see if I can add a test and describe in a comment on #2216 (@redapple the branch is on your repo, I'm unable to commit on it).

@eliasdorneles eliasdorneles force-pushed the fix-util-function-to-work-outside-project-dir branch from 472c02f to 8e4947e Compare September 29, 2016 13:17
@eliasdorneles eliasdorneles changed the title Make HttpCacheMiddleware work when outside project Make data_path work when outside project (used by HttpCacheMiddleware and Deltafetch plugins) Sep 29, 2016
@eliasdorneles eliasdorneles changed the title Make data_path work when outside project (used by HttpCacheMiddleware and Deltafetch plugins) Make data_path work when outside project (used by HttpCacheMiddleware and Deltafetch plugin) Sep 29, 2016
@eliasdorneles
Copy link
Member Author

So, @redapple and I wrote tests for this, and while we were at it we found that Deltafetch is also uding that data_path function -- so this fix will apply to that as well (PR title updated to reflect it).

Also, this will fix cache for scrapy shell besides scrapy runspider (consistent with the current behavior scrapy shell working inside a project if the setting is enabled).
Thanks @redapple for the help 👍

@redapple redapple changed the title Make data_path work when outside project (used by HttpCacheMiddleware and Deltafetch plugin) [MRG+1] Make data_path work when outside project (used by HttpCacheMiddleware and Deltafetch plugin) Sep 30, 2016
@eliasdorneles
Copy link
Member Author

Something I should mention, we revised the change in data_path to also prepend .scrapy when running outside a project.

eliasdorneles added a commit that referenced this pull request Sep 30, 2016
if inside_project():
path = join(project_data_dir(), path)
else:
path = join('.scrapy', path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, before it raised NotConfigured and now it creates a directory in CWD/.scrapy/<path>.

And the goal is to allow httpcache and deltafetch (and similars) to run even outside a scrapy project.

It is a bit backwards incompatible, scripts running in readonly filesystem now will fail because directory can't be created.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's indeed a bit backwards incompatible, there is a PR to mention it in the release notes: #2298

About the case of scripts running in readonly filesystem, is that a huge concern?
It looks minor to me, because for the breaking change to happen the script would need to be 1) running scrapy outside a project and 2) using the settings to enabling http cache or some other plugin that would try to use the data dir (in which case, the expected behavior would still be for the settings to be used and any existing data previously stored in .scrapy to be used).

So, still a bug fix, don't you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Looks good then.

with inside_a_project() as proj_path:
expected = os.path.join(proj_path, '.scrapy', 'somepath')
self.assertEquals(expected, data_path('somepath'))
self.assertEquals('/absolute/path', data_path('/absolute/path'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the test for abs path inside project only?

If it makes a difference then it should be tested out and inside.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, lemme add it to the test outside project as well!

@dangra dangra merged commit bca374d into master Sep 30, 2016
@dangra dangra deleted the fix-util-function-to-work-outside-project-dir branch September 30, 2016 18:23
dangra added a commit that referenced this pull request Sep 30, 2016
Update 1.2 release notes with data_path changes from #1581
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants