Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Fix for #685 Add Google Cloud Storage Feed Export #3608

Merged
merged 20 commits into from Jul 16, 2020

Conversation

ejulio
Copy link
Contributor

@ejulio ejulio commented Feb 1, 2019

Hey.
This is just a draft to add GCS feed export.
I followed the same behavior as the FilesPipeline, so I'm just using the GCS_PROJECT_ID setting and it requires the developer to provide the credentials through environment variables (refer to https://googleapis.github.io/google-cloud-python/latest/core/config.html#authentication for more information).

About the credentials above, I think we could also add a new setting with a path to the credentials file and initialize it using https://googleapis.github.io/google-cloud-python/latest/core/auth.html#service-accounts. I think this might be easier then setting environment variables when deploying.

The list of ToDos, besides your thoughts are:

  • Add documentation
  • Add gcs schema
  • Add bucket policy options

Please note that I worked with unit tests here, but I'm open to idea of integration tests. Do you have any thoughts?

Thanks for your time to review this PR 馃槃

(edit) Closes #685

@codecov
Copy link

codecov bot commented Feb 1, 2019

Codecov Report

Merging #3608 into master will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3608      +/-   ##
==========================================
+ Coverage   84.80%   84.83%   +0.03%     
==========================================
  Files         163      163              
  Lines        9987    10014      +27     
  Branches     1487     1487              
==========================================
+ Hits         8469     8495      +26     
  Misses       1254     1254              
- Partials      264      265       +1     
Impacted Files Coverage 螖
scrapy/extensions/feedexport.py 85.71% <100.00%> (+1.09%) 猬嗭笍
scrapy/settings/default_settings.py 98.72% <100.00%> (+0.01%) 猬嗭笍
scrapy/utils/test.py 54.11% <100.00%> (+4.76%) 猬嗭笍
scrapy/utils/trackref.py 82.85% <0.00%> (-2.86%) 猬囷笍


class GCSFeedStorage(BlockingFeedStorage):

project_id = None
Copy link
Contributor

@victor-torres victor-torres Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem right to store this information as class attributes.
You should be defining those values in the init method, don't you?

Copy link
Contributor Author

@ejulio ejulio Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to make it private?
I don't see any difference in:

class A:
   attribute = None
   def __init__(self, a):
        self.attribute = a

class A:
   def __init__(self, a):
        self.attribute = a

Actually, I prefer the first one because it makes things explicit (there is an attribute called attribute).

Copy link
Contributor

@victor-torres victor-torres Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean to declare them as instance attributes instead of class attributes.
This is how we are doing in S3FeedStorage, for example.

Class attributes are really useful when storing constants or helper values, things that are shared across multiple instances of the same class.

In this case, I don't think you should reuse the storage settings.

In worst case scenarios, it could also lead to indesirable mutability when trying to change an instance attribute value causes the class attribute value and therefore all other instance attribute values to be changed too.

In a nutshell, you need those values in order to initialize the instance, so there's no reason to keep an unused default value as a class attribute.

client = Client(project=self.project_id)
bucket = client.get_bucket(self.bucket_name)
blob = bucket.blob(self.blob_name)
blob.upload_from_file(file)
Copy link
Contributor

@victor-torres victor-torres Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check for optional Policy options for the bucket? I believe we have something similar on Media Pipelines.

Copy link
Contributor Author

@ejulio ejulio Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

馃憤 Added to the todo list

@ejulio
Copy link
Contributor Author

ejulio commented Feb 8, 2019

@victor-torres and @stav .
Just updated the PR with and some of your thoughts.

@ejulio ejulio changed the title WIP: Fix for #685 Add Google Cloud Storage Feed Export Fix for #685 Add Google Cloud Storage Feed Export Feb 9, 2019
tests/test_feedexport.py Show resolved Hide resolved
Copy link
Contributor

@victor-torres victor-torres left a comment

Looks good to me.

@ejulio
Copy link
Contributor Author

ejulio commented Feb 11, 2019

@dangra, @kmike and @lopuhin can I have a word from you regarding this PR?
Specially the integration/unit test part :D

@victor-torres
Copy link
Contributor

victor-torres commented Feb 26, 2019

@ejulio, I believe we need to cover the GOOGLE_APPLICATION_CREDENTIALS environment variable and the Service Account credentials in JSON format provided by Google Cloud.

I believe you have tested your scripts locally, but we'll need to specify the credentials somehow when deploying to remote platforms such as Scrapy Cloud.

We still need to define how to pass these credentials using spider settings. One suitable way would be to encode them using Base64 or any other hash algorithm and defining the value in Scrapy Cloud. We would need to write the decoded Base64 value into a file and define the environment variable with its location during runtime. This would avoid having to version the credentials file.

What do you think?

@ejulio
Copy link
Contributor Author

ejulio commented Feb 26, 2019

@victor-torres , I thought about it.
But, since GCS for IMAGES_STORE, currently, only works by reading the credentials from a JSON file, I kept the same behavior.
I think this is something we can work on, but not sure if in this PR, since it would make sense to make this authentication available for IMAGES_STORE as well.

* URI scheme: ``gcs``
* Example URIs:

* ``gcs://mybucket/path/to/export.csv``
Copy link
Contributor

@victor-torres victor-torres Feb 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gsutil uses gs:// as an URI prefix. See here: https://cloud.google.com/storage/docs/gsutil.

Copy link
Contributor Author

@ejulio ejulio Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good one.
For media pipelines, it is also gs https://docs.scrapy.org/en/latest/topics/media-pipeline.html#id1
I'll change it 馃槃

Copy link
Contributor Author

@ejulio ejulio Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victor-torres , I updated the scheme!

scrapy/extensions/feedexport.py Show resolved Hide resolved
@victor-torres
Copy link
Contributor

victor-torres commented Feb 28, 2019

I'll take a better look tomorrow. Thanks for updating.

Copy link
Contributor

@victor-torres victor-torres left a comment

LGTM

@ejulio
Copy link
Contributor Author

ejulio commented Mar 6, 2019

@dangra can you review this PR as well?

docs/topics/feed-exports.rst Show resolved Hide resolved
docs/topics/settings.rst Outdated Show resolved Hide resolved
@Gallaecio Gallaecio changed the title Fix for #685 Add Google Cloud Storage Feed Export [MRG+1] Fix for #685 Add Google Cloud Storage Feed Export Mar 22, 2019
scrapy/settings/default_settings.py Outdated Show resolved Hide resolved
@alepuccetti
Copy link

alepuccetti commented Nov 4, 2019

Any update on this feature?

@PabloReyes
Copy link

PabloReyes commented Feb 9, 2020

Great work on this PR. Is there any update on it?

@ejulio
Copy link
Contributor Author

ejulio commented Feb 10, 2020

@kmike @Gallaecio , should I fix the conflicts to merge it?

@Gallaecio
Copy link
Member

Gallaecio commented Feb 10, 2020

Solved. I don鈥檛 think it was blocking review, though. There鈥檚 just lots of stuff to review 馃槄

@PabloReyes
Copy link

PabloReyes commented Feb 12, 2020

Thanks @Gallaecio and @ejulio ! There seem to be some issues now with the tests.mock now. I've seen in other places you're using unittest.mock and that seems to work OK. Don't know if they are two completely different libraries with different purposes though.

@Gallaecio
Copy link
Member

Gallaecio commented Feb 12, 2020

It looks like since 92ffd2f#diff-498cf53d35427897613cdfc4b76fc6ea mock should be imported from unittest directly on each test file, instead of importing it from tests/__init__.py.

@ejulio
Copy link
Contributor Author

ejulio commented Feb 12, 2020

I can work on the fixes, but only if it moves forward to be merged.
As it's been flagged for merging for 6+ months...

@Gallaecio
Copy link
Member

Gallaecio commented Feb 12, 2020

That鈥檚 why I did not want to bother you, there鈥檚 no guarantee it will get merged any sooner 馃檨

@elacuesta
Copy link
Member

elacuesta commented Mar 19, 2020

Hi @ejulio, this is awesome work 馃殌

I was trying this myself but I was getting the following exception:

Traceback (most recent call last):
  File "/.../scrapy/venv-scrapy/bin/scrapy", line 11, in <module>
    load_entry_point('Scrapy', 'console_scripts', 'scrapy')()
  File "/.../scrapy/scrapy/cmdline.py", line 145, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/.../scrapy/scrapy/cmdline.py", line 99, in _run_print_help
    func(*a, **kw)
  File "/.../scrapy/scrapy/cmdline.py", line 153, in _run_command
    cmd.run(args, opts)
  File "/.../scrapy/scrapy/commands/runspider.py", line 87, in run
    self.crawler_process.crawl(spidercls, **opts.spargs)
  File "/.../scrapy/scrapy/crawler.py", line 176, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/.../scrapy/scrapy/crawler.py", line 209, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/.../scrapy/scrapy/crawler.py", line 214, in _create_crawler
    return Crawler(spidercls, self.settings)
  File "/.../scrapy/scrapy/crawler.py", line 64, in __init__
    self.extensions = ExtensionManager.from_crawler(self)
  File "/.../scrapy/scrapy/middleware.py", line 53, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/.../scrapy/scrapy/middleware.py", line 35, in from_settings
    mw = create_instance(mwcls, settings, crawler)
  File "/.../scrapy/scrapy/utils/misc.py", line 146, in create_instance
    return objcls.from_crawler(crawler, *args, **kwargs)
  File "/.../scrapy/scrapy/extensions/feedexport.py", line 242, in from_crawler
    o = cls(crawler.settings)
  File "/.../scrapy/scrapy/extensions/feedexport.py", line 227, in __init__
    if not self._storage_supported(self.urifmt):
  File "/.../scrapy/scrapy/extensions/feedexport.py", line 309, in _storage_supported
    self._get_storage(uri)
  File "/.../scrapy/scrapy/extensions/feedexport.py", line 328, in _get_storage
    return self._get_instance(self.storages[urlparse(uri).scheme], uri)
  File "/.../scrapy/scrapy/extensions/feedexport.py", line 322, in _get_instance
    *args, **kwargs)
  File "/.../scrapy/scrapy/utils/misc.py", line 150, in create_instance
    return objcls(*args, **kwargs)
TypeError: __init__() missing 2 required positional arguments: 'project_id' and 'acl'

I took the liberty of merging master into this branch to solve the conflict in tests/test_feedexport.py, and now I'm able to upload files just fine 馃憣.
Not sure exactly what happened there, I didn't dig too much, perhaps something related to #3858.

Besides that, the new tests are not being executed in Travis, since google-cloud-storage is not installed. Could you update that? Thanks!

wRAR
wRAR approved these changes Jun 19, 2020
@wRAR
Copy link
Contributor

wRAR commented Jun 19, 2020

@ejulio I'm ready to merge this after you add google-cloud-storage to tests/requirements-py3.txt (I tried to push to your branch but it looks like github doesn't allow me)

@elacuesta
Copy link
Member

elacuesta commented Jun 20, 2020

I added google-cloud-storage to the "pinned" tox environment, and all tests pass. However, the docs build fails because the GCS_PROJECT_ID setting is already defined in docs/topics/media-pipeline.rst.

@wRAR
Copy link
Contributor

wRAR commented Jun 26, 2020

Should we merge this? I think everything is fine now.

@kmike kmike merged commit 07470e1 into scrapy:master Jul 16, 2020
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Google Cloud Storage Support (Storage backends)
9 participants