Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3FeedStorage supports custom endpoint of object storage. #4998

Merged
merged 3 commits into from
Jul 14, 2021

Conversation

bkayranci
Copy link
Contributor

@bkayranci bkayranci commented Feb 21, 2021

Scrapy documentation says that, scrapy supports other object storages via AWS_ENDPOINT_URL. Unfortunately, custom object storage endpoint does not support in feed exporter.

Related Links

@bkayranci bkayranci force-pushed the s3-feed-exporter branch 3 times, most recently from 23c6390 to 2683e94 Compare February 21, 2021 15:43
@codecov
Copy link

codecov bot commented Feb 21, 2021

Codecov Report

Merging #4998 (df9ffb8) into master (016c7e9) will decrease coverage by 4.10%.
The diff coverage is 100.00%.

❗ Current head df9ffb8 differs from pull request most recent head 44ac8af. Consider uploading reports for the commit 44ac8af to get more accurate results

@@            Coverage Diff             @@
##           master    #4998      +/-   ##
==========================================
- Coverage   88.19%   84.09%   -4.11%     
==========================================
  Files         162      162              
  Lines       10497    10498       +1     
  Branches     1517     1517              
==========================================
- Hits         9258     8828     -430     
- Misses        965     1409     +444     
+ Partials      274      261      -13     
Impacted Files Coverage Δ
scrapy/extensions/feedexport.py 91.03% <100.00%> (-4.29%) ⬇️
scrapy/core/http2/stream.py 27.01% <0.00%> (-64.37%) ⬇️
scrapy/pipelines/images.py 28.07% <0.00%> (-62.29%) ⬇️
scrapy/core/http2/agent.py 36.14% <0.00%> (-60.25%) ⬇️
scrapy/core/downloader/handlers/http2.py 43.42% <0.00%> (-56.58%) ⬇️
scrapy/core/http2/protocol.py 34.17% <0.00%> (-49.25%) ⬇️
scrapy/utils/ssl.py 53.65% <0.00%> (-17.08%) ⬇️
scrapy/utils/asyncgen.py 83.33% <0.00%> (-16.67%) ⬇️
scrapy/core/downloader/contextfactory.py 75.92% <0.00%> (-11.12%) ⬇️
scrapy/utils/test.py 50.00% <0.00%> (-10.94%) ⬇️
... and 7 more

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

docs/topics/feed-exports.rst Outdated Show resolved Hide resolved
@edgarrmondragon
Copy link

Hi @Gallaecio. Is something blocking the merge on this one? If so, perhaps I can address any issues in case @bkayranci is not available 😃

@Gallaecio Gallaecio merged commit fcc6bec into scrapy:master Jul 14, 2021
@Gallaecio
Copy link
Member

It looks good to me, and @wRAR ’s feedback has been addressed, so I think we can merge.

@lljr
Copy link

lljr commented Aug 26, 2021

The documentation lacks usage info... do I need only AWS_ENDPOINT_URL set to my s3-like url in the settings.py file? I guess I'll need the aws secret and access key variables as well?

@bkayranci
Copy link
Contributor Author

The documentation lacks usage info... do I need only AWS_ENDPOINT_URL set to my s3-like url in the settings.py file? I guess I'll need the aws secret and access key variables as well?

The documentation are available for access and secret keys. Please follow the documentation for access keys.

I share with you an example usage for endpoint_url option. The changes of PR #4998 pass a variable to the function via kwargs.

@bkayranci
Copy link
Contributor Author

Thanks for the response @bkayranci ... It's still not clear how this all works together. I'm aware that the access key and the secret can be set in settings.py. I'm not sure how to set up AWS_ENDPOINT_URL. Should this be set in settings.py and how?

The reason for my question is that I tried using Linode's Object Storage. I set AWS_ENDPOINT_URL to Linode's object storage URL. After I deploy, the file is not created...

Do I need to manually set boto with AWS_ENDPOINT_URL? I ask because the function linked in your response contains several parameters including the ones for access key and secret...

How does it all tie together? I've never used boto and it's barely mentioned in the documentation.

I'm asking how this all works together because it's confusing to set the access key and secret in a variable in settings.py but then seeing that there are parameters for these same ones in create_client.

You can set in settings.py the AWS_ENDPOINT_URL variable.

You can find how to set the url in documentation of your cloud provider. For example, DigitalOcean provides compatible api for s3.

I set the AWS_ENDPOINT_URL variable in settings.py following,

## settings.py
...
AWS_ENDPOINT_URL = 'https://nyc3.digitaloceanspaces.com'
...

Now, you are able to use it for storage backend. docs

@lljr
Copy link

lljr commented Aug 30, 2021

@bkayranci Thanks for the response. My bad I deleted the old comment because I thought I had solved the issue. Anyway, I'm not able to post to the custom endpoint. I was able to trigger the action when I set the s3 uri in FEEDS. However, I got this error: botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records..

How do I trigger the post to my custom endpoint? Do I set the uri in FEEDS with the path to the file I want to store?

@bkayranci
Copy link
Contributor Author

@bkayranci Thanks for the response. My bad I deleted the old comment because I thought I had solved the issue. Anyway, I'm not able to post to the custom endpoint. I was able to trigger the action when I set the s3 uri in FEEDS. However, I got this error: botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records..

How do I trigger the post to my custom endpoint? Do I set the uri in FEEDS with the path to the file I want to store?

I did not try the set endpoint in URI. My URI includes only bucket name and file path like "s3://mybucket/path/to/export.csv".

Also, do you sure that your environment includes the changes of #4998 ?

@lljr
Copy link

lljr commented Aug 30, 2021

Also, do you sure that your environment includes the changes of #4998 ?

Do you mean if I'm up to date? I'm using v2.5.0...

I did not try the set endpoint in URI. My URI includes only bucket name and file path like "s3://mybucket/path/to/export.csv".

Yes, that's what I mean. When I do this it triggers the post but it errors with an incorrect access key. I'll try to make a new bucket and use it's new keys instead. So far this is what I have and it works but it errors with the last error I just mentioned:

config = configparser.ConfigParser()
config_file = os.path.join(os.path.dirname(__file__), 'env.conf')
config.read(config_file)
AWS_ACCESS_KEY_ID = config['credentials']['AWS_ACCESS_KEY_ID']
AWS_SECRET_ACCESS_KEY = config['credentials']['AWS_SECRET_ACCESS_KEY']
AWS_ENDPOINT_URL = 'my-linode-bucket-test.us-southeast-1.linodeobjects.com'
AWS_USE_SSL = True
AWS_VERIFY = True

# HERE FOR DEBUGGIN PURPOSES
DEPTH_LIMIT = 1
FEEDS = {
    # 'epidemiology-data.json': {
    #     'format': 'json',
    #     'encoding': 'utf-8',
    #     'indent': 4,
    #     'overwrite': True,
    # },
    's3://epidemiology/data.json': {
        'format': 'json',
        'encoding': 'utf-8',
        'indent': 4,
        'overwrite': True,
    },
}

@wRAR
Copy link
Member

wRAR commented Aug 30, 2021

Do you mean if I'm up to date? I'm using v2.5.0...

Then it indeed doesn't.

@lljr
Copy link

lljr commented Aug 30, 2021

@wRAR how do I download the latest? I used python3 -m pip install git+https://github.com/scrapy/scrapy.git?

@bkayranci
Copy link
Contributor Author

bkayranci commented Aug 30, 2021

Do you mean if I'm up to date? I'm using v2.5.0...

Then it indeed doesn't.

So, @lljr sents to AWS with his secrets. @lljr please, sure that remove your token (regenerate), it could be leak.

@wRAR Is there anything you expect from me for including in new version?

@lljr
Copy link

lljr commented Aug 30, 2021

@bkayranci I didn't ... I read them from a local file

@bkayranci
Copy link
Contributor Author

@bkayranci I didn't ... I read them from a local file

botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.

it means you could not pass to the function your endpoint. so, your requests sent to AWS. Now, AWS could be know your token.

@lljr
Copy link

lljr commented Aug 30, 2021

@bkayranci yes, ok so maybe this is just that Linode is not compatible? I read here

[cluster-id] needs to be set therefore my end_point url is wrong. I was able to fix this and in the stacktrace I can see

2021-08-30 12:53:45 [botocore.auth] DEBUG: CanonicalRequest:
PUT
/epidemiology-ni/epidemiology-data-ni.json

but it errors with botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: Unknown. Maybe this is not compatible with Linode? Let me know your thoughts... BTW the access key is set (given by Linode).

@wRAR
Copy link
Member

wRAR commented Aug 30, 2021

how do I download the latest? I used python3 -m pip install git+https://github.com/scrapy/scrapy.git?

That should work.

Is there anything you expect from me for including in new version?

No, it was merged and so will be included in the next release.

@lljr
Copy link

lljr commented Aug 30, 2021

It works. I had to write in the credentials in the s3 uri like in the documentation like this s3://aws_key:aws_secret@mybucket/path/to/export.csv.

This won't work

AWS_ACCESS_KEY_ID = config['credentials']['AWS_ACCESS_KEY_ID']
AWS_SECRET_ACCESS_KEY = config['credentials']['AWS_SECRET_ACCESS_KEY']

Any idea why? Or at least it could help telling me how to debug?? I use to have the aws cli installed in my computer. I deleted the ~/.aws folder since it's no longer in use. I think boto looks into this folder first... I'm not sure but even deleting this folder would throw the same invalid secret key error.

EDIT: I tested my changes with scrapyd and it won't export the feed to the aws bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Custom Endpoint for S3FeedStorage S3 FeedStorage does not support custom S3 like storages
5 participants