Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FTPFileStore to FilesPipeline #3961

Merged
merged 20 commits into from Jan 24, 2020
Merged

Conversation

OmarFarrag
Copy link
Contributor

@OmarFarrag OmarFarrag commented Aug 18, 2019

In this PR, a FTPFileStore class is added to scrapy.pipelines.files to add support for saving files on FTP servers.

I added close_spider function to the FilesPipeline and from it call close_connection function of the store object, if exists. This allows opening the FTP connection only once at initialization and closing it when the spider is closed (This is different from scrapy.extensions.feedexports where the connection is opened and closed every time an item is exported).

There are two things I need feedback on them. First, if the username and password should be supplied in the file_store setting or have separated settings fields for them. Second thing if a more exception handling is needed when opening the connection.

After resolving these things, will write tests and documentation :)

CLOSES #3928

@codecov
Copy link

codecov bot commented Aug 18, 2019

Codecov Report

No coverage uploaded for pull request base (master@c544c0d). Click here to learn what that means.
The diff coverage is 89.52%.

@@            Coverage Diff            @@
##             master    #3961   +/-   ##
=========================================
  Coverage          ?   83.82%           
=========================================
  Files             ?      166           
  Lines             ?     9820           
  Branches          ?     1464           
=========================================
  Hits              ?     8232           
  Misses            ?     1335           
  Partials          ?      253
Impacted Files Coverage Δ
scrapy/dupefilters.py 98.11% <ø> (ø)
scrapy/commands/genspider.py 83.72% <ø> (ø)
scrapy/pipelines/media.py 97.29% <ø> (ø)
scrapy/pqueues.py 100% <ø> (ø)
scrapy/utils/display.py 31.25% <ø> (ø)
scrapy/spiders/init.py 100% <ø> (ø)
scrapy/commands/check.py 23.18% <ø> (ø)
scrapy/utils/template.py 100% <ø> (ø)
scrapy/statscollectors.py 92.15% <ø> (ø)
scrapy/exceptions.py 91.3% <ø> (ø)
... and 128 more

@webtekindo
Copy link

webtekindo commented Aug 19, 2019

def __init__(self, uri):
    assert uri.startswith('ftp://')
    u = urlparse(uri)
    self.ftp = FTP()
    self.ftp.connect(u.hostname, u.port or '21')
    self.ftp.login(u.username, u.password)
    self.basedir = u.path + '/'
    ftp_makedirs_cwd(self.ftp, self.basedir)

def persist_file(self, path, buf, info, meta=None, headers=None):
    buf.seek(0)
    filename = path.split('/')[-1]
    ftp_makedirs_cwd(self.ftp, self.basedir + '/'.join(path.split('/')[:-1]))
    return threads.deferToThread(
        self.ftp.storbinary,
        'STOR %s' % filename,
        buf
    )

I'm not sure if this is the best solution but something like that will allow to use custom file_path.

Copy link
Member

@Gallaecio Gallaecio left a comment

Looks promising. I’ve left a few comments.

Please, avoid unrelated line removals and indentation additions.

scrapy/pipelines/files.py Outdated Show resolved Hide resolved
scrapy/pipelines/files.py Outdated Show resolved Hide resolved
scrapy/pipelines/files.py Outdated Show resolved Hide resolved
@OmarFarrag
Copy link
Contributor Author

OmarFarrag commented Aug 19, 2019

@Gallaecio I have considered your comments and modified the code ..
I have a note, the ImagesPipeline inherits from the FilesPipeline and it implements from_settings method just the same except for getting the store_uri, I could add the ftp username and password extraction to that method as well, but I suggest creating one method in FilesPipeline that does this implementation and takes store identification parameter and this method would be called by the two from_settings methods of FilesPipeline and ImagesPipeline

scrapy/pipelines/files.py Outdated Show resolved Hide resolved
scrapy/pipelines/files.py Outdated Show resolved Hide resolved
@OmarFarrag
Copy link
Contributor Author

OmarFarrag commented Aug 19, 2019

@Gallaecio I do believe this implementation is nearly the same as scrapy.extensions.feedexport.FTPFeedStorage and may be refactored

Copy link
Member

@Gallaecio Gallaecio left a comment

Yes, it looks like you could create a function for persisting a file in scrapy.utils.ftp, and have it shared by both. It would also be great if you could implement FTP active mode support as well.

scrapy/pipelines/files.py Outdated Show resolved Hide resolved
Copy link
Member

@Gallaecio Gallaecio left a comment

Great work!

scrapy/pipelines/files.py Outdated Show resolved Hide resolved
scrapy/pipelines/files.py Outdated Show resolved Hide resolved
scrapy/utils/ftp.py Outdated Show resolved Hide resolved
@OmarFarrag
Copy link
Contributor Author

OmarFarrag commented Aug 22, 2019

@Gallaecio I added settings extraction in ImagesPipeline, just the same as in FilesPipeline, and I do also believe this could be refactored. In both cases, I will start with the tests and docs now.

@webtekindo
Copy link

webtekindo commented Aug 28, 2019

We very often get an error:

error_perm: 530 Sorry, the maximum number of clients (10) for this user are already connected.

Can we somehow keep only one FTP connection instead of connecting each time? That will also improve performance.

@Gallaecio
Copy link
Member

Gallaecio commented Aug 29, 2019

We could make the number of FTP clients configurable. Whether to do that as part of these changes or in a later pull request, I’m not sure.

@OmarFarrag
Copy link
Contributor Author

OmarFarrag commented Sep 6, 2019

@Gallaecio @webtekindo Okay, once this PR is merged, I will open another PR and add this option.

@Gallaecio I am working on the test now but there's something I wanna understand, the line uri = os.environ.get('ENVVAR'), how to know the value of the env var. If the test is to be written like for S3 and GCS, it would be FTP_TEST_FILE_URI , but in test_feedexport.FTPFeedStorageTest , FEEDTEST_FTP_URI is used.

@OmarFarrag OmarFarrag requested a review from Gallaecio Sep 6, 2019
@Gallaecio
Copy link
Member

Gallaecio commented Sep 16, 2019

I think FTP_TEST_FILE_URI is OK for this test.

Copy link
Member

@Gallaecio Gallaecio left a comment

Left a few comments on the documentation. Sorry for being very picky 😅

docs/topics/media-pipeline.rst Outdated Show resolved Hide resolved
docs/topics/media-pipeline.rst Outdated Show resolved Hide resolved
docs/topics/media-pipeline.rst Outdated Show resolved Hide resolved
docs/topics/media-pipeline.rst Outdated Show resolved Hide resolved
docs/topics/media-pipeline.rst Outdated Show resolved Hide resolved
docs/topics/media-pipeline.rst Outdated Show resolved Hide resolved
@OmarFarrag
Copy link
Contributor Author

OmarFarrag commented Sep 28, 2019

Left a few comments on the documentation. Sorry for being very picky 😅

Totally okay of course 😄 Actually it's first time working with .rst files so I kinda had no idea what I was doing 😆

docs/topics/media-pipeline.rst Outdated Show resolved Hide resolved
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
wRAR
wRAR approved these changes Nov 12, 2019
scrapy/utils/ftp.py Show resolved Hide resolved
scrapy/utils/ftp.py Outdated Show resolved Hide resolved
scrapy/utils/ftp.py Outdated Show resolved Hide resolved
@ghost
Copy link

ghost commented Jan 8, 2020

+1

@ghost
Copy link

ghost commented Jan 8, 2020

Is there any option to get this feature before official merge?

Thanks

@Gallaecio
Copy link
Member

Gallaecio commented Jan 10, 2020

@sauraCactus You could install the corresponding branch of @OmarFarrag’s repository with the feature.

@kmike
Copy link
Member

kmike commented Jan 24, 2020

I haven't checked how it works, but the code looks good to me - merging, relying on @OmarFarrag to get it right :) Thanks @OmarFarrag, @Gallaecio and @wRAR!

@kmike kmike merged commit 5f407cf into scrapy:master Jan 24, 2020
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save images on remote server
5 participants