Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add copy_if_newer parameter to functions in copy.py #22

Merged
merged 13 commits into from Mar 29, 2017

Conversation

gpcimino
Copy link

Added copy_if_newer parameter to all functions in copy.py.
Files are copied only if destination files don't exist or sources are newer then destination files.
Directories are always copied.
In case time of modification for files cannot be determined files are safely copied.

Moreover I added geturl method to osfs returning "file://" + self.getsyspath(path).

@lurch
Copy link
Contributor

lurch commented Feb 10, 2017

Can you explain what exactly "files are safely copied" means?
Also, I see that some of the code-comments refer to In case time of modification for destination files cannot be determined - does it also handle the case where the modification times of the source files can't be determined?
In some FTP clients I've used, there's an option to "overwrite if file-sizes are different" (in case the timestamps are unreliable) - would that be a useful fallback here?

@gpcimino
Copy link
Author

just wants to add that now cop_file, copy_dir and copy_fs have a return value containing the copied files

@willmcgugan
Copy link
Member

I can see why that would be useful, and the implementation looks good. But I'm not so keen on add this.

The main reason is the overloading of the copy methods. copy_if_newer may seem like small addition to the signature, but it is on equal footing with other flags that might be useful. And if we add those we end up with a function that is harder to test and maintain.

I think it would be best to keep copy_fs as a "do one thing and do it well" method. Which is admittedly a departure from FS1 which tried to pack in as much functionality as possible.

I'd like to suggest either implementing a new function (e.g. copy_fs_if_newer), or by using a more generic mechanism. A custom Walker may be able to do this.

For instance...

copy_fs(foo_fs, bar_fs, walker=NewerFilesWalker(foo_fs, bar_fs))

Always open to other viewpoints and I appreciate your contributions. :)

@gpcimino
Copy link
Author

@lurch: "files are safely copied" means that in case func _source_is_newer() cannot figure out the modification time of both files src, dst, the file is copied. Of course we can make this even more complex considering the file size as well. As first implementation i decided to consider the modification time only.

@gpcimino
Copy link
Author

@will: implementing a new function copy_fs_if_newer() will end up in code copy and paste which is bad.
An option could be to have 2 public functions (i know public/private in python is not enforced) copy_fs() and copy_fs_if_newer(). Those functions will call an unique private _copy_fs() which will have all the parameters (included copy_if_newer). This should simplify the exposed interface to the final user but avoid code copy and paste.

Using a custom walker could be a neat implementation, but from the user point of view won't be as clear as a simple function parameter like copy_if_newer.

@willmcgugan
Copy link
Member

@gpcimino There is another consideration that occurs to me. In _source_is_newer you call getinfo twice. Which is reasonable, but getinfo may require a request for network filesystems. Two requests per file copied would likely result in a slow copy.

A custom copy_fs and copy_dir could get all the info for each directory which would be more efficient.

I concur that if there is going to be a lot of common functionality then it would be worth factoring out a private function there.

Using a custom walker could be a neat implementation, but from the user point of view won't be as clear as a simple function parameter like copy_if_newer.

True, but maybe the copy_fs_if_newer could just invoke copy_fs(foo_fs, bar_fs, walker=NewerFilesWalker). Then we have a NewerFilesWalker we can use elsewhere. Just a suggestion...

@lurch
Copy link
Contributor

lurch commented Feb 10, 2017

"files are safely copied" means that in case func _source_is_newer() cannot figure out the modification time of both files src, dst, the file is copied.

Ahh, I guess in that case "In case time of modification for files cannot be determined files are always copied." would be much clearer.

Regarding @willmcgugan 's suggestions, perhaps you could combine both approaches into the best of both worlds, i.e.

def copy_fs_if_newer(src_fs, dst_fs, other_options...):
    return copy_fs(src_fs, dst_fs, walker=NewerFilesWalker(src_fs, dst_fs, other_options...))

Gives you a simple interface, and doesn't involve copy'n'pasting code :)

@lurch
Copy link
Contributor

lurch commented Feb 10, 2017

Haha, snap 😆

@gpcimino
Copy link
Author

gpcimino commented Feb 12, 2017

@lurch, @willmcgugan: if we decide to go for NewerFilesWalker implementation then how do we cope with the single file copy method?

def copy_file(src_fs, src_path, dst_fs, dst_path):
…

This method (obviously) doesn't get a walker as input parameter, so how do we handle the case copy_if_newer?

Thinking more generally, what should be the responsibility of a Walker object? IMHO it should descend (aka walk) one file system tree. It shouldn’t access external file systems objects to perform some kind of chekcs (e.g. check the file modification time).

IMHO, the proper position for the copy_if_newer flag is in the functions signatures inside the copy module (copy_file, copy_fs, etc). Those functions get 2 file systems in input, and they are properly positioned to execute actions on both those file systems (e.g. compare the modification time of a file).

So what is the solution?

A simple solution could be, use the (ugly) python features *args: copy_file will become:

def copy_file(src_fs, src_path, dst_fs, dst_path, *args):

where args can handle any parameter (e.g. copy_if_newer). IMHO this Python feature make code really unreadable, but in some scenario I have to admit is extremely powerful.

If we want to consider a more disruptive refactoring ;-) and if we go back to the old good times of OOP, some OOP guru (may be was Grady Booch?) was used to say: “if your problem gets more complex add a class!” So, let’s add a class. We can refactor the functions of the copy module in a copy class. Something like:

class copy(object):
	def __init__(self, src_fs, dst_fs, copy_if_newer=False, walker=None):
		self._src_fs = src_fs
		self._dst_fs = dst_fs
		self._walker = walker
		self._copy_if_newer=copy_if_newer

	def set_copy_if_newer(self, value):
		self._copy_if_newer = value

	def set_walker(self, walker):
		self._walker = walker

	def copy_file(self,src_path, dst_fs, dst_path):
		pass

	def copy_structure():
		pass

In this case we don’t have the problem of too many parameters for the copy_x functions anymore.
Moreover a class here makes a lot of sense, because the Copy object could be reused multiple time, in the same snip of code. Immagine something like that:

backup= Copy("osfs:///data/", "osfs:///storage/")
backup.copy_file("/logs/error.log", "/error/error.log")
backup.set_copy_if_newer(True)
backup.copy_file("/logs/access.log", "/access/access.log")
...

Moreover, we can immagie that in future the Copy class can growth to achieve rsync like functionalities. For instance we might want to implement a very useful mirroring (between filesystems) function. Or we might want to develop a copy_dir based on regex, where dst_path is calculated from src_path using a regex, thus implementing a path transformation (see below, the example of backup copy refactored in one line, code not tested!):

backup= Copy("osfs:///data/", "osfs:///storage/")
backup.copy_dir_with_transform("/", "\/.+\/(?P<type>.+)\.log, "/{type}/{type}.log")

I understand that this second scenario (refactor to a class) will require a lot of effort in refactoring tests and also documentation.
Please consider this only a Sunday morning rant ;-)

What do you think guys?

Ciao
GP

@gpcimino
Copy link
Author

@willmcgugan,

did you have the chance to think about this PR?
Thanks
GP

@willmcgugan
Copy link
Member

@gpcimino Sorry, should have replied earlier.

My thinking there is that it could be implemented in a fairly straightforward manner, without many changes. On reflection, an overloaded Walker may be more trouble than its worth.

Here's the current main loop in copy_dir.

    with manage_fs(src_fs, writeable=False) as src_fs:
        with manage_fs(dst_fs, create=True) as dst_fs:
            with src_fs.lock(), dst_fs.lock():
                dst_fs.makedir(_dst_path, recreate=True)
                for dir_path, dirs, files in walker.walk(src_fs, _src_path):
                    copy_path = combine(
                        _dst_path,
                        frombase(_src_path, dir_path)
                    )
                    for info in dirs:
                        dst_fs.makedir(
                            info.make_path(copy_path),
                            recreate=True
                        )
                    for info in files:
                        copy_file(
                            src_fs,
                            info.make_path(dir_path),
                            dst_fs,
                            info.make_path(copy_path)
                        )

If you changed the walker.walk to walker.info you have an info object for the modification times of the source filesystem. You could then call scandir on the destination filesystem to get info for the destination.

Then its just a matter of adding a condition around the copy_file method.

This has the advantage that you can get the info for the directories on the src and dst in one call, which is quite efficient. It will be much faster for FTP or network filesystems.

Does that make sense?

@gpcimino
Copy link
Author

Hi,

i agree with your comments. Just to be sure, with the implementation you suggested, what is the interface of the copy.py module? Shall we keep two public methods copy_dir and copy_dir_if_newer?
Is this true for the other methods:

def copy_file(...):
...
def copy_file_if_newer(..):
...
def copy_fs(...):
...
def copy_fs_if_newer(...):
...

Thanks
GP

@willmcgugan
Copy link
Member

Great!

I think copy_dir_if_newer and copy_fs_if_newer makes sense and has parity with the existing methods. I probably wouldn't implement copy_file_if_newer though unless you have a particular use case for it?

@coveralls
Copy link

coveralls commented Feb 27, 2017

Coverage Status

Coverage decreased (-0.7%) to 99.096% when pulling 1e4ace8 on gpcimino:feature/copy-if-newer into de3e3a9 on PyFilesystem:master.

@gpcimino
Copy link
Author

gpcimino commented Feb 27, 2017

Will,

here is the new PR. I tested on Win7 with Python 3.5.1 and all tests passed (also the new ones I added).
On Travis apparently one test is failing on Python version 2.7 and 3.5.
I will push some more tests to increase code coverage, which decerased due to lack of tests for copy_file_if_newer().

For the new implementation I tried to follow your suggestions. However, since I am not calling walker.walk anymore, the body of copy_dir_if_newer changed significantly. I would be grateful if you could have a look at the code, may be you can spot some nasty hidden bug!

I agree with you that copy_file_if_newer is not so important, if a client need to copy a single file he probably doesn’t need to use pyfilesystem2. However I left copy_file_if_newer method for symmetry. Form user prospective It could be useful in a scenario when a client is calling multiple operations on a given file system, such as a sequence of copy_dir and copy_file, in this case the single copy_if_newer could be handy. I would definitely leave it in!
Ciao
GP

@coveralls
Copy link

coveralls commented Feb 27, 2017

Coverage Status

Coverage decreased (-0.3%) to 99.413% when pulling c3e6fc0 on gpcimino:feature/copy-if-newer into de3e3a9 on PyFilesystem:master.

@willmcgugan
Copy link
Member

Thanks @gpcimino I'm in the middle of moving house, but I'll have a look at that very soon. First impressions are positive, but I may have some super-pedantic suggestions.

@gpcimino
Copy link
Author

gpcimino commented Mar 1, 2017

Will,

i still have to add the doc strings of the new methods. Once your (gently!) code review will be completed i can add the doc strings to the final version of the code.
GP

Copy link
Member

@willmcgugan willmcgugan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some queries for you.

fs/copy.py Outdated
return True
else:
nmsp = ('details', 'modified')
return src_fs.getinfo(src_path, nmsp).modified > dst_fs.getinfo(dst_path, nmsp).modified
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that the modified attribute may be None if the filesystem doesn't store that information.

See https://pyfilesystem2.readthedocs.io/en/latest/info.html#details-namespace

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will ad a check just before the return statement, to catch the case modified == None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

fs/copy.py Outdated
if not dst_fs.exists(dst_path):
return True
else:
nmsp = ('details', 'modified')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what nmsp stands for there. I'd like to avoid abbreviations wherever possible, unless there is a president. I don't mind verbose_variable_names.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nmsp stands for namespace. I used an abbrevition to keep line 201 under 100 chars.
I will change it to namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

fs/copy.py Outdated
@@ -101,13 +155,14 @@ def copy_dir(src_fs, src_path, dst_fs, dst_path, walker=None):
in ``src_fs``. Set this if you only want to consider a sub-set
of the resources in ``src_fs``.
:type walker: :class:`~fs.walk.Walker`
:returns: a list with the copied files dst_path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. Seems sensible, but a copy can potentially copy millions of files, which may be an issue on some platforms.

Do you have a use case for it? I'd be tempted to drop it unless there is a tangible need for it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add this feature exactly to support my use case. Once you have copied some files, you may want to know which files you copied, to do something on top of them.
In my case, i have a data acquisition pipeline, where the copy file is just the first step. After the copy, i need to execute actions on the files, and of course know which files have been copies is super useful!

Said that, I didn’t think to the fact the list of copied files can grow enormously.
IMHO the best implementation would be to pass to copy_dir_* and to copy_fs_* a call back function. Then, is up to the user to decide what to do inside the callback. For instance one simple implementation could be just count the copied files and sum the total bytes copied. A second option could be to collect all the files in list (as I am doing now).

The signature of the callback functions could be same as copy file:

Def callback(src_fs, src_path, dst_fs, dst_path):
…

Do you like it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So would the callback actually do the copying? I'm thinking of something like this:

copy_fs(src_fs, dst_fst, copy_file=copy_file, walker=None)

With copy.copy_file as the default, functionality remains the same. But for your use case, you pass a specialised function that stores the copied files in a list. Is that what you had in mind?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to keep the copy instaruction inside the copy_* method as usual.
The callback will be triggered only if the copy happens (e.g. if src is newer than dst).
Than is up to the user to define what to do in the callback.
If you look at my lastest commit in this PR branch, you can find the code already working.

Possible scenarios we can handle (code not tested):

  1. print what's going on:
def echo(src_fs, src_path, dst_fs, dst_path):
       src_url = open_fs(src_fs).geturl(src_path)
       dst_url = open_fs(dst_fs).geturl(dst_path)
       print("file " + src_url+ " copied to " +dst_url  )

copy_fs_if_newer(source_root, target_root, walker, callback=echo)

  1. counting:
num_file = 0 
def count(src_fs, src_path, dst_fs, dst_path):
     num_file += 1
     #we could also compute the total size of copied files in MB

copy_fs_if_newer(source_root, target_root, walker, callback=echo)

print("Copied " + str(num_file) + " files")
  1. collect all copied files:
files_copied = []
def accumulate(src_fs, src_path, dst_fs, dst_path):
    url = open_fs(dst_fs).geturl(dst_path)
    files_copied.append(url)

copy_fs_if_newer(source_root, target_root, walker, callback=accumulate)

for f in files_copied:
    #do something with f...

Do this make sense?
GP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Works for me. Only suggestion I would have is to rename callback to a more descriptive on_copy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

fs/copy.py Outdated

def copy_file_if_newer(src_fs, src_path, dst_fs, dst_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add docs?

Why return the dst_path? If its just to indicate copied or not copied, maybe a boolean would be sufficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above! ;-)
No worries, I will add the docs to all functions once we agree on stable code base!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe a boolean to indicate whether the file was copied or not. But let errors bubble?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, i will remove the try/except block

fs/copy.py Outdated
dst_fs.makedir(copy_path, recreate=True)
if copy_info.is_file:
if dir_path in dst_state:
if copy_info.modified <= dst_state[dir_path].modified:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment re modified attribute, which can be None.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

fs/copy.py Outdated
with manage_fs(dst_fs, create=True) as dst_fs:
with src_fs.lock(), dst_fs.lock():
dst_fs.makedir(_dst_path, recreate=True)
nmsp = ('details', 'modified')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More verbose variable names here please (nmsp, fp and i).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -302,6 +302,9 @@ def getsyspath(self, path):
sys_path = self._to_sys_path(path)
return sys_path

def geturl(self, path, purpose='download'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't occur to me to do, but I like it. It isn't clear from the docs (https://pyfilesystem2.readthedocs.io/en/latest/reference/base.html#fs.base.FS.geturl) but it should throw a NoURL for any value of purpose other than 'download'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

self._write_file(dst_file1)

time.sleep(1) #sleep 1 sec to ensure dst_file1 is older
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to do this without the sleeps?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add sleeps because on trevis, tests failed in unpredictive way. For some version of python some test falied once, then the second time they succced. This is strange... so i decided to add a sleep in case was a race on file object. But probably it is not, because the behaivoe is the same with or without sleep.

I will remove the sleep, still some test will fail, do you have an idea why?
Note that py3.5 tests are failing. On my workstation i use py35 on windows and tests are all green!

fs/copy.py Outdated
else:
nmsp = ('details', 'modified')
return src_fs.getinfo(src_path, nmsp).modified > dst_fs.getinfo(dst_path, nmsp).modified
except Exception as ex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a good idea to catch all exceptions here. There are variety of errors that could occur here, and you risk interpreting them incorrectly...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please suggest a hierachy of exceptions to catch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I see its a private method, so I'm probably being too pedantic. I guess if there is any error, you would want default to attempting a copy?

In that case, you might just want to change it to catch FSError which is the base for all filesystem errors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in case of problems (eg cannot find the modified time), i just execute the copy (i.e. return True).
Im my last commit on this PR branch i catch the FSError exception insted of Exception

fs/copy.py Outdated
@@ -56,13 +88,35 @@ def copy_file(src_fs, src_path, dst_fs, dst_path):
if src_fs is dst_fs:
# Same filesystem, so we can do a potentially optimized copy
src_fs.copy(src_path, dst_path, overwrite=True)
return dst_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather this didn't return anything. It looks like there is only a single possible return value anyway...

Copy link
Author

@gpcimino gpcimino Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we catch the copy failed event? Something like:

def copy_file(src_fs, src_path, dst_fs, dst_path):
try:
    with manage_fs(src_fs, writeable=False) as src_fs:
     ....
     return True
except:
     return False

We can return a boolean to report on the success of the copy.
Again here we should use more detailed Exception handling.
What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its probably best just to let the errors bubble up. It lets the caller do more finely grained error handling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@coveralls
Copy link

coveralls commented Mar 15, 2017

Coverage Status

Coverage decreased (-0.4%) to 99.316% when pulling 39bb737 on gpcimino:feature/copy-if-newer into de3e3a9 on PyFilesystem:master.

@gpcimino
Copy link
Author

Will,

we are almost done 😄 !!!
I updated the PR with the rename of callback in on_copy.
If you consider all the code modifications closed, i will update the docstrings.
Then you can merge the PR.
GP

@willmcgugan
Copy link
Member

@gpcimino Looks good to me! 👍

@gpcimino
Copy link
Author

Will,

i just push the docstring update to the PR branch.
I also add the check in osfs.geturl(), as per your request!
If you can merge the PR to the master branch would be very good!
Ciao
GP

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.5%) to 99.295% when pulling ac6bca1 on gpcimino:feature/copy-if-newer into de3e3a9 on PyFilesystem:master.

@willmcgugan
Copy link
Member

Hi GC.

Regarding that test fail. I'm wondering if you could explicitly set the time with os.utime a few seconds ahead, rather than letting it set the current time. You might be able to avoid the sleep with TravisCI.

I'm not sure there is much you can do to debug on Travis, except commit and wait for any fails.

That said, if it turns out to be tricky. I wouldn't have any objections if you put the sleep back in to fix it for now, and we can come up with a more elegant solution later. Just to get this PR merged... :-)

@coveralls
Copy link

coveralls commented Mar 27, 2017

Coverage Status

Coverage decreased (-0.2%) to 99.515% when pulling 3b85c19 on gpcimino:feature/copy-if-newer into de3e3a9 on PyFilesystem:master.

@gpcimino
Copy link
Author

Will,

i fixed the errors on TravisCI adding a delay to the utime of the files used in copy_x_if_newer().
All tests are now green!
GP

@willmcgugan
Copy link
Member

That's great. Thanks for sticking with it. :-)

@willmcgugan willmcgugan merged commit 737edcf into PyFilesystem:master Mar 29, 2017
@willmcgugan
Copy link
Member

Hi GP. I added some tests to bring coverage back up. I also made a few changes to your code for formatting and brevity. The tests all pass, but would you mind checking this PR over to make sure I haven't broken anything you rely on...

#34

Will

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants