Add copy_if_newer parameter to functions in copy.py #22

gpcimino · 2017-02-10T15:02:21Z

Added copy_if_newer parameter to all functions in copy.py.
Files are copied only if destination files don't exist or sources are newer then destination files.
Directories are always copied.
In case time of modification for files cannot be determined files are safely copied.

Moreover I added geturl method to osfs returning "file://" + self.getsyspath(path).

…d to osfs.

lurch · 2017-02-10T15:12:54Z

Can you explain what exactly "files are safely copied" means?
Also, I see that some of the code-comments refer to In case time of modification for destination files cannot be determined - does it also handle the case where the modification times of the source files can't be determined?
In some FTP clients I've used, there's an option to "overwrite if file-sizes are different" (in case the timestamps are unreliable) - would that be a useful fallback here?

gpcimino · 2017-02-10T15:13:27Z

just wants to add that now cop_file, copy_dir and copy_fs have a return value containing the copied files

willmcgugan · 2017-02-10T15:25:57Z

I can see why that would be useful, and the implementation looks good. But I'm not so keen on add this.

The main reason is the overloading of the copy methods. copy_if_newer may seem like small addition to the signature, but it is on equal footing with other flags that might be useful. And if we add those we end up with a function that is harder to test and maintain.

I think it would be best to keep copy_fs as a "do one thing and do it well" method. Which is admittedly a departure from FS1 which tried to pack in as much functionality as possible.

I'd like to suggest either implementing a new function (e.g. copy_fs_if_newer), or by using a more generic mechanism. A custom Walker may be able to do this.

For instance...

copy_fs(foo_fs, bar_fs, walker=NewerFilesWalker(foo_fs, bar_fs))

Always open to other viewpoints and I appreciate your contributions. :)

gpcimino · 2017-02-10T15:31:22Z

@lurch: "files are safely copied" means that in case func _source_is_newer() cannot figure out the modification time of both files src, dst, the file is copied. Of course we can make this even more complex considering the file size as well. As first implementation i decided to consider the modification time only.

gpcimino · 2017-02-10T15:37:09Z

@will: implementing a new function copy_fs_if_newer() will end up in code copy and paste which is bad.
An option could be to have 2 public functions (i know public/private in python is not enforced) copy_fs() and copy_fs_if_newer(). Those functions will call an unique private _copy_fs() which will have all the parameters (included copy_if_newer). This should simplify the exposed interface to the final user but avoid code copy and paste.

Using a custom walker could be a neat implementation, but from the user point of view won't be as clear as a simple function parameter like copy_if_newer.

willmcgugan · 2017-02-10T16:12:17Z

@gpcimino There is another consideration that occurs to me. In _source_is_newer you call getinfo twice. Which is reasonable, but getinfo may require a request for network filesystems. Two requests per file copied would likely result in a slow copy.

A custom copy_fs and copy_dir could get all the info for each directory which would be more efficient.

I concur that if there is going to be a lot of common functionality then it would be worth factoring out a private function there.

Using a custom walker could be a neat implementation, but from the user point of view won't be as clear as a simple function parameter like copy_if_newer.

True, but maybe the copy_fs_if_newer could just invoke copy_fs(foo_fs, bar_fs, walker=NewerFilesWalker). Then we have a NewerFilesWalker we can use elsewhere. Just a suggestion...

lurch · 2017-02-10T16:12:36Z

"files are safely copied" means that in case func _source_is_newer() cannot figure out the modification time of both files src, dst, the file is copied.

Ahh, I guess in that case "In case time of modification for files cannot be determined files are always copied." would be much clearer.

Regarding @willmcgugan 's suggestions, perhaps you could combine both approaches into the best of both worlds, i.e.

def copy_fs_if_newer(src_fs, dst_fs, other_options...):
    return copy_fs(src_fs, dst_fs, walker=NewerFilesWalker(src_fs, dst_fs, other_options...))

Gives you a simple interface, and doesn't involve copy'n'pasting code :)

lurch · 2017-02-10T16:13:41Z

Haha, snap 😆

gpcimino · 2017-02-12T13:20:30Z

@lurch, @willmcgugan: if we decide to go for NewerFilesWalker implementation then how do we cope with the single file copy method?

def copy_file(src_fs, src_path, dst_fs, dst_path):
…

This method (obviously) doesn't get a walker as input parameter, so how do we handle the case copy_if_newer?

Thinking more generally, what should be the responsibility of a Walker object? IMHO it should descend (aka walk) one file system tree. It shouldn’t access external file systems objects to perform some kind of chekcs (e.g. check the file modification time).

IMHO, the proper position for the copy_if_newer flag is in the functions signatures inside the copy module (copy_file, copy_fs, etc). Those functions get 2 file systems in input, and they are properly positioned to execute actions on both those file systems (e.g. compare the modification time of a file).

So what is the solution?

A simple solution could be, use the (ugly) python features *args: copy_file will become:

def copy_file(src_fs, src_path, dst_fs, dst_path, *args):

where args can handle any parameter (e.g. copy_if_newer). IMHO this Python feature make code really unreadable, but in some scenario I have to admit is extremely powerful.

If we want to consider a more disruptive refactoring ;-) and if we go back to the old good times of OOP, some OOP guru (may be was Grady Booch?) was used to say: “if your problem gets more complex add a class!” So, let’s add a class. We can refactor the functions of the copy module in a copy class. Something like:

class copy(object):
	def __init__(self, src_fs, dst_fs, copy_if_newer=False, walker=None):
		self._src_fs = src_fs
		self._dst_fs = dst_fs
		self._walker = walker
		self._copy_if_newer=copy_if_newer

	def set_copy_if_newer(self, value):
		self._copy_if_newer = value

	def set_walker(self, walker):
		self._walker = walker

	def copy_file(self,src_path, dst_fs, dst_path):
		pass

	def copy_structure():
		pass

In this case we don’t have the problem of too many parameters for the copy_x functions anymore.
Moreover a class here makes a lot of sense, because the Copy object could be reused multiple time, in the same snip of code. Immagine something like that:

backup= Copy("osfs:///data/", "osfs:///storage/")
backup.copy_file("/logs/error.log", "/error/error.log")
backup.set_copy_if_newer(True)
backup.copy_file("/logs/access.log", "/access/access.log")
...

Moreover, we can immagie that in future the Copy class can growth to achieve rsync like functionalities. For instance we might want to implement a very useful mirroring (between filesystems) function. Or we might want to develop a copy_dir based on regex, where dst_path is calculated from src_path using a regex, thus implementing a path transformation (see below, the example of backup copy refactored in one line, code not tested!):

backup= Copy("osfs:///data/", "osfs:///storage/")
backup.copy_dir_with_transform("/", "\/.+\/(?P<type>.+)\.log, "/{type}/{type}.log")

I understand that this second scenario (refactor to a class) will require a lot of effort in refactoring tests and also documentation.
Please consider this only a Sunday morning rant ;-)

What do you think guys?

Ciao
GP

gpcimino · 2017-02-22T22:58:25Z

@willmcgugan,

did you have the chance to think about this PR?
Thanks
GP

willmcgugan · 2017-02-23T13:57:36Z

@gpcimino Sorry, should have replied earlier.

My thinking there is that it could be implemented in a fairly straightforward manner, without many changes. On reflection, an overloaded Walker may be more trouble than its worth.

Here's the current main loop in copy_dir.

    with manage_fs(src_fs, writeable=False) as src_fs:
        with manage_fs(dst_fs, create=True) as dst_fs:
            with src_fs.lock(), dst_fs.lock():
                dst_fs.makedir(_dst_path, recreate=True)
                for dir_path, dirs, files in walker.walk(src_fs, _src_path):
                    copy_path = combine(
                        _dst_path,
                        frombase(_src_path, dir_path)
                    )
                    for info in dirs:
                        dst_fs.makedir(
                            info.make_path(copy_path),
                            recreate=True
                        )
                    for info in files:
                        copy_file(
                            src_fs,
                            info.make_path(dir_path),
                            dst_fs,
                            info.make_path(copy_path)
                        )

If you changed the walker.walk to walker.info you have an info object for the modification times of the source filesystem. You could then call scandir on the destination filesystem to get info for the destination.

Then its just a matter of adding a condition around the copy_file method.

This has the advantage that you can get the info for the directories on the src and dst in one call, which is quite efficient. It will be much faster for FTP or network filesystems.

Does that make sense?

gpcimino · 2017-02-26T21:29:34Z

Hi,

i agree with your comments. Just to be sure, with the implementation you suggested, what is the interface of the copy.py module? Shall we keep two public methods copy_dir and copy_dir_if_newer?
Is this true for the other methods:

def copy_file(...):
...
def copy_file_if_newer(..):
...
def copy_fs(...):
...
def copy_fs_if_newer(...):
...

Thanks
GP

willmcgugan · 2017-02-26T23:25:51Z

Great!

I think copy_dir_if_newer and copy_fs_if_newer makes sense and has parity with the existing methods. I probably wouldn't implement copy_file_if_newer though unless you have a particular use case for it?

…en src file is newer

coveralls · 2017-02-27T16:04:44Z

Coverage decreased (-0.7%) to 99.096% when pulling 1e4ace8 on gpcimino:feature/copy-if-newer into de3e3a9 on PyFilesystem:master.

gpcimino · 2017-02-27T16:29:35Z

Will,

here is the new PR. I tested on Win7 with Python 3.5.1 and all tests passed (also the new ones I added).
On Travis apparently one test is failing on Python version 2.7 and 3.5.
I will push some more tests to increase code coverage, which decerased due to lack of tests for copy_file_if_newer().

For the new implementation I tried to follow your suggestions. However, since I am not calling walker.walk anymore, the body of copy_dir_if_newer changed significantly. I would be grateful if you could have a look at the code, may be you can spot some nasty hidden bug!

I agree with you that copy_file_if_newer is not so important, if a client need to copy a single file he probably doesn’t need to use pyfilesystem2. However I left copy_file_if_newer method for symmetry. Form user prospective It could be useful in a scenario when a client is calling multiple operations on a given file system, such as a sequence of copy_dir and copy_file, in this case the single copy_if_newer could be handy. I would definitely leave it in!
Ciao
GP

coveralls · 2017-02-27T16:44:49Z

Coverage decreased (-0.3%) to 99.413% when pulling c3e6fc0 on gpcimino:feature/copy-if-newer into de3e3a9 on PyFilesystem:master.

willmcgugan · 2017-03-01T10:15:54Z

Thanks @gpcimino I'm in the middle of moving house, but I'll have a look at that very soon. First impressions are positive, but I may have some super-pedantic suggestions.

gpcimino · 2017-03-01T11:17:34Z

Will,

i still have to add the doc strings of the new methods. Once your (gently!) code review will be completed i can add the doc strings to the final version of the code.
GP

willmcgugan

Some queries for you.

willmcgugan · 2017-03-02T20:38:09Z

fs/copy.py

+            return True
+        else:
+            nmsp = ('details', 'modified')
+            return src_fs.getinfo(src_path, nmsp).modified > dst_fs.getinfo(dst_path, nmsp).modified


I'm afraid that the modified attribute may be None if the filesystem doesn't store that information.

See https://pyfilesystem2.readthedocs.io/en/latest/info.html#details-namespace

I will ad a check just before the return statement, to catch the case modified == None

willmcgugan · 2017-03-02T20:38:28Z

fs/copy.py

+        if not dst_fs.exists(dst_path):
+            return True
+        else:
+            nmsp = ('details', 'modified')


Not sure what nmsp stands for there. I'd like to avoid abbreviations wherever possible, unless there is a president. I don't mind verbose_variable_names.

nmsp stands for namespace. I used an abbrevition to keep line 201 under 100 chars.
I will change it to namespace.

willmcgugan · 2017-03-02T20:40:54Z

fs/copy.py

@@ -101,13 +155,14 @@ def copy_dir(src_fs, src_path, dst_fs, dst_path, walker=None):
        in ``src_fs``. Set this if you only want to consider a sub-set
        of the resources in ``src_fs``.
    :type walker: :class:`~fs.walk.Walker`
+    :returns: a list with the copied files dst_path.


I'm not sure about this. Seems sensible, but a copy can potentially copy millions of files, which may be an issue on some platforms.

Do you have a use case for it? I'd be tempted to drop it unless there is a tangible need for it.

I add this feature exactly to support my use case. Once you have copied some files, you may want to know which files you copied, to do something on top of them.
In my case, i have a data acquisition pipeline, where the copy file is just the first step. After the copy, i need to execute actions on the files, and of course know which files have been copies is super useful!

Said that, I didn’t think to the fact the list of copied files can grow enormously.
IMHO the best implementation would be to pass to copy_dir_* and to copy_fs_* a call back function. Then, is up to the user to decide what to do inside the callback. For instance one simple implementation could be just count the copied files and sum the total bytes copied. A second option could be to collect all the files in list (as I am doing now).

The signature of the callback functions could be same as copy file:

Def callback(src_fs, src_path, dst_fs, dst_path): …

Do you like it?

So would the callback actually do the copying? I'm thinking of something like this:

copy_fs(src_fs, dst_fst, copy_file=copy_file, walker=None)

With copy.copy_file as the default, functionality remains the same. But for your use case, you pass a specialised function that stores the copied files in a list. Is that what you had in mind?

The idea is to keep the copy instaruction inside the copy_* method as usual.
The callback will be triggered only if the copy happens (e.g. if src is newer than dst).
Than is up to the user to define what to do in the callback.
If you look at my lastest commit in this PR branch, you can find the code already working.

Possible scenarios we can handle (code not tested):

print what's going on:

def echo(src_fs, src_path, dst_fs, dst_path): src_url = open_fs(src_fs).geturl(src_path) dst_url = open_fs(dst_fs).geturl(dst_path) print("file " + src_url+ " copied to " +dst_url ) copy_fs_if_newer(source_root, target_root, walker, callback=echo)

counting:

num_file = 0 def count(src_fs, src_path, dst_fs, dst_path): num_file += 1 #we could also compute the total size of copied files in MB copy_fs_if_newer(source_root, target_root, walker, callback=echo) print("Copied " + str(num_file) + " files")

collect all copied files:

files_copied = [] def accumulate(src_fs, src_path, dst_fs, dst_path): url = open_fs(dst_fs).geturl(dst_path) files_copied.append(url) copy_fs_if_newer(source_root, target_root, walker, callback=accumulate) for f in files_copied: #do something with f...

Do this make sense?
GP

Yes. Works for me. Only suggestion I would have is to rename callback to a more descriptive on_copy.

willmcgugan · 2017-03-02T20:42:52Z

fs/copy.py


+def copy_file_if_newer(src_fs, src_path, dst_fs, dst_path):


Can you add docs?

Why return the dst_path? If its just to indicate copied or not copied, maybe a boolean would be sufficient.

See comment above! ;-)
No worries, I will add the docs to all functions once we agree on stable code base!

I think maybe a boolean to indicate whether the file was copied or not. But let errors bubble?

OK, i will remove the try/except block

willmcgugan · 2017-03-02T20:44:52Z

fs/copy.py

+                        dst_fs.makedir(copy_path, recreate=True)
+                    if copy_info.is_file:
+                        if dir_path in dst_state:
+                            if copy_info.modified <= dst_state[dir_path].modified:


Same comment re modified attribute, which can be None.

willmcgugan · 2017-03-02T20:45:52Z

fs/copy.py

+        with manage_fs(dst_fs, create=True) as dst_fs:
+            with src_fs.lock(), dst_fs.lock():
+                dst_fs.makedir(_dst_path, recreate=True)
+                nmsp = ('details', 'modified')


More verbose variable names here please (nmsp, fp and i).

willmcgugan · 2017-03-02T20:50:39Z

fs/osfs.py

@@ -302,6 +302,9 @@ def getsyspath(self, path):
        sys_path = self._to_sys_path(path)
        return sys_path

+    def geturl(self, path, purpose='download'):


This didn't occur to me to do, but I like it. It isn't clear from the docs (https://pyfilesystem2.readthedocs.io/en/latest/reference/base.html#fs.base.FS.geturl) but it should throw a NoURL for any value of purpose other than 'download'

willmcgugan · 2017-03-02T20:54:59Z

tests/test_copy.py

+            self._write_file(dst_file1)
+
+            time.sleep(1) #sleep 1 sec to ensure dst_file1 is older


Any way to do this without the sleeps?

I add sleeps because on trevis, tests failed in unpredictive way. For some version of python some test falied once, then the second time they succced. This is strange... so i decided to add a sleep in case was a race on file object. But probably it is not, because the behaivoe is the same with or without sleep.

I will remove the sleep, still some test will fail, do you have an idea why?
Note that py3.5 tests are failing. On my workstation i use py35 on windows and tests are all green!

willmcgugan · 2017-03-02T20:57:44Z

fs/copy.py

+        else:
+            nmsp = ('details', 'modified')
+            return src_fs.getinfo(src_path, nmsp).modified > dst_fs.getinfo(dst_path, nmsp).modified
+    except Exception as ex:


I don't think it is a good idea to catch all exceptions here. There are variety of errors that could occur here, and you risk interpreting them incorrectly...

Could you please suggest a hierachy of exceptions to catch?

Actually, I see its a private method, so I'm probably being too pedantic. I guess if there is any error, you would want default to attempting a copy?

In that case, you might just want to change it to catch FSError which is the base for all filesystem errors.

Yes, in case of problems (eg cannot find the modified time), i just execute the copy (i.e. return True).
Im my last commit on this PR branch i catch the FSError exception insted of Exception

willmcgugan · 2017-03-02T20:59:15Z

fs/copy.py

@@ -56,13 +88,35 @@ def copy_file(src_fs, src_path, dst_fs, dst_path):
            if src_fs is dst_fs:
                # Same filesystem, so we can do a potentially optimized copy
                src_fs.copy(src_path, dst_path, overwrite=True)
+                return dst_path


I'd rather this didn't return anything. It looks like there is only a single possible return value anyway...

Should we catch the copy failed event? Something like:

def copy_file(src_fs, src_path, dst_fs, dst_path): try: with manage_fs(src_fs, writeable=False) as src_fs: .... return True except: return False

We can return a boolean to report on the success of the copy.
Again here we should use more detailed Exception handling.
What do you think?

I think its probably best just to let the errors bubble up. It lets the caller do more finely grained error handling.

coveralls · 2017-03-15T17:12:36Z

Coverage decreased (-0.4%) to 99.316% when pulling 39bb737 on gpcimino:feature/copy-if-newer into de3e3a9 on PyFilesystem:master.

gpcimino · 2017-03-17T16:35:45Z

Will,

we are almost done 😄 !!!
I updated the PR with the rename of callback in on_copy.
If you consider all the code modifications closed, i will update the docstrings.
Then you can merge the PR.
GP

willmcgugan · 2017-03-17T17:43:51Z

@gpcimino Looks good to me! 👍

gpcimino · 2017-03-22T10:03:05Z

Will,

i just push the docstring update to the PR branch.
I also add the check in osfs.geturl(), as per your request!
If you can merge the PR to the master branch would be very good!
Ciao
GP

coveralls · 2017-03-22T10:04:42Z

Coverage decreased (-0.5%) to 99.295% when pulling ac6bca1 on gpcimino:feature/copy-if-newer into de3e3a9 on PyFilesystem:master.

willmcgugan · 2017-03-26T20:01:19Z

Hi GC.

Regarding that test fail. I'm wondering if you could explicitly set the time with os.utime a few seconds ahead, rather than letting it set the current time. You might be able to avoid the sleep with TravisCI.

I'm not sure there is much you can do to debug on Travis, except commit and wait for any fails.

That said, if it turns out to be tricky. I wouldn't have any objections if you put the sleep back in to fix it for now, and we can come up with a more elegant solution later. Just to get this PR merged... :-)

coveralls · 2017-03-27T13:51:21Z

Coverage decreased (-0.2%) to 99.515% when pulling 3b85c19 on gpcimino:feature/copy-if-newer into de3e3a9 on PyFilesystem:master.

gpcimino · 2017-03-27T13:55:17Z

Will,

i fixed the errors on TravisCI adding a delay to the utime of the files used in copy_x_if_newer().
All tests are now green!
GP

willmcgugan · 2017-03-29T18:01:57Z

That's great. Thanks for sticking with it. :-)

willmcgugan · 2017-03-29T19:47:54Z

Hi GP. I added some tests to bring coverage back up. I also made a few changes to your code for formatting and brevity. The tests all pass, but would you mind checking this PR over to make sure I haven't broken anything you rely on...

#34

Will

Giampaolo Cimino added 2 commits February 10, 2017 15:43

Add copy_if_newer parameter to functions in copy.py. Add geturl metho…

9be260c

…d to osfs.

added doc string for return values of copy functions

a9ef43b

copy_if_newer param removed, implemented new methods for copy only wh…

1e4ace8

…en src file is newer

test methods to cover copy_file_if_newer

c3e6fc0

willmcgugan requested changes Mar 2, 2017

View reviewed changes

Giampaolo Cimino added 4 commits March 7, 2017 11:16

improvements requested for pull request

c6cc1ec

Merge branch 'callback' into feature/copy-if-newer

fbbc7ba

remove exception handling from copy_file* methods

733ce9b

remove tmp file

39bb737

renamed callback in on_copy

50d0a4b

docstring added for copy module and raise exception in geturl in osfs

ac6bca1

gpcimino mentioned this pull request Mar 24, 2017

FTP problems with proftp server #24

Closed

Giampaolo Cimino added 3 commits March 27, 2017 12:18

fix test_copy for PY2 compatibility and Travis error on file time

a6517a2

apply delay to file utime to be sure if_newer is detected

8a66771

add comments and code clean up

3b85c19

willmcgugan merged commit 737edcf into PyFilesystem:master Mar 29, 2017

		self._write_file(dst_file1)

		time.sleep(1) #sleep 1 sec to ensure dst_file1 is older

Add copy_if_newer parameter to functions in copy.py #22

Add copy_if_newer parameter to functions in copy.py #22

Conversation

gpcimino commented Feb 10, 2017

lurch commented Feb 10, 2017

gpcimino commented Feb 10, 2017

willmcgugan commented Feb 10, 2017

gpcimino commented Feb 10, 2017

gpcimino commented Feb 10, 2017

willmcgugan commented Feb 10, 2017

lurch commented Feb 10, 2017

lurch commented Feb 10, 2017

gpcimino commented Feb 12, 2017 • edited

gpcimino commented Feb 22, 2017

willmcgugan commented Feb 23, 2017

gpcimino commented Feb 26, 2017

willmcgugan commented Feb 26, 2017

coveralls commented Feb 27, 2017 • edited

gpcimino commented Feb 27, 2017 • edited

coveralls commented Feb 27, 2017 • edited

willmcgugan commented Mar 1, 2017

gpcimino commented Mar 1, 2017

willmcgugan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gpcimino Mar 6, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Mar 15, 2017 • edited

gpcimino commented Mar 17, 2017

willmcgugan commented Mar 17, 2017

gpcimino commented Mar 22, 2017

coveralls commented Mar 22, 2017

willmcgugan commented Mar 26, 2017

coveralls commented Mar 27, 2017 • edited

gpcimino commented Mar 27, 2017

willmcgugan commented Mar 29, 2017

willmcgugan commented Mar 29, 2017

gpcimino commented Feb 12, 2017 •

edited

coveralls commented Feb 27, 2017 •

edited

gpcimino commented Feb 27, 2017 •

edited

coveralls commented Feb 27, 2017 •

edited

gpcimino Mar 6, 2017 •

edited

coveralls commented Mar 15, 2017 •

edited

coveralls commented Mar 27, 2017 •

edited