Gdrive Implementation #12

segator · 2017-02-12T19:58:16Z

Hi,
This is my first attemp to implement gdrive using base code of the implementation of @mkhon

I modified the code of @mkhon to have the next features:

S3QL GDrive Full Implementation
Gdrive Error Handling
batch requests(better performance in write/delete little files, avoid ban for too much requests)
oauth client integrated with s3ql
avoid unecesary requests
md5 checksum read/write

I don't expect you accept now the changes, I want to know I am in the good way, I want to refactor some things before merge it to main.

About Oauth auth
I finally modified your oauth util for google storage
I added a parameter --oauth_type where you can define if you want to generate token for google storage or google drive and also I added the possibilty to use own clientID/secret.
You should modify your clientID to accept google drive, because right now you must have your own clientID to generate token.

The idea is oauth client generate a refresh token
and when you mount s3ql you set the next values:
user: youclient_id
password:client_secret:refreshToken

Let me know what do you thing about this implementation, I'm not an expert in python so I suppose there are lot of things could be done better.

https://developers.google.com/drive/v3/web/handle-errors#403_user_rate_limit_exceeded controled with is_temp_failure

don't return backendError as https://developers.google.com/drive/v3/web/handle-errors#500_backend_error says

doing some long tests over gdrive, so added retry request if status code is >= 500

-All threads share same accessToken, before tokens were calculated for every thread (less requests) -Remove ugly warnings about mem_cache, disabling discovery cache -Sometimes on upload files google throw error 4XX or 5XX but is correctly upload so this cause after retry have a copy of some block files so when you do a fsck you get an error because duplicate key. So in case of retry we try to remove the object before upload -Some commented code of a non finished implementation of delete batch (multiple deleted on single request)

when deleting by id in subcases like copy,move,or failed uploads to avoid duplcate objects in the backend

Nikratio · 2017-02-21T19:23:19Z

Thanks a lot for your contribution, and apologies for the lack of a response! I want to look at this, but I just have much less time available to spend on S3QL than in the past.

segator · 2017-02-21T22:44:11Z

Hi @Nikratio , I fully understand, don't worry, the implementation it's not finished yet, but I want your aprove to continue in this direction.

-Disable Google Request Logs -Simplify list request -Fix Duplicated objects when uploading(Some times S3QL call open_write without check if the file exist, on gdrive you can have 2 files with the same name) -GDrive sometimes is slow updaing lists after delete files so appear as file exist when not, so in the delete method I accept a error 404 as a valid delete.

Added this exception to be retried

-SSL Errors control as a temporal error -Avoid query Root Path ID for every thread (only first time) -Duplicated objects when updating metadata fixed

szepeviktor · 2017-05-27T14:03:47Z

Is this PR for Google Drive or http://www.g-technology.com/products/g-drive ?

segator · 2017-05-27T14:06:51Z

Google Drive of course

szepeviktor · 2017-05-27T14:19:18Z

Thanks.

Nikratio · 2017-08-24T20:07:43Z

Sorry for the delay! Are you still interested in finishing this? Then I will take a look.

segator · 2017-08-24T20:21:43Z

Yes of course, I will be happy to colaborate to your project, the implementation is pretty stable I have 3 instances withouts FS crash since 4 month ago.
Probably some code need to be refactored to be more clear(probably my first "hard" coding with python)

The only error I don't know why is happining is sometimes when retrying uploads s3ql duplicate the file.
I add controls to avoid this but anyway is happening. So when you do FSCK you get duplicate ID errors that is easily fixing removing one of the duplicated files(are exactly equals)
Anyway should be fixed before integrate to main branch.

Some extras I add:
S3qlstat now have a prometheus exporter so you can have pretty reports, check here:
https://snapshot.raintank.io/dashboard/snapshot/74VnTwHw6s1ZNK6MZdjs6lsdo78NMxNU
s3qlstat /path --prometheus_exporter --prometheus_port

S3QL Oauth now integrate google drive besides google storage.

Nikratio · 2017-08-25T08:54:12Z

Could you explain what you mean by "s3ql duplicate the file"? What fsck errros do you get specifically?

Nikratio

I now took a look at the code, and this looks like a good start that could be merged eventually. So it would be great if you could continue to work on this. I have left more detailed comments in the code. Note that this is just a first pass, I'm pretty sure I'll have more to say in a few more iterations.

Nikratio · 2017-08-25T08:56:30Z

setup.py

@@ -134,7 +134,9 @@ def main():
                     'requests',
                     'defusedxml',
                     'dugong >= 3.4, < 4.0',
-                     'llfuse >= 1.0, < 2.0' ]
+                     'llfuse >= 1.0, < 2.0',
+                     'google-api-python-client >= 1.4.2']


Really required? Note to self: come back to that later

I suppose you mean google-api-python-client, it's the official google drive SDK, yes of course is necesary unless you want to reinvent the wheel and implement all the API request with your own http library

The problem with these official SKDs is that often they are designed to do much more than what an application like S3QL wants a library to do. For example, S3QL needs control over threading, connecting, re-trying requests etc. If the SDK does all these things automatically, that's super convenient for quick experimentation, but S3QL would not work very well. That's the reason why S3QL isn't using the official Amazon S3 or Google Storage SDKs either.

Nikratio · 2017-08-25T08:59:11Z

src/s3ql/oauth_client.py

@@ -8,6 +8,9 @@

 from .logging import logging, setup_logging, QuietError
 from .parse_args import ArgumentParser
+from oauth2client.client import OAuth2WebServerFlow
+from oauth2client.client import AccessTokenCredentials
+from  oauth2client.client import OOB_CALLBACK_URN


Why is this necessary? The existing code already does OAuth without any additional modules, so why can't it work with Google Drive?

@mkhon commented in their PR: This is actually was the point where I decided to not reuse s3c implementation and use google python API: s3ql's OAuth2 implementation uses "OAuth 2.0 for devices" flow. Google Drive API scope is not allowed for this flow: https://developers.google.com/identity/protocols/OAuth2ForDevices#allowedscopes

I think in that case I'd rather switch to a different flow for both Google Storage and Google Drive. Or is there a good reason to use different flows?

Nikratio · 2017-08-25T09:01:05Z

src/s3ql/oauth_client.py


-    return parser.parse_args(args)
+    options =  parser.parse_args(args)
+    if options.client_id == '':


Can you give an example of a situation where you'd want to use a different client id?

google drive limit max number of request per minut/day depends of the application, so maybe you could have an google application with a higher cap.(This is my case)

Nikratio · 2017-08-25T09:02:34Z

src/s3ql/oauth_client.py


-    options = parse_args(args)
-    setup_logging(options)
+def googleDrive(options):


Please follow the conventions of the existing S3QL code (which in turn mostly follows Python PEP8). This means CamelCase only for classes and under_scores for everything else. Also, method/function names should typically be verbs so you can tell what they do. From the name, there's no way to tell what the googleDrive function does.

You are right, pending to refactor

Nikratio · 2017-08-25T09:03:47Z

src/s3ql/parse_args.py

@@ -182,6 +182,12 @@ def add_storage_url(self):
                          type=storage_url_type,
                          help='Storage URL of the backend that contains the file system')

+    def add_oauth(self):
+        self.add_argument("--oauth_type", metavar='<oauth_type>',default="google-storage",type=oauth_type,


I think it's better to instead require the user to pass a storage url, and infer the correct backend from that. At some point we may want to generate tokens that are specific to the given bucket (or even prefix).

I don't know if I understand, what you checked in the code is the parameter for choice oauth_type storage/drive when generating user/pass for your specific google drive/storage account.

Yes. Drop that parameter, and require the user to specify the storage URL. Then you can look at the prefix (gs:// or gdrive://) to determine whether the token is for Google Drive or Google Storage.

Nikratio · 2017-08-25T09:17:09Z

src/s3ql/backends/gdrive.py

+        '''Make chunk property'''
+        return 'ref({0}.{1})'.format(k, i)
+
+    def _encode_metadata(self, metadata):


please add docstring

Nikratio · 2017-08-25T09:20:46Z

src/s3ql/backends/gdrive.py

+                properties[k] = v
+        return properties
+
+    def _decode_metadata(self, f):


This looks like you really should be using thaw_basic_mapping() and free_basic_mapping instead (from s3ql.common)

I will check it, I didn't know about this

Nikratio · 2017-08-25T09:22:37Z

src/s3ql/backends/gdrive.py

+    def open_write(self, key, metadata=None, is_compressed=False):
+        log.debug("key: {0}".format(key))
+        if metadata is None:
+            metadata = dict()


If metadata is never None, why does _encode_metadata check for it?

are you fully right never is None? if you are, could be removed

No, what I'm saying is that here you ensure that it metadata will never be None from here on. But then you call another function (sorry, can't see the context at the moment, may be something like _encode_metadata) that checks if metadata is None and in that case returns None - defeating the check above.

Nikratio · 2017-08-25T09:28:25Z

src/s3ql/backends/gdrive.py

+        log.debug("")
+        for f in self._list_files(self.folder, "id"):
+            self._delete_by_id(f['id'])            
+    '''


What's happening here? Why is this all commented out?

it was an attemt to implement delete_multi, but it is very hard because gdrive no process all of them as block, most of them could fail, what do you do in this case? rexecute until 0 pending/error requests? probably you will get a ban for too much requests.

I do not understand what you mean with because gdrive no process all of them as block, most of them could fail. If there is no way to implement delete_multi, just don't implement it. The higher layers will fail back to repeated delete calls instead.

I expect that the server will send you warnings to slow down before banning you completely. If the method raises the right exceptions, S3QL will honor these warnings and everything should work fine. If the google-api-python-client module doesn't give you access to these early warnings then... we shouldn't use it.

Nikratio · 2017-08-25T09:28:55Z

src/s3ql/backends/gdrive.py

+            found = True
+            self._delete_by_id(f['id'])
+        if not found:
+            if force or is_retry:


Why ignore the error on retry?

gdrive not always works "in sync" , some times you will get an error deleting a file but in the next retry request you find out the file doesn't exist anymore, it's weird, I add this condition to avoid s3ql crash when this kind of things happen.
When google scale up/down their infrastructure is common get error 500(it's in their documentation) I guess when this happens you could have an error 500 but the request is procesed correctly.

Alright. Could you please put this into the code as a comment?

fix options init

google now throw http 408 timeout

Nikratio · 2018-12-29T14:55:02Z

Friendly ping. Do you think you'll have time to resolve the open issues in the near future? If not, I'll close this pull request for now.

segator · 2018-12-29T15:27:52Z

Hey i would like to have gdrive integrated but you requested to avoid use google sdk, from my point of view this is a bad practice so for this reason i abandoned the other points you commented, anyway some people is using sucesfully my implementation without issues

Nikratio · 2018-12-29T15:51:47Z

Thanks for the quick reply! Are you referring to this comment?

I expect that the server will send you warnings to slow down before banning you completely. If the method raises the right exceptions, S3QL will honor these warnings and everything should work fine. If the google-api-python-client module doesn't give you access to these early warnings then... we shouldn't use it.

If so, then it's not clear to me why this means that you can't use the official sdk. Why can't you simply not implement delete_multi as I suggested?

If you refer to the logging related comment: can't you depend on a fixed version of the SDK?

segator · 2018-12-29T16:23:42Z

i not sure if I understand it correctly(sorry my english is so basic)
gdrive already send you especial http responses when banning to "limit exceeded(by too much request per second or max total in a day)

thoose errors are actually controlled by _is_temp_failure(If I remember correctly)

anyway in case google change anything in their API if we use their SDK we only need to change the sdk version(depending of the changes) if they change the api and we are implementing the client we will need to adapt it.
and the other more important point, why reinvent the wheel?

Regards

Nikratio · 2018-12-29T20:08:52Z

You misunderstood. You can use the official SDK as long as you can solve the issues described in the comments.

segator · 2018-12-29T20:17:42Z

ok sorry, I will check all the comments(hopefully this month) again and try to fix all of them, gdrive work pretty stable and fast with s3ql, I have a s3ql mountpoint running more than 1 year and a half without issues, and other has been upgraded to have persistent cache and works pretty nice too.
Thanks for your work!

Nikratio · 2019-02-11T20:58:37Z

Closing this for now as discussed above. Please do reopen once the open issues are resolved!

eleaner · 2020-06-21T10:45:04Z

Guys, in the name of less cash-heavy users of S3QL, would you be able to revive the discussion?

I guess al ot of us here would love to see it running

stickenhoffen · 2020-11-06T10:56:43Z

Yeah I agree; Google Drive would be an amazing addition.

eleaner · 2020-11-06T16:55:20Z

@stickenhoffen
sadly not much interest here

segator added 9 commits February 12, 2017 20:19

First Implementation GDrive

cb09502

Oauth client now support drive and storage google

b0cbfd2

Add dependencies and missing changes

8857398

detected a temporal error not explained in

bddbff5

https://developers.google.com/drive/v3/web/handle-errors#403_user_rate_limit_exceeded controled with is_temp_failure

add another temporal network error to retry

572f31a

Error 500 by google drive

e74a89c

don't return backendError as https://developers.google.com/drive/v3/web/handle-errors#500_backend_error says

multiple error 500 503 503 detected

d2341cd

doing some long tests over gdrive, so added retry request if status code is >= 500

handle more deletes retries cases

30c1f6c

when deleting by id in subcases like copy,move,or failed uploads to avoid duplcate objects in the backend

segator added 3 commits February 25, 2017 00:23

GDrive sometimes throw IncompleteRead

1a109ba

Added this exception to be retried

Multiple Improvements

18bb8c0

-SSL Errors control as a temporal error -Avoid query Root Path ID for every thread (only first time) -Duplicated objects when updating metadata fixed

segator added 2 commits August 1, 2017 23:39

Add ResumableError 400 Bad Request Control

dace847

Add Prometheus Exporter Support to s3qlstat comand

a78f76f

Nikratio mentioned this pull request Aug 24, 2017

Initial implementation of Google Drive backend. #4

Closed

Nikratio self-assigned this Aug 24, 2017

Nikratio added the enhancement label Aug 24, 2017

Merge branch 'master' into gdrive

ab30580

Nikratio requested changes Aug 25, 2017

View reviewed changes

Merge branch 'master' into gdrive

15d16fb

Nikratio force-pushed the master branch from c889fec to b2dfac0 Compare November 6, 2018 09:47

Merge release-2.32 to gdrive implementation

5cb5b9b

segator added 4 commits November 7, 2018 15:39

Update gdrive.py

070c1b4

fix options init

changes

a94604c

gdrive now works after merge from s3ql master

43aab95

Update gdrive.py

d1d136b

google now throw http 408 timeout

Nikratio closed this Feb 11, 2019

Gdrive Implementation #12

Gdrive Implementation #12

Conversation

segator commented Feb 12, 2017 • edited

Nikratio commented Feb 21, 2017

segator commented Feb 21, 2017

szepeviktor commented May 27, 2017

segator commented May 27, 2017

szepeviktor commented May 27, 2017

Nikratio commented Aug 24, 2017

segator commented Aug 24, 2017

Nikratio commented Aug 25, 2017

Nikratio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

segator Aug 25, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nikratio commented Dec 29, 2018

segator commented Dec 29, 2018

Nikratio commented Dec 29, 2018

segator commented Dec 29, 2018

Nikratio commented Dec 29, 2018

segator commented Dec 29, 2018

Nikratio commented Feb 11, 2019

eleaner commented Jun 21, 2020

stickenhoffen commented Nov 6, 2020

eleaner commented Nov 6, 2020

segator commented Feb 12, 2017 •

edited

segator Aug 25, 2017 •

edited