Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gdrive Implementation #12

Closed
wants to merge 21 commits into from
Closed

Gdrive Implementation #12

wants to merge 21 commits into from

Conversation

segator
Copy link

@segator segator commented Feb 12, 2017

Hi,
This is my first attemp to implement gdrive using base code of the implementation of @mkhon

I modified the code of @mkhon to have the next features:

  • S3QL GDrive Full Implementation
  • Gdrive Error Handling
  • batch requests(better performance in write/delete little files, avoid ban for too much requests)
  • oauth client integrated with s3ql
  • avoid unecesary requests
  • md5 checksum read/write

I don't expect you accept now the changes, I want to know I am in the good way, I want to refactor some things before merge it to main.

About Oauth auth
I finally modified your oauth util for google storage
I added a parameter --oauth_type where you can define if you want to generate token for google storage or google drive and also I added the possibilty to use own clientID/secret.
You should modify your clientID to accept google drive, because right now you must have your own clientID to generate token.

The idea is oauth client generate a refresh token
and when you mount s3ql you set the next values:
user: youclient_id
password:client_secret:refreshToken

Let me know what do you thing about this implementation, I'm not an expert in python so I suppose there are lot of things could be done better.

doing some long tests over gdrive, so added retry request if status code is >= 500
-All threads share same accessToken, before tokens were calculated for every thread (less requests)
-Remove ugly warnings about mem_cache, disabling discovery cache
-Sometimes on upload files google throw error 4XX or 5XX but is correctly upload so this cause after retry have a copy of some block files so when you do a fsck you get an error  because duplicate key.
So in case of retry we try to remove the object before upload
-Some commented code of a non finished implementation of delete batch (multiple   deleted on single request)
when deleting by id in subcases like copy,move,or failed uploads to avoid duplcate objects in the backend
@Nikratio
Copy link
Collaborator

Thanks a lot for your contribution, and apologies for the lack of a response! I want to look at this, but I just have much less time available to spend on S3QL than in the past.

@segator
Copy link
Author

segator commented Feb 21, 2017

Hi @Nikratio , I fully understand, don't worry, the implementation it's not finished yet, but I want your aprove to continue in this direction.

-Disable Google Request Logs
-Simplify list request
-Fix Duplicated objects when uploading(Some times S3QL call open_write without check if the file exist, on gdrive you can have 2 files with the same name)
-GDrive sometimes is slow updaing lists after delete files so appear as file exist when not, so in the delete method I accept a error 404 as a valid delete.
Added this exception to be retried
-SSL Errors control as a temporal error
-Avoid query Root Path ID for every thread (only first time)
-Duplicated objects when updating metadata fixed
@szepeviktor
Copy link
Collaborator

Is this PR for Google Drive or http://www.g-technology.com/products/g-drive ?

@segator
Copy link
Author

segator commented May 27, 2017

Google Drive of course

@szepeviktor
Copy link
Collaborator

Thanks.

@Nikratio
Copy link
Collaborator

Sorry for the delay! Are you still interested in finishing this? Then I will take a look.

@Nikratio Nikratio self-assigned this Aug 24, 2017
@segator
Copy link
Author

segator commented Aug 24, 2017

Yes of course, I will be happy to colaborate to your project, the implementation is pretty stable I have 3 instances withouts FS crash since 4 month ago.
Probably some code need to be refactored to be more clear(probably my first "hard" coding with python)

The only error I don't know why is happining is sometimes when retrying uploads s3ql duplicate the file.
I add controls to avoid this but anyway is happening. So when you do FSCK you get duplicate ID errors that is easily fixing removing one of the duplicated files(are exactly equals)
Anyway should be fixed before integrate to main branch.

Some extras I add:
S3qlstat now have a prometheus exporter so you can have pretty reports, check here:
https://snapshot.raintank.io/dashboard/snapshot/74VnTwHw6s1ZNK6MZdjs6lsdo78NMxNU
s3qlstat /path --prometheus_exporter --prometheus_port

S3QL Oauth now integrate google drive besides google storage.

@Nikratio
Copy link
Collaborator

Could you explain what you mean by "s3ql duplicate the file"? What fsck errros do you get specifically?

Copy link
Collaborator

@Nikratio Nikratio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now took a look at the code, and this looks like a good start that could be merged eventually. So it would be great if you could continue to work on this. I have left more detailed comments in the code. Note that this is just a first pass, I'm pretty sure I'll have more to say in a few more iterations.

@@ -134,7 +134,9 @@ def main():
'requests',
'defusedxml',
'dugong >= 3.4, < 4.0',
'llfuse >= 1.0, < 2.0' ]
'llfuse >= 1.0, < 2.0',
'google-api-python-client >= 1.4.2']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really required? Note to self: come back to that later

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you mean google-api-python-client, it's the official google drive SDK, yes of course is necesary unless you want to reinvent the wheel and implement all the API request with your own http library

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with these official SKDs is that often they are designed to do much more than what an application like S3QL wants a library to do. For example, S3QL needs control over threading, connecting, re-trying requests etc. If the SDK does all these things automatically, that's super convenient for quick experimentation, but S3QL would not work very well. That's the reason why S3QL isn't using the official Amazon S3 or Google Storage SDKs either.

@@ -8,6 +8,9 @@

from .logging import logging, setup_logging, QuietError
from .parse_args import ArgumentParser
from oauth2client.client import OAuth2WebServerFlow
from oauth2client.client import AccessTokenCredentials
from oauth2client.client import OOB_CALLBACK_URN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? The existing code already does OAuth without any additional modules, so why can't it work with Google Drive?

Copy link
Author

@segator segator Aug 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkhon commented in their PR: This is actually was the point where I decided to not reuse s3c implementation and use google python API: s3ql's OAuth2 implementation uses "OAuth 2.0 for devices" flow. Google Drive API scope is not allowed for this flow: https://developers.google.com/identity/protocols/OAuth2ForDevices#allowedscopes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in that case I'd rather switch to a different flow for both Google Storage and Google Drive. Or is there a good reason to use different flows?


return parser.parse_args(args)
options = parser.parse_args(args)
if options.client_id == '':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of a situation where you'd want to use a different client id?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google drive limit max number of request per minut/day depends of the application, so maybe you could have an google application with a higher cap.(This is my case)


options = parse_args(args)
setup_logging(options)
def googleDrive(options):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the conventions of the existing S3QL code (which in turn mostly follows Python PEP8). This means CamelCase only for classes and under_scores for everything else. Also, method/function names should typically be verbs so you can tell what they do. From the name, there's no way to tell what the googleDrive function does.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, pending to refactor

@@ -182,6 +182,12 @@ def add_storage_url(self):
type=storage_url_type,
help='Storage URL of the backend that contains the file system')

def add_oauth(self):
self.add_argument("--oauth_type", metavar='<oauth_type>',default="google-storage",type=oauth_type,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to instead require the user to pass a storage url, and infer the correct backend from that. At some point we may want to generate tokens that are specific to the given bucket (or even prefix).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I understand, what you checked in the code is the parameter for choice oauth_type storage/drive when generating user/pass for your specific google drive/storage account.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Drop that parameter, and require the user to specify the storage URL. Then you can look at the prefix (gs:// or gdrive://) to determine whether the token is for Google Drive or Google Storage.

'''Make chunk property'''
return 'ref({0}.{1})'.format(k, i)

def _encode_metadata(self, metadata):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add docstring

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

properties[k] = v
return properties

def _decode_metadata(self, f):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like you really should be using thaw_basic_mapping() and free_basic_mapping instead (from s3ql.common)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check it, I didn't know about this

def open_write(self, key, metadata=None, is_compressed=False):
log.debug("key: {0}".format(key))
if metadata is None:
metadata = dict()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If metadata is never None, why does _encode_metadata check for it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you fully right never is None? if you are, could be removed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I'm saying is that here you ensure that it metadata will never be None from here on. But then you call another function (sorry, can't see the context at the moment, may be something like _encode_metadata) that checks if metadata is None and in that case returns None - defeating the check above.

log.debug("")
for f in self._list_files(self.folder, "id"):
self._delete_by_id(f['id'])
'''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening here? Why is this all commented out?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was an attemt to implement delete_multi, but it is very hard because gdrive no process all of them as block, most of them could fail, what do you do in this case? rexecute until 0 pending/error requests? probably you will get a ban for too much requests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand what you mean with because gdrive no process all of them as block, most of them could fail. If there is no way to implement delete_multi, just don't implement it. The higher layers will fail back to repeated delete calls instead.

I expect that the server will send you warnings to slow down before banning you completely. If the method raises the right exceptions, S3QL will honor these warnings and everything should work fine. If the google-api-python-client module doesn't give you access to these early warnings then... we shouldn't use it.

found = True
self._delete_by_id(f['id'])
if not found:
if force or is_retry:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore the error on retry?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gdrive not always works "in sync" , some times you will get an error deleting a file but in the next retry request you find out the file doesn't exist anymore, it's weird, I add this condition to avoid s3ql crash when this kind of things happen.
When google scale up/down their infrastructure is common get error 500(it's in their documentation) I guess when this happens you could have an error 500 but the request is procesed correctly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Could you please put this into the code as a comment?

@Nikratio
Copy link
Collaborator

Friendly ping. Do you think you'll have time to resolve the open issues in the near future? If not, I'll close this pull request for now.

@segator
Copy link
Author

segator commented Dec 29, 2018

Hey i would like to have gdrive integrated but you requested to avoid use google sdk, from my point of view this is a bad practice so for this reason i abandoned the other points you commented, anyway some people is using sucesfully my implementation without issues

@Nikratio
Copy link
Collaborator

Thanks for the quick reply! Are you referring to this comment?

I expect that the server will send you warnings to slow down before banning you completely. If the method raises the right exceptions, S3QL will honor these warnings and everything should work fine. If the google-api-python-client module doesn't give you access to these early warnings then... we shouldn't use it.

If so, then it's not clear to me why this means that you can't use the official sdk. Why can't you simply not implement delete_multi as I suggested?

If you refer to the logging related comment: can't you depend on a fixed version of the SDK?

@segator
Copy link
Author

segator commented Dec 29, 2018

i not sure if I understand it correctly(sorry my english is so basic)
gdrive already send you especial http responses when banning to "limit exceeded(by too much request per second or max total in a day)

thoose errors are actually controlled by _is_temp_failure(If I remember correctly)

anyway in case google change anything in their API if we use their SDK we only need to change the sdk version(depending of the changes) if they change the api and we are implementing the client we will need to adapt it.
and the other more important point, why reinvent the wheel?

Regards

@Nikratio
Copy link
Collaborator

You misunderstood. You can use the official SDK as long as you can solve the issues described in the comments.

@segator
Copy link
Author

segator commented Dec 29, 2018

ok sorry, I will check all the comments(hopefully this month) again and try to fix all of them, gdrive work pretty stable and fast with s3ql, I have a s3ql mountpoint running more than 1 year and a half without issues, and other has been upgraded to have persistent cache and works pretty nice too.
Thanks for your work!

@Nikratio
Copy link
Collaborator

Closing this for now as discussed above. Please do reopen once the open issues are resolved!

@Nikratio Nikratio closed this Feb 11, 2019
@eleaner
Copy link

eleaner commented Jun 21, 2020

Guys, in the name of less cash-heavy users of S3QL, would you be able to revive the discussion?

I guess al ot of us here would love to see it running

@stickenhoffen
Copy link

Yeah I agree; Google Drive would be an amazing addition.

@eleaner
Copy link

eleaner commented Nov 6, 2020

@stickenhoffen
sadly not much interest here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants