-
Notifications
You must be signed in to change notification settings - Fork 4
New Maintainer Wanted :-) #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, @maelle I am interested in learning more about contributing to the package. I used OneKp data for research in my master's thesis and would love to help if possible. (to be pedantic, I used Onekp public release data and not the Onekp R package) In terms of the codebase, my experience with R has been simple scripts here and there so far, and as such, I may or may not be qualified to hit the ground running on day one. If this is not an issue, then I am happy to try. I tried to use this R package a few months back and had some issues documented here. I could not get the R package working and ended up writing this python script that automates the data scrapping process from the Onekp public release data. So, I am familiar with the data and automating its retrieval from the web. (Please note: As the data is hosted on Google Drive, I did end up running into google drive API/access limitations) A few questions I have are: 1.Is data migration the primary goal? a. Is the data being moved to an in-house data hosting setup? b. If not, do you have a short list of potential candidates in mind? BackBlaze, Wasabi, NextCloud, a FTP setup? c. Regardless of the service provider of choice, egress would be an issue with maintaining a project like this, right? So, hosting the data on the web in an accessible manner might come with some recurring costs from the data hosting service. Does ropensci sponsor this? If so, do you have documentation on how to set up recurring charges?
Sorry if I missed anything in the documentation you posted. |
👋 @RijanDhakal1010! Thank you for volunteering!
The goal would not be to migrate the data, but to contact OneKP maintainers to see what's the current best way to access their data: is it Google Drive, or something else?
Thanks for asking. There's no such guideline with numbers.
I'm happy to answer more questions! |
Hello @maelle, Yes, that cleared up my confusions. I used the package recently to retrieve some data and the two biggest issues were:
I think I will try to get in touch with the previous maintainer and/or onekp and see to explore online storage options. This seems doable. Happy to get started. Sincerely, |
👋 @RijanDhakal1010! I think it makes more sense to contact the onekp team rather than the previous package maintainer. Please post again when you know more / when you're more sure you want to become the maintainer so that I might give you access to this repository. Thanks so much! |
👋 @RijanDhakal1010 were you able to contact the onekp team? No worries if not. |
Hi @maelle , I did email the onekp team but have not heard back from them. Not sure if I ended up in their spam folder, or if it's a defunct email or they have not had the time to get back to me. I will send a new round of emails and see if I can hear back from them. I am more closely connected with folks who did auxiliary work on the Onekp project. If I do not hear back in sometime, I can try to get in touch with the Onekp team through them. What do you think? |
sounds like a great strategy, thanks so much for your efforts and for the update! |
Hi @maelle, Some updates. I got a chance to talk to the principle scientist behind the Onekp project and he is happy to help maintain the accessibility of the data for this project. I learned this only after talking to him (Onekp principle scientist) but I did not know that the OneKP project and Ropensci OneKP R package are two completely separate projects and have not had much "direct" interactions. Not an issue but has left me with a few queries which can only be answered from the Ropensci side. The onekp data is a few hundred gigabytes and as of now this R package is talking to a copy of the data on a google drive account/folder. How familiar are with this setup? Do you know who has been footing the bill for hosting this data on google drive? If Ropensci has been funding this, can the funding be diverted to an alternative data-hosting resource more suited for this package's goals? Or has the data been sitting in a free google account for non-profits? If no funding has been allocated so far, can it be allocated now? (Funding does not necessarily have to be capital and could be Ropensci's in house computing resources, if available) I also found out that the biologist from OneKP had a database server setup for the data but their respective academic organization had the servers shut down for cybersecurity reasons pertaining to university policy. So, on campus FTP/SFTP servers from the original OneKP scientists are most likely not an option. The original genomic data for the OneKP project does sit on Cyverse, which could be a makeshift alternative for hosting this data but would most likely require some significant changes in the code-base to switch from something like google drive. I say makeshift because this is a back-end dependency which may or may not be as reliable as google drive (the current problems from google drive notwithstanding). What do you think? Sincerely, |
Thanks a ton @RijanDhakal1010!! I don't know anything about the original Google Drive setup, I'd recommend contacting @arendsee directly. Sorry to not be of more help! |
@maelle No worries! Will do! |
@RijanDhakal1010 do you need an invitation to rOpenSci slack workspace? If so to which email address? Cc @yabellini |
@maelle I do need an invitation. Please send it to rijan_dhakal@outlook.com. Thank you! |
Thank you! Note that invites are sent more or less weekly. |
Hi @RijanDhakal1010, so a bit of history. When I first implemented Then the transition to Google Drive happened and the A year later the package blew up again, see #3. And I hacked a solution in the suspiciously named commit "Fix #3 - possibly lose portability to windows". There I used a system call to curl. That was definitely not a good idea. It might be a good idea to ditch Google Drive entirely and try to interface with cyverse. This might be a lot of work. You could ask the cyverse people if they have an API. In addition to all the complexity and bugs it causes, Google Drive is blocked in several countries. |
Oh, and Rijan you are an awesome person! I think you are on the right path and it is great that you have been in contact with OneKP team. It is easy to make packages like onekp, but it is much harder to inherit and maintain them. |
Hi @arendsee , Thank you for reaching out! I can absolutely respect and understand why you had to use google drive. The issues notwithstanding, google's scale is definitely a plus point. I think Cyverse will have to be the route to go for the back-end. I believe they do have a CLI tool called iRODS for interactive and automated data retrieval (could be wrong). I will get back to this thread once I get a chance to search/read a bit more into Cyverse's APIs. Sincerely, |
Hi @arendsee, when you said you initially had an FTP server as the backend, was it the Cyverse SFTP API by any chance? If yes, then I might have accidentally re-invented the wheel but if no then it seems Cyverse does provide a currently stable I have not run into any rate limits so far interacting with Cyverse via SFTP but I have emailed them to see if they have any rate limit polices that might hinder this idea (hopefully not!). If not then I think this is a viable solution to move forward with.The SFTP interface provides an easy way for people to anonymously access the public folder and R seems to have a functional SFTP interaction package. So, we could use SFTP within R and cyverse's public folders to replace google drive. Sincerely, |
Hi @RijanDhakal1010, no, I never worked with the Cyverse API. Looks like you are onto a good solution! |
Hi @arendsee , Awesome! I also just heard back from Cyverse. They said they do not throttle egress so long as the number of concurrent connections is kept reasonable. So this works out as a viable solution to replace google drive. Thank you! Sincerely, |
Hi @maelle, I think I have everything I need to start changing the back-end of the package from google drive to Cyverse. Did you want me to take over the repo? or work on it on my end and then make a pull request? Sincerely, |
@RijanDhakal1010 I've now invited you to the rOpenSci GitHub organization, and to a team with admin access to this repository! Sorry I hadn't done it earlier. Thanks so much for all your work on this! For info we recently created a cheatsheet for maintainers of rOpenSci packages: https://devdevguide.netlify.app/maintenance_cheatsheet.html |
@maelle , Got it and accepted. Thank you! |
@RijanDhakal1010 could you please update DESCRIPTION to change the maintainer? (removing the "cre" role from the previous maintainer, adding yourself with roles "aut" and "cre"). Thanks a lot! |
Hi @maelle, Apologies for the delay! I was able to clone the original repo to my machine (have read rights) but I am not being able to publish the changes. The specific error being Sorry this was not brought to your attention earlier, I have been working with the old code on a local fork and only just realized it now. |
No worries, I've sent you a new invitation to the GitHub organization and a GitHub team with admin access to this repo. Note that you will need to have enabled 2FA see https://docs.github.com/en/authentication/securing-your-account-with-two-factor-authentication-2fa/configuring-two-factor-authentication |
@RijanDhakal1010 did you end up getting access? |
Hi @maelle , I do have access to the repo and the data. I apologize but ever since I signed up here my professional responsibilities expanded somewhat unexpectedly and I was unable to debug the local branch for the repo that I have as much as I wanted to. But I have a rudimentary framework for how I want to apply the changes required by the package. Right now, the biggest issue is not the code as much as the backend data. This package is a way to access the data published for the OneKP project, which itself was run by a consortium of scientists. The google drive backend as implemented right now hosts data that is not insignificantly different from the data that is in the public domain. The great thing about the public domain data is that it is hosted by an accredited research organization with reasonably generous access/egress. Switching to it will have great benefits in the long run but comes at the cost of making any new changes to the package non-backwards compatible. I am of the mind that 30-ish species that are missing in the public domain are worth the cost of switching to a better backed source. If that does not go against Ropensci policy then I am happy to get the ball rolling in that direction. Once again, I apologize for my tardiness here! Sincerely, |
Hi @RijanDhakal1010!
It's your package so you are the one to decide! For what it's worth, to me your arguments sound perfectly good! Cheers |
Or new maintainer team. 😸
If you're interested, please comment in the issue.
For more info, see
Cc @ropensci/admin @arendsee
The text was updated successfully, but these errors were encountered: