Skip to content

New Maintainer Wanted :-) #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
maelle opened this issue Dec 6, 2022 · 28 comments
Open

New Maintainer Wanted :-) #9

maelle opened this issue Dec 6, 2022 · 28 comments

Comments

@maelle
Copy link
Member

maelle commented Dec 6, 2022

Or new maintainer team. 😸

⚠️ Ideally the new maintainer would look for a better way to access One KP data than with GoogleDrive ⚠️

If you're interested, please comment in the issue.
For more info, see

Cc @ropensci/admin @arendsee

@VectorFrankenstein
Copy link

VectorFrankenstein commented Dec 21, 2022

Hi, @maelle

I am interested in learning more about contributing to the package.

I used OneKp data for research in my master's thesis and would love to help if possible. (to be pedantic, I used Onekp public release data and not the Onekp R package)

In terms of the codebase, my experience with R has been simple scripts here and there so far, and as such, I may or may not be qualified to hit the ground running on day one. If this is not an issue, then I am happy to try.

I tried to use this R package a few months back and had some issues documented here. I could not get the R package working and ended up writing this python script that automates the data scrapping process from the Onekp public release data. So, I am familiar with the data and automating its retrieval from the web. (Please note: As the data is hosted on Google Drive, I did end up running into google drive API/access limitations)

A few questions I have are:

1.Is data migration the primary goal?

a. Is the data being moved to an in-house data hosting setup?

b. If not, do you have a short list of potential candidates in mind? BackBlaze, Wasabi, NextCloud, a FTP setup?

c. Regardless of the service provider of choice, egress would be an issue with maintaining a project like this, right? So, hosting the data on the web in an accessible manner might come with some recurring costs from the data hosting service. Does ropensci sponsor this? If so, do you have documentation on how to set up recurring charges?

  1. Are there some etiquette I should make myself familiar with before trying to commit to the project? (For example, are there a minimum number of hours a maintainer should be available each week? Or deadlines on submitted issues? )

Sorry if I missed anything in the documentation you posted.

@maelle
Copy link
Member Author

maelle commented Jan 2, 2023

👋 @RijanDhakal1010! Thank you for volunteering!

1.Is data migration the primary goal?
a. Is the data being moved to an in-house data hosting setup?
b. If not, do you have a short list of potential candidates in mind? BackBlaze, Wasabi, NextCloud, a FTP setup?
c. Regardless of the service provider of choice, egress would be an issue with maintaining a project like this, right? So, hosting the data on the web in an accessible manner might come with some recurring costs from the data hosting service. Does ropensci sponsor this? If so, do you have documentation on how to set up recurring charges?

The goal would not be to migrate the data, but to contact OneKP maintainers to see what's the current best way to access their data: is it Google Drive, or something else?

Are there some etiquette I should make myself familiar with before trying to commit to the project? (For example, are there a minimum number of hours a maintainer should be available each week? Or deadlines on submitted issues? )

Thanks for asking. There's no such guideline with numbers.

  • Here the first priority would be to find a better way to access the data, and update the code and tests accordingly. I'd be happy to provide general package maintenance guidance as needed, since you mentioned you're new to developing packages. You could start with https://r-pkgs.org/whole-game.html
  • Then as regular maintenance is concerned, it's fine to let issues build up a bit before tackling them. I'd recommend watching the repository to be notified of issues. When I open an issue about docs building problems for instance, I send reminders every few weeks. Once a year we send a package maintainer survey to see what's the maintenance status (we started last year, the survey was open for a few weeks). Does this help?

I'm happy to answer more questions!

@VectorFrankenstein
Copy link

Hello @maelle,

Yes, that cleared up my confusions.

I used the package recently to retrieve some data and the two biggest issues were:

  1. A warning about a deprecated dependency
  2. Noticeably slow retrieval of data, (700 mbs of data took 3.5 hours on a 1000 mbps home internet connection)

I think I will try to get in touch with the previous maintainer and/or onekp and see to explore online storage options.

This seems doable. Happy to get started.

Sincerely,
Rijan

@maelle
Copy link
Member Author

maelle commented Jan 2, 2023

👋 @RijanDhakal1010!

I think it makes more sense to contact the onekp team rather than the previous package maintainer.

Please post again when you know more / when you're more sure you want to become the maintainer so that I might give you access to this repository.

Thanks so much!

@maelle
Copy link
Member Author

maelle commented Feb 7, 2023

👋 @RijanDhakal1010 were you able to contact the onekp team? No worries if not.

@VectorFrankenstein
Copy link

Hi @maelle , I did email the onekp team but have not heard back from them. Not sure if I ended up in their spam folder, or if it's a defunct email or they have not had the time to get back to me.

I will send a new round of emails and see if I can hear back from them. I am more closely connected with folks who did auxiliary work on the Onekp project. If I do not hear back in sometime, I can try to get in touch with the Onekp team through them.

What do you think?

@maelle
Copy link
Member Author

maelle commented Feb 7, 2023

sounds like a great strategy, thanks so much for your efforts and for the update!

@VectorFrankenstein
Copy link

Hi @maelle,

Some updates.

I got a chance to talk to the principle scientist behind the Onekp project and he is happy to help maintain the accessibility of the data for this project.

I learned this only after talking to him (Onekp principle scientist) but I did not know that the OneKP project and Ropensci OneKP R package are two completely separate projects and have not had much "direct" interactions. Not an issue but has left me with a few queries which can only be answered from the Ropensci side.

The onekp data is a few hundred gigabytes and as of now this R package is talking to a copy of the data on a google drive account/folder. How familiar are with this setup? Do you know who has been footing the bill for hosting this data on google drive? If Ropensci has been funding this, can the funding be diverted to an alternative data-hosting resource more suited for this package's goals? Or has the data been sitting in a free google account for non-profits? If no funding has been allocated so far, can it be allocated now? (Funding does not necessarily have to be capital and could be Ropensci's in house computing resources, if available)

I also found out that the biologist from OneKP had a database server setup for the data but their respective academic organization had the servers shut down for cybersecurity reasons pertaining to university policy. So, on campus FTP/SFTP servers from the original OneKP scientists are most likely not an option.

The original genomic data for the OneKP project does sit on Cyverse, which could be a makeshift alternative for hosting this data but would most likely require some significant changes in the code-base to switch from something like google drive. I say makeshift because this is a back-end dependency which may or may not be as reliable as google drive (the current problems from google drive notwithstanding).

What do you think?

Sincerely,
Rijan

@maelle
Copy link
Member Author

maelle commented Feb 9, 2023

Thanks a ton @RijanDhakal1010!! I don't know anything about the original Google Drive setup, I'd recommend contacting @arendsee directly. Sorry to not be of more help!

@VectorFrankenstein
Copy link

@maelle No worries! Will do!

@maelle
Copy link
Member Author

maelle commented Feb 9, 2023

@RijanDhakal1010 do you need an invitation to rOpenSci slack workspace? If so to which email address? Cc @yabellini

@VectorFrankenstein
Copy link

@maelle I do need an invitation. Please send it to rijan_dhakal@outlook.com. Thank you!

@maelle
Copy link
Member Author

maelle commented Feb 9, 2023

Thank you! Note that invites are sent more or less weekly.

@arendsee
Copy link
Contributor

Hi @RijanDhakal1010, so a bit of history. When I first implemented onekp it used the old FTP server and everything was awesome. I talked briefly to the database manager on the cyverse side. Eric Carpenter, I believe his name was. But the FTP site was working fine, so I was not motivated to change.

Then the transition to Google Drive happened and the onekp package blew up. You can check out the comments in #2 for a bit of context. joelnitta found a workaround. That was back in 2019.

A year later the package blew up again, see #3. And I hacked a solution in the suspiciously named commit "Fix #3 - possibly lose portability to windows". There I used a system call to curl. That was definitely not a good idea.

It might be a good idea to ditch Google Drive entirely and try to interface with cyverse. This might be a lot of work. You could ask the cyverse people if they have an API. In addition to all the complexity and bugs it causes, Google Drive is blocked in several countries.

@arendsee
Copy link
Contributor

Oh, and Rijan you are an awesome person! I think you are on the right path and it is great that you have been in contact with OneKP team. It is easy to make packages like onekp, but it is much harder to inherit and maintain them.

@VectorFrankenstein
Copy link

Hi @arendsee ,

Thank you for reaching out!

I can absolutely respect and understand why you had to use google drive. The issues notwithstanding, google's scale is definitely a plus point.

I think Cyverse will have to be the route to go for the back-end. I believe they do have a CLI tool called iRODS for interactive and automated data retrieval (could be wrong).

I will get back to this thread once I get a chance to search/read a bit more into Cyverse's APIs.

Sincerely,
Rijan

@VectorFrankenstein
Copy link

Hi @arendsee,

when you said you initially had an FTP server as the backend, was it the Cyverse SFTP API by any chance? If yes, then I might have accidentally re-invented the wheel but if no then it seems Cyverse does provide a currently stable SFTP interface to their public data folders. I have been using curl to interact with the data and so far the data transfer has been fast and convenient.

I have not run into any rate limits so far interacting with Cyverse via SFTP but I have emailed them to see if they have any rate limit polices that might hinder this idea (hopefully not!). If not then I think this is a viable solution to move forward with.The SFTP interface provides an easy way for people to anonymously access the public folder and R seems to have a functional SFTP interaction package. So, we could use SFTP within R and cyverse's public folders to replace google drive.

Sincerely,
Rijan

@arendsee
Copy link
Contributor

arendsee commented Mar 2, 2023

Hi @RijanDhakal1010, no, I never worked with the Cyverse API. Looks like you are onto a good solution!

@VectorFrankenstein
Copy link

Hi @arendsee ,

Awesome! I also just heard back from Cyverse. They said they do not throttle egress so long as the number of concurrent connections is kept reasonable. So this works out as a viable solution to replace google drive.

Thank you!

Sincerely,
Rijan

@VectorFrankenstein
Copy link

Hi @maelle,

I think I have everything I need to start changing the back-end of the package from google drive to Cyverse. Did you want me to take over the repo? or work on it on my end and then make a pull request?

Sincerely,
Rijan

@maelle
Copy link
Member Author

maelle commented Mar 2, 2023

@RijanDhakal1010 I've now invited you to the rOpenSci GitHub organization, and to a team with admin access to this repository! Sorry I hadn't done it earlier. Thanks so much for all your work on this!

For info we recently created a cheatsheet for maintainers of rOpenSci packages: https://devdevguide.netlify.app/maintenance_cheatsheet.html

@VectorFrankenstein
Copy link

@maelle , Got it and accepted. Thank you!

@maelle
Copy link
Member Author

maelle commented Apr 7, 2023

@RijanDhakal1010 could you please update DESCRIPTION to change the maintainer? (removing the "cre" role from the previous maintainer, adding yourself with roles "aut" and "cre"). Thanks a lot!

@VectorFrankenstein
Copy link

Hi @maelle, Apologies for the delay! I was able to clone the original repo to my machine (have read rights) but I am not being able to publish the changes. The specific error being github persmisssion denied class=Ssh (23); code=Eof (-20). Is this something on ropensci's end or mine?

Sorry this was not brought to your attention earlier, I have been working with the old code on a local fork and only just realized it now.

@maelle
Copy link
Member Author

maelle commented Apr 7, 2023

No worries, I've sent you a new invitation to the GitHub organization and a GitHub team with admin access to this repo. Note that you will need to have enabled 2FA see https://docs.github.com/en/authentication/securing-your-account-with-two-factor-authentication-2fa/configuring-two-factor-authentication

@maelle
Copy link
Member Author

maelle commented Nov 7, 2023

@RijanDhakal1010 did you end up getting access?

@VectorFrankenstein
Copy link

Hi @maelle ,

I do have access to the repo and the data. I apologize but ever since I signed up here my professional responsibilities expanded somewhat unexpectedly and I was unable to debug the local branch for the repo that I have as much as I wanted to. But I have a rudimentary framework for how I want to apply the changes required by the package.

Right now, the biggest issue is not the code as much as the backend data. This package is a way to access the data published for the OneKP project, which itself was run by a consortium of scientists. The google drive backend as implemented right now hosts data that is not insignificantly different from the data that is in the public domain. The great thing about the public domain data is that it is hosted by an accredited research organization with reasonably generous access/egress. Switching to it will have great benefits in the long run but comes at the cost of making any new changes to the package non-backwards compatible.

I am of the mind that 30-ish species that are missing in the public domain are worth the cost of switching to a better backed source. If that does not go against Ropensci policy then I am happy to get the ball rolling in that direction.

Once again, I apologize for my tardiness here!

Sincerely,
Rijan

@maelle
Copy link
Member Author

maelle commented Nov 7, 2023

Hi @RijanDhakal1010!
Congrats on the job expansion!

I am of the mind that 30-ish species that are missing in the public domain are worth the cost of switching to a better backed source. If that does not go against Ropensci policy then I am happy to get the ball rolling in that direction.

It's your package so you are the one to decide! For what it's worth, to me your arguments sound perfectly good!

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants