Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust scientific practice that is not open? #344

Closed
fkiraly opened this issue Mar 19, 2019 · 16 comments
Closed

Robust scientific practice that is not open? #344

fkiraly opened this issue Mar 19, 2019 · 16 comments
Assignees
Labels
enhancement New feature or request reproducibility-book Content for reproducibility book

Comments

@fkiraly
Copy link

fkiraly commented Mar 19, 2019

Summary

Proper scientific practice is worthwhile to pursue in general - and I can easily think that outputs of the Turing way would be very useful to any organisation engaging in practical data science.

Now robust, valid, and reproducibile research is not necessarily the same as open research. Since an organisation may be interested in empirical truth, without necessarily wanting to broadcast their best findings to the entire world (unlike, say, a typical member of the academic community).

More precisely, data and data scientific artefacts may be sensitive, often for commercial, ethical, or privacy reasons.

Currently, the Turing way seems to be mostly/exclusively (?) assuming the academic ideal of openness - which paradoxically is, de-facto, shunned by mainstream academia for various reasons for which this margin is perhaps too small to discuss.

Whereas, non-academic entities (e.g., government, industry) are very strongly interested in the ideal of robustness/validity, without necessarily wanting, or being able to, embrace openness, for reasons as stated.

I personally think success lies down the second route - instead of appealing to powerful academics' conscience (which may or may not exist), show an argument that makes itself due to tangible benefits of an empirically robust workflow (by definition, "empirically robust" means "benefits are real").

In any case, I think the Turing way should not shun closed, or semi-open, research, just for the reason that it is closed. I'd be strongly in favour in outlining best practices and process for either, and making explicit the various trade-offs that there are (e.g., IP ownership vs the benefits of external quality checks)

What needs to be done?

Think about this.

Who can help?

The community, with helpful or critical consideration.


Updates

@KirstieJane
Copy link
Collaborator

Thanks @fkiraly!

I’m sad that it appears that the Turing Way appears to be shunning closed research!! That’s absolutely not the goal.

Can you point to where we do that? Obviously the chapter on Open Research is focused on open, but I think the goal of building a private binderhub is very explicitly supporting closed research (including, for example, closed until the research team chooses to open it).

@fkiraly
Copy link
Author

fkiraly commented Mar 19, 2019

Well, it looked like there's no chapter on closed research planned... or on processes to ensure quality and reproducibility in that setting. Or is there?

"shunned" is perhaps a too strong word, agreed :-)

@fkiraly
Copy link
Author

fkiraly commented Mar 19, 2019

Perhaps it's down to the content in chapters that are yet to come into existence - so it's easily possible that I'm not getting the full picture on a thing that doesn't exist yet.

@KirstieJane
Copy link
Collaborator

KirstieJane commented Mar 19, 2019

I think every chapter (expect open research) supports closed or open research!

I don’t want to include a specific chapter on closed research - I don’t think that helps anyone all that much - but all the chapters should include options for free (educational discount) private GitHub and Travis accounts so there are no parts of version control, CI, testing or reproducible computational environments that require public code/data.

@fkiraly
Copy link
Author

fkiraly commented Mar 19, 2019

Very non-hypothetical scenario (that I know you are familiar with): sensitive patient records need to be analyzed in a safe haven. Analyses are in the safe haven, as well as purpose-written but generalizable code for methods, and data preparation/cleaning. Data cannot be moved, and only parts of the code may be declassified. What is reproducible research in that case?

Wouldn't this deserve extra discussion?

@KirstieJane
Copy link
Collaborator

This is explicitly planned for one of the champions project case studies.

(I have to go offline. Please be assured that this is a point that is baked into the whole purpose of the book.)

@fkiraly
Copy link
Author

fkiraly commented Mar 19, 2019

Well, I know :-) that's why I'm asking.

@ikosmidis
Copy link

Just a few extra thoughts:

I like the idea of an "open research" chapter. I suggest (and relevant to this thread), if not already planned/done to give a clear, careful definition of what "open" is in expressions like "Open data", "Open hardware", "Open access" and the likes.

My working definition, also close to the point @fkiraly's made, is that "open" can only be defined relative to the context it is used, especially as far as resaerch goes.

For example, my academic work (papers, code, data and so on) is open as openly accessible to the world (hmm publishers!). I have engaged, though, in research and delivered data-analytic pipelines that:

  • are not openly accessible to the world
  • use openly accessible tools and methods to analyze closed data (in silos or safe havens)
  • have data, code, repositories and everything else that is needed to set up the workflow that are "open" within the organisations and teams who directly benefit (commercially or otherwise) from their outputs and who may want to modify them without my input later.

So tools that encourage "open research" are still very much useful for semi-open/closed research.

If not already done, I feel that Turing Way should recognise this explicitly and provide a definition of "open" in one of its early chapters. It may also be worth checking whether some of the statements having the word "open" in them can be put in terms of "reproducible" and "replicable". The latter two have an almost immutable definition between the various contexts that data science is useful/required.

@KirstieJane
Copy link
Collaborator

Thanks @ikosmidis. The Open Research chapter is already written and merged into the chapters.

Really happy to recieved any feedback on it!

@KirstieJane
Copy link
Collaborator

There's an issue open for me to write an introduction and motivation chapter: #161.

With all the workshops we've run and the Turing events for Health and TPS this month, I haven't had time to sit down and incorporate those points. But most of the points in your and @fkiraly's comments will go in there (or are already in the open research chapter!)

@KirstieJane KirstieJane added enhancement New feature or request reproducibility-book Content for reproducibility book labels Mar 20, 2019
@KirstieJane KirstieJane self-assigned this Mar 20, 2019
@pherterich
Copy link
Collaborator

The research data management chapter #196 has a section on when data cannot be open. It's not super detailed yet mainly because we want to get the main points in to give you a starting point to point out where exactly you feel more detail is needed. Any comments on that will be super helpful to understand how much detail you expect us to provide in certain areas.

As @KirstieJane mentioned, the chapter should be followed by a case study on data that couldn't be shared and how @LouiseABowler still made the research reproducible.

Our definition of reproducibility just says "same data and same code = same results" and the chapter links this to Open Research as that is just the easiest way to have access to data and code if possible. We'll review to ensure that this chapter is inclusive, any concrete pointers are welcome. I authored both chapters mentioned so I might have some blindness when looking at them.

@fkiraly
Copy link
Author

fkiraly commented Mar 20, 2019

Dear @pherterich, actually the research data management chapter addresses a number of the points, thanks for pointing this out. I was just looking at the completed chapters and not at the pull requests, my mistake.

Still, I think it might be worth thinking about different audiences, and different scenarios.
I believe @KirstieJane that you have carefully thought about this in your internal discussions, but on the other hand, in my personal opinion, it does shines through a bit for whom you write primarily.

More preisely, i.m.o., it is crystal clear that this is the stylized young academic whom you are (helping a lot, but also) nudging a bit towards the ethical imperative with your narrative index finger.

You make a lot of arguments how openness and reproducibility benefits such an academic researcher - most of whom will be, in essence, "providers" of data science from a societal standpoint.
But it can also make sense for non-academic "end users" of data science to champion openness, solid science and reproducibility, even if not embracing openness entirely in all processes.

Rephrasing this: you may like to think about how a non-academic end user of data science would be reading and reacting to your thoughts.

Some examples of what such an audience may or may not find interesting:
Assume you're an industrial/govt/clinical end user.
What should be an internal set of robustness standards to hold data science work against?
When should you go open, what should you keep secure? What are the trade-offs and interactions?
How can you leverage, say, benefits of independent quality checks that are common in the academic community? How should you support, and work with, the right kind of research quality initiative so everyone wins?

You do mention scikit-learn, for example, which I think gives you an interesting case study for these considerations. I personally think there's an entire new translational interaction model there, but you might not have to go that far into original research to make some points along these lines.
(though, of course, no one forces you)

@KirstieJane
Copy link
Collaborator

KirstieJane commented Mar 20, 2019

Sorry @fkiraly - I think I'm lost. Where are you seeing these arguments? And the mention of sklearn?

(on reflection, I think maybe you're reading them in the chapters? Not something that I've written? You mean "you" in the plural?)

@fkiraly
Copy link
Author

fkiraly commented Mar 20, 2019

With the last paragraph, I was referring the existing open research chapter, which is written by @r-j-arnold, not you (I believe?).
https://github.com/alan-turing-institute/the-turing-way/blob/master/chapters/open_research.md
more precisely, the "open software" paragraph where sklearn is explicitly mentioned.

I think it's really nice, and it - almost - brings the point home why, even if you operate within a locally closed environment, open software in the data science ecosystem is still good for you (and that's why it makes sense to support it).

@KirstieJane
Copy link
Collaborator

Gotcha - thanks @fkiraly.

If you’d like to open a PR to that chapter you’re super welcome to! (That’s the goal!)

I’m wondering through if there’s a separate chapter that we could do that’s an “interview” with you (or a group of folks) where we capture the tensions of open vs closed working and specifically conclude with @ikosmidis point that open should be defined as relative to your needs.

What do you think?

It can be a traditional chapter too of course, but as these will be quite strong opinions, I like the idea of presenting them as such: different points of view for different situations.

@dingaaling
Copy link
Collaborator

Hello from 2023 @fkiraly! Hope all is well with you, many months later. The TTW Core team is currently reviewing issues in our backlog to determine how they can best be taken forward.

It looks like this issue is no longer being actioned, so we'll be closing it for now - however, please feel free to re-open it to continue working on it at any point! If you have any questions or concerns, please comment below ✨ Thank you!

Outstanding Issues + Pull Requests Review automation moved this from Outstanding Issues to Done Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request reproducibility-book Content for reproducibility book
Development

No branches or pull requests

5 participants