Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should at least some of the inflammation data pass the `detect_problems` function? #170

richford opened this issue Oct 23, 2015 · 3 comments


Copy link

@richford richford commented Oct 23, 2015

So we have a bunch of csv files with inflammation data. In 05-cond and 06-func, we develop some tests to tell whether or not our data is suspicious. One of the tests concerns the max behavior and one of the tests concerns the min behavior. But when we run detect_problems() on all of the files in inflammation*.csv we see that none of the csv files results in the "Seems OK!" print statement.

I just taught this lesson and some of the students stopped me to ask "Wait, so none of our data is okay?" I think it was demotivating for some of them (as it would be in the real world if you realized that all of your data were bad).

Should we consider changing either the datasets or the conditional tests in detect_problems() so that at least some of the inflammation data is considered OK?

Copy link

@tbekolay tbekolay commented Oct 24, 2015

I definitely agree that this outcome is pretty sad depending on how you present it. I find that I often motivate the Python lesson by saying that it's a pretty common thing for a colleague to give you data and for you to be poking through it, so I'm often tempted to present this as the situation in which you get these csv files. However, since we're kind of trying to find some fraud here, it makes more sense to present it more adversarially, which in itself is kind of a bummer.

Personally, I find it difficult to discuss this data in general, since it just seems like numbers that I can't really attach to the real world. The Lessons subcommittee has been having some meetings (which you're welcome to attend!) talking about an overhaul of this lesson, and one of the issues we've discussed is the data set. We're looking at other data sets like the GapMinder data, but I think we're pretty unanimous in not using the inflammation data moving forward.

Of course, that's a long-term answer, and it'll be a while before those overhauled lessons are ready. So what should we do right now? I think that it's relatively difficult to cook up some fake data, and I can't think of anything less interesting than typing numbers into a spreadsheet to make sure they pass some checks. If someone wants to do that, I would tip my hat to them, but I think a more important short-term solution would be to add a callout explaining the data set in I'd recommend that this even include a reasonable context -- e.g.,

You work with the Reproducibility Initiative and a colleague has been unable to reproduce a study that uses this set of inflammation data. They ask if you could look at these data files and see if you notice anything strange about them. The data files contain ...

Do you think this would alleviate your concerns?

Copy link
Member Author

@richford richford commented Oct 27, 2015

I like these suggestion a lot, both the long term solution of using other datasets and the short term solution of motivating the existing dataset with a callout in Thanks.

statkclee pushed a commit to statkclee/python-novice-inflammation that referenced this issue Jan 4, 2016
Copy link

@gvwilson gvwilson commented Jul 31, 2016

@richford any chance you'd have time to PR this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.