Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when to present Pandas/matplotlib #70

Closed
stuckyb opened this issue Jul 22, 2016 · 30 comments
Closed

when to present Pandas/matplotlib #70

stuckyb opened this issue Jul 22, 2016 · 30 comments
Assignees

Comments

@stuckyb
Copy link

stuckyb commented Jul 22, 2016

I understand the motivation for covering pandas and matplotlib -- I'm guessing the idea is to give learners "powerful tools" right from the get-go. However, if this is really for "people who have never programmed before", I wonder whether throwing in all of that in the middle of the day will leave students feeling empowered or overwhelmed. Just learning the basics of a general-purpose programming language, like Python, for the first time, in a single day, is a lot to cover. I'm not suggesting cutting this entirely, but I wonder if it might work better as a "putting it all together" project at the end, if there is time. And maybe also cut it down to a really small set of basic Pandas/matplotlib functionality to keep from overwhelming students -- perhaps only what is needed to complete some relatively simple data analysis task.

@gvwilson
Copy link
Contributor

I agree that it's a lot - see #40 - but I also think it's important to give learners something "real" before lunch. If the lesson was built around image manipulation (Guzdial and Ericson's "media-first" approach) we'd be OK, but part of our remit was to get people to data analysis as quickly as possible. We can move "looping over data sets" to the afternoon, but it's not really enough to make a difference. The only other option I see is to drop Pandas and stick to NumPy, but that doesn't really make things simpler, and we lose all the statistics :-(

@gvwilson gvwilson self-assigned this Jul 25, 2016
@thomasballinger
Copy link

thomasballinger commented Jul 25, 2016

I'm bringing my hands-on bias to the table here, but I think the topics

are not useful to near-beginners who have not written much code yet. To those with a little more experience that inevitably find their way into these courses these topics are very useful, and are what make sitting through the rest of the content worthwhile so they don't feel like they've wasted their time. I think there's very real value to this given the difficulty of getting just the right audience, and in fact find it's usually these too-advanced-for-the-class folks that end up doing things with content taught in lessons, but if we really believe we're helping beginners then these are the topics I nominate to cut.

Variable scope: the common initial mental model of scope, dynamic scope, mostly works. If we're not writing programs complicated enough, it isn't necessary to know that there's anything other than a global variable.

Test-driven development: although the way it's introduced here is pretty slick (add asserts outside of the function), I find manual testing to be easier to pick up for beginners.

Programming style: The implicit programming practice in this lesson seems useful, but the explicit lessons are not useful to beginners.

I think of SWC as teaching folks who already write code to write better code. In courses where this is the goal, these topics are terrifically useful. But I think they all require some programming experience to pick up.

@abostroem
Copy link
Contributor

My vote is for option b) Cut the stuff on programming style, defensive programming, and maybe even variable scope for the following reasons:

  • My experience was that I had a hard time understanding things like defensive programming and variable scope before I had had some experience programming.
  • I believe our audience is looking for the skills to get up and running on their research and two essential skills are data manipulation (whether in Numpy or Pandas, from data or simulation) and plotting. My goal is for students to leave with the barebones skills required to do their research - even if it isn't the most efficient or elegant and the ability to expand their knowledge as needed without me.
  • These lessons could be included as extras - to be taught if the host wants to devote more time to Python or if the group is slightly more experienced.

@stuckyb
Copy link
Author

stuckyb commented Jul 25, 2016

Of the 3 topics that thomasballinger identified as non-essential (variable scope, test-driven development in defensive programming, programming style), I'd say that variable scope is most useful to true beginners. If we want them to write functions, then understanding scope really is important. And to the extent that beginners need to understand it, I don't think it's that difficult to teach, either: the basic concept is the difference between local and global variables, which is straightforward. If at all possible, I'd leave that in.

I agree that defensive programming and test-driven development are probably not so useful for "people who have never programmed before", in part because, as abostroem pointed out, it's hard to really see the value in these ideas until you've had some programming experience (specifically, negative experiences from not using them).

Still, I agree with thomasballinger that these more advanced units could be the most useful part of the day for learners who come with some programming background. So I like the idea of keeping them as extras, either to be taught if time permits or to be included if the class consists of learners with some experience.

@mboisson
Copy link

Hi Greg,
I guess it depends what is the primary purpose of the training, and what is the primary purpose of the people attending the training. What kind of program are they trying to write ?

If it is mostly calls to external libraries, maybe defensive programming is not important.
If it is mostly to write an algorithm, maybe plotting is not important....

I usually try to adapt my teaching on the fly, although sometimes I discover later that I skipped an important part from a previous section, and then backtrack to that specific thing.

Maxime

@justbennet
Copy link
Contributor

One thing to consider might be whether we should change the question a bit and prioritize according to the following question:

-- What do we absolutely want the learner to leave the workshop having done at least once for for themself.

Our discussion sometimes seems very instructor-centric, and I try to use the above phrasing as a sanity check on myself. I start with a list of what I want everyone to have done by workshop's end, then I keep a few more topics on tap, often not new but just ways of combining what we have already done.

Most of the people with whom I have discussed this seem to want more practice with less material. I remain very skeptical that a really meaningful coverage of functions is going to happen, so I would opt to spend more time clearing up possible confusion about usage of basic variables and programming logic in simple scripts.

@kblin
Copy link

kblin commented Jul 26, 2016

I agree this really depends on the target audience and the overall content of the course. Are you going to cover R/ggplot in another module? Then you probably don't need to cover pandas/matplotlib. Is your main focus on data viz and statistics? Then possibly the more advanced programming topics are less important.

@stuckyb
Copy link
Author

stuckyb commented Jul 26, 2016

I can't speak for Greg or any of the other folks designing these new lessons, but to try to focus the discussion a bit: The target audience has already been broadly identified as "people who have never programmed before", and two stated goals of the lesson (see Greg's comment above) are: 1) "to give learners something "real" before lunch"; and 2) "to get people to data analysis as quickly as possible".

So I take all of that to mean that at least some pandas/matplotlib material needs to stay. (Of course, individual instructors can and should customize their coverage for the unique requirements of their class.)

@justbennet
Copy link
Contributor

Would it make sense to create the Pandas/matplotlib material that you want to have completed and work backward from there to fill in missing understanding? Quite a lot of people will be reading more code than they write, too, so that might be worth considering as a presentation method as well as a planning method. Provide the finished product at the start, then decode it.

Just a thought.

@andreww
Copy link

andreww commented Jul 26, 2016

I think the best bet is probably to drop the three things identified by @thomasballinger. However, I wonder if another useful way to think about this may be to consider (roughly) how the content of this lesson should differ from the content of a (hypothetical?) lesson where the learners have some programming experience. This may help us focus on what stage in a programmer's development teaching particular topics could reasonably make a difference with the constraint of teaching in a short workshop style.

@stuckyb
Copy link
Author

stuckyb commented Jul 26, 2016

I agree that test-driven development and programming style (2 of the 3 items identified by @thomasballinger) could be dropped from the "core" curriculum and recast as optional material. However, I still think that if we wish to teach functions, then we need at least some discussion of scope. The concept of local variables is a key part of understanding how functions work (and how they help organize code). Perhaps the material covering scope could be abbreviated and moved to the section introducing functions, but I don't think it'd be a good idea to drop it entirely.

@abostroem
Copy link
Contributor

@justbennet that was the approach we tried with the Inflammation lesson - to start with a show and tell - this is what you can do with Python - then dig into the details. In practice student's really want to follow along and many instructors felt they wanted to give the background knowledge as they went. For this reason we opted to start at the beginning with the gap minder lesson. Your idea could still be useful for lesson planning (do you see this as different from the lesson objectives?).

@justbennet
Copy link
Contributor

Yes, I think planning and objectives are different, though I think ideally, they would be very closely aligned. To my mind, lesson planning is much like writing a script, whereas the objectives are more like the notes one might take. I rewrite the script for every workshop, ahead of time, to adjust timing and emphasis (and sometimes topics) based on the anticipated audience, but the overall objectives remain more stable.

Like I said, just thoughts off the top of my head. The problem is to cut back to something that can definitely be got through in the allotted two half-days, then provide a dim sum of additional topics, no?

@mickley
Copy link

mickley commented Jul 27, 2016

Well, my impression is that intentionally or not this lesson seems less applied than some of the others (eg the main Python and R lessons). I'm not sure that's a bad thing, though it may frustrate some learners who are chomping at the bit to get started on their own stuff.

At least in my branch of science I feel that I regularly see people who are adept at making plots doing statistics, and getting data in and out. But they don't actually know how to program, and they waste lots of time and create rough code as a result. I think a lot of people in science are teaching the practical/applied skills, but unless one takes a programming class or really sits down to carefully learn programming there's no opportunity to learn this stuff.

Learning things like loops, conditionals, simple error checking using print(), commenting and making code readable to someone else etc. are really critical skills many of us take for granted that many scientists don't ever learn. Perhaps the defensive programming and style don't need to be as in depth though. I think just using if and print gets you a very long way--assert is more advanced.

The other thing I think is nice about this is that many of these are transferable language-independent skills. You can go take these same ideas and apply them to R or any number of languages. I think that's worth stressing. And pandas/matplotlib don't fit that paradigm.

@jeremycg
Copy link

jeremycg commented Jul 27, 2016

I might be wrong, but I think that one of Software Carpentry's goals is to act like a trojan horse - people come to the workshops because they know they need to learn R or Python and once there, they learn these, but also things like version control, testing and programming style that they didn't know they needed to learn, but which are incredibly important (and evidence backed).

For this reason, I'm hesitant to recommend dropping these parts in favour of matplotlib and pandas. If there is a need to get to plotting etc before lunch, it might be worth showing examples. In a recent (non-SWC) course I taught, the best feedback I got was walking through real world notebooks - I walked the learners through the IPython notebook from Buzzfeed about tennis match fixing and the article here in the first section and the kaggle example dataset for the titanic data in the end. Learners did not necessarily follow every part of the code, but a quick run through can show the data types, graphs and analysis possible, and was highly motivating according to the feedback.

@gvwilson
Copy link
Contributor

One way to see this lesson plan more clearly might be to compare it with an alternative. I've therefore put together https://swcarpentry.github.io/python-novice-images/design/, which uses an images-first approach to Python (drawing from Ericson and Guzdial's media-first computing work at Georgia Tech). It doesn't cover Pandas or data analysis, but it does cover loops, conditionals, functions, and NumPy array notation. Please have a look and add comments here on this issue if it sparks any new thoughts.

@kevin-vilbig
Copy link

It is better to have a little too much material in the lesson than not enough material, as long as it was organized in such a way that the most important concepts were covered first and that the stuff that is less general, particularly the stuff that @thomasballinger mentions, is covered at the end. It's nice to have, but those are lessons that students may have to learn the hard way, or give them links so that they can look it over after the live portion of the workshop. Maybe we should make it more clear that those parts are Optional or Intermediate? I approach these workshops like an editor looking at journalistic copy. I thought that the stuff at the end was already intended to be droppable if time didn't permit.

The rest of this is a little more general about how I approach teaching these workshops.

We can't really leave it up to naive participants to tell us what they want to know, because the whole point is that they don't know what they don't know. [1] I don't expect our learners to come out the other side of a two day beginner workshop as "competent" programmers. I expect them coming out the other side knowing what questions to ask of the legions of knowledge-bases available. I expect them to know where to begin, and to learn a little bit of the basic concepts so that they know the difference between a variable assignment and a function call. The goal of these workshops is to move people from Unconscious Incompetence to Conscious Incompetence rather than higher up on the stages of competence. Expecting more in a few hours is folly.

[1] https://en.wikipedia.org/wiki/Four_stages_of_competence

This is why I work to impart confidence in interactive computing, a willingness to make "mistakes", and to defuse the fear of "breaking things." Often when a student asks me, "What happens when you do X?" I say, "I don't know, let's try it." I do this even when I really know what will happen. Modeling the method of interactive computing is the most important thing that we do. We should be busting the harmful myth of the virtuoso genius programmer ("Sometimes, I dream in code".... ugh) who whips out perfect code and let people really understand that they can and should be trying things themselves. "Always yield to the Hands-On Imperative."

@tbekolay
Copy link
Contributor

It is better to have a little too much material in the lesson than not enough material, as long as it was organized in such a way that the most important concepts were covered first and that the stuff that is less general, particularly the stuff that @thomasballinger mentions, is covered at the end.

This is also the approach that we went with in python-novice-inflammation, and in practice it really doesn't work well. It's intimidating to both instructors and students to see a lot of material at the outset, then confusing when material either doesn't get covered or (even worse) gets hastily introduced by instructors who want to get through everything. Less is definitely more.

As such, I'm also +1 on the proposal to move defensive programming and test-driven development to an intermediate lesson, though I would keep variable scope in. Right now, I feel like we recognize that few intermediate workshops get run, and so we take a more broad approach in the novice lesson. However, I think it would improve novice workshops to move material like this to an intermediate lesson, even if it doesn't get taught as much as we'd like. If a workshop gets through all the novice material, there's no reason they couldn't move on to the intermediate material afterward.

I've therefore put together https://swcarpentry.github.io/python-novice-images/design/, which uses an images-first approach to Python.

Personally I find the NumPy / Pandas lesson far more useful for scientists, which is our stated audience, though for the general audience the images-first approach is compelling. In an ideal world, I would have us maintain separate lessons, one using NumPy / Pandas for scientists and one using images or other media for certain audiences (e.g., scientists from disciplines without tabular data). And also lessons using more domain-specific tools that are a level of abstraction above NumPy / Pandas. But I think given our unideal world with limited resources, a NumPy / Pandas lesson is the most immediately useful to our target audience. Of course, I would support other people starting up those other lessons.

I feel that I regularly see people who are adept at making plots doing statistics, and getting data in and out. But they don't actually know how to program, and they waste lots of time and create rough code as a result.

This is a bit off-topic at this point, but since I've heard this sentiment from other instructors as well (sorry that I'm unloading it all on this one quote!) I think it's important to call this out as being demotivating to scientists who are trying their best. Being adept at making plots and doing statistics and mangling data is programming, full stop.

As instructors, we should be mindful of the steep learning curve that programming presents. At workshops, it's our jobs to figure out where people are on that learning curve and push them up as far as they're comfortable. Progressing takes time, and you can't skip ahead because each concept builds on previous concepts. What set of concepts would someone need to know to say that they've started programming? Why would a readable script with no for loops be less of a program than one that defines classes and has loops but is poorly factored and unreadable? We want to convince students that we can save them time if they trudge up the learning curve with us. Making an arbitrary "you're a programmer" line on that curve does more harm than good.

@abostroem
Copy link
Contributor

Huge +1 to everything @tbekolay said.

I also wanted to add that the lesson scope we are defining for this novice lesson would also be good for a half day with an group who has programmed before but is new to Python.

@justbennet
Copy link
Contributor

I concur with @abostroem and @tbekolay in this. I think that, whatever the final topic selection, someone should be able to have a reasonable expectation of actually finishing it in six hours, preferably without too much rushing.

I also agree there should be some separate grab bag of extra topics, suitably labeled as such, from which to draw if you're so lucky as to get through the core material. Is there a place in the SWC universe for small, independent 'units' like that?

For some specific suggestions about the core material, I would work backward from looping over datasets, and take the elements of that and back-propagate.

When talking about libraries, don't use math (except maybe as a aside), use the glob() library; when talking about lists, use lists of filenames as an example instead of numbers; in the section on variables and assignments, use file or folder names for string examples.

By doing so, all the same principles of naming things, making lists, assigning values get demonstrated, but with semantic things that are going to be used later. I believe that helps to unify the material, making it easier to process. I think that also sets up the section on Looping over datasets to be a kind of 'Ah, ha' moment, because now, many of the seemingly disparate topics presented previously come together in an recognizably useful way. They're led to some kind of synthesis.

When you get to writing a function, why not write a function to do something they've seen, like read tabular data and print summary statistics for a set of files? (For this, I created a set of gapminder files, each with one yearof data.) All the same principles and techniques can be used, but again, I think it will be easier to process because they are used in service of the same 'task', namely getting some useful statistics from some files of gapminder data.

@kevin-vilbig
Copy link

kevin-vilbig commented Jul 29, 2016

I don't give students access to the lesson notes until after the lesson and the expectations of the instructors should be managed during training.

I also agree there should be some separate grab bag of extra topics, suitably labeled as such, from which to draw if you're so lucky as to get through the core material. Is there a place in the SWC universe for small, independent 'units' like that?

@justbennet That's what I'm saying. All it would take to assuage these concerns is to mark those sections as Optional Material rather than fragmenting the repositories.

@gvwilson
Copy link
Contributor

gvwilson commented Aug 1, 2016

there should be some separate grab bag of extra topics,
suitably labeled as such, from which to draw if you're so lucky as to get through the core material.

We've had such in the past (see e.g. the https://github.com/swcarpentry/shell-extras repository, or the extra episodes from "Writing Data" onwards in http://swcarpentry.github.io/r-novice-gapminder/); the problem is that nobody maintains the extra material.

@gvwilson
Copy link
Contributor

gvwilson commented Aug 1, 2016

I've taken another run at this by drafting a third possibility: an introduction to Python using plotting as the running example. The design notes are at https://swcarpentry.github.io/python-novice-plotting/design/, and I think this (a) gets us to NumPy and Pandas while (b) doing graphical things early on to keep people engaged and (c) not overwhelming learners with too much about style, defensive programming, etc. Comments here on the python-novice-plotting approach?

@iglpdc
Copy link

iglpdc commented Aug 1, 2016

As such, I'm also +1 on the proposal to move defensive programming and test-driven development to an intermediate lesson, though I would keep variable scope in. Right now, I feel like we recognize that few intermediate workshops get run, and so we take a more broad approach in the novice lesson. However, I think it would improve novice workshops to move material like this to an intermediate lesson, even if it doesn't get taught as much as we'd like. If a workshop gets through all the novice material, there's no reason they couldn't move on to the intermediate material afterward.

Totally agree. In general, I think we should do this with most of our lessons. Our current novice lessons have enough material for a novice and an intermediate lesson in them.

I think one of the reasons why the core lessons are so long is because they come from the old screencasts by @gvwilson, which aimed to be much more comprehensive and by no means could be covered in less than 3 hours. But with 3-hours lessons as the norm, we should cut them off merciless. :)

@stuckyb
Copy link
Author

stuckyb commented Aug 1, 2016

@gvwilson, I read through the design notes for both of your lesson proposals (the image-focused version and the plotting-focused version), and of the two, I like the plotting-focused lesson better because I think the material will be of more immediate practical application for (most of) our students. I also think the lesson you've outlined could address most of the concerns voiced on this thread.

I know what you've shown us are early design notes, but here are a few quick comments:

  1. I'd suggest beginning the day with a motivating example to show the students what they'll be learning and what they'll be able to do by the end of the day. This is the same as my suggestion motivating example #69 for the present lesson.
  2. I think the section on conditionals should continue with the plotting theme rather than using an image-based example.
  3. Whether/when to discuss variable scope has been an issue on this thread, so I'll say that I'd include it in the section "Writing Functions".
  4. Finally, I'd suggest that if the lesson goes this route, effort is made to keep the running stats/plotting example relatively simple, and that the "pieces" learned in each section are cumulative so that they build up to a final product at the end. Since a major goal is to give students general programming principles that they can apply in a variety of situations, I think it is very important that students don't have to devote a great deal of effort to understanding the details of numpy, etc.

Overall, though, I think a plotting-driven lesson could be a good way to go.

@lexnederbragt
Copy link
Contributor

lexnederbragt commented Aug 2, 2016

From the argumentation above, what I support most is the notion that this lesson is supposed to get researchers/scientists started with data analysis in python (including plotting). In that context, test-driven development in defensive programming and programming style are less of a priority. I suspect many beginners don't get the idea behind TDT before they have programmed for a bit, and both subjects suffer from 'what do I get out of investing in learning it'. (our “teach most immediately useful first” approach).

Variable scope is supposed to take 25 minutes (10 for teaching, 15 for exercises) but that seems much too long an estimate for what is the current content.

As for intermediate lessons, I think with for each 'total-beginner' lesson (such as this one), there can (should?) be a corresponding 'using what you learned more effectively' (good practices, getting things done with less pain) lesson. These can be shorter (three hours) and would for python include TDT and programming style/documentation, plus perhaps finding and using libraries, and importing functions from your own code in scripts.

Finally, I also liked https://swcarpentry.github.io/python-novice-plotting/design/ as a solution.

@justbennet
Copy link
Contributor

Greg, I like https://swcarpentry.github.io/python-novice-plotting/design/ quite a lot! I think that looks very promising. It's a good introduction to the topics, it provides comprehensible cognitive framework from which to hang all the ideas, it provides a final goal for the lessons that can be related to real work in a variety of areas, and above all, there's a meaningful point to everything.

Let me know if you would like me to work on implementing any of this.

@zonca
Copy link

zonca commented Aug 9, 2016

I also like the plotting-first lesson, I myself have a version of the python lesson where I replaced the inflammation numpy example with analyzing mosquito data taken from the intermediate lesson, see https://github.com/zonca/swcarpentry-workshop-pandas.

I find numpy too low level for most users, pandas is a better abstraction level for people analyzing heterogeneous data.

@abostroem
Copy link
Contributor

When I surveyed the community about how they teach Python: show and tell first, then get into the nitty gritty (i.e. the inflammation lesson) vs. intro material first, ~50% of people wanted intro material first and ~20% were neutral. Can some of those people who find the inflammation example first problematic to teach weigh in on the plotting first lesson design for this lesson?

@gvwilson
Copy link
Contributor

gvwilson commented Sep 6, 2016

Closing this one because the lesson's design has been updated significantly - we will open a new free discussion issue. Thank you all for your guidance.

@gvwilson gvwilson closed this as completed Sep 6, 2016
rgaiacs pushed a commit to rgaiacs/swc-python-novice-gapminder that referenced this issue May 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests