-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading Sheets Import and Process prototype #2523
Reading Sheets Import and Process prototype #2523
Conversation
@kevinrobinson I just realized I wasn't rerunning the tests when I reopened this, so I don't think these failures are from this branch. That aside, when you have a chance, would you take a look and see if this looks suitable for the reading sheets import? |
@edavidsonsawyer This is awesome progress! 👍 Apologies on the delay in not noticing this earlier in the week. I'm assuming this is built towards the format in First, let's update the format to use a new sheet called Second, let's remove any computed values (eg, how many levels of growth). The new format should discourage this more explicitly, but either way if there is anything that's computed we don't need to import it as we can just re-compute ourselves. There are sometimes subtle differences here, and removing that will just simplify a bit. Third, there are validations on the Also, I think it's great that this PR punts the "import" step for now, which will have to handle the "sync" process where data points change over time. We can do that separately, so focusing on the processor class seems like great scoping. 👍 |
@edavidsonsawyer re test failures, this is what I see here: https://travis-ci.org/studentinsights/studentinsights/jobs/561886950#L3144 So these are just linter failures, you can fix most of the minor things like whitespace by running |
reset_counters! | ||
|
||
# parse | ||
rows = [] | ||
StreamingCsvTransformer.from_text(@log, file_text).each_with_index do |row, index| | ||
#last row before data is the header | ||
string = (@header_rows_count > 1) ? file_text.lines[@header_rows_count-1..-1].join : file_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you say more about moving this before the CSV is parsed, here rather than afterward in flat_map_rows
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. StreamingCSVTransformer
needs a string that can be parsed as a csv. This is here to trim header rows that we aren't going to use later to reference data points, assuming here the last line before the data begins is the "real" header. I haven't been able to find a cleaner way to specify which row in the csv to treat as the header. flat_map_rows
just takes a csv row as an argument. It needs to reference the header data.
An alternative approach would be to define the headers explicitly in code rather than trying to get them from the csv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it, thanks! 👍 Yeah so the problem is that in that template, the actual "rows" in the second row since the first row has other stuff helpful for people entering data, got it.
The new reading template format reverses that - the first row is hidden from people, so it makes for simple parsing, and then there some formats have a a few rows without any real data, with explanations for people about how to use the sheet, etc.
So this makes total sense as a workaround, and then we should be able to remove it with the new format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great! Since this is likely to be sensitive to formatting changes, there are probably a few pieces here that might need to change with new formats. That shouldn't be a problem as long as the final template works well for everyone using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, let me know if you want to chat more. It's possible there will be some drift in the format from now until September, but I'm guessing it'll just be things like names to columns, or maybe skipping a different number of header rows or something like that.
Here's our notes talking through the cases for update. I think we can do the "importer" work without doing this first, and just add the data points we find. step 1
That won't handle if we run it twice - we'll get extra data points, but that's okay as a good first step. step 2
step 3After that I might be:
Those cases might be as simple as: "delete everything that ever came from this form." But there might be important edge cases we want to handle, like the teacher deleting a row because a student changed homeroom. And we'll have to translate those to simple guidance for educators too. Alternately, we might just want to say "if the student is in the sheet, update their records for that time period in the sheet." We don't have to handle all those edge cases, simple is better! We could say "if you do X, Y will happen" and "here are two edge cases that would break, but we're not handling now." |
@edavidsonsawyer FYI I pushed some other commits on #2544 related to the processor code. You should be able to pull them all in, but if there's conflicts you're not sure how to resolve just ping me and I am happy to help with merging! |
@kevinrobinson thanks. I merged that branch in so I don't think that will be a problem here. I'm rewriting the spec for this to make it more useful and robust and we should stop seeing these failures. |
Okay, sounds good! Let me know if there's anything else you need, and either way I'll probably check in early next week to see what else needs doing to start trying this out with fall benchmarks! |
Hi @kevinrobinson this should be ready for a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @edavidsonsawyer! This looks like good progress towards step 1. I left comments inline, let me know if you want to chat more. I'll still plan on spending some time Friday morning checking in with where we're at and then seeing if I can start trying to deploy step 1 incrementally.
Also, I tried to fix the flaky tests on master, so merging should resolve that. Thanks for working around the noise and feel free to ping if it comes up again. 👍
@kevinrobinson I've made the changes recommended above. I also gave this a shot using a test account with the fetcher and the template sheet, and it seemed to work well with a handful of students across the different grades. |
@edavidsonsawyer Great! Merging and going to try kicking the tires here. 👍 🎉 |
Who is this PR for?
Educators
What problem does this PR fix?
Reading data from schools is kept in spreadsheets that must be manually moved for the data to reach Insights
What does this PR do?
Introduces Importer and Processor classes that will fetch sheets from a google drive folder and process the data so it can be added to Insights.
This assumes the sheets within a grade follow a consistent template, but the specific template can be flexible between grades.
Checklists
Does this PR use tests to help verify we can deploy these changes quickly and confidently?