Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading Sheets Import and Process prototype #2523

Merged
merged 40 commits into from Sep 20, 2019

Conversation

@edavidsonsawyer
Copy link
Collaborator

edavidsonsawyer commented Jul 18, 2019

Who is this PR for?

Educators

What problem does this PR fix?

Reading data from schools is kept in spreadsheets that must be manually moved for the data to reach Insights

What does this PR do?

Introduces Importer and Processor classes that will fetch sheets from a google drive folder and process the data so it can be added to Insights.

This assumes the sheets within a grade follow a consistent template, but the specific template can be flexible between grades.

Checklists

  • Core

Does this PR use tests to help verify we can deploy these changes quickly and confidently?

  • Included specs for changes
  • Manual testing made more sense here
@edavidsonsawyer

This comment has been minimized.

Copy link
Collaborator Author

edavidsonsawyer commented Jul 23, 2019

@kevinrobinson I just realized I wasn't rerunning the tests when I reopened this, so I don't think these failures are from this branch.

That aside, when you have a chance, would you take a look and see if this looks suitable for the reading sheets import?

@kevinrobinson

This comment has been minimized.

Copy link
Contributor

kevinrobinson commented Jul 26, 2019

@edavidsonsawyer This is awesome progress! 👍 Apologies on the delay in not noticing this earlier in the week.

I'm assuming this is built towards the format in example 3 (somerville 2nd grade reading) is that right? If so, I think there's three good next steps to do here, and then let's merge this is and hopefully the timing will align where we have some production data ready then too. Let me know if you want to chat more over email or video too!

First, let's update the format to use a new sheet called dev SPS Reading Benchmarks (7/19/19) that is added in drive. This is the proposed new format for the coming year for benchmarks, and I marked the older reading examples as deprecated now. You'll notice there's a hidden row at the top with explicit computer-friendly column keys, and then there are two rows for the humans using the sheets :) So after parsing the CSV, the processor code can just skip the first two rows when processing. So let's update this to use that sheet format, and you could update the processor to be able to handle all the tabs at once, or just starting with one tab/grade is great too.

Second, let's remove any computed values (eg, how many levels of growth). The new format should discourage this more explicitly, but either way if there is anything that's computed we don't need to import it as we can just re-compute ourselves. There are sometimes subtle differences here, and removing that will just simplify a bit.

Third, there are validations on the ReadingBenchmarkDataPoint model that only allow adding assessment_key values that are whitelisted. That's a small step to help us stay sane when adding a lot of different values. So the test here is awesome for testing the processor code, but the new processor code introduces new assessment_key values that aren't whitelisted, so if we tried rows.each {|row| ReadingBenchmarkData.create!(row) } many would fail (eg, for new values like growth_f_and_p). So if the processor has code to add new data points, let's add them into the whitelist in ReadingBenchmarkDataPoint too. And the tests can check this by doing something like above and then making assertions on them.

Also, I think it's great that this PR punts the "import" step for now, which will have to handle the "sync" process where data points change over time. We can do that separately, so focusing on the processor class seems like great scoping. 👍

@kevinrobinson

This comment has been minimized.

Copy link
Contributor

kevinrobinson commented Jul 26, 2019

@edavidsonsawyer re test failures, this is what I see here: https://travis-ci.org/studentinsights/studentinsights/jobs/561886950#L3144

Screen Shot 2019-07-26 at 12 44 25 PM

So these are just linter failures, you can fix most of the minor things like whitespace by running rubocop -a and if it can't fix them automatically it will show you where the issue is.

db/schema.rb Outdated Show resolved Hide resolved
reset_counters!

# parse
rows = []
StreamingCsvTransformer.from_text(@log, file_text).each_with_index do |row, index|
#last row before data is the header
string = (@header_rows_count > 1) ? file_text.lines[@header_rows_count-1..-1].join : file_text

This comment has been minimized.

Copy link
@kevinrobinson

kevinrobinson Jul 26, 2019

Contributor

Can you say more about moving this before the CSV is parsed, here rather than afterward in flat_map_rows?

This comment has been minimized.

Copy link
@edavidsonsawyer

edavidsonsawyer Jul 28, 2019

Author Collaborator

Sure thing. StreamingCSVTransformer needs a string that can be parsed as a csv. This is here to trim header rows that we aren't going to use later to reference data points, assuming here the last line before the data begins is the "real" header. I haven't been able to find a cleaner way to specify which row in the csv to treat as the header. flat_map_rows just takes a csv row as an argument. It needs to reference the header data.

An alternative approach would be to define the headers explicitly in code rather than trying to get them from the csv.

This comment has been minimized.

Copy link
@kevinrobinson

kevinrobinson Jul 29, 2019

Contributor

Ah got it, thanks! 👍 Yeah so the problem is that in that template, the actual "rows" in the second row since the first row has other stuff helpful for people entering data, got it.

The new reading template format reverses that - the first row is hidden from people, so it makes for simple parsing, and then there some formats have a a few rows without any real data, with explanations for people about how to use the sheet, etc.

So this makes total sense as a workaround, and then we should be able to remove it with the new format.

This comment has been minimized.

Copy link
@edavidsonsawyer

edavidsonsawyer Jul 29, 2019

Author Collaborator

That's great! Since this is likely to be sensitive to formatting changes, there are probably a few pieces here that might need to change with new formats. That shouldn't be a problem as long as the final template works well for everyone using it.

This comment has been minimized.

Copy link
@kevinrobinson

kevinrobinson Jul 29, 2019

Contributor

Yep, let me know if you want to chat more. It's possible there will be some drift in the format from now until September, but I'm guessing it'll just be things like names to columns, or maybe skipping a different number of header rows or something like that.

@kevinrobinson

This comment has been minimized.

Copy link
Contributor

kevinrobinson commented Aug 23, 2019

Here's our notes talking through the cases for update. I think we can do the "importer" work without doing this first, and just add the data points we find.

step 1

class MegaReadingImporter
  def import
    # get all the tabs
    # get all the rows in the tab, where row is like `(eric, ORF WPM, 7, fall)`

    # just add them all to the database
    rows.each {|row| ReadingBenchmarkDataPoint.create!(row) }
    end

That won't handle if we run it twice - we'll get extra data points, but that's okay as a good first step.

step 2

# Importer is in charge of everything!
class MegaReadingImporter
  def import
    # get all the tabs
    # get all the records in the tab

    # it just gives the syncer all the rows as new active record models
    # but it doesn't delete or handle anything else, it still is just creating all the records twice
    # but does that through the syncer, tracks stats and stuff
    end

step 3

After that I might be:

# Importer is in charge of everything!
class MegaReadingImporter
  def import
    # get all the tabs
    # get all the records in the tab

    # it knows how to call the syncer in the right ways
    # use the syncer to do smarter syncer for different cases
    end

Those cases might be as simple as: "delete everything that ever came from this form." But there might be important edge cases we want to handle, like the teacher deleting a row because a student changed homeroom. And we'll have to translate those to simple guidance for educators too.

Alternately, we might just want to say "if the student is in the sheet, update their records for that time period in the sheet."

We don't have to handle all those edge cases, simple is better! We could say "if you do X, Y will happen" and "here are two edge cases that would break, but we're not handling now."

@kevinrobinson

This comment has been minimized.

Copy link
Contributor

kevinrobinson commented Sep 11, 2019

@edavidsonsawyer FYI I pushed some other commits on #2544 related to the processor code. You should be able to pull them all in, but if there's conflicts you're not sure how to resolve just ping me and I am happy to help with merging!

@edavidsonsawyer

This comment has been minimized.

Copy link
Collaborator Author

edavidsonsawyer commented Sep 12, 2019

@kevinrobinson thanks. I merged that branch in so I don't think that will be a problem here. I'm rewriting the spec for this to make it more useful and robust and we should stop seeing these failures.

@kevinrobinson

This comment has been minimized.

Copy link
Contributor

kevinrobinson commented Sep 12, 2019

Okay, sounds good! Let me know if there's anything else you need, and either way I'll probably check in early next week to see what else needs doing to start trying this out with fall benchmarks!

@edavidsonsawyer

This comment has been minimized.

Copy link
Collaborator Author

edavidsonsawyer commented Sep 17, 2019

Hi @kevinrobinson this should be ready for a look.

Copy link
Contributor

kevinrobinson left a comment

Hey @edavidsonsawyer! This looks like good progress towards step 1. I left comments inline, let me know if you want to chat more. I'll still plan on spending some time Friday morning checking in with where we're at and then seeing if I can start trying to deploy step 1 incrementally.

Also, I tried to fix the flaky tests on master, so merging should resolve that. Thanks for working around the noise and feel free to ping if it comes up again. 👍

@edavidsonsawyer

This comment has been minimized.

Copy link
Collaborator Author

edavidsonsawyer commented Sep 19, 2019

@kevinrobinson I've made the changes recommended above. I also gave this a shot using a test account with the fetcher and the template sheet, and it seemed to work well with a handful of students across the different grades.

@kevinrobinson

This comment has been minimized.

Copy link
Contributor

kevinrobinson commented Sep 20, 2019

@edavidsonsawyer Great! Merging and going to try kicking the tires here. 👍 🎉

@kevinrobinson kevinrobinson merged commit 8509478 into studentinsights:master Sep 20, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@kevinrobinson kevinrobinson mentioned this pull request Sep 20, 2019
2 of 2 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.