Skip to content

Conversation

@towerofnix
Copy link
Contributor

Resolves

Resolves #4329.

Proposed Changes

Ports scratch-flash's CSV import code to scratch-gui.

Reason for Changes

Compatibility with Scratch 2.0 projects which export list data or expect to import data in a particular format.

Also, the code is Cleaner.™

Test Coverage

Not tested yet. While I've tested this with some basic file importing, I would greatly prefer to implement some unit tests for this, since the code is designed to be testable and this is the sort of utility function where confirming a particular input gives a particular output (i.e, tests, obviously) would be particularly useful.

Copy link
Contributor

@paulkaplan paulkaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So part of the reason to use a module for CSV parsing is that it can get pretty complicated, with lots of edge cases. If we are going to do CSV parsing on our own, we need to add a lot of unit tests. Could you add unit tests covering some different CSV situations? Including the ones that were the reason for changing this in the first place? You can find examples of how to do it in the test/unit directory

@towerofnix
Copy link
Contributor Author

@paulkaplan I've added unit tests that confirm that confirm all the behavior of the code in this PR. A couple of the tests, particularly "does not do anything with quotation marks", might seem sort of controversial (since it means we aren't following CSV specification), but they're there so that 3.0 behavior matches 2.0, so we can be sure that no related 2.0 projects are broken in 3.0. Happy to add more tests if there are other edge cases you'd like to be certain about.

@paulkaplan
Copy link
Contributor

@towerofnix thanks for the patience on this. I brought this issue up with the whole team, and I think we should maintain the use of an official CSV parsing library, and make sure lists exported from scratch can be re-imported into scratch correctly. Here is the reasoning:

  1. If people are getting data from spreadsheet apps like google sheets/excel, or other data sources that call themselves CSVs, they will use the real CSV encoding. If we have our own parser that doesn't implement the CSV spec, we will break all the other data sources out there.
  2. We are really concerned about maintainability. We already have experience trying to maintain our own parsers for ill-defined specs (I'm looking at you, SVG files 👀) and it can be a big problem to maintain that over time. It is better to use whatever the most accepted standard is, which in this case is the well-defined CSV spec.
  3. There is definitely an issue with the fact that you cannot export a list containing a quote and re-import that list into scratch. This is something you could actually help us with, could we use the CSV parser to encode a CSV file correctly when exporting (link to how Papaparse does this).

In summary, instead of changing this direction, lets export CSV files correctly. How does that sound?

@towerofnix
Copy link
Contributor Author

@paulkaplan No worries, and thanks for the discussion and detailed response. As a whole, I agree - just some aside thoughts here.

I can only think of one situation where using the old CSV parser (i.e. the code in this PR) would be better, and that's for importing data exported from a program that targets 2.0's CSV parser, e.g. a tool a user wrote for their 2.0 project or Scratch 2.0 itself. (The same reasoning goes the other way around, e.g. exporting data which a user-created external tool expects to be in the non-standard / simple pseudo-CSV format.) But I think either of these cases is very rare, and we gain much more from exporting proper CSV - i.e, the ability to consistently import with papaparse and compatibility with external tools that expect standards-compliant CSV (which there are a lot of).

I actually didn't consider that exporting data properly would solve the problem just as well, and better, than changing the way data is imported, setting aside the case I described above. Definitely agreed that it'll serve a lot more use to keep papaparse and change exporting.

It looks like we could use Papa.unparse to do the conversion here; then we wouldn't have to depend on "guessing" the CSV spec for exporting either.

@paulkaplan
Copy link
Contributor

@towerofnix thanks again for putting in the thinking on this issue. I see what you mean now when you said "compatibility issue", it is true that if we change the importing we will break the import/export cycle if lists were exported previously (I suspect this is the origin of the original issue, since the person is trying to import a file that looks like it was generated by scratch). This is definitely pointing out all the more reason to use standard file types and be as strict/universal as possible about generating them.

One other possible solution is that we could use the csv parser only for files called .csv, and use a simple line breaker for .txt files. We actually do not pretend to export csv files, we specifically call them .txt files (probably for exactly this reason, they are not csv files!) This would kind of solve all the problems. @kchadha what do you think of that?

@towerofnix
Copy link
Contributor Author

towerofnix commented Mar 13, 2019

One other possible solution is that we could use the csv parser only for files called .csv, and use a simple line breaker for .txt files.

In this regard, should we just use the code that this PR has, when importing .txt files? After all it's already implemented and unit-tested to maintain compatibility with 2.0 projects. It's a sizable chunk of code for a relatively obscure feature, but since it's all already written (and objectively the most compliant way to go, being the same code 2.0 uses), I'd think we may as well use it.

@t-karcher
Copy link

Will this change also fix #4694 ?

@towerofnix
Copy link
Contributor Author

@t-karcher actually while working on this a few days ago (my code's not pushed yet) I was wondering if we should implement that. I agree that it would be useful and it wouldn't be too hard (most code is already present on my local repo).

@paulkaplan Thoughts on making entering zero (and(?) any invalid/out-of-range value) make it import the whole CSV file as though it were a plain text file, importing every line unmodified? That was the behavior in Scratch 2.0. (Also already implemented here for txt import, and on my local repo this is easy to add to CSV import too.)

@kchadha
Copy link
Contributor

kchadha commented Mar 29, 2019

One other possible solution is that we could use the csv parser only for files called .csv, and use a simple line breaker for .txt files. We actually do not pretend to export csv files, we specifically call them .txt files (probably for exactly this reason, they are not csv files!) This would kind of solve all the problems. @kchadha what do you think of that?

Sorry for the late response! @paulkaplan, I think I'm okay with that solution. I think we generally try to avoid handling how we parse things using only the file extension, but I think in this case, 1) there isn't a better way to differentiate the two formats, and 2) using "the wrong" parser because the file extension is inaccurate shouldn't cause any huge problems.

@towerofnix
Copy link
Contributor Author

@kchadha @paulkaplan I've pushed changes to make non-txt files import with Papaparse instead. (This also implements @t-karcher's suggestion of making a zero/invalid column index return the full original file, as 2.0 behaved, even with non-txt files.) The individual commit diff probably isn't too useful - you might be better off reviewing the full PR changes.

Exporting files to always save as proper CSV (with Papa.unparse) hasn't been done yet - that's still todo!

@towerofnix
Copy link
Contributor Author

Alright, pushed Papa.unparse.

@thisandagain
Copy link
Contributor

@paulkaplan @kchadha Bump

@picklesrus
Copy link
Contributor

We took a look at this today and decided the problem we want to solve is that Scratch is not being consistent with itself in importing and exporting of lists. We think solving that problem would probably take a different approach that starts on the export side I'm closing this in favor of solving #6565 instead.

@picklesrus picklesrus closed this Jan 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Import .txt file to list doesn't work well

7 participants