Better handle loaded CSVs#41
Conversation
|
The impetus for this change is that our canonical country dataset now includes a value with commas ("Bonaire, Sint Eustatius, and Saba"). Using the existing implementation to load that CSV results in broken data for that row. |
xabriel
left a comment
There was a problem hiding this comment.
I'd like to see a unit test reproducing the issue, but I can see that the hive.py script is heavily coupled to HDFS, so perhaps that would be difficult without a proper testing framework. So I will leave the test addition or not to your judgement.
As discussed elsewhere, I do suggest we invest into a pytest setup sometime soon.
|
Well, once again, you turn out to be very right about testing 😅 Since it turns out I had used a previous version of the country dataset in the CSV loading test, I just updated it to use the new version. It's still not ideal for testing since it doesn't contain a full variety of data types, but it's good in the short-term. In addition to containing a quoted value, this new version of the dataset also happens to include a boolean field. So the updated test revealed that the serde I switched to loads all fields as strings! That was in the documentation, but I had just failed to notice. So, now this is a bit of a quandary since neither method, new or old, works very well. Probably the best solution is to give up on Hive altogether and use Spark to support loading CSVs, but that's more work. Let me think a bit more about it. |
Spark is where we are putting a lot of energy in terms of making sure that we can use the resources on the dse-k8s cluster as well as YARN/Hadoop. So I think there would be considerable value if wmfdata-python were to support spark. |
|
@tullis I agree with you, but just in case you weren't aware, Wmfdata-Python does support Spark pretty extensively. It's just that it doesn't have a Spark-based CSV loading function. Anyway, the best thing you could do to help would be to nudge Emil to prioritize fixing this issue (T327983) with a Spark-based replacement 😊 |
|
Ah, thanks @nshahquinn - I wasn't aware of that. Thanks for putting me straight. |
|
I've updated this to something more limited but actually effective (adding an option to specify backslash escaping). It doesn't fully solve the underlying problem, but it's a useful interim step that allows I've set this as work in progress for now because I need to add some tests 😊 |
Unlike ROW FORMAT DELIMITED, Hive's CSV serde can handle quoted fields.