Better handle loaded CSVs by nshahquinn · Pull Request #41 · wikimedia/wmfdata-python

nshahquinn · 2022-12-17T00:13:31Z

Unlike ROW FORMAT DELIMITED, Hive's CSV serde can handle quoted fields.

nshahquinn · 2022-12-17T00:16:18Z

The impetus for this change is that our canonical country dataset now includes a value with commas ("Bonaire, Sint Eustatius, and Saba"). Using the existing implementation to load that CSV results in broken data for that row.

xabriel

I'd like to see a unit test reproducing the issue, but I can see that the hive.py script is heavily coupled to HDFS, so perhaps that would be difficult without a proper testing framework. So I will leave the test addition or not to your judgement.

As discussed elsewhere, I do suggest we invest into a pytest setup sometime soon.

nshahquinn · 2023-01-25T23:20:52Z

Well, once again, you turn out to be very right about testing 😅

Since it turns out I had used a previous version of the country dataset in the CSV loading test, I just updated it to use the new version. It's still not ideal for testing since it doesn't contain a full variety of data types, but it's good in the short-term.

In addition to containing a quoted value, this new version of the dataset also happens to include a boolean field. So the updated test revealed that the serde I switched to loads all fields as strings! That was in the documentation, but I had just failed to notice.

So, now this is a bit of a quandary since neither method, new or old, works very well. Probably the best solution is to give up on Hive altogether and use Spark to support loading CSVs, but that's more work. Let me think a bit more about it.

tullis · 2023-01-26T17:46:15Z

Probably the best solution is to give up on Hive altogether and use Spark to support loading CSVs, but that's more work.
Now that you mention it, I think that targeting spark instead of hive for wmfdata-pythonwould be extremely useful.

Spark is where we are putting a lot of energy in terms of making sure that we can use the resources on the dse-k8s cluster as well as YARN/Hadoop. So I think there would be considerable value if wmfdata-python were to support spark.

nshahquinn · 2023-01-26T21:37:03Z

@tullis I agree with you, but just in case you weren't aware, Wmfdata-Python does support Spark pretty extensively. It's just that it doesn't have a Spark-based CSV loading function.

Anyway, the best thing you could do to help would be to nudge Emil to prioritize fixing this issue (T327983) with a Spark-based replacement 😊

tullis · 2023-01-27T12:17:18Z

Ah, thanks @nshahquinn - I wasn't aware of that. Thanks for putting me straight.

nshahquinn · 2023-02-02T02:31:37Z

I've updated this to something more limited but actually effective (adding an option to specify backslash escaping). It doesn't fully solve the underlying problem, but it's a useful interim step that allows load_csv to handle everything that the Hive's underlying ROW FORMAT DELIMITED can.

I've set this as work in progress for now because I need to add some tests 😊

nshahquinn requested review from ottomata and xabriel December 17, 2022 00:18

xabriel approved these changes Dec 19, 2022

View reviewed changes

nshahquinn force-pushed the load_csv branch from 99860e2 to 3b35252 Compare January 25, 2023 23:13

nshahquinn mentioned this pull request Jan 26, 2023

Indicate EU membership in country dataset and switch to TSV wikimedia-research/canonical-data#3

Merged

Update load_csv to handle optional escaping

b9db318

nshahquinn force-pushed the load_csv branch from 3b35252 to b9db318 Compare February 2, 2023 02:27

nshahquinn marked this pull request as draft February 2, 2023 02:29

nshahquinn closed this Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handle loaded CSVs#41

Better handle loaded CSVs#41
nshahquinn wants to merge 1 commit intomainfrom
load_csv

nshahquinn commented Dec 17, 2022

Uh oh!

nshahquinn commented Dec 17, 2022

Uh oh!

xabriel left a comment

Uh oh!

nshahquinn commented Jan 25, 2023

Uh oh!

tullis commented Jan 26, 2023

Uh oh!

nshahquinn commented Jan 26, 2023 •

edited

Loading

Uh oh!

tullis commented Jan 27, 2023

Uh oh!

nshahquinn commented Feb 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

nshahquinn commented Dec 17, 2022

Uh oh!

nshahquinn commented Dec 17, 2022

Uh oh!

xabriel left a comment

Choose a reason for hiding this comment

Uh oh!

nshahquinn commented Jan 25, 2023

Uh oh!

tullis commented Jan 26, 2023

Uh oh!

nshahquinn commented Jan 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tullis commented Jan 27, 2023

Uh oh!

nshahquinn commented Feb 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

nshahquinn commented Jan 26, 2023 •

edited

Loading