Skip to content
This repository was archived by the owner on Nov 4, 2024. It is now read-only.

Better handle loaded CSVs#41

Closed
nshahquinn wants to merge 1 commit intomainfrom
load_csv
Closed

Better handle loaded CSVs#41
nshahquinn wants to merge 1 commit intomainfrom
load_csv

Conversation

@nshahquinn
Copy link
Copy Markdown
Member

Unlike ROW FORMAT DELIMITED, Hive's CSV serde can handle quoted fields.

@nshahquinn
Copy link
Copy Markdown
Member Author

The impetus for this change is that our canonical country dataset now includes a value with commas ("Bonaire, Sint Eustatius, and Saba"). Using the existing implementation to load that CSV results in broken data for that row.

Copy link
Copy Markdown
Contributor

@xabriel xabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a unit test reproducing the issue, but I can see that the hive.py script is heavily coupled to HDFS, so perhaps that would be difficult without a proper testing framework. So I will leave the test addition or not to your judgement.

As discussed elsewhere, I do suggest we invest into a pytest setup sometime soon.

@nshahquinn
Copy link
Copy Markdown
Member Author

Well, once again, you turn out to be very right about testing 😅

Since it turns out I had used a previous version of the country dataset in the CSV loading test, I just updated it to use the new version. It's still not ideal for testing since it doesn't contain a full variety of data types, but it's good in the short-term.

In addition to containing a quoted value, this new version of the dataset also happens to include a boolean field. So the updated test revealed that the serde I switched to loads all fields as strings! That was in the documentation, but I had just failed to notice.

So, now this is a bit of a quandary since neither method, new or old, works very well. Probably the best solution is to give up on Hive altogether and use Spark to support loading CSVs, but that's more work. Let me think a bit more about it.

@tullis
Copy link
Copy Markdown
Contributor

tullis commented Jan 26, 2023

Probably the best solution is to give up on Hive altogether and use Spark to support loading CSVs, but that's more work.
Now that you mention it, I think that targeting spark instead of hive for wmfdata-pythonwould be extremely useful.

Spark is where we are putting a lot of energy in terms of making sure that we can use the resources on the dse-k8s cluster as well as YARN/Hadoop. So I think there would be considerable value if wmfdata-python were to support spark.

@nshahquinn
Copy link
Copy Markdown
Member Author

nshahquinn commented Jan 26, 2023

@tullis I agree with you, but just in case you weren't aware, Wmfdata-Python does support Spark pretty extensively. It's just that it doesn't have a Spark-based CSV loading function.

Anyway, the best thing you could do to help would be to nudge Emil to prioritize fixing this issue (T327983) with a Spark-based replacement 😊

@tullis
Copy link
Copy Markdown
Contributor

tullis commented Jan 27, 2023

Ah, thanks @nshahquinn - I wasn't aware of that. Thanks for putting me straight.

@nshahquinn nshahquinn marked this pull request as draft February 2, 2023 02:29
@nshahquinn
Copy link
Copy Markdown
Member Author

I've updated this to something more limited but actually effective (adding an option to specify backslash escaping). It doesn't fully solve the underlying problem, but it's a useful interim step that allows load_csv to handle everything that the Hive's underlying ROW FORMAT DELIMITED can.

I've set this as work in progress for now because I need to add some tests 😊

@nshahquinn nshahquinn closed this Nov 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants