Skip to content


Subversion checkout URL

You can clone with
Download ZIP


postgresql JSON_ARRAY_ELEMENTS equivalent #47

beatboxa opened this Issue · 10 comments

3 participants


This SerDe is great Roberto!

I've got a quick question: Is there an equivalent to the json_array_elements function in postgresql? If you're not familiar with the function, it explodes an array into its elements. There's a few examples here (as well as additional postgresql json functions):

Example: if you had 1 record, stored in an array field as follows:
[[a, b], [c,d], e]
then json_array_elements would result in 3 records:
which could then be split into multiple columns, using a this SerDe.
a | b
c | d
e |

This function is particularly useful for arrays that contain multiple columns since it allows you to 'explode' first and parse columns later (as opposed to parsing columns and then exploding, which results in many unwanted combinations of values if you only combinations where position is equal).

Any thoughts on this?


I see, but I think you'd better have that using a UDF/UDTF rather than in the SerDe.
I think the SerDe should just translate whatever data structure JSON has into the hive equivalent, but if you want to 'massage' it to something else, you should do it using something you can easily turn on and off.
Whatever you do in the serde is executed for every query and can't be easily be turned off.
In this case, I think your [a,b] , [c,d] should be instead declared as structs/objects, like
[{ f1 => a, f2 =>b}, {f1 =>c, f2 => c } , {f1 => e}]
this way you could refer to them easily using explode/lateral view.


Interesting...I never thought of this but I guess there are 2 perspectives on the hive equivalent :) And I totally understand your perspective as well.

Suppose you have the following:

[Transaction Number: 1
[Item Array:
[Item Number: 1
UPC code: 11111111
Quantity: 2]
[Item Number: 2
UPC code: 33333333
Quantity: 9]]]
[Transaction Number: 2
[Item Array:
[Item Number: 1
UPC code: 55555555
Quantity: 1]]]

When using this SerDe, this will translate to:

Transaction Number:

Item Number:

UPC Code:


And unfortunately, we've lost the fact that these last 3 columns (originally represented by a single array of values) are all related--that position #1 of Item Number corresponds to position #1 of Quantity. These 3 columns all represent the same array in the original, and we've lost that fact.

My take is that even if a physical row is defined by a distinct "Transaction Number", I think the data actually represents a record as being an Item within each Transaction. So I see this as actually being 3 records, not 2.

One way to think about it is that the ITEM_NUMBER, UPC_CODE, and QUANTITY fields are nested within the array in the original JSON--not the other way around, so JSON_ARRAY_ELEMENTS results in a closer representation of the original JSON because it first splits the array into multiple records; and then splits these records into their fields (just like the original JSON).

By splitting the fields first and the array second, we technically can reconstruct them, but it seems to be adding a redundant step and relies on external knowledge (knowing these arrays are related)--particularly if you have multiple arrays. I believe this is why postgreSQL implemented the JSON_array_elements specifically for JSON arrays in addition to having the standard array functions (~Hive UDF's).

If we had, for example, another array in the JSON as follows (that was at a different granularity):

[Transaction Number: 3
[Item Array:
[Item Number: 1
UPC code: zzz
Quantity: 4]]
[Coupon Array:
[Coupon Number: 1
Coupon Code: 123]
[Coupon Number: 2
Coupon Code: 456]]]

And we used the SerDe, we'd end up with 2 new fields (in addition to the original Item fields):

Coupon Number:

Coupon Code:

But, in this example, the position of Coupon Number and Coupon Code only correspond to one another, and they don't correspond to the Item Array columns above. Unfortunately, this relationship between fields has been lost within the scope of the Hive table we've constructed.

This is all just my feedback. Your SerDe is great--just wanted to give you feedback for a potential improvement or additional functionality that could be very useful to maintain the data as close to the original state as possible.


Oh cool! Yes, personally, I'd find it much more useful in the SerDe...It can certainly be done outside with array UDF's (such as the one you mentioned), but doing this in the SerDe would save processing steps & required space, which can make a difference at scale.

I read some blogs on how to manually extract a certain position of the array in your SerDe, but as you can imagine, this may not scale well when the array has many values (50+) and varying ranges (ie. each array can have anywhere from 1 - 100+ positions). I've also seen a bunch of posts from various people attempting to do this (with the answer always being "this functionality doesn't exist in Hive today. You need to use custom SerDe's and UDF's."

The JSON_ARRAY_ELEMENTS was a lifesaver in PostgreSQL, and it would definitely be much appreciated in this SerDe so that this effectively becomes the de facto, flexible JSON-to-Hive parser.

Two different methods I can think of would be:
1) Don't even split the array into fields. Simply allow the user to store the entire array (including all fields) in a single Hive column like the struct in your example, but without having to explicitly define the number of records & array position (for another example: This allows the user to use a native Hive function subsequently to explode the array. This was similar to the method I'd use in PostgreSQL--I'd store each entire array JSON text in a single column, then use JSON_ARRAY_ELEMENTS to explode, and then the '=>' function to extract fields. For multiple arrays, (which each define a different level of granularity), I'd fill in nulls for columns that were part of a different array (effectively exploding each array and then unioning/appending the datasets together).

2) Allow the user to do both the array 'EXPLODE' and subsequent column parsing in one step. This would be really cool if it was part of this SerDe!

Thanks for taking the time and effort to at least assess the idea, and have a great weekend!


So, I added the feature of handling arrays in the serde, code is in the development branch and some docs on the readme.


I have the same ETL problem as @beatboxa encountered. Hope to use this feature in the near future!
BTW. Where can I get the UDF of STRUCTIFY as mentioned earlier?


I just added the capability to have records mapped to Json arrays, so you can map a table to an array
[ [ a1, a2,], [b1,b2], [c1,c2]]
suing a table
create table mytable ( a array, b array, c attay)...

but reading the thread, and the postgres doc, it looks like you're asking basically for what LATERAL VIEW EXPLODE already does, except you'd like to do it on the whole array.

I see the examples above, and it just looks to me that json arrays were used where json object should have. Can you send me a sample of the json you'd like parsed, and how you'd like the table to look like ?


My data looks as below which is an order containing multiple items

[["SKU_001","Product Name A","Category1",10.5,1],["SKU002","Product Name B","Category1",20,3]]

The size of nested array is not fixed, which means there may be more than 2 item in the example.

And I wish to transform it into such

SKU | Product Name | Category | Unit Price | Product Quantity
SKU001 | Product Name A | Category1 | 10.5 | 1
SKU002 | Product Name B | Category2 | 20 | 3

As you mentioned, it may be best using a UDF/ UDTF to to the job. I tried LATERAL VIEW EXPLODE but failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.