Skip to content

Commit

Permalink
fixed conflict
Browse files Browse the repository at this point in the history
  • Loading branch information
rcongiu committed Aug 30, 2015
2 parents f5d416a + a6aa602 commit 231b315
Show file tree
Hide file tree
Showing 22 changed files with 335 additions and 35 deletions.
59 changes: 56 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,18 @@ Features:
* nested data structures are also supported.
* modular to support multiple versions of CDH

BINARIES
----------
github used to allow uploading of binaries, but not anymore.
Many people have been asking me for binaries in private by email
so I decided to upload binaries here:

http://www.congiu.net/hive-json-serde/

so you don't need to compile your own. There are versions for
CDH4 and CDH5.


COMPILE
---------

Expand All @@ -37,17 +49,25 @@ json-serde/target/json-serde-VERSION-jar-with-dependencies.jar
```




```bash
$ mvn package

# If you want to compile the serde against a different
# If you want to compile the serde against a different
# version of the cloudera libs, use -D:
$ mvn -Dcdh.version=0.9.0-cdh3u4c-SNAPSHOT package
```



Hive 0.14.0 and 1.0.0
-----------

Compile with
```
mvn -Pcdh5 -Dcdh5.hive.version=1.0.0 clean package
```


EXAMPLES
------------

Expand All @@ -71,6 +91,9 @@ gold
yellow
```

If you have complex json it can become tedious to create the table
by hand. I recommend [hive-json-schema)(https://github.com/quux00/hive-json-schema) to build your schema from the data.


### Nested structures

Expand Down Expand Up @@ -135,6 +158,32 @@ ALTER TABLE json_table SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
it will not make the query fail, and the above record will be returned as
NULL null null


### UNIONTYPE support (PLEASE READ IF YOU USE IT)

A Uniontype is a field that can contain different types, like in C.
Hive usually stores a 'tag' that is basically the index of the datatype,
for instance, if you create a uniontype<int,string,float> , tag would be
0 for int, 1 for string, 2 for float (see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-UnionTypes).

Now, JSON data does not store anything like that, so the serde will try and
look what it can do.. that is, check, in order, if the data is compatible
with any of the given types. So, THE ORDER MATTERS. Let's say you define
a field f as UNIONTYPE<int,string> and your js has
```{json}
{ "f": "123" } // parsed as int, since int precedes string in definitions,
// and "123" can be parsed to a number
{ "f": "asv" } // parsed as string
```
That is, a number in a string. This will return a tag of 0 and an int rather
than a string.
It's worth noticing that complex Union types may not be that efficient, since
the SerDe may try to parse the same data in several ways; however, several
people asked me to implement this feature to cope with bad JSON, so..I did.




### MAPPING HIVE KEYWORDS

Sometimes it may happen that JSON data has attributes named like reserved words in hive.
Expand Down Expand Up @@ -208,6 +257,10 @@ Versions:
support for array records,
fixed handling of null in arrays #54,
refactored Timestamp Handling
* 1.2 (2014/06) Refactored to multimodule for CDH5 compatibility
* 1.3 (2014/09/08) fixed #80, #82, #84, #85
* 1.3.5 (2015/08/30) Added UNIONTYPE support (#53), made CDH5 default, handle
empty array where an empty object should be (#112)



Expand Down
2 changes: 1 addition & 1 deletion json-serde-cdh4-shim/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
<groupId>org.openx.data</groupId>
<artifactId>json-serde-parent</artifactId>
<relativePath>../pom.xml</relativePath>
<version>1.3</version>
<version>1.3.5</version>
</parent>
<modelVersion>4.0.0</modelVersion>

Expand Down
2 changes: 1 addition & 1 deletion json-serde-cdh5-shim/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
<groupId>org.openx.data</groupId>
<artifactId>json-serde-parent</artifactId>
<relativePath>../pom.xml</relativePath>
<version>1.3</version>
<version>1.3.5</version>
</parent>
<modelVersion>4.0.0</modelVersion>

Expand Down
2 changes: 1 addition & 1 deletion json-serde/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
<groupId>org.openx.data</groupId>
<artifactId>json-serde-parent</artifactId>
<relativePath>../pom.xml</relativePath>
<version>1.3</version>
<version>1.3.5</version>
</parent>
<modelVersion>4.0.0</modelVersion>

Expand Down
19 changes: 12 additions & 7 deletions json-serde/src/main/java/org/openx/data/jsonserde/JsonSerDe.java
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,7 @@
import java.util.Properties;
import org.apache.hadoop.hive.serde2.SerDe;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.*;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
Expand All @@ -32,11 +31,7 @@
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hive.serde2.SerDeStats;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.BooleanObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.ByteObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.DoubleObjectInspector;
Expand Down Expand Up @@ -159,7 +154,6 @@ public Object deserialize(Writable w) throws SerDeException {

// Try parsing row into JSON object
Object jObj = null;


try {
String txt = rowText.toString().trim();
Expand Down Expand Up @@ -335,6 +329,8 @@ Object serializeField(Object obj,
case STRUCT:
result = serializeStruct(obj, (StructObjectInspector)oi, null);
break;
case UNION:
result = serializeUnion(obj, (UnionObjectInspector)oi);
}
return result;
}
Expand Down Expand Up @@ -365,6 +361,15 @@ private JSONArray serializeList(Object obj, ListObjectInspector loi) {
return ar;
}

/**
* Serializes a Union
*/
private Object serializeUnion(Object obj, UnionObjectInspector oi) {
if(obj == null) return null;

return serializeField(obj, oi.getObjectInspectors().get(oi.getTag(obj)));
}

/**
* Serializes a Hive map&lt;&gt; using a JSONObject.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,20 @@
*======================================================================*/
package org.openx.data.jsonserde.objectinspector;


import java.util.ArrayList;
import java.util.EnumMap;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory;
import org.apache.hadoop.hive.serde2.objectinspector.UnionObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJavaObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.MapTypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.*;
import org.openx.data.jsonserde.objectinspector.primitive.JavaStringByteObjectInspector;
import org.openx.data.jsonserde.objectinspector.primitive.JavaStringDoubleObjectInspector;
import org.openx.data.jsonserde.objectinspector.primitive.JavaStringFloatObjectInspector;
Expand Down Expand Up @@ -93,6 +92,17 @@ public static ObjectInspector getJsonObjectInspectorFromTypeInfo(
fieldObjectInspectors, options);
break;
}
case UNION:{
UnionTypeInfo unionTypeInfo = (UnionTypeInfo) typeInfo;

List<ObjectInspector> ois = new LinkedList<ObjectInspector>();
for( TypeInfo ti : ((UnionTypeInfo) typeInfo).getAllUnionObjectTypeInfos()) {
ois.add(getJsonObjectInspectorFromTypeInfo(ti, options));
}
result = getJsonUnionObjectInspector(ois, options);
break;
}

default: {
result = null;
}
Expand All @@ -102,12 +112,33 @@ public static ObjectInspector getJsonObjectInspectorFromTypeInfo(
return result;
}


static HashMap<ArrayList<Object>, JsonUnionObjectInspector> cachedJsonUnionObjectInspector
= new HashMap<ArrayList<Object>, JsonUnionObjectInspector>();

public static JsonUnionObjectInspector getJsonUnionObjectInspector(
List<ObjectInspector> ois,
JsonStructOIOptions options) {
ArrayList<Object> signature = new ArrayList<Object>();
signature.add(ois);
signature.add(options);
JsonUnionObjectInspector result = cachedJsonUnionObjectInspector
.get(signature);
if (result == null) {
result = new JsonUnionObjectInspector(ois, options);
cachedJsonUnionObjectInspector.put(signature,result);

}
return result;
}

/*
* Caches Struct Object Inspectors
*/
static HashMap<ArrayList<Object>, JsonStructObjectInspector> cachedStandardStructObjectInspector
= new HashMap<ArrayList<Object>, JsonStructObjectInspector>();


public static JsonStructObjectInspector getJsonStructObjectInspector(
List<String> structFieldNames,
List<ObjectInspector> structFieldObjectInspectors,
Expand Down Expand Up @@ -173,13 +204,13 @@ public static JsonMapObjectInspector getJsonMapObjectInspector(
= new EnumMap<PrimitiveCategory, AbstractPrimitiveJavaObjectInspector>(PrimitiveCategory.class);

static {
primitiveOICache.put(PrimitiveCategory.STRING, new JavaStringJsonObjectInspector());
primitiveOICache.put(PrimitiveCategory.BYTE, new JavaStringByteObjectInspector());
primitiveOICache.put(PrimitiveCategory.SHORT, new JavaStringShortObjectInspector());
primitiveOICache.put(PrimitiveCategory.STRING, new JavaStringJsonObjectInspector());
primitiveOICache.put(PrimitiveCategory.BYTE, new JavaStringByteObjectInspector());
primitiveOICache.put(PrimitiveCategory.SHORT, new JavaStringShortObjectInspector());
primitiveOICache.put(PrimitiveCategory.INT, new JavaStringIntObjectInspector());
primitiveOICache.put(PrimitiveCategory.LONG, new JavaStringLongObjectInspector());
primitiveOICache.put(PrimitiveCategory.FLOAT, new JavaStringFloatObjectInspector());
primitiveOICache.put(PrimitiveCategory.DOUBLE, new JavaStringDoubleObjectInspector());
primitiveOICache.put(PrimitiveCategory.FLOAT, new JavaStringFloatObjectInspector());
primitiveOICache.put(PrimitiveCategory.DOUBLE, new JavaStringDoubleObjectInspector());
primitiveOICache.put(PrimitiveCategory.TIMESTAMP, new JavaStringTimestampObjectInspector());
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,12 @@ public Object getStructFieldData(Object data, StructField fieldRef) {
// somehow we have the object parsed already
return getStructFieldDataFromList((List) data, fieldRef );
} else if (data instanceof JSONArray) {
return getStructFieldDataFromList(((JSONArray) data).getAsArrayList(), fieldRef );
JSONArray ja = (JSONArray) data;
// se #113: some people complain of receving bad JSON,
// sometimes getting [] instead of {} for an empty field.
// this line should help them
if(ja.length() == 0 ) return null;
return getStructFieldDataFromList(ja.getAsArrayList(), fieldRef );
} else {
throw new Error("Data is not JSONObject but " + data.getClass().getCanonicalName() +
" with value " + data.toString()) ;
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
package org.openx.data.jsonserde.objectinspector;

import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.*;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.openx.data.jsonserde.json.JSONArray;
import org.openx.data.jsonserde.json.JSONObject;

import java.util.List;

/**
* Created by rcongiu on 8/29/15.
*/
public class JsonUnionObjectInspector implements UnionObjectInspector {
JsonStructOIOptions options;
private List<ObjectInspector> ois;


public JsonUnionObjectInspector(List<ObjectInspector> ois,JsonStructOIOptions opts) {
this.ois = ois;
options = opts;
}


@Override
public List<ObjectInspector> getObjectInspectors() {
return ois;
}


/*
* This method looks at the object and finds which object inspector should be used.
*/
@Override
public byte getTag(Object o) {
if(o==null) return 0;
for(byte i =0; i< ois.size(); i ++) {
ObjectInspector oi = ois.get(i);

switch(oi.getCategory()) {
case LIST: if(o instanceof JSONArray) return i; else break;
case STRUCT: if(o instanceof JSONObject) return i; else break;
case MAP: if(o instanceof JSONObject) return i; else break;
case UNION: return i;

case PRIMITIVE: {
PrimitiveObjectInspector poi = (PrimitiveObjectInspector) oi;
try {
// try to parse it, return if able to
poi.getPrimitiveJavaObject(o);
return i;
} catch (Exception ex) { continue;}
}
default :throw new Error("Object Inspector " + oi.toString() + " Not supported for object " + o.toString());
}
}
throw new Error("No suitable Object Inspector found for object " + o.toString() + " of class " + o.getClass().getCanonicalName());
}

@Override
public Object getField(Object o) {
return o;
}

@Override
public String getTypeName() {
return ObjectInspectorUtils.getStandardUnionTypeName(this);

}

@Override
public Category getCategory() {
return Category.UNION;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,12 @@ public byte get(Object o) {
}
}

@Override
public Object getPrimitiveJavaObject(Object o)
{
return get(o);
}

@Override
public Object create(byte value) {
return (value);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,11 @@ public double get(Object o) {
}
}

@Override
public Object getPrimitiveJavaObject(Object o) {
return get(o);
}

@Override
public Object create(double value) {
return value;
Expand Down
Loading

0 comments on commit 231b315

Please sign in to comment.