Skip to content

Commit

Permalink
Merge branch 'release/1.1.7'
Browse files Browse the repository at this point in the history
i
fixes #31, static record breakes nested structureswq
  • Loading branch information
rcongiu committed Sep 30, 2013
2 parents 41b5085 + 1a4a503 commit 1f925db
Show file tree
Hide file tree
Showing 6 changed files with 175 additions and 41 deletions.
51 changes: 34 additions & 17 deletions README.txt → README.md
@@ -1,4 +1,6 @@
JsonSerde - a read/write SerDe for JSON Data
================================================

AUTHOR: Roberto Congiu <rcongiu@yahoo.com>

Serialization/Deserialization module for Apache Hadoop Hive
Expand All @@ -12,22 +14,27 @@ Features:
* nested data structures are also supported.

COMPILE
---------

Use maven to compile the serde.

```bash
$ mvn package

If you want to compile the serde against a different version of the cloudera libs,
use -D:
mvn -Dcdh.version=0.9.0-cdh3u4c-SNAPSHOT package
# If you want to compile the serde against a different
# version of the cloudera libs, use -D:
$ mvn -Dcdh.version=0.9.0-cdh3u4c-SNAPSHOT package
```


EXAMPLES
------------

Example scripts with simple sample data are in src/test/scripts. Here some excerpts:

* Query with complex fields like arrays
### Query with complex fields like arrays

```sql
CREATE TABLE json_test1 (
one boolean,
three array<string>,
Expand All @@ -41,11 +48,14 @@ hive> select three[1] from json_test1;

gold
yellow
```


* Nested structures
### Nested structures

You can also define nested structures:

```sql
add jar ../../../target/json-serde-1.0-SNAPSHOT-jar-with-dependencies.jar;

CREATE TABLE json_nested_test (
Expand All @@ -55,14 +65,17 @@ CREATE TABLE json_nested_test (
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;

-- data : {"country":"Switzerland","languages":["German","French","Italian"],"religions":{"catholic":[10,20],"protestant":[40,50]}}
-- data : {"country":"Switzerland","languages":["German","French",
-- "Italian"],"religions":{"catholic":[10,20],"protestant":[40,50]}}

LOAD DATA LOCAL INPATH 'nesteddata.txt' OVERWRITE INTO TABLE json_nested_test ;

select * from json_nested_test; -- result: Switzerland ["German","French","Italian"] {"catholic":[10,20],"protestant":[40,50]}
select languages[0] from json_nested_test; -- result: German
select religions['catholic'][0] from json_nested_test; -- result: 10
```

* MALFORMED DATA
### MALFORMED DATA

The default behavior on malformed data is throwing an exception.
For example, for malformed json like
Expand All @@ -72,12 +85,14 @@ you get:
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 32 [character 33 line 1]

this may not be desirable if you have a few bad lines you wish to ignore. If so you can do:
```sql
ALTER TABLE json_table SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
```

it will not make the query fail, and the above record will be returned as
NULL null null

* MAPPING HIVE KEYWORDS
### MAPPING HIVE KEYWORDS

Sometimes it may happen that JSON data has attributes named like reserved words in hive.
For instance, you may have a JSON attribute named 'timestamp', which is a reserved word
Expand All @@ -95,7 +110,7 @@ Notice the "mapping.ts", that means: take the column 'ts' and read into it the
JSON attribute named "timestamp"


# ARCHITECTURE
### ARCHITECTURE

For the JSON encoding/decoding, I am using a modified version of Douglas Crockfords JSON library:
https://github.com/douglascrockford/JSON-java
Expand All @@ -115,25 +130,27 @@ match hive table declaration.
More detailed explanation on my blog:
http://www.congiu.com/articles/json_serde

# CONTRIBUTING
### CONTRIBUTING

I am using gitflow for the release cycle.


* THANKS
### THANKS

Thanks to Douglas Crockford for the liberal license for his JSON library, and thanks to
my employer OpenX and my boss Michael Lum for letting me open source the code.



Versions:
1.0: initial release
1.1: fixed some string issues
1.1.1 (2012/07/03): fixed Map Adapter (get and put would call themselves...ooops)
1.1.2 (2012/07/26): Fixed issue with columns that are not mapped into JSON, reported by Michael Phung
1.1.4 (2012/10/04): Fixed issue #13, problem with floats, Reported by Chuck Connell
1.1.6 (2013/07/10): Fixed issue #28, error after 'alter table add columns'
* 1.0: initial release
* 1.1: fixed some string issues
* 1.1.1 (2012/07/03): fixed Map Adapter (get and put would call themselves...ooops)
* 1.1.2 (2012/07/26): Fixed issue with columns that are not mapped into JSON, reported by Michael Phung
* 1.1.4 (2012/10/04): Fixed issue #13, problem with floats, Reported by Chuck Connell
* 1.1.6 (2013/07/10): Fixed issue #28, error after 'alter table add columns'
* 1.1.7 (TBD) : Fixed issue #25, timestamp support, fix parametrized build,
Fixed issue #31 (static member shouldn't be static)



10 changes: 5 additions & 5 deletions pom.xml
Expand Up @@ -4,7 +4,7 @@

<groupId>org.openx.data</groupId>
<artifactId>json-serde</artifactId>
<version>1.1.6</version>
<version>1.1.7</version>
<packaging>jar</packaging>

<name>openx-json-serde</name>
Expand Down Expand Up @@ -86,15 +86,15 @@
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop.hive</groupId>
<groupId>org.apache.hive</groupId>
<artifactId>hive-serde</artifactId>
<version>0.8.0-cdh4a1-SNAPSHOT</version>
<version>${cdh.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop.hive</groupId>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>0.8.0-cdh4a1-SNAPSHOT</version>
<version>${cdh.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
Expand Down
51 changes: 39 additions & 12 deletions src/main/java/org/openx/data/jsonserde/JsonSerDe.java
Expand Up @@ -13,14 +13,13 @@

package org.openx.data.jsonserde;

import java.sql.Timestamp;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde.Constants;
import org.apache.hadoop.hive.serde2.SerDe;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
Expand Down Expand Up @@ -53,6 +52,11 @@
import org.openx.data.jsonserde.objectinspector.JsonObjectInspectorFactory;
import org.openx.data.jsonserde.objectinspector.JsonStructOIOptions;

import javax.print.attribute.standard.DateTimeAtCompleted;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde.Constants;
import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;

/**
* Properties:
* ignore.malformed.json = true/false : malformed json will be ignored
Expand Down Expand Up @@ -111,8 +115,8 @@ public void initialize(Configuration conf, Properties tbl) throws SerDeException
}
assert (columnNames.size() == columnTypes.size());

stats = new SerDeStats();
stats = new SerDeStats();

// Create row related objects
rowTypeInfo = (StructTypeInfo) TypeInfoFactory
.getStructTypeInfo(columnNames, columnTypes);
Expand Down Expand Up @@ -154,6 +158,7 @@ public Object deserialize(Writable w) throws SerDeException {

// Try parsing row into JSON object
JSONObject jObj = null;


try {
jObj = new JSONObject(rowText.toString()) {
Expand All @@ -166,8 +171,31 @@ public Object deserialize(Writable w) throws SerDeException {
* java.lang.Object)
*/
@Override
public JSONObject put(String key, Object value)
throws JSONException {
public JSONObject put(String key, Object value) throws JSONException {

try {
if (columnNames.contains(key) &&
rowTypeInfo.getStructFieldTypeInfo(key).getCategory().equals(PrimitiveObjectInspector.Category.PRIMITIVE) &&
((PrimitiveTypeInfo) rowTypeInfo.getStructFieldTypeInfo(key))
.getPrimitiveCategory().equals(PrimitiveObjectInspector.PrimitiveCategory.TIMESTAMP) ) {
if(value instanceof String) {
value = Timestamp.valueOf((String)value);
} else if (value instanceof Float ) {
value = new Timestamp( (long) (((Float)value).floatValue() * 1000));
} else if ( value instanceof Integer) {
value = new Timestamp( ((Integer)value).longValue() * 1000);
} else if ( value instanceof Long) {
value = new Timestamp( ((Long)value).longValue() * 1000);
} else if ( value instanceof Double) {
value = new Timestamp( ((Double)value).longValue() * 1000);
} else {
throw new JSONException("I don't know how to conver to timestamp a field of type " + value.getClass()) ;
}
}
} catch (IllegalArgumentException e) {
throw new JSONException("Timestamp " + value + "improperly formatted.");
}

return super.put(key.toLowerCase(), value);
}
};
Expand All @@ -192,7 +220,7 @@ public ObjectInspector getObjectInspector() throws SerDeException {

/**
* We serialize to Text
* @return
* @return
*
* @see org.apache.hadoop.io.Text
*/
Expand Down Expand Up @@ -225,7 +253,7 @@ public Writable serialize(Object obj, ObjectInspector objInspector) throws SerDe

Text t = new Text(serializer.toString());

serializedDataSize = t.getBytes().length;
serializedDataSize = t.getBytes().length;
return t;
}

Expand All @@ -243,9 +271,6 @@ private String getSerializedFieldName( List<String> columnNames, int pos, Struct
* Serializing means getting every field, and setting the appropriate
* JSONObject field. Actual serialization is done at the end when
* the whole JSON object is built
* @param serializer
* @param obj
* @param structObjectInspector
*/
private JSONObject serializeStruct( Object obj,
StructObjectInspector soi, List<String> columnNames) {
Expand Down Expand Up @@ -405,7 +430,7 @@ public void onMalformedJson(String msg) throws SerDeException {

@Override
public SerDeStats getSerDeStats() {
if(lastOperationSerialize) {
if(lastOperationSerialize) {
stats.setRawDataSize(serializedDataSize);
} else {
stats.setRawDataSize(deserializedDataSize);
Expand Down Expand Up @@ -435,5 +460,7 @@ private Map<String, String> getMappings(Properties tbl) {
}
return mps;
}



}
Expand Up @@ -15,7 +15,6 @@
import java.util.List;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector;
import static org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.LOG;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.openx.data.jsonserde.json.JSONException;
import org.openx.data.jsonserde.json.JSONObject;
Expand Down Expand Up @@ -100,7 +99,7 @@ public Object getStructFieldDataFromJsonObject(JSONObject data, StructField fiel
}


static List<Object> values = new ArrayList<Object>();


/**
* called to map from hive to json
Expand All @@ -114,7 +113,8 @@ protected String getJsonField(StructField fr) {
return fr.getFieldName();
}
}


List<Object> values = new ArrayList<Object>();
@Override
public List<Object> getStructFieldsDataAsList(Object o) {
JSONObject jObj = (JSONObject) o;
Expand Down
21 changes: 17 additions & 4 deletions src/test/java/org/openx/data/jsonserde/JsonSerDeTest.java
Expand Up @@ -103,11 +103,10 @@ public void testDeserialize() throws Exception {

JSONObject result = (JSONObject) instance.deserialize(w);
assertEquals(result.get("four"), "poop");

assertTrue(result.get("three") instanceof JSONArray);

assertTrue(((JSONArray) result.get("three")).get(0) instanceof String);
assertEquals(((JSONArray) result.get("three")).get(0), "red");
assertTrue( ((JSONArray)result.get("three")).get(0) instanceof String );
assertEquals( ((JSONArray)result.get("three")).get(0),"red");
}

// {"one":true,"three":["red","yellow",["blue","azure","cobalt","teal"],"orange"],"two":19.5,"four":"poop"}
Expand Down Expand Up @@ -155,6 +154,20 @@ public void testDeserialize2Initializations() throws Exception {
}


@Test
public void testDeserializePartialFieldSet() throws Exception {
Writable w = new Text("{\"missing\":\"whocares\",\"one\":true,\"three\":[\"red\",\"yellow\",[\"blue\",\"azure\",\"cobalt\",\"teal\"],\"orange\"],\"two\":19.5,\"four\":\"poop\"}");
JsonSerDe instance = new JsonSerDe();
initialize(instance);
JSONObject result = (JSONObject) instance.deserialize(w);
assertEquals(result.get("four"),"poop");

assertTrue( result.get("three") instanceof JSONArray);

assertTrue( ((JSONArray)result.get("three")).get(0) instanceof String );
assertEquals( ((JSONArray)result.get("three")).get(0),"red");
}

/**
* Test of getSerializedClass method, of class JsonSerDe.
*/
Expand Down

0 comments on commit 1f925db

Please sign in to comment.