Add LZO Thrift read support in Presto-hive. #75

Yaliang · 2017-03-14T01:20:47Z

I wish to expose code in twitter fork's repo first, make it possible to discuss this implementation with the team and get some feedbacks. The DummyClass is somehow able to bypass, we may able to read from thrift blob and transform into presto block directly. In that way, it may be more meaningful to the upstream since it may no longer need the elephant-bird.

billonahill · 2017-03-14T15:49:43Z

presto-hive/src/main/java/com/facebook/presto/twitter/elephantbird/serde/ThriftSerDe.java

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package com.twitter.elephantbird.hive.serde;


why elephantbird in the package name? This class is generic and not dependent on EB.

That's because the dal-sync was using the elephant-bird's hive serDe. Later, I realize it's not worth to use their serDe class. Since the elephant-bird dependency is not in the hive, as a result I cannot use hive-cli to alter the serDe class. Meanwhile, I don't want to change dal-sync for one more turn of initial backfill, which caused the dal failed all its client's request and some people got paged.

Well we should move this class to it's proper home (without elephantbird) before we ship, since it's basically naming technical debt, which is much easier to fix before we go live than after.

Yes, completely right. Let's wait dal finishing their fix.

billonahill · 2017-03-14T15:50:44Z

presto-hive/src/main/java/com/facebook/presto/hive/ThriftHiveRecordCursorProvider.java

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package com.facebook.presto.hive;


Could we either commit this upstream or put it in the twitter package? Same for ThriftHiveRecordCursor.java.

Sure, let's put them in twitter package first.

Enable splits for LZO Thrift data

Yaliang · 2017-03-29T03:38:09Z

+cc @maosongfu

Yaliang · 2017-03-31T00:33:15Z

Upstream modified BackgroundHiveSplitLoader in order to support customized InputFormat. I will take a look if we can adapt to that approach. Since we need FileSystem to read the index file, it may be a little bit different. Searched the annotation for the customized InputFormat, It's from uber/hoodie.

Yaliang · 2017-04-13T00:14:44Z

Ready for review. @billonahill @dabaitu

billonahill

Please provide unit tests for the 7 new tests. Especially tests for how we read fields from thrift objects.

billonahill · 2017-04-12T23:39:26Z

presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java

+            throws IOException
+    {
+        if (bucketHandle.isPresent()) {
+            throw new PrestoException(StandardErrorCode.NOT_SUPPORTED, "Bucketed table in ThriftGeneralInputFormat is not yet supported");


please include info in the exception message to indicate the table, that's not supported. Also what is a bucketed table? We should make the exception more clear to a user without detailed knowledge of the code.

I basically follow this line:

https://github.com/twitter-forks/presto/blob/twitter-master/presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java#L327

I think the bucketed table means the rows are pre-sampled based on some columns in order to save the query time. Since we are using the external table for all, we simply put NOT_SUPPORTED should be fine. Here is the hive wiki:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

I see. It's basically a field-partitioned table. We should include the table name in the exception message.

billonahill · 2017-04-12T23:42:55Z

presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java

+
+        while (hiveFileIterator.hasNext()) {
+            LocatedFileStatus file = hiveFileIterator.next();
+            if (!ThriftGeneralInputFormat.lzoSuffixFilter.accept(file.getPath())) {


Instead exposing a public static member of this class with an accept method, could we instead expose a static method on ThriftGeneralInputFormat like isLzoFile? That makes the expected usage more straight forward and the implementation more flexible to be changed.

Sure, but what I am doing is mixing the logic from LzoInputFormat in elephant-bird and LzoIndex in hadoop-lzo. I am not quite sure if it's the best way to filter the files by exposing either a static method isLzoFile or a static member lzoSuffixFilter. We can also return an empty list if the passed file path shows it is not an LZO file.

I'm not familiar enough with those classes OTTOMH to comment onthe mixing, but I think it's better to support a boolean method than to return an empty list.

billonahill · 2017-04-12T23:45:12Z

presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java

+        while (hiveFileIterator.hasNext()) {
+            LocatedFileStatus file = hiveFileIterator.next();
+            if (!ThriftGeneralInputFormat.lzoSuffixFilter.accept(file.getPath())) {
+                continue;


What's the use case for why non-lzo files would be passed to addLzoThriftSplitsToQueue?

The index file will be passed

billonahill · 2017-04-12T23:54:06Z

...to-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/HiveThriftFieldIdResolver.java

+        this.root = root;
+    }
+
+    public ThriftFieldIdResolver initialize(Properties schema)


Since this method is not initializing the instance but instead returning a new one, this should be a static factor method and the above constructors should be private. Same for the next method:

public static ThriftFieldIdResolver newResolver(Properties schema) public static ThriftFieldIdResolver newNestedResolver(JsonNode root, int hiveIndex)

I will think more carefully for these. By saying newNestedResolver(JsonNode root, int hiveIndex) we will lose other possible solution other than a JSON field in the schema. The purpose I make ThriftFieldIdResolver as an interface and HiveThriftFieldIdResolver as an implementation was trying to giving a more flexible solution in future. Let's say in future we can make a service call to DAL to resolve the thrift field id, we may just call initialize to set some context, and return itself as a delegate to DAL and always use the same object.

I think the interface/impl pattern is fine, just don't mix in the builder pattern as well via instance methods on the impl. The interface should just have what's needed for their usage (i.e, short getThriftId(int hiveIndex);). The builder methods should not be included and should be handled separately as static initializes. When changing the implementation, a new factory method on a new class can be used, but the downstream consumers would remain unaffected since the interface contract they care about is consistent.

You are right. I was thinking in the same way where we should separate builder and instance methods. I agree on the initial constructor. But I cannot agree on how to build the nested resolver. The nested resolver builder must be a part of the interface because the nested resolver relies on the instantiated root resolver. If we have to hold the JsonNode root in our case, we just don't need this class at all.

+1 for the refactor.

billonahill · 2017-04-12T23:56:33Z

...to-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/HiveThriftFieldIdResolver.java

+        try {
+            return new HiveThriftFieldIdResolver(objectMapper.readTree(jsonData));
+        }
+        catch (Exception e) {


Why catch and swallow Exception here and below? Can we be more specific re which types of exceptions we want to ignore? This is too loose in that any kind of bug or exception will change what's returned, probably without a log warning. Is this code path really expected?

I think it should be fine. It's saying let's try to read this JSON property. If it's not presented or corrupted, no worries, let's try a default id resolver which simply does hiveIndex+1.

Another place uses similar try catch box for JSON is here:

https://github.com/twitter-forks/presto/blob/twitter-master/presto-record-decoder/src/main/java/com/facebook/presto/decoder/json/JsonRowDecoder.java#L62

This pattern (of catching Exception) is really bad, because it encapsulates anything that could possibly go wrong and assigns a single solution to it. If you want to handle the case where the json is not presented, handle it with a null check. Same for the case where it might be corrupted, handle it by catching the expected exception that would be thrown in that case (JsonProcessingException?) and logging it.

What if you had corrupt json in the system due to a bug in dalsync? The current approach would mask that forever as an expected condition (you have to assume we're not running in production with debug logging enabled, and even if we were the scenario I describe is worthy of error logging).

I will add the null check.
For the rest, let's break this conversation to more specific:

Should we have a default hiveIndex+1 behaviors?

What level of the log should we log for what type of exception with the factor that the resolver will be called very frequently?

if you encounter what we think is a bug, we should log it and ideally processing should fail. if it's an expected case, like an unsupported type, we should proceed without logging.

For the hiveIndex + 1 thing, why do we do that currently? Is that because the id space is offset by 1?

Yes. It's because the id space is offset by 1.

billonahill · 2017-04-13T00:11:40Z

presto-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/ThriftGeneralInputFormat.java

+        super(new MultiInputFormat());
+    }
+
+    private void initialize(FileSplit split, JobConf job) throws IOException


Can this be part of the constructor?

No, the initialize method is called in getRecordReader method. There is nothing specified for ThriftGenericRow even if we only use ThriftGenericRow. Also in general, the constructor of an InputFormat is often called without parameters so we would better leave initialize method as is.

If there's a valid reason for why object creation and initialization needs to happen in different times in the code, then fine to keep it as is,

billonahill · 2017-04-13T00:12:19Z

presto-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/ThriftGeneralInputFormat.java

+            setInputFormatInstance(new MultiInputFormat(new TypeRef(thriftClass) {}));
+        }
+        catch (ClassNotFoundException e) {
+            throw new RuntimeException("Failed getting class for " + thriftClassName);


Is there a better presto exception we can throw here than RuntimeException?

I was imitating here:
https://github.com/twitter-forks/presto/blob/twitter-master/presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java#L215

And I'm convinced it's should be fine. It will be caught here:
https://github.com/twitter-forks/presto/blob/twitter-master/presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java#L183

Can we imitate https://github.com/twitter-forks/presto/blob/twitter-master/presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java#L184 instead and throw PrestoException, which is a type of RuntimeException? Throwing runtime exception should be avoided. PrestoException was created to have a specific type of runtime exception that could be handle specifically and could contain more contectual information. For example our error reports break down types of presto exceptions.

I don't see the advantage of using RuntimeException over PrestoException in this case.

Can we throw PrestoException here?

billonahill · 2017-04-13T00:13:10Z

presto-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/ThriftGeneralInputFormat.java

+            throws IOException
+    {
+        job.setBoolean("elephantbird.mapred.input.bad.record.check.only.in.close", false);
+        job.setFloat("elephantbird.mapred.input.bad.record.threshold", 0.0f);


We used to have to configure these often with EB in map reduce. We should make these configurable by table.

Sure, I will allow JobConf to override these.

billonahill · 2017-04-13T00:17:19Z

presto-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/ThriftGenericRow.java

+            case TType.MAP:
+                return readMap(iprot);
+            default:
+                TProtocolUtil.skip(iprot, type);


Why skip if we don't find a know type? Is this an error condition that should be logged? What type(s) do we expect here?

I simply follow the scrooge generated code which skips any unmatched type and field ids. Maybe we can log it, but most codes I have seen right now simply skip all unexpected conditions. What do you think?

we might not want to log in this case (maybe debug logging?) since it could be verbose if there are many unsupported fields.

billonahill · 2017-04-13T00:18:50Z

presto-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/ThriftGenericRow.java

+            default:
+                TProtocolUtil.skip(iprot, type);
+        }
+        return null;


Can we move this into the default clause to make it explicit? Also what does returning null mean? Could we instead throw a TypeNotFoundException or something like that? Typically returning null is an anti-pattern.

Returning null means there is a field here but we cannot understand what it is. Let's mark it as null and continue. It may because the file is corrupt or something. Yes, we can just throw an exception and stop it. Do you think we should prevent this loose conversion?

If the contract we're providing is that we return null if we can't handle the type, then we should stick with this. Would still be good to include the return in the default block though, so it's clear that default should a.) skip and; b.) return null.

Will move it.

…verride by table

billonahill · 2017-04-13T19:16:01Z

presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java

+            }
+
+            Configuration targetConfiguration = hdfsEnvironment.getConfiguration(file.getPath());
+            JobConf targetJob = new JobConf(targetConfiguration);


Working with Hadoop configs can sometimes be expensive actually, they're notoriously inefficient. If there's a way we can get these configs once outside of the loop we should.

I think it's better to keep it in the loop since the path changed for each file and we should keep the logic correct.

For Presto, the hdfsEnvironment.getConfiguration(file.getPath()) will always return same Configuration. I think this also the reason why we have to use a soft link to dwrev files. Code:
https://github.com/twitter-forks/presto/blob/twitter-master/presto-hive/src/main/java/com/facebook/presto/hive/HiveHdfsConfiguration.java#L66

That's fine then. Let's just be aware of it and note it as something to keep an eye on w.r.t. performance.

billonahill · 2017-04-14T16:45:51Z

...to-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/HiveThriftFieldIdResolver.java

+            return (short) (hiveIndex + 1);
+        }
+
+        Short thriftId = thriftIds.get(Integer.valueOf(hiveIndex));


Integer.valueOf is probably not needed here due to auto boxing. sane for below in the else statement

billonahill · 2017-04-26T17:56:31Z

presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java

+                        else {
+                            chunkLength = index.alignSliceEndToIndex(offset + chunkLength, length) - offset;
+                        }
+                        log.debug("lzo split: %s (%s:%s)", path, offset, offset + chunkLength);


After testing can we remove this line? I suspect it would make it so debug logging is saturated with these and not useful for anything else. Have you found that to be the case?

Yes. I use this line for manual verification. It will be removed later.

billonahill · 2017-04-26T17:57:46Z

presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java

+                    while (chunkOffset >= blockLocation.getLength()) {
+                        // allow overrun for lzo compressed file for intermediate blocks
+                        if (!isLzoCompressedFile(filePath) || blockLocation.getOffset() + blockLocation.getLength() >= length) {
+                            checkState(chunkOffset == blockLocation.getLength(), "Error splitting blocks");


Please add more context to the "Error splitting blocks" message to better describe what error was encountered.

billonahill · 2017-04-26T17:59:49Z

presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java

@@ -535,13 +543,21 @@ private void addToHiveSplitSourceRoundRobin(List<Iterator<HiveSplit>> iteratorLi
            Map<Integer, HiveType> columnCoercions)
            throws IOException
    {
+        Path filePath = new Path(path);


This class has a lot of mixed in code now, which will make merging from master difficult and risky in the future. We should either contribute this code back to presto (preferred), or make it more merge-friendly if our plan is to not contribute it. In that case, is there a way to minimize changes to this class to only a few one-liners that call methods added to the bottom in a block, or maybe methods in a twitter-specific class? Or maybe adding method calls with empty bodies and then subclassing is possible?

I plan to contribute back to OSS since they are working on LZO text which could benefit from this change.

Included in the upstream PR prestodb#7916

The code added in this file is to handle lzo,lzop decompression, right? Would switching to airlift lib help get rid of our additions here? Or is it that our lzop files might not all be indexed, so we need to handle it here?

The code added in this method is to perform a flexible splitting, not for decompression. If the index file is not present, line 592 will make it return the whole file as a split.

This index being present vs absent is only with Thrift files?

The more general truth is that the index is coming with lzop files. It's no aware of the data format. That's the reason why we want to contribute the patch for splittable lzo to upstream since they have support for lzo text and they should be able to benefit with our patch

billonahill · 2017-04-26T18:03:46Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java

@@ -279,6 +310,21 @@ static boolean isSplittable(InputFormat<?, ?> inputFormat, FileSystem fileSystem
        }
    }

+    public static boolean isLzoCompressedFile(Path filePath)


Unless we plan to contribute this code back, please move these methods to a twitter-specific class for merge-friendliness.

I plan to contribute to upstream

billonahill · 2017-04-26T18:05:08Z

...to-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/HiveThriftFieldIdResolver.java

+import static com.facebook.presto.hive.HiveUtil.checkCondition;
+import static com.google.common.base.MoreObjects.toStringHelper;
+
+public class HiveThriftFieldIdResolver


please make a unit test for this class

Added TestHiveThriftFieldIdResolver

billonahill · 2017-04-26T18:13:04Z

presto-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/ThriftGeneralInputFormat.java

+            setInputFormatInstance(new MultiInputFormat(new TypeRef(thriftClass) {}));
+        }
+        catch (ClassNotFoundException e) {
+            throw new RuntimeException("Failed getting class for " + thriftClassName);


Can we throw PrestoException here?

billonahill · 2017-04-26T18:14:57Z

presto-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/ThriftGenericRow.java

+        iprot.readStructEnd();
+    }
+
+    private Object readElem(TProtocol iprot, byte type) throws TException


Can you create a unit test for this class/method that constructs a thrift object of all supported (and unsupported) types and then asserts that it can read (or fail appropriately) for each.

It should be covered in testLZOThrift.

billonahill · 2017-04-26T18:17:28Z

presto-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/ThriftGenericRow.java

+        oprot.writeStructEnd();
+    }
+
+    private void writeField(Map.Entry<Short, Object> field, TProtocol oprot)


Could we change this signature to more clearly reflect/convey what's in the Entry by instead passing the key values? This reads much more clearly:

private void writeField(short thriftFieIdId, Object fieldValue, TProtocol oprot)

Good point. Will address.

billonahill · 2017-04-26T18:20:08Z

...ve/src/main/java/com/facebook/presto/twitter/hive/thrift/ThriftHiveRecordCursorProvider.java

+    private static final String LAZY_BINARY_SERDE = "org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe";
+    private static final String THRIFT_GENERIC_ROW = ThriftGenericRow.class.getName();
+    private static final Set<String> THRIFT_SERDE_CLASS_NAMES = ImmutableSet.<String>builder()
+            .add("com.facebook.presto.twitter.hive.thrift.ThriftGeneralSerDe")


There classes are already dependencies so we should do ThriftGeneralSerDe.class.getName()

billonahill · 2017-04-26T18:21:12Z

...ve/src/main/java/com/facebook/presto/twitter/hive/thrift/ThriftHiveRecordCursorProvider.java

+            return Optional.empty();
+        }
+
+        setPropertyIfUnset(schema, "elephantbird.mapred.input.bad.record.check.only.in.close", Boolean.toString(false));


these settings should be configurable, ideally per datatype. we should expect that bad records exist.

What kind of configurable are you expecting here? It's configurable by table's scale currently. Say if the table's properties have these settings(user can use hive-cli to set these properties), it will respect the values from the table.

I didn't realize that that value could be set using hive-cli, that's great.

Yaliang · 2017-04-27T00:37:10Z

With commit d0d2ae3, we now have an integrated unit test for lzo thrift.

[presto-hive (yaliangw/addThriftInHive)] $ pmvn test -Dtest=TestHiveFileFormats
...
[INFO] ------------------------------------------------------------------------
[INFO] Building presto-hive 0.170-tw-0.32
[INFO] ------------------------------------------------------------------------
...
-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running com.facebook.presto.hive.TestHiveFileFormats
...
Results :

Tests run: 40, Failures: 0, Errors: 0, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
...

The reason I called it integrated is because the test is actually walking through the whole process includes the ThriftHiveRecordCursorProvider, ThriftHiveRecordCursor, ThriftGeneralDeserializer, ThriftGenericRow, HiveThriftFieldIdResolverFactory, HiveThriftFieldIdResolver. I will add a separate unit test to test the HiveThriftFieldIdResolver can correctly understand the JSON string of hive-id-thrift-id mapping.

Yaliang · 2017-04-27T22:41:19Z

I have submitted a separate PR prestodb#7916 in upstream for splitting splittable lzo. The support of LZO/LZOP codec is just merged this morning. Hopefully get their review soon.

…odec

…veIntegrationSmokeTest, TestHivePageSink

Yaliang · 2017-05-03T02:01:02Z

Ready for another round review. @billonahill @dabaitu

billonahill · 2017-05-03T21:37:17Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java

@@ -238,6 +262,13 @@ static String getInputFormatName(Properties schema)
        return name;
    }

+    public static String getSerializationClassName(Properties schema)


This class has deps on com.facebook.presto.twitter.hive.thrift.ThriftGeneralInputFormat and references elephantbird so I assume you're not planning on contributing this to OSS. If that's the case, can we put these new methods in another twitter-specific class? That way we've left less a footprint on this class and can more easily merge from OSS in the future.

Let me clarify the changes in this class. The change in line 195 must be placed in this class since it calls getRecordReader directly and we don't have the chance to rename the property, once we use HiveMultiInputFormat directly and remove com.facebook.presto.twitter.hive.thrift.ThriftGeneralInputFormat which is just a mirror of EB's HiveMultiInputFormat. The change on HiveMultiInputFormat has been merged in EB so we will remove line 249 to line 252 once the EB make a new release. The method getSerializationClassName could be moved to someplace out of this class. Then rest changes are included in PR prestodb#7916 so they will go to OSS.

billonahill · 2017-05-03T21:44:58Z

.../src/main/java/com/facebook/presto/twitter/hive/thrift/HiveThriftFieldIdResolverFactory.java

+    private static final Logger log = Logger.get(HiveThriftFieldIdResolverFactory.class);
+    private static final ObjectMapper objectMapper = new ObjectMapper();
+    public static final String THRIFT_FIELD_ID_JSON = "thrift.field.id.json";
+    public static final ThriftFieldIdResolver PLUSONE = new HiveThriftFieldIdResolver(null);


Why PLUSONE? Can we give this a more descriptive name?

because it's doing the shifting as ThriftId = HiveId Plus one
Is there any suggestion?

I see, I missed that. Instead of naming it based on it's current implementation details, it seems more natural to name it based on what it is: HIVE_THRIFT_FIELD_ID_RESOLVER. That makes it more clear to the reader what class is returned by default.

billonahill · 2017-05-03T21:55:43Z

...ive/src/test/java/com/facebook/presto/twitter/hive/thrift/TestHiveThriftFieldIdResolver.java

+@Test
+public class TestHiveThriftFieldIdResolver
+{
+    private static final Map<String, Object> STRUCT_FIELD_ID_AS_MAP = ImmutableMap.of(


You could use Map<String, Short> and avoid the cast.

I get this error:

incompatible types: inference variable V has incompatible bounds [ERROR] equality constraints: java.lang.Short [ERROR] lower bounds: java.lang.Integer

I thought auto-boxing would handle that for you. I guess not...

billonahill · 2017-05-03T21:56:11Z

...ive/src/test/java/com/facebook/presto/twitter/hive/thrift/TestHiveThriftFieldIdResolver.java

+            "1", (short) 2,
+            "id", (short) 4);
+
+    private static final Map<String, Object> LIST_FIELD_ID_AS_MAP = ImmutableMap.of(


why are we mixing string and short values in the same map?

The value for 0 can be any map. It represents the nested structure for the element's thrift id mapping.

billonahill · 2017-05-03T21:56:36Z

...ive/src/test/java/com/facebook/presto/twitter/hive/thrift/TestHiveThriftFieldIdResolver.java

+            "0", "{}",
+            "id", (short) 5);
+
+    private static final Map<String, Object> VERBOSE_PRIMARY_FIELD_ID_AS_MAP = ImmutableMap.of(


Map<String, Short>

billonahill · 2017-05-03T21:57:13Z

...ive/src/test/java/com/facebook/presto/twitter/hive/thrift/TestHiveThriftFieldIdResolver.java

+    public void testDefaultResolver()
+            throws Exception
+    {
+        ThriftFieldIdResolver plusOne = resolverFactory.createResolver(new Properties());


please use more descriptive name than plusOne, like resolver.

billonahill · 2017-05-03T22:08:11Z

...ive/src/test/java/com/facebook/presto/twitter/hive/thrift/TestHiveThriftFieldIdResolver.java

+    {
+        ThriftFieldIdResolver plusOne = resolverFactory.createResolver(new Properties());
+
+        assertEquals(plusOne.getThriftId(0), 1);


why are these values found when using a resolverFactory initialized with nothing in it?

Because the default non-informative resolver just do hiveId + 1 as result

billonahill · 2017-05-03T22:08:50Z

...ive/src/test/java/com/facebook/presto/twitter/hive/thrift/TestHiveThriftFieldIdResolver.java

+        assertEquals(plusOne.getThriftId(0), 1);
+        assertEquals(plusOne.getThriftId(10), 11);
+        assertEquals(plusOne.getThriftId(5), 6);
+        assertEquals(plusOne.getNestedResolver(2), plusOne);


why does getNestedResolver return the same resolver?

Because the default non-informative resolver should return itself since it's nested resolver must be non-informative resolver

dabaitu · 2017-05-04T16:14:57Z

presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java

@@ -78,6 +80,9 @@
 import static com.facebook.presto.hive.HiveSessionProperties.getMaxSplitSize;
 import static com.facebook.presto.hive.HiveUtil.checkCondition;
 import static com.facebook.presto.hive.HiveUtil.getInputFormat;
+import static com.facebook.presto.hive.HiveUtil.getLzopIndexPath;
+import static com.facebook.presto.hive.HiveUtil.isLzopCompressedFile;
+import static com.facebook.presto.hive.HiveUtil.isLzopIndexFile;


Is it easy to switch to airlift lzo decompressor now? If not, can you add a note to do that in the future?
I don't know the technical details, but from website description the airlift decompressor is suppose to be faster.

Right, that's potentially an optimization we can do. But unfortunately, if we want to switch to airlift decompressor, we have to either write the airlift's codec class naming in the LZO thrift data file or pull the logic of decompressing lzo from EB, especially this line: https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/LzoRecordReader.java#L84

dabaitu · 2017-05-04T16:23:12Z

presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java

@@ -535,13 +543,21 @@ private void addToHiveSplitSourceRoundRobin(List<Iterator<HiveSplit>> iteratorLi
            Map<Integer, HiveType> columnCoercions)
            throws IOException
    {
+        Path filePath = new Path(path);


The code added in this file is to handle lzo,lzop decompression, right? Would switching to airlift lib help get rid of our additions here? Or is it that our lzop files might not all be indexed, so we need to handle it here?

dabaitu · 2017-05-04T16:26:06Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveStorageFormat.java

@@ -91,6 +93,11 @@
    TEXTFILE(LazySimpleSerDe.class.getName(),
            TextInputFormat.class.getName(),
            HiveIgnoreKeyTextOutputFormat.class.getName(),
+            new DataSize(8, Unit.MEGABYTE)),


how did you estimate this writer size?

It's based on HiveIgnoreKeyTextOutputFormat which simply write binary on my understanding. We do not use the write functionality but the unit test requires a valid StorageFormat for the test file format.

dabaitu · 2017-05-04T16:54:17Z

...to-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/HiveThriftFieldIdResolver.java

+        this.root = root;
+    }
+
+    public short getThriftId(int hiveIndex)


Can you add some comments about how thrift ids could be discontinuous and hive ids are continuous to help us follow the logic of index mapping?

dabaitu · 2017-05-04T17:05:01Z

presto-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/ThriftHiveRecordCursor.java

+            recordReader.close();
+        }
+        catch (IOException e) {
+            throw Throwables.propagate(e);


why not just
throw new RuntimeException(e);

Good catch.

Yaliang · 2017-05-04T18:39:15Z

@dabaitu Updated. Will you take a look again?

dabaitu

👍 ^1000 for coding this.

dabaitu · 2017-05-04T21:20:32Z

presto-hive/src/main/java/com/facebook/presto/hive/BackgroundHiveSplitLoader.java

@@ -535,13 +543,21 @@ private void addToHiveSplitSourceRoundRobin(List<Iterator<HiveSplit>> iteratorLi
            Map<Integer, HiveType> columnCoercions)
            throws IOException
    {
+        Path filePath = new Path(path);


This index being present vs absent is only with Thrift files?

dabaitu · 2017-05-04T21:21:45Z

...to-hive/src/main/java/com/facebook/presto/twitter/hive/thrift/HiveThriftFieldIdResolver.java

+        this.root = root;
+    }
+
+    public short getThriftId(int hiveIndex)


Port original PR(#75) to prestosql code base: * 3c9b124 * 9a65316 * 7bb2994 * 489d3d2 * d0d2ae3 * d526964

Initial commit to support LZO Thrift

d526964

billonahill requested changes Mar 14, 2017

View reviewed changes

Yaliang changed the title ~~Add LZO Thrift support in Hive.~~ Add LZO Thrift read support in Presto-hive. Mar 16, 2017

Move classes into twitter folder with better namings

9027039

Yaliang force-pushed the yaliangw/addThriftInHive branch from c08f486 to 107140e Compare March 20, 2017 20:31

Set elephant-bird task config to avoid decode errors getting suppressed

d071d79

Yaliang force-pushed the yaliangw/addThriftInHive branch from 107140e to d071d79 Compare March 20, 2017 21:09

dabaitu and others added 5 commits March 21, 2017 11:52

Enable splits for LZO Thrift data

29a21ba

Merge pull request #2 from dabaitu/thomass/enableSplitsForThrift

656cf3b

Enable splits for LZO Thrift data

Add support for thrif id mapping

b9acafc

Better Naming: ThriftGeneralRow -> ThriftGenericRow

63fca50

Support splittable lzo

e90d854

Yaliang force-pushed the yaliangw/addThriftInHive branch from 3df63ca to e90d854 Compare March 24, 2017 00:52

Rebase 0.170-tw-0.32

d674783

Yaliang added 4 commits March 30, 2017 23:56

Fix missed method

433cb84

Support skip unused top level column

f07187c

Fix a mistake in splitting

00038dc

Merge branch 'twitter-master' into yaliangw/addThriftInHive

c643602

billonahill requested changes Apr 13, 2017

View reviewed changes

Better toString for HiveThriftFieldIdResolver; Allow EB config been o…

987c5a2

…verride by table

billonahill reviewed Apr 13, 2017

View reviewed changes

Better Logging

0a15a09

billonahill reviewed Apr 14, 2017

View reviewed changes

Refactor thrift field id resolver

09de625

Yaliang force-pushed the yaliangw/addThriftInHive branch from 9d8eb8e to 09de625 Compare April 15, 2017 03:41

Yaliang added 2 commits April 17, 2017 15:16

Support deserialization when the SerDe is set as LazyBinarySerDe

becae61

Move functions out of ThriftGeneralInputFormat

cce5023

billonahill requested changes Apr 26, 2017

View reviewed changes

Yaliang added 2 commits April 26, 2017 17:17

Added unit test for pre-generated lzo thrift data

d0d2ae3

Remove log in BackgroundHiveSplitLoader

9b06ace

Yaliang added 2 commits April 26, 2017 17:40

change the writeField parameter in ThriftGenericRow

0a648af

Adapt method name lzo->lzop as oss patch

85f3764

Remove hard coded class name and Remove LZO codec in HiveCompressionC…

87cf0c0

…odec

Yaliang force-pushed the yaliangw/addThriftInHive branch from 83dc5e2 to 87cf0c0 Compare April 28, 2017 02:01

Yaliang added 2 commits April 27, 2017 22:35

Remove tests on THRIFTBINARY format in AbstractTestHiveClient, TestHi…

21a2b5c

…veIntegrationSmokeTest, TestHivePageSink

Adapt to more compact json string

6bc1de5

Yaliang force-pushed the yaliangw/addThriftInHive branch from 8e5249b to 6bc1de5 Compare April 28, 2017 23:35

Yaliang added 2 commits April 28, 2017 17:58

Add tests for HiveThriftFieldIdResolver

c3ddc7e

Use PrestoException in ThriftGeneralInputFormat

1b7addc

Unsupport write in thrift generic row

5541366

billonahill requested changes May 3, 2017

View reviewed changes

move getSerializationClassName to ThriftHiveRecordCursorProvider

a8ac008

billonahill approved these changes May 4, 2017

View reviewed changes

Better naming for default thrift id resolver

6df4e86

billonahill approved these changes May 4, 2017

View reviewed changes

dabaitu reviewed May 4, 2017

View reviewed changes

Added example of json property

7d5685f

Yaliang force-pushed the yaliangw/addThriftInHive branch from 3200ff6 to 7d5685f Compare May 4, 2017 20:23

dabaitu approved these changes May 4, 2017

View reviewed changes

Yaliang merged commit 71b47c6 into twitter-forks:twitter-master May 4, 2017

Yaliang deleted the yaliangw/addThriftInHive branch May 4, 2017 22:54

Yaliang mentioned this pull request Aug 25, 2017

Add support for Thrift data prestodb/presto#7422

Closed

luohao pushed a commit that referenced this pull request May 1, 2019

Add support for lzo-thrift file format

9c23d50

Port original PR(#75) to prestosql code base: * 3c9b124 * 9a65316 * 7bb2994 * 489d3d2 * d0d2ae3 * d526964

Add LZO Thrift read support in Presto-hive. #75

Add LZO Thrift read support in Presto-hive. #75

Conversation

Yaliang commented Mar 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yaliang commented Mar 29, 2017

Yaliang commented Mar 31, 2017 • edited Loading

Yaliang commented Apr 13, 2017

billonahill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

billonahill Apr 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yaliang Apr 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yaliang commented Mar 14, 2017 •

edited

Loading

Yaliang commented Mar 31, 2017 •

edited

Loading

billonahill Apr 14, 2017 •

edited

Loading

Yaliang Apr 26, 2017 •

edited

Loading