Skip to content
This repository has been archived by the owner on Oct 8, 2019. It is now read-only.

DataChunk* precondition failing because of malformed(?) date #71

Closed
robinkraft opened this issue Jul 11, 2012 · 8 comments
Closed

DataChunk* precondition failing because of malformed(?) date #71

robinkraft opened this issue Jul 11, 2012 · 8 comments
Labels

Comments

@robinkraft
Copy link
Contributor

This happens on feature/deliver when running preprocessing:

Caused by: java.lang.AssertionError: Assert failed: (or (not date) (string? (first date)))
    at forma.thrift$DataChunk_STAR_.doInvoke(thrift.clj:278)
    at clojure.lang.RestFn.invoke(RestFn.java:497)
    at clojure.lang.Var.invoke(Var.java:431)
    at clojure.lang.AFn.applyToHelper(AFn.java:178)
    at clojure.lang.Var.applyTo(Var.java:532)
    at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:68)
    at cascalog.ClojureMap.operate(ClojureMap.java:34)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
    ... 112 more

Thrift is getting called in predicate/chunkfier:

https://github.com/reddmetrics/forma-clj/blob/feature/deliver/src/clj/forma/hadoop/predicate.clj#L157

I'm running this command:

hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar forma.hadoop.jobs.preprocess.PreprocessStatic "gadm" "/user/hadoop/gadm.txt" "s3n://pailbucket/cmr/" "500" :CMR
@sritchie
Copy link
Contributor

What does (first date) mean here?

On Wed, Jul 11, 2012 at 10:13 AM, Robin Kraft <
reply@reply.github.com

wrote:

This happens on feature/deliver when running preprocessing:

Caused by: java.lang.AssertionError: Assert failed: (or (not date)
(string? (first date)))
        at forma.thrift$DataChunk_STAR_.doInvoke(thrift.clj:278)
        at clojure.lang.RestFn.invoke(RestFn.java:497)
        at clojure.lang.Var.invoke(Var.java:431)
        at clojure.lang.AFn.applyToHelper(AFn.java:178)
        at clojure.lang.Var.applyTo(Var.java:532)
        at
cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:68)
        at cascalog.ClojureMap.operate(ClojureMap.java:34)
        at
cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
        ... 112 more

Thrift is getting called in predicate/chunkfier:

https://github.com/reddmetrics/forma-clj/blob/feature/deliver/src/clj/forma/hadoop/predicate.clj#L157

I'm running this command:

hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar
forma.hadoop.jobs.preprocess.PreprocessStatic "gadm"
"/user/hadoop/gadm.txt" "s3n://pailbucket/cmr/" "500" :CMR

Reply to this email directly or view it on GitHub:
#71

Sam Ritchie, Twitter Inc
703.662.1337
@sritchie09

(Too brief? Here's why! http://emailcharter.org)

@ghost ghost assigned eightysteele Jul 14, 2012
@eightysteele
Copy link
Contributor

@sritchie, yeah dude, so date is an optional argument and (first date) is just grabbing the date value. Totally a hack! Going to refactor this namespace to use :keys and :or instead in #110.

@robinkraft
Copy link
Contributor Author

Ran into this issue with Assert failed: (or (not date) (string? (first date))) again today. I have a workaround at the end of this note, something to do with how nil values are handled with thrift and Cascalog. I'd say this is a high priority to fix since we can't do any static data preprocessing until it's fixed.

It gets pretty complicated in the static preprocessing, so I haven't been able to replicate the exact state of the preprocessing workflow when it fails (with a call to predicate/chunkifier).

But all is not lost. Something is going wrong with a nullable !date field generated when we use sparse-windower as a source. The nullable !date field is passed into chunkifier:

(chunkifier ?dataset !date ?s-res ?t-res ?h ?v ?id ?val :> ?tile-chunk)

That should pass the (or (not date)) part of the precondition, but it's clearly not. Is this an issue with Thrift? Cascalog? Cascading? No idea, but this causes the same exception:

(??- (let [src [["ndvi" (thrift/ModisPixelLocation* "500" 28 8 0 0) 1 "16" nil]]]
                                     (<- [?dc]
                                         (src ?name ?loc ?val ?t-res !date)
                                         (thrift/DataChunk* ?name ?loc ?val ?t-res !date :> ?dc))))

One fix would be to use thrift/DataChunk* and thrift/ModisChunkLocation* directly with those values, leaving out the !date field, which we don't need anyway for static data:

(thrift/ModisChunkLocation* ?s-res ?h ?v ?id chunk-size :> ?tile-loc)
(thrift/DataChunk* ?dataset ?tile-loc ?val ?t-res :> ?tile-chunk)

robinkraft added a commit that referenced this issue Jul 19, 2012
fix for #71  by replacing call to chunkifier with direct use of thrift API
@robinkraft
Copy link
Contributor Author

Fixed with pull request #141

@sritchie
Copy link
Contributor

Can I see a test for this?

@robinkraft
Copy link
Contributor Author

A test for what exactly? This whole preprocessing process? Or just the piece that was breaking?

On Jul 19, 2012, at 8:21 PM, Sam Ritchie reply@reply.github.com wrote:

Can I see a test for this?


Reply to this email directly or view it on GitHub:
#71 (comment)

@sritchie
Copy link
Contributor

The piece that was breaking -- I'm realizing that the precondition is better, I'm just convinced that I understand why the precondition was failing with !date. I'd like to understand it before we move on.

@robinkraft
Copy link
Contributor Author

This will replicate the error. There's no test of the chunkifier at the moment, but I can add one tomorrow.

(??- (let [src [["ndvi" (thrift/ModisPixelLocation* "500" 28 8 0 0) 1 "16" nil]]]
                                     (<- [?dc]
                                         (src ?name ?loc ?val ?t-res !date)
                                         (thrift/DataChunk* ?name ?loc ?val ?t-res !date :> ?dc))))

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants