-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeDescriptor for Option[String] does not make sense. #1375
Comments
I think |
At one point we only allowed I thought Option[primitive] to try side step these issues. The decoding stuff can be a bit more of a pain. Is it defined if part of an Option[CaseClass] exists? i guess its None, or should the others be null's? |
I'm not sure if we should abort on Option[String] -- it's true that TSV can't differentiate between Some("") and None -- but that's just what you get w/ TSVs. I think this is expected / the right behavior right? If we were going to do anything defensive, I'd say it'd be to disallow Some("") (throw in the write path) |
I don't like throwing exceptions at all, but there is a good point here: what is this macro doing? It is called We could use a subclass of this with the TypedText sources, something like TextTypeDescriptor (something similar to what was done with DBTypeDescriptor). This subclass could have the more constrained semantics? Might be good to settle this issue before we publish a public version of the code with some possibly ambiguous behavior. What solutions sound best now? |
I think TypedText.tsv shouldn't allow Option[String] because then it can't obey the round-trip law. Maybe we should have
[Slightly off topic:] We might also throw in a TSV TypedSink that allows any type and uses the toString method to serialize things that aren't primitives, options, or case-classes. This could be useful for debugging. If we have something like this, the user will never be surprised by what happens. |
I'm in favor of not allowing I know that TSV is not really a production data format, but the extra code required for the user to do an explicit conversion from |
Given its not a production format and the end users this will target are for either debugging (in which case best effort write but write as much as you can seems the ideal), or for data scientists importing into another system? I'd be infavor of asking our data scientists who use scalding a lot for input here as the users |
I agree with Gabriel. It makes sense to force the user to decide how to handle the None case explicitly when they write their output to TSV. However, once they load that TSV again, it makes more sense to me to read in the empty case as Some("") rather than None. It seems more intuitive to think of a "null string" as an empty one. |
I think I disagree with Nathan and Gabriel on this. Here's one of the top things on my Scalding wishlist. val myPipe = {
TypedPipe.from(List(
(4, None, Some(7)),
(3, Some("a"), None)
)
)
}
myPipe.save(TypedOsv("myFile.osv")) For ad hoc analysis, I would strongly prefer to get output like this
rather than this:
I'd be willing to sacrifice round-trip ability on Strings if I can reduce the amount of .getOrElse lines I have to add to all my ad hoc jobs. |
@sid-kap , would the last of your four classes be able to do what I described in my previous comment? |
Yes, that's what I had in mind. |
What about this translation scheme:
So in Chris's example, you would get:
... but if they were
|
I'd be fine with @sid-kap 's approach, which would have a strict sink for people who want to be explicit and a looser sink for people who just want to do their ad hoc sampling quickly, and don't ever need to fulfill a round trip. I'd also be totally fine with Gabriel's solution for Strings, assuming that (Some(5), None, Some(6)) would get converted to 5,,6. With Gabriel's approach, there only needs to be one type of sink. |
Slightly off topic: That reminds me, I noticed that our TypedText classes don't support quoted values. @csaid Do you think we should support quoted values? It's part of the CSV spec, so I think we should allow it. (https://tools.ietf.org/html/rfc4180) |
Supporting quotes might conflict with Gabriel's suggestion. |
I'd be happy with Gabriel's solution even if it meant losing the ability to work with quoted values. I never deal with them. |
Quotes are a minefield, i believe though our TypedText stuff does support them in so much as the underlying cascading decoders do. quote handling always gets iffy in csv type data. Usually best to avoid... |
@csaid keep in mind the current TypedText.osv does not write "Some(a)" or "None" it does the "right" thing. What I'm concerned with here is this: who is reading the data if we go with a non-round-trip format? Aren't we just sowing confusing and giving people more rope? If you want debug, why not do: My preference is:
So, I don't like the multiple options for sinks/sources and a different behavior for sources which are also sinks. As for quoting, I think this could be an interesting way to go, but as Ian says, we'd probably have to abandon using cascading's parsing and put our own logic in. Practical question: do people have a compelling example of wanting to use Lastly: note we can support |
[I'm on vacation right now, so I won't be able to respond quickly to everything in this thread.] Those of us consuming data in a non-round-trip way are analyzing CSV files in R or Python and are willing to tolerate a little ambiguity around rare edge cases in exchange for ease of use. If we want to avoid multiple sinks and quoting, then my vote is to stick with the current behavior. Relatedly, I'd find it very useful if we could get to a point where something like this (which is very common after a .leftJoin) val myPipe = {
TypedPipe.from(List(
(1, Some(Some("foo"), None)),
(2, Some(None, Some(12))),
(3, None)
)
)
} could get written out as
without having to manually flatten things with a .map and a lot of .getOrElse statements. You all know much better than me, but I suspect that allowing Option[String] makes it easier to get to this point, right? |
@csaid Is that last line If we support this, then you get ambiguous output, which is exactly what we don't want. Currently you can't do two layers of nesting Option for this reason. I don't think dealing with the nesting that happen in joins here is the best solution because often we won't write out after those operations. Have you seen |
This is a very interesting discussion. From an ease of use standpoint, I would prefer both the empty string and None to write out to the same thing in csv, and round trip it back into None.
Frequently, my ad hoc code has to deal with optional things (like a field in an upstream table that isn't set, and I want to save that state to an intermediate table).
For me, the empty vs NA seems totally ambiguous and not well supported by cvs. I wouldn't trust my downstream analysis to do the right thing. So if I every cared about empty strings, I would probably just map them to something else (like "-") to avoid confusion. |
I agree with @joshualande -- lets not forget that the whole point is to write some data to a TSV, users of TSV need to be aware that None and "" mean the same thing in TSV, and probably expect and rely on that, this is true in many languages (java libraries usually encode null as "" in TSV, python libraries do the same, and so on). |
I'm also not in favor of making a "clever" or more fully featured TSV format (ones with quotes, escaping, and all that). It just adds complexity, and at the end the format is still a bad format. What about strings that contain the delimiter (tab) character? What about strings with newlines in them? I think users of this format understand that it's a very simple format and doesn't handle things like that. |
Agree with Josh/Alex on this issue. val myLeftPipe = {
TypedPipe.from(List(
(1, "a"),
(2, "b")
)
)
}
val myRightPipe = {
TypedPipe.from(List(
(1, Some("x"), Some("y")),
(1, None, Some("z"))
)
)
}
val g1 = myLeftPipe.groupBy(_._1)
val g2 = myRightPipe.groupBy(_._1)
val multiJoined = MultiJoin.left(g1, g2).toTypedPipe This gives me a TypedPipe[(Int, ((Int, String), Option[(Int, Option[String], Some[String])]))]. I'm looking for a simple way to get from that to something flat like this: Sorry if this is a bit off-topic. |
The ambiguity is easy to explain, if you write: so when we read |
Oh yeah, I understand that ambiguity, and I think several of us are saying it's ok to tolerate it. For some reason, I thought Oscar was talking about a different ambiguity caused by flattening the results of a .leftJoin. |
So, it sounds like everyone wants to keep the ambiguity. I hear your point Alex that this thing is already a bit crappy: in that we are not 100% sure (and doubt) that it handles strings that contain the delimiter. To me, that just makes me want to fix that issue too. I really hate mine-fields, and I hate errors at runtime that could have been turned into compile-time errors. We could implement another function (probably with a macro), that flattens trait GetOrElseEmpty[T] {
type U // result type
def unbox(t: T): U
} Seems like it would work:
Then we could keep the code exact, but then have a method like:
|
TypedText doesn't correctly handle strings that contain the delimiter. I think the Hadoop text reader/writers should take care of the mines involved with quoting and strings that contain commas or newlines. |
@sid-kap can you write some tests that exercise this? I'm in favor of fixing. Do most TSV, CSV and 1-SV parsers understand the |
The "official" CSV format allows However, the official TSV format does not allow tabs in the fields, and therefore (I assume) has no need for quoting. (I'm guessing newlines are also not allowed.) See http://www.iana.org/assignments/media-types/text/tab-separated-values, or https://en.wikipedia.org/wiki/Tab-separated_values. I can't find any documentation on OSV. Sure, I can write some tests for this. I think property-based tests for roundtrip-ability would be a good idea. |
So, we could by default throw if the user gives us a string containing the delimiter in the case of tsv or osv, and in the case of csv do quoting. |
That sounds good. |
Personally, I don't like the added complexity. Given that this format has traps, I think it's much better to keep it incredibly simple, so that the traps are easily understood, instead of bolting on fixes until we've created a very confusing format. There is not official CSV / TSV format, just lots of formats that separate things with commas and tabs. Some do string escaping -- do they all escape in exactly the same manor? Why don't we just encourge typed-json for human readable string based formats, and leave this one as simple as possible? |
There is a MIME types for text/csv which has guidelines on how to escape strings with quotes. If we follow the MIME specification, there shouldn't be any ambiguity. If we don't want to add support for quotes, the simple solution would be to fail on strings that contains a comma, newline, or double-quote. |
The thing about TSV/CSV is that they are highly portable. The main reason we didn't like them in the past is that there was not a good schema, but this typed code requires you to write a schema and it fails if the data is bad. It is hard to see how this is much different than thrift to me (except in thrift the IDL language is not the language you are writing, which is hardly a feature). Given that excel, R, python, etc... all consume and produce TSV/CSV with minimal effort, I think it is good to support it as correctly as possible, and indeed in many cases it is a fine choice (assuming you have a clear schema, which we do here). I don't like complexity for the user, but I don't mind (a certain amount) of complexity for the system implementer: that's the goal. Do heavy lifting once, and then make an API that is easy to use correctly. If we got this right, TSVs would be much safer to use without ambiguity, and I think that would be a win. |
Yes, but I think it's worth verifying that R, python, pig, hadoop's i/o formats, and others actually use the spec described there. |
I still prefer to have the user make the conversion from @csaid and @joshualande may have made this mistake already without even realizing that they made it, which is not to imply that they are bad data scientists (I think they are both great!), but rather to imply that everybody makes mistakes when given inexact tools to work with. To quote Dijkstra:
|
I think this was discussed above, but is the any chance we can make TypedTsv round-trip safe, but add a second write-only source that writes out the data for the easiest processing in R/python? |
@isnotinvain Pig's CSVLoader docs also supports the same quoting style (according to http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/CSVLoader.html). Unfortunately, Hadoop's TextDelimited scheme doesn't properly support newlines in the middle of fields, but there is a Hadoop CSV scheme that someone wrote that handles this properly (https://github.com/datascienceinc/cascading.csv). We could add this as a dependency and use this instead of TextDelimited. |
Adding a dependency on another project is something we'd be pretty slow to do, lots of maintainance down the track. If its licensed permissively enough and has short enough source it might make sense to copy it in tho. |
It looks like it has dependencies on apache commons too. Doubt we'd want that in scalding core. I guess it starts becoming a question of do we want these edge cases covered enough for all the work likely here? |
I made support for nesting and removed the ability to read |
In case it helps anybody else looking through this, you can convert from an Option to a String that can be easily read into R using the mkString function, i.e. scala> Option(1L).mkString
res0: String = 1
scala> None.mkString
res1: String = "" This is handy if you need to convert the Options before saving to Csv/Tsv, i.e. case class Data(name:String, value:Option[Long], country:String)
val data = List(
Data("Fred",Some(1L),"US"),
Data("Sally", None, "CA")
)
val results = {
TypedPipe.from(data)
.map { case Data(name, value, country) => (name, value.mkString, country) }
.save(TypedCsv("test.csv"))
} Writes out
|
We should abort when we hit Option[String]. When writing an Option[String] to a text file, a None is indistinguishable from
""
(empty string).Right now, I guess if it is empty, we are returning None, so if someone writes
Some("")
, then we would readNone
, which is obviously bad news.If I am right, this means our tests are weak, because we should be testing round-trips of this kind.
The text was updated successfully, but these errors were encountered: