Add check_string function that is more generic, thanks to Encodings #2

ScottPJones · 2015-06-08T12:35:31Z

No description provided.

ScottPJones · 2015-06-08T12:37:15Z

@tkelman Could you take a look at this please?
This shows the greater generality possible, by using Encodings.
This is why, after you've looked, if you agree, I want to put Encodings into base in a PR, and then change PR #11575 to use this much more generic version, based on Encodings.

tkelman · 2015-06-08T14:07:53Z

This looks like it's just pasting in the UTF8 version of check_string into the same function, doesn't look much more general than JuliaLang/julia#11575 at all since there isn't any code reuse.

tkelman · 2015-06-08T14:09:11Z

I do think optional arguments as boolean keyword arguments is better than C-style flags in a UInt though. Though the convention for keyword arguments is typically lowercase.

ScottPJones · 2015-06-08T14:12:20Z

@tkelman As I said, if Encodings.jl can be put in Base (which helps everybody trying to deal with conversions... it would be a basis for people adding support for other encodings, such as CP1252, EUC, or SJIS, etc. in packages), then this version would replace what's currently in #11575.
No issue of code reuse, this more generic version (which is what I thought you wanted), would become the one-and-only function, in Base.

tkelman · 2015-06-08T14:21:36Z

Right. I am comparing this strictly relative to #11575, and this doesn't look any better. And it requires encodings to be added to base first, which as I've said before I'm neutral on. I haven't seen a strong enough use case for needing to look at different endianness of strings in base.

tkelman · 2015-06-08T14:23:14Z

Put another way, we can fix the bugs and try to improve performance (subject to code size tradeoffs) in the functionality that exists in base right now, but for new stuff with esoteric encodings, experimenting with it in a package (aka right here) for a while is the right place to do things.

ScottPJones · 2015-06-08T14:38:04Z

I'm sorry, but I don't think you are quite getting it. The only encodings used by this new version of check_strings are the same as before, so there is no increase in base.
This simply makes the function more generic, at no cost to performance, to allow it to be extended in packages outside of base, as I indicated above.

tkelman · 2015-06-08T14:47:51Z

I'm looking at the two versions of the code, and this doesn't look more generic to me. It looks like you added extra encoding inputs, and pasted the UTF8 version of the code into the same method instead of having one UTF8 method and one non-UTF8 method. The latter is fine for base, especially since it doesn't need to add any extra code first. Generality is not about reducing the method count, it's about doing more with less - here you've just moved code around, I don't see the benefit.

ScottPJones · 2015-06-08T14:53:22Z

You're missing it then.
This version can accept not just a Vector, but also AbstractVector and AbstractArray of different CodeUnit sizes, as well as an AbstractString. In less code, it handles many more cases. That seems a benefit to me...

tkelman · 2015-06-08T15:03:44Z

I did see that part, which smells like a feature - but so far not a use case for it. This is a good place to work on that, to come up with a convincing case.

ScottPJones · 2015-06-08T15:05:07Z

This allows it to handle things like a repeated string or substring efficiently.

ScottPJones · 2015-06-08T16:52:31Z

OK, I've fixed up the comments.
The added flexibility of this approach will be needed to fix some of the issues such as: #11463, #11501, #11502, which is why I want to replace the code in #11575 with this code.

quinnj · 2015-06-08T16:57:32Z

src/CheckStrings.jl

+
+using Encodings
+
+export check_string


Let's just call it check; I'm a big fan of avoiding _ where possible and letting the arguments play apart of indicating what a method does, i.e. check(e::Encoding, dat) == I'm checking the encoding/validity of dat.

I worry though that check might be too general a name... this is really just to check string encodings, stored in various ways...

Ideally in the future, we'd have a Strings module in Base and we could just have Strings.check() and leave it unexported.

Yes, I saw a PR about making a Strings module in Base... Although I have some problems with that (not the concept, but maybe the implementation [mainly because I've got a branch where I'm experimenting with moving a lot of the stuff in strings.jl into separate files, which just have the most basic string functionality, just enough to be able to support """ strings, for documentation, earlier in the build])
It's not that I don't want to export it, I think it should be, because people could extend it for many other string encodings.

ScottPJones · 2015-06-08T17:01:30Z

@quinnj Fine by me... I'd planned on asking if the name should be more generic as well...
Do you get my points of why I think this extensible version should replace the one I did earlier (before your nice Strings.jl package here)? This seems incredibly much cleaner and powerful to me...

quinnj · 2015-06-08T17:05:37Z

src/CheckStrings.jl

+        ch, pos = next(dat, pos)
+        totalchar += 1
+        if ch > 0x7f
+            if E <: UTF8


Any time I find myself doing if X <: Y, it screams for use of dispatch. I think this is probably why @tkelman isn't super impressed by this vs. JuliaLang/julia#11575, since you've basically just added this if statement here and combined the two methods. A better solution would be to find a way to further split out the internals of check that could be dispatched on according to Encoding. This leads to nice hierarchy of methods with the most generic at the top level, and Encoding-specific methods at the lowest and hopefully smallest level.

OK, I put it as one method, because reviewers had previously asked to do just that... (the back and forth makes my head hurt!) 😀

I'm not saying there should be two separate check methods, I'm saying that the if E <: UTF8 is an immediate flag to me that whatever is within that if block should actually be pulled out as a method check_internal(::Type{UTF8}, ...), and it looks like in this case, a generic fallback for check_internal{T}(::Type{T}, ...). That becomes more extensible and general when another encoding doesn't have to reimplement (i.e. copy) all the other code from check, they just have to implement check_internal. This is the beauty of multiple dispatch + types as values in Julia.

Yes, that sounds like a very good idea, I'll try to experiment with how to do that, and make sure it is still efficient...

quinnj · 2015-06-08T17:26:52Z

@ScottPJones, here's something to think about.

I really kind of hate how much string code we have in Base that interacts with Vector{UInt8}, Vector{UInt16}, etc. It's probably the most used and tampered with type-internals of any type in Base. Ideally, we could define a very minimal set of methods that would need to interact with stored bytes, and abstract everything else out.

Indeed, the new String{E} type in this package, regardless of encoding, points to raw allocated bytes as opposed to a mix of UInt16, UInt32, etc.

What are your thoughts on consolidating all methods to operating on UInt8s and what would be a minimal abstraction to interact with those bytes?

ScottPJones · 2015-06-08T17:40:12Z

I'm not sure how that would be all that good... I would think you'd want the compiler to know the size of the code units (1, 2, or 4 bytes), typically with aligned access to 2 or 4 byte values...
Also important if you want to vectorize (at least, with the stuff I used to do by hand)

ScottPJones · 2015-06-14T00:15:37Z

@quinnj I've been trying to adapt the checkstring code to not just use Encodings, but also to directly use the String types.
What I want to know is what you'd intended for:

logical access to the characters of a string, regardless of the underlying encoding (i.e. like AbstractString, where you use start(), endof(), and next() to go through the characters, one at a time (not through the underlying code units).
logical access to the code units of a string, i.e. UInt8, UInt16, or UInt32.
At this level, I would hope to have any endian issues already dealt with,
so that only the encoding issues (like, 1-4 code units for UTF8, or 1 or 2 code units for UTF16),
need to be dealt with.
Maybe the start(), endof() are always in terms of bytes? Could there be a nextcodeunit(), like
next()?
access to the low level storage representation, i.e. array of UInt8...
This would be used for things like you've already done, like fast == using memcmp() if types are consistent.

I was also wondering, what do you think about String having a "validated" bit?

ScottPJones · 2015-06-16T11:00:40Z

Bump @quinnj Have you had a chance to think about any of the questions I raised this weekend? (I realize you've been very busy with mmap, which is a great thing also)

quinnj · 2015-06-16T11:19:46Z

Hey @ScottPJones, yeah, I've been busy with getting stuff ready for JuliaCon, so I trying to refrain from doing much more work on this (hopefully we can have a larger discussion or working session at JuliaCon on this! Maybe on Wednesday during the Hackathon).

1&2) I actually took some time to finally soak in the long, but where-much-progress-was-made JuliaLang/julia#9297 and I think things finally started to settle there. I.e. getindex(s::String, i) returns the i-th code point/character. There was also support added for doing something fancy like "hey there"[1+ 1byte] and "hey there"[1 + 1grapheme]. Those are maybe a little fancy, as opposed to just having getcodeunit(s::String, i) and getgrapheme(s::String, i) methods.

So far, I've only done getindex and getcodeunit for DirectIndexedEncoding strings since they're easy 😛.

My one worry with providing easy access to the Array of UInt8 is that it will get overused and we're back to the ugly situation of having an explosion of string methods operating on Vector{UInt8}. This is as opposed to only providing ways to index characters/code units that provide lower-level access, but avoid turning a string into a full blown Array type, which has led to some of the performance penalties we have with current Base strings. I'm sure it wouldn't be terrible to have a bytes(s::String) method that gave you the raw bytes, but I'd love to explore other ways of doing whatever you're trying to do with those bytes.

For the validated bit, I'm wondering how it will get used. I imagine you could have the normal string constructors do validation on construction, with "unsafe" methods that would allow direct construction without validation (if you know you're getting valid code points from somewhere) for performance. But then how does the bit get used later? Can certain operations only be done on a validated string? Do you "show" them differently?

ScottPJones · 2015-06-16T11:35:16Z

Right, about 3) I'd really want that sort of thing only accessible from the Strings module.
Unfortunately (and this is one of my biggest complaints about julia, from a software engineering viewpoint), julia gives no way of protecting your implementation, so people are always poking around in the private bits (look at all the places in Base and in packages that look at .data of a string!)
The software engineering issues may require drunken discussion at the Muddy Charles sometime during JuliaCon, which may end up with me getting tossed right into the very muddy Charles!

ScottPJones · 2015-06-16T11:43:14Z

About a validated bit... I'm not sure about it, I'm just bouncing ideas around... there are so many optimizations possible when you know strings are validated, but if you also want to support read-only access to strings that you just point to in memory (that are in the middle of a record you got from ZMQ, or from Java, or ODBC, etc.), you might be able to validate them, and then set the bit, to avoid the overhead of calling checkstring. You actually might need 3 states, come to think of it... unknown, valid, invalid,
or maybe better to store a whole set of flags, like those returned from checkstring, which would also help a lot when doing conversions (maybe need to store some of the returned sizes as well... but we'd need to be very careful... we don't want the size of the String object to get too big... (I don't currently know what the allocation sizes are in Julia, with the boxing etc.). Also, for short strings, at some point, we'd really want to have these strings integrated into Base, using ideas like Stefan's bytevec.

quinnj reviewed Jun 8, 2015
View reviewed changes

Add check_string function that is more generic, thanks to Encodings

040b647

ScottPJones force-pushed the spj/checkstring branch from 3c83c17 to 040b647 Compare June 9, 2015 02:34

Improve documentation

cce6b37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add check_string function that is more generic, thanks to Encodings #2

Add check_string function that is more generic, thanks to Encodings #2

ScottPJones commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

tkelman commented Jun 8, 2015

tkelman commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

tkelman commented Jun 8, 2015

tkelman commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

tkelman commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

tkelman commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

quinnj Jun 8, 2015

ScottPJones Jun 8, 2015

quinnj Jun 9, 2015

ScottPJones Jun 10, 2015

ScottPJones commented Jun 8, 2015

quinnj Jun 8, 2015

ScottPJones Jun 8, 2015

quinnj Jun 9, 2015

ScottPJones Jun 10, 2015

quinnj commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

ScottPJones commented Jun 14, 2015

ScottPJones commented Jun 16, 2015

quinnj commented Jun 16, 2015

ScottPJones commented Jun 16, 2015

ScottPJones commented Jun 16, 2015

Add check_string function that is more generic, thanks to Encodings #2

Are you sure you want to change the base?

Add check_string function that is more generic, thanks to Encodings #2

Conversation

ScottPJones commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

tkelman commented Jun 8, 2015

tkelman commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

tkelman commented Jun 8, 2015

tkelman commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

tkelman commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

tkelman commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScottPJones commented Jun 8, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quinnj commented Jun 8, 2015

ScottPJones commented Jun 8, 2015

ScottPJones commented Jun 14, 2015

ScottPJones commented Jun 16, 2015

quinnj commented Jun 16, 2015

ScottPJones commented Jun 16, 2015

ScottPJones commented Jun 16, 2015