Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check_string function that is more generic, thanks to Encodings #2

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ScottPJones
Copy link
Collaborator

No description provided.

@ScottPJones
Copy link
Collaborator Author

@tkelman Could you take a look at this please?
This shows the greater generality possible, by using Encodings.
This is why, after you've looked, if you agree, I want to put Encodings into base in a PR, and then change PR #11575 to use this much more generic version, based on Encodings.

@tkelman
Copy link

tkelman commented Jun 8, 2015

This looks like it's just pasting in the UTF8 version of check_string into the same function, doesn't look much more general than JuliaLang/julia#11575 at all since there isn't any code reuse.

@tkelman
Copy link

tkelman commented Jun 8, 2015

I do think optional arguments as boolean keyword arguments is better than C-style flags in a UInt though. Though the convention for keyword arguments is typically lowercase.

@ScottPJones
Copy link
Collaborator Author

@tkelman As I said, if Encodings.jl can be put in Base (which helps everybody trying to deal with conversions... it would be a basis for people adding support for other encodings, such as CP1252, EUC, or SJIS, etc. in packages), then this version would replace what's currently in #11575.
No issue of code reuse, this more generic version (which is what I thought you wanted), would become the one-and-only function, in Base.

@tkelman
Copy link

tkelman commented Jun 8, 2015

Right. I am comparing this strictly relative to #11575, and this doesn't look any better. And it requires encodings to be added to base first, which as I've said before I'm neutral on. I haven't seen a strong enough use case for needing to look at different endianness of strings in base.

@tkelman
Copy link

tkelman commented Jun 8, 2015

Put another way, we can fix the bugs and try to improve performance (subject to code size tradeoffs) in the functionality that exists in base right now, but for new stuff with esoteric encodings, experimenting with it in a package (aka right here) for a while is the right place to do things.

@ScottPJones
Copy link
Collaborator Author

I'm sorry, but I don't think you are quite getting it. The only encodings used by this new version of check_strings are the same as before, so there is no increase in base.
This simply makes the function more generic, at no cost to performance, to allow it to be extended in packages outside of base, as I indicated above.

@tkelman
Copy link

tkelman commented Jun 8, 2015

I'm looking at the two versions of the code, and this doesn't look more generic to me. It looks like you added extra encoding inputs, and pasted the UTF8 version of the code into the same method instead of having one UTF8 method and one non-UTF8 method. The latter is fine for base, especially since it doesn't need to add any extra code first. Generality is not about reducing the method count, it's about doing more with less - here you've just moved code around, I don't see the benefit.

@ScottPJones
Copy link
Collaborator Author

You're missing it then.
This version can accept not just a Vector, but also AbstractVector and AbstractArray of different CodeUnit sizes, as well as an AbstractString. In less code, it handles many more cases. That seems a benefit to me...

@tkelman
Copy link

tkelman commented Jun 8, 2015

I did see that part, which smells like a feature - but so far not a use case for it. This is a good place to work on that, to come up with a convincing case.

@ScottPJones
Copy link
Collaborator Author

This allows it to handle things like a repeated string or substring efficiently.

@ScottPJones
Copy link
Collaborator Author

OK, I've fixed up the comments.
The added flexibility of this approach will be needed to fix some of the issues such as: #11463, #11501, #11502, which is why I want to replace the code in #11575 with this code.


using Encodings

export check_string
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just call it check; I'm a big fan of avoiding _ where possible and letting the arguments play apart of indicating what a method does, i.e. check(e::Encoding, dat) == I'm checking the encoding/validity of dat.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry though that check might be too general a name... this is really just to check string encodings, stored in various ways...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally in the future, we'd have a Strings module in Base and we could just have Strings.check() and leave it unexported.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I saw a PR about making a Strings module in Base... Although I have some problems with that (not the concept, but maybe the implementation [mainly because I've got a branch where I'm experimenting with moving a lot of the stuff in strings.jl into separate files, which just have the most basic string functionality, just enough to be able to support """ strings, for documentation, earlier in the build])
It's not that I don't want to export it, I think it should be, because people could extend it for many other string encodings.

@ScottPJones
Copy link
Collaborator Author

@quinnj Fine by me... I'd planned on asking if the name should be more generic as well...
Do you get my points of why I think this extensible version should replace the one I did earlier (before your nice Strings.jl package here)? This seems incredibly much cleaner and powerful to me...

ch, pos = next(dat, pos)
totalchar += 1
if ch > 0x7f
if E <: UTF8
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time I find myself doing if X <: Y, it screams for use of dispatch. I think this is probably why @tkelman isn't super impressed by this vs. JuliaLang/julia#11575, since you've basically just added this if statement here and combined the two methods. A better solution would be to find a way to further split out the internals of check that could be dispatched on according to Encoding. This leads to nice hierarchy of methods with the most generic at the top level, and Encoding-specific methods at the lowest and hopefully smallest level.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I put it as one method, because reviewers had previously asked to do just that... (the back and forth makes my head hurt!) 😀

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying there should be two separate check methods, I'm saying that the if E <: UTF8 is an immediate flag to me that whatever is within that if block should actually be pulled out as a method check_internal(::Type{UTF8}, ...), and it looks like in this case, a generic fallback for check_internal{T}(::Type{T}, ...). That becomes more extensible and general when another encoding doesn't have to reimplement (i.e. copy) all the other code from check, they just have to implement check_internal. This is the beauty of multiple dispatch + types as values in Julia.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds like a very good idea, I'll try to experiment with how to do that, and make sure it is still efficient...

@quinnj
Copy link
Owner

quinnj commented Jun 8, 2015

@ScottPJones, here's something to think about.

I really kind of hate how much string code we have in Base that interacts with Vector{UInt8}, Vector{UInt16}, etc. It's probably the most used and tampered with type-internals of any type in Base. Ideally, we could define a very minimal set of methods that would need to interact with stored bytes, and abstract everything else out.

Indeed, the new String{E} type in this package, regardless of encoding, points to raw allocated bytes as opposed to a mix of UInt16, UInt32, etc.

What are your thoughts on consolidating all methods to operating on UInt8s and what would be a minimal abstraction to interact with those bytes?

@ScottPJones
Copy link
Collaborator Author

I'm not sure how that would be all that good... I would think you'd want the compiler to know the size of the code units (1, 2, or 4 bytes), typically with aligned access to 2 or 4 byte values...
Also important if you want to vectorize (at least, with the stuff I used to do by hand)

@ScottPJones
Copy link
Collaborator Author

@quinnj I've been trying to adapt the checkstring code to not just use Encodings, but also to directly use the String types.
What I want to know is what you'd intended for:

  1. logical access to the characters of a string, regardless of the underlying encoding (i.e. like AbstractString, where you use start(), endof(), and next() to go through the characters, one at a time (not through the underlying code units).
  2. logical access to the code units of a string, i.e. UInt8, UInt16, or UInt32.
    At this level, I would hope to have any endian issues already dealt with,
    so that only the encoding issues (like, 1-4 code units for UTF8, or 1 or 2 code units for UTF16),
    need to be dealt with.
    Maybe the start(), endof() are always in terms of bytes? Could there be a nextcodeunit(), like
    next()?
  3. access to the low level storage representation, i.e. array of UInt8...
    This would be used for things like you've already done, like fast == using memcmp() if types are consistent.

I was also wondering, what do you think about String having a "validated" bit?

@ScottPJones
Copy link
Collaborator Author

Bump @quinnj Have you had a chance to think about any of the questions I raised this weekend? (I realize you've been very busy with mmap, which is a great thing also)

@quinnj
Copy link
Owner

quinnj commented Jun 16, 2015

Hey @ScottPJones, yeah, I've been busy with getting stuff ready for JuliaCon, so I trying to refrain from doing much more work on this (hopefully we can have a larger discussion or working session at JuliaCon on this! Maybe on Wednesday during the Hackathon).

1&2) I actually took some time to finally soak in the long, but where-much-progress-was-made JuliaLang/julia#9297 and I think things finally started to settle there. I.e. getindex(s::String, i) returns the i-th code point/character. There was also support added for doing something fancy like "hey there"[1+ 1byte] and "hey there"[1 + 1grapheme]. Those are maybe a little fancy, as opposed to just having getcodeunit(s::String, i) and getgrapheme(s::String, i) methods.

So far, I've only done getindex and getcodeunit for DirectIndexedEncoding strings since they're easy 😛.

  1. My one worry with providing easy access to the Array of UInt8 is that it will get overused and we're back to the ugly situation of having an explosion of string methods operating on Vector{UInt8}. This is as opposed to only providing ways to index characters/code units that provide lower-level access, but avoid turning a string into a full blown Array type, which has led to some of the performance penalties we have with current Base strings. I'm sure it wouldn't be terrible to have a bytes(s::String) method that gave you the raw bytes, but I'd love to explore other ways of doing whatever you're trying to do with those bytes.

For the validated bit, I'm wondering how it will get used. I imagine you could have the normal string constructors do validation on construction, with "unsafe" methods that would allow direct construction without validation (if you know you're getting valid code points from somewhere) for performance. But then how does the bit get used later? Can certain operations only be done on a validated string? Do you "show" them differently?

@ScottPJones
Copy link
Collaborator Author

Right, about 3) I'd really want that sort of thing only accessible from the Strings module.
Unfortunately (and this is one of my biggest complaints about julia, from a software engineering viewpoint), julia gives no way of protecting your implementation, so people are always poking around in the private bits (look at all the places in Base and in packages that look at .data of a string!)
The software engineering issues may require drunken discussion at the Muddy Charles sometime during JuliaCon, which may end up with me getting tossed right into the very muddy Charles!

@ScottPJones
Copy link
Collaborator Author

About a validated bit... I'm not sure about it, I'm just bouncing ideas around... there are so many optimizations possible when you know strings are validated, but if you also want to support read-only access to strings that you just point to in memory (that are in the middle of a record you got from ZMQ, or from Java, or ODBC, etc.), you might be able to validate them, and then set the bit, to avoid the overhead of calling checkstring. You actually might need 3 states, come to think of it... unknown, valid, invalid,
or maybe better to store a whole set of flags, like those returned from checkstring, which would also help a lot when doing conversions (maybe need to store some of the returned sizes as well... but we'd need to be very careful... we don't want the size of the String object to get too big... (I don't currently know what the allocation sizes are in Julia, with the boxing etc.). Also, for short strings, at some point, we'd really want to have these strings integrated into Base, using ideas like Stefan's bytevec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants