Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use u8 and [u8], not char and str for printing methods on Console #161

Closed
wants to merge 1 commit into from

Conversation

ftxqxd
Copy link

@ftxqxd ftxqxd commented Jun 4, 2015

Console’s printing methods don’t work with Unicode strings or characters, so it doesn’t make sense to use the char (Unicode scalar value) and str (UTF-8 string) types. This changes these methods to use u8 and [u8] instead. The ergonomic impact of this is minimal (although it breaks just about all code, unfortunately), due to the existence of byte string and character literals (b"foo" and b'c'), and additionally, this makes it possible to specify special characters directly inside byte strings (e.g., printing b"90\xf8" (0xf8 is chars::GRADE) now displays 90°, but previously displayed 90├© due to the fact that strings are UTF-8).

`Console`’s printing functions don’t work with Unicode strings or characters, so
it doesn’t make sense to use the `char` (Unicode scalar value) and `str` (UTF-8
string) types for printing.
@tomassedovic
Copy link
Owner

The ergonomic impact seems much higher to me, but yeah, this is a problem. I absolutely agree that we should allow b'c' and b"foo", but I'd like to keep the ability to pass in normal chars and strings as well (with the understanding that you mustn't pass in non-ascii values).

What do you think of accepting something like Into<u8> and Into<[u8]>, potentially with the char and str implementations calling std::ascii::AsciiExt::is_ascii during the conversion?

(we'd have to define a different trait because the types and the Into trait are in a different crate)

That conversion would have a performance hit, but you'd get a runtime error and you could always make it fast by using the ASCII types directly. I.e. easy for people getting started but not costing folks who know what they're doing.

@zsparal
Copy link
Collaborator

zsparal commented Jun 4, 2015

Since libtcod actually does support Unicode printing functions: docs, we could statically dispatch the printing based on what argument it gets. I cannot test this right now, but this should work (the mutable trait object syntax might be incorrect, I can't remember if you need a second &):

pub trait PrintString { 
    fn print(&self, con: &mut Console);
}

impl PrintString for [u8] {
    fn print(&self, con: &mut Console) {
        // Print ASCII using the ffi
    }
}

impl<'a> PrintString for &'a str {
    fn print(&self, con: &mut Console) {
        // Print Unicode using the ffi
    }
}

impl PrintString for String {
    fn print(&self, con: &mut Console) {
        // Print Unicode using the ffi
    }
}

Using this, the definition for the actual print function would look something like this:

fn print<T>(&mut self, x: i32, y: i32, text: &T) where T: PrintString {
    assert!(x >= 0 && y >= 0);
    text.print(self);
}

This could also be repeated for char and u8.

@ftxqxd
Copy link
Author

ftxqxd commented Jun 4, 2015

@tomassedovic Yeah, I was just thinking about doing that. Potentially (perhaps later in a different PR, though) the char/str → u8/[u8] conversion implementation could also manually map specific Unicode glyphs to their equivalents in libtcod’s character set, and only if no such equivalent exists would it throw a runtime error.

One question, though: the ASCII control characters display differently in libtcod’s character set. These characters could easily be supported using char/&str, because they have the same representation as a C/byte string, but that’s potentially confusing, because they don’t display the same in Unicode, which is what char and &str are supposed to represent. So should a char/&str → u8/[u8] conversion method allow control characters?

@Gustorn I did try to use the _utf functions from the FFI, but they just didn’t work in my testing (e.g., é displayed as ú), and I had to manually convert &str to UTF-32 anyway, which is an O(n) operation.

@zsparal
Copy link
Collaborator

zsparal commented Jun 4, 2015

@P1start Does libtcod really expect UTF-32 strings? That's a bit curious... Also, are you sure you were using a Unicode font for the Root console?

@ftxqxd
Copy link
Author

ftxqxd commented Jun 4, 2015

@Gustorn Are Unicode fonts available for libtcod? I was under the impression that only 256-character terminal.png-like fonts were supported. And I’m pretty sure they’re UTF-32 because the C API expects const wchar_t *, which is equivalent to *const i32 in Rust (wchar_t is 32 bits, at least on my platform).

Also, I just realised that a static-dispatch-based API (fn print<T: AsTcodStr> or whatever) isn’t going to work, since that would make Console not object safe, which is a big deal. You could use dynamic dispatch (fn set_char(x: i32, y: i32, glyph: &ToTcodStr) or something similar), but then you’d have to box/reference every string or character, making the ergonomic impact even worse, probably.

@zsparal
Copy link
Collaborator

zsparal commented Jun 4, 2015

Here's a forum topic about an (apparently) full Unicode bitmap font. As for wchar_t and UTF: you actually have to store UTF-8 using a type that can hold a minimum of 32 bits, since Unicode code points can take up 1-4 bytes using that encoding (the original spec said up to 6 bytes, but if I remember correctly there is an RFC that specifies 4 as the current upper limit). As a sidenote here: the libtcod implementation is actually not that great, since wchar_t is actually compiler specific, and can be as small as 8 bits...

Object safety is actually a really good point, one which I haven't considered. To be completely honest, I haven't really seen a use-case for using trait objects, so object safety is doesn't seem a terribly big deal to me: the only function where Console types are actually interchangeable is blit, other than that I can't see why you'd store a Vec<Box<Console>> for example.

@ftxqxd
Copy link
Author

ftxqxd commented Jun 4, 2015

@Gustorn Looks like that ‘only’ supports the basic multilingual plane, but I now see that you can indeed support more than just 256 characters. I guess I now know how I was using the _utf functions incorrectly: the only difference is that they support 232 characters instead of just 256; they don’t necessarily map to Unicode values unless your font file is specifically organised that way.

One of the first things I wrote with tcod involved using Console as a trait object, in order to make sure another trait was object safe. It went something like this:

trait Draw {
    fn draw(&self, &mut Console);
}

And then there was going to be a Vec<Box<Draw>> somewhere which would be iterated over to draw every drawable object, but I never really got that far. It’s not a super compelling example, but it’s a good idea anyway to keep traits object safe whenever possible.

@zsparal
Copy link
Collaborator

zsparal commented Jun 4, 2015

Yeah, unfortunaly that is a valid example, even though I would argue that drawing on Offscreen consoles would be good enough: it makes very simple prototypes harder to make (no direct drawing to Root), but 90% of the time you would draw on Offscreen consoles and blit them to Root anyway. These are the times when I actually miss C++. A templated print wouldn't cause any problems there.

I still think that using u8 as default is not the greatest idea: it does have a significant ergonomic impact (either the byte literals or as_bytes (?)), and it's much more natural to work with String and &str. An other option would be to get rid of every function inside the trait (except for con), and define everything as free functions. Since there's no actual UFCS in Rust, this would be pretty inconvenient (even though it would mirror the C API quite faithfully):

root.print(args...); 
// Would become
console::print(root, args...);

@zsparal
Copy link
Collaborator

zsparal commented Jun 4, 2015

So here's some testing I've done with object safety, and I'm happy to say, we can actually do the generic print and have Console still be object safe. Here's the link to my little test: commit. Unforuntalely I don't really have time today, but I think I'll implement it tomorrow.

@tomassedovic
Copy link
Owner

I'm sorry, but I've had little time to participate today and I'll be completely off for the next few days. If you come up with something that doesn't require the u8 and [u8] literals, feel free to merge it.

I'll be back on Monday so if it doesn't come in by then, I'll read through this thread again and get this fixed one way or another.

Thanks both of you.

@zsparal
Copy link
Collaborator

zsparal commented Jun 12, 2015

Proposing to close this since #163 was merged.

@tomassedovic
Copy link
Owner

Agreed, the unicode concerns should be addressed by #163. Note that we still don't use b'@' and b"foo" directly, but that can be addressed in another pull request.

Thanks @P1start for starting this! If you feel we should re-open this, feel free to comment here.

@zsparal
Copy link
Collaborator

zsparal commented Jun 12, 2015

I'll reopen this since I did a bit of research regarding the literals.

After a bit of experimentation I'm not sure if we should accept ASCII literals, and here's why. With the current implementation we choose the printing function based on Rust's is_ascii function, which only returns true if the given characters are the part of the original (non-extended) ASCII table. This means that given the example in the opening post (b"90\xf8") is_ascii would return false, which would result in us trying to use the UTF-8 printing function with an invalid byte (the degree character is represented by two bytes in UTF-8). This is most certainly not the behaviour one would expect.

Now the current runtime dispatch could be combined with a custom trait, but I'm not sure if introducing arbitrary rules for the different string types is such a good idea. Let's say we have the trait TcodString that controls if a value should be converted to UTF-8 or not. A possible implementation for the different string (or string-like) types:

  • &str -> convert if not is_ascii
  • String -> convert if not is_ascii
  • &[u8] -> never convert

This is probably close to the behaviour people would expect: since libtcod doesn't really care about encoding, using ASCII literals, even with the extended ASCII table should give correct results. In the case of the two regular string types they are already guaranteed to be valid UTF-8 encoded strings at least, but we can save the conversion if they are in the original ASCII range (this is what we're doing now).

I can probably implement it in a way that still accepts user-defined string types, provided they implement AsRef<str>. Without negative trait bounds also accepting AsRef<[u8]> is impossible I think (it would have to be a concrete implementation for &[u8]).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants