New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support of grapheme clusters #520

Merged
merged 17 commits into from May 1, 2018

Conversation

Projects
None yet
6 participants
@jmuk
Copy link
Contributor

jmuk commented Feb 13, 2018

now next/prev_grapheme_offset works correctly rather than
falling back to codepoints.

fixes #298

add support of grapheme clusters
now next/prev_grapheme_offset works correctly rather than
falling back to codepoints.

fixes #298

@googlebot googlebot added the cla: yes label Feb 13, 2018

@cmyr
Copy link
Member

cmyr left a comment

Cool, thanks for this! A few little nits but otherwise looks good.

}

fn is_boundary(s: &String, offset: usize) -> bool {
if let Ok(r) = GraphemeCursor::new(offset, s.len(), true).is_boundary(s, 0) {

This comment has been minimized.

@cmyr

cmyr Feb 14, 2018

Member

you can call .ok() on any Result<T> to convert it to an Option<T>, discarding the error.

let mut cursor = Cursor::new(self, offset);
cursor.is_boundary::<GraphemeMetric>()
}

// graphemes should probably be developed as a cursor-based interface

This comment has been minimized.

@cmyr

cmyr Feb 14, 2018

Member

We can get rid of this comment now. :)

@@ -145,6 +147,51 @@ impl Metric<RopeInfo> for BaseMetric {
}
}

#[derive(Clone, Copy)]
pub struct GraphemeMetric(());

This comment has been minimized.

@cmyr

cmyr Feb 14, 2018

Member

You can declare a unit struct without any braces: pub struct GraphemeMetric;.

@raphlinus

This comment has been minimized.

Copy link
Contributor

raphlinus commented Feb 14, 2018

I'm thrilled to see progress on this, and hate to throw cold water on the idea, but unfortunately trying to treat grapheme clusters as a metric doesn't quite work. The underlying reason is that grapheme cluster counting is fundamentally not a monoid homomorphism, as awesome as it would be if that were the case. Here's how to write a test case that will fail:

Create a string of RIS code points, let's say \u{1F1E6} that doesn't fit in a single leaf node (ie > 1kb). Now test whether the offset at len(first leaf) + 4 bytes is a boundary. The GraphemeMetric will always say no, but that would be wrong if the number of code points in the first leaf is odd. If the number of code points in the first leaf is even, edit the rope (usually taking the substring from offset 4 will work, unless the first leaf is small), so that one or the other would fail.

Because the size of graphemes is not bounded, and because boundaries can in the worst case depend on unbounded amounts of context, the boundary detection must be able to handle bringing in an arbitrary number of leaves as pre-context.

Bottom line, this needs to be implemented directly in {prev,next}_grapheme_offset with some attention to leaf boundaries. It is, sadly, not trivial to test - this is one reason I'd like to see some infrastructure for property testing developed.

Thanks for the attempt, and I'm happy to provide guidance to see it through.

@mqzry

This comment has been minimized.

Copy link
Member

mqzry commented Mar 13, 2018

I am taking a stab at this. Did I understand it correctly that what is needed is:

  • Instantiate a GraphemeCursor
  • create a chunk iterators that walk in reverse and normal order to handle GraphemeIncomplete::PreContext or GraphemeIncomplete::NextChunk results
  • Call the right function on the grapheme cursor for example is_boundary
  • Handle any Err(GraphemeIncomplete) return value properly
@jmuk

This comment has been minimized.

Copy link
Contributor

jmuk commented Mar 13, 2018

Yup, actually I was working on that. Some prototype is working on my local machine although it needs some refinements (sorry for my slow progress).

However, I've found a few issues on unicode-segmentation side, which cause panic. I'll send a PR to them first. Stay tuned.

@jmuk

This comment has been minimized.

Copy link
Contributor

jmuk commented Mar 13, 2018

jmuk added some commits Feb 27, 2018

update Cargo.toml
Currently this points to my local github repository since
some fixes are necessary.
@jmuk

This comment has been minimized.

Copy link
Contributor

jmuk commented Mar 17, 2018

So I pushed my version of the code. Please take a look, if that's okay. I'm fine to hold this until the problem I've seen in unicode-segmentation are fixed.

@jmuk

This comment has been minimized.

Copy link
Contributor

jmuk commented Mar 20, 2018

This PR actually relies on unicode-rs/unicode-segmentation#40 too. I can create workarounds if that turned out unwanted.

jmuk added some commits Mar 27, 2018

@mqzry

This comment has been minimized.

Copy link
Member

mqzry commented Apr 29, 2018

What is the status for this PR? All relevant PRs seem to be merged.

@raphlinus

This comment has been minimized.

Copy link
Contributor

raphlinus commented Apr 29, 2018

Good question. I think unicode-segmentation needs to be published to crates.io. @Manishearth?

@Manishearth

This comment has been minimized.

Copy link

Manishearth commented Apr 29, 2018

@raphlinus

This comment has been minimized.

Copy link
Contributor

raphlinus commented Apr 29, 2018

Thanks @Manishearth! Would you like to resolve the conflicts, @jmuk? If not, we can take care of it.

Incidentally, this will help a lot in dealing with CRLF line endings, right now the cursor can get between the two which causes all kinds of unexpected behavior.

jmuk added some commits Apr 30, 2018

@jmuk

This comment has been minimized.

Copy link
Contributor

jmuk commented Apr 30, 2018

@raphlinus done!

@raphlinus
Copy link
Contributor

raphlinus left a comment

Looks great, thanks, and thanks for your patience.

@@ -363,13 +366,13 @@ impl Rope {

// graphemes should probably be developed as a cursor-based interface

This comment has been minimized.

@raphlinus

raphlinus Apr 30, 2018

Contributor

This comment can be removed, as you've actually done it :)

This comment has been minimized.

@jmuk

jmuk Apr 30, 2018

Contributor

removed!

}
// GraphemeIncomplete::NextChunk error will not happen, since this cursor thinks the end of
// the current leaf as the end of the string. Can a cursor know the total length of
// the tree?

This comment has been minimized.

@raphlinus

raphlinus Apr 30, 2018

Contributor

To answer this question, this should be available at cursor.root.len(). In my first reading, I'm not sure whether this would allow you to simplify the following logic.

This comment has been minimized.

@jmuk

jmuk Apr 30, 2018

Contributor

Okay -- yet cursor.root isn't visible here in rope.rs. I added total_len() to tree.rs, and it helps to simplify the code; at least I think so. PTAL.

[[package]]
name = "unicode-segmentation"
version = "1.2.0"
source = "git+https://github.com/jmuk/unicode-segmentation.git?branch=fix_for_xi#1d7d062b38e6e5299ed0e5750f6890230c6cd0fa"

This comment has been minimized.

@raphlinus

raphlinus Apr 30, 2018

Contributor

I think I'd like to wait for unicode-segmentation to release the fixed version, as (among other things), this would prevent releasing xi-rope as a crate. I know it's been a long time since we've done that, but we're discussing doing a release before long.

let mut c = GraphemeCursor::new(pos, l.len() + leaf_offset, true);
let mut next_boundary = c.next_boundary(&l, leaf_offset);
while let Err(GraphemeIncomplete::PreContext(_)) = next_boundary {
let (pl, poffset) = self.prev_leaf()?;

This comment has been minimized.

@raphlinus

raphlinus Apr 30, 2018

Contributor

We should update the minimum rust version in README, as ? for Option is not supported in 1.20. A quick test shows that 1.22 should still work.

This comment has been minimized.

@jmuk

jmuk Apr 30, 2018

Contributor

It seems the README had been updated already to require Rust 1.22; as 3c53388

assert_eq!(None, a.next_grapheme_offset(17));
}

#[test]

This comment has been minimized.

@raphlinus

raphlinus Apr 30, 2018

Contributor

Excellent, thanks!

jmuk added some commits Apr 30, 2018

@cmyr

cmyr approved these changes May 1, 2018

Copy link
Member

cmyr left a comment

Played around with this, feels good.

@raphlinus raphlinus merged commit 6fd9768 into xi-editor:master May 1, 2018

2 checks passed

cla/google All necessary CLAs are signed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@raphlinus

This comment has been minimized.

Copy link
Contributor

raphlinus commented May 1, 2018

Many thanks!

@jmuk

This comment has been minimized.

Copy link
Contributor

jmuk commented May 1, 2018

Yay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment