Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capi capture names #282

Merged
merged 1 commit into from Sep 27, 2016
Merged

Conversation

davidblewett
Copy link
Contributor

Refs #279.

This isn't ready, but I need some help. I'm stumped by how Rust is handling strings. In my tests with these changes, I see this when using the API:

Entering rure_iter_capture_names_next
Iterator last_index: 0
capture_names length: 1
last: 1
Selected name: last
CString cast: 'last'
Entering rure_iter_capture_names_next
Iterator last_index: 1
capture_names length: 1
> /home/dblewett/.virtualenvs/rure/site-packages/rure/regex.py(172)find_captures()
-> match = ffi.new('rure_match *')
(Pdb) all_names
['l          ']

I'm not sure if this implementation is correct and the Python FFI is not working, or if the implementation is flawed. What I get back from each call to rure_iter_next_captures is either an empty string (when using ffi.new('char[]', [])), a single character (when using ffi.new('char[]', '')), or a single character followed by padding (when using ffi.new('char[]', ' ' * 20)).

@BurntSushi
Copy link
Member

Hmm, it's hard to tell what exactly is going wrong... Here's my advice:

  • Forget about Python for the moment and write a C unit test. There are many existing C unit tests here: https://github.com/rust-lang-nursery/regex/blob/master/regex-capi/ctest/test.c (When you define a new one, make sure to add it to the main function at the bottom.)
  • The C iterator should precisely match the semantics of the Rust CaptureNames iterator. This means yielding an iteration even for unnamed capture groups. You could return true for those captures, but ensure the string given back to the caller is zero-length.
  • You should be able to embed the CaptureNames Rust iterator type directly into your IterCaptureNames type (perhaps even using CaptureNames directly). That way, you don't have to re-implement the iterator.
  • The API generally uses length based strings instead of C strings. C strings are undoubtedly more convenient though, and for this particular use case, I think it's probably the right choice.

Sorry I can't diagnose your precise problem, but I'd attack this by ignoring Python and trying to make the code a bit simpler.

Thank you for working on this!

@davidblewett
Copy link
Contributor Author

Okay, point taken about not using Python's cffi. I changed to calling the CaptureNames iterator, and added a unit test. What I don't get now is that the test fails on the 2nd invocation of rure_iter_capture_names_next (when trying to print the string returned).

kell% LD_LIBRARY_PATH="../target/debug:$LD_LIBRARY_PATH" ./test
PASSED: test_is_match
PASSED: test_shortest_match
PASSED: test_find
snowman
PASSED: test_captures
PASSED: test_iter
Entering rure_iter_capture_names_next
Iterator last_index: 0
capture_names length: 3
month: 2
day: 3
year: 1
Selected name:
CString cast: ''
[test_iter_capture_names] Received name '(null)'
Entering rure_iter_capture_names_next
Iterator last_index: 1
capture_names length: 3
month: 2
day: 3
year: 1
Selected name: year
CString cast: 'year'
[test_iter_capture_names] Third name? 'True'
zsh: segmentation fault (core dumped)  LD_LIBRARY_PATH="../target/debug:$LD_LIBRARY_PATH" ./test

@davidblewett
Copy link
Contributor Author

davidblewett commented Sep 13, 2016

@BurntSushi : I think I'm finally grokking the lifetime aspect of this change. However, I think to fully support it is going to require more changes than I'm comfortable making:

  • The ffi_fun macro doesn't appear to support the lifetime specifier:
   Compiling rure v0.1.1 (file:///home/dblewett/src/regex/regex-capi)
src/rure.rs:237:35: 237:36 error: no rules expected the token `<`
src/rure.rs:237     fn rure_iter_capture_names_new<'a>(
                                                  ^
  • I tried just manually specifying the usual no_mangle, pub extern dance, but got this compiler warning:
   Compiling rure v0.1.1 (file:///home/dblewett/src/regex/regex-capi)
src/rure.rs:238:5: 247:6 warning: generic functions must be mangled, #[warn(no_mangle_generic_items)] on by default
src/rure.rs:238     pub extern fn rure_iter_capture_names_new<'a>(

This change does not fix the segfault reported in #282 (comment).

@BurntSushi
Copy link
Member

That warning is there for good reason. Lifetime parameters are type parameters, and you can't expose generics in a C ABI. You have to remove it. (You might do it by using bytes::CaptureNames<'static>, which will require a transmute when you build it.)

I'm not sure about your other problems. I'll need to have a closer look for that, which I can do later today thanks to your C test. :-)

@BurntSushi
Copy link
Member

Just to say a bit more about 'static... When it comes to exposing a C ABI, we completely lose all of the guarantees enforced by the Rust compiler, which means it is now up to the C programmer to maintain proper lifetimes. For example, one of the first things I looked for in your docs was something like this (and was very pleased to see it):

An rure_iter_capture_names value may not outlive its corresponding rure, and should be freed before its corresponding rure is freed.

This is precisely what that 'a lifetime enforced at compile time inside of Rust. By forcing it to be 'static, we're explicitly throwing away all of Rust's guarantees and putting them on the C programmer.

(Technically, I don't think an rure_iter_capture_names needs to actually be freed before its corresponding rure, but it certainly can't be used after its corresponding rure is freed.)

Result::Err(err) => return false
};
println!("CString cast: '{}'", cs.to_str().unwrap());
*capture_name = *cs.into_raw();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is here; the CString is getting freed when the unsafe block completes, thus the C unit test triggers a segfault trying to read the memory (similar to https://users.rust-lang.org/t/correct-way-to-implement-a-function-which-returns-a-c-string/315/3 ). I think the solution is to use a Box, but I'm unsure of the details of how to do that.

@davidblewett
Copy link
Contributor Author

@BurntSushi : Figured the problem out; it all boiled down to 4702e69 it appears. The C unit test now passes correctly. Will test in Python's cffi next.

@davidblewett
Copy link
Contributor Author

I've rebased my rure-python branch on top of this, and it's working great. https://github.com/davidblewett/regex/tree/rure-python

@davidblewett
Copy link
Contributor Author

@BurntSushi : Would like to get this merged; it appears everything looks in place. Tests are passing, and I've cleaned up the debug statements I had in from earlier. Tried to condense the code back down so that it is easier to follow as well.

I have a functioning Python binding I would like to issue a PR for that is based on this.

@BurntSushi
Copy link
Member

@davidblewett This all looks good. Would you mind squashing this down to one commit and rebase it on to current master with a descriptive commit message? After that it should be good to go. Thanks so much for doing this! Very nice work. :-)

I have a functioning Python binding I would like to issue a PR for that is based on this.

A PR to this repo? Or somewhere else?

@davidblewett
Copy link
Contributor Author

Yeah, I can squash it. Will try to do that tomorrow.

The branch I've been working on Python support is here: davidblewett/regex@capi-capture-names...davidblewett:rure-python .

@BurntSushi
Copy link
Member

To be clear, I'm not sure I'm comfortable with adding other language bindings. Maintaining a canonical C API is one thing, but opening the door to every other language doesn't seem like a good idea. :(

* Add new `rure_iter_capture_names` struct
  - Opaque pointer encapsulates access to:
  - Underyling Rust iterator
  - Each capture group name CString
* Add functions for instantiating the iterator and processing:
  - `rure_iter_capture_names_new`
  - `rure_iter_capture_names_next`
  - `rure_iter_capture_names_free`
* Track CString objects handed out, and free them when called.
* Add unit test for new functions
@davidblewett
Copy link
Contributor Author

@BurntSushi : went ahead and squashed the commits to a single and updated the PR.

@davidblewett
Copy link
Contributor Author

davidblewett commented Sep 26, 2016

To be clear, I'm not sure I'm comfortable with adding other language bindings.

Yeah, I can understand that. I'm trying to find the best route to do this ( I asked on the web forum awhile back: https://users.rust-lang.org/t/binding-regex-from-python/7253 ).

The problem that I'm running into is that I couldn't figure out how to have a simple "wrapper" crate of the rure crate. Every time I tried, the C functions always returned garbage (see https://github.com/davidblewett/rure-python/blob/master/rure_wrapper/src/lib.rs ). Was trying to do something similar to the FST bindings here: https://github.com/jbaiter/python-rust-fst .

By having it inside the rure crate itself, it simplified a lot of the Python packaging side. I suppose what I can do is instead of trying to use the rust_setuptools / RustDistribution stuff to try to automate building the crate, I can just dictate that you need to drop the rure static library in a given directory and we'll build the wheel based on that.

@BurntSushi BurntSushi merged commit 51f3194 into rust-lang:master Sep 27, 2016
@BurntSushi
Copy link
Member

All right, all set! Thanks so much for your patience and pushing through this.

I'm not sure what the in-crate/out-crate issue is, but if you open a new issue with more details I'd be happy to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants