Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[READY] Fix IndexError exception from C++ #453

Merged
merged 6 commits into from
Apr 18, 2016

Conversation

micbou
Copy link
Collaborator

@micbou micbou commented Apr 9, 2016

This PR fixes the IndexError exception returned by the FilterAndCandidatesWrap function when a candidate or the query contains non-ascii characters. Its error message depends on the OS:

  • on Windows: invalid bitset<N> position
  • on Linux: bitset::set: __position (which is xxx) >= _Nb (which is 128)
  • on OS X: bitset set argument out of range

It is caused by the LetterBitsetFromString function trying to set a bit to the Bitset object with a character index that is out of range. This is solved by only setting bits for character indices satisfying 0 ≤ i < 128.

We add two tests for the LetterBitsetFromString function: one test to check the lower and upper character bounds and another to check that non-ascii characters are ignored. These tests are failing without this change. For example, on Windows:

[ RUN      ] LetterBitsetFromStringTest.Boundaries
unknown file: error: C++ exception with description "invalid bitset<N> position" thrown in the test body.
[  FAILED  ] LetterBitsetFromStringTest.Boundaries (1 ms)
[ RUN      ] LetterBitsetFromStringTest.IgnoreNonAsciiCharacters
unknown file: error: C++ exception with description "invalid bitset<N> position" thrown in the test body.
[  FAILED  ] LetterBitsetFromStringTest.IgnoreNonAsciiCharacters (0 ms)

Fixes #177 and ycm-core/YouCompleteMe#2103.


This change is Reviewable

@micbou micbou force-pushed the letter-bitset-from-string branch from 3613041 to dc3ecd1 Compare April 9, 2016 02:06
@coveralls
Copy link

Coverage Status

Coverage remained the same at 84.534% when pulling dc3ecd1 on micbou:letter-bitset-from-string into 09f2164 on Valloric:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 84.534% when pulling dc3ecd1 on micbou:letter-bitset-from-string into 09f2164 on Valloric:master.

@puremourning
Copy link
Member

Reviewing the code a bit I find a number of other places where the code assumes that each char letter can be used to index into LetterNodeListMap::letters_. I expect that we will have to fix all of them too.

For example in cpp/ycm/LetterNodeListMap.cpp:

std::list< LetterNode * > &LetterNodeListMap::operator[] ( char letter ) {
  int letter_index = IndexForChar( letter );
  std::list< LetterNode * > *list = letters_[ letter_index ];
  ...
}
...
std::list< LetterNode * > *LetterNodeListMap::ListPointerAt( char letter ) {
  return letters_[ IndexForChar( letter ) ];
}
...
bool LetterNodeListMap::HasLetter( char letter ) const {
  return letters_[ IndexForChar( letter ) ] != NULL;
}

according to the docs this could breach the requirement that letter here be < NUM_LETTERS.

I suspect that what is happening is that this is always reading or writing past the end of the array, which could lead to undefined behaviour nuclear war. I could be wrong. Running under valgrind would be a cheap way to test this.


Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions.


cpp/ycm/Candidate.cpp, line 69 [r1] (raw file):
char_index should be declared where it is initialised (i.e. on line 71). In general, variables should be initialised when they are declared (e.g. int char_index = 0) but redundant assignments should be avoided too (so int char_index = IndexForChar( letter )). Good style also to declare close to where it is used (within the smallest reasonable scope).


cpp/ycm/Candidate.cpp, line 72 [r1] (raw file):
I personally think that it is more natural to read (and write): if ( char_index >= 0 && char_index < NUM_LETTERS )

There is a sort of old-school fallacy that in order to prevent accidental use of the assignment operator, programmers would (heinously IMO) obfuscate their code with if ( 0 == myInt ), rather than if ( myInt == 0 ).

Anyway, I always feel like the variable (lvalue) should be of the left, because we normally say "is this apple red" not "is red this apple".


Comments from Reviewable

@micbou micbou force-pushed the letter-bitset-from-string branch from dc3ecd1 to 83b91d1 Compare April 9, 2016 13:56
@micbou
Copy link
Collaborator Author

micbou commented Apr 9, 2016

Good catch. I'll try to write tests for these functions.


Reviewed 1 of 2 files at r1.
Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions.


cpp/ycm/Candidate.cpp, line 69 [r1] (raw file):
Done.


cpp/ycm/Candidate.cpp, line 72 [r1] (raw file):
I wrote it this way to be closer to the mathematical form: 0 ≤ i < NUM_LETTERS. But let's be consistent and always write the variable to the left.


Comments from Reviewable

@micbou micbou force-pushed the letter-bitset-from-string branch from 83b91d1 to 3ca5e3c Compare April 9, 2016 14:08
@micbou
Copy link
Collaborator Author

micbou commented Apr 9, 2016

Review status: 0 of 2 files reviewed at latest revision, 3 unresolved discussions.


cpp/ycm/Candidate.cpp, line 70 [r3] (raw file):
I changed char_index to letter_index because this name is used in other places.


Comments from Reviewable

@puremourning
Copy link
Member

I think this fixes ycm-core/YouCompleteMe#278 too (along with my upcoming PR for unicode)

@micbou micbou changed the title [READY] Fix IndexError exception from C++ [WIP] Fix IndexError exception from C++ Apr 9, 2016
@micbou
Copy link
Collaborator Author

micbou commented Apr 9, 2016

This fixes the bitset exception, not the filename unicode support.


Reviewed 1 of 2 files at r1, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 3 unresolved discussions.


Comments from Reviewable

@puremourning
Copy link
Member

Right, the combination of the 2 should fix it, though I think :)


Reviewed 2 of 2 files at r1, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 3 unresolved discussions.


Comments from Reviewable

@micbou
Copy link
Collaborator Author

micbou commented Apr 9, 2016

I mean, even with our fixes, completing filenames with non-ascii characters will not be possible and I think this is what issue ycm-core/YouCompleteMe#278 is about.


Review status: all files reviewed at latest revision, 3 unresolved discussions.


Comments from Reviewable

@puremourning
Copy link
Member

Well the report it self talks about fixing the bitset exception. If we fixed the filter to support unicode, then the filename completer would also continue to work. The filename completer itself handles unicode fine (at least after my PR) :)


Review status: all files reviewed at latest revision, 1 unresolved discussion.


Comments from Reviewable

@Valloric
Copy link
Member

:lgtm: from me if @puremourning is fine with it too. I think he's now officially the unicode expert as it pertains to ycmd. :)


Review status: all files reviewed at latest revision, 2 unresolved discussions.


cpp/ycm/Candidate.cpp, line 70 [r3] (raw file):
Fine with this, though it looks wierd that we now have letter_index = IndexForChar. Maybe update the function name too?

I'm fine with leaving it the way you have it too.


cpp/ycm/tests/LetterBitsetFromString_test.cpp, line 60 [r3] (raw file):
Maybe jam a chinese letter in there too. They tend to use more that two bytes in UTF-8.


Comments from Reviewable

@coveralls
Copy link

Coverage Status

Coverage remained the same at 84.534% when pulling c07d4c5 on micbou:letter-bitset-from-string into 09f2164 on Valloric:master.

@micbou
Copy link
Collaborator Author

micbou commented Apr 10, 2016

So I added tests for the LetterNode class: one with ascii characters and another with non-ascii characters.

As expected, the non-ascii test was failing by returning out of bounds pointers instead of a NULL pointer. Fixed by checking that the letter index is in range in LetterNodeListMap methods.


Reviewed 2 of 4 files at r5, 4 of 4 files at r6.
Review status: 6 of 7 files reviewed at latest revision, 6 unresolved discussions.


cpp/ycm/Candidate.cpp, line 70 [r3] (raw file):
You want to rename it to IndexForLetter? I don't mind.


cpp/ycm/LetterNodeListMap.cpp, line 30 [r6] (raw file):
@puremourning I wrote the condition as 0 <= i && i < n instead of i >= 0 && i < n because it is done this way in the IsUppercase function, just above.


cpp/ycm/LetterNodeListMap.h, line 35 [r6] (raw file):
Should we inline these functions?


cpp/ycm/LetterNodeListMap.h, line 42 [r6] (raw file):
@Valloric This method is never used. Should we remove it?


cpp/ycm/LetterNodeListMap.h, line 48 [r6] (raw file):
This method too.


cpp/ycm/tests/LetterBitsetFromString_test.cpp, line 60 [r3] (raw file):
is a 3-bytes character and all chinese characters seem to be 3-bytes (ø is 2-bytes). We could use the 𐍈 character: it is 4-bytes. You know what: I'll update the string to uni¢𐍈d€ (¢ is 2-bytes) so that we use 1, 2, 3, and 4-bytes characters in the same string (there are no characters with more than 4-bytes in UTF-8).


Comments from Reviewable

@coveralls
Copy link

Coverage Status

Coverage remained the same at 84.534% when pulling db27875 on micbou:letter-bitset-from-string into 09f2164 on Valloric:master.

@puremourning
Copy link
Member

Reviewed 2 of 4 files at r5, 2 of 4 files at r6, 1 of 1 files at r7.
Review status: all files reviewed at latest revision, 9 unresolved discussions.


cpp/ycm/LetterNodeListMap.cpp, line 71 [r7] (raw file):
I think this leaks memory (who calls the delete for this new?), and it is difficult to come up with a scheme where the return value from this call is non-const and doesn't leak. (one option would be something like static std::list< LetterNode *>() out_of_range_result; return out_of_range_result, but this would lead to strange behaviour if someone used that result and modified it)

As it happens, I think it is reasonable to have a precondition that operator[] takes a value which strictly must be InAsciiRange. We can add an assert( IsInAsciiRange( letter_index ) ) or just throw an exception (as we typically don't test with debug builds).


cpp/ycm/LetterNodeListMap.h, line 35 [r6] (raw file):
I'm not sure how well inlining exported symbols works, TBH. But inlining IsInAsciiRange could be done.


cpp/ycm/LetterNodeListMap.h, line 42 [r6] (raw file):
If we have dead code, I think we should remove it (personally).


cpp/ycm/LetterNodeListMap.h, line 48 [r6] (raw file):
ditto


cpp/ycm/tests/LetterNode_test.cpp, line 30 [r7] (raw file):
I don't think this needs to be on the heap, does it? Even if it does, who deletes it?

OK a small memory leak in our test isn't a big deal, but it means the destructor code (which is still code we should test) isn't run.


cpp/ycm/tests/LetterNode_test.cpp, line 65 [r7] (raw file):
also probably leaks. Can probably just be LetterNode node( "un...." );


Comments from Reviewable

@puremourning
Copy link
Member

Reviewed 1 of 4 files at r6.
Review status: all files reviewed at latest revision, 9 unresolved discussions.


Comments from Reviewable

@Valloric
Copy link
Member

Review status: all files reviewed at latest revision, 10 unresolved discussions.


cpp/ycm/Candidate.cpp, line 70 [r3] (raw file):
Sure, why not.


cpp/ycm/LetterNode.cpp, line 30 [r7] (raw file):
Ha, at least I was honest about it. :D


cpp/ycm/LetterNodeListMap.cpp, line 30 [r6] (raw file):
Nice. The formatting in IsUppercase was intentional; it more clearly shows that letter needs to be between A and Z.

The formatting works here just as well.


cpp/ycm/LetterNodeListMap.cpp, line 71 [r7] (raw file):
This is tricky, I agree. And yes, as written, it leaks.

I'm not against having an assert.


cpp/ycm/LetterNodeListMap.h, line 35 [r6] (raw file):
Let's leave it up to the compiler.


cpp/ycm/LetterNodeListMap.h, line 42 [r6] (raw file):
Sure, kill it.


cpp/ycm/LetterNodeListMap.h, line 48 [r6] (raw file):
Kill it with fire.


cpp/ycm/tests/LetterBitsetFromString_test.cpp, line 60 [r3] (raw file):
Sounds good.


cpp/ycm/tests/LetterNode_test.cpp, line 30 [r7] (raw file):
Leaks are bad in tests too because it means you can't run leak detectors on your test suite.

And I agree, this shouldn't be on the heap.


Comments from Reviewable

Test ASCII boundaries and unicode characters in
LetterBitsetFromString function.
@micbou
Copy link
Collaborator Author

micbou commented Apr 10, 2016

Reviewed 1 of 1 files at r7, 6 of 7 files at r8.
Review status: 7 of 8 files reviewed at latest revision, 7 unresolved discussions.


cpp/ycm/Candidate.cpp, line 70 [r3] (raw file):
Done.


cpp/ycm/LetterNodeListMap.cpp, line 71 [r7] (raw file):
I decided to use the at method from Boost for the [] operator and the ListPointerAt method.


cpp/ycm/LetterNodeListMap.h, line 42 [r6] (raw file):
Done.


cpp/ycm/LetterNodeListMap.h, line 48 [r6] (raw file):
Done.


cpp/ycm/tests/LetterNode_test.cpp, line 30 [r7] (raw file):
Done.


cpp/ycm/tests/LetterNode_test.cpp, line 65 [r7] (raw file):
Done.


Comments from Reviewable

@micbou
Copy link
Collaborator Author

micbou commented Apr 10, 2016

Should we bump the core version for these changes?


Review status: 7 of 8 files reviewed at latest revision, 7 unresolved discussions.


Comments from Reviewable

@coveralls
Copy link

Coverage Status

Coverage remained the same at 84.534% when pulling a281202 on micbou:letter-bitset-from-string into 09f2164 on Valloric:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 84.534% when pulling a281202 on micbou:letter-bitset-from-string into 09f2164 on Valloric:master.

@Valloric
Copy link
Member

Yes, core version should be bumped.


Review status: 7 of 8 files reviewed at latest revision, 6 unresolved discussions.


cpp/ycm/LetterNodeListMap.cpp, line 71 [r7] (raw file):
What do we do in higher levels of the stack when the C++ layer throws an exception? We should make sure throwing here works.


Comments from Reviewable

@micbou
Copy link
Collaborator Author

micbou commented Apr 11, 2016

Version bumped.


Reviewed 1 of 1 files at r9, 1 of 1 files at r10.
Review status: all files reviewed at latest revision, 6 unresolved discussions.


cpp/ycm/LetterNodeListMap.cpp, line 71 [r7] (raw file):
This exception will never get to the Python layer because candidates containing non-ascii (nonprintable) characters are discarded. I checked the behavior without discarding such candidates and we get a IndexError: array<>: index out of range exception in Python so this is working as expected.

If we want to be extra cautious, we could add the FilterAndSortCandidates tests from PR #455 to this PR.


Comments from Reviewable

@coveralls
Copy link

Coverage Status

Coverage remained the same at 84.534% when pulling f58c41c on micbou:letter-bitset-from-string into 09f2164 on Valloric:master.

@micbou micbou changed the title [WIP] Fix IndexError exception from C++ [READY] Fix IndexError exception from C++ Apr 11, 2016
@Valloric
Copy link
Member

Awesome stuff!


Reviewed 1 of 4 files at r5, 1 of 1 files at r7, 6 of 7 files at r8, 1 of 1 files at r9, 1 of 1 files at r10.
Review status: all files reviewed at latest revision, 6 unresolved discussions.


cpp/ycm/LetterNodeListMap.cpp, line 71 [r7] (raw file):
Nah, we're good.


Comments from Reviewable

@vheon
Copy link
Contributor

vheon commented Apr 11, 2016

Reviewed 1 of 4 files at r5, 1 of 1 files at r7, 6 of 7 files at r8, 1 of 1 files at r9.
Review status: all files reviewed at latest revision, 6 unresolved discussions.


Comments from Reviewable

@puremourning
Copy link
Member

:lgtm:


Reviewed 6 of 7 files at r8, 1 of 1 files at r9, 1 of 1 files at r10.
Review status: all files reviewed at latest revision, 2 unresolved discussions.


cpp/ycm/LetterNodeListMap.h, line 35 [r6] (raw file):
Not sure the compiler can inline it if the definition is in the TU rather than the header. (Well, it can inline it in the TU, but not elsewhere, which I think is the point of the question). In any case, it is early optimisation.


Comments from Reviewable

@Valloric
Copy link
Member

Thanks for the PR! :)

@homu r+


Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@homu
Copy link
Contributor

homu commented Apr 18, 2016

📌 Commit f58c41c has been approved by Valloric

homu added a commit that referenced this pull request Apr 18, 2016
[READY] Fix IndexError exception from C++

This PR fixes the `IndexError` exception returned by the `FilterAndCandidatesWrap` function when a candidate or the query contains non-ascii characters. Its error message depends on the OS:
 - on Windows: `invalid bitset<N> position`
 - on Linux: `bitset::set: __position (which is xxx) >= _Nb (which is 128)`
 - on OS X: `bitset set argument out of range`

It is caused by the `LetterBitsetFromString` function trying to set a bit to the `Bitset` object with a character index that is out of range. This is solved by only setting bits for character indices satisfying `0 ≤ i < 128`.

We add two tests for the `LetterBitsetFromString` function: one test to check the lower and upper character bounds and another to check that non-ascii characters are ignored. These tests are failing without this change. For example, on Windows:
```
[ RUN      ] LetterBitsetFromStringTest.Boundaries
unknown file: error: C++ exception with description "invalid bitset<N> position" thrown in the test body.
[  FAILED  ] LetterBitsetFromStringTest.Boundaries (1 ms)
[ RUN      ] LetterBitsetFromStringTest.IgnoreNonAsciiCharacters
unknown file: error: C++ exception with description "invalid bitset<N> position" thrown in the test body.
[  FAILED  ] LetterBitsetFromStringTest.IgnoreNonAsciiCharacters (0 ms)
```

Fixes #177 and ycm-core/YouCompleteMe#2103.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/453)
<!-- Reviewable:end -->
@homu
Copy link
Contributor

homu commented Apr 18, 2016

⌛ Testing commit f58c41c with merge bc32373...

@homu
Copy link
Contributor

homu commented Apr 18, 2016

☀️ Test successful - status

@homu homu merged commit f58c41c into ycm-core:master Apr 18, 2016
homu added a commit that referenced this pull request Apr 21, 2016
[READY] Definitely fix IndexError exception from C++

PR #453 did not completely fix the IndexError issue, but moved it to other parts of the C++ code. The exception is now raised by the [`NodeListForLetter` method in `QueryMatchResult`](https://github.com/Valloric/ycmd/blob/master/cpp/ycm/Candidate.cpp#L96) when the query contains non-ascii characters. Since we don't support this kind of query, we want to deal with it as early as possible: in the `CandidatesForQueryAndType` and `FilterAndSortCandidates` functions. We don't check for non-ascii characters but for nonprintable ones as [we already do for candidates](https://github.com/Valloric/ycmd/blob/master/cpp/ycm/CandidateRepository.cpp#L136).

Added two tests for the `CandidatesForQuery` function: `EmptyCandidatesForUnicode` is failing without this PR but not `EmptyCandidatesForNonPrintable` because nonprintable candidates are ignored.

No tests for `FilterAndSortCandidates` because it will be tested by the Python layer in PR #455.

Core version bumped.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/463)
<!-- Reviewable:end -->
homu added a commit that referenced this pull request Apr 24, 2016
[READY] Fix issues with multi-byte characters

## Summary

This change introduces more general support for non-ASCII characters in buffers handled by YCMD.

In ycmd's public API, all offsets are byte offsets into the UTF-8 encoded buffers. We also assume (because, we have no other choice) that files stored on disk are also UTF-8 encoded. Internally, almost all of ycmd's functionality operates on unicode strings (python 2 `unicode()` and python 3 `str()` objects, transparently via `future`). Many of the downstream completion engines expect unicode code points as the offsets in their APIs. One special case is the `ycm_core` library (identifier completer and clang completer), which requires instances of the _native_ `str` type. All strings used within the c++ using `boost::python` require passing through `ToCppStringCompatible`

Previously, we were largely just assuming that `code point == byte offset` - i.e. all buffers contained only ASCII characters. This worked up to a point, but more by luck than judgement in a number of places.

## References

In combination with a YCM change and PR #453, I hope this:

- fixes #109
- fixes ycm-core/YouCompleteMe#2096
- fixes ycm-core/YouCompleteMe#2088
- fixes ycm-core/YouCompleteMe#2069
- fixes ycm-core/YouCompleteMe#2066
- fixes ycm-core/YouCompleteMe#1378

## Overview of changes

The changes fall into the following areas:

- Providing access to and conversion to/from code points and byte offsets (`request_wrap.py`)
- Changing certain algorithms/features to work entirely in codepoint space when they are trying to operate on logical 'characters' within the buffer (see known issues for why this isn't perfect, but probably most of the way there)
- Changing the completers to convert between the external (on both sides) and internal representations by using the shortcuts provided in `request_wrap.py`
- Adding tests for each of the completers for both completions and subcommands

## Completer-specific notes

Pretty much all of the completers I tested required some changes:
- clang uses utf-8 and byte offsets, but had some bugs with the `GetDoc` parsing stuff
- OmniSharp speaks codepoint offsets
- Tern speaks codepoint offsets
- JediHTTP speaks codepoint offsets
- tsserver speaks codepoint offsets
- gocode speaks byte offsets
- racer i did not test

## Further work / Known issues

- we act blissfully ignorant of the case where a unicode character consumes multiple code points (such as where there is a modifier after the code point)
- when typing a unicode character, we still get an exception from `bitset` (see #453 for that fix)
- the filtering and sorting system is 100% designed for ASCII only, and it is not in the scope of this PR to change that. Currently after any filtering operation, words containing non-ASCII characters are excluded.
- I did not get round to testing rust using racer
- there are further changes required to YouCompleteMe client (a further PR is coming for that)

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/455)
<!-- Reviewable:end -->
@micbou micbou deleted the letter-bitset-from-string branch May 1, 2016 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error with non-ascii character in file name
6 participants