-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add filter to DataExtractor that removes some of control chars #886
Conversation
Is this related to #570? |
@rocket-turtle, correct, I see random spec failures, though. 856.23 failed here https://travis-ci.org/sunspot/sunspot/builds/347727961 |
Solr itself can handle these chars perfectly, but the XML transport can't, so there is no way to fix this at the Solr side using FilterFactory or anything else because it never reaches there. Solr fails on parsing the received XML that contains any of the characters I've blacklisted. |
@heaven please rebase your branch against master. Lets make sure that tests are actually passing |
@serggl, the branch is in sync with the master. I also replaced the list of characters with a regular expression, which should be more efficient. |
@heaven I wonder if we could measure index performance somehow: with and without your patch |
@serggl There will be as much overhead as a simple regular expression could cause. Not sure how to make it more efficient, perhaps we can check if a string contains one of those blacklisted characters before running We have this in production and everything seems good so far, errors also have gone from the log. We run indexing from a background process so I can't say if it became slower. |
one more thing that I did not noticed before is that there are no specs that prove that this fix does actually work. Can you please add some? |
…ML transport doesn't support. Fixes XML parser error in Solr org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code 11)) Relevant issues: sunspot#570
…ld be more efficient on larger strings.
@serggl hi, I just added some tests. |
I spent a little bit of time looking for ways to escape these control characters since Solr itself is fine with them. It looks like that's only allowed in XML 1.1, is inconsistently supported, and would probably impose more of a performance impact than just stripping them out. I'm good with this PR. It also doesn't look like there is a valid escape sequence for a null byte even in XML 1.1 so there would still be something to remove. |
@mlh758 are you good merging this? |
I am, yes. |
XML transport doesn't support some of the control chars.
Fixes XML parser error in Solr: