Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filter to DataExtractor that removes some of control chars #886

Merged
merged 4 commits into from
Jun 14, 2019

Conversation

heaven
Copy link
Contributor

@heaven heaven commented Mar 1, 2018

XML transport doesn't support some of the control chars.

Fixes XML parser error in Solr:

org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code
11))

@rocket-turtle
Copy link
Contributor

Is this related to #570?

@heaven
Copy link
Contributor Author

heaven commented Mar 5, 2018

@rocket-turtle, correct, I see random spec failures, though.

856.23 failed here https://travis-ci.org/sunspot/sunspot/builds/347727961
but succeed here https://travis-ci.org/sunspot/sunspot/builds/347725061

@heaven
Copy link
Contributor Author

heaven commented Mar 5, 2018

Solr itself can handle these chars perfectly, but the XML transport can't, so there is no way to fix this at the Solr side using FilterFactory or anything else because it never reaches there. Solr fails on parsing the received XML that contains any of the characters I've blacklisted.

@serggl
Copy link
Collaborator

serggl commented Mar 22, 2018

@heaven please rebase your branch against master. Lets make sure that tests are actually passing

@heaven
Copy link
Contributor Author

heaven commented Mar 22, 2018

@serggl, the branch is in sync with the master. I also replaced the list of characters with a regular expression, which should be more efficient.

@serggl
Copy link
Collaborator

serggl commented Mar 22, 2018

@heaven I wonder if we could measure index performance somehow: with and without your patch

@heaven
Copy link
Contributor Author

heaven commented Mar 22, 2018

@serggl There will be as much overhead as a simple regular expression could cause. Not sure how to make it more efficient, perhaps we can check if a string contains one of those blacklisted characters before running gsub, but I think ruby may do this optimization for us.

We have this in production and everything seems good so far, errors also have gone from the log. We run indexing from a background process so I can't say if it became slower.

@serggl
Copy link
Collaborator

serggl commented Mar 23, 2018

one more thing that I did not noticed before is that there are no specs that prove that this fix does actually work. Can you please add some?

…ML transport doesn't support.

Fixes XML parser error in Solr
org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code
11))

Relevant issues: sunspot#570
@heaven
Copy link
Contributor Author

heaven commented Jun 10, 2019

@serggl hi, I just added some tests.

@mlh758
Copy link
Collaborator

mlh758 commented Jun 13, 2019

I spent a little bit of time looking for ways to escape these control characters since Solr itself is fine with them. It looks like that's only allowed in XML 1.1, is inconsistently supported, and would probably impose more of a performance impact than just stripping them out. I'm good with this PR. It also doesn't look like there is a valid escape sequence for a null byte even in XML 1.1 so there would still be something to remove.

@serggl
Copy link
Collaborator

serggl commented Jun 14, 2019

@mlh758 are you good merging this?

@mlh758
Copy link
Collaborator

mlh758 commented Jun 14, 2019

I am, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants