Add filter to DataExtractor that removes some of control chars #886

heaven · 2018-03-01T11:46:59Z

XML transport doesn't support some of the control chars.

Fixes XML parser error in Solr:

org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code
11))

rocket-turtle · 2018-03-05T08:03:11Z

Is this related to #570?

heaven · 2018-03-05T08:07:44Z

@rocket-turtle, correct, I see random spec failures, though.

856.23 failed here https://travis-ci.org/sunspot/sunspot/builds/347727961
but succeed here https://travis-ci.org/sunspot/sunspot/builds/347725061

heaven · 2018-03-05T08:18:24Z

Solr itself can handle these chars perfectly, but the XML transport can't, so there is no way to fix this at the Solr side using FilterFactory or anything else because it never reaches there. Solr fails on parsing the received XML that contains any of the characters I've blacklisted.

serggl · 2018-03-22T08:28:19Z

@heaven please rebase your branch against master. Lets make sure that tests are actually passing

heaven · 2018-03-22T15:24:51Z

@serggl, the branch is in sync with the master. I also replaced the list of characters with a regular expression, which should be more efficient.

serggl · 2018-03-22T20:05:16Z

@heaven I wonder if we could measure index performance somehow: with and without your patch

heaven · 2018-03-22T20:30:50Z

@serggl There will be as much overhead as a simple regular expression could cause. Not sure how to make it more efficient, perhaps we can check if a string contains one of those blacklisted characters before running gsub, but I think ruby may do this optimization for us.

We have this in production and everything seems good so far, errors also have gone from the log. We run indexing from a background process so I can't say if it became slower.

serggl · 2018-03-23T06:37:54Z

one more thing that I did not noticed before is that there are no specs that prove that this fix does actually work. Can you please add some?

…ML transport doesn't support. Fixes XML parser error in Solr org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code 11)) Relevant issues: sunspot#570

…ld be more efficient on larger strings.

heaven · 2019-06-10T12:11:04Z

@serggl hi, I just added some tests.

mlh758 · 2019-06-13T13:45:48Z

I spent a little bit of time looking for ways to escape these control characters since Solr itself is fine with them. It looks like that's only allowed in XML 1.1, is inconsistently supported, and would probably impose more of a performance impact than just stripping them out. I'm good with this PR. It also doesn't look like there is a valid escape sequence for a null byte even in XML 1.1 so there would still be something to remove.

serggl · 2019-06-14T11:23:10Z

@mlh758 are you good merging this?

mlh758 · 2019-06-14T13:56:57Z

I am, yes.

…#936, #927, #923, #921

serggl added the needs-tests label Mar 25, 2018

heaven added 3 commits March 5, 2019 15:24

Add filter to DataExtractor that removes some of control chars that X…

4be5c78

…ML transport doesn't support. Fixes XML parser error in Solr org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code 11)) Relevant issues: sunspot#570

Add support for earlier ruby versions (below 2.1)

abcea30

Replace the list of blacklisted chars with a regular expression. Shou…

f645aa7

…ld be more efficient on larger strings.

heaven force-pushed the data-extractor branch from a1578df to f645aa7 Compare March 5, 2019 13:29

heaven mentioned this pull request May 16, 2019

Sunspot Solr Reindexing failing due to illegal characters #570

Closed

Add tests

ce0f712

serggl removed the needs-tests label Jun 14, 2019

serggl merged commit 66d3bfe into sunspot:master Jun 14, 2019

serggl added a commit that referenced this pull request Jul 5, 2019

[ci skip] bump gem version to accomodate #886, #878, #944, #941, #930, …

87ec0aa

…#936, #927, #923, #921

heaven deleted the data-extractor branch July 7, 2019 11:38

mlh758 mentioned this pull request Jul 30, 2019

Invalid UTF-8 character crashes reindexing #688

Closed

alagos mentioned this pull request Sep 21, 2020

Fixes UTF8 chars being stripped from SOLR visfleet/sunspot#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add filter to DataExtractor that removes some of control chars #886

Add filter to DataExtractor that removes some of control chars #886

heaven commented Mar 1, 2018

rocket-turtle commented Mar 5, 2018

heaven commented Mar 5, 2018

heaven commented Mar 5, 2018

serggl commented Mar 22, 2018

heaven commented Mar 22, 2018

serggl commented Mar 22, 2018

heaven commented Mar 22, 2018 •

edited

Loading

serggl commented Mar 23, 2018

heaven commented Jun 10, 2019

mlh758 commented Jun 13, 2019 •

edited

Loading

serggl commented Jun 14, 2019

mlh758 commented Jun 14, 2019

Add filter to DataExtractor that removes some of control chars #886

Add filter to DataExtractor that removes some of control chars #886

Conversation

heaven commented Mar 1, 2018

rocket-turtle commented Mar 5, 2018

heaven commented Mar 5, 2018

heaven commented Mar 5, 2018

serggl commented Mar 22, 2018

heaven commented Mar 22, 2018

serggl commented Mar 22, 2018

heaven commented Mar 22, 2018 • edited Loading

serggl commented Mar 23, 2018

heaven commented Jun 10, 2019

mlh758 commented Jun 13, 2019 • edited Loading

serggl commented Jun 14, 2019

mlh758 commented Jun 14, 2019

heaven commented Mar 22, 2018 •

edited

Loading

mlh758 commented Jun 13, 2019 •

edited

Loading