Correct the handling of alternate graphics in Title proper fields (display and indexing) #2591

JoelleDosimont · 2021-12-09T13:34:27Z

Describe the bug

When a document has an alternate graphics (ex. Title in cyrillic characters), the field "value" is doubled to add the transliteration. The sub-element "language" is used to determine the language and graphic.

There are problems in two cases :

The sub-element "Language" is used only in 1 of the proper title's value (most of the migrated records are in this case).

The display of the titles is good.
The proper title's value missing the Language element is not indexed. Therefore, a search on this title is not successful.

The sub-element "Language" is used in the 2 proper title's values.

The display of the titles is wrong. In the detailed view of the document, the title appears 3 times (see screenshot below).
Both of the proper title's values are indexed.

To Reproduce

Case 1

Search fot thoses terms "Frantsuzsko-russkii slovar".
See that the search is not successful even though the following record exists : https://ilsdev.test.rero.ch/professional/records/documents/detail/252

Case 2

Display the detail view of a document in cyrillic, with the value language used both for the original graphic and the transliteration (ex. : https://ilsdev.test.rero.ch/professional/records/documents/detail/712)
See that there is 3 entries for the Title

Expected behavior

For a Title, if the value is repeated for transliteration (same language, different graphic) : the Title should display :

First : the original graphic (ex. for russian : in cyrillic).
Below : the transliteration.

All values are indexed, regardless of the presence of the Language sub-element.

Context

server: ilsdev.test.rero.ch
version: v1.10.0 or the commit hash (see frontpage).

Screenshots

Expected display :

Additional context

On the 07/12/2021, the display of the Title was different : 1 entry for the original graphic, 2 entries for the transliteration.
On the 09/12/2021 : 2 entries for the original graphic, 1 entry for the transliteration in the wrong order (the original graphic should be first).

The text was updated successfully, but these errors were encountered:

Some complex fields has a formated content in a special field `_text`. This field is not indexed anymore, but all the related field are indexed. The legacy `series` document field configuration is removed. Remove `&` character from the `tokenize_on_chars`. The tokenize characters comes from the following unicode categories as defined in the elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf. `&` is excluded from the Po category. Unfortunately, elasticsearch does not support exceptions. `unicategories` python package has been used to generate the list from categories. Note that some unicode characters such as `"\ud836\ude8b"` has been removed as they create an elasticsearch errors. Data Migration Instructions: need elasticsearch server side document reindexing. * Closes: rero#3050. * Closes: rero#2591. * Closes: rero#2730. * Closes: rero#2972. * Closes: rero#2050. * Closes: rero#3027. Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>

Some complex fields has a formated content in a special field `_text`. This field is not indexed anymore, but all the related field are indexed. The legacy `series` document field configuration is removed. Remove `&` character from the `tokenize_on_chars`. The tokenize characters comes from the following unicode categories as defined in the elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf. `&` is excluded from the Po category. Unfortunately, elasticsearch does not support exceptions. `unicategories` python package has been used to generate the list from categories. Note that some unicode characters such as `"\ud836\ude8b"` has been removed as they create an elasticsearch errors. Data Migration Instructions: need a complete document re-indexing. * Closes: rero#3050. * Closes: rero#2591. * Closes: rero#2730. * Closes: rero#2972. * Closes: rero#2050. * Closes: rero#3027. Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>

Some complex fields has a formated content in a special field `_text`. This field is not indexed anymore, but all the related field are indexed. The legacy `series` document field configuration is removed. Remove `&` character from the `tokenize_on_chars`. The tokenize characters comes from the following unicode categories as defined in the elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf. `&` is excluded from the Po category. Unfortunately, elasticsearch does not support exceptions. `unicategories` python package has been used to generate the list from categories. Note that some unicode characters such as `"\ud836\ude8b"` has been removed as they create an elasticsearch errors. Data Migration Instructions: need a complete document re-indexing. * Closes: #3050. * Closes: #2591. * Closes: #2730. * Closes: #2972. * Closes: #2050. * Closes: #3027. Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>

Some complex fields has a formated content in a special field `_text`. This field is not indexed anymore, but all the related field are indexed. The legacy `series` document field configuration is removed. Remove `&` character from the `tokenize_on_chars`. The tokenize characters comes from the following unicode categories as defined in the elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf. `&` is excluded from the Po category. Unfortunately, elasticsearch does not support exceptions. `unicategories` python package has been used to generate the list from categories. Note that some unicode characters such as `"\ud836\ude8b"` has been removed as they create an elasticsearch errors. Data Migration Instructions: need a complete document re-indexing. * Closes: rero#3050. * Closes: rero#2591. * Closes: rero#2730. * Closes: rero#2972. * Closes: rero#2050. * Closes: rero#3027. Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>

Some complex fields has a formated content in a special field `_text`. This field is not indexed anymore, but all the related field are indexed. The legacy `series` document field configuration is removed. Remove `&` character from the `tokenize_on_chars`. The tokenize characters comes from the following unicode categories as defined in the elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf. `&` is excluded from the Po category. Unfortunately, elasticsearch does not support exceptions. `unicategories` python package has been used to generate the list from categories. Note that some unicode characters such as `"\ud836\ude8b"` has been removed as they create an elasticsearch errors. Data Migration Instructions: need a complete document re-indexing. * Closes: #3050. * Closes: #2591. * Closes: #2730. * Closes: #2972. * Closes: #2050. * Closes: #3027. Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>

Some complex fields has a formated content in a special field `_text`. This field is not indexed anymore, but all the related field are indexed. The legacy `series` document field configuration is removed. Remove `&` character from the `tokenize_on_chars`. The tokenize characters comes from the following unicode categories as defined in the elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf. `&` is excluded from the Po category. Unfortunately, elasticsearch does not support exceptions. `unicategories` python package has been used to generate the list from categories. Note that some unicode characters such as `"\ud836\ude8b"` has been removed as they create an elasticsearch errors. Data Migration Instructions: need a complete document re-indexing. * Closes: rero#3050. * Closes: rero#2591. * Closes: rero#2730. * Closes: rero#2972. * Closes: rero#3027. Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>

Some complex fields has a formated content in a special field `_text`. This field is not indexed anymore, but all the related field are indexed. The legacy `series` document field configuration is removed. Remove `&` character from the `tokenize_on_chars`. The tokenize characters comes from the following unicode categories as defined in the elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf. `&` is excluded from the Po category. Unfortunately, elasticsearch does not support exceptions. `unicategories` python package has been used to generate the list from categories. Note that some unicode characters such as `"\ud836\ude8b"` has been removed as they create an elasticsearch errors. Data Migration Instructions: need a complete document re-indexing. * Fixes several mapping configurations comming from the facets configuration. * Closes: rero#3050. * Closes: rero#2591. * Closes: rero#2730. * Closes: rero#2972. * Closes: rero#3027. Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>

Some complex fields has a formated content in a special field `_text`. This field is not indexed anymore, but all the related field are indexed. The legacy `series` document field configuration is removed. Remove `&` character from the `tokenize_on_chars`. The tokenize characters comes from the following unicode categories as defined in the elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf. `&` is excluded from the Po category. Unfortunately, elasticsearch does not support exceptions. `unicategories` python package has been used to generate the list from categories. Note that some unicode characters such as `"\ud836\ude8b"` has been removed as they create an elasticsearch errors. Data Migration Instructions: need a complete document re-indexing. * Fixes several mapping configurations comming from the facets configuration. * Closes: #3050. * Closes: #2591. * Closes: #2730. * Closes: #2972. * Closes: #3027. Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>

JoelleDosimont added f: professional ui Professional interface f: public ui Public interface, as opposed to the professional interface triage bug Breaks something but is not blocking p-Low Low priority labels Dec 9, 2021

PascalRepond removed the triage label Feb 18, 2022

JoelleDosimont added p-High High priority (to be solved in the 2-3 next months) and removed p-Low Low priority labels Jun 2, 2022

JoelleDosimont changed the title ~~Correct the display of title when the value "language" is used (for Alternate Graphics)~~ Correct the handling of alternate graphics in Title proper fields (display and indexing) Jun 2, 2022

jma mentioned this issue Aug 17, 2022

search: fix document index configuration #3030

Merged

jma closed this as completed in #3030 Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct the handling of alternate graphics in Title proper fields (display and indexing) #2591

Correct the handling of alternate graphics in Title proper fields (display and indexing) #2591

JoelleDosimont commented Dec 9, 2021 •

edited

Loading

Correct the handling of alternate graphics in Title proper fields (display and indexing) #2591

Correct the handling of alternate graphics in Title proper fields (display and indexing) #2591

Comments

JoelleDosimont commented Dec 9, 2021 • edited Loading

JoelleDosimont commented Dec 9, 2021 •

edited

Loading