Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct the handling of alternate graphics in Title proper fields (display and indexing) #2591

Closed
JoelleDosimont opened this issue Dec 9, 2021 · 0 comments · Fixed by #3030
Labels
bug Breaks something but is not blocking f: professional ui Professional interface f: public ui Public interface, as opposed to the professional interface p-High High priority (to be solved in the 2-3 next months)

Comments

@JoelleDosimont
Copy link
Contributor

JoelleDosimont commented Dec 9, 2021

Describe the bug

When a document has an alternate graphics (ex. Title in cyrillic characters), the field "value" is doubled to add the transliteration. The sub-element "language" is used to determine the language and graphic.

There are problems in two cases :

  1. The sub-element "Language" is used only in 1 of the proper title's value (most of the migrated records are in this case).
  • The display of the titles is good.
  • The proper title's value missing the Language element is not indexed. Therefore, a search on this title is not successful.
  1. The sub-element "Language" is used in the 2 proper title's values.
  • The display of the titles is wrong. In the detailed view of the document, the title appears 3 times (see screenshot below).
  • Both of the proper title's values are indexed.

To Reproduce

Case 1

  1. Search fot thoses terms "Frantsuzsko-russkii slovar".
  2. See that the search is not successful even though the following record exists : https://ilsdev.test.rero.ch/professional/records/documents/detail/252

Case 2

  1. Display the detail view of a document in cyrillic, with the value language used both for the original graphic and the transliteration (ex. : https://ilsdev.test.rero.ch/professional/records/documents/detail/712)
  2. See that there is 3 entries for the Title

Expected behavior

For a Title, if the value is repeated for transliteration (same language, different graphic) : the Title should display :

  • First : the original graphic (ex. for russian : in cyrillic).
  • Below : the transliteration.

All values are indexed, regardless of the presence of the Language sub-element.

Context

Screenshots

image
Expected display :
image

Additional context

On the 07/12/2021, the display of the Title was different : 1 entry for the original graphic, 2 entries for the transliteration.
On the 09/12/2021 : 2 entries for the original graphic, 1 entry for the transliteration in the wrong order (the original graphic should be first).

@JoelleDosimont JoelleDosimont added f: professional ui Professional interface f: public ui Public interface, as opposed to the professional interface triage bug Breaks something but is not blocking p-Low Low priority labels Dec 9, 2021
@JoelleDosimont JoelleDosimont added p-High High priority (to be solved in the 2-3 next months) and removed p-Low Low priority labels Jun 2, 2022
@JoelleDosimont JoelleDosimont changed the title Correct the display of title when the value "language" is used (for Alternate Graphics) Correct the handling of alternate graphics in Title proper fields (display and indexing) Jun 2, 2022
jma added a commit to jma/rero-ils that referenced this issue Aug 17, 2022
Some complex fields has a formated content in a special field `_text`.
This field is not indexed anymore, but all the related field are
indexed. The legacy `series` document field configuration is removed.

Remove `&` character from the `tokenize_on_chars`. The tokenize
characters comes from the following unicode categories as defined in the
elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf.
`&` is excluded from the Po category. Unfortunately, elasticsearch does
not support exceptions. `unicategories` python package has been used to
generate the list from categories. Note that some unicode characters
such as `"\ud836\ude8b"` has been removed as they create an
elasticsearch errors.

Data Migration Instructions: need elasticsearch server side document
reindexing.

* Closes: rero#3050.
* Closes: rero#2591.
* Closes: rero#2730.
* Closes: rero#2972.
* Closes: rero#2050.
* Closes: rero#3027.

Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>
jma added a commit to jma/rero-ils that referenced this issue Aug 17, 2022
Some complex fields has a formated content in a special field `_text`.
This field is not indexed anymore, but all the related field are
indexed. The legacy `series` document field configuration is removed.

Remove `&` character from the `tokenize_on_chars`. The tokenize
characters comes from the following unicode categories as defined in the
elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf.
`&` is excluded from the Po category. Unfortunately, elasticsearch does
not support exceptions. `unicategories` python package has been used to
generate the list from categories. Note that some unicode characters
such as `"\ud836\ude8b"` has been removed as they create an
elasticsearch errors.

Data Migration Instructions: need a complete document re-indexing.

* Closes: rero#3050.
* Closes: rero#2591.
* Closes: rero#2730.
* Closes: rero#2972.
* Closes: rero#2050.
* Closes: rero#3027.

Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>
jma added a commit to jma/rero-ils that referenced this issue Aug 17, 2022
Some complex fields has a formated content in a special field `_text`.
This field is not indexed anymore, but all the related field are
indexed. The legacy `series` document field configuration is removed.

Remove `&` character from the `tokenize_on_chars`. The tokenize
characters comes from the following unicode categories as defined in the
elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf.
`&` is excluded from the Po category. Unfortunately, elasticsearch does
not support exceptions. `unicategories` python package has been used to
generate the list from categories. Note that some unicode characters
such as `"\ud836\ude8b"` has been removed as they create an
elasticsearch errors.

Data Migration Instructions: need a complete document re-indexing.

* Closes: rero#3050.
* Closes: rero#2591.
* Closes: rero#2730.
* Closes: rero#2972.
* Closes: rero#2050.
* Closes: rero#3027.

Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>
jma added a commit to jma/rero-ils that referenced this issue Aug 17, 2022
Some complex fields has a formated content in a special field `_text`.
This field is not indexed anymore, but all the related field are
indexed. The legacy `series` document field configuration is removed.

Remove `&` character from the `tokenize_on_chars`. The tokenize
characters comes from the following unicode categories as defined in the
elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf.
`&` is excluded from the Po category. Unfortunately, elasticsearch does
not support exceptions. `unicategories` python package has been used to
generate the list from categories. Note that some unicode characters
such as `"\ud836\ude8b"` has been removed as they create an
elasticsearch errors.

Data Migration Instructions: need a complete document re-indexing.

* Closes: rero#3050.
* Closes: rero#2591.
* Closes: rero#2730.
* Closes: rero#2972.
* Closes: rero#2050.
* Closes: rero#3027.

Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>
jma added a commit that referenced this issue Aug 22, 2022
Some complex fields has a formated content in a special field `_text`.
This field is not indexed anymore, but all the related field are
indexed. The legacy `series` document field configuration is removed.

Remove `&` character from the `tokenize_on_chars`. The tokenize
characters comes from the following unicode categories as defined in the
elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf.
`&` is excluded from the Po category. Unfortunately, elasticsearch does
not support exceptions. `unicategories` python package has been used to
generate the list from categories. Note that some unicode characters
such as `"\ud836\ude8b"` has been removed as they create an
elasticsearch errors.

Data Migration Instructions: need a complete document re-indexing.

* Closes: #3050.
* Closes: #2591.
* Closes: #2730.
* Closes: #2972.
* Closes: #2050.
* Closes: #3027.

Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>
jma added a commit to jma/rero-ils that referenced this issue Aug 22, 2022
Some complex fields has a formated content in a special field `_text`.
This field is not indexed anymore, but all the related field are
indexed. The legacy `series` document field configuration is removed.

Remove `&` character from the `tokenize_on_chars`. The tokenize
characters comes from the following unicode categories as defined in the
elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf.
`&` is excluded from the Po category. Unfortunately, elasticsearch does
not support exceptions. `unicategories` python package has been used to
generate the list from categories. Note that some unicode characters
such as `"\ud836\ude8b"` has been removed as they create an
elasticsearch errors.

Data Migration Instructions: need a complete document re-indexing.

* Closes: rero#3050.
* Closes: rero#2591.
* Closes: rero#2730.
* Closes: rero#2972.
* Closes: rero#2050.
* Closes: rero#3027.

Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>
jma added a commit that referenced this issue Aug 22, 2022
Some complex fields has a formated content in a special field `_text`.
This field is not indexed anymore, but all the related field are
indexed. The legacy `series` document field configuration is removed.

Remove `&` character from the `tokenize_on_chars`. The tokenize
characters comes from the following unicode categories as defined in the
elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf.
`&` is excluded from the Po category. Unfortunately, elasticsearch does
not support exceptions. `unicategories` python package has been used to
generate the list from categories. Note that some unicode characters
such as `"\ud836\ude8b"` has been removed as they create an
elasticsearch errors.

Data Migration Instructions: need a complete document re-indexing.

* Closes: #3050.
* Closes: #2591.
* Closes: #2730.
* Closes: #2972.
* Closes: #2050.
* Closes: #3027.

Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>
jma added a commit to jma/rero-ils that referenced this issue Aug 24, 2022
Some complex fields has a formated content in a special field `_text`.
This field is not indexed anymore, but all the related field are
indexed. The legacy `series` document field configuration is removed.

Remove `&` character from the `tokenize_on_chars`. The tokenize
characters comes from the following unicode categories as defined in the
elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf.
`&` is excluded from the Po category. Unfortunately, elasticsearch does
not support exceptions. `unicategories` python package has been used to
generate the list from categories. Note that some unicode characters
such as `"\ud836\ude8b"` has been removed as they create an
elasticsearch errors.

Data Migration Instructions: need a complete document re-indexing.

* Closes: rero#3050.
* Closes: rero#2591.
* Closes: rero#2730.
* Closes: rero#2972.
* Closes: rero#3027.

Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>
jma added a commit to jma/rero-ils that referenced this issue Aug 24, 2022
Some complex fields has a formated content in a special field `_text`.
This field is not indexed anymore, but all the related field are
indexed. The legacy `series` document field configuration is removed.

Remove `&` character from the `tokenize_on_chars`. The tokenize
characters comes from the following unicode categories as defined in the
elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf.
`&` is excluded from the Po category. Unfortunately, elasticsearch does
not support exceptions. `unicategories` python package has been used to
generate the list from categories. Note that some unicode characters
such as `"\ud836\ude8b"` has been removed as they create an
elasticsearch errors.

Data Migration Instructions: need a complete document re-indexing.

* Fixes several mapping configurations comming from the facets
  configuration.
* Closes: rero#3050.
* Closes: rero#2591.
* Closes: rero#2730.
* Closes: rero#2972.
* Closes: rero#3027.

Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>
jma added a commit to jma/rero-ils that referenced this issue Aug 24, 2022
Some complex fields has a formated content in a special field `_text`.
This field is not indexed anymore, but all the related field are
indexed. The legacy `series` document field configuration is removed.

Remove `&` character from the `tokenize_on_chars`. The tokenize
characters comes from the following unicode categories as defined in the
elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf.
`&` is excluded from the Po category. Unfortunately, elasticsearch does
not support exceptions. `unicategories` python package has been used to
generate the list from categories. Note that some unicode characters
such as `"\ud836\ude8b"` has been removed as they create an
elasticsearch errors.

Data Migration Instructions: need a complete document re-indexing.

* Fixes several mapping configurations comming from the facets
  configuration.
* Closes: rero#3050.
* Closes: rero#2591.
* Closes: rero#2730.
* Closes: rero#2972.
* Closes: rero#3027.

Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>
@jma jma closed this as completed in #3030 Aug 25, 2022
jma added a commit that referenced this issue Aug 25, 2022
Some complex fields has a formated content in a special field `_text`.
This field is not indexed anymore, but all the related field are
indexed. The legacy `series` document field configuration is removed.

Remove `&` character from the `tokenize_on_chars`. The tokenize
characters comes from the following unicode categories as defined in the
elasticsearch source code: Ps, Pe, Po, Pc, Pd, Pi, Pf.
`&` is excluded from the Po category. Unfortunately, elasticsearch does
not support exceptions. `unicategories` python package has been used to
generate the list from categories. Note that some unicode characters
such as `"\ud836\ude8b"` has been removed as they create an
elasticsearch errors.

Data Migration Instructions: need a complete document re-indexing.

* Fixes several mapping configurations comming from the facets
  configuration.
* Closes: #3050.
* Closes: #2591.
* Closes: #2730.
* Closes: #2972.
* Closes: #3027.

Co-Authored-by: Johnny Mariéthoz <Johnny.Mariethoz@rero.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Breaks something but is not blocking f: professional ui Professional interface f: public ui Public interface, as opposed to the professional interface p-High High priority (to be solved in the 2-3 next months)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants