Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Just first regex hit is shown if multiple regex patterns match the same input string #1897

Closed
lfcnassif opened this issue Sep 26, 2023 · 8 comments · Fixed by #1900
Closed
Assignees
Labels

Comments

@lfcnassif
Copy link
Member

Reported on #1745

@lfcnassif lfcnassif added the bug label Sep 26, 2023
@wladimirleite
Copy link
Member

I can take a look at this, @lfcnassif.

@lfcnassif
Copy link
Member Author

If you have available time @tc-wleite, that would be great! Thank you very much for all your volunteer work on this project!

@wladimirleite
Copy link
Member

I was able to reproduce and fix the issue reported by @paulobreim (#1745 (comment)).
If a string matches more than one regex, only the first one was considered.

It doesn't seem to be the same situation reported by @milcent-CVM (#1745 (comment)).

@milcent-CVM, can you provide a couple of sample strings that should match the regex you created?

@milcent-CVM
Copy link

Sure! Thank you!

Let me just post the (still in evolution - regex101.com) Regex Pattern that got the most hits (still doesn't account for all latin characters in the person's name and expects case insensitive, which is the default in IPED):

\b(?:\n*)+(?:\s*\-?)?(?:[^\-]relatora?)+(?:\s*:?\s*)(?:diretora?|presidente|)?\s*(\w+é*(?:(?: +\w+)+))|(\w+(?:(?: +\w+é*)+))(?:\r*\t*\n*)(?:diretora?)\-(?:relatora?)\b

I will join here many parts of the articles in one peace that would result in many hits, OK?

"""
RELATOR :
WLADIMIR CASTELO BRANCO CASTRO

DURVAL JOSÉ SOLEDADE SANTOS
Diretor-Relator

JOSÉ LUIZ OSORIO DE ALMEIDA FILHO
RELATOR : Diretor Durval José Soledade Santos

Dos fatos
Lei nº 6.404/1976.

Diretor Relator: Alexandre Costa Rangel

Relatório de Julgamento (1251150) SEI 19957.010729/2019-31 / pg. 1

Data do julgamento: 23/06/2020

Relator: Diretor Henrique Machado

Acusados:

LEONARDO BRUNET MENDES DE MORAES

Diretor-Relator

FRANCISCO AUGUSTO DA COSTA E SILVA

Presidente

RELATÓRIO

Relator: Leonardo Brunet Mendes De Moraes

DOS FATOS

WLADIMIR CASTELO BRANCO CASTRO
Diretor-Relator

Presidente da Sessão

RELATÓRIO

Relator : Diretor Wladimir Castelo Branco Castro

Rio de Janeiro, 04 de abril de 2007.

Maria Helena de Santana
Diretora-Relatora

Marcelo Fernandez Trindade
Presidente da Sessão de Julgamento

Rio de Janeiro, 18 de dezembro de 2007.

Durval Soledade
Diretor-Relator

Maria Helena dos Santos Fernandes de Santana
Participaram do julgamento os Diretores Marcos Barbosa Pinto, Relator, Durval Soledade, Sergio Weguelin e a Presidente da CVM, Maria Helena dos Santos Fernandes de Santana.

Rio de Janeiro, 28 de agosto de 2007.

Marcos Barbosa Pinto
Diretor-Relator

Maria Helena dos Santos Fernandes de Santana
Rio de Janeiro, 21 de agosto de 2007.

Eli Loria
Diretor-Relator e Presidente da Sessão de Julgamento
"""

@wladimirleite
Copy link
Member

@milcent-CVM, using the regex and the text posted, regex101 is not showing any matches (screenshot below).
Can you check if I am doing something different from what you are?

image

@milcent-CVM
Copy link

milcent-CVM commented Sep 26, 2023

@tc-wleite , this is probably due to regex101 being case-sensitive, while IPED is not (at least according to RegexConfig.txt).
The Regex that works in regex101 is the one below:

\b(?:\n*)+(?:\s*\-?)?(?:[^\-]Relatora?|RELATORA?)+(?:\s*:?\s*)(?:Diretora?|DIRETORA?|Presidente|PRESIDENTE)?\s*(\w+(?:(?: +\w+é*É*)+))|(\w+(?:(?: +\w+é*É*)+))(?:\r*\t*\n*)(?:Diretora?|DIRETORA?)\-(?:Relatora?|RELATORA?)\b

image

@wladimirleite
Copy link
Member

@milcent-CVM, it seems that syntax used by regex101 and IPED (which uses dk.brics.automaton) is not the same.
Please, take a look at:
https://www.brics.dk/automaton/faq.html
https://www.brics.dk/automaton/doc/dk/brics/automaton/RegExp.html

Trying simpler expressions first in IPED should help.
Or using a small standalone program to test the dk.brics.automaton library:

import dk.brics.automaton.*;
public class Test {
    public static void main(String[] args) {
        RegExp r = new RegExp("([^\\-]relatora?)");
        Automaton a = r.toAutomaton();
        String input = " relator".toLowerCase();
        System.out.println(a.run(input));
    }
}

This program output is "true".
But if I change the expression to "(?:[^\\-]relatora?)", it won't match.
However, in regex101, it does:
image

@milcent-CVM
Copy link

milcent-CVM commented Sep 27, 2023 via email

@lfcnassif lfcnassif changed the title Custom user regex patterns in RegexConfig.txt could be ignored Just first regex hit is shown if multiple regex patterns match the same input string Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants