Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG #2

Open
leoalenc opened this issue Apr 16, 2022 · 2 comments
Open

BUG #2

leoalenc opened this issue Apr 16, 2022 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@leoalenc
Copy link

leoalenc commented Apr 16, 2022

Describe the bug
I followed the documentation (https://udon2.github.io/basic/) and tried to parse a simple conllu file (see example below from https://github.com/EmilStenstrom/conllu/blob/master/README.md) using the IDLE Shell 3.8.10:

import udon2
# assumes that you have a file named 'test.conllu' with a valid CoNLL-U format
roots = udon2.ConllReader.read_file('test.conllu')

It didn't work. Instead, it caused a crash of the shell:


================================ RESTART: Shell ================================
>>> 

The follwing error was reported:

Failed to split a line of the CONLL-U format: Resource temporarily unavailable

To Reproduce

1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
3   brown   brown  ADJ    JJ   Degree=Pos                  4   amod    _   _
4   fox     fox    NOUN   NN   Number=Sing                 5   nsubj   _   _
5   jumps   jump   VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
6   over    over   ADP    IN   _                           9   case    _   _
7   the     the    DET    DT   Definite=Def|PronType=Art   9   det     _   _
8   lazy    lazy   ADJ    JJ   Degree=Pos                  9   amod    _   _
9   dog     dog    NOUN   NN   Number=Sing                 5   nmod    _   SpaceAfter=No
10  .       .      PUNCT  .    _                           5   punct   _   _

Expected behavior
The code above should work according to the documentation: https://udon2.github.io/basic/

Screenshots
If applicable, add screenshots to help explain your problem.
bug_udon2

Environment specs (please complete the following information):

  • OS: Ubuntu 20.04.4 LTS
  • Python version: Python 3.8.10
  • UDon2 version: '0.1.0'
  • Precompiled?: Yes

Additional context

@leoalenc leoalenc added the bug Something isn't working label Apr 16, 2022
@dkalpakchi
Copy link
Collaborator

dkalpakchi commented Apr 17, 2022

Hi!

Thank you for the detailed bug report! I confirm that I'm able to reproduce the bug. The problem is that the fields in the input file are separated by spaces, and not by tab characters, as required by the CoNLL-U format, I believe (https://universaldependencies.org/format.html). That's the reason udon2 didn't manage to read the given CoNLL-U file. If you try the file below instead, it should work just fine.

1	The	the	DET	DT	Definite=Def|PronType=Art	4	det	_	_
2	quick	quick	ADJ	JJ	Degree=Pos	4	amod	_	_
3	brown	brown	ADJ	JJ	Degree=Pos	4	amod	_	_
4	fox	fox	NOUN	NN	Number=Sing	5	nsubj	_	_
5	jumps	jump	VERB	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
6	over	over	ADP	IN	_	9	case	_	_
7	the	the	DET	DT	Definite=Def|PronType=Art	9	det	_	_
8	lazy	lazy	ADJ	JJ	Degree=Pos	9	amod	_	_
9	dog	dog	NOUN	NN	Number=Sing	5	nmod	_	SpaceAfter=No
10	.	.	PUNCT	.	_	5	punct	_	_

I'm not entirely sure if I'm reading the formatting section too literally and spaces should be allowed, so I would be grateful if you could provide any link, where it is said that spaces and tabs can be used interchangeably (if you have such a link).

What I can definitely agree with is that the error message should be better. In fact, I tried the regular Python shell (all other environment specs being the same) and it gives the error message "Failed to split a line of the CONLL-U format: Success" and then exits from the shell. I don't know why IDLE doesn't show this error message, although, again, the message is too cryptic. Depending on whether the spaces should be allowed, I'll come up with a better message.

Please let me know what your thoughts on the matter are.

@leoalenc
Copy link
Author

leoalenc commented Apr 17, 2022

roots = udon2.ConllReader.read_file('test.conllu')

Many thanks for your response! You're right, the specification of the CoNLL-U format of the Universal Dependencies documentation reads:

We use a revised version of the CoNLL-X format called CoNLL-U. Annotations are encoded in plain text files [...] with three types of lines:

Word lines containing the annotation of a word/token in 10 fields separated by single tab characters [...].

I tested the same example with fields separated by tabs and everything worked perfectly. I was mislead by the fact that the Python conllu library does not complain about spaces separating fields. The test file at hand was produced by copying and pasting the example of their documentation. This library, however, does separate fields with single tabs when it serializes parsed CONLLU data. The visualization tool at https://urd2.let.rug.nl/~kleiweg/conllu/ also accepts files with either spaces or tabs separating fields, which was another source of error for a newbee as myself.
I'm very glad that I will be able to work with CONLLU files with udon2 in a treebank project I've just started.
BTW, the error message I reported previously was displayed in the bash shell from which I started the IDLE shell.

@dkalpakchi dkalpakchi self-assigned this Dec 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants