Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full unicode support #50

Closed
aalsinat opened this issue Dec 14, 2017 · 4 comments
Closed

Full unicode support #50

aalsinat opened this issue Dec 14, 2017 · 4 comments

Comments

@aalsinat
Copy link

aalsinat commented Dec 14, 2017

I'd like to know how to parse full unicode string. \w is not enough for parsing lines like

2785-599 São Domingo

I've been trying to use /'regEx'/u or even \p{L} like in python but it doesn't seem to work.
Using string match, it works:

address = '2785-599 São Domingo'

But when the instance is created a new error arises:

'ascii' codec can't encode character

Appreciate your help.

@aalsinat aalsinat changed the title Full unicode Full unicode support Dec 14, 2017
@igordejanovic
Copy link
Member

textX uses Python re module for regex matches. Whatever you can match using re module you can use in textX between slashes /..../.

Assuming you are trying to parse a file with multiple records like the one you provided you could do something like this:

# -*- coding: utf-8 -*-
from textx import metamodel_from_str

grammar = '''
Model: records*=Record;
Record: num=/\d+-\d+/ name=/\w+\s+\w+/;
'''
mm = metamodel_from_str(grammar)

input = '''
2785-599 São Domingo
2785-599 São Domingo
'''

model = mm.model_from_str(input)
print(model.records)
print(model.records[0].name)
print(model.records[0].num)

Note that input text should be interpreted as unicode. textX accepts unicode only. The example above is for Python 3.

@aalsinat
Copy link
Author

aalsinat commented Dec 14, 2017

First of all, thanks for your quick response! I forgot to tell you that I am using Python 2.7. With this version, using:

  • re.U
  • re.UNICODE

makes the \w sequence dependent on Unicode

@igordejanovic
Copy link
Member

For Python 2 you can do this:

# -*- coding: utf-8 -*-
from textx import metamodel_from_str

grammar = '''
Model: records*=Record;
Record: num=/\d+-\d+/ name=/(?u)\w+\s+\w+/;
'''
mm = metamodel_from_str(grammar)

input = u'''
2785-599 São Domingo
2785-599 São Domingo
'''

model = mm.model_from_str(input)
print(model.records[0].name)
print(model.records[0].num)

Notice (?u) at the beginning of name regex match. This turns on the re.UNICODE flag for that match.
There is also u at the beginning of input unicode string.

There was a slight problem with printing exceptions on syntax errors with unicode chars in Python 2. It should be fixed now on the master branch so use that version.

@aalsinat
Copy link
Author

Great! Great job! Thank you so much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants