Full unicode support #50

aalsinat · 2017-12-14T09:30:28Z

I'd like to know how to parse full unicode string. \w is not enough for parsing lines like

2785-599 São Domingo

I've been trying to use /'regEx'/u or even \p{L} like in python but it doesn't seem to work.
Using string match, it works:

address = '2785-599 São Domingo'

But when the instance is created a new error arises:

'ascii' codec can't encode character

Appreciate your help.

The text was updated successfully, but these errors were encountered:

igordejanovic · 2017-12-14T14:06:13Z

textX uses Python re module for regex matches. Whatever you can match using re module you can use in textX between slashes /..../.

Assuming you are trying to parse a file with multiple records like the one you provided you could do something like this:

# -*- coding: utf-8 -*-
from textx import metamodel_from_str

grammar = '''
Model: records*=Record;
Record: num=/\d+-\d+/ name=/\w+\s+\w+/;
'''
mm = metamodel_from_str(grammar)

input = '''
2785-599 São Domingo
2785-599 São Domingo
'''

model = mm.model_from_str(input)
print(model.records)
print(model.records[0].name)
print(model.records[0].num)

Note that input text should be interpreted as unicode. textX accepts unicode only. The example above is for Python 3.

aalsinat · 2017-12-14T15:08:33Z

First of all, thanks for your quick response! I forgot to tell you that I am using Python 2.7. With this version, using:

re.U
re.UNICODE

makes the \w sequence dependent on Unicode

igordejanovic · 2017-12-14T17:58:50Z

For Python 2 you can do this:

# -*- coding: utf-8 -*-
from textx import metamodel_from_str

grammar = '''
Model: records*=Record;
Record: num=/\d+-\d+/ name=/(?u)\w+\s+\w+/;
'''
mm = metamodel_from_str(grammar)

input = u'''
2785-599 São Domingo
2785-599 São Domingo
'''

model = mm.model_from_str(input)
print(model.records[0].name)
print(model.records[0].num)

Notice (?u) at the beginning of name regex match. This turns on the re.UNICODE flag for that match.
There is also u at the beginning of input unicode string.

There was a slight problem with printing exceptions on syntax errors with unicode chars in Python 2. It should be fixed now on the master branch so use that version.

aalsinat · 2017-12-14T18:07:55Z

Great! Great job! Thank you so much for your help!

aalsinat changed the title ~~Full unicode~~ Full unicode support Dec 14, 2017

igordejanovic closed this as completed Feb 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full unicode support #50

Full unicode support #50

aalsinat commented Dec 14, 2017 •

edited

igordejanovic commented Dec 14, 2017

aalsinat commented Dec 14, 2017 •

edited

igordejanovic commented Dec 14, 2017

aalsinat commented Dec 14, 2017

Full unicode support #50

Full unicode support #50

Comments

aalsinat commented Dec 14, 2017 • edited

igordejanovic commented Dec 14, 2017

aalsinat commented Dec 14, 2017 • edited

igordejanovic commented Dec 14, 2017

aalsinat commented Dec 14, 2017

aalsinat commented Dec 14, 2017 •

edited

aalsinat commented Dec 14, 2017 •

edited