spanish usage #75

angelo337 · 2016-12-21T12:53:39Z

hi there
is it possible to use RASA in spanish? with the MITIE model in spanish?
if so, could you please point me some resource to do all changes?
thanks
angelo

amn41 · 2016-12-21T15:23:28Z

the spanish MITIE models are here , if you unzip them and find the feature extractor file you should use that as your mitie_file. If you find that the tokenizer isn't working perfectly for spanish we can address that.

angelo337 · 2016-12-21T19:46:47Z

I just Download that model and place all that infor in the config file, however I am getting this error:
would you please point me out how to fix it?
thanks

creangel@creangel_hadoop:~/Downloads/mitie/rasa_nlu$ time python -m rasa_nlu.train -c config.json
Training to recognize 4 categories: 'saludo', 'restaurante_busqueda', 'afirmacion', 'despedida'
Train classifier
extracting text features
now do training
num training samples: 63
C: 200 f-score: 0.709677
C: 400 f-score: 0.709677
C: 300 f-score: 0.709677
C: 100 f-score: 0.709677
C: 0.01 f-score: 0.612903
C: 600 f-score: 0.709677
C: 1400 f-score: 0.709677
C: 3000 f-score: 0.709677
C: 5000 f-score: 0.709677
C: 2550 f-score: 0.709677
C: 1325 f-score: 0.709677
C: 712.5 f-score: 0.709677
C: 406.25 f-score: 0.709677
C: 253.125 f-score: 0.709677
C: 176.562 f-score: 0.709677
C: 138.281 f-score: 0.709677
C: 119.141 f-score: 0.709677
C: 109.57 f-score: 0.709677
C: 104.785 f-score: 0.709677
C: 102.393 f-score: 0.709677
C: 101.196 f-score: 0.709677
C: 100.598 f-score: 0.709677
C: 100.299 f-score: 0.709677
best C: 100.598
test on train:
20 0 0 0
0 8 0 0
0 0 21 0
0 0 0 14

overall accuracy: 1
Training time: 429 seconds.
df.number_of_classes(): 4

Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "build/bdist.linux-x86_64/egg/rasa_nlu/train.py", line 65, in
File "build/bdist.linux-x86_64/egg/rasa_nlu/train.py", line 59, in do_train
File "build/bdist.linux-x86_64/egg/rasa_nlu/trainers/mitie_trainer.py", line 25, in train
File "build/bdist.linux-x86_64/egg/rasa_nlu/trainers/mitie_trainer.py", line 42, in train_entity_extractor
File "build/bdist.linux-x86_64/egg/rasa_nlu/trainers/mitie_trainer.py", line 31, in start_and_end
IndexError: list index out of range

amn41 · 2016-12-21T20:14:36Z

looks like there's an error picking up one of your entities. I can't tell if this is a bug or a problem with your data without seeing it.

Please try training intents only (e.g. removing any entities from your training data), and then add them back one by one until you trigger this error. Then please post here the training example which causes the error.

angelo337 · 2016-12-22T00:23:52Z

hi there
I just try your solutions and work like a charm, i figure out my mistake is that start counting sentences from 1 instead of 0.
now is fix it.
thanks

ghost · 2017-01-12T07:32:02Z

I have the same problem @angelo337 had.. IndexError: list index out of range
I am using the expressions.json file from wit.ai

is there a problem with training wit data??
expressions.json.zip

amn41 · 2017-01-12T08:15:55Z

thanks for sharing your training data! I'm able to reproduce this error. It's down to the fact that you have entities like 'perth' in the sentence "what is perths weather like next week". MITIE can only handle entities made up of whole tokens. I will handle this edge case in rasa, but it will still return "perths" rather than "perth" as your location. So for now you will have to resolve that entity yourself. It's on the roadmap to come up with a solution to that, though.

amn41 · 2017-01-12T08:32:59Z

although thinking about it we could explicitly insert a whitespace in these cases. I will create a new issue & make a proposal

beeva-lisettegarcia · 2017-02-21T11:14:15Z

Hello

I would like to use rasa por spanish texts.
I already download the spanish Mitie model and prepared the config file.
During training, I get the following error:

python -m rasa_nlu.train -c config.json
Training to recognize 8 categories: 'greet', 'restaurant_search', 'affirm', 'goodbye', 'saludo', 'busqueda_restaurante', 'afirmacion', 'despedida'
Train classifier
extracting text features
now do training
num training samples: 44
C: 200 f-score: 0.525
C: 400 f-score: 0.525
C: 300 f-score: 0.525
C: 100 f-score: 0.525
C: 0.01 f-score: 0.575
C: 50.005 f-score: 0.525
C: 25.0075 f-score: 0.525
C: 12.5088 f-score: 0.525
C: 6.25938 f-score: 0.525
C: 3.13469 f-score: 0.525
C: 1.57234 f-score: 0.525
C: 0.791172 f-score: 0.525
C: 0.400586 f-score: 0.525
best C: 0.01
test on train:
5 0 0 0 0 0 0 0
0 8 0 0 0 0 0 0
0 0 6 0 0 0 1 0
0 0 0 5 0 0 0 0
1 0 0 0 1 0 0 0
0 0 0 0 0 8 0 0
0 0 0 0 0 0 5 0
0 0 0 1 0 0 0 3

overall accuracy: 0.931818
Training time: 854 seconds.
df.number_of_classes(): 8

Traceback (most recent call last):
File "/home/lisettegarcia/miniconda3/envs/py2/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/home/lisettegarcia/miniconda3/envs/py2/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/lisettegarcia/miniconda3/envs/py2/lib/python2.7/site-packages/rasa_nlu/train.py", line 65, in
do_train(config)
File "/home/lisettegarcia/miniconda3/envs/py2/lib/python2.7/site-packages/rasa_nlu/train.py", line 59, in do_train
trainer.train(training_data)
File "/home/lisettegarcia/miniconda3/envs/py2/lib/python2.7/site-packages/rasa_nlu/trainers/mitie_trainer.py", line 30, in train
self.entity_extractor = self.train_entity_extractor(data.entity_examples)
File "/home/lisettegarcia/miniconda3/envs/py2/lib/python2.7/site-packages/rasa_nlu/trainers/mitie_trainer.py", line 53, in train_entity_extractor
start, end = self.find_entity(ent, text)
File "/home/lisettegarcia/miniconda3/envs/py2/lib/python2.7/site-packages/rasa_nlu/trainers/mitie_trainer.py", line 35, in find_entity
tokens, offsets = tk.tokenize_with_offsets(text)
File "/home/lisettegarcia/miniconda3/envs/py2/lib/python2.7/site-packages/rasa_nlu/tokenizers/mitie_tokenizer.py", line 24, in tokenize_with_offsets
offset += m.start()
AttributeError: 'NoneType' object has no attribute 'start'

Tracing the error, I found the problem in

(mitie_tokenizer.py)
line 22 m = re.search(re.escape(tok), _text[offset:])

when we work with words with accents.

Any idea ?

Thanks
busq_restaurante_Data.json.zip

frankai · 2017-02-21T12:26:23Z

I have the same problem than @beeva-lisettegarcia when training with spanish accents. The problem appears to be in the mitie_tokenizer.py script. Any idea or clue to fix it? Thanks!

tmbo · 2017-02-21T14:22:43Z

@beeva-lisettegarcia @frankai I just pushed a change that should fix the encoding issue (unfortunately the test that should have ensured this functionality had a bug on its own 😓 ). Would be great if you could test that to see if it solves your issue.

For the future: Please avoid re-using closed issues. Don't hesitate to create new issues. The only thing you should do is the following: make sure the exact problem is not already an existing issue.

cbonadio · 2017-02-21T19:46:55Z

I had the same issue as @beeva-lisettegarcia @frankai, now pulled the changes and it is working.

Thanks

beeva-lisettegarcia · 2017-02-22T09:08:19Z

Thanks, Now it is working :-)

[ATO-1764] Dependency Upgrades

angelo337 closed this as completed Dec 22, 2016

tmbo mentioned this issue Dec 24, 2016

api.ai and multi-language support. #15

Merged

tmbo added a commit that referenced this issue Feb 21, 2017

Fixed unicode character tokenization using mitie #75

c52aeb4

vcidst pushed a commit that referenced this issue Jan 23, 2024

Merge pull request #75 from RasaHQ/ATO-1764-dependency-upgrades-oss

0c55e0f

[ATO-1764] Dependency Upgrades

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanish usage #75

spanish usage #75

angelo337 commented Dec 21, 2016

amn41 commented Dec 21, 2016

angelo337 commented Dec 21, 2016

amn41 commented Dec 21, 2016

angelo337 commented Dec 22, 2016

ghost commented Jan 12, 2017

amn41 commented Jan 12, 2017

amn41 commented Jan 12, 2017

beeva-lisettegarcia commented Feb 21, 2017

frankai commented Feb 21, 2017

tmbo commented Feb 21, 2017

cbonadio commented Feb 21, 2017

beeva-lisettegarcia commented Feb 22, 2017

spanish usage #75

spanish usage #75

Comments

angelo337 commented Dec 21, 2016

amn41 commented Dec 21, 2016

angelo337 commented Dec 21, 2016

amn41 commented Dec 21, 2016

angelo337 commented Dec 22, 2016

ghost commented Jan 12, 2017

amn41 commented Jan 12, 2017

amn41 commented Jan 12, 2017

beeva-lisettegarcia commented Feb 21, 2017

frankai commented Feb 21, 2017

tmbo commented Feb 21, 2017

cbonadio commented Feb 21, 2017

beeva-lisettegarcia commented Feb 22, 2017