New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no dictionary #8
Comments
The dictionary is not licensed, so I don't think I can include it in my library. I had a similar problem with my praatio library. |
But u can convert this file into dictionary python object and save as python object or as json. It will work faster and will be better to upload |
Sorry, I forgot to add the error message: from pysle import isletool
isletool.LexicalTool('ISLEdict.txt').lookup('cat')
|
As I said, the issue of including it is a legal one. Perhaps I can ask if the data can be released under some license? Either way, maybe I could cache the results or store in some intermediate format to make loading faster. |
So how should I use this package without necessary data? It will be very good if u convert necessary files into some json/binary formats to not have legacy problems |
Oh sorry, the link to the necessary data is in the requirements section: The data is derived from an academic project. If there is a way for me to make that clearer, please let me know. |
I have made a new release 2.1.1 If you try to load the ISLEdict but it does not exist, a warning states where to find the file: >>> from pysle import isletool
>>> a = isletool.LexicalTool('bloop.txt')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "isletool.py", line 87, in __init__
raise IsleDictDoesNotExist()
pysle.isletool.IsleDictDoesNotExist: The path to the ISLE dictionary file does not exist.
The ISLE dictionary is an external resource that must be downloaded separately. ISLEdict.txt can be found here:
https://github.com/uiuc-sst/g2ps/tree/master/English/ISLEdict.txt
Please see the requirements section in the README file for more details. What do you think? It takes about 1~1.5 seconds to load ISLEDict.txt into memory. If we pickle the data and then load it, it takes about the same amount of time. I was a bit surprised. Maybe I did something wrong? |
I also tried serializing/deserializing with json and it wasn't any faster. I think it is slow to load simply because the dictionary file is large (~16 MB). |
Okay, now I know the path to dictionaries, thank u. I will try to use it. So will ur package work with another languages if I use it's dictionary? But in g2ps I cannot find some files like ISLEDict.txt for other languages With my project I have saved my dictionary (50kb) into json and the loading of this was over 5 times faster. I dunno how u transform ISLEDict.txt file. I have transformed into python dict(). Anyway 1.5 secs is not bad, don't worry |
Unfortunately, I don't know of a similar resource file for other languages. If you know of any, please share! |
So I found this collection. Is it exactly ur package need? |
That collection is useful but actually it comes with its own python code. I think if you want to access the wikipron dataset, you should use their python library:
If you need languages other than English, you should use wikipron. If you only need English, ISLEdict has 10 times as many words and includes syllable information. wikipron does not include syllable boundaries. |
The upstream project is now MIT licensed. Perhaps that means the file can now be included in this repo? uiuc-sst/g2ps@8abb736 |
Nice find! I should be able to put out a release with it included today or tomorrow. Thanks! |
Release v.2.2.0 is out. You can now just do:
etc No need to deal with the ISLEX dictionary, as it is now included in the library. Thank you @sevagh Thanks! |
I think u should point all data files into MANIFEST.in file
See https://realpython.com/pypi-publish-python-package/
I do something like for persian language: https://github.com/PasaOpasen/PersianG2P
The text was updated successfully, but these errors were encountered: