Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no dictionary #8

Closed
PasaOpasen opened this issue May 30, 2020 · 15 comments
Closed

no dictionary #8

PasaOpasen opened this issue May 30, 2020 · 15 comments

Comments

@PasaOpasen
Copy link

from pysle import isletool

isletool.LexicalTool('ISLEdict.txt').lookup('cat')

I think u should point all data files into MANIFEST.in file

See https://realpython.com/pypi-publish-python-package/

I do something like for persian language: https://github.com/PasaOpasen/PersianG2P

@timmahrt
Copy link
Owner

The dictionary is not licensed, so I don't think I can include it in my library. I had a similar problem with my praatio library.

@PasaOpasen
Copy link
Author

But u can convert this file into dictionary python object and save as python object or as json. It will work faster and will be better to upload

@PasaOpasen
Copy link
Author

PasaOpasen commented May 30, 2020

Sorry, I forgot to add the error message:

from pysle import isletool

isletool.LexicalTool('ISLEdict.txt').lookup('cat')
Traceback (most recent call last):

  File "<ipython-input-1-c4a023343ec3>", line 3, in <module>
    isletool.LexicalTool('ISLEdict.txt').lookup('cat')

  File "C:\ProgramData\Anaconda3\lib\site-packages\pysle\isletool.py", line 74, in __init__
    self.data = self._buildDict()

  File "C:\ProgramData\Anaconda3\lib\site-packages\pysle\isletool.py", line 81, in _buildDict
    with io.open(self.islePath, "r", encoding='utf-8') as fd:

FileNotFoundError: [Errno 2] No such file or directory: 'ISLEdict.txt'

@timmahrt
Copy link
Owner

As I said, the issue of including it is a legal one. Perhaps I can ask if the data can be released under some license?

Either way, maybe I could cache the results or store in some intermediate format to make loading faster.

@PasaOpasen
Copy link
Author

So how should I use this package without necessary data?

It will be very good if u convert necessary files into some json/binary formats to not have legacy problems

@timmahrt
Copy link
Owner

timmahrt commented May 30, 2020

Oh sorry, the link to the necessary data is in the requirements section:
https://raw.githubusercontent.com/uiuc-sst/g2ps/master/English/ISLEdict.txt

The data is derived from an academic project.

If there is a way for me to make that clearer, please let me know.

@timmahrt
Copy link
Owner

I have made a new release 2.1.1 If you try to load the ISLEdict but it does not exist, a warning states where to find the file:

>>> from pysle import isletool
>>> a = isletool.LexicalTool('bloop.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "isletool.py", line 87, in __init__
    raise IsleDictDoesNotExist()
pysle.isletool.IsleDictDoesNotExist: The path to the ISLE dictionary file does not exist.
The ISLE dictionary is an external resource that must  be downloaded separately.  ISLEdict.txt can be found here:
https://github.com/uiuc-sst/g2ps/tree/master/English/ISLEdict.txt
Please see the requirements section in the README file for more details.

What do you think?

It takes about 1~1.5 seconds to load ISLEDict.txt into memory. If we pickle the data and then load it, it takes about the same amount of time. I was a bit surprised. Maybe I did something wrong?

@timmahrt
Copy link
Owner

I also tried serializing/deserializing with json and it wasn't any faster. I think it is slow to load simply because the dictionary file is large (~16 MB).

@PasaOpasen
Copy link
Author

PasaOpasen commented May 31, 2020

Okay, now I know the path to dictionaries, thank u. I will try to use it.

So will ur package work with another languages if I use it's dictionary? But in g2ps I cannot find some files like ISLEDict.txt for other languages

With my project I have saved my dictionary (50kb) into json and the loading of this was over 5 times faster. I dunno how u transform ISLEDict.txt file. I have transformed into python dict(). Anyway 1.5 secs is not bad, don't worry

@timmahrt
Copy link
Owner

Unfortunately, I don't know of a similar resource file for other languages. If you know of any, please share!

@PasaOpasen
Copy link
Author

So I found this collection. Is it exactly ur package need?

@timmahrt
Copy link
Owner

That collection is useful but actually it comes with its own python code. I think if you want to access the wikipron dataset, you should use their python library:

https://github.com/kylebgorman/wikipron

If you need languages other than English, you should use wikipron.

If you only need English, ISLEdict has 10 times as many words and includes syllable information. wikipron does not include syllable boundaries.

@sevagh
Copy link

sevagh commented Nov 16, 2020

The upstream project is now MIT licensed. Perhaps that means the file can now be included in this repo? uiuc-sst/g2ps@8abb736

@timmahrt
Copy link
Owner

Nice find! I should be able to put out a release with it included today or tomorrow. Thanks!

@timmahrt timmahrt reopened this Nov 17, 2020
@timmahrt
Copy link
Owner

timmahrt commented Nov 18, 2020

Release v.2.2.0 is out. You can now just do:

from pysle import isletool
isleDict = isletool.LexicalTool()
isleDict.lookup('pumpkin')

etc

No need to deal with the ISLEX dictionary, as it is now included in the library. Thank you @sevagh

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants