-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
german presage #26
Comments
|
very good, thanks for looking into it! Corpus has to be a plain text file
that will be processed by the software. For reference, used text files were
from 200MB - 2GB so far.
|
|
I'm a bit confused of all the xml and so on formats of the german corpora. |
|
No, we are looking for a text with words in sentences. Such as a collection
of articles, texts of speeches, logs of chatrooms, movie subtitles.
Something where you could calculate what's a chance to see some word after
combination of other words. Just dictionary will not help in this case
… |
|
ok, understand. Now I have found four files here: |
|
should be simple to write a script removing the count lines and then process it by presage. So, that's already good |
|
Maybe I could simply delete all number-characters in the files - but that would probably corrupt some of the sentences... -? I don't think I have the skills to write a script :-/ |
|
Thanks for finding the corpus files! No, you cannot upload files over here. The best you could do is to make a list with specific links for the files we are expected to import. As you don't know how to write a script, we'll have to wait till either me or someone else has time to look into it. Can't promise any specific time on my side, but if nobody will volunteer, I'll look into it. Let's just use this issue to coordinate the effort and warn others if someone starts working on it. So, please make a list of direct links to the files we need to download to get texts. If its using that format with one number in front of the sentence, so be it. That we will take into account. |
|
Thanks for your patience! Let me know if there is something else I can do. I know its a little hard for someone experienced like You to handle "unprofessionals" like me, but I really like to learn and contribute - so sorry for any inconvenience. |
|
I am sorry, I completely forgot about it. Would you mind to resend it? |
|
Made one file out if it and deleted all numbers. hope you can use it. best regards! |
|
Hmm, strange - link has expired again. |
|
ok, next try, if this fails too, I will use a differnet service :-) |
|
I have just pushed German packages to OpenRepos. Please test and report back. |
|
Thank You so much. Tested it for "normal" purpose like mails and sms, notes an so on, and its just working fine I think. A friend of mine who is used to androids text prediction was totally ok with it, too! |
|
Excellent! I am closing it here. Please feel free to comment later in closed issue or open a new one if needed. |
|
Thank you again for your excellent work - if I can pay you a cup of coffee, let me know. |
|
No worries, main thing was to refresh the memory on how it was done. If you wish to donate, sure there is a link out there at github and easy to find by "donate rinigus", as far as I can see. But please feel free to skip it ... Glad it worked out rather easily. |
I like to have a german presage. In openrepos I've read, tit needs a text corpus for that. I will try to get one. Do I have to take care for something special?
best regards
The text was updated successfully, but these errors were encountered: