Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

german presage #26

Closed
matzgewinn opened this issue Feb 18, 2020 · 17 comments
Closed

german presage #26

matzgewinn opened this issue Feb 18, 2020 · 17 comments

Comments

@matzgewinn
Copy link

I like to have a german presage. In openrepos I've read, tit needs a text corpus for that. I will try to get one. Do I have to take care for something special?
best regards

@rinigus
Copy link
Collaborator

rinigus commented Feb 18, 2020 via email

@matzgewinn
Copy link
Author

I'm a bit confused of all the xml and so on formats of the german corpora.
So far I have two wordlists, they have a much smaller size than you wrote and are from here:
https://github.com/solariz/german_stopwords
and here:
https://sourceforge.net/projects/germandict/
They are plain text files - is this enough, and if yes how to proceed?

@rinigus
Copy link
Collaborator

rinigus commented Feb 18, 2020 via email

@matzgewinn
Copy link
Author

ok, understand. Now I have found four files here:
https://wortschatz.uni-leipzig.de/de/download
They are from articles, the web, wikipedia. Fils are 55mb, 118mb, 114mb and 131mb. Unfortunately the lines isn the files are counted, so there is a number before each sentence.

@rinigus
Copy link
Collaborator

rinigus commented Feb 19, 2020

should be simple to write a script removing the count lines and then process it by presage. So, that's already good

@matzgewinn
Copy link
Author

Maybe I could simply delete all number-characters in the files - but that would probably corrupt some of the sentences... -? I don't think I have the skills to write a script :-/
Anyway - should I upload the files here or what is the further procedure?
Best regards!

@rinigus
Copy link
Collaborator

rinigus commented Feb 20, 2020

Thanks for finding the corpus files!

No, you cannot upload files over here. The best you could do is to make a list with specific links for the files we are expected to import.

As you don't know how to write a script, we'll have to wait till either me or someone else has time to look into it. Can't promise any specific time on my side, but if nobody will volunteer, I'll look into it.

Let's just use this issue to coordinate the effort and warn others if someone starts working on it.

So, please make a list of direct links to the files we need to download to get texts. If its using that format with one number in front of the sentence, so be it. That we will take into account.

@matzgewinn
Copy link
Author

Thanks for your patience!
I at least could delete the numbers (by using the "sed" command), I hope the files are usable for you.
Here is the link, it is valid for seven days from now:
https://send.firefox.com/download/f9cfb13f4adb6d6a/#X7qFjSYId7PpfVkRcnmWGw

Let me know if there is something else I can do. I know its a little hard for someone experienced like You to handle "unprofessionals" like me, but I really like to learn and contribute - so sorry for any inconvenience.

@rinigus
Copy link
Collaborator

rinigus commented Mar 13, 2020

I am sorry, I completely forgot about it. Would you mind to resend it?

@matzgewinn
Copy link
Author

Made one file out if it and deleted all numbers. hope you can use it. best regards!
https://send.firefox.com/download/1ef75eda68a7e53d/#oZpe7VFklKPXPdtJs52y1w

@rinigus
Copy link
Collaborator

rinigus commented Mar 13, 2020

Hmm, strange - link has expired again.

@matzgewinn
Copy link
Author

ok, next try, if this fails too, I will use a differnet service :-)
https://send.firefox.com/download/4bf05f4b2c3e44e8/#NKsjqPx2kLK0R51xFo-S5w

@rinigus
Copy link
Collaborator

rinigus commented Mar 15, 2020

I have just pushed German packages to OpenRepos. Please test and report back.

@matzgewinn
Copy link
Author

Thank You so much. Tested it for "normal" purpose like mails and sms, notes an so on, and its just working fine I think. A friend of mine who is used to androids text prediction was totally ok with it, too!
so lets hear what other people say...

@rinigus
Copy link
Collaborator

rinigus commented Mar 16, 2020

Excellent! I am closing it here. Please feel free to comment later in closed issue or open a new one if needed.

@rinigus rinigus closed this as completed Mar 16, 2020
@matzgewinn
Copy link
Author

Thank you again for your excellent work - if I can pay you a cup of coffee, let me know.

@rinigus
Copy link
Collaborator

rinigus commented Mar 16, 2020

No worries, main thing was to refresh the memory on how it was done. If you wish to donate, sure there is a link out there at github and easy to find by "donate rinigus", as far as I can see. But please feel free to skip it ...

Glad it worked out rather easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants