Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greek support #282

Closed
cschwem2er opened this issue Nov 7, 2016 · 5 comments
Closed

Greek support #282

cschwem2er opened this issue Nov 7, 2016 · 5 comments

Comments

@cschwem2er
Copy link

Hi, are there any plans to fully support Greek as a language? From what I can tell there are no stopwords available although some sources already exist.

@kbenoit
Copy link
Collaborator

kbenoit commented Nov 7, 2016

Thanks for the link, that's very useful. Reviewing it, however, the list of English stopwords (at 895 words) is far longer and more extensive than the fairly conservative lists currently available in quanteda through stopwords() (where length(stopwords("english")) is 174).

However the Greek list from the source you listed has only 79 words, so it is much smaller. However this also shows that the coverage of the different languages from your source is highly imbalanced. What do you think of the 79 words, compared to the other language lists in stopwords()?

@cschwem2er
Copy link
Author

To be honest I have absolutely no idea. I'm in the middle of processing greek parliamentary written questions and only see a bunch of weird characters ;-).
I will contact a native ASAP and let you know his or her opinion about the stopwords.

@cschwem2er
Copy link
Author

cschwem2er commented Nov 22, 2016

EL_stopwords.xlsx
I just sent you a mail with a pretty exhaustive stop word list created by a native speaker (and translated to English).

kbenoit added a commit that referenced this issue Nov 23, 2016
@kbenoit kbenoit closed this as completed Nov 23, 2016
@cschwem2er
Copy link
Author

cschwem2er commented Nov 23, 2016

There is still one mistake in the stopword list: "τηs", which is the translation of "hers", should be replaced with "της".
A fixed spreadsheet is available here.

kbenoit added a commit that referenced this issue Nov 23, 2016
@kbenoit
Copy link
Collaborator

kbenoit commented Nov 23, 2016

Thanks, fixed in 8fc9954

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants