Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementing stop words #17

Open
trutzig89182 opened this issue Feb 7, 2022 · 2 comments
Open

implementing stop words #17

trutzig89182 opened this issue Feb 7, 2022 · 2 comments
Labels
prio:high must be done some time soon

Comments

@trutzig89182
Copy link
Collaborator

Yes, this sounds more reasonable than checking the whole stop word list for every single word. Regarding the jsonl file: since this is our program, you can also implement a special "jsonl" option for doc_type, if you want. This would be very specific (since it would be for Twitter JSON only, I guess)... but why not. Otherwise, it should theoretically work by passing an iterator (class) with doc_type="iterable" that iterates over the jsonl files. An example can be found here in the gensim documentation (section "Training your own model"): https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

/EDIT: Ah, I am unsure how the counting with/without stop words works. Are they also excluded from the total word count? This would be important to know, since it would mean that deleting the corresponding rows in the final results table is too late.

Originally posted by @thomjur in #15 (comment)

@trutzig89182 trutzig89182 mentioned this issue Feb 7, 2022
3 tasks
@trutzig89182
Copy link
Collaborator Author

the total word count is handled via full_counter, right? Than it would contain the stopwords. One way of excluding them in a final file (if that is wanted) could be to get the value for each stop word before deleting the item and adding it during this process. That would also allow to print out how much words were excluded via the stop list.

But perhaps it would also be better to have to different kinds of stop lists? If we want to exclude interpunctions and links from being counted this would make sense to be applied within the function. It it is about excluding actual words without any expected keyness, we still want them to count as a word for defining the 3 tokens range, wouldn'˝t we? So this would probably be a reason to exclude them after gathering the collocations.

Probably the most difficult part is to have a look at what stop words mean for the statistical measures you started to include.

Originally posted by @trutzig89182 in #15 (comment)

@trutzig89182 trutzig89182 added the prio:high must be done some time soon label Feb 7, 2022
@thomjur
Copy link
Owner

thomjur commented Feb 7, 2022

The problem is that I am not a trained (computer) linguist either, and I am not sure which procedure is most common. I think it might be reasonable to leave as many words "in" as possible, otherwise it might be strange if the word counts differ significantly from the number of words of the actual corpus. I think your initial idea to just ignore the stop words in the results table sounds best, but we can check that. Also, I oftentimes take care of deleting the stop words before I feed the documents into a program. Punctuation: I think our current procedure already ignores punctuation... I thought this makes sense, but maybe I am wrong...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
prio:high must be done some time soon
Projects
None yet
Development

No branches or pull requests

2 participants