implementing stop words #17

trutzig89182 · 2022-02-07T12:48:26Z

Yes, this sounds more reasonable than checking the whole stop word list for every single word. Regarding the jsonl file: since this is our program, you can also implement a special "jsonl" option for doc_type, if you want. This would be very specific (since it would be for Twitter JSON only, I guess)... but why not. Otherwise, it should theoretically work by passing an iterator (class) with doc_type="iterable" that iterates over the jsonl files. An example can be found here in the gensim documentation (section "Training your own model"): https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

/EDIT: Ah, I am unsure how the counting with/without stop words works. Are they also excluded from the total word count? This would be important to know, since it would mean that deleting the corresponding rows in the final results table is too late.

Originally posted by @thomjur in #15 (comment)

The text was updated successfully, but these errors were encountered:

trutzig89182 · 2022-02-07T12:49:56Z

the total word count is handled via full_counter, right? Than it would contain the stopwords. One way of excluding them in a final file (if that is wanted) could be to get the value for each stop word before deleting the item and adding it during this process. That would also allow to print out how much words were excluded via the stop list.

But perhaps it would also be better to have to different kinds of stop lists? If we want to exclude interpunctions and links from being counted this would make sense to be applied within the function. It it is about excluding actual words without any expected keyness, we still want them to count as a word for defining the 3 tokens range, wouldn'˝t we? So this would probably be a reason to exclude them after gathering the collocations.

Probably the most difficult part is to have a look at what stop words mean for the statistical measures you started to include.

Originally posted by @trutzig89182 in #15 (comment)

thomjur · 2022-02-07T14:10:49Z

The problem is that I am not a trained (computer) linguist either, and I am not sure which procedure is most common. I think it might be reasonable to leave as many words "in" as possible, otherwise it might be strange if the word counts differ significantly from the number of words of the actual corpus. I think your initial idea to just ignore the stop words in the results table sounds best, but we can check that. Also, I oftentimes take care of deleting the stop words before I feed the documents into a program. Punctuation: I think our current procedure already ignores punctuation... I thought this makes sense, but maybe I am wrong...

trutzig89182 mentioned this issue Feb 7, 2022

TODO (General List) #9

Open

3 tasks

trutzig89182 added the prio:high must be done some time soon label Feb 7, 2022

trutzig89182 mentioned this issue Feb 23, 2022

Creating twitter adapter #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implementing stop words #17

implementing stop words #17

trutzig89182 commented Feb 7, 2022

trutzig89182 commented Feb 7, 2022

thomjur commented Feb 7, 2022

implementing stop words #17

implementing stop words #17

Comments

trutzig89182 commented Feb 7, 2022

trutzig89182 commented Feb 7, 2022

thomjur commented Feb 7, 2022