-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Program Structure #15
Comments
I have started restructuring the program accordingly. I also needed to change the tests, they seem to work. |
Okay, last change for today: One can now choose between
It worked and created a csv file with the results in my folder. Note that I have also uploaded the |
Looks great and the core structure seems right to me. I am currently still trying to find out how best to use it with twitter jsonl files, but that is basically on special case of importing PyCollocations, and does not change anythin in the core structure. It once we work on it, stop words lists could be another thing to hand over to the package. I think there would be two ways of doing it: either excluding stop words when collecting the collocations, or executing them when returning results. My fist impuls is that the second option would be better, as it keeps the core function less complex. |
Yes, this sounds more reasonable than checking the whole stop word list for every single word. Regarding the jsonl file: since this is our program, you can also implement a special "jsonl" option for /EDIT: Ah, I am unsure how the counting with/without stop words works. Are they also excluded from the total word count? This would be important to know, since it would mean that deleting the corresponding rows in the final results table is too late. |
the total word count is handled via full_counter, right? Than it would contain the stopwords. One way of excluding them in a final file (if that is wanted) could be to get the value for each stop word before deleting the item and adding it during this process. That would also allow to print out how much words were excluded via the stop list. But perhaps it would also be better to have to different kinds of stop lists? If we want to exclude interpunctions and links from being counted this would make sense to be applied within the function. It it is about excluding actual words without any expected keyness, we still want them to count as a word for defining the 3 tokens range, wouldn'˝t we? So this would probably be a reason to exclude them after gathering the collocations. Probably the most difficult part is to have a look at what stop words mean for the statistical measures you started to include. |
The problem is that I am not a trained (computer) linguist either, and I am not sure which procedure is most common. I think it might be reasonable to leave as many words "in" as possible, otherwise it might be strange if the word counts differ significantly from the number of words of the actual corpus. I think your initial idea to just ignore the stop words in the results table sounds best, but we can check that. Also, I oftentimes take care of deleting the stop words before I feed the documents into a program. Punctuation: I think our current procedure already ignores punctuation... I thought this makes sense, but maybe I am wrong... |
@trutzig89182
I have started thinking about the general structure based on the comments you made. Here is a first draft:
So, there are basically two ways one could interact with this program:
start_collocation_analysis()
function.What do you think?
Here is the
draw.io
file that you can also change. We can also use a different program, this is only a first draft.https://1drv.ms/u/s!AjzmoTNnf_mknqAUBlIqx-OB5jdLAg?e=5WtRyu
The text was updated successfully, but these errors were encountered: