Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique word forms count versus corpus terms count #20

Closed
pbstudent opened this issue Apr 21, 2022 · 3 comments
Closed

Unique word forms count versus corpus terms count #20

pbstudent opened this issue Apr 21, 2022 · 3 comments

Comments

@pbstudent
Copy link

pbstudent commented Apr 21, 2022

Although I could not find a statement clarifying the difference in the online help or in Voyant Tools info popups, it appears that "unique word forms count" is pre-stopwords processing and corpus terms count is post-stopwords processing. Is that correct?

@ajmacdonald
Copy link
Collaborator

There are info popups in the summary panel which explain this. Unique word forms count is the total number of words after discarding duplicates occurrences. So "the" would only be counted once even if it occurs 100 times in the corpus. It is unrelated to stopwords.

@pbstudent
Copy link
Author

pbstudent commented Apr 25, 2022

Yes the info blocks are helpful. However, terms such as "the" are part of the default stopwords list. The corpus terms count is lower than unique words count. Is the corpus terms count result after stopwords filtering, whereas unique word forms count includes words in the stopwords list?

@ajmacdonald
Copy link
Collaborator

The default stopwords list is applied globally by default, so this includes the corpus terms panel.
Metadata is unaffected however, e.g. the unique word forms statement in the summary panel or the stats in the documents panel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants