Skip to content

collocations question #220

@Jim89

Description

@Jim89

Hi Ken,

I have a question about quanteda collocations that I'm hoping you might be able to help with.

I'm looking for collocations of terms within a corpus of customer complaints and I've turned again to quanteda to help with this.

I notice that when you calculate the collocations your first tokenise the texts with simplify = TRUE, returning a simple vector of all tokens across all documents in the corpuse rather than tokens within a document.

This means that I'm detecting collocations that occur across documents in the corpus, not just within them and I'm not sure if that's correct.

For example if document 1 contains the text "This is a test" and document 2 contains the text "This is also a test" then the collocation "test this" is returned, even though it doesn't exist within a document in the corpus.

Are you able to shed any light on why this is the behaviour of collocations, and if this is correct?

Thanks very much and I hope you're well. I'm enjoying the work I'm doing with analysis of text and quanteda makes things a lot easier!

Jim.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions