Skip to content

How to input tokenized sentences into sklearn.feature_extraction.text.TfidfVectorizer  #17279

@DachuanZhao

Description

@DachuanZhao

I find the example in the document :

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

But in my real NLP project , I have tokenized sentences in my special way which means corpus is a corpus_list like this :

from sklearn.feature_extraction.text import TfidfVectorizer
corpus_list = [
    ['This', 'is', 'the', 'first', 'document', '.'],
    ['This', 'document', 'is', 'the', 'second', 'document', '.'],
    ['And', 'this', 'is', 'the', 'third', 'one', '.'],
    ['Is', 'this', 'the', 'first', 'document', '?']
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus_list)

I think TfidfVectorizer should support this input ,because we often use a complex tokenize strategy .

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions