-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Closed
Closed
Copy link
Labels
Description
I find the example in the document :
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
But in my real NLP project , I have tokenized sentences in my special way which means corpus
is a corpus_list
like this :
from sklearn.feature_extraction.text import TfidfVectorizer
corpus_list = [
['This', 'is', 'the', 'first', 'document', '.'],
['This', 'document', 'is', 'the', 'second', 'document', '.'],
['And', 'this', 'is', 'the', 'third', 'one', '.'],
['Is', 'this', 'the', 'first', 'document', '?']
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus_list)
I think TfidfVectorizer
should support this input ,because we often use a complex tokenize strategy .