`TfidfVectorizer` is another useful tool provided by the `scikit-learn library` in Python. It uses the concept of [TF-IDF](Text_Vectorization.md). It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text, but it also considers the importance of each word in the context of the entire document. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis).

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

d0 = "The quick brown fox jumps over the lazy dog."
d1 = "Python is a great programming language for machine learning."
d2 = "CountVectorizer converts text into a matrix of token counts."
d3 = "Natural language processing helps computers understand human language."
d4 = "Machine learning models require clean and structured data."

document = [d0, d1, d2, d3, d4]

# Create a TfidfVectorizer object
tdidf = TfidfVectorizer()
tdidf.fit(document)

# Display the vocabulary
print("Vocabulary:", tdidf.vocabulary_)

vector = tdidf.fit_transform(document)

# display tf-idf values
print('\ntf-idf value:')
print(vector)

# in matrix form
print('\ntf-idf values in matrix form:')
print(vector.toarray())

Vocabulary: {'the': 33, 'quick': 29, 'brown': 1, 'fox': 10, 'jumps': 16, 'over': 25, 'lazy': 18, 'dog': 8, 'python': 28, 'is': 15, 'great': 11, 'programming': 27, 'language': 17, 'for': 9, 'machine': 20, 'learning': 19, 'countvectorizer': 6, 'converts': 4, 'text': 32, 'into': 14, 'matrix': 21, 'of': 24, 'token': 34, 'counts': 5, 'natural': 23, 'processing': 26, 'helps': 12, 'computers': 3, 'understand': 35, 'human': 13, 'models': 22, 'require': 30, 'clean': 2, 'and': 0, 'structured': 31, 'data': 7}

tf-idf value:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 39 stored elements and shape (5, 36)>
  Coords	Values
  (0, 33)	0.6030226891555271
  (0, 29)	0.3015113445777636
  (0, 1)	0.3015113445777636
  (0, 10)	0.3015113445777636
  (0, 16)	0.3015113445777636
  (0, 25)	0.3015113445777636
  (0, 18)	0.3015113445777636
  (0, 8)	0.3015113445777636
  (1, 28)	0.3792466455261381
  (1, 15)	0.3792466455261381
  (1, 11)	0.3792466455261381
  (1, 27)	0.3792466455261381
  (1, 17)	0.3059738