# TfidfVectorizer
TfidfVectorizer is a class provided by scikit-learn (a popular machine learning library in Python) for converting a collection of raw documents (text) into a matrix of TF-IDF features. TF-IDF stands for Term Frequency-Inverse Document Frequency, and it is a numerical statistic used in natural language processing and information retrieval to represent the importance of a term within a document relative to a collection of documents (corpus).

# TfidfVectorizer():
This is the constructor for creating a TfidfVectorizer object. You can customize the vectorizer's behavior by passing various parameters as arguments.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
corpus = [
    'LW awarded best innovative training provider by linkedin',
    'best awarded serach engine provider is google',
    'we are LW working for making india future ready',
    'if we want to search in india are go to google',
    'i am the best and how is the best'
]

In [3]:
vectorizer = TfidfVectorizer()

# fit(raw_documents):
This function analyzes the text data provided in raw_documents and builds the vocabulary and IDF (Inverse Document Frequency) parameters needed for TF-IDF vectorization.

In [4]:
vectorizer.fit(corpus)

TfidfVectorizer()

# transform(raw_documents): 
This function transforms the input text data into a TF-IDF matrix based on the vocabulary and IDF parameters learned from the fit method

In [5]:
tfidf_matrix = vectorizer.transform(corpus)

In [6]:
tfidf_matrix

<5x30 sparse matrix of type '<class 'numpy.float64'>'
	with 40 stored elements in Compressed Sparse Row format>

# fit_transform(raw_documents): 
This is a combination of the fit and transform methods, which is commonly used to both fit the vectorizer to the data and transform it in one step.

In [7]:
transform_output = vectorizer.fit_transform(corpus)

In [8]:
transform_output

<5x30 sparse matrix of type '<class 'numpy.float64'>'
	with 40 stored elements in Compressed Sparse Row format>

# get_feature_names_out():
Returns an array of feature names (terms) in the order they appear in the TF-IDF matrix.

In [9]:
vectorizer.get_feature_names_out()

array(['am', 'and', 'are', 'awarded', 'best', 'by', 'engine', 'for',
       'future', 'go', 'google', 'how', 'if', 'in', 'india', 'innovative',
       'is', 'linkedin', 'lw', 'making', 'provider', 'ready', 'search',
       'serach', 'the', 'to', 'training', 'want', 'we', 'working'],
      dtype=object)

# get_params():
Returns a dictionary of the parameters that were set when creating the vectorizer.

In [10]:
params = vectorizer.get_params()

In [11]:
params

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': None,
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'use_idf': True,
 'vocabulary': None}

# vocabulary_:

This attribute contains the vocabulary (unique terms) learned from the documents.

In [12]:
vectorizer.vocabulary_

{'lw': 18,
 'awarded': 3,
 'best': 4,
 'innovative': 15,
 'training': 26,
 'provider': 20,
 'by': 5,
 'linkedin': 17,
 'serach': 23,
 'engine': 6,
 'is': 16,
 'google': 10,
 'we': 28,
 'are': 2,
 'working': 29,
 'for': 7,
 'making': 19,
 'india': 14,
 'future': 8,
 'ready': 21,
 'if': 12,
 'want': 27,
 'to': 25,
 'search': 22,
 'in': 13,
 'go': 9,
 'am': 0,
 'the': 24,
 'and': 1,
 'how': 11}

In [13]:
vectorizer.vocabulary_.get("lw")

18

In [14]:
vectorizer.vocabulary_['best']

4

# idf_: 
An attribute that contains the inverse document frequency (IDF) of each term in the vocabulary after the vectorizer has been fit.

In [15]:
vectorizer.idf_

array([2.09861229, 2.09861229, 1.69314718, 1.69314718, 1.40546511,
       2.09861229, 2.09861229, 2.09861229, 2.09861229, 2.09861229,
       1.69314718, 2.09861229, 2.09861229, 2.09861229, 1.69314718,
       2.09861229, 1.69314718, 2.09861229, 1.69314718, 2.09861229,
       1.69314718, 2.09861229, 2.09861229, 2.09861229, 2.09861229,
       2.09861229, 2.09861229, 2.09861229, 1.69314718, 2.09861229])

# dtype:
'vectorizer.dtype' returns the data type of the TF-IDF matrix created by the vectorizer.

In [16]:
vectorizer.dtype

numpy.float64

# fixed_vocabulary_:
`vectorizer.fixed_vocabulary_` indicates whether a fixed vocabulary is used during TF-IDF vectorization (True if fixed, False otherwise).

In [17]:
vectorizer.fixed_vocabulary_

False

# vectorizer.lowercase:
`vectorizer.lowercase` specifies whether the vectorizer should convert text to lowercase during tokenization (True if it does, False if it doesn't).

In [18]:
vectorizer.lowercase

True

# toarray():
`transform_output.toarray()` converts the sparse matrix `transform_output` to a dense NumPy array.

In [19]:
transform_output.toarray()

array([[0.        , 0.        , 0.        , 0.31888178, 0.26470068,
        0.39524574, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.39524574, 0.        , 0.39524574, 0.31888178, 0.        ,
        0.31888178, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.39524574, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.35894109, 0.29795353,
        0.        , 0.44489823, 0.        , 0.        , 0.        ,
        0.35894109, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.35894109, 0.        , 0.        , 0.        ,
        0.35894109, 0.        , 0.        , 0.44489823, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.29258431, 0.        , 0.        ,
        0.        , 0.        , 0.36265071, 0.36265071, 0.        ,
        0.        , 0.        , 0.        , 0.

# build_analyzer()

This method returns a callable to process input data.

In [20]:
analyzer = vectorizer.build_analyzer()
text = "This is an example sentence."
tokens = analyzer(text)
print(tokens)  # Output: ['this', 'is', 'an', 'example', 'sentence']

['this', 'is', 'an', 'example', 'sentence']


# build_preprocessor()

This method returns a function to preprocess the text before tokenization.

In [21]:
preprocessor = vectorizer.build_preprocessor()
text = "i am the best."
preprocessed_text = preprocessor("i am the best")
print(preprocessed_text)  # Output: 'this is an example sentence'


i am the best


# build_tokenizer()

This method returns a function that splits a string into a sequence of tokens.

In [22]:
tokenizer = vectorizer.build_tokenizer()
text = "This is an example sentence."
tokens = tokenizer(text)
print(tokens)  # Output: ['This', 'is', 'an', 'example', 'sentence', '.']


['This', 'is', 'an', 'example', 'sentence']


# decode(doc)

This method decodes the input into a string of Unicode symbols.

In [23]:
encoded_text = vectorizer.transform(corpus)
decoded_text = vectorizer.decode(encoded_text[0])
print(decoded_text)  # Output: 'this is encoded text'


  (0, 26)	0.39524574252810757
  (0, 20)	0.31888177640211135
  (0, 18)	0.31888177640211135
  (0, 17)	0.39524574252810757
  (0, 15)	0.39524574252810757
  (0, 5)	0.39524574252810757
  (0, 4)	0.26470068018333703
  (0, 3)	0.31888177640211135


# get_params([deep])

This method gets parameters for this estimator.

In [24]:
params = vectorizer.get_params()

In [25]:
params

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': None,
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'use_idf': True,
 'vocabulary': None}

# get_stop_words()

This method builds or fetches the effective stop words list.

stop_words (default=None): You can specify a list of stop words
    (common words like "the," "and," "is" that are often removed from text).

In [26]:
stop_words = vectorizer.get_stop_words()


In [27]:
stop_words

# inverse_transform(X)

This method returns terms per document with nonzero entries in X.

In [28]:
terms_per_document = vectorizer.inverse_transform(tfidf_matrix)


In [29]:
terms_per_document

[array(['training', 'provider', 'lw', 'linkedin', 'innovative', 'by',
        'best', 'awarded'], dtype='<U10'),
 array(['serach', 'provider', 'is', 'google', 'engine', 'best', 'awarded'],
       dtype='<U10'),
 array(['working', 'we', 'ready', 'making', 'lw', 'india', 'future', 'for',
        'are'], dtype='<U10'),
 array(['we', 'want', 'to', 'search', 'india', 'in', 'if', 'google', 'go',
        'are'], dtype='<U10'),
 array(['the', 'is', 'how', 'best', 'and', 'am'], dtype='<U10')]

# **set_params(params)

This method sets the parameters of this estimator.

In [30]:
vectorizer.set_params(stop_words='english')

TfidfVectorizer(stop_words='english')

# TfidfVectorizer()

This is the constructor function for creating a TfidfVectorizer object.

Parameters:

stop_words (default=None): You can specify a list of stop words (common words like "the," "and," "is" that are often removed from text).

max_df (default=1.0): Ignore terms that have a document frequency higher than this threshold.

min_df (default=1): Ignore terms that have a document frequency lower than this threshold.

max_features (default=None): Limit the number of features (words) to this maximum number based on term frequency.

ngram_range (default=(1, 1)): Specify the range of n-grams to consider (e.g., (1, 2) for unigrams and bigrams).

use_idf (default=True): Enable inverse-document-frequency reweighting.

smooth_idf (default=True): Add 1 to document frequencies to avoid division by zero.

sublinear_tf (default=False): Apply sublinear scaling to the term frequency.

token_pattern (default=r"(?u)\b\w\w+\b"): Regular expression pattern for tokenization.