<h1> Vectorization</h1>
<p> 
Vectorization is the process of converting textual data into numerical vectors, where each vector represents a piece of text such as a document or a sentence. This conversion allows machines to process and analyze text data using mathematical and statistical techniques.</p>

<h3>Techniques of Vectorization</h3>
<p><b>Bag-of-Words (BoW)</b> : Represents each document as a vector where each dimension corresponds to a unique word in the corpus, and the value represents the frequency of that word in the document.</p>

<p><b>TF-IDF (Term Frequency-Inverse Document Frequency)</b>
    :Similar to BoW but also considers the importance of words based on their frequency across the entire corpus.<p>

Now, let's implement these vectorization techniques using SKlearn

<h4> Bag-of-Words(Bow) Representes</h4>

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
import spacy

In [12]:
# Custom tokenizer function for preprocessing

def custom_tokenizer(text):
    # Tokenization and lowercasing using spaCy
    doc = nlp(text.lower())
    # Stop word, punctuation  removal and lemmatization
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return tokens

<h5> <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">Vectorizer</a></h5> 
<p>"Convert a collection of text documents to a matrix of token counts"</br>
"It transform a given text into a vector on the basis of the 
 frequency of each word in the text" </p>

In [14]:
# Initialize CountVectorizer with custom tokenizer
vectorizer = CountVectorizer(tokenizer=custom_tokenizer)

In [22]:
# Example sentences
sentences = ["I love learning NLP.", "NLP is a technique to analyze natural languages. It is fun to learn."]

In [16]:
# Fit and transform the sentences
X_bow = vectorizer.fit_transform(sentences)

In [17]:
# Get the Bag-of-Words representation
bow_representation = X_bow.toarray()

In [21]:
# Print vocabulary
print("\nVocabulary:\n")
print(vectorizer.get_feature_names_out())


Vocabulary:

['analys' 'fun' 'languaes' 'learn' 'love' 'natural' 'nlp' 'technique']


In [23]:
# Print Bag-of-Words representation
print("Bag-of-Words Representation:\n")
print(bow_representation)

Bag-of-Words Representation:

[[0 0 0 1 1 0 1 0]
 [1 1 1 1 0 1 1 1]]


<h3> TF-IDF (Term Frequency-Inverse Document Frequency) </h3>

<p>It aims to reflect the importance of a word in a document relative to a collection of documents (corpus). TF-IDF combines two components:
    
<ul>
 <li><b> Term Frequency (TF)</b> Measures the frequency of a term (word) in a document.   It indicates how often a term occurs in a document relative to the total number of terms in the document. </li><br>
    
   <li><b>Inverse Document Frequency (IDF) </b> Measures the importance of a term across multiple documents in the corpus. It indicates how unique or rare a term is across the entire corpus. </li>
</ul></p>
 <p><h5> The Forumula is </h5> 
 
TF-IDF(t,d)=TF(t,d)×IDF(t).
    
 where: 
 
Tf(t,d) = Is terem frequecny of term t in document d. 
Idf(t)  = Inverse dcoment frequency of term t in corpus. 
  
<i>to calculate</i>
     
Tf(t,d) = number of occrance of terem t in document d / total number of termes in document d. 
Idf(t) = total number of documents in the corpus / number of documents containing term t. 
    
</p>

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [52]:
# Example corpus
corpus = [
    "I love learning NLP.",
    "NLP is a technique to analyze natural languages. It is fun to learn."
]

In [53]:
# intialize Tfidf Vectorizer  with custom tokenizer
tf_idf_vectorizer = TfidfVectorizer(tokenizer = custom_tokenizer)

In [54]:
# Fit and transform the corpus with
X_tfidf = tf_idf_vectorizer.fit_transform(corpus)

In [63]:
# Get Tfidf Vocabulary
tfidf_vocabulary = tf_idf_vectorizer.get_feature_names_out()

In [64]:
# Get Tfidf representation 
tfidf_representaion = X_tfidf.toarray()

In [69]:
# print corpus
corpus

['I love learning NLP.',
 'NLP is a technique to analyze natural languages. It is fun to learn.']

In [65]:
#Get vocabulary 
tfidf_vocabulary

array(['analyze', 'fun', 'language', 'learn', 'love', 'natural', 'nlp',
       'technique'], dtype=object)

In [70]:
# Get tfidf representation of corpus
tfidf_representaion

array([[0.        , 0.        , 0.        , 0.50154891, 0.70490949,
        0.        , 0.50154891, 0.        ],
       [0.4078241 , 0.4078241 , 0.4078241 , 0.29017021, 0.        ,
        0.4078241 , 0.29017021, 0.4078241 ]])