<a href="https://colab.research.google.com/github/shahzadahmad7/Natural-Language-Processing/blob/main/NLP_Models_Tech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Bag of Words(BoW) model** is used to preprocess the text or documentations. It converts the documents
into a bag of words, which keeps a count of the total occurrences of most frequently used words.
Bag-of-Words is one of the most used methods to transform tokens into a set of features.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "I love coding",
    "Coding is fun",
    "Machine learning is interesting"
]

# Create the BoW model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Print the BoW representation of each document
for i in range(len(documents)):
    print(f"Document {i+1}:")
    for j in range(len(feature_names)):
        word = feature_names[j]
        count = X[i, j]
        if count > 0:
            print(f"    {word}: {count}")


Document 1:
    coding: 1
    love: 1
Document 2:
    coding: 1
    fun: 1
    is: 1
Document 3:
    interesting: 1
    is: 1
    learning: 1
    machine: 1


Term Frequency and Inverse Document Frequency is abbreviated as **TF-IDF**.
1. Information retrieval (IR) or summarization scores are measured by this.  2. TF-IDF is additionally used to determine how pertinent a term is in a particular document.  3- Steps to multiplying two measures to determine the TF-IDF:   i. The frequency of a word in a document; ii. The word's inverse document frequency across a set of documents; 3.


**Why do we require the TF-IDF?**
1. TF-IDF aids in determining a word's significance within the context of the corpus of documents. TFIDF considers the frequency of the word in the document, offset by the number of documents included in the corpus.  TF is calculated by dividing a term's frequency by the total number of terms in the document.  3- The IDF is calculated by calculating the logarithm of the quotient produced by dividing the total number of documents by the number of documents containing the phrase.  4- The result of multiplying the two variables TF and IDF is then Tf.idf.


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "I love coding",
    "Coding is fun",
    "Machine learning is interesting"
]

# Create the TF-IDF model
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Print the TF-IDF representation of each document
for i in range(len(documents)):
    print(f"Document {i+1}:")
    for j in range(len(feature_names)):
        word = feature_names[j]
        tfidf_score = X[i, j]
        if tfidf_score > 0:
            print(f"    {word}: {tfidf_score}")


Document 1:
    coding: 0.6053485081062916
    love: 0.7959605415681652
Document 2:
    coding: 0.5178561161676974
    fun: 0.680918560398684
    is: 0.5178561161676974
Document 3:
    interesting: 0.5286346066596935
    is: 0.4020402441612698
    learning: 0.5286346066596935
    machine: 0.5286346066596935
