<a href="https://colab.research.google.com/github/shrikant280304/FMML_PROJECTS_AND_LABS/blob/main/FMML_M3Lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

**$Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?**


TF-IDF technique mostly outperforms BoW model in various NLP scenarios since it offers more elaborate words’ importance in the documents’ perspective. Here’s a breakdown of why TF-IDF usually performs better than BoW:

1. Helps to distinguish the Important Words From the Common Words
This model is a very rudimentary type of model where every word in a document is treated with equal significance just a feature of ‘word frequency’. When this is done, resemblances to documents are sought based on static features of words that are often and hence unhelpful, such as ‘the’, ‘is’, ‘and’ etc make disproportionate contributions to the feature vector.
This is countered by the fact that TF-IDF gives fewer prices to words, which are commonplace in writing and more values to words, which are rare in documents. This is done through the IDF component, which brings down the importance of the documents’ most frequently used words and increases the importance of less frequently used words in the documents.

2. Enhances the Elaboration of the Terms
Unlike tf*idf or ctf, BoW doesn’t care if a word occurrence matters for the identification of one document from the other. For instance, typical text processing approach BoW may find words such as “said” and “today” important because they are frequent, even through they are not informative of topic differences.
While the Laplace model results in feature vectors dominated by few common words thus it does not give the best classification or clustering result, TF-IDF on the other hand, reduces the significance of such common words thus results in the definition of feature vectors based on more important words in each document/website.

3. Intersections between Document Relevance and Corpus Frequency
In the BoW model words that frequently occur in the document are given high weights regardless of these words’ importance or irrelevance to the message conveyed in the document. This may result to over focusing on some terms, which do not hold that direct meaning in natural language.
TF-IDF combines the term frequency in a particular document (TF) and the inverse document frequency (IDF) giving it an advantage when it comes to identifying terms meaningful to a document in relationship to other documents.

4. It Reduces Dimensionality and Noise
First, the BoW model creates dense, high-dimensional vectors that are sparse, and hence can introduce a significant amount of noise to the model. It is well understood that many machine learning algorithms are difficult to use in high dimensional spaces because of the phenomenon known as the “curse of dimensionality”, and the overfitting problem.
Despite this, TF-IDF has a tendency to reduce the dimensionality and noise in some way since it down-ranks high frequency terms in favor of higher variability and rarity. That is why, as a rule, vectors obtained by means of TF-IDF are less extensive, and are easier for models to work with, which, ultimately, will make the degree of accuracy higher.

5. Far More Effective than This Current Approach of Similarity-Based Evaluations
Use of weights in TF-IDF also enables far better applicability of cosine similarity or other similarity measurements where by for instance in document clustering and information retrieval work. Since TF-IDF gives weightage to only those terms that are unique and important for a particular document, its measure of “distance” between documents is usually a better measure in given tasks to increase performance.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
df = pd.read_csv('spam.csv', encoding='latin-1')

# Print the first few rows and the columns of the DataFrame to verify
print(df.head())
print(df.columns)

# Use the actual column names from the DataFrame
# 'Category' is the label and 'Message' is the text
X = df['Message']
y = df['Category'].map({'ham': 0, 'spam': 1})  # Convert labels to binary

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Bag-of-Words
bow_vectorizer = CountVectorizer()
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)

# Train and evaluate using Bag-of-Words
knn.fit(X_train_bow, y_train)
y_pred_bow = knn.predict(X_test_bow)
accuracy_bow = accuracy_score(y_test, y_pred_bow)

# Train and evaluate using TF-IDF
knn.fit(X_train_tfidf, y_train)
y_pred_tfidf = knn.predict(X_test_tfidf)
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)

# Print results
print(f'Accuracy using Bag-of-Words: {accuracy_bow:.4f}')
print(f'Accuracy using TF-IDF: {accuracy_tfidf:.4f}')

eighting of Terms:

TF-IDF computes the absolute importance of any word by calculating its ‘Term Frequency’ which is the number of times the word is used in a document and the documents frequency of the word which is the number of documents in which the word occurs. This means that words, which appear often in a large number of documents, such as stop words, are given less weightage while other, non-redundant terms that characterize a document are assigned more weightage.

Conversely, BoW has a major drawback of overlooking the semantics of words at the same time processing all the words in a document in terms of count leading to the shedding its contextual significance.

Handling of Rare Words:

TF-IDF is used in an effort to show that the rare terms are more likely to represent the content of the document. This is especially encouraging in classification spears where specific terms can be considerably discriminant between categories.
However in BoW their significance may be negligible towards the representation of the documents since they rarely appear in the document.

Normalization:

However, TF-IDF scales the term frequencies thus making it straightforward to compare documents of varying sizes. It helps to level down longer documents in which many general words could be encountered during analysis and the same time it enhances the comparison of bigger corpora.
There are no normalization carried out in BoW, thus the performance of BoW might be highly dependent on the nature of the documents in question.

Information Retrieval:

However TF-IDF is preferable in information retrieval systems because it measures information content based on the documents relevance toward a query making it more appropriate for the search and recommendation services.
BoW is not as suitable for this job because it does not contain a means for estimating the weight of the terms relative to the entire record set
.
Context Preservation:

Thus, TF-IDF provide somewhere the context through the weighting and therefore can take into consideration the importance of the terms compared to BoW.
BoW completely disregards the order and context of terms of words which may lead to misunderstanding of the total content of the document all together.







**Can you think of techniques that are better than both BoW and TF-IDF**

Indeed, there are several pre-processing methods that excel BO W and even TF IDF for many NLP applications. Here are some of the most notable ones:

**Word Embeddings (e.g., Word2Vec, GloVe):**

Words may include people for reference as it is the kind of vectors that places each word in a coordinated vector space in which words with similar meanings are nearer. BoW or TF-IDF only aims to compute frequencies while different word vectors defined cosine similarity and capture finer grained semantics and their relations.
These embeddings can be fixed within a large corpus, then optimized for a particular task of interest such as text classification and sentiment analysis.

**Contextualized Word Embeddings (e.g., BERT, ELMo):**

Thus, we have contextual word embeddings as models such as BERT (Bidirectional Encoder Representations from Transformers) and ELMo (Embeddings from Language Models). In contrast to static embeddings, these models produce ongoing entirely embeddings for the same vocabulary that deploys throughout a sentence.
This makes it possible to capture the meaning of words with respect to context and the current surrounding environment a task that has really helped in sharpening the accuracy of understanding and generation of human languages.

**Transformers:**

The transformer architecture upon which all BERT, GPT and T5 are based, for example, has performed very well across board in NLP tasks. They apply weights on different words in the sentence/ sentence vectors so they capture long range dependencies and contextual relationship far better than BoW or TF-IDF.
Currently, fine-tuning specific pre-trained transformer model on certain data sets can provide state of the art performance in tasks such as question answering, translation, and summarization.

**Sentence Embeddings (e.g., Sentence-BERT):**

They are phrases and sentences analog of word embeddings that represent the entire phrase, and in particular, the sentence, by fixed-size vector. This can be especially helpful in obviously similar tasks such as working with sentences and coming up with the measure of similarity or with methods of clustering.
Originally, the Sentence-BERT model is developed to be utilized for generating the embeddings desirable for a number of tasks while ensuring operational semantical similarity.

**N-grams with Smoothing Techniques:**

N-gram models store sequences of ‘n’ words, thus providing some level of context within its storage. When used together with the smoothing techniques such as the Laplace smoothing, they are more efficient than BoW as well as the TF-IDF since it introduces order of the words and their conjunctions.

**Topic Modeling (e.g., LDA):**

Several freeform models which can be applied for the discovery of themes in a set of documents include Latent Dirichlet Allocation (LDA). These models are more informative than the ones counting the occurrences of words since they present the documents as mixtures of topics rather than numbers.

**Deep Learning Approaches:**

CNNs and RNNs can be used for text data which directly means that it can be processed. Some of these models can automatically find out other patterns and features from raw text which makes it have superior performance over others in sentiment analysis and classification.

**Hybrid Models:**

It also means that proper blending of various techniques can also improve the efficiency of treatment. For instance, embeddings from BERT as the input feature for classification or composing BoW features with word embedding are extra boost strategies that yield high results.

**Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.**

Like other processes in NLP, stemming and lemmatization are used in the process of text normalization. It’s important to keep in their mind that both templates have the same goal of simplifying a text as much as possible by stripping off a number of prefixes and suffixes of the words as possible, though the two pursue this task in a different manner.

Here's a comparison of the two, along with their pros and cons:

Stemming
Definition:

Stemming is the task of convertion of words to their base form, which is usually done by removing the suffix from the words. For instance, “run,” “runner,” “running” may be all lowered down to the stem of “run.”

Pros:

Simplicity and Speed: When applied, stemming algorithms are typically simple and quick enshringing word reduction rules that are ideal for big data.

Reduces Variability:

azi u poate ajuta la reducerea dimensionalității setului de date, ceea ce își poate traduce pozitiv în exerciții precum indexarea informațiilor, clasarea textelor etc.
Broad Coverage: Stemming can indeed be used on a large number of languages including those that, do not require large dictionaries and /or morphological analyzers.
Cons:

Inaccuracy:

Stemming ends up with a stem that could be either unrelated or non-existing morhphonemic form of the original word. For instance, let us consider the word “better” ; it is stemmed down to “better” that does not help to decipher its meaning.
Loss of Meaning: Because stemming is normally aggressive and based on rules some of the words’ meaning might be lost because the algorithm does not take into consideration the context of the words stemmed.
No Guarantee of Real Words: Unfortunately, the results may not give valid dictionary words altogether, and hence cannot be easily readable for the human populace.
Lemmatization

Definition:
Lemmatization differs from stemming in that only words’ base or dictionary form (lemma) is obtained with referencing to the context and the words’ part of speech. For instance if the given word was “running” the lemmatized form would be “run” and if given word was “better” the lemmatized form would be “good.”

Pros:

Semantic Accuracy: Lemmatization is more accurate than stemming because it uses context and the part of speech of words which results in genuine dictionary forms.
Improved Understanding: Lemmatisation can be useful because it delivers actual words which can enhance downstream work such as sentiment analysis or translation.
Enhanced Performance: Lemmatization is used in many NLP applications because it is found to aid in improving the performance of models due to better input data.
Cons:

Complexity and Speed:

It is usually the case that lemmmatization algorithms are somewhat more sophisticated and take more time to implement since often a certain amount of knowledge of the language dictionaries and morphological analysis is needed.

Language Dependency:

Lemmatization can be expected to make use of a certain amount of specialized language sources and utilities, making it even less suitable for general use than stemming.
Need for Part-of-Speech Tagging: Nevertheless, lemmatization is often most effective when used in conjunction with the part of speech for the particular word and may include other procedures.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
