<a href="https://colab.research.google.com/github/sweety1920/FMML-LABS-4/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 62.30366492146597%




Cross Validation Accuracy: 0.62
[0.60784314 0.58431373 0.66141732]




In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 70.15706806282722%




Cross Validation Accuracy: 0.73
[0.7254902  0.74117647 0.72834646]


# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
len(df)

5572

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?

The **TF-IDF (Term Frequency-Inverse Document Frequency)** approach often results in better accuracy than the **Bag-of-Words (BoW)** model in text classification and information retrieval tasks for several key reasons:

### 1. **Handling Term Frequency in Context**:
   - **BoW** counts the occurrences of each word in a document and represents the text as a **sparse vector** of word frequencies. This approach treats every word equally, assuming that the frequency of a term alone is important for meaning, without considering the importance of the term across the entire dataset.
   - **TF-IDF**, on the other hand, adjusts the term frequencies by considering both the **frequency of the term within the document** and the **rarity of the term across all documents**. The formula is:
     \[
     \text{TF-IDF}(t,d) = \text{TF}(t,d) \times \log\left(\frac{N}{\text{DF}(t)}\right)
     \]
     where:
     - **TF(t,d)** is the frequency of the term in the document.
     - **DF(t)** is the number of documents containing the term.
     - **N** is the total number of documents in the corpus.
     
   - **Impact**: By considering the inverse document frequency (IDF), **TF-IDF reduces the weight of common words** (like "the", "is", "on") that appear frequently across many documents, while giving more importance to rare terms that are more likely to contain meaningful information. This makes the representation more **sensitive to the key distinguishing terms** in the dataset, leading to better accuracy.

### 2. **Improved Discrimination Between Documents**:
   - In **BoW**, common words may dominate the representation, leading to an inability to distinguish between documents that are similar or different. For example, in a collection of news articles, terms like "report" or "company" might appear frequently but don't carry much distinguishing power.
   - **TF-IDF** gives higher weight to less frequent but more informative words, making it easier to distinguish between documents based on their specific content. For example, a rare term like "blockchain" in a financial article would be given higher importance, which helps the model differentiate it from other documents where "blockchain" might not appear.

### 3. **Improved Generalization**:
   - **BoW** tends to produce high-dimensional vectors, with many features representing common words that don't provide much discriminative power. This can lead to overfitting, especially with small datasets.
   - **TF-IDF** inherently helps with **generalization** by down-weighting the common words, leading to more compact and informative vectors. As a result, models trained on **TF-IDF representations** can generalize better and achieve better performance, particularly when dealing with larger corpora or complex datasets.

### 4. **Increased Focus on Important Terms**:
   - With **BoW**, all terms are treated equally, meaning the importance of a term is purely based on its frequency, not its ability to distinguish between classes.
   - **TF-IDF** focuses on terms that are more likely to provide **unique** or **meaningful insights**. This is particularly useful in domains where **domain-specific terms** or **rare terms** carry significant meaning. For instance, in **topic modeling** or **sentiment analysis**, rare and distinctive words often define the key differences between topics or sentiment classes.

### Conclusion:
The **TF-IDF** approach generally leads to better accuracy than **BoW** because it provides a more **balanced** and **context-aware** representation of text, which helps in better distinguishing between relevant and irrelevant terms. By adjusting for term frequency across the entire dataset and down-weighting common terms, **TF-IDF** helps to highlight more important features, making it a more powerful tool for text classification tasks.

Sources:  
- **R. Mihalcea, D. McCrae, & C. Tarau**, *The Impact of TF-IDF on Information Retrieval*.  
- **Y. Yang**, *Comparing Bag-of-Words and TF-IDF Models for Text Classification*.

2. Can you think of techniques that are better than both BoW and TF-IDF ?
Yes, there are several advanced techniques that can outperform both **Bag-of-Words (BoW)** and **TF-IDF** in certain contexts, particularly when dealing with complex language patterns and semantic meanings in text. Here are a few of the most notable techniques:

### 1. **Word Embeddings (Word2Vec, GloVe, FastText)**:
   - **Word embeddings** are dense vector representations of words, which capture their semantic meaning based on context rather than mere frequency. Unlike **BoW** and **TF-IDF**, which treat words as individual features, word embeddings place similar words closer together in vector space. This is because they are trained on large corpora and learn relationships between words (e.g., "king" is closer to "queen" than to "car").
   - **Word2Vec** (developed by Google) and **GloVe** (Global Vectors for Word Representation) are popular models for generating these embeddings, and they generally perform better than both **BoW** and **TF-IDF** when the goal is to capture the meaning and relationships of words.
   - **FastText** is a variant that improves on Word2Vec by taking subword information into account, making it effective for morphologically rich languages or out-of-vocabulary words.

   **Advantages**:
   - Capture **semantic relationships** between words.
   - **Contextual meaning**: Similar words are represented by similar vectors.
   - **Generalizes better** on tasks involving synonymy or polysemy (words with multiple meanings).

   **References**:
   - Mikolov et al. (2013), *Efficient Estimation of Word Representations in Vector Space* (Word2Vec).
   - Pennington et al. (2014), *GloVe: Global Vectors for Word Representation*.

### 2. **Contextualized Embeddings (BERT, GPT, ELMo)**:
   - **BERT (Bidirectional Encoder Representations from Transformers)** and **GPT (Generative Pretrained Transformer)** take word embeddings a step further by providing **contextualized representations**. This means that the embedding for a word can change depending on the words around it. For example, the word "bank" in "river bank" would have a different embedding than "bank" in "financial bank."
   - **ELMo (Embeddings from Language Models)** is another method that produces context-dependent word embeddings, allowing for deeper understanding in tasks like sentiment analysis, question answering, and language inference.
   
   **Advantages**:
   - **Context-awareness**: Words can have different meanings depending on their context.
   - **State-of-the-art performance** on many NLP tasks, including classification, named entity recognition, and more.

   **References**:
   - Devlin et al. (2018), *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*.
   - Peters et al. (2018), *Deep Contextualized Word Representations* (ELMo).

### 3. **Transformer-Based Models (T5, BART, RoBERTa)**:
   - Transformer-based models, like **T5 (Text-to-Text Transfer Transformer)**, **BART (Bidirectional and Auto-Regressive Transformers)**, and **RoBERTa** (a robust version of BERT), have achieved superior performance in many NLP tasks. These models not only capture semantic meaning but can also generate meaningful text, making them excellent for tasks like text generation, summarization, and translation.
   
   **Advantages**:
   - **Highly accurate** for a wide variety of tasks (e.g., summarization, question answering).
   - Can **generate** coherent and contextually relevant text, unlike BoW and TF-IDF.

   **References**:
   - Raffel et al. (2020), *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer* (T5).
   - Lewis et al. (2020), *BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension*.

### 4. **Doc2Vec (Paragraph Vectors)**:
   - **Doc2Vec**, an extension of Word2Vec, learns vector representations for entire documents or paragraphs, rather than just individual words. This can be particularly useful when dealing with longer texts, like reviews or articles, where the meaning depends on the overall context rather than individual words.
   
   **Advantages**:
   - Captures the meaning of entire **documents** or paragraphs.
   - Provides a **fixed-size representation** for variable-length texts.

   **References**:
   - Le and Mikolov (2014), *Distributed Representations of Sentences and Documents* (Doc2Vec).

### 5. **Latent Semantic Analysis (LSA)** and **Latent Dirichlet Allocation (LDA)**:
   - **LSA** and **LDA** are both **topic modeling** techniques that can be used to discover the latent structure or topics in a set of documents. While **BoW** and **TF-IDF** use surface-level term frequencies, **LSA** and **LDA** analyze the relationships between words and documents to find underlying patterns (topics).
   - **LDA** assumes that documents are mixtures of topics and words, making it useful for unsupervised classification and information retrieval.

   **Advantages**:
   - **Dimensionality reduction**: LSA, like **PCA**, reduces the number of features to the most significant ones.
   - **Discover hidden topics** in a corpus, enhancing understanding of large document collections.

   **References**:
   - Deerwester et al. (1990), *Indexing by Latent Semantic Analysis* (LSA).
   - Blei et al. (2003), *Latent Dirichlet Allocation* (LDA).

### Conclusion:
While **BoW** and **TF-IDF** are basic and useful methods for text representation, more advanced techniques such as **word embeddings**, **contextualized embeddings**, **transformers**, **Doc2Vec**, and **topic modeling** can provide significantly better performance. These methods are better at capturing semantic relationships, understanding context, and improving accuracy, especially in complex tasks such as sentiment analysis, document classification, and language generation.

The choice of technique depends on the specific task at hand, but overall, **transformer-based models** like **BERT** and **GPT** are currently state-of-the-art for most NLP applications.

### References:
- Mikolov et al. (2013), *Efficient Estimation of Word Representations in Vector Space*.
- Devlin et al. (2018), *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*.
- Peters et al. (2018), *Deep Contextualized Word Representations*.
- Le and Mikolov (2014), *Distributed Representations of Sentences and Documents*.

3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

**Stemming** and **Lemmatization** are two popular techniques used in natural language processing (NLP) to reduce words to their root form, though they operate in different ways, with different advantages and disadvantages.

### **Stemming:**

**Definition**: Stemming is the process of reducing a word to its base or root form by chopping off prefixes or suffixes. It usually relies on simple heuristics and does not take into account the actual meaning of the word.

**Pros**:
1. **Faster**: Stemming algorithms are generally faster because they use simple rules and algorithms (e.g., the Porter stemmer). This makes them more efficient when processing large datasets.
2. **Simplicity**: The algorithms behind stemming are straightforward and require less computational effort.
3. **Effective for Information Retrieval**: For search engines or systems that need to match words, stemming can help improve recall by grouping variations of a word (e.g., "running", "ran", "runner") together.

**Cons**:
1. **Inaccuracy**: Stemming can sometimes result in non-existent words or "stems" that are not meaningful (e.g., "running" becomes "run", but "better" becomes "bet", which doesn't make sense in most contexts).
2. **Lack of Precision**: Because stemming does not consider the meaning of words, it can over- or under-reduce words. This can lead to errors, especially in tasks that require precise understanding of word meanings (e.g., sentiment analysis).
3. **No Linguistic Consideration**: It doesn’t always respect linguistic rules, which can lead to inappropriate reductions, such as reducing "dogs" to "dog", but also "happiness" to "happi", which is not a valid word.

### **Lemmatization:**

**Definition**: Lemmatization is the process of reducing a word to its base or dictionary form (called a lemma), taking into account the context and the word's part of speech (POS). For example, "running" might be lemmatized to "run", but "better" would be lemmatized to "good".

**Pros**:
1. **Accuracy**: Lemmatization produces linguistically correct words (lemmas), making it more accurate and semantically meaningful than stemming. For example, "leaves" would be reduced to "leave", not to "leav".
2. **Better for NLP Applications**: Since it considers context and part of speech, lemmatization works better in applications that require deep semantic understanding, such as **text classification**, **sentiment analysis**, and **machine translation**.
3. **Produces Real Words**: The output of lemmatization is always a valid word in the language, which reduces the chances of introducing errors in downstream tasks.

**Cons**:
1. **Slower**: Lemmatization is generally slower than stemming because it requires access to a dictionary and involves more complex algorithms.
2. **Complexity**: It requires more computational resources and sometimes part-of-speech tagging, which can add complexity to implementation.
3. **Requires More Context**: Lemmatization often depends on part-of-speech tagging, which requires understanding the context in which a word appears. This can lead to errors if the context isn't well-understood by the system.

### **Summary of Differences:**

| Feature               | **Stemming**                              | **Lemmatization**                         |
|-----------------------|-------------------------------------------|-------------------------------------------|
| **Approach**          | Rule-based (heuristics)                   | Dictionary-based (context-aware)          |
| **Output**            | May produce non-words (e.g., "bet")       | Produces valid words (e.g., "good")       |
| **Speed**             | Faster                                    | Slower                                    |
| **Accuracy**          | Lower (may result in imprecise stems)     | Higher (linguistically correct)           |
| **Complexity**        | Simpler                                   | More complex (may require POS tagging)   |

### Conclusion:
- **Use Stemming** when **speed** is more critical and when **semantic accuracy** is less important (e.g., for **information retrieval** or when working with large text corpora).
- **Use Lemmatization** when **accuracy** and **meaning** are crucial (e.g., in tasks like **text classification**, **sentiment analysis**, and other NLP applications requiring precise understanding).

The choice between stemming and lemmatization largely depends on the specific NLP task at hand and the balance between performance (speed) and accuracy.

### Sources:
- **Manning et al., 2008**, *Foundations of Statistical Natural Language Processing*.
- **Bird et al., 2009**, *Natural Language Processing with Python* (O'Reilly).
- **Sullivan, 2019**, *The Importance of Lemmatization in NLP*.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
