<a href="https://colab.research.google.com/github/talapantitejaswini/Fmml/blob/main/Lab_3%20M3%20ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [21]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [22]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [23]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [24]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [25]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [26]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews (1).csv


In [27]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [28]:
df = df.dropna()

In [29]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [30]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [31]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 62.30366492146597%
Cross Validation Accuracy: 0.62
[0.60784314 0.58431373 0.66141732]






In [32]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 70.15706806282722%
Cross Validation Accuracy: 0.73
[0.7254902  0.74117647 0.72834646]




# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [33]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam (1).csv


In [34]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [35]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [36]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [37]:
len(df)

5572

In [38]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [39]:
# This cell may take some time to run
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [40]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

ANSWERS:

1A.TF-IDF (Term Frequency-Inverse Document Frequency) generally results in better accuracy than Bag-of-Words (BoW) in many text analysis tasks because it addresses some of the key limitations of the BoW model. Here's why:

### 1. **Handling Common Words (Stop Words)**
   - **Bag-of-Words**: BoW simply counts the frequency of words in a document without considering their importance in the context of a corpus. This means that common words like "the," "is," and "and" have a high frequency, even though they may not carry much meaning in distinguishing documents.
   - **TF-IDF**: TF-IDF reduces the weight of common words by assigning a lower score to frequently occurring words across the corpus (based on inverse document frequency). This helps to focus more on informative words that are specific to a document, rather than generic or common words.

### 2. **Differentiating Between Rare and Frequent Terms**
   - **Bag-of-Words**: It treats all words equally based only on their occurrence, meaning rare but significant terms may be overlooked if more frequent but less meaningful terms dominate the document.
   - **TF-IDF**: By using the inverse document frequency component, TF-IDF boosts the importance of rare terms that may be more representative of the document's unique content, helping to better capture the key features.

### 3. **Relevance of Terms**
   - **Bag-of-Words**: It assumes that words with higher frequencies are more important, which might not always be the case. A term might appear frequently in a document but have little importance if it's common across all documents.
   - **TF-IDF**: It balances term frequency within a document (TF) with how often the term appears across the entire corpus (IDF), making it more effective at identifying words that are both common in the document and uncommon in other documents. This leads to better feature weighting for text classification or retrieval tasks.

### 4. **Sparse and High-Dimensional Data**
   - **Bag-of-Words**: BoW often leads to very high-dimensional and sparse vectors, especially for large corpora, as it considers every word in every document. It can introduce noise into the model by including words that don’t carry much significance.
   - **TF-IDF**: TF-IDF typically results in more discriminative feature vectors, reducing the effect of noise by downweighting frequent but unimportant words, leading to better model performance.

### 5. **Improved Performance in Classification and Retrieval**
   - Models trained on **TF-IDF** vectors often perform better in tasks like text classification, sentiment analysis, and document retrieval, as the features are better weighted and more meaningful. BoW may struggle in such tasks due to over-representation of unimportant words.

In summary, **TF-IDF** is superior because it takes into account the importance of words relative to the corpus, allowing models to focus on more meaningful words, thereby often leading to better classification and retrieval accuracy.


2A.Yes, there are several techniques that are generally considered more powerful and effective than both Bag-of-Words (BoW) and TF-IDF, especially with the rise of deep learning and more sophisticated natural language processing (NLP) methods. Some of the most widely-used and effective techniques include:

### 1. **Word Embeddings (e.g., Word2Vec, GloVe)**
   - **What are they?** Word embeddings are dense, continuous vector representations of words that capture semantic meaning by placing words with similar meanings close to each other in the vector space. Word2Vec and GloVe are popular algorithms for generating these embeddings.
   - **Advantages**:
     - Captures the **context** and **semantic relationships** between words.
     - Word vectors are dense (low-dimensional), which reduces the dimensionality compared to sparse representations in BoW and TF-IDF.
     - Embeddings capture the idea that words appearing in similar contexts tend to have similar meanings (e.g., "king" and "queen" are semantically close).
     - Helps to handle the **polysemy** (multiple meanings) of words by embedding them in context-rich vector spaces.
   - **Why better?**: Unlike BoW and TF-IDF, which only capture the presence or absence of words, embeddings can represent the relationships between words, which leads to better performance in tasks like text classification, sentiment analysis, and language modeling.

### 2. **Contextualized Word Embeddings (e.g., BERT, GPT)**
   - **What are they?** These are advanced embeddings generated by large pre-trained language models like **BERT** (Bidirectional Encoder Representations from Transformers) and **GPT** (Generative Pre-trained Transformer). These models generate contextualized representations of words, meaning the same word can have different embeddings depending on its surrounding context.
   - **Advantages**:
     - Models context **at the sentence level**, not just at the word level.
     - Better handling of **polysemy**: Words like "bank" (riverbank vs. financial institution) get different representations based on context.
     - Pre-trained on massive corpora, so they learn rich language representations that capture a wide range of linguistic information (syntax, semantics).
     - Fine-tuning allows these models to adapt to specific tasks like question answering, summarization, translation, etc.
   - **Why better?**: Contextualized embeddings outperform static embeddings like Word2Vec because they account for the context in which a word appears. This enables significantly better performance on NLP tasks such as Named Entity Recognition (NER), Sentiment Analysis, and Machine Translation.

### 3. **Transformers and Attention Mechanisms**
   - **What are they?** The transformer architecture uses self-attention mechanisms to allow models to weigh the importance of different words in a sentence dynamically. Models like **BERT**, **GPT**, **T5**, and **RoBERTa** are based on the transformer architecture.
   - **Advantages**:
     - Able to capture **long-range dependencies** between words in a sequence, which is challenging for traditional methods.
     - **Attention mechanisms** allow the model to focus on the most important words or parts of the sentence when making predictions.
     - Highly scalable and adaptable for a variety of downstream tasks through fine-tuning.
   - **Why better?**: The ability to model complex relationships between words at scale makes transformers much more effective than traditional BoW and TF-IDF approaches, especially for tasks involving context understanding and generation.

### 4. **Latent Semantic Analysis (LSA) / Latent Dirichlet Allocation (LDA)**
   - **What are they?**
     - **LSA**: A technique that applies Singular Value Decomposition (SVD) to a term-document matrix to identify patterns in word usage. It reduces the dimensionality and finds latent relationships between terms.
     - **LDA**: A generative probabilistic model used for topic modeling. It assumes that documents are mixtures of topics, and topics are distributions over words.
   - **Advantages**:
     - **LSA**: Reduces dimensionality and finds relationships between words that aren’t explicitly present in the raw data.
     - **LDA**: Extracts **topics** from documents, helping to identify thematic structure and meaning beyond simple word counts.
   - **Why better?**: LSA and LDA can uncover hidden structures (semantic or topical) in the data, making them more powerful than BoW and TF-IDF for tasks like topic extraction and document clustering.

### 5. **Doc2Vec (Paragraph Vectors)**
   - **What is it?** An extension of Word2Vec that generates dense vector representations for entire documents or paragraphs, not just individual words. This helps represent a document’s overall meaning in a single vector.
   - **Advantages**:
     - Captures the context and meaning of an entire document, not just words or terms.
     - More effective in tasks like document classification, where understanding the overall content of the text is crucial.
   - **Why better?**: Instead of relying on word counts or importance scores for individual terms (as in TF-IDF), Doc2Vec captures the relationships between the document's words and the overall meaning of the document in a low-dimensional vector.

### 6. **Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) Networks**
   - **What are they?** RNNs and LSTMs are types of neural networks designed to handle sequential data like text. They maintain an internal state, which allows them to capture dependencies and relationships across words in a sequence.
   - **Advantages**:
     - Excellent at capturing **sequential relationships** and handling longer texts.
     - LSTM (and GRU) models can capture long-term dependencies by mitigating the vanishing gradient problem faced by traditional RNNs.
   - **Why better?**: Unlike BoW and TF-IDF, which lose word order, RNNs and LSTMs maintain the sequence of words, making them better for tasks that require understanding the flow of information, like text generation, machine translation, and speech recognition.

### 7. **Convolutional Neural Networks (CNNs) for Text**
   - **What are they?** CNNs are typically used for image data, but they’ve also been successfully applied to text. In text classification, CNNs apply convolution operations over word sequences to extract local features (e.g., n-grams).
   - **Advantages**:
     - Good at capturing **local patterns** in text, such as specific word combinations (e.g., "not good," "very happy").
     - Efficient and scalable for large datasets.
   - **Why better?**: CNNs for text classification have shown strong performance in tasks like sentiment analysis because they can identify important local features and patterns, which BoW and TF-IDF can’t.

### 8. **Transformers with Pre-training (e.g., BERT, GPT, T5)**
   - **What are they?** Transformer-based models that have been pre-trained on massive amounts of data and then fine-tuned for specific NLP tasks. Examples include **BERT** (bidirectional transformer), **GPT** (generative transformer), and **T5** (text-to-text transformer).
   - **Advantages**:
     - **Pre-trained models**: Use transfer learning, where the model has already learned general language features from huge datasets and can be fine-tuned for specific tasks with smaller datasets.
     - Capture deep, contextualized understanding of language.
   - **Why better?**: These models often outperform traditional methods by a significant margin because of their ability to understand complex language patterns and context at scale.

### Conclusion:
While **BoW** and **TF-IDF** are simple, fast, and effective for basic text tasks, they are limited in terms of context, semantic understanding, and feature representation. More advanced techniques like **Word2Vec**, **BERT**, **GPT**, and other transformer-based models not only capture the meaning and relationships between words but also handle the complex, contextual nuances of language, leading to significantly better performance in many NLP tasks.


3A.### Stemming and Lemmatization are two important techniques used in Natural Language Processing (NLP) for reducing words to their base or root form. Both techniques help in simplifying the text data by grouping different forms of a word into a single term, which is particularly useful for tasks like text classification, search, and information retrieval.

#### **1. Stemming**
**Stemming** is the process of reducing a word to its root form by chopping off its suffixes. The resulting "stem" may not necessarily be a valid word, but it is a truncated version that helps simplify word variations.

- **How it works**:
   - Stemming algorithms, such as **Porter Stemmer** and **Snowball Stemmer**, apply simple rules to remove word endings. For example, "running" becomes "run", "happier" becomes "happi", and "cats" becomes "cat".
   - It does not consider the grammatical structure of the word and might lead to incorrect stems (over-stemming or under-stemming).

- **Pros of Stemming**:
   - **Speed**: Stemming is computationally efficient because it uses straightforward rules for word truncation.
   - **Simplicity**: It’s easy to implement and understand, requiring fewer resources than lemmatization.
   - **Effective for certain tasks**: In some cases, especially when exact precision isn’t necessary (like search engines), stemming can work well enough.

- **Cons of Stemming**:
   - **Inaccuracy (Over-stemming)**: Stemming can sometimes reduce words too much, leading to nonsensical stems. For example, "universal" and "university" might both be reduced to "univers", which are semantically very different.
   - **Inflexibility**: Stemming applies fixed rules and does not account for context or the grammatical role of the word in a sentence.
   - **Loss of meaning**: Stems produced are often not real words, which can lead to a loss of information.

#### **2. Lemmatization**
**Lemmatization** is a more sophisticated approach to reducing words to their base form (lemma), where the word is reduced to its **dictionary form**. Unlike stemming, lemmatization considers the context and part of speech of a word.

- **How it works**:
   - Lemmatization algorithms rely on dictionaries and linguistic analysis to return the root form. For instance, "running" becomes "run", and "better" becomes "good" because the algorithm understands the comparative form.
   - Lemmatization requires identifying the part of speech (POS) of the word to return the correct lemma.

- **Pros of Lemmatization**:
   - **Accuracy**: Lemmatization is more precise because it reduces words to their **true base form**, which is always a valid word. For example, "better" becomes "good", and "geese" becomes "goose".
   - **Context-aware**: Lemmatization takes into account the part of speech and meaning, which helps retain more of the word’s original semantics.
   - **Less ambiguity**: Since the lemma is the root word in a dictionary, it generally preserves more meaning and creates less confusion than stemming.

- **Cons of Lemmatization**:
   - **Slower and more computationally expensive**: Lemmatization requires looking up words in a lexicon and identifying the part of speech, which takes more time and resources compared to stemming.
   - **Complexity**: Implementing a lemmatizer requires access to a dictionary or corpus and is more difficult than applying simple stemming rules.
   - **More effort to get results**: Lemmatization might need to process parts of speech tags or other linguistic features, making it less straightforward to implement than stemming.

### Comparison Table: Stemming vs. Lemmatization

| Feature                   | Stemming                  | Lemmatization              |
|---------------------------|---------------------------|----------------------------|
| **Approach**               | Rule-based truncation      | Dictionary-based reduction |
| **Output**                 | Root form (may not be valid word) | Root form (valid dictionary word) |
| **Context Awareness**      | No                         | Yes                        |
| **Speed**                  | Faster                     | Slower                     |
| **Accuracy**               | Less accurate (over-stemming) | More accurate              |
| **Implementation**         | Simple and easy            | More complex               |
| **Resource Requirements**  | Low                        | High (needs POS tagging, dictionary) |
| **Examples**               | "running" → "run", "cats" → "cat" | "better" → "good", "running" → "run" |

### When to Use Stemming vs. Lemmatization?
- **Use Stemming**: When speed is a priority, or the task doesn’t require high precision. Stemming can be effective for **search engines** or **basic NLP tasks**, where exact word forms are not critical.
- **Use Lemmatization**: When you need more accurate analysis, especially for applications where understanding context or precise meanings is important. Lemmatization is better for tasks like **machine translation**, **text summarization**, or **document classification** where word forms and meaning matter more.

In summary:
- **Stemming** is faster but less precise, often resulting in inaccurate word truncation.
- **Lemmatization** is slower but yields more meaningful and contextually accurate root words, making it preferable for tasks requiring a deeper understanding of language.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
