<a href="https://colab.research.google.com/github/shalini225/-FMML_M1L3.ipynb/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [2]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [3]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
A) The TF-IDF (Term Frequency-Inverse Document Frequency) approach generally results in better accuracy compared to the Bag-of-Words (BoW) model because it incorporates more meaningful information about the importance of words within a dataset. Here’s a breakdown of why TF-IDF performs better:


---

1. Weighting of Terms Based on Importance

Bag-of-Words: Treats all words equally. It only counts the raw frequency of words in a document, which means frequent words (like "the," "is," or "and") tend to dominate the representation, even though they carry little meaningful information.

TF-IDF: Assigns weights to words based on their importance.

Term Frequency (TF): Captures how frequently a word appears in a document.

Inverse Document Frequency (IDF): Reduces the weight of common words that appear in many documents, emphasizing less frequent but more discriminative words.


As a result, TF-IDF downplays the significance of common stopwords and emphasizes unique terms that can better distinguish between documents.



---

2. Reduction of Noise

In BoW, frequent but unimportant words (e.g., stopwords) can introduce noise into the model, leading to poorer performance.

TF-IDF reduces the influence of such noise by assigning low weights to words that appear in almost all documents.



---

3. Improved Discrimination Between Documents

Words with high IDF scores are typically rare and specific to certain documents, making them more useful for distinguishing between classes or topics.

BoW fails to differentiate between words that appear in all documents versus those that are rare and meaningful.


For example:

BoW might consider "movie" and "the" as equally important.

TF-IDF would identify "movie" as more informative if "the" appears frequently across all documents.



---

4. Handling Document-Length Bias

BoW models tend to favor longer documents because they contain more word occurrences, resulting in higher word counts.

TF-IDF normalizes term frequency, making it less biased toward document length and ensuring fair comparisons between documents.



---

5. Better Feature Representation

TF-IDF produces real-valued feature vectors instead of just integer counts, which provides a finer-grained representation of word importance.

This leads to improved performance when used with machine learning models, as the numerical weights capture more meaningful information than simple word counts.



---

6. Suitability for High-Dimensional Data

In text data, the number of features (unique words) is typically very high. TF-IDF helps highlight relevant words while suppressing irrelevant ones, making the high-dimensional space more manageable for machine learning algorithms.



---

Summary

While BoW treats all words equally, TF-IDF accounts for the relative importance of words in a document and across the corpus. By assigning higher weights to informative words and lower weights to common ones, TF-IDF creates a more meaningful feature representation, leading to better accuracy in tasks like text classification, clustering, and retrieval.


2. Can you think of techniques that are better than both BoW and TF-IDF ?
A) Yes, several techniques outperform Bag-of-Words (BoW) and TF-IDF in representing text data, especially with the advancements in natural language processing (NLP). These methods incorporate contextual, semantic, and structural information, which BoW and TF-IDF lack. Here are some of the most effective techniques:


---

1. Word Embeddings

Word embeddings represent words as dense, low-dimensional vectors that capture semantic relationships between words. Unlike BoW and TF-IDF, embeddings preserve word meaning and relationships.

Word2Vec (by Google): Generates word embeddings using two methods—CBOW (Continuous Bag of Words) and Skip-Gram. Words with similar meanings are mapped close together in the vector space.

GloVe (Global Vectors for Word Representation, by Stanford): Captures global co-occurrence statistics to build word embeddings.

FastText (by Facebook): Extends Word2Vec by including subword information, which helps represent rare or misspelled words.


Why better?

Captures semantic meaning and relationships between words (e.g., king - man + woman ≈ queen).

Handles vocabulary more effectively than BoW/TF-IDF.



---

2. Contextual Word Embeddings

Unlike static embeddings (Word2Vec), contextual embeddings generate word representations that depend on the surrounding context.

ELMo (Embeddings from Language Models): Uses deep bidirectional LSTM networks to generate context-sensitive embeddings for each word.

BERT (Bidirectional Encoder Representations from Transformers): Uses a transformer-based model to learn embeddings, capturing the context of words in both directions (left and right).

GPT (Generative Pre-trained Transformer): Similar to BERT but processes text unidirectionally during pretraining.


Why better?

Captures the meaning of words in context. For example, the word "bank" has different embeddings depending on whether it's used in "river bank" or "financial bank."

Produces state-of-the-art results on tasks like text classification, named entity recognition, and sentiment analysis.



---

3. Sentence and Document Embeddings

Techniques that generate vector representations for sentences or entire documents, rather than individual words:

Doc2Vec (Paragraph Vector): Extends Word2Vec to generate document-level embeddings.

Universal Sentence Encoder (by Google): Pretrained embeddings for sentences, useful for tasks like text similarity and clustering.

Sentence-BERT (SBERT): A modification of BERT that generates sentence-level embeddings, optimized for tasks like semantic similarity.


Why better?

Captures the meaning of entire sentences or documents.

Effective for tasks requiring more than word-level understanding, such as document classification or clustering.



---

4. Transformer-Based Models (End-to-End)

Modern transformer-based models, such as BERT, RoBERTa, GPT, and T5, not only generate embeddings but also perform text-related tasks directly in an end-to-end fashion.

These models are pretrained on massive corpora and fine-tuned for specific tasks like classification, summarization, or question answering.


Why better?

Eliminates the need for manual feature extraction (like BoW/TF-IDF).

Models the relationships between words, phrases, and sentences.

Achieves state-of-the-art performance on various NLP tasks.



---

5. Topic Modeling

Latent Dirichlet Allocation (LDA): Represents documents as mixtures of latent topics, where each topic is a distribution over words.

Non-Negative Matrix Factorization (NMF): Factorizes the term-document matrix to identify hidden topics in the corpus.


Why better?

Captures the underlying topics in a document, providing more interpretability.

Effective for tasks like topic discovery or summarization.



---

6. Neural Network-Based Techniques

Recurrent Neural Networks (RNNs): Processes sequential data, capturing word dependencies.

Long Short-Term Memory (LSTM): Addresses issues like long-term dependencies in sequential data.

Attention Mechanisms and Transformers: Focus on relevant parts of the input sequence, enabling better performance for tasks like machine translation, summarization, and classification.



---

7. Hybrid Approaches

Combining multiple techniques can also outperform BoW and TF-IDF. For example:

TF-IDF Weighted Word Embeddings: Combine the interpretability of TF-IDF with the semantic power of word embeddings.

BERT + TF-IDF Features: Combine transformer-based embeddings with traditional TF-IDF features to improve performance on certain tasks.



---

Summary

BoW and TF-IDF are simple and effective, but they ignore word meaning, context, and word order. Techniques like Word Embeddings (Word2Vec, GloVe), contextual embeddings (BERT, GPT), and transformer-based models capture rich, semantic, and contextual information, leading to better accuracy and performance in modern NLP tasks.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.
A) Stemming and Lemmatization are text preprocessing techniques used to reduce words to their root forms. While both serve the same purpose, they differ significantly in approach, accuracy, and computational cost. Here's an analysis of their pros and cons:


---

1. Stemming

Stemming is a heuristic process that chops off word endings (suffixes) to reach a base form, often using simple rules. The most common stemmer is Porter’s Stemmer.

Pros:

Speed and Simplicity: Stemming is fast and computationally inexpensive since it uses rule-based approaches (e.g., removing suffixes like -ing, -ed).

Good for Informal Applications: Works well for applications where precision is less critical, such as search engines, where approximate matches can be sufficient.

Language Independence: Stemmers can be easily implemented for different languages with basic rule adjustments.


Cons:

Lack of Accuracy: Stemming can produce non-real words that don’t exist in the language (e.g., "running" → "run", but "studies" → "studi").

Over-Stemming: It may incorrectly reduce words to unrelated stems (e.g., "universal" → "univers").

Under-Stemming: It may fail to reduce words with irregular endings to the same root (e.g., "running" and "ran" might be treated differently).

No Context Awareness: Stemming works purely on rules and doesn't consider the meaning or context of words.



---

2. Lemmatization

Lemmatization reduces words to their dictionary form (lemma) by considering the word's meaning, part of speech (POS), and linguistic rules. Lemmatization typically uses tools like WordNet, spaCy, or NLTK.

Pros:

Accuracy: Lemmatization produces valid dictionary words (e.g., "running" → "run", "better" → "good").

Context-Aware: It takes into account the word's POS (e.g., "bats" as a noun → "bat", as a verb → "bat").

Better for NLP Applications: Lemmatization is more effective for advanced applications requiring semantic understanding, such as text classification or machine translation.


Cons:

Computational Cost: Lemmatization is slower and more computationally intensive because it requires linguistic resources and parsing the context of the word.

Dependency on Language: Lemmatization relies on language-specific dictionaries and tools, making it harder to implement across different languages.

Tool Dependency: Lemmatization requires external libraries (e.g., WordNet, spaCy), unlike stemming, which can be implemented using simple rules.



---

Comparison Table


---

Which One to Choose?

Use Stemming when speed is a priority, and minor inaccuracies are acceptable (e.g., keyword search, information retrieval).

Use Lemmatization for tasks requiring linguistic accuracy, such as sentiment analysis, text classification, or chatbots.


Example:

By carefully choosing between stemming and lemmatization, you can optimize both accuracy and efficiency based on the specific requirements of your NLP task.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
