<a href="https://colab.research.google.com/github/shalini225/-FMML_M1L3.ipynb/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

In [None]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
A) The TF-IDF (Term Frequency-Inverse Document Frequency) approach generally yields better accuracy than Bag-of-Words (BoW) because it provides a more refined representation of the importance of words in a document. Here’s how TF-IDF improves upon BoW:

1. Reduces the Impact of Common Words:

In BoW, each word is given equal weight, so common words like "the," "and," or "is" may be overly emphasized.

TF-IDF reduces the weight of these common words by factoring in inverse document frequency (IDF), which assigns lower scores to words that appear in many documents. This helps focus on words that are more unique and informative for each document.



2. Highlights Informative Words:

By combining term frequency (how often a word appears in a document) with IDF, TF-IDF gives higher scores to words that are more representative of a document’s specific content.

For instance, if a word appears frequently in a document but rarely elsewhere, it is likely to be more relevant to that document’s topic. TF-IDF captures this better than BoW.



3. Handles Document Similarity Better:

TF-IDF provides a more nuanced way to compare documents by emphasizing words that are important to each document, rather than counting all words equally.

This helps in calculating document similarity more accurately, which is often essential in classification and clustering tasks.



4. Reduces Noise in Data:

Since TF-IDF down-weights common words and emphasizes distinctive terms, it often reduces "noise" in the dataset.

This can make it easier for machine learning models to focus on meaningful distinctions between documents, which can improve the accuracy of the model.




Overall, by adjusting for the relevance of words in the entire dataset, TF-IDF offers a more informative and discriminative feature representation than BoW, which typically improves classification accuracy and overall model performance.


2. Can you think of techniques that are better than both BoW and TF-IDF ?
A) Yes, several techniques generally outperform both Bag-of-Words (BoW) and TF-IDF because they capture deeper linguistic and contextual relationships between words. Here are some of the most effective ones:

1. Word Embeddings (Word2Vec, GloVe)

Word2Vec: This technique, created by Google, learns word embeddings by analyzing large corpora. It creates dense, continuous-valued vector representations for words, allowing words with similar meanings to have similar vectors.

GloVe (Global Vectors for Word Representation): Developed by Stanford, GloVe captures word meaning by combining the strengths of global word co-occurrence statistics and local context windows.

Advantage: Both methods capture semantic relationships (e.g., "king" - "man" + "woman" ≈ "queen"), which BoW and TF-IDF do not. This allows for a more meaningful comparison between words and documents based on context.


2. FastText

Developed by Facebook, FastText is an extension of Word2Vec that includes subword information. It represents each word as a combination of character n-grams, allowing it to handle rare or misspelled words better.

Advantage: FastText can generate embeddings for words it hasn't seen in training by leveraging subword information, which makes it useful for domains with specialized or infrequent vocabulary.


3. Doc2Vec

Doc2Vec (or Paragraph Vectors) is a technique that learns fixed-length representations for entire documents, rather than individual words. Developed as an extension of Word2Vec, it assigns vectors to whole documents, capturing the context of the entire document rather than only words.

Advantage: This method is effective in capturing thematic and contextual similarities between documents, which can significantly improve performance on tasks like document classification, sentiment analysis, and topic modeling.


4. Topic Modeling (LDA, NMF)

Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) are probabilistic techniques that aim to uncover hidden thematic structures within text.

Advantage: These methods model each document as a mixture of topics, providing a more nuanced, interpretable representation that captures thematic relationships. This often improves classification and clustering tasks.


5. Transformer-Based Models (BERT, RoBERTa, GPT)

BERT (Bidirectional Encoder Representations from Transformers): BERT is a pre-trained transformer model that captures context in both directions (left-to-right and right-to-left). It has been widely used for a variety of NLP tasks, achieving state-of-the-art results.

Other Variants: Models like RoBERTa, GPT, and T5 are further advancements built upon transformers that are specialized for specific NLP tasks.

Advantage: Transformer models understand context at a deeper level, handling polysemy (words with multiple meanings) and capturing fine-grained relationships between words. This makes them highly effective for complex NLP tasks like question-answering, sentiment analysis, and named entity recognition.


6. Sentence and Document Embeddings (Sentence-BERT, Universal Sentence Encoder)

These techniques extend word embeddings to encode entire sentences or documents, often using deep learning techniques to capture broader context.

Advantage: Sentence and document embeddings provide compact, context-aware representations, useful for tasks requiring sentence-level understanding, such as similarity matching and information retrieval.


Summary

While BoW and TF-IDF focus on word frequencies and importance, these advanced techniques capture linguistic structure, word relationships, and semantic context, making them better suited for complex NLP tasks. Transformer-based models, in particular, are currently among the most effective due to their ability to model language comprehensively and accurately.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.
A)Stemming and lemmatization are two common techniques in natural language processing (NLP) used to reduce words to their base or root forms, but they differ in how they approach this task.

Stemming

Definition: Stemming is a rule-based process of reducing a word to its root form by stripping prefixes and suffixes, often without regard for whether the result is a valid word.

Example:

"playing," "played," and "plays" are all reduced to "play" by removing suffixes.

"running" becomes "run."



Pros:

1. Speed: Stemming is typically faster than lemmatization because it relies on simple rules rather than linguistic analysis.


2. Resource-Efficient: Stemming doesn’t require additional resources like language dictionaries, so it is lightweight and easy to implement.


3. Useful in Specific Contexts: In cases where precision is less important, such as early-stage data exploration or when building a quick prototype, stemming can still improve text processing efficiency.



Cons:

1. Lack of Precision: Stemming may reduce words to non-existent or incorrect root forms, which can introduce inaccuracies. For example, "better" might be stemmed to "bet," which loses meaning.


2. Over-Stemming and Under-Stemming: Because stemming is rule-based, it can over-stem (removing too much) or under-stem (removing too little). For instance, "university" and "universe" might both be reduced to "univers," though they have different meanings.


3. Language-Specific Limitations: Stemming doesn’t account for the nuanced differences between words in complex languages, especially those with irregular words or varying grammatical rules.



Lemmatization

Definition: Lemmatization is a more sophisticated process that reduces words to their base or dictionary form (lemma) by considering the word’s meaning and part of speech. It relies on vocabulary and morphology analysis (e.g., identifying "better" as the lemma of "good").

Example:

"playing" becomes "play," but "better" would be correctly lemmatized to "good."



Pros:

1. Higher Accuracy: Lemmatization produces linguistically correct base forms, reducing ambiguities and improving accuracy, especially in applications where precise meaning is essential.


2. Better for Semantic Analysis: Since lemmatization considers word context, it is more suitable for tasks that require understanding nuances, such as sentiment analysis, entity recognition, or machine translation.


3. Handles Irregular Forms: Lemmatization can handle irregular word forms (like "went" to "go"), which is challenging for simple stemming rules.



Cons:

1. Slower Performance: Lemmatization is computationally heavier than stemming because it involves dictionary lookups and morphological analysis, making it slower.


2. Language-Specific Dictionaries Needed: Lemmatization requires language-specific resources, so it’s less portable across different languages or dialects without adapting dictionaries and grammatical rules.


3. Complex Implementation: Implementing lemmatization is generally more complex and requires additional libraries or pre-trained models, which may add to the development time.



Summary

Stemming is faster and simpler but less accurate, often leading to incorrect or overly aggressive reductions.

Lemmatization is more accurate and context-aware but slower and requires additional linguistic resources.


In practice, choosing between them depends on the specific application requirements. For applications prioritizing speed or where exact root forms are less critical (e.g., search engines), stemming might be sufficient. However, for tasks needing high linguistic precision (e.g., chatbots, sentiment analysis), lemmatization is generally preferred.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
