<a href="https://colab.research.google.com/github/sanjeevmanvithvellala/IIITH_AI-ML/blob/main/VSM_AIML_Module_3_Lab_3_Using_KNN_for_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Student Training Program on AIML**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [38]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [39]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [40]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [41]:
5*12

60

In [42]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [43]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [44]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews (1).csv


In [45]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [46]:
df = df.dropna()

In [47]:
df

Unnamed: 0,sentence,sentiment
0,Not sure who was more lost - the flat characte...,0
1,Attempting artiness with black & white and cle...,0
2,Very little music or anything to speak of.,0
3,The best scene in the movie was when Gerardo i...,1
4,"The rest of the movie lacks art, charm, meanin...",0
...,...,...
950,I just got bored watching Jessice Lange take h...,0
951,"Unfortunately, any virtue in this film's produ...",0
952,"In a word, it is embarrassing.",0
953,Exceptionally bad!,0


In [48]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [49]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [50]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

KNN with BOW accuracy = 64.3979057591623%




Cross Validation Accuracy: 0.65
[0.64313725 0.60392157 0.7007874 ]




In [51]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

KNN with TFIDF accuracy = 71.72774869109948%




Cross Validation Accuracy: 0.73
[0.71764706 0.74509804 0.73622047]


In [52]:
## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.
import pandas as pd
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder
from tabulate import tabulate

def evaluate_knn(X_train, X_test, y_train, y_test, k_values, metrics_list, weights_list, algorithm):
    results = []

    for k in k_values:
        for m in metrics_list:
            for w in weights_list:
                knn = neighbors.KNeighborsClassifier(
                    n_neighbors=k,
                    weights=w,
                    algorithm=algorithm,
                    metric=m,
                    n_jobs=-1
                )
                knn.fit(X_train, y_train)
                predicted = knn.predict(X_test)

                acc = metrics.accuracy_score(y_test, predicted)
                cv_scores = cross_val_score(knn, X_train, y_train, cv=3)

                results.append([k, m, w, acc, cv_scores.mean()])

    # Sorting by best test accuracy
    results.sort(key=lambda x: x[3], reverse=True)

    # Printing the results table
    print(tabulate(results, headers=["K", "Metric", "Weights", "Test Acc", "CV Acc"],
                   tablefmt="pretty", floatfmt=".4f"))
    print("\nBest combination:")
    print(results[0])
    return results[0]

def encode_labels(y_train, y_test):
    if y_train.dtype == object or y_test.dtype == object:
        le = LabelEncoder()
        y_train = le.fit_transform(y_train)
        y_test = le.transform(y_test)
        print("Labels encoded:", list(le.classes_))
    return y_train, y_test


def bow_knn():
    """KNN with Bag of Words"""
    training_data = pd.read_csv('reviews.csv')

    X_train_raw, X_test_raw, y_train, y_test = train_test_split(
        training_data["sentence"],
        training_data["sentiment"],
        test_size=0.2,
        random_state=5
    )

    y_train, y_test = encode_labels(y_train, y_test)

    X_train, X_test = createBagOfWords(X_train_raw, X_test_raw,
                                       remove_stopwords=True, lemmatize=True, stemmer=False)
    best_params = evaluate_knn(
        X_train, X_test, y_train, y_test,
        k_values=[3, 5, 7, 9],
        metrics_list=['euclidean', 'manhattan', 'cosine'],
        weights_list=['uniform', 'distance'],
        algorithm='brute'
    )
    return best_params

def tfidf_knn():
    """KNN with TF-IDF"""
    training_data = pd.read_csv('reviews.csv')

    X_train_raw, X_test_raw, y_train, y_test = train_test_split(
        training_data["sentence"],
        training_data["sentiment"],
        test_size=0.2,
        random_state=5
    )

    # Encode labels if needed
    y_train, y_test = encode_labels(y_train, y_test)

    # Vectorize
    X_train, X_test = createTFIDF(X_train_raw, X_test_raw,
                                  remove_stopwords=True, lemmatize=True, stemmer=False)

    print("\n--- TF-IDF + KNN Testing ---")
    best_params = evaluate_knn(
        X_train, X_test, y_train, y_test,
        k_values=[3, 5, 7, 9],
        metrics_list=['euclidean', 'manhattan', 'cosine'],
        weights_list=['uniform', 'distance'],
        algorithm='brute'
    )
    return best_params

if __name__ == "__main__":
    bow_knn()
    tfidf_knn()


Labels encoded: [' it just lacked imagination.', ' nothing about where he spend 2 years between his childhood and mature age.', '0', '1']




+---+-----------+----------+--------------------+--------------------+
| K |  Metric   | Weights  |      Test Acc      |       CV Acc       |
+---+-----------+----------+--------------------+--------------------+
| 3 |  cosine   | uniform  | 0.7068062827225131 | 0.7107663012711646 |
| 3 |  cosine   | distance | 0.7068062827225131 | 0.7107611548556431 |
| 5 |  cosine   | distance | 0.6963350785340314 | 0.7173125418146261 |
| 7 |  cosine   | uniform  | 0.6963350785340314 | 0.6911584581339097 |
| 5 |  cosine   | uniform  | 0.6910994764397905 | 0.716010498687664  |
| 7 |  cosine   | distance | 0.6910994764397905 | 0.6950748803458392 |
| 9 |  cosine   | uniform  | 0.6858638743455497 | 0.7094951366373321 |
| 9 |  cosine   | distance | 0.6858638743455497 | 0.7134167052647831 |
| 3 | manhattan | distance | 0.680628272251309  | 0.6020894447017652 |
| 3 | euclidean | distance | 0.6701570680628273 | 0.6322628789048429 |
| 3 | manhattan | uniform  | 0.6701570680628273 | 0.5981678760743142 |
| 9 | 



+---+-----------+----------+--------------------+--------------------+
| K |  Metric   | Weights  |      Test Acc      |       CV Acc       |
+---+-----------+----------+--------------------+--------------------+
| 9 |  cosine   | distance | 0.7329842931937173 | 0.7448201327775205 |
| 5 |  cosine   | uniform  | 0.7225130890052356 | 0.730368997992898  |
| 9 | euclidean | uniform  | 0.7225130890052356 | 0.7447738150378261 |
| 9 | euclidean | distance | 0.7225130890052356 | 0.743461479079821  |
| 9 |  cosine   | uniform  | 0.7225130890052356 | 0.7435077968195153 |
| 5 |  cosine   | distance | 0.7172774869109948 | 0.732988523493387  |
| 7 |  cosine   | distance | 0.7172774869109948 | 0.747444804693531  |
| 7 |  cosine   | uniform  | 0.7120418848167539 | 0.7448355720240851 |
| 3 |  cosine   | uniform  | 0.7068062827225131 | 0.7421234110442078 |
| 3 |  cosine   | distance | 0.7068062827225131 | 0.7394987391281972 |
| 7 | euclidean | uniform  | 0.7068062827225131 | 0.7303844372394628 |
| 7 | 



# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [53]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam (1).csv


In [54]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [55]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [56]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [57]:
len(df)

5572

In [58]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [59]:
# This cell may take some time to run
predicted, y_test = bow_knn()

KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.9064603  0.89973082 0.91313131]




In [60]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


In [61]:
## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score
import pandas as pd

# Prevent CV errors if class counts are small
def get_safe_cv(y_train, desired_cv=3):
    min_class_size = y_train.value_counts().min()
    return min(desired_cv, min_class_size) if min_class_size > 1 else 1

def evaluate_knn(X_train, X_test, y_train, y_test, params):
    knn = neighbors.KNeighborsClassifier(**params)
    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)

    cv_folds = get_safe_cv(y_train, desired_cv=3)
    scores = cross_val_score(knn, X_train, y_train, cv=cv_folds)

    print(f"Params: {params}")
    print(f"Accuracy = {acc*100:.2f}%")
    print(f"Cross Validation ({cv_folds}-fold) Accuracy: {scores.mean():.2f}")
    print(f"Scores: {scores}\n")

def bow_knn():
    """Run KNN with Bag-of-Words on multiple parameter sets"""
    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})

    X_train, X_test, y_train, y_test = train_test_split(
        training_data["Message"], training_data["Category"],
        test_size=0.2, random_state=5
    )

    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)

    param_grid = [
        {"n_neighbors": 3, "weights": "uniform", "metric": "euclidean"},
        {"n_neighbors": 5, "weights": "distance", "metric": "manhattan"},
        {"n_neighbors": 7, "weights": "uniform", "metric": "cosine"}
    ]

    for params in param_grid:
        evaluate_knn(X_train, X_test, y_train, y_test, params)

def tfidf_knn():
    """Run KNN with TF-IDF on multiple parameter sets"""
    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})

    X_train, X_test, y_train, y_test = train_test_split(
        training_data["Message"], training_data["Category"],
        test_size=0.2, random_state=5
    )

    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)

    param_grid = [
        {"n_neighbors": 3, "weights": "distance", "metric": "cosine"},
        {"n_neighbors": 5, "weights": "uniform", "metric": "euclidean"},
        {"n_neighbors": 7, "weights": "distance", "metric": "manhattan"}
    ]

    for params in param_grid:
        evaluate_knn(X_train, X_test, y_train, y_test, params)

# Example run
bow_knn()
tfidf_knn()

Params: {'n_neighbors': 3, 'weights': 'uniform', 'metric': 'euclidean'}
Accuracy = 93.45%
Cross Validation (3-fold) Accuracy: 0.92
Scores: [0.9320323  0.91386272 0.92525253]

Params: {'n_neighbors': 5, 'weights': 'distance', 'metric': 'manhattan'}
Accuracy = 94.17%
Cross Validation (3-fold) Accuracy: 0.93
Scores: [0.93472409 0.92732167 0.93198653]

Params: {'n_neighbors': 7, 'weights': 'uniform', 'metric': 'cosine'}
Accuracy = 97.58%
Cross Validation (3-fold) Accuracy: 0.96
Scores: [0.95693136 0.96231494 0.96094276]

Params: {'n_neighbors': 3, 'weights': 'distance', 'metric': 'cosine'}
Accuracy = 98.30%
Cross Validation (3-fold) Accuracy: 0.97
Scores: [0.96971736 0.96164199 0.96430976]

Params: {'n_neighbors': 5, 'weights': 'uniform', 'metric': 'euclidean'}
Accuracy = 92.20%
Cross Validation (3-fold) Accuracy: 0.90
Scores: [0.90040377 0.89771198 0.9010101 ]

Params: {'n_neighbors': 7, 'weights': 'distance', 'metric': 'manhattan'}
Accuracy = 93.09%
Cross Validation (3-fold) Accuracy: 0.

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

---
### 1) Why does TF-IDF generally result in better accuracy than Bag-of-Words?
Reason:

* BoW treats every word equally, only counting occurrences. This means very common but uninformative words like “the”, “is”, “you” can dominate the feature space, even though they don’t help in classification.

* TF-IDF (Term Frequency – Inverse Document Frequency) reduces the weight of common words and increases the weight of rare but important words that are more discriminative for the target class.

* By downweighting unimportant words and upweighting class-specific words, TF-IDF helps the model focus on more meaningful features, often improving accuracy.

Example: \\
In a spam dataset:

* Word “free” might occur rarely in normal messages but frequently in spam → TF-IDF will assign it a high weight.

* Word “the” occurs everywhere → TF-IDF will assign it a low weight.
---
### 2) Techniques better than both BoW and TF-IDF
Yes — there are vectorization techniques that preserve semantic meaning rather than just counting words:

* Beyond BoW and TF-IDF, there are more advanced techniques that capture semantic meaning rather than just frequency.
* Word2Vec (Mikolov et al., Google) learns dense word embeddings where similar words have similar vectors, thus capturing meaning, not just counts.
*  GloVe (Stanford) is similar to Word2Vec but is based on co-occurrence matrices.
* FastText (Facebook) improves on Word2Vec by using subword information, which makes it more effective for rare or misspelled words.
* Doc2Vec extends this idea to entire documents or sentences, producing embeddings for larger text units. BERT and other transformer-based embeddings are context-aware, meaning a word’s vector changes depending on surrounding words (for example, “bank” in “river bank” vs. “bank account”).
* Finally, Sentence-BERT is optimized for capturing semantic similarity between entire sentences, making it powerful for tasks like question answering or text matching.

Why they’re better:

* Capture context and meaning of words.

* Handle synonyms, polysemy, and word order better.

* Produce dense (smaller) vectors, reducing computational cost while keeping rich information.
---

### 3) Stemming vs Lemmatization — Pros & Cons
* Stemming and Lemmatization are both techniques to reduce words to a base form, but they differ in approach and precision.
*  Stemming is a rule-based method that chops word endings to reach a root form (e.g., running → run, studies → studi).
* It is faster since it relies on simple rules without linguistic checks, but it can produce non-words like “studi,” making it less accurate.
* Lemmatization, on the other hand, uses a dictionary and morphological analysis to convert words to their proper lemma (e.g., running → run, studies → study).
* It is slower due to dictionary lookups and deeper analysis, but it is more accurate and always returns valid words.
* Stemming is useful for large-scale, quick preprocessing where precision is less critical (such as search indexing), whereas Lemmatization is preferred in NLP tasks where meaning preservation is important, such as sentiment analysis or machine translation.

Summary:

* Stemming = quick & dirty.

* Lemmatization = slow & smart.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
