<a href="https://colab.research.google.com/github/varshiarjampudi/FMML__labs__and__Projects/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 62.30366492146597%




Cross Validation Accuracy: 0.62
[0.60784314 0.58431373 0.66141732]




In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 70.15706806282722%




Cross Validation Accuracy: 0.73
[0.7254902  0.74117647 0.72834646]


# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
len(df)

5572

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

In [None]:


1. Weighting by Term Importance:
BoW: Treats all words equally, using simple word frequency (the count of each word) in a document. This can lead to high-frequency words (like "the", "and", etc.) dominating the representation, even though they carry little meaning.
TF-IDF: Adjusts the frequency of words by how commonly they appear across the entire corpus. It gives higher importance to terms that appear frequently in a document but are rare across other documents, indicating they are more relevant or specific to that document's content.
2. Dealing with Common Words (Stop Words):
BoW: Common words (such as "a", "an", "the", etc.) can overwhelm the model because they appear in nearly every document, even though they provide little value in distinguishing between documents.
TF-IDF: Reduces the weight of common words by factoring in the inverse document frequency. Words that appear in many documents have a lower IDF score, which reduces their impact on the final representation.
3. Reducing Noise from Frequent but Irrelevant Words:
BoW: High-frequency words (even if they are contextually irrelevant) can significantly affect the model's performance. For example, a frequently occurring word like "movie" might appear in many documents, but it may not be very informative about the specific content of a document.
TF-IDF: By balancing term frequency with inverse document frequency, it helps highlight words that are important for distinguishing between documents, and diminishes the impact of frequent but unimportant words.
4. Better Representation of Rare but Important Terms:
BoW: Rare words that could be highly relevant (e.g., specific jargon or domain-related terms) are treated the same as more frequent words, which may not provide an effective feature set for classification.
TF-IDF: Rare terms get higher weights, ensuring that important but uncommon words (such as technical terms or unique identifiers) contribute more to the document's representation.
5. Improved Generalization:
BoW: Can result in overfitting, especially in cases where certain words may be too specific to certain documents or categories.
TF-IDF: Helps the model generalize better by emphasizing words that have a more unique and meaningful contribution, thus improving classification or clustering performance.
In summary, TF-IDF improves upon BoW by weighing terms according to their significance, reducing the influence of overly common words, and improving the model's ability to identify the truly informative terms in a document. This leads to better accuracy, especially in text classification tasks.

In [None]:


Stemming and Lemmatization are two fundamental techniques used in Natural Language Processing (NLP) for text preprocessing. Both aim to reduce words to their base or root form, but they do so in different ways. Here's a breakdown of each technique, along with their pros and cons:
1. Stemming
Stemming is the process of removing suffixes from words to get to their "root" or base form. This process is often heuristic-based, relying on predefined rules to strip suffixes from words.
Example:
running → run
happiness → happi
better → better (not changed by the stemmer)Pros of Stemming:Fast and Simple: Stemming algorithms like the Porter Stemmer are quick and computationally efficient.Works well for some languages: Particularly effective when you don't need perfect linguistic accuracy.Reduces dimensionality: By reducing words to their root forms, it helps reduce the complexity of the vocabulary in tasks like text classification or clustering.Cons of Stemming:Overstemming and Understemming: Stemming can sometimes produce incorrect or non-existent roots (e.g., "happily" → "happi"), which may not be meaningful in context.Loss of meaning: It does not preserve the exact meaning of words. For example, "better" and "good" may both stem to "good," but they have different meanings.Lack of linguistic accuracy: Stemming is not based on proper linguistic rules but rather on heuristics, so it can lead to non-standard word forms that may not make sense in certain applications.2. Lemmatization
Lemmatization is a more sophisticated technique that reduces a word to its base form by considering its meaning and part of speech. Lemmatization uses a vocabulary and morphological analysis to ensure that the root word is a valid word in the language.
Example:
running → run (verb)
better → good (adjective)
mice → mouse (noun)Pros of Lemmatization:Accurate and meaningful: Lemmatization produces real words (e.g., "better" becomes "good"), preserving meaning and making the process more linguistically sound.Context-aware: Lemmatization can handle different forms of a word (e.g., "running" can become "run" only if it's used as a verb).Reduces ambiguity: Helps in distinguishing words that share similar forms but have different meanings.Cons of Lemmatization:Slower than stemming: Lemmatization requires more complex algorithms and, in some cases, access to dictionaries or additional resources, making it slower than stemming.Requires more computational resources: The additional steps in processing (e.g., part-of-speech tagging) make lemmatization computationally heavier.Dependency on external resources: Lemmatizers typically need access to a lexicon or corpus to correctly identify the lemma (e.g., WordNet), which may not always be available.Summary: Stemming vs. LemmatizationFeatureStemmingLemmatizationAccuracyLower (may result in incorrect words)Higher (produces valid words)ComplexitySimple and fastMore complex and slowerResource RequirementsNo external resourcesRequires lexicons or external resourcesMeaning PreservationCan lose meaning (e.g., "happily" → "happi")Preserves meaning (e.g., "better" → "good")Use caseSuitable for simple applications with less focus on linguistic accuracyBetter for applications where linguistic correctness matters

In [None]:
1. Vector Representations and Similarity:
In NLP, vector representations of words, sentences, or documents are used to capture their meaning in a mathematical space. These vectors are often created using word embeddings (e.g., Word2Vec, GloVe) or more advanced methods like transformer-based models (e.g., BERT, GPT).

Goal: The idea is that similar texts or words should have similar representations (vectors) in this space. For example, "king" and "queen" or "dog" and "cat" would be close to each other in the vector space, reflecting their semantic similarity.

Similarity: Texts that share similar topics or meanings should be closer in this space, which allows for more efficient comparisons, clustering, classification, and other NLP tasks.

2. Self-Supervised Learning:
Self-supervised learning is a type of machine learning where the model learns to predict parts of the input data from other parts, without relying on labeled data. This approach is highly beneficial in NLP, where vast amounts of unlabeled text data are available.

Pretraining: In the context of models like BERT or GPT, self-supervised tasks can involve predicting the next word in a sentence, filling in missing words, or even determining whether one sentence logically follows from another. The model uses these tasks to learn rich representations of language.

Fine-Tuning: After pretraining, the model can be fine-tuned on a specific task (like sentiment analysis, question answering, etc.) with labeled data. Fine-tuning helps adjust the model's knowledge to the specifics of the target task.

3. Bringing Similar Texts Closer Together:
When using models trained with self-supervised learning, the idea is to map similar inputs (e.g., similar sentences or documents) to similar vectors in the model's embedding space. This allows the model to distinguish between relevant and irrelevant inputs, improving performance on downstream tasks.

For example:

Contrastive Learning (a type of self-supervised learning) might involve creating positive pairs (similar texts) and negative pairs (dissimilar texts), and then training the model to minimize the distance between positive pairs and maximize the distance between negative pairs in the vector space.
Benefits of Self-Supervised Learning in NLP:
No need for labeled data: Self-supervised methods can leverage vast amounts of unlabeled text data, which is often abundant and inexpensive to collect.
Rich representations: By learning from context (e.g., predicting missing words or next words), models can capture nuanced meanings, relationships, and structures in language.
Improved task performance: Fine-tuning a pre-trained model on a specific task allows for strong generalization even with a limited amount of labeled data.
In Summary:
Vector representations bring similar texts closer together in the vector space, which is crucial for tasks like search, clustering, or document similarity.
Self-supervised learning allows models to learn useful representations without labeled data, and the model can later be fine-tuned to specific tasks for enhanced performance.





ChatGPT can make mistakes. Check important info.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
