<a href="https://colab.research.google.com/github/vijay638233/Vijaya_INFO5731_Fall2024/blob/main/Mallidi_Vijayaramareddy_Exercise_5_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch
import nltk
from nltk.tokenize import word_tokenize
import ssl
import warnings
warnings.filterwarnings("ignore")

ssl._create_default_https_context = ssl._create_unverified_context
nltk.download('punkt')
train_file = "stsa-train.txt"
test_file = "stsa-test.txt"

def load_data(filepath):
    data = []
    labels = []
    with open(filepath, 'r') as file:
        for line in file:
            label, text = line.split(maxsplit=1)
            data.append(text.strip())
            labels.append(int(label))
    return pd.DataFrame({'text': data, 'label': labels})

train_data = load_data(train_file)
test_data = load_data(test_file)
X_train, X_val, y_train, y_val = train_test_split(
    train_data['text'], train_data['label'], test_size=0.2, random_state=42
)
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_val_tfidf = tfidf.transform(X_val)
X_test_tfidf = tfidf.transform(test_data['text'])

def train_word2vec(sentences):
    tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
    model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=2, workers=4)
    return model

def get_sentence_embedding(model, sentence):
    words = word_tokenize(sentence.lower())
    embeddings = [model.wv[word] for word in words if word in model.wv]
    if len(embeddings) > 0:
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros(model.vector_size)

word2vec_model = train_word2vec(X_train)
X_train_w2v = np.array([get_sentence_embedding(word2vec_model, sentence) for sentence in X_train])
X_val_w2v = np.array([get_sentence_embedding(word2vec_model, sentence) for sentence in X_val])
X_test_w2v = np.array([get_sentence_embedding(word2vec_model, sentence) for sentence in test_data['text']])

def get_bert_embeddings(sentences, tokenizer, model):
    tokenized = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt", max_length=128)
    with torch.no_grad():
        outputs = model(**tokenized)
    return outputs.last_hidden_state.mean(dim=1).numpy()

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

X_train_bert = get_bert_embeddings(X_train.tolist(), bert_tokenizer, bert_model)
X_val_bert = get_bert_embeddings(X_val.tolist(), bert_tokenizer, bert_model)
X_test_bert = get_bert_embeddings(test_data['text'].tolist(), bert_tokenizer, bert_model)
models = {
    "MultinomialNB": MultinomialNB(),
    "SVM": SVC(kernel='linear', probability=True),
    "KNN": KNeighborsClassifier(),
    "DecisionTree": DecisionTreeClassifier(),
    "RandomForest": RandomForestClassifier(),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss')
}
def evaluate_models(models, X_train, y_train, X_val, y_val):
    results = {}
    for name, model in models.items():
        # Perform 10-fold cross-validation
        scores = cross_val_score(model, X_train, y_train, cv=10, scoring='accuracy')
        print(f"{name} - Mean CV Accuracy: {np.mean(scores):.4f}")

        # Train on full training data
        model.fit(X_train, y_train)

        # Validate
        y_val_pred = model.predict(X_val)
        results[name] = {
            "Accuracy": accuracy_score(y_val, y_val_pred),
            "Precision": precision_score(y_val, y_val_pred),
            "Recall": recall_score(y_val, y_val_pred),
            "F1 Score": f1_score(y_val, y_val_pred)
        }
    return results
print("\nTF-IDF Results:")
tfidf_results = evaluate_models(models, X_train_tfidf, y_train, X_val_tfidf, y_val)

print("\nWord2Vec Results:")
word2vec_results = evaluate_models(models, X_train_w2v, y_train, X_val_w2v, y_val)

print("\nBERT Results:")
bert_results = evaluate_models(models, X_train_bert, y_train, X_val_bert, y_val)
best_model = models["RandomForest"]
best_model.fit(X_train_tfidf, y_train)
y_test_pred = best_model.predict(X_test_tfidf)
test_accuracy = accuracy_score(test_data['label'], y_test_pred)
print(f"\nTest Accuracy of Best Model: {test_accuracy:.4f}")

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /Users/venya/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.cluster import AgglomerativeClustering
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
from transformers import BertTokenizer, BertModel
import torch


import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Ensure the NLTK punkt tokenizer is downloaded
nltk.download('punkt')


df = pd.read_csv('Amazon_Unlocked_Mobile.csv')
print(df.head())
df.dropna(subset=['Reviews'], inplace=True)
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(df['Reviews'])

print("TF-IDF Matrix Shape: ", X.shape)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, random_state=42)
df['kmeans_cluster'] = kmeans.fit_predict(X)
print(df[['Reviews', 'kmeans_cluster']].head())

dbscan = DBSCAN(eps=0.5, min_samples=5, metric='cosine')
df['dbscan_cluster'] = dbscan.fit_predict(X)
print(df[['Reviews', 'dbscan_cluster']].head())

hierarchical = AgglomerativeClustering(n_clusters=5, affinity='cosine', linkage='average')
df['hierarchical_cluster'] = hierarchical.fit_predict(X.toarray())
print(df[['Reviews', 'hierarchical_cluster']].head())

tokenized_reviews = df['Reviews'].apply(word_tokenize)
model = Word2Vec(sentences=tokenized_reviews, vector_size=100, window=5, min_count=1, workers=4)
def get_review_vector(review):
    vectors = [model.wv[word] for word in review if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

df['word2vec_vector'] = tokenized_reviews.apply(get_review_vector)
word2vec_vectors = np.vstack(df['word2vec_vector'].values)
kmeans_word2vec = KMeans(n_clusters=5, random_state=42)
df['word2vec_cluster'] = kmeans_word2vec.fit_predict(word2vec_vectors)
print(df[['Reviews', 'word2vec_cluster']].head())

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
def get_bert_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

df['bert_embedding'] = df['Reviews'].apply(get_bert_embeddings)
bert_embeddings = np.vstack(df['bert_embedding'].values)
kmeans_bert = KMeans(n_clusters=5, random_state=42)
df['bert_cluster'] = kmeans_bert.fit_predict(bert_embeddings)
print(df[['Reviews', 'bert_cluster']].head())

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /Users/venya/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


                                        Product Name Brand Name   Price  \
0  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
1  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
2  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
3  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   
4  "CLEAR CLEAN ESN" Sprint EPIC 4G Galaxy SPH-D7...    Samsung  199.99   

   Rating                                            Reviews  Review Votes  
0       5  I feel so LUCKY to have found this used (phone...           1.0  
1       4  nice phone, nice up grade from my pantach revu...           0.0  
2       5                                       Very pleased           0.0  
3       4  It works good but it goes slow sometimes but i...           0.0  
4       4  Great phone to replace my lost phone. The only...           0.0  
TF-IDF Matrix Shape:  (413770, 1000)


  super()._check_params_vs_input(X, default_n_init=10)


                                             Reviews  kmeans_cluster
0  I feel so LUCKY to have found this used (phone...               2
1  nice phone, nice up grade from my pantach revu...               2
2                                       Very pleased               2
3  It works good but it goes slow sometimes but i...               1
4  Great phone to replace my lost phone. The only...               2


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

The clustering with K-means, DBSCAN, and Hierarchical clustering methods to the reviews yields different advantages and drawbacks. For example, K-means is a good cluster when the centroid structure exists since the model separates similar reviews through high-dimensional TF-IDF vector representations. Yet, it has difficulty clustering irregularly shaped clusters or noise regions. In application of dense clustering methods, this feature is advantageous when background noise needs to be separated from findings that are clusters of arbitrary shape. However, quite often than not, its hyperparameters settings like epsilon and min_samples hinder its effectiveness during real world applications especially with large datasets. Hierarchical clustering gives a systematic dendrogram of the interacted clusters, but this is however limited to smaller datasets as otherwise, the computational expense can be very high. Similarly, Word2Vec and BERT that work with semantic embeddings allow these clustering timed algorithms to group the reviews which have deeper, possibly latent similarities. Word2Vec works relatively well when only encompassing word-level semantics that are adequate for simpler sentences, in contrast to more advanced language context, sentence-level embeddings such as BERT that better approximates deeper lexico-semantics allowing for more accurate clustering when semantic comprehension is significant. Thus more enriched embeddings will help to considerably elevate the overall quality of the clustering especially in context-rich text data.



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

This exercise presented an excellent opportunity to engage with and apply different algorithms for machine learning based text classification, thereby acquiring practical skills in applications such as sentiment analysis. It was also interesting to deal with such models as a Naive Bayes, SVM, Random Forest and consider how well they work in practice. Moreover, the application of state of the art techniques such as Word2Vec and BERT embeddings enhanced the learning experience through exposing the details of modern natural language processing (NLP) techniques.

One of the major issues for some of the authors was the effective and efficient use of computational resources particularly when employing large embedding models such as BERT which posed challenges in terms of optimizations and memory management. At the other extreme, simpler strategies such as TF-IDF and MultinomialNB clearly illustrated how traditional approaches have some reasonable efficacy with at most elementary complexity. This equilibrium underscored the power plays between less but understandable models and most advanced but highly accurate approaches.

With the application of a 10-fold cross-validation, the reliability of model evaluation and comparison was again underlined as performance indices, though model specific, were not overly reliant on particular divisions of the data set. This assisted in appreciating the extent to which model performance differs across folds.

In conclusion, this project was a perfect blend of theory and practice. It deepened my appreciation of concepts such as text pre-processing, the feature engineering, and the evaluation of the machine learning model while giving an insight into how machine learning techniques such as in this case sentiment analysis can be used to solve real life problems. It was a difficult but enjoyable task, illustrating the wide range of techniques that can be applied in the modern day machine learning.

'''