<a href="https://colab.research.google.com/github/sainikhila11/SaiNikhila_INFO5731_Spring2024/blob/main/Yavanamanda_Sai_Exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [3]:
# Sentiment Analysis Classification Pipeline

import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from gensim.models import Word2Vec
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from transformers import BertTokenizer, BertModel
import time

# Load data
print("Loading data...")
train_data_path = "/content/stsa-train.txt"
test_data_path = "/content/stsa-test.txt"

# Function to load data
def load_data(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()
    data = []
    labels = []
    for line in lines:
        label, text = line.split(' ', 1)
        data.append(text.strip())
        labels.append(int(label))
    return data, labels

train_texts, train_labels = load_data(train_data_path)
test_texts, test_labels = load_data(test_data_path)

# Split train data into train and validation sets
print("Splitting data into train and validation sets...")
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=42)

# Convert text data into numerical features
print("Converting text data into numerical features...")
count_vectorizer = CountVectorizer()
X_train_counts = count_vectorizer.fit_transform(train_texts)
X_val_counts = count_vectorizer.transform(val_texts)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_val_tfidf = tfidf_transformer.transform(X_val_counts)

# Function to perform 10 fold cross validation
def perform_cross_validation(classifier, X, y):
    print("Performing cross-validation...")
    cv = KFold(n_splits=10, shuffle=True, random_state=42)
    scores = cross_val_score(classifier, X, y, cv=cv, scoring='accuracy')
    return scores.mean()

# Function to evaluate classifier
def evaluate_classifier(classifier, X_train, y_train, X_test, y_test):
    print("Evaluating classifier...")
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    return accuracy, recall, precision, f1

# Function to measure execution time
def measure_execution_time(start_time, end_time):
    duration = end_time - start_time
    return duration

# Multinomial Naive Bayes
start_time = time.time()
print("Training Multinomial Naive Bayes classifier...")
nb_classifier = MultinomialNB()
nb_cv_accuracy = perform_cross_validation(nb_classifier, X_train_counts, train_labels)
nb_accuracy, nb_recall, nb_precision, nb_f1 = evaluate_classifier(nb_classifier, X_train_counts, train_labels, X_val_counts, val_labels)
end_time = time.time()
print("Execution Time:", measure_execution_time(start_time, end_time), "seconds")
print("Multinomial Naive Bayes Classifier Results:")
print("Cross Validation Accuracy:", nb_cv_accuracy)
print("Accuracy:", nb_accuracy)
print("Recall:", nb_recall)
print("Precision:", nb_precision)
print("F1 Score:", nb_f1)
print()

# SVM
start_time = time.time()
print("Training SVM classifier...")
svm_classifier = SVC()
svm_cv_accuracy = perform_cross_validation(svm_classifier, X_train_tfidf, train_labels)
svm_accuracy, svm_recall, svm_precision, svm_f1 = evaluate_classifier(svm_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)
end_time = time.time()
print("Execution Time:", measure_execution_time(start_time, end_time), "seconds")
print("SVM Classifier Results:")
print("Cross Validation Accuracy:", svm_cv_accuracy)
print("Accuracy:", svm_accuracy)
print("Recall:", svm_recall)
print("Precision:", svm_precision)
print("F1 Score:", svm_f1)
print()

# KNN
start_time = time.time()
print("Training KNN classifier...")
knn_classifier = KNeighborsClassifier()
knn_cv_accuracy = perform_cross_validation(knn_classifier, X_train_tfidf, train_labels)
knn_accuracy, knn_recall, knn_precision, knn_f1 = evaluate_classifier(knn_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)
end_time = time.time()
print("Execution Time:", measure_execution_time(start_time, end_time), "seconds")
print("KNN Classifier Results:")
print("Cross Validation Accuracy:", knn_cv_accuracy)
print("Accuracy:", knn_accuracy)
print("Recall:", knn_recall)
print("Precision:", knn_precision)
print("F1 Score:", knn_f1)
print()

# Decision Tree
start_time = time.time()
print("Training Decision Tree classifier...")
dt_classifier = DecisionTreeClassifier()
dt_cv_accuracy = perform_cross_validation(dt_classifier, X_train_tfidf, train_labels)
dt_accuracy, dt_recall, dt_precision, dt_f1 = evaluate_classifier(dt_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)
end_time = time.time()
print("Execution Time:", measure_execution_time(start_time, end_time), "seconds")
print("Decision Tree Classifier Results:")
print("Cross Validation Accuracy:", dt_cv_accuracy)
print("Accuracy:", dt_accuracy)
print("Recall:", dt_recall)
print("Precision:", dt_precision)
print("F1 Score:", dt_f1)
print()

# Random Forest
start_time = time.time()
print("Training Random Forest classifier...")
rf_classifier = RandomForestClassifier()
rf_cv_accuracy = perform_cross_validation(rf_classifier, X_train_tfidf, train_labels)
rf_accuracy, rf_recall, rf_precision, rf_f1 = evaluate_classifier(rf_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)
end_time = time.time()
print("Execution Time:", measure_execution_time(start_time, end_time), "seconds")
print("Random Forest Classifier Results:")
print("Cross Validation Accuracy:", rf_cv_accuracy)
print("Accuracy:", rf_accuracy)
print("Recall:", rf_recall)
print("Precision:", rf_precision)
print("F1 Score:", rf_f1)
print()

# XGBoost
start_time = time.time()
print("Training XGBoost classifier...")
xgb_classifier = XGBClassifier()
xgb_cv_accuracy = perform_cross_validation(xgb_classifier, X_train_tfidf, train_labels)
xgb_accuracy, xgb_recall, xgb_precision, xgb_f1 = evaluate_classifier(xgb_classifier, X_train_tfidf, train_labels, X_val_tfidf, val_labels)
end_time = time.time()
print("Execution Time:", measure_execution_time(start_time, end_time), "seconds")
print("XGBoost Classifier Results:")
print("Cross Validation Accuracy:", xgb_cv_accuracy)
print("Accuracy:", xgb_accuracy)
print("Recall:", xgb_recall)
print("Precision:", xgb_precision)
print("F1 Score:", xgb_f1)
print()

# Word2Vec
start_time = time.time()
print("Training Word2Vec classifier...")
word2vec_model = Word2Vec(sentences=[text.split() for text in train_texts], vector_size=100, window=5, min_count=1, workers=4)
word2vec_features = np.array([np.mean([word2vec_model.wv[word] for word in text.split() if word in word2vec_model.wv] or [np.zeros(100)], axis=0) for text in train_texts])
word2vec_val_features = np.array([np.mean([word2vec_model.wv[word] for word in text.split() if word in word2vec_model.wv] or [np.zeros(100)], axis=0) for text in val_texts])

word2vec_classifier = RandomForestClassifier()
word2vec_cv_accuracy = perform_cross_validation(word2vec_classifier, word2vec_features, train_labels)
word2vec_accuracy, word2vec_recall, word2vec_precision, word2vec_f1 = evaluate_classifier(word2vec_classifier, word2vec_features, train_labels, word2vec_val_features, val_labels)
end_time = time.time()
print("Execution Time:", measure_execution_time(start_time, end_time), "seconds")
print("Word2Vec Classifier Results:")
print("Cross Validation Accuracy:", word2vec_cv_accuracy)
print("Accuracy:", word2vec_accuracy)
print("Recall:", word2vec_recall)
print("Precision:", word2vec_precision)
print("F1 Score:", word2vec_f1)
print()

# BERT
print("Training BERT classifier...")
start_time = time.time()
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Function to encode text using BERT
def encode_text(text):
    input_ids = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=128)['input_ids'].to(device)
    with torch.no_grad():
        outputs = bert_model(input_ids)
    return outputs.pooler_output.cpu().numpy()

bert_features_train = np.concatenate([encode_text(text) for text in train_texts])
bert_features_val = np.concatenate([encode_text(text) for text in val_texts])

bert_classifier = RandomForestClassifier()
bert_cv_accuracy = perform_cross_validation(bert_classifier, bert_features_train, train_labels)
bert_accuracy, bert_recall, bert_precision, bert_f1 = evaluate_classifier(bert_classifier, bert_features_train, train_labels, bert_features_val, val_labels)
end_time = time.time()
print("Execution Time:", measure_execution_time(start_time, end_time), "seconds")
print("BERT Classifier Results:")
print("Cross Validation Accuracy:", bert_cv_accuracy)
print("Accuracy:", bert_accuracy)
print("Recall:", bert_recall)
print("Precision:", bert_precision)
print("F1 Score:", bert_f1)
print()


Loading data...
Splitting data into train and validation sets...
Converting text data into numerical features...
Training Multinomial Naive Bayes classifier...
Performing cross-validation...
Evaluating classifier...
Execution Time: 0.15351343154907227 seconds
Multinomial Naive Bayes Classifier Results:
Cross Validation Accuracy: 0.7779979240245201
Accuracy: 0.7947976878612717
Recall: 0.8429172510518934
Precision: 0.777490297542044
F1 Score: 0.8088829071332435

Training SVM classifier...
Performing cross-validation...
Evaluating classifier...
Execution Time: 79.04475116729736 seconds
SVM Classifier Results:
Cross Validation Accuracy: 0.775101350689707
Accuracy: 0.7976878612716763
Recall: 0.8597475455820477
Precision: 0.7730138713745272
F1 Score: 0.8140770252324037

Training KNN classifier...
Performing cross-validation...
Evaluating classifier...
Execution Time: 14.630201816558838 seconds
KNN Classifier Results:
Cross Validation Accuracy: 0.7162213329329357
Accuracy: 0.7297687861271677


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
!pip install sentence-transformers




In [18]:
# Step 1: Mount Google Drive to access files (only if working in Google Colab)
from google.colab import drive


# Step 2: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer

# Step 3: Load the dataset
data_path = "/content/imdb_reviews_sentimentanalysis.csv"  # Update with your file path
df = pd.read_csv(data_path)

# Step 4: Explore the dataset
print(df.head())

# Step 5: Preprocess the data if necessary

# Step 6: Text Clustering with K-means
# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=5, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(df['clean_text'].values.astype('U'))

# K-means Clustering
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, n_init=10, max_iter=300, random_state=42)
clusters_kmeans = kmeans.fit_predict(tfidf_matrix)

# Step 7: Text Clustering with DBSCAN
# DBSCAN Clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters_dbscan = dbscan.fit_predict(tfidf_matrix)

# Step 8: Text Clustering with Hierarchical Clustering
# Hierarchical Clustering
agglomerative = AgglomerativeClustering(n_clusters=num_clusters, linkage='ward')
clusters_hierarchical = agglomerative.fit_predict(tfidf_matrix.toarray())

# Step 9: Text Clustering with Word2Vec
# Tokenization
sentences = [review.split() for review in df['clean_text'].values]

# Training Word2Vec model
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Applying K-means clustering on Word2Vec embeddings
word_vectors = word2vec_model.wv
word_vectors_matrix = word_vectors.vectors
kmeans_word2vec = KMeans(n_clusters=num_clusters, random_state=42)
clusters_word2vec = kmeans_word2vec.fit_predict(word_vectors_matrix)

# Step 10: Text Clustering with BERT
# Sentence embeddings with BERT
model = SentenceTransformer('bert-base-nli-mean-tokens')
sentence_embeddings = model.encode(df['clean_text'], show_progress_bar=True)

# Applying K-means clustering on BERT embeddings
kmeans_bert = KMeans(n_clusters=num_clusters, random_state=42)
clusters_bert = kmeans_bert.fit_predict(sentence_embeddings)

# Step 11: Visualize the clustering results (optional)

# Step 12: Analyze the clustering results (optional)

# Step 13: Print out the cluster assignments for each method
print("K-means Clustering:")
print(clusters_kmeans)
print("DBSCAN Clustering:")
print(clusters_dbscan)
print("Hierarchical Clustering:")
print(clusters_hierarchical)
print("Word2Vec Clustering:")
print(clusters_word2vec)
print("BERT Clustering:")
print(clusters_bert)


   document_id                                         clean_text Sentiment 
0            1  margot does the best with what shes given but ...   negative
1            2  before making barbie greta gerwigsinglehandedl...   positive
2            3  the quality the humor and the writing of the m...   negative
3            4  as much as it pains me to give a movie called ...   positive
4            5  as a woman that grew up with barbie i was very...   negative


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]



K-means Clustering:
[4 1 3 4 1 1 1 4 4 1 0 0 1 1 4 3 1 1 2 1 3 1 1 4 1 4 1 3 4 1 1 1 4 4 1 2 0
 1 1 4 3 1 1 2 1 3 1 1 4 1 4 1 3 4 1 1 1 4 4 1 1 0 1 1 4 3 1 1 2 1 3 1 1 4
 1 4 1 3 4 1 1 1 4 4 1 4 0 1 1 4 3 1 1 2 1 3 1 1 4 1]
DBSCAN Clustering:
[ 0 -1 -1 -1 -1 -1 -1 -1 -1 -1  1  1 -1  2 -1 -1  3 -1  4 -1 -1 -1 -1 -1
 -1  0 -1 -1 -1 -1 -1 -1 -1 -1 -1  4  1 -1  2 -1 -1  3 -1  4 -1 -1 -1 -1
 -1 -1  0 -1 -1 -1 -1 -1 -1 -1 -1 -1  3  1 -1  2 -1 -1  3 -1  4 -1 -1 -1
 -1 -1 -1  0 -1 -1 -1 -1 -1 -1 -1 -1 -1  0  1 -1  2 -1 -1  3 -1  4 -1 -1
 -1 -1 -1  2]
Hierarchical Clustering:
[2 0 2 1 1 0 1 1 1 1 0 0 1 0 1 1 3 1 4 1 0 0 0 2 1 2 0 2 1 1 0 1 1 1 1 4 0
 1 0 1 1 3 1 4 1 0 0 0 2 1 2 0 2 1 1 0 1 1 1 1 3 0 1 0 1 1 3 1 4 1 0 0 0 2
 1 2 0 2 1 1 0 1 1 1 1 2 0 1 0 1 1 3 1 4 1 0 0 0 2 0]
Word2Vec Clustering:
[4 4 4 ... 2 2 2]
BERT Clustering:
[3 1 0 1 1 4 4 3 2 1 0 0 2 1 1 3 1 2 4 3 3 1 1 2 2 3 1 0 1 1 4 4 3 2 1 4 0
 2 1 1 3 1 2 4 3 3 1 1 2 2 3 1 0 1 1 4 4 3 2 1 1 0 2 1 1 3 1 2 4 3 3 1 1 2
 2 3 1 0 1 1 4 4

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**


Each clustering method, including K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT, offers unique approaches to understanding textual data. K-means partitions the data into clusters based on similarity, but its performance can be sensitive to the choice of initial centroids and the number of clusters. DBSCAN, on the other hand, identifies clusters based on density, capable of handling irregularly shaped clusters and noise points. Hierarchical clustering constructs a hierarchical tree of clusters, providing insights into the hierarchical structure of the data. Word2Vec and BERT embeddings capture semantic similarities, enabling clustering based on semantic context. Word2Vec clusters individual words, while BERT clusters entire sentences, providing more contextually rich representations. Ultimately, the choice of clustering method depends on the specific characteristics of the data and the analytical goals, with each method offering its strengths and limitations in uncovering meaningful patterns in text data.







# Mandatory Question

Working on the 10 fold validation kind of dataset is something new i did with this exercise and i have got to know about new algorithms like Multinomial NB.

