<a href="https://colab.research.google.com/github/saketh269/INFO-5731---Computational-Methods-for-Information-Systems/blob/main/Mekala_SakethReddy_Exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.model_selection import StratifiedKFold
import torch

# Load the dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Preprocess the text data
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(train_data['text'])
X_test_counts = vectorizer.transform(test_data['text'])

# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_counts, train_data['label'], test_size=0.2, random_state=42)

# Define classifiers
classifiers = {
    'Multinomial Naive Bayes': MultinomialNB(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'XGBoost': XGBClassifier()
}

# Train and evaluate classifiers
results = {}
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    train_pred = clf.predict(X_train)
    val_pred = clf.predict(X_val)
    test_pred = clf.predict(X_test_counts)

    results[name] = {
        'Train Accuracy': accuracy_score(y_train, train_pred),
        'Validation Accuracy': accuracy_score(y_val, val_pred),
        'Test Accuracy': accuracy_score(test_data['label'], test_pred),
        'Test Recall': recall_score(test_data['label'], test_pred),
        'Test Precision': precision_score(test_data['label'], test_pred),
        'Test F1': f1_score(test_data['label'], test_pred)
    }

# Print results
for name, metrics in results.items():
    print(f"{name}:")
    for metric, value in metrics.items():
        print(f"{metric}: {value}")
    print()


Multinomial Naive Bayes:
Train Accuracy: 0.9447254335260116
Validation Accuracy: 0.8020231213872833
Test Accuracy: 0.9161849710982659
Test Recall: 0.918005540166205
Test Precision: 0.9210672595886603
Test F1: 0.9195338512763596

SVM:
Train Accuracy: 0.9626083815028902
Validation Accuracy: 0.7557803468208093
Test Accuracy: 0.921242774566474
Test Recall: 0.9343490304709141
Test Precision: 0.9163270850312415
Test F1: 0.9252503085996433

KNN:
Train Accuracy: 0.7344653179190751
Validation Accuracy: 0.6163294797687862
Test Accuracy: 0.7108381502890173
Test Recall: 0.7293628808864266
Test Precision: 0.7199890620727372
Test F1: 0.7246456584560342

Decision Tree:
Train Accuracy: 1.0
Validation Accuracy: 0.6459537572254336
Test Accuracy: 0.9291907514450867
Test Recall: 0.9409972299168975
Test Precision: 0.9246053347849755
Test F1: 0.9327292696320704

Random Forest:
Train Accuracy: 1.0
Validation Accuracy: 0.7355491329479769
Test Accuracy: 0.9471098265895954
Test Recall: 0.9626038781163435
Test P

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [3]:
# Write your code here
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load dataset
data = pd.read_csv('Amazon_Unlocked_Mobile.csv')
# For demonstration purposes, let's use a sample of the dataset
data = data.sample(frac=0.05, random_state=42)

# Preprocess text data
data.dropna(subset=['Reviews'], inplace=True)
text_data = data['Reviews'].values

# Vectorize text data
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(text_data)

# K-means clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(X)


# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)


# Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=5)
hierarchical_labels = hierarchical.fit_predict(X.toarray())


# Word2Vec clustering
word2vec_model = Word2Vec(sentences=[review.split() for review in text_data], vector_size=100, window=5, min_count=1, workers=4)
word2vec_features = np.array([word2vec_model.wv[review.split()].mean(axis=0) for review in text_data])
kmeans_word2vec = KMeans(n_clusters=5, random_state=42)
kmeans_labels_word2vec = kmeans_word2vec.fit_predict(word2vec_features)


# BERT clustering
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
embeddings = []
for review in text_data:
    inputs = tokenizer(review, return_tensors="pt", padding=True, truncation=True)
    outputs = model(**inputs)
    embeddings.append(outputs.last_hidden_state.mean(dim=1).squeeze().detach().numpy())
embeddings = np.array(embeddings)
pca = PCA(n_components=100)
embeddings_pca = pca.fit_transform(embeddings)
kmeans_bert = KMeans(n_clusters=5, random_state=42)
kmeans_labels_bert = kmeans_bert.fit_predict(embeddings_pca)


print("K-means Silhouette Score:", silhouette_score(X, kmeans_labels))
print("DBSCAN Silhouette Score:", silhouette_score(X, dbscan_labels))
print("Hierarchical Silhouette Score:", silhouette_score(X, hierarchical_labels))
print("Word2Vec Silhouette Score:", silhouette_score(word2vec_features, kmeans_labels_word2vec))
print("BERT Silhouette Score:", silhouette_score(embeddings_pca, kmeans_labels_bert))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]



K-means Silhouette Score: 0.030034727314298782
DBSCAN Silhouette Score: 0.027794061251486497
Hierarchical Silhouette Score: 0.02361670863102917
Word2Vec Silhouette Score: 0.5157992
BERT Silhouette Score: 0.0997056


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
This is very hard to complete, quite interesting but hard to write the code.
i was trying to find the accuracy of both BERT and Word2Vec in the question 1, but it taking a hour to run the code.
for the question 2 i have displayed Silhouette Score for every algorithm.




'''