# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
# Write your code here
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from gensim.models import Word2Vec
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

# Load the dataset
train_data = pd.read_csv('stsa-train.txt', delimiter='\t')
test_data = pd.read_csv('stsa-test.txt', delimiter='\t')

# Extract labels and text from the single column in train_data
train_data['Label'] = train_data.iloc[:, 0].apply(lambda x: int(x.split()[0]))
train_data['Text'] = train_data.iloc[:, 0].apply(lambda x: ' '.join(x.split()[1:]))

# Extract labels and text from the single column in test_data
test_data['Label'] = test_data.iloc[:, 0].apply(lambda x: int(x.split()[0]))
test_data['Text'] = test_data.iloc[:, 0].apply(lambda x: ' '.join(x.split()[1:]))

# Drop the original columns
train_data.drop(columns=[train_data.columns[0]], inplace=True)
test_data.drop(columns=[test_data.columns[0]], inplace=True)

# Split the training data into features and labels
X_train, X_val, y_train, y_val = train_test_split(train_data['Text'].values,
                                                  train_data['Label'].values,
                                                  test_size=0.2,
                                                  random_state=42)

# Split the test data into features and labels
X_test = test_data['Text'].values
y_test = test_data['Label'].values

# Initialize Word2Vec model
word2vec_model = Word2Vec(vector_size=100, window=5, min_count=1, workers=4)

# Define a function to preprocess text data
def preprocess_text(text):
    # You can implement your text preprocessing steps here
    # For now, let's just return the text as is
    return text

# Perform preprocessing
X_train_preprocessed = [preprocess_text(text) for text in X_train]
X_val_preprocessed = [preprocess_text(text) for text in X_val]
X_test_preprocessed = [preprocess_text(text) for text in X_test]

# Train Word2Vec model
word2vec_model.build_vocab(X_train_preprocessed)
word2vec_model.train(X_train_preprocessed, total_examples=word2vec_model.corpus_count, epochs=10)

# Extract Word2Vec embeddings for train, validation, and test data
X_train_word2vec = []
X_val_word2vec = []
X_test_word2vec = []

for text in X_train_preprocessed:
    words = text.split()
    embeddings = []
    for word in words:
        if word in word2vec_model.wv:
            embeddings.append(word2vec_model.wv[word])
    if embeddings:
        X_train_word2vec.append(np.mean(embeddings, axis=0))
    else:
        X_train_word2vec.append(np.zeros(100))

for text in X_val_preprocessed:
    words = text.split()
    embeddings = []
    for word in words:
        if word in word2vec_model.wv:
            embeddings.append(word2vec_model.wv[word])
    if embeddings:
        X_val_word2vec.append(np.mean(embeddings, axis=0))
    else:
        X_val_word2vec.append(np.zeros(100))

for text in X_test_preprocessed:
    words = text.split()
    embeddings = []
    for word in words:
        if word in word2vec_model.wv:
            embeddings.append(word2vec_model.wv[word])
    if embeddings:
        X_test_word2vec.append(np.mean(embeddings, axis=0))
    else:
        X_test_word2vec.append(np.zeros(100))

# Convert to numpy arrays
X_train_word2vec = np.array(X_train_word2vec)
X_val_word2vec = np.array(X_val_word2vec)
X_test_word2vec = np.array(X_test_word2vec)

# Initialize classifiers
classifiers = {
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'XGBoost': XGBClassifier()
}

# Initialize evaluation metrics
evaluation_metrics = {
    'Accuracy': accuracy_score,
    'Recall': recall_score,
    'Precision': precision_score,
    'F1 Score': f1_score
}

# Train and evaluate classifiers
results_word2vec = {}

for clf_name, clf in classifiers.items():
    print(f"Evaluating {clf_name} with Word2Vec embeddings...")
    clf.fit(X_train_word2vec, y_train)
    y_pred = clf.predict(X_val_word2vec)
    clf_results = {}
    for metric_name, metric_func in evaluation_metrics.items():
        clf_results[metric_name] = metric_func(y_val, y_pred)
    results_word2vec[clf_name] = clf_results

# Print results
print("\nEvaluation Results with Word2Vec embeddings:")
for clf_name, clf_result in results_word2vec.items():
    print(f"\nClassifier: {clf_name}")
    for metric_name, metric_value in clf_result.items():
        print(f"{metric_name}: {metric_value}")




Evaluating SVM with Word2Vec embeddings...
Evaluating KNN with Word2Vec embeddings...
Evaluating Decision Tree with Word2Vec embeddings...
Evaluating Random Forest with Word2Vec embeddings...
Evaluating XGBoost with Word2Vec embeddings...

Evaluation Results with Word2Vec embeddings:

Classifier: SVM
Accuracy: 0.536849710982659
Recall: 0.7387140902872777
Precision: 0.5454545454545454
F1 Score: 0.6275421266705404

Classifier: KNN
Accuracy: 0.5086705202312138
Recall: 0.35294117647058826
Precision: 0.5548387096774193
F1 Score: 0.43143812709030105

Classifier: Decision Tree
Accuracy: 0.5390173410404624
Recall: 0.6976744186046512
Precision: 0.5501618122977346
F1 Score: 0.6151990349819059

Classifier: Random Forest
Accuracy: 0.5411849710982659
Recall: 0.7086183310533516
Precision: 0.551063829787234
F1 Score: 0.6199880311190904

Classifier: XGBoost
Accuracy: 0.5361271676300579
Recall: 0.7045143638850889
Precision: 0.5472901168969182
F1 Score: 0.6160287081339714


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
# Write your code here
!pip install sentence-transformers
# Import necessary libraries
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer

# Download NLTK data
nltk.download('punkt')

# Load the CSV file containing movie reviews
data = pd.read_csv("movie_reviews_with_sentiment.csv")

# Drop rows with missing values (NaN) in the 'clean_text' column
data.dropna(subset=['clean_text'], inplace=True)

# Preprocess the text data (e.g., remove stopwords, tokenize, lowercase, etc.)
preprocessed_text = []

for review in data['clean_text']:
    # Tokenize the review into words
    tokens = nltk.word_tokenize(review)
    preprocessed_text.append(tokens)

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=5, stop_words='english')

# Transform text data into TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(data['clean_text'].astype(str))

# Perform clustering using different algorithms

# 1. K-means clustering with a different number of clusters
kmeans_model = KMeans(n_clusters=3, random_state=42)  # Change the number of clusters
kmeans_clusters = kmeans_model.fit_predict(tfidf_matrix)

# 2. DBSCAN clustering with different parameters
dbscan_model = DBSCAN(eps=1.0, min_samples=10)  # Change epsilon and min_samples
dbscan_clusters = dbscan_model.fit_predict(tfidf_matrix)

# 3. Hierarchical clustering with different linkage method
hierarchical_model = AgglomerativeClustering(n_clusters=5, linkage='ward')  # Change the linkage method
hierarchical_clusters = hierarchical_model.fit_predict(tfidf_matrix.toarray())

# 4. Word2Vec clustering with K-means
word2vec_model = Word2Vec(preprocessed_text, vector_size=100, window=5, min_count=5, workers=4)
word_vectors = word2vec_model.wv
kmeans_word2vec = KMeans(n_clusters=5, random_state=42)
kmeans_word2vec.fit(word_vectors.vectors)
word2vec_clusters = kmeans_word2vec.labels_

# 5. BERT clustering with different embedding method
sentences = data['clean_text'].tolist()
sentence_transformer_model = SentenceTransformer('bert-base-nli-mean-tokens')
bert_embeddings = sentence_transformer_model.encode(sentences)
bert_kmeans_model = KMeans(n_clusters=5, random_state=42)
bert_clusters = bert_kmeans_model.fit_predict(bert_embeddings)

# Print clustering results
print("K-means Clustering Results:")
print(kmeans_clusters)

print("\nDBSCAN Clustering Results:")
print(dbscan_clusters)

print("\nHierarchical Clustering Results:")
print(hierarchical_clusters)

print("\nWord2Vec Clustering Results:")
print(word2vec_clusters)

print("\nBERT Clustering Results:")
print(bert_clusters)



Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m112.6/171.5 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]



K-means Clustering Results:
[1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2 1 1 2 1 2 1 2
 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1
 2 1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2 1 1 2 1 2 1
 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 1
 1 2 1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2 1 1 2 1 2
 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1
 1 1 2 1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2 1 1 2 1
 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 2
 1 1 1 2 1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2 1 1 2
 1 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1
 2 1 1 1 2 1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2 1 1
 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2
 1 2 1 1 1 2 1 0 0 1 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 1 0 0 1 2

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

Different cluster allocations are produced by K-means, DBSCAN adjusts to different cluster density, Word2Vec generates word embeddings clusters, BERT clusters textual segments, and Hierarchical clustering uses dendrograms to show hierarchical links. These techniques satisfy a variety of requirements, ranging from simple division to textual intricacies in meaning and context.





# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
The code run time is too long and I am unable to run it. BERT model is taking more time and we are not getting enough
to do the exercises.
I even have another class today(Friday), Exercise is released on Thursday, I am not getting sufficient time to finish this
exercise.

Apart from having class When I tried to run first code it took hours and system crashed multiple times. So I removed BERT
and going forward with other features.

For the second one the dataset is too big to be preprocessed and to do the other clustering methods.

Later I have used my own dataset to run the code.

Please take all these into consideation. Thank you.


'''

'\nThe code run time is too long and I am unable to run it. BERT model is taking more time and we are not getting enough\nto do the exercises. \nI even have another class today(Friday), Exercise is released on Thursday, I am not getting sufficient time to finish this\nexercise.\n\nApart from having class When I tried to run first code it took hours and system crashed multiple times. So I removed BERT\nand going forward with other features.\n\nFor the second one the dataset is too big to be preprocessed and to do the other clustering methods.\n\nLater I have used my own dataset to run the code.\n\nPlease take all these into consideation. Thank you.\n\n\n'