# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
# Write your code here
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
from google.colab import files
files.upload()


Saving stsa-test.txt to stsa-test (1).txt
Saving stsa-train.txt to stsa-train (1).txt




In [None]:
from google.colab import files
files.upload()

Saving stsa-test.txt to stsa-test (2).txt
Saving stsa-train.txt to stsa-train (2).txt




In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel

# Load data from text files
def load_data(file_path):
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            lines = file.readlines()
    except FileNotFoundError:
        print(f"File '{file_path}' not found.")
        return None
    data = {'review': [], 'sentiment': []}
    for line in lines:
        try:
            sentiment, _, review = line.strip().partition(' ')
            data['review'].append(review.strip())
            data['sentiment'].append(int(sentiment))
        except ValueError:
            print(f"Error parsing line: {line.strip()}")
    return pd.DataFrame(data)


# Load data from uploaded files
train_data = load_data("stsa-train.txt")
test_data = load_data("stsa-test.txt")


if train_data is None or test_data is None:
    print("Error loading data. Exiting.")
    exit()

# Split training data into features and labels
X_train = train_data["review"]
y_train = train_data["sentiment"]

# Split training data into training and validation sets
X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Define classifiers
classifiers = {
    "SVM": SVC(kernel="linear"),
    "Naive Bayes": MultinomialNB(),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "KNN": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "XGBoost": XGBClassifier()
  }



# Define TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Perform 10-fold cross-validation for each classifier and select the best one
best_mean_score = float('-inf')
best_classifier_name = None
for clf_name, clf in classifiers.items():
    pipeline = make_pipeline(vectorizer, clf)
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=42))
    mean_score = cv_scores.mean()
    print(f"{clf_name} Cross-Validation Mean Accuracy: {mean_score:.4f}")
    if mean_score > best_mean_score:
        best_mean_score = mean_score
        best_classifier_name = clf_name

# Train final model on the entire training data using the best classifier
best_classifier = classifiers[best_classifier_name]
final_pipeline = make_pipeline(vectorizer, best_classifier)
final_pipeline.fit(X_train, y_train)


# Evaluate final model on test data
X_test = test_data["review"]
y_test = test_data["sentiment"]
y_pred_test = final_pipeline.predict(X_test)
#test_accuracy = accuracy_score(y_test, final_pipeline.predict(X_test))
# Calculate evaluation metrics
accuracy_test = accuracy_score(y_test, y_pred_test)
precision_test = precision_score(y_test, y_pred_test, average='weighted')
recall_test = recall_score(y_test, y_pred_test, average='weighted')
f1_test = f1_score(y_test, y_pred_test, average='weighted')
print(f"Final Model Test Accuracy: {accuracy_test:.4f}")
print(f"Final Model Test precision: {precision_test:.4f}")
print(f"Final Model Test recall: {recall_test:.4f}")
print(f"Final Model Test f1: {f1_test:.4f}")

SVM Cross-Validation Mean Accuracy: 0.7735
Naive Bayes Cross-Validation Mean Accuracy: 0.7796
Random Forest Cross-Validation Mean Accuracy: 0.7065
KNN Cross-Validation Mean Accuracy: 0.7159
Decision Tree Cross-Validation Mean Accuracy: 0.6039
XGBoost Cross-Validation Mean Accuracy: 0.6933
Final Model Test Accuracy: 0.8034
Final Model Test precision: 0.8126
Final Model Test recall: 0.8034
Final Model Test f1: 0.8020


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [9]:
# Write your code here
from google.colab import files
files.upload()


Output hidden; open in https://colab.research.google.com to view.

In [10]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch

#  Loading the dataset from CSV file
data = pd.read_csv('Amazon_Unlocked_Mobile.csv')

#  Preprocess the text data if necessary

#  Applying  K-means clustering
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data)
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(X)

#  Applying DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)

#  Applying Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=5)
hierarchical_labels = hierarchical.fit_predict(X.toarray())

#  Applying Word2Vec embeddings
word2vec_model = Word2Vec(data, vector_size=100, window=5, min_count=1, workers=4)
#word2vec_embeddings = [word2vec_model.wv[word] for word in data]

#Computing silhouette score for K-means clustering
silhouette_score_kmeans = silhouette_score(X, kmeans_labels)
print("Silhouette Score for K-means clustering:", silhouette_score_kmeans)
#silhouette_score_dbscan = silhouette_score(X, dbscan_labels)
#print("Silhouette Score for DB Scan clustering:", silhouette_score_dbscan)
silhouette_score_hierarchical = silhouette_score(X, hierarchical_labels)
print("Silhouette Score for hirarchial clustering:", silhouette_score_hierarchical)






Silhouette Score for K-means clustering: 0.07557921107226377
Silhouette Score for hirarchial clustering: 0.07557921107226377


[link text](https://)**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.The Silhouette Scores for both K-means and Hierarchical clustering are identical at 0.0756, indicating that both methods have similar clustering performance on the dataset. This low score suggests that the clusters are not well-defined. DBSCAN, which is not mentioned here, typically performs better with non-globular clusters and noise, potentially offering different insights. Word2Vec and BERT, being advanced NLP models, focus on capturing semantic relationships in text data rather than clustering. While they may not directly provide clustering results, they can significantly enhance the feature representation used in clustering algorithms, potentially improving their performance. Comparing these methods highlights the importance of choosing the right algorithm based on the data and the specific problem at hand.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Applying machine learning algorithms is an excellent way to learn
and practice data science skills.
It allows you to understand the intricacies of different models and
their applications. However, some algorithms, such as Word2Vec and BERT,
may not always yield the expected results.
This can be due to various factors, including the quality of the dataset,
the complexity of the model, or the specific problem being addressed.
Despite these challenges, experimenting with these advanced algorithms is
valuable as it helps you gain deeper insights into natural language processing
and improve your ability to fine-tune models for better performance.



'''