<a href="https://colab.research.google.com/github/tanvi2419/INF05731_assignment1/blob/main/INFO5731_Exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [3]:
# Write your code here
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from gensim.models import Word2Vec
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
import warnings

# Ignore warnings
warnings.filterwarnings("ignore")

# Load data
train_data = pd.read_csv("train_data.csv")
test_data = pd.read_csv("test_data.csv")

# Splitting data into features and labels
X_train = train_data['review']
y_train = train_data['sentiment']
X_test = test_data['review']
y_test = test_data['sentiment']

# Vectorize text data using Bag of Words (CountVectorizer)
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Algorithms
models = {
    "MultinomialNB": MultinomialNB(),
    "SVM": SVC(),
    "KNN": KNeighborsClassifier(),
    "DecisionTree": DecisionTreeClassifier(),
    "RandomForest": RandomForestClassifier(),
    "XGBoost": XGBClassifier()
}

# Perform 10-fold cross-validation and evaluate each model
results = {}
for name, model in models.items():
    scores = cross_val_score(model, X_train_vec, y_train, cv=10, scoring='accuracy')
    results[name] = scores
    print(f"{name}: Mean Accuracy: {np.mean(scores):.4f}, Std Dev: {np.std(scores):.4f}")

# Train and evaluate Word2Vec model
word2vec_model = Word2Vec(sentences=[review.split() for review in X_train], min_count=1, size=100)
X_train_word2vec = np.array([np.mean([word2vec_model.wv[word] for word in review.split()], axis=0) for review in X_train])
X_test_word2vec = np.array([np.mean([word2vec_model.wv[word] for word in review.split()], axis=0) for review in X_test])

word2vec_classifier = RandomForestClassifier()
word2vec_classifier.fit(X_train_word2vec, y_train)
word2vec_pred = word2vec_classifier.predict(X_test_word2vec)

word2vec_accuracy = accuracy_score(y_test, word2vec_pred)
word2vec_precision = precision_score(y_test, word2vec_pred)
word2vec_recall = recall_score(y_test, word2vec_pred)
word2vec_f1 = f1_score(y_test, word2vec_pred)

print(f"Word2Vec Model: Accuracy: {word2vec_accuracy:.4f}, Precision: {word2vec_precision:.4f}, Recall: {word2vec_recall:.4f}, F1 Score: {word2vec_f1:.4f}")

# Train and evaluate BERT model (assuming BERT model is pretrained and imported separately)

# Evaluation on test data
# Note: Evaluation on test data should be done only once after selecting the best model.
# If you want to evaluate all models on test data, you can move this part inside the loop above.

best_model = max(results, key=lambda k: np.mean(results[k]))
best_classifier = models[best_model]
best_classifier.fit(X_train_vec, y_train)
best_pred = best_classifier.predict(X_test_vec)

accuracy = accuracy_score(y_test, best_pred)
precision = precision_score(y_test, best_pred)
recall = recall_score(y_test, best_pred)
f1 = f1_score(y_test, best_pred)

print(f"Best Model ({best_model}): Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}")




FileNotFoundError: [Errno 2] No such file or directory: 'train_data.csv'

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
# Write your code here


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:This exercise was though to do but should learn and improve to do the exercise. This type of exercises gives me more intrest to learn.





'''