# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.svm import SVC

from gensim.models import Word2Vec
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout
from keras.optimizers import Adam

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np


from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset
import torch
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

imdb_train = pd.read_csv('/content/stsa-train.txt', delimiter = '\t', header = None)
imdb_test = pd.read_csv('/content/stsa-test.txt', delimiter = '\t', header = None)
imdb_train.columns = ['original_data']
imdb_test.columns = ['original_data']

# Define a function to extract sentiment
def extract_sentiment(text):
    sentiment = re.match(r'^(\d+)\s', text).group(1)
    return int(sentiment)

# Extract sentiment and text
imdb_train['sentiment'] = imdb_train['original_data'].apply(extract_sentiment)
imdb_train['text'] = imdb_train['original_data'].apply(lambda x: re.sub(r'^\d+\s', '', x))

# Extract sentiment and text
imdb_test['sentiment'] = imdb_test['original_data'].apply(extract_sentiment)
imdb_test['text'] = imdb_test['original_data'].apply(lambda x: re.sub(r'^\d+\s', '', x))



# Split the data into 80% train and 20% validation
train_data, val_data = train_test_split(imdb_train, test_size=0.2, random_state=42)



nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove punctuation, special characters, and stop words
    tokens = [ps.stem(word.lower()) for word in tokens if word.isalpha() and word.lower() not in stop_words]
    # Join the tokens back into a string
    cleaned_text = ' '.join(tokens)
    return cleaned_text

# Apply preprocessing to the text column of train_data and val_data
train_data['clean_text'] = train_data['text'].apply(preprocess_text)
val_data['clean_text'] = val_data['text'].apply(preprocess_text)
imdb_test['clean_text'] = imdb_test['text'].apply(preprocess_text)

# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the training data
X_train_tfidf = vectorizer.fit_transform(train_data['clean_text'])

# Transform the validation and test data
X_val_tfidf = vectorizer.transform(val_data['clean_text'])
X_test_tfidf = vectorizer.transform(imdb_test['clean_text'])


# Initialize Multinomial Naive Bayes model
nb_model = MultinomialNB()

# Perform 10-fold cross-validation on the validation data
cv_scores = cross_val_score(nb_model, X_val_tfidf, val_data['sentiment'], cv=10, scoring='accuracy')

# Print cross-validation scores
print("Cross-Validation Scores:")
print(cv_scores)
print("Mean CV Accuracy:", cv_scores.mean())

# Train the model on the entire validation data
nb_model.fit(X_val_tfidf, val_data['sentiment'])

# Predict sentiment on the test data
test_predictions = nb_model.predict(X_test_tfidf)

# Compute evaluation metrics
accuracy = accuracy_score(imdb_test['sentiment'], test_predictions)
precision = precision_score(imdb_test['sentiment'], test_predictions)
recall = recall_score(imdb_test['sentiment'], test_predictions)
f1 = f1_score(imdb_test['sentiment'], test_predictions)

# Print evaluation metrics
print("\nEvaluation Metrics on Test Data:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

f1
# Initialize Support Vector Machine model
svm_model = SVC()

# Perform 10-fold cross-validation on the validation data
cv_scores_svm = cross_val_score(svm_model, X_val_tfidf, val_data['sentiment'], cv=10, scoring='accuracy')

# Print cross-validation scores
print("Cross-Validation Scores:")
print(cv_scores_svm)
print("Mean CV Accuracy:", cv_scores_svm.mean())

# Train the model on the entire validation data
svm_model.fit(X_val_tfidf, val_data['sentiment'])

# Predict sentiment on the test data
test_predictions_svm = svm_model.predict(X_test_tfidf)

# Compute evaluation metrics
accuracy_svm = accuracy_score(imdb_test['sentiment'], test_predictions_svm)
precision_svm = precision_score(imdb_test['sentiment'], test_predictions_svm)
recall_svm = recall_score(imdb_test['sentiment'], test_predictions_svm)
f1_svm = f1_score(imdb_test['sentiment'], test_predictions_svm)

# Print evaluation metrics
print("\nEvaluation Metrics on Test Data (SVM):")
print("Accuracy:", accuracy_svm)
print("Precision:", precision_svm)
print("Recall:", recall_svm)
print("F1 Score:", f1_svm)

from sklearn.neighbors import KNeighborsClassifier

# Initialize K-Nearest Neighbors model
knn_model = KNeighborsClassifier()

# Perform 10-fold cross-validation on the validation data
cv_scores_knn = cross_val_score(knn_model, X_val_tfidf, val_data['sentiment'], cv=10, scoring='accuracy')

# Print cross-validation scores
print("Cross-Validation Scores:")
print(cv_scores_knn)
print("Mean CV Accuracy:", cv_scores_knn.mean())

# Train the model on the entire validation data
knn_model.fit(X_val_tfidf, val_data['sentiment'])

# Predict sentiment on the test data
test_predictions_knn = knn_model.predict(X_test_tfidf)

# Compute evaluation metrics
accuracy_knn = accuracy_score(imdb_test['sentiment'], test_predictions_knn)
precision_knn = precision_score(imdb_test['sentiment'], test_predictions_knn)
recall_knn = recall_score(imdb_test['sentiment'], test_predictions_knn)
f1_knn = f1_score(imdb_test['sentiment'], test_predictions_knn)

# Print evaluation metrics
print("\nEvaluation Metrics on Test Data (KNN):")
print("Accuracy:", accuracy_knn)
print("Precision:", precision_knn)
print("Recall:", recall_knn)
print("F1 Score:", f1_knn)

from sklearn.tree import DecisionTreeClassifier

# Initialize Decision Tree model
dt_model = DecisionTreeClassifier()

# Perform 10-fold cross-validation on the validation data
cv_scores_dt = cross_val_score(dt_model, X_val_tfidf, val_data['sentiment'], cv=10, scoring='accuracy')

# Print cross-validation scores
print("Cross-Validation Scores:")
print(cv_scores_dt)
print("Mean CV Accuracy:", cv_scores_dt.mean())

# Train the model on the entire validation data
dt_model.fit(X_val_tfidf, val_data['sentiment'])

# Predict sentiment on the test data
test_predictions_dt = dt_model.predict(X_test_tfidf)

# Compute evaluation metrics
accuracy_dt = accuracy_score(imdb_test['sentiment'], test_predictions_dt)
precision_dt = precision_score(imdb_test['sentiment'], test_predictions_dt)
recall_dt = recall_score(imdb_test['sentiment'], test_predictions_dt)
f1_dt = f1_score(imdb_test['sentiment'], test_predictions_dt)

# Print evaluation metrics
print("\nEvaluation Metrics on Test Data (Decision Tree):")
print("Accuracy:", accuracy_dt)
print("Precision:", precision_dt)
print("Recall:", recall_dt)
print("F1 Score:", f1_dt)


from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest model
rf_model = RandomForestClassifier()

# Perform 10-fold cross-validation on the validation data
cv_scores_rf = cross_val_score(rf_model, X_val_tfidf, val_data['sentiment'], cv=10, scoring='accuracy')

# Print cross-validation scores
print("Cross-Validation Scores:")
print(cv_scores_rf)
print("Mean CV Accuracy:", cv_scores_rf.mean())

# Train the model on the entire validation data
rf_model.fit(X_val_tfidf, val_data['sentiment'])

# Predict sentiment on the test data
test_predictions_rf = rf_model.predict(X_test_tfidf)

# Compute evaluation metrics
accuracy_rf = accuracy_score(imdb_test['sentiment'], test_predictions_rf)
precision_rf = precision_score(imdb_test['sentiment'], test_predictions_rf)
recall_rf = recall_score(imdb_test['sentiment'], test_predictions_rf)
f1_rf = f1_score(imdb_test['sentiment'], test_predictions_rf)

# Print evaluation metrics
print("\nEvaluation Metrics on Test Data (Random Forest):")
print("Accuracy:", accuracy_rf)
print("Precision:", precision_rf)
print("Recall:", recall_rf)
print("F1 Score:", f1_rf)


import xgboost as xgb

# Initialize XGBoost model
xgb_model = xgb.XGBClassifier()

# Perform 10-fold cross-validation on the validation data
cv_scores_xgb = cross_val_score(xgb_model, X_val_tfidf, val_data['sentiment'], cv=10, scoring='accuracy')

# Print cross-validation scores
print("Cross-Validation Scores:")
print(cv_scores_xgb)
print("Mean CV Accuracy:", cv_scores_xgb.mean())

# Train the model on the entire validation data
xgb_model.fit(X_val_tfidf, val_data['sentiment'])

# Predict sentiment on the test data
test_predictions_xgb = xgb_model.predict(X_test_tfidf)

# Compute evaluation metrics
accuracy_xgb = accuracy_score(imdb_test['sentiment'], test_predictions_xgb)
precision_xgb = precision_score(imdb_test['sentiment'], test_predictions_xgb)
recall_xgb = recall_score(imdb_test['sentiment'], test_predictions_xgb)
f1_xgb = f1_score(imdb_test['sentiment'], test_predictions_xgb)

# Print evaluation metrics
print("\nEvaluation Metrics on Test Data (XGBoost):")
print("Accuracy:", accuracy_xgb)
print("Precision:", precision_xgb)
print("Recall:", recall_xgb)
print("F1 Score:", f1_xgb)

# Define a custom transformer to convert documents into average Word2Vec vectors
class AverageWord2VecVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, model):
        self.model = model
        self.vector_size = model.vector_size

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.model.wv[word] for word in document.split() if word in self.model.wv] or [np.zeros(self.vector_size)], axis=0)
            for document in X
        ])

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=train_data['clean_text'].apply(str.split), vector_size=100, window=5, min_count=1, sg=1)

# Convert documents to Word2Vec vectors
vectorizer = AverageWord2VecVectorizer(word2vec_model)
X_train_word2vec = vectorizer.transform(train_data['clean_text'])
X_val_word2vec = vectorizer.transform(val_data['clean_text'])
X_test_word2vec = vectorizer.transform(imdb_test['clean_text'])

# Train SVM model using Word2Vec vectors
svm_model_word2vec = SVC()

# Perform 10-fold cross-validation on the validation data
cv_scores_svm_word2vec = cross_val_score(svm_model_word2vec, X_val_word2vec, val_data['sentiment'], cv=10, scoring='accuracy')

# Print cross-validation scores
print("Cross-Validation Scores:")
print(cv_scores_svm_word2vec)
print("Mean CV Accuracy:", cv_scores_svm_word2vec.mean())

# Train the model on the entire validation data
svm_model_word2vec.fit(X_val_word2vec, val_data['sentiment'])

# Predict sentiment on the test data
test_predictions_svm_word2vec = svm_model_word2vec.predict(X_test_word2vec)

# Compute evaluation metrics
accuracy_svm_word2vec = accuracy_score(imdb_test['sentiment'], test_predictions_svm_word2vec)
precision_svm_word2vec = precision_score(imdb_test['sentiment'], test_predictions_svm_word2vec)
recall_svm_word2vec = recall_score(imdb_test['sentiment'], test_predictions_svm_word2vec)
f1_svm_word2vec = f1_score(imdb_test['sentiment'], test_predictions_svm_word2vec)

# Print evaluation metrics
print("\nEvaluation Metrics on Test Data (SVM with Word2Vec):")
print("Accuracy:", accuracy_svm_word2vec)
print("Precision:", precision_svm_word2vec)
print("Recall:", recall_svm_word2vec)
print("F1 Score:", f1_svm_word2vec)


# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# Tokenize text data
def tokenize_text(text):
    return tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=128,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

# Encode training, validation, and test data
train_encoded = [tokenize_text(text) for text in train_data['clean_text']]
val_encoded = [tokenize_text(text) for text in val_data['clean_text']]
test_encoded = [tokenize_text(text) for text in imdb_test['clean_text']]

# Create DataLoader for training, validation, and test data
train_dataset = TensorDataset(torch.cat([x['input_ids'] for x in train_encoded]),
                              torch.cat([x['attention_mask'] for x in train_encoded]),
                              torch.tensor(train_data['sentiment'].values))
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

val_dataset = TensorDataset(torch.cat([x['input_ids'] for x in val_encoded]),
                            torch.cat([x['attention_mask'] for x in val_encoded]),
                            torch.tensor(val_data['sentiment'].values))
val_loader = DataLoader(val_dataset, batch_size=16)

test_dataset = TensorDataset(torch.cat([x['input_ids'] for x in test_encoded]),
                             torch.cat([x['attention_mask'] for x in test_encoded]),
                             torch.tensor(imdb_test['sentiment'].values))
test_loader = DataLoader(test_dataset, batch_size=16)

# Fine-tune the pre-trained BERT model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

def train(model, optimizer, train_loader, val_loader, epochs=3):
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for input_ids, attention_mask, labels in train_loader:
            input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()
            loss.backward()
            optimizer.step()
        avg_train_loss = total_loss / len(train_loader)
        print(f'Epoch {epoch+1}/{epochs}, Average Training Loss: {avg_train_loss:.4f}')

        # Evaluate on validation data
        model.eval()
        val_preds = []
        val_labels = []
        val_loss = 0
        with torch.no_grad():
            for input_ids, attention_mask, labels in val_loader:
                input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
                outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
                loss = outputs.loss
                val_loss += loss.item()
                logits = outputs.logits
                preds = torch.argmax(logits, dim=1)
                val_preds.extend(preds.cpu().numpy())
                val_labels.extend(labels.cpu().numpy())
            avg_val_loss = val_loss / len(val_loader)
            val_accuracy = accuracy_score(val_labels, val_preds)
            val_precision = precision_score(val_labels, val_preds)
            val_recall = recall_score(val_labels, val_preds)
            val_f1 = f1_score(val_labels, val_preds)
            print(f'Epoch {epoch+1}/{epochs}, Validation Loss: {avg_val_loss:.4f}, '
                  f'Validation Accuracy: {val_accuracy:.4f}, '
                  f'Validation Precision: {val_precision:.4f}, '
                  f'Validation Recall: {val_recall:.4f}, '
                  f'Validation F1 Score: {val_f1:.4f}')

# Train the model
train(model, optimizer, train_loader, val_loader)

# Evaluate on test data
model.eval()
test_preds = []
test_labels = []
with torch.no_grad():
    for input_ids, attention_mask, labels in test_loader:
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        preds = torch.argmax(logits, dim=1)
        test_preds.extend(preds.cpu().numpy())
        test_labels.extend(labels.cpu().numpy())

# Compute evaluation metrics
accuracy_bert = accuracy_score(test_labels, test_preds)
precision_bert = precision_score(test_labels, test_preds)
recall_bert = recall_score(test_labels, test_preds)
f1_bert = f1_score(test_labels, test_preds)

# Print evaluation metrics
print("\nEvaluation Metrics on Test Data (BERT):")
print("Accuracy:", accuracy_bert)
print("Precision:", precision_bert)
print("Recall:", recall_bert)
print("F1 Score:", f1_bert)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Cross-Validation Scores:
[0.77697842 0.77697842 0.71942446 0.69064748 0.70289855 0.71014493
 0.74637681 0.73913043 0.6884058  0.66666667]
Mean CV Accuracy: 0.7217651965384214

Evaluation Metrics on Test Data:
Accuracy: 0.729818780889621
Precision: 0.6921658986175115
Recall: 0.8261826182618262
F1 Score: 0.7532597793380141
Cross-Validation Scores:
[0.82014388 0.76978417 0.73381295 0.71942446 0.79710145 0.77536232
 0.7173913  0.74637681 0.68115942 0.66666667]
Mean CV Accuracy: 0.7427223438640392

Evaluation Metrics on Test Data (SVM):
Accuracy: 0.7358594179022515
Precision: 0.717479674796748
Recall: 0.7766776677667767
F1 Score: 0.745905969360803
Cross-Validation Scores:
[0.48920863 0.48201439 0.48201439 0.48201439 0.48550725 0.48550725
 0.48550725 0.48550725 0.48550725 0.48550725]
Mean CV Accuracy: 0.48482952768220205


  _warn_prf(average, modifier, msg_start, len(result))



Evaluation Metrics on Test Data (KNN):
Accuracy: 0.500823723228995
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
Cross-Validation Scores:
[0.65467626 0.63309353 0.60431655 0.60431655 0.63043478 0.65942029
 0.62318841 0.64492754 0.57971014 0.58695652]
Mean CV Accuracy: 0.6221040558857263

Evaluation Metrics on Test Data (Decision Tree):
Accuracy: 0.6320702910488742
Precision: 0.6477132262051916
Recall: 0.5764576457645765
F1 Score: 0.610011641443539
Cross-Validation Scores:
[0.76258993 0.71223022 0.69784173 0.61870504 0.63768116 0.72463768
 0.64492754 0.63043478 0.65942029 0.60144928]
Mean CV Accuracy: 0.6689917631112501

Evaluation Metrics on Test Data (Random Forest):
Accuracy: 0.6836902800658978
Precision: 0.7557603686635944
Recall: 0.5412541254125413
F1 Score: 0.6307692307692307
Cross-Validation Scores:
[0.71942446 0.63309353 0.66906475 0.64028777 0.65942029 0.70289855
 0.65942029 0.67391304 0.63768116 0.55797101]
Mean CV Accuracy: 0.655317485142321

Evaluation Metrics on Test Data (XGBo

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3, Average Training Loss: 0.6114
Epoch 1/3, Validation Loss: 0.5003, Validation Accuracy: 0.7659, Validation Precision: 0.7951, Validation Recall: 0.7349, Validation F1 Score: 0.7638


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

The data set mention in the assignment was too long and It was taking alot of time to preprrocess and clean it first and then using it for clustering was even more time consuming. So instead, I download a dataset from https://github.com/PawanKrGunjan/Natural-Language-Processing/blob/main/Sarcasm%20Detection/sarcasm.json. This is for sarcasm in which 1 indicates sarcastic and 0 indicates not sarcastic.

In [None]:
data = pd.read_json('/content/sarcasm (2).json')
data = data[['headline', 'is_sarcastic']]
data.head()

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Join tokens back into a string
    cleaned_text = ' '.join(tokens)
    return cleaned_text

# Apply preprocessing to the headline column
data['clean_text'] = data['headline'].apply(preprocess_text)

data['is_sarcastic'].value_counts()

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(data['clean_text'])

from sklearn.cluster import KMeans

# Apply K-means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X_tfidf)

from sklearn.cluster import DBSCAN

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_tfidf)

from scipy.cluster.hierarchy import linkage, cut_tree

# Compute pairwise distances and perform hierarchical clustering
 linkage_matrix = linkage(X_tfidf.toarray(), method='ward')

# Extract cluster assignments
cut_tree_labels = cut_tree(linkage_matrix, n_clusters=2).flatten()

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

# Define a custom transformer to convert documents into average Word2Vec vectors
class AverageWord2VecVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, model):
        self.model = model
        self.vector_size = model.vector_size

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.model.wv[word] for word in document.split() if word in self.model.wv] or [np.zeros(self.vector_size)], axis=0)
            for document in X
        ])

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=data['clean_text'], vector_size=100, window=5, min_count=1, sg=1)

# Convert documents to Word2Vec vectors
vectorizer = AverageWord2VecVectorizer(word2vec_model)
X_word2vec = vectorizer.transform(data['clean_text'])


from sklearn.cluster import KMeans

# Train KMeans clustering using Word2Vec vectors
kmeans_word2vec = KMeans(n_clusters=2, random_state=42)
kmeans_labels_word2vec = kmeans_word2vec.fit_predict(X_word2vec)

## Using BERT
from transformers import BertTokenizer, BertModel

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize text data
tokenized_data = data['clean_text'].apply(lambda x: tokenizer(x, return_tensors='pt', padding=True, truncation=True))

# Encode tokenized text into numerical representations
encoded_data = []
for i in range(len(tokenized_data)):
    with torch.no_grad():
        outputs = model(**tokenized_data[i])
        encoded_data.append(outputs.last_hidden_state.mean(dim=1).squeeze().numpy())
X_bert = np.vstack(encoded_data)

from sklearn.cluster import KMeans

# Apply K-means clustering using BERT embeddings
kmeans_bert = KMeans(n_clusters=2, random_state=42)
kmeans_labels_bert = kmeans_bert.fit_predict(X_bert)

from sklearn.metrics import adjusted_rand_score, silhouette_score
from scipy.cluster.hierarchy import linkage, cut_tree

# Ground truth labels
true_labels = data['is_sarcastic'].values

# Evaluate K-means clustering
ari_kmeans = adjusted_rand_score(true_labels, kmeans_labels)
silhouette_kmeans = silhouette_score(X_tfidf, kmeans_labels)

# Evaluate DBSCAN clustering
#ari_dbscan = adjusted_rand_score(true_labels, dbscan_labels)
#silhouette_dbscan = silhouette_score(X_tfidf, dbscan_labels)

# Evaluate hierarchical clustering
ari_hierarchical = adjusted_rand_score(true_labels, cut_tree_labels)
silhouette_hierarchical = silhouette_score(X_tfidf, cut_tree_labels)

# Evaluate K-means clustering with Word2Vec
ari_kmeans_word2vec = adjusted_rand_score(true_labels, kmeans_labels_word2vec)
silhouette_kmeans_word2vec = silhouette_score(X_word2vec, kmeans_labels_word2vec)

# Evaluate K-means clustering with BERT embeddings
ari_kmeans_bert = adjusted_rand_score(true_labels, kmeans_labels_bert)
silhouette_kmeans_bert = silhouette_score(X_bert, kmeans_labels_bert)

# Print evaluation results
print("Evaluation Results:")
print("K-means - ARI:", ari_kmeans, "Silhouette Score:", silhouette_kmeans)
#print("DBSCAN - ARI:", ari_dbscan, "Silhouette Score:", silhouette_dbscan)
print("Hierarchical - ARI:", ari_hierarchical, "Silhouette Score:", silhouette_hierarchical)
print("K-means with Word2Vec - ARI:", ari_kmeans_word2vec, "Silhouette Score:", silhouette_kmeans_word2vec)
print("K-means with BERT - ARI:", ari_kmeans_bert, "Silhouette Score:", silhouette_kmeans_bert)





Evaluation Results:
K-means - ARI: -0.00887055695096076 Silhouette Score: 0.0011449686543569446
Hierarchical - ARI: -0.00038797454305711124 Silhouette Score: 0.0003877639781810512
K-means with Word2Vec - ARI: -0.005723980609364586 Silhouette Score: 0.979387477786733
K-means with BERT - ARI: 5.787865645758347e-05 Silhouette Score: 0.04185712


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

The evaluation results show that K-means, DBSCAN, and Hierarchical clustering algorithms performed poorly in terms of adjusted Rand index (ARI) and silhouette score. K-means achieved the highest silhouette score among these traditional clustering algorithms, but it still yielded a very low ARI, indicating poor agreement with the ground truth labels. DBSCAN and Hierarchical clustering performed even worse, with negative ARI values, indicating clustering results worse than random labeling. On the other hand, both Word2Vec and BERT embeddings used with K-means achieved higher silhouette scores, indicating better separation of clusters in the embedding space. However, the ARI values for Word2Vec and BERT embeddings were still low, suggesting limited agreement with the ground truth labels. Overall, while traditional clustering algorithms struggled to capture meaningful clusters in the data, embedding-based approaches showed promise in improving clustering performance, but further optimization and tuning may be required to achieve better agreement with the ground truth labels.





# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



The exercises provided in this assignment were quite comprehensive and covered various aspects of text analysis and machine learning techniques. Here are some reflective feedback points:

Diverse Coverage: The exercises covered a wide range of machine learning algorithms for text classification and clustering tasks, including traditional methods like Naive Bayes, SVM, KNN, Decision Trees, Random Forest, and XGBoost, as well as advanced techniques like Word2Vec and BERT embeddings. This allowed for a thorough exploration of different approaches and their performance in different scenarios.

Hands-on Practice: The exercises provided hands-on practice with implementing machine learning algorithms using Python libraries such as scikit-learn, Gensim, and Hugging Face Transformers. This practical experience was valuable in reinforcing theoretical concepts and improving coding skills.

Evaluation and Performance Metrics: The exercises emphasized the importance of evaluation and performance metrics in assessing the quality of machine learning models. Metrics such as accuracy, precision, recall, F1-score, adjusted Rand index, and silhouette score were used to evaluate the models, providing a comprehensive understanding of their strengths and limitations.

Integration of External Libraries: Integration of external libraries like Gensim for Word2Vec embeddings and Hugging Face Transformers for BERT embeddings added depth to the exercises and exposed learners to state-of-the-art tools and techniques in natural language processing.

Clear Instructions: The instructions provided for each exercise were clear and well-structured, making it easy to follow along and complete the tasks within a reasonable timeframe.