<a href="https://colab.research.google.com/github/snampally97/Assignment-Exercises/blob/main/Srikanth_Nampally_Exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

# Load data
train_data_url = 'https://raw.githubusercontent.com/snampally97/Assignment-Exercises/main/stsa-train.csv'
test_data_url = 'https://raw.githubusercontent.com/snampally97/Assignment-Exercises/main/stsa-test.csv'

train_df = pd.read_csv(train_data_url, header=None)
test_df = pd.read_csv(test_data_url, header=None)

# Rename columns
train_df.columns = ['text']
test_df.columns = ['text']

# Add labels to the data
train_df['label'] = 1
test_df['label'] = 0

# Combine train and test data
full_df = pd.concat([train_df, test_df], ignore_index=True)

# Split data into features and target
X = full_df['text']
y = full_df['label']

# Vectorize text data
vectorizer = CountVectorizer()
X_vec = vectorizer.fit_transform(X)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42)

# Define classifiers
classifiers = {
    "MultinomialNB": MultinomialNB(),
    "SVM": SVC(),
    "KNN": KNeighborsClassifier(),
    "Decision tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier()
}

# Train and evaluate classifiers
results = {}
for clf_name, clf in classifiers.items():
    print(f"Training and evaluating {clf_name}...")
    # Perform 10-fold cross validation
    cv_scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
    print(f"Cross-validation scores: {cv_scores}")
    print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

    # Train on full training data
    clf.fit(X_train, y_train)

    # Evaluate on test set
    y_pred = clf.predict(X_test)

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    results[clf_name] = {
        "Accuracy": accuracy,
        "Recall": recall,
        "Precision": precision,
        "F1 Score": f1
    }

# Display results
print("\nResults:")
for clf_name, metrics in results.items():
    print(clf_name)
    for metric_name, value in metrics.items():
        print(f"{metric_name}: {value}")
    print()

Training and evaluating MultinomialNB...
Cross-validation scores: [0.75       0.74571429 0.74105866 0.76680973 0.76251788 0.74964235
 0.75822604 0.76108727 0.74821173 0.74821173]
Mean cross-validation accuracy: 0.7531479664827303
Training and evaluating SVM...
Cross-validation scores: [0.79       0.79       0.79113019 0.79113019 0.79113019 0.79113019
 0.78969957 0.78969957 0.78969957 0.78969957]
Mean cross-validation accuracy: 0.7903319027181688
Training and evaluating KNN...
Cross-validation scores: [0.78285714 0.75857143 0.77253219 0.76251788 0.76824034 0.76537911
 0.7739628  0.76967096 0.77682403 0.77110157]
Mean cross-validation accuracy: 0.7701657469854896
Training and evaluating Decision tree...
Cross-validation scores: [0.67857143 0.69142857 0.70815451 0.68669528 0.70529328 0.67525036
 0.70243205 0.7167382  0.6981402  0.70100143]
Mean cross-validation accuracy: 0.6963705293276109
Training and evaluating Random Forest...
Cross-validation scores: [0.78571429 0.78       0.78254649 

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [13]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer
import numpy as np

# Load the dataset
url = "https://raw.githubusercontent.com/snampally97/Assignment-Exercises/main/stsa-test.csv"
data = pd.read_csv(url)

# Detect the column containing text data
text_column_name = data.columns[0]  # Assuming the text data is in the first column

# Vectorization using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)  # You can adjust max_features as needed
X = vectorizer.fit_transform(data[text_column_name])

# K-means clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(X)

# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)

# Hierarchical clustering
agg_clustering = AgglomerativeClustering(n_clusters=5)
agg_labels = agg_clustering.fit_predict(X.toarray())  # Hierarchical clustering requires dense matrix

# Word2Vec clustering
word2vec_model = Word2Vec(sentences=[sentence.split() for sentence in data[text_column_name]], vector_size=100, window=5, min_count=1, workers=4)
word2vec_vectors = np.array([word2vec_model.wv[word] for sentence in data[text_column_name] for word in sentence.split()])
word2vec_kmeans = KMeans(n_clusters=5, random_state=42)
word2vec_kmeans_labels = word2vec_kmeans.fit_predict(word2vec_vectors)

# BERT embeddings and clustering
model = SentenceTransformer('bert-base-nli-mean-tokens')
bert_embeddings = model.encode(data[text_column_name])
bert_kmeans = KMeans(n_clusters=5, random_state=42)
bert_kmeans_labels = bert_kmeans.fit_predict(bert_embeddings)

# Output the cluster labels for each method
print("K-means Labels:", kmeans_labels)
print("DBSCAN Labels:", dbscan_labels)
print("Hierarchical Labels:", agg_labels)
print("Word2Vec K-means Labels:", word2vec_kmeans_labels)
print("BERT K-means Labels:", bert_kmeans_labels)




modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]



K-means Labels: [1 0 3 ... 4 3 0]
DBSCAN Labels: [-1 -1 -1 ... -1 -1 -1]
Hierarchical Labels: [0 0 1 ... 0 4 0]
Word2Vec K-means Labels: [2 4 1 ... 0 1 4]
BERT K-means Labels: [4 4 3 ... 3 3 3]


In [9]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

In [3]:
pip install pandas numpy scikit-learn gensim tensorflow




**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Based on a variety of embedding techniques and clustering algorithms, including machine learning algorithms for text categorization, I believe the application provides a thorough approach to text clustering. It uses well-known libraries like Gensim, scikit-learn, and Hugging Face Transformers to efficiently manage preprocessing. While there is need for improvement, it offers a solid framework for text clustering challenges. Advanced methods for adjusting hyperparameters and assessing performance could be investigated.




'''