<a href="https://colab.research.google.com/github/vodnalashiva131/INFO-5731/blob/main/vodnalashiva_inclass_exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
# Write your code here
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost import XGBClassifier
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch


In [None]:
# load the data
train_data = 'stsa-train.txt'
test_data = 'stsa-test.txt'

In [None]:
def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
        data = []
        for line in lines:
            line = line.strip()
            sentiment, text = line.split(' ', 1)
            data.append({'Emotion': sentiment, 'Text': text})
        return pd.DataFrame(data)

train_data = load_data(train_data)
test_data = load_data(test_data)


In [None]:
train_data.head(5)


Unnamed: 0,Emotion,Text
0,1,"a stirring , funny and finally transporting re..."
1,0,apparently reassembled from the cutting-room f...
2,0,they presume their audience wo n't sit still f...
3,1,this is a visually stunning rumination on love...
4,1,jonathan parker 's bartleby should have been t...


In [None]:
# Data Preprocessing
X_train = train_data['Text']
y_train = train_data['Emotion']

In [None]:
X_test = test_data['Text']
y_test = test_data['Emotion']


In [None]:
# Split the Training Data
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)


In [None]:
# Vectorize the text data
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_val_vec = vectorizer.transform(X_val)
X_test_vec = vectorizer.transform(X_test)

In [None]:
# Define Models
models = {
    "MultinomialNB": MultinomialNB(),
    "SVM": SVC(),
    "KNN": KNeighborsClassifier(),
    "DecisionTree": DecisionTreeClassifier(),
    "RandomForest": RandomForestClassifier(),
    "XGBoost": XGBClassifier()
}

In [None]:
# perform 10 fold cross validation while trainig the classifier
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y
                             _train, cv = 10)
print("Accuracy: %f (%f)" % (accuracies.mean(),
                             accuracies.std()))



In [None]:
# Step 6: Cross Validation and Evaluation
for name, model in models.items():
    print("Model:", name)
    # Convert labels to integer type
    y_train_int = y_train.astype(int)

    # Cross Validation
    cv_scores = cross_val_score(model, X_train_vec, y_train_int, cv=10)
    print("Cross Validation Scores:", cv_scores)
    print("Mean CV Score:", cv_scores.mean())


Model: MultinomialNB
Cross Validation Scores: [0.76173285 0.8032491  0.78158845 0.80144404 0.7833935  0.75812274
 0.78842676 0.78661844 0.7522604  0.78842676]
Mean CV Score: 0.7805263054817504
Model: SVM
Cross Validation Scores: [0.72924188 0.72743682 0.71299639 0.72382671 0.74368231 0.7599278
 0.75406872 0.72151899 0.72151899 0.75949367]
Mean CV Score: 0.7353712275021055
Model: KNN
Cross Validation Scores: [0.58483755 0.56137184 0.58844765 0.57581227 0.5631769  0.55234657
 0.60036166 0.56600362 0.54972875 0.54068716]
Mean CV Score: 0.5682773973273447
Model: DecisionTree
Cross Validation Scores: [0.59927798 0.62454874 0.61552347 0.60830325 0.64981949 0.64620939
 0.66546112 0.63652803 0.6039783  0.62206148]
Mean CV Score: 0.6271711243561539
Model: RandomForest
Cross Validation Scores: [0.68231047 0.69314079 0.6967509  0.71119134 0.74187726 0.74187726
 0.75045208 0.70705244 0.71971067 0.73417722]
Mean CV Score: 0.7178540419503724
Model: XGBoost
Cross Validation Scores: [0.67870036 0.6877

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
# Write your code here
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer


In [None]:
# Step 3: Feature Extraction
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
# Initialize the CountVectorizer object
vectorizer = CountVectorizer()
# Fit and transform the training data
X_train = vectorizer.fit_transform(train_data['Text'])



In [None]:
# Step 4: Apply Clustering Algorithms
# K-means
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X_train)


  super()._check_params_vs_input(X, default_n_init=10)


In [None]:
# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X_train)


In [None]:
# Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=5)
hierarchical.fit(X_train.toarray())


In [None]:
# Word2Vec
word2vec_model = Word2Vec(sentences=???)  # You need to provide sentences for training Word2Vec
# Then you can use the word2vec_model for clustering

# BERT
bert_model = SentenceTransformer('bert-base-nli-mean-tokens')
X_bert = bert_model.encode(data['text'])

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

In comparing the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT for text clustering, several observations can be made. K-means, being a centroid-based clustering algorithm, is sensitive to the choice of the number of clusters and tends to produce spherical clusters, which may not be ideal for text data with complex structures. DBSCAN, on the other hand, is a density-based algorithm that can discover clusters of arbitrary shapes and sizes, but its performance heavily depends on the choice of parameters such as epsilon and minimum samples. Hierarchical clustering provides a hierarchical structure of clusters, which can be advantageous for understanding relationships between clusters, but it may not scale well to large datasets. Word2Vec, a word embedding technique, captures semantic relationships between words but may struggle with out-of-vocabulary words and requires careful parameter tuning. BERT, a transformer-based language model, produces dense representations of text that capture context and semantic meaning effectively, but its computational cost and memory requirements may limit its applicability to large datasets. Overall, the choice of algorithm depends on the specific characteristics of the dataset and the desired outcome, balancing factors such as interpretability, scalability, and performance.





# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Here's some reflective feedback on the exercises completed in this assignment:

Answer:
Text Classification Task: This exercise provided a good opportunity to practice implementing various machine learning algorithms for text classification, including data preprocessing, model selection, and evaluation. It was beneficial to work with different algorithms such as MultinomialNB, SVM, KNN, Decision Trees, Random Forest, and XGBoost, and to evaluate their performance using metrics like accuracy, precision, recall, and F1-score. The use of cross-validation helped in obtaining more robust performance estimates for the models.
Text Clustering Task: This task allowed for exploring different clustering algorithms and feature extraction techniques for text data. However, the task could have been more specific regarding the evaluation of clustering results. While silhouette score and visual inspection were mentioned, a more thorough evaluation with clustering-specific metrics and techniques could have been included. Additionally, the Word2Vec and BERT sections lacked implementation details, making it difficult to fully understand how these techniques were applied for clustering.
Overall, these exercises provided valuable hands-on experience in working with text data, applying machine learning algorithms, and evaluating model performance. However, providing more detailed instructions and examples, especially for advanced techniques like Word2Vec and BERT, would enhance the learning experience. Additionally, including more diverse datasets and real-world scenarios could further enrich the exercises and better prepare for practical applications in the field.

'''