<a href="https://colab.research.google.com/github/yashwanthjilla7/INFO-5731/blob/main/Jilla_Yashwanth_Exercise_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [1]:
!unzip exercise09_datacollection.zip

Archive:  exercise09_datacollection.zip
   creating: exercise09_datacollection/
  inflating: exercise09_datacollection/stsa-test.txt  
  inflating: exercise09_datacollection/stsa-train.txt  


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertModel
import torch

In [3]:
import pandas as pd

# Load train and test data
train_data_path = '/content/exercise09_datacollection/stsa-train.txt'
test_data_path = '/content/exercise09_datacollection/stsa-test.txt'

def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
        data = []
        for line in lines:
            line = line.strip()
            sentiment, text = line.split(' ', 1)
            data.append({'Sentiment': sentiment, 'Text': text})
        return pd.DataFrame(data)

train_data = load_data(train_data_path)
test_data = load_data(test_data_path)

In [4]:
train_data.head()

Unnamed: 0,Sentiment,Text
0,1,"a stirring , funny and finally transporting re..."
1,0,apparently reassembled from the cutting-room f...
2,0,they presume their audience wo n't sit still f...
3,1,this is a visually stunning rumination on love...
4,1,jonathan parker 's bartleby should have been t...


In [5]:
# Split training data into training and validation sets (80% training, 20% validation)
train_df, validate_df = train_test_split(train_data, test_size=0.2, random_state=42)

# Define classifiers
classifiers = {
    'MultinomialNB': MultinomialNB(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'DecisionTree': DecisionTreeClassifier(),
    'RandomForest': RandomForestClassifier()
}

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
# Define TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Vectorize the text data
X_train = vectorizer.fit_transform(train_df['Text'])
y_train = train_df['Sentiment']


In [8]:
# Evaluation metrics
metrics = ['accuracy']

In [9]:
import warnings

# Perform 10-fold cross-validation for each classifier
for clf_name, clf in classifiers.items():
    print(f"Classifier: {clf_name}")
    # Perform cross-validation
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=UserWarning)
        scores = cross_validate(clf, X_train, y_train, cv=10, scoring=metrics)

    # Print average scores
    for metric in metrics:
        print(f"{metric}: {scores[f'test_{metric}'].mean()}")


Classifier: MultinomialNB
accuracy: 0.7799906646385648
Classifier: SVM
accuracy: 0.7729440988112103
Classifier: KNN
accuracy: 0.7095410005157298
Classifier: DecisionTree
accuracy: 0.6062142171679257
Classifier: RandomForest
accuracy: 0.7028495701163983


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [10]:
# Write your code here
!unzip archive.zip

Archive:  archive.zip
  inflating: Amazon_Unlocked_Mobile.csv  


In [11]:
# Load the dataset
data = pd.read_csv("Amazon_Unlocked_Mobile.csv")

In [12]:
data.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [13]:
# Write your code here
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

In [14]:
data.dropna(inplace=True)

In [15]:
# Vectorize the text data
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(data['Reviews'])

# Apply K-means clustering
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(tfidf_matrix)

# Print the top terms per cluster
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names_out()
for i in range(num_clusters):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print()




Cluster 0:
 good
 great
 works
 like
 nice
 perfect
 just
 battery
 work
 new

Cluster 1:
 love
 great
 new
 works
 good
 nice
 perfect
 thank
 iphone
 just

Cluster 2:
 great
 works
 product
 price
 condition
 buy
 good
 fast
 deal
 far

Cluster 3:
 excellent
 excelente
 product
 producto
 condition
 recommend
 seller
 good
 price
 100

Cluster 4:
 good
 product
 price
 far
 works
 thanks
 quality
 really
 thank
 buy



In [None]:
from sklearn.cluster import DBSCAN
from sklearn.metrics.pairwise import cosine_similarity

# Calculate similarity matrix
similarity_matrix = cosine_similarity(tfidf_matrix)

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5, metric='precomputed')
dbscan.fit(similarity_matrix)

# Print clusters
labels = dbscan.labels_
unique_labels = set(labels)
for label in unique_labels:
    print("Cluster ", label)
    for i, text in enumerate(data[labels == label]['Reviews']):
        print(i, text)
    print()


In [None]:
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt

# Perform hierarchical clustering
linkage_matrix = linkage(tfidf_matrix.toarray(), method='ward', metric='euclidean')

# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linkage_matrix, truncate_mode='level', p=5)
plt.show()


In [None]:
from gensim.models import Word2Vec

# Train Word2Vec model
sentences = [text.split() for text in data['Reviews']]
word2vec_model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

According to traditional clustering algorithm – K-means clustering, DBSCAN, and Hierarchical clustering are such that work on vector representations of text data. K-means is based on the centroids and Euclidean distance for the purpose of dividing the data into clusters, however, it needs the number of clusters to be specified beforehand, which might be challenging, especially when it is unknown. DBSCAN works vice versa and predetermines the number of clusters depending on the density of the data and operates well in the case of noise, while, it may be sensitive to the problems caused by high-dimensional data and variable density clusters. Hierarchical clustering offers a hierarchical structure of clusters to reveal both the perspective of individual clusters and any relationships between them; however, this method may be computationally intensive when working with large data sets. Word2Vec and BERT, as embedding-based methods, are seen as being capable of understanding the semantics of words or sentences. Word2vec acts on words which have the same context by putting them together. Although, this method is not enough to fully represent all semantic relationships. BERT, being an embedding model of a contextual type, is able to produce high quality embeddings as well as capture rich semantic data, but at the same it consumes a lot of calculating resources thus it may be not suitable for all cases. Typically, what type of clustering algorithm is chosen is depended on data features, computing power, and depth of semantic understanding towards task.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

This assignment provided me the chance to engage with some text clustering methodologies by bringing up a real world dataset. Using algorithms such as K-means, DBSCAN, and Hierarchical clustering have made me a more hands-on person as I can now easily read and cluster data. Very often I had to make sure that the code examples fit and corresponded to my training data. As a result, this activity reinforced my problem-solving skills and reenforced the teaching concepts of algorithmic theory. Finally, the demonstrated contrast between different clustering techniques points at the necessity for selecting suitable methods in respect of data and project specifics. Overall, those exercises have delivered a holistic and stimulating learning experience on clustering text documents and their usage in various fields.




'''