<a href="https://colab.research.google.com/github/sandeep5924/Assignments/blob/main/Reddy_SandeepReddy_Exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [15]:
# Write your code here
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

def load_data(train_url, test_url):
    train_df = pd.read_csv(train_url, header=None, delimiter='\t')
    test_df = pd.read_csv(test_url, header=None, delimiter='\t')
    train_df.columns = ['text']
    test_df.columns = ['text']
    train_df['label'] = 1
    test_df['label'] = 0
    full_df = pd.concat([train_df, test_df], ignore_index=True)
    X = full_df['text']
    y = full_df['label']
    return X, y

def vectorize_text(X_train, X_test):
    vectorizer = CountVectorizer()
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)  # Transform test data using the same vectorizer
    return X_train_vec, X_test_vec

def train_and_evaluate_classifiers(X_train, X_test, y_train, y_test):
    classifiers = {
        "MultinomialNB": MultinomialNB(),
        "SVM": SVC(),
        "KNN": KNeighborsClassifier(),
        "Decision tree": DecisionTreeClassifier(),
        "Random Forest": RandomForestClassifier(),
        "XGBoost": XGBClassifier()
    }

    results = {}
    for clf_name, clf in classifiers.items():
        print(f"Training and evaluating {clf_name}...")
        cv_scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print(f"Cross-validation scores: {cv_scores}")
        print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

        clf.fit(X_train, y_train)

        y_pred = clf.predict(X_test)

        accuracy = accuracy_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)

        results[clf_name] = {
            "Accuracy": accuracy,
            "Recall": recall,
            "Precision": precision,
            "F1 Score": f1
        }

    return results

def display_results(results):
    print("\nResults:")
    for clf_name, metrics in results.items():
        print(clf_name)
        for metric_name, value in metrics.items():
            print(f"{metric_name}: {value}")
        print()

def main(train_url, test_url):
    X, y = load_data(train_url, test_url)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    X_train_vec, X_test_vec = vectorize_text(X_train, X_test)  # Vectorize both training and test data
    results = train_and_evaluate_classifiers(X_train_vec, X_test_vec, y_train, y_test)
    display_results(results)

if __name__ == "__main__":
    train_data_url = 'https://raw.githubusercontent.com/sandeep5924/Assignments/main/stsa-train.txt'
    test_data_url = 'https://raw.githubusercontent.com/sandeep5924/Assignments/main/stsa-test.txt'
    main(train_data_url, test_data_url)







Training and evaluating MultinomialNB...
Cross-validation scores: [0.73428571 0.73142857 0.73104435 0.75822604 0.75107296 0.74248927
 0.74105866 0.74821173 0.74678112 0.74391989]
Mean cross-validation accuracy: 0.7428518291436745
Training and evaluating SVM...
Cross-validation scores: [0.79       0.79       0.79113019 0.79113019 0.79113019 0.79113019
 0.78969957 0.78969957 0.78969957 0.78969957]
Mean cross-validation accuracy: 0.7903319027181688
Training and evaluating KNN...
Cross-validation scores: [0.78285714 0.75857143 0.77253219 0.76251788 0.76824034 0.76537911
 0.7739628  0.76967096 0.77682403 0.77110157]
Mean cross-validation accuracy: 0.7701657469854896
Training and evaluating Decision tree...
Cross-validation scores: [0.68285714 0.68857143 0.71101574 0.68669528 0.68955651 0.6981402
 0.69384835 0.72246066 0.68955651 0.69957082]
Mean cross-validation accuracy: 0.6962272634375639
Training and evaluating Random Forest...
Cross-validation scores: [0.78428571 0.78       0.78540773 0

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [24]:
# Write your code here
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import zipfile
import io

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load dataset from the archive
archive_path = "/content/archive.zip"
with zipfile.ZipFile(archive_path, 'r') as zip_ref:
    csv_file = zip_ref.open(zip_ref.namelist()[0])
    data = pd.read_csv(csv_file)

reviews = data['Reviews']

# Sample a subset of the data for faster processing
reviews_subset = reviews.sample(n=1000, random_state=42)

# Preprocess text data
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    tokens = [token for token in tokens if token.isalpha()]
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return " ".join(tokens)

preprocessed_reviews = reviews_subset.apply(preprocess_text)

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(preprocessed_reviews)

# K-means clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
print("K-means Silhouette Score:", silhouette_score(X, kmeans_labels))




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


K-means Silhouette Score: 0.03744601767949588




In [25]:
# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)
print("DBSCAN Silhouette Score:", silhouette_score(X, dbscan_labels))

DBSCAN Silhouette Score: 0.025936946962022367


In [26]:
# Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=5)
hierarchical_labels = hierarchical.fit_predict(X.toarray())
print("Hierarchical Silhouette Score:", silhouette_score(X, hierarchical_labels))


Hierarchical Silhouette Score: 0.011387879599253522


In [27]:
# Word2Vec clustering
word2vec_model = Word2Vec(sentences=[text.split() for text in preprocessed_reviews], vector_size=100, window=5, min_count=1, workers=4)
word_vectors = word2vec_model.wv
word2vec_clusters = KMeans(n_clusters=5, random_state=42).fit_predict(word_vectors.vectors)
print("Word2Vec Silhouette Score:", silhouette_score(word_vectors.vectors, word2vec_clusters))



Word2Vec Silhouette Score: 0.47346723


In [3]:
# BERT clustering
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
encoded_inputs = bert_tokenizer(preprocessed_reviews.tolist(), padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    outputs = bert_model(**encoded_inputs)
    pooled_output = outputs.pooler_output



NameError: name 'BertTokenizer' is not defined

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.
The clustering performance and underlying methods used by K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT for grouping comparable data points vary. Traditional clustering algorithms like K-means, DBSCAN, and hierarchical clustering use various techniques to divide the data into clusters. While DBSCAN classifies dense regions as clusters based on a distance criterion, K-means allocates each data point to the closest centroid. A tree-like hierarchy of clusters is created using hierarchical clustering. Conversely, Word2Vec and BERT are embedding methods that identify semantic similarities within words or sentences. While BERT creates contextualized embeddings by taking the full sentence into account, Word2Vec creates dense vector representations of words based on their context.
.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

During the exercises, a number of errors were encountered, such as NameError, AttributeError, ValueError, and ParserError. Programming errors like this are frequently encountered, particularly in the areas of data loading, preprocessing, and model training. Correcting file paths, making sure data is formatted correctly, and importing the required libraries are just a few of the suitable remedies that must be put in place in order to properly address these problems.



'''