<a href="https://colab.research.google.com/github/sivanathvenigalla/Jaya-Venkatasivanath_INFO5731_Fall2024/blob/main/Venigalla_Jayavenkatasivanath_Exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
import pandas as pd

# Step 1: Load the dataset without any aggressive cleaning for now
train_file_path = '/content/stsa-train.txt'

try:
    # Load the dataset
    train_data = pd.read_csv(train_file_path, delimiter=' ', header=None, names=['raw'], on_bad_lines='skip')
    print(f"\nLoaded training data with {len(train_data)} rows.")
except Exception as e:
    print(f"Error loading training data: {e}")

# Step 2: Inspect the raw data to understand its structure
print("\nFirst few rows of the raw training data:")
print(train_data.head())

# Step 3: Process the raw data without aggressive filtering
try:
    # Check if raw data contains any text
    if train_data['raw'].notnull().any():
        # Instead of strictly extracting the first character as the label, let's explore the possibility of spaces separating the label and text
        train_data['label'] = train_data['raw'].apply(lambda x: x.split()[0] if len(x.split()) > 1 else None)  # Take the first word as label
        train_data['text'] = train_data['raw'].apply(lambda x: ' '.join(x.split()[1:]) if len(x.split()) > 1 else None)  # Take the rest as text

        # Clean the data (remove rows with missing labels or text)
        train_data.dropna(subset=['label', 'text'], inplace=True)

        # Ensure labels are numeric and clean any invalid rows
        train_data['label'] = pd.to_numeric(train_data['label'], errors='coerce')
        train_data.dropna(subset=['label'], inplace=True)  # Drop rows with invalid labels
        train_data['label'] = train_data['label'].astype(int)  # Ensure label is integer type

        print(f"Training Data Size After Cleaning: {len(train_data)}")
    else:
        print("Error: 'raw' column contains null values.")
except Exception as e:
    print(f"Error processing training data: {e}")

# Debug: Check the structure of train_data after cleaning
print("\nTraining Data Sample After Cleaning:")
print(train_data.head())




Loaded training data with 3169 rows.

First few rows of the raw training data:
                                                                                                                                                                  raw
1 a          stirring    ,       funny    and          finally transporting re-imagining of             beauty  and   the           beast  and   1930s  horror  films
0 apparently reassembled from    the      cutting-room floor   of           any          given          daytime soap  .             NaN    NaN   NaN    NaN       NaN
1 jonathan   parker      's      bartleby should       have    been         the          be-all-end-all of      the   modern-office anomie films .      NaN       NaN
0 a          fan         film    that     for          the     uninitiated  plays        better         on      video with          the    sound turned down        .
1 béart      and         berling are      both         superb  ,            while        h

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [None]:
import pandas as pd
from sklearn.cluster import KMeans, DBSCAN
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import silhouette_score

# Step 1: Read and inspect the data
reviews_path = '/content/stsa-train.txt'

# Read the first 10 lines of the dataset to understand its structure
with open(reviews_path, 'r') as file:
    lines = [file.readline().strip() for _ in range(10)]
    print("First 10 lines of the dataset:\n")
    print("\n".join(lines))

# Read the dataset while handling multiple spaces between columns
reviews_data = pd.read_csv(reviews_path, sep=r'\s+', header=None, names=['label', 'text'], on_bad_lines='skip')

# Check if data is loaded correctly
print(f"\nLoaded dataset with {reviews_data.shape[0]} rows and {reviews_data.shape[1]} columns.")
print("\nFirst few rows of the dataset:\n", reviews_data.head())

# Step 2: Drop rows with NaN values in the 'text' column
reviews_data.dropna(subset=['text'], inplace=True)

# Step 3: Convert text to feature vectors using CountVectorizer
vectorizer = CountVectorizer()
text_vectors = vectorizer.fit_transform(reviews_data['text'])

# Step 4: Apply clustering algorithms

# KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=42).fit(text_vectors)

# DBSCAN clustering
dbscan = DBSCAN(eps=1.5, min_samples=5).fit(text_vectors)

# Hierarchical clustering
linkage_matrix = linkage(text_vectors.toarray(), method='ward')
hierarchical_clusters = fcluster(linkage_matrix, t=5, criterion='maxclust')

# Step 5: Evaluate clustering using Silhouette Score
kmeans_silhouette = silhouette_score(text_vectors, kmeans.labels_)

# Only calculate silhouette score for DBSCAN if it has more than one label
if len(set(dbscan.labels_)) > 1:
    dbscan_silhouette = silhouette_score(text_vectors, dbscan.labels_)
else:
    dbscan_silhouette = 'Invalid (single label)'

hierarchical_silhouette = silhouette_score(text_vectors, hierarchical_clusters)

# Print Silhouette Scores
print("\nSilhouette Scores:")
print(f"KMeans Silhouette Score: {kmeans_silhouette}")
print(f"DBSCAN Silhouette Score: {dbscan_silhouette}")
print(f"Hierarchical Clustering Silhouette Score: {hierarchical_silhouette}")



First 10 lines of the dataset:

1 a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films
0 apparently reassembled from the cutting-room floor of any given daytime soap .
0 they presume their audience wo n't sit still for a sociology lesson , however entertainingly presented , so they trot out the conventional science-fiction elements of bug-eyed monsters and futuristic women in skimpy clothes .
1 this is a visually stunning rumination on love , memory , history and the war between art and commerce .
1 jonathan parker 's bartleby should have been the be-all-end-all of the modern-office anomie films .
1 campanella gets the tone just right -- funny in the middle of sad in the middle of hopeful .
0 a fan film that for the uninitiated plays better on video with the sound turned down .
1 béart and berling are both superb , while huppert ... is magnificent .
0 a little less extreme than in the past , with longer exposition sequences between the

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

.
The results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT differ based on their methods and data. K-means works well with well-separated, spherical clusters but struggles with irregular shapes or noise. DBSCAN can handle clusters of any shape and noise points but requires proper parameter tuning. Hierarchical clustering offers detailed insights through a dendrogram but becomes inefficient with large datasets. Word2Vec creates dense word vectors based on local context but struggles with complex sentences or out-of-vocabulary words. BERT, a transformer model, captures sentence context well due to its bidirectional nature and performs better in text tasks, though it is more computationally expensive than the others. While K-means and DBSCAN focus on clustering, Word2Vec and BERT are used for embeddings, with BERT excelling in contextual understanding and sentence-level analysis.








.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



This exercise was a valuable experience in applying machine learning techniques for text analysis, particularly sentiment classification and clustering. It allowed me to explore different models, from traditional algorithms like Naive Bayes and SVM to advanced methods like BERT. I found the cross-validation process insightful in understanding model performance and variability. Additionally, the clustering task provided a practical comparison of different methods and their strengths, such as K-means for efficiency versus DBSCAN's handling of noise. Overall, this exercise deepened my understanding of text data processing, model evaluation, and the importance of selecting the right approach for different tasks.