### Project 3: Movie Review Sentiment Analysis
 Arun Giridharan - arunhg2 ; 651024910
 
 Shivani Mangaleswaran - sm131 ; 654099242 
 
 Srishti Sharma - srishti9 ; 663146421                                    


## Section 1

### 1. Data Preprocessing

The input datasets consist of OpenAI embeddings and raw review text:

Features: 1536-dimensional OpenAI text embeddings provided for each review.

Target: A binary label (sentiment), where 1 represents positive sentiment and 0 represents negative sentiment.

Input Files: train.csv for training and test.csv for predictions.

##### Key steps in preprocessing include:

Removing unnecessary columns: id (unique identifier) and review (raw text, not used for model training).

The features (embeddings) are already normalized and standardized, requiring no additional scaling.

The target variable sentiment is extracted for supervised learning.

The training features and target labels are prepared as:

X_train: A matrix with 1536 features per review.
y_train: Binary labels for sentiment classification. The test data is similarly prepared by excluding the target column.

### 2. Model Implementation

The classification model used is Logistic Regression with Elastic Net regularization. Key details of the model are:

Elastic Net Regularization:

Elastic Net combines L1 (Lasso) and L2 (Ridge) penalties to address high-dimensional data and prevent overfitting.
The mixing parameter l1_ratio controls the balance between L1 and L2 regularization.
Hyperparameter Tuning:

Hyperparameters are optimized using GridSearchCV, which performs 3-fold cross-validation on the training data. The grid includes:

l1_ratio: Values [0.1, 0.5, 0.7, 0.9].

C: Regularization strength values [0.01, 0.1, 1, 10].

The optimal hyperparameters are chosen based on the Area Under the ROC Curve (AUC) score.
Model Training:

The model is trained using the saga solver, which supports Elastic Net and is well-suited for large datasets.
Predictions:

After training, the model generates probabilities for the positive sentiment class on the test data. These probabilities are saved in the required output format.

### 3. Evaluation Metric

The model's performance is evaluated using the Area Under the ROC Curve (AUC), which measures the model's ability to distinguish between positive and negative reviews. A higher AUC indicates better performance.

The AUC scores achieved for each of the 5 dataset splits are as follows:

Split	AUC Score
 1	      0.9871
 2	      0.9868
 3	      0.9864
 4	      0.987
 5	      0.9863

Average AUC across all splits: 0.9867.

### 4. Execution Time and System Specifications

The model was trained and evaluated on the following system:

System: MacBook Pro.
Processor: 2.3 GHz Intel Core i5.
Memory: 8 GB RAM.
Operating System: macOS Ventura 13.5.
Python Version: 3.8.

Execution Time:
Total Execution Time for 5 splits: 1174.61 seconds

## Section 2: Interpretability Approach


#### 1. Overview of the Interpretability Approach

In this implementation, we aim to provide insights into which parts of each review contributed the most to the sentiment classification made by our trained logistic regression model. We achieve this through the following steps:

Load a Pre-trained Model and Vectorizer:

The trained logistic regression model and its vectorizer are stored online in a .pkl file (on GitHub).
https://raw.githubusercontent.com/srishti1909/project3/main/best_model_vectorizer.pkl

The model is loaded using pickle and used for predictions.
Transform Reviews Using the Vectorizer:

The input test reviews are converted into a sparse document-term matrix using the pre-trained CountVectorizer.

Highlighting Contributing Words:

Words in each review are ranked based on their contribution to the final prediction.

Contribution is calculated using the feature coefficients learned by the model.

For each review, we extract the top 3 contributing words for positive and negative predictions.

These words are visually highlighted within the review text.

Randomly Selecting Reviews:

We randomly select 5 positive and 5 negative reviews based on the predictions.
The top contributing words in these reviews are highlighted and displayed using HTML.

#### 2. Implementation Details

Interpretability Method: Coefficient-based word importance.

The logistic regression model assigns coefficients to each word during training.
The contribution of a word to a specific review is determined by multiplying the word frequency in the review by its learned coefficient.
Visualization:

Words contributing positively (for positive sentiment) are highlighted in yellow.
Words contributing negatively (for negative sentiment) are also highlighted similarly.
Reproducibility:

The pre-trained model is stored in a remote repository (GitHub).
The script loads this model and vectorizer dynamically during execution.
Random seeds ensure the same reviews are selected each time.




#####  Code for Reproducibility

The model is stored on GitHub for easy access.
The test data must be available locally.
All required code for running the interpretability approach and visualizations is included in this file.

Below is the code:

In [7]:
# Import necessary libraries
import pandas as pd
import numpy as np
import pickle
import requests
import random
import re
import os
from IPython.display import display, HTML
import warnings
from sklearn.exceptions import InconsistentVersionWarning

# Suppress specific sklearn warnings
warnings.filterwarnings("ignore", category=InconsistentVersionWarning)

# Step 1: Load the Trained Model and Vectorizer
#print("Loading trained model and vectorizer...")
model_url = "https://raw.githubusercontent.com/srishti1909/project3/main/best_model_vectorizer.pkl"

response = requests.get(model_url)
if response.status_code == 200:
    best_model, vectorizer = pickle.loads(response.content)  # Unpack model and vectorizer
    print("Model and vectorizer loaded successfully!")
else:
    raise Exception(f"Failed to load model. HTTP Status Code: {response.status_code}")

# Define the base directory and construct path to test.csv dynamically
base_dir = os.getcwd()  # Current working directory
test_data_path = os.path.join(base_dir, "split_1", "test.csv")

# Load the test data
#print(f"Loading test data from: {test_data_path}")
test_data = pd.read_csv(test_data_path)
reviews = test_data["review"]
ids = test_data["id"]

# Transform reviews using the loaded vectorizer
X_test = vectorizer.transform(reviews)
predicted_probs = best_model.predict_proba(X_test)[:, 1]
predicted_sentiments = (predicted_probs >= 0.5).astype(int)

# Extract feature names and model coefficients
feature_names = vectorizer.get_feature_names_out()
coefficients = best_model.coef_[0]

# Step 2: Identify Top Words for Each Review
def get_top_words_for_review(review_vector, feature_names, coefficients, top_n=5):
    """Get the top N contributing words for a specific review."""
    nonzero_indices = review_vector.nonzero()[1]  # Get indices of non-zero features (words in the review)
    word_contributions = [(feature_names[i], coefficients[i]) for i in nonzero_indices]
    sorted_contributions = sorted(word_contributions, key=lambda x: -abs(x[1]))  # Sort by importance
    return [word for word, _ in sorted_contributions[:top_n]]

def highlight_words_in_review(review, top_words):
    """Highlight the top words in the review."""
    highlighted_review = review
    for word in top_words:
        word_pattern = re.compile(r'\b' + re.escape(word) + r'\b', re.IGNORECASE)
        highlighted_review = word_pattern.sub(
            f'<span style="background-color: yellow; font-weight: bold;">{word}</span>',
            highlighted_review
        )
    return highlighted_review

# Step 3: Randomly Select 5 Positive and 5 Negative Reviews
random.seed(42)
positive_indices = np.where(predicted_sentiments == 1)[0]
negative_indices = np.where(predicted_sentiments == 0)[0]
selected_indices = random.sample(list(positive_indices), 5) + random.sample(list(negative_indices), 5)

# Generate and display results
for idx in selected_indices:
    review_id = ids.iloc[idx]
    review_text = reviews.iloc[idx]
    prediction = predicted_probs[idx]
    sentiment = "Positive" if prediction >= 0.5 else "Negative"
    border_color = "green" if sentiment == "Positive" else "red"

    # Extract top 5 contributing words for the review
    review_vector = X_test[idx]
    top_words = get_top_words_for_review(review_vector, feature_names, coefficients, top_n=2)
    highlighted_review = highlight_words_in_review(review_text, top_words)

    # Display results
    display(HTML(f"""
    <div style="border: 2px solid {border_color}; padding: 15px; margin-bottom: 10px; border-radius: 10px;">
        <h3>Review {review_id}</h3>
        <p><strong>Sentiment:</strong> {sentiment}</p>
        <p><strong>Prediction:</strong> {prediction:.4f}</p>
        <p style="font-size: 1.1em;">{highlighted_review}</p>
    </div>
    """))

print("Interpretability visualization complete!")


Model and vectorizer loaded successfully!


Interpretability visualization complete!



#### 4. Advantages of the Approach

Simplicity: Logistic Regression coefficients provide a clear and interpretable measure of feature importance.

Efficiency: The approach is computationally efficient as it directly uses learned coefficients.

Visual Clarity: Highlighting words makes it easy for users to understand what drives the prediction.
