# Kaggle Amazon Movie Reviews Prediction Project

## Overview

This project focuses on predicting Amazon movie review scores by using machine learning algorithms and improving on it. My approach includes feature engineering, model selection, and tuning it to improve predictive accuracy on a highly imbalanced multi-class dataset that we were given. We utilize metadata, text-based insights, and sentiment analysis to create meaningful features and evaluated a range of models to identify the most effective approach.

---


### Helpfulness Features
Our Helpfulness features aim to capture the perceived value of each review by users. By calculating the `Helpfulness Ratio` from the helpful votes (`HelpfulnessNumerator`) to the total votes (`HelpfulnessDenominator`), we have a feature that shows the user engagement and feedback received. This is very useful for reviews with a lot of votes. This is because it adds insight into which reviews might be more informative and relevant. More over, we can  mormalize the helpfulness score by adding 1 to the denominator. This would help prevent division errors from `Denominator = 0`, which also makes sure we model stability in our code.

> **Rationale**: Because we know that a high helpfulness ratio means a review resonates with readers, it could correlate with certain star ratings, particularly extremes (1 or 5 stars). This is when the readers are likely to agree or disagree strongly with the content. This feature helps capture this sentiment effectively.

### Time-Based Features
- **ReviewYear and ReviewMonth**: Time-based features are included to allow models to pick up on temporal patterns that often influence review scores that users give. Ex:
  - More popular listings might receive higher reviews soon after release, w/ scores potentially decreasing as the hype dies down.
  - There may be seasonal effects, especially with movie or holiday-related products, where scores fluctuate based on release dates or time of year. Christmas decorations will probably have higher ratings in December than in July. Since everyone is in the christmas spirit and are positive. 

> **Rationale**: These two patterns are very useful for models like Random Forest(rf) or Gradient Boosting(gb). This is because these can utilize non-linear relationships and capture seasonality or trends in the dataset.

### Text Analysis Features
- **TextLength and SummaryLength**: When we calculate the TextLength and SummaryLength, we can see the verbosity + engagement of the reviewer. Lengthier reviews can mean greater engagement and stronger opinions, which could lean towards either end of the scoring spectrum. 

  - **Word Count**: Counting the words in the text and summary offers a slightly different perspective by indicating descriptiveness without necessarily implying length of review(a short review with impactful words). 

> **Rationale**: These features adds more depth to our analysis by providing models with indicators of engagement levels. Engagement levels is obviously a very important part of reviews. When we have a higher verbosity, it can also mean higher or lower ratings. We know that users are more likely to share detailed insights for extreme experiences in their reviews. This should improve performance for our project, especially for classifiers that benefit from richer feature sets like Extra Trees(et) and Logistic Regression(lr).

### Sentiment Analysis
- **Text Polarity (Sentiment Score)**: We use TextBlob’s sentiment analysis. We calculate each review’s polarity, which is whether a review has positive or negative sentiment. Given that positive reviews generally correlate with higher scores and negative reviews with lower scores, this feature will be very good in predicting the review scores of reviews by users.  

> **Rationale**: Sentiment polarity would help our models differentiate between positive and negative language. This is crucial for accurate predictions. Sentiment data enriches feature representation in our models. It adds a qualitative dimension that numeric models might miss otherwise, helping algorithms like Gradient Boosting(gb) capture complex relationships. 
---

### Selected Features
The selected features (**Helpfulness, ReviewYear, ReviewMonth, TextLength, SummaryLength, NumWordsText, NumWordsSummary, HelpfulnessRatio, and TextPolarity**) provide a comprehensive description of each review by users. I chose these features to capture a balanced mix of quantitative and qualitative characteristics. 

These features enable models to capture diverse aspects of the review content, structure, and context, enhancing prediction accuracy by giving the model a fuller picture of each review.

### Standardization
To maintain feature consistency in our models, we applied **StandardScaler** to normalize the data. Standardization offers various advantages:

- **Uniform Scale**: Standardizing rescales all features to a similar range, usually with a mean of 0 and a standard deviation of 1. This helps in models like K-Nearest Neighbors and Logistic Regression, where feature scales impact distance calculations and linear coefficients.

- **Equal Weighting**: By putting all features on a comparable scale, we prevent features with large numeric ranges (e.g., `TextLength` or `NumWordsText`) from dominating others with smaller ranges, like `TextPolarity`. Standardization allows each feature to contribute equitably to the model, improving overall accuracy.

- **Optimization for Certain Models**: Distance-based models (Ex. **KNN**) and gradient-based models (Ex., **Logistic Regression**) benefit greatly from us using standardization. The Consistent scaling enhances performance during training by reducing convergence time and improving result stability.

### Summary
By combining these well balanced and mixed features with a structured approach to standardization, our feature engineering and preprocessing methods create  an efficient and accurrate model pipeline. This setup maximizes model interpretability and robustness. This allows our selected algorithms to utilize both numerical insights (`TextLength`, `NumWordsText`) and textual nuances (`TextPolarity`). The balanced approach improves predictive performance on the Kaggle competition by capturing contextual depth and linguistic sentiment in the reviews.

---

## Models and Methodology

### Models Tested
To assess model performance, we tested five main algorithms, each with their own unique strengths:

1. **K-Nearest Neighbors (KNN)**: 
   - **Purpose**: KNN leverages the similarity of data points to capture **local patterns** within the dataset. Given our rich set of sentiment and text-based features, KNN can group similar reviews based on shared characteristics.
   - **Hyperparameters**: We specifically tuned `n_neighbors` to identify the optimal neighborhood size, alongside the `weights` parameter to adjust distance weighting.
   
2. **Random Forest (RF)**: 
   - **Purpose**: RF is robust in handling both **high-dimensional** data and mixed feature types (e.g., categorical and continuous). It’s also effective at managing **non-linear relationships** between features, providing reliable and stable predictions.
   - **Feature Importance**: With built-in feature importance measures, RF prioritizes the most informative features, such as `TextPolarity` and `HelpfulnessRatio`, making it valuable for feature selection.
   
3. **Logistic Regression (LR)**:
   - **Purpose**: As a **linear baseline model**, LR offers interpretability and insight into linear relationships. Adding `class_weight='balanced'` addresses **class imbalance**, making it useful for applications where underrepresented classes affect prediction accuracy.
   - **Regularization**: The `C` parameter, adjusted through hyperparameter tuning, controls regularization strength to prevent overfitting while maintaining interpretability.
   
4. **Gradient Boosting (GB)**:
   - **Purpose**: With its iterative learning process, GB sequentially improves the model by **combining weak learners** into a robust final model. This model is particularly effective for capturing **non-linear relationships** and managing complex interactions between features.
   - **Hyperparameters**: Key parameters include `n_estimators`, `learning_rate`, and `max_depth` to control tree complexity and ensure performance optimization.
   
5. **Extra Trees (ET)**:
   - **Purpose**: ET is similar to RF but introduces extra randomness in tree construction, potentially enhancing generalization. This approach makes it more computationally efficient and suitable for **high-dimensional datasets** with many features.
   - **Hyperparameters**: Tuning focused on `n_estimators` and `max_depth` to control the number of trees and their complexity, balancing accuracy and training speed.

### Hyperparameter Tuning Strategy
Our hyperparameter tuning strategy relied on **RandomizedSearchCV**:
- **Efficiency**: Unlike exhaustive grid search, RandomizedSearchCV allows exploring a **broad parameter space** within a feasible time.
- **Key Parameters**:
   - **KNN**: `n_neighbors`, `weights`
   - **RF and ET**: `n_estimators`, `max_depth`
   - **LR**: `C` for regularization
   - **GB**: `learning_rate`, `n_estimators`, `max_depth`, and `subsample` for tree control and overfitting prevention

### Special Techniques
1. **Parallel Processing**: 
   - We leveraged `parallel_backend` with `n_jobs=-1` to distribute computations across multiple cores, reducing training time for each model.
   
2. **Class Weight Balancing**:
   - To counter class imbalance, especially with Logistic Regression, we applied `class_weight='balanced'`. This ensured that underrepresented classes received proportionate attention during training, enhancing recall for minority classes.

3. **Feature Engineering**:
   - By incorporating both **quantitative** (e.g., `TextLength`, `HelpfulnessRatio`) and **qualitative** (e.g., `TextPolarity`) features, we provided models with a rich view of each review. This mix allows models to discern subtle patterns within the language and structure of reviews.

### Summary of Key Choices
Our approach combines **robust feature engineering** with **well-tuned models** to create a balanced pipeline that prioritizes interpretability, efficiency, and predictive power:
- Feature engineering yields a comprehensive set of both **numerical** and **text-based** insights, giving models the capability to detect nuanced relationships in our reviews from users. 
- A diverse set of models allowed us to leverage each algorithm’s strengths, w/ KNN capturing local similarities, RF and ET handling non-linear feature interactions, and GB and LR balancing complexity with interpretability. This creates a well-rounded ensemble that can adapt to the dataset’s unique characteristics.
- Hyperparameter tuning optimized model performance without unnecessary overfitting, while preprocessing ensured data consistency and model robustness. This approach maximizes predictive accuracy while maintaining model stability and interpretability.

> **Final Model Selection**: After training and evaluating each model, we selected the one with the highest validation score as our final estimator. This balanced approach, enriched with sentiment and helpfulness insights, positions our model strongly on Kaggle by drawing from both **linguistic depth** and **statistical precision**.
