Analysis:

Overview of Modeling Approach

Building on the cleaned and structured dataset described in previous sections, our main objective was to construct a predictive model capable of determining, at the moment a review is posted, whether it will ultimately receive more than five helpful votes. Because the “helpful” outcome is rare in the dataset and occurs long after posting, we restricted our features strictly to those available at posting time, including numeric attributes (rating, verification status, text length), review and product-level aggregates, and the review summary processed through TF-IDF and dimensionality reduction. 

Given the chronological nature of reviews, we designed our workflow to mimic realistic deployment conditions: models train on earlier reviews and forecast outcomes for later ones. This time-based structure prevents information leakage and ensures our evaluation reflects how such a system would perform in practice.

Preprocessing and Feature Construction

To prepare the data for modeling, we combined numerical and textual information in a unified preprocessing pipeline. Numerical variables were standardized to ensure comparability across features. Reviewer and product-level statistics such as average historical rating and typical helpful vote behavior were computed only within the training window to avoid leaking future information.

For text processing, summaries were vectorized using TF-IDF with unigram and bigram features. Because full TF-IDF matrices are high-dimensional and computationally expensive, we applied Truncated SVD to reduce these representations to 100 principal components, retaining most of the meaningful variation while improving model speed and reducing overfitting risk.

Finally, because helpful reviews make up a very small portion of the dataset, we applied class weighting to increase the influence of the minority class during training. Without this adjustment, virtually all models would default toward predicting the majority non-helpful class.

Model Training Process

We implemented three models of increasing complexity,  K-Nearest Neighbors, Logistic Regression, and a small feedforward Neural Network. Each model was wrapped in the same scikit-learn pipeline, allowing identical preprocessing procedures and making the final comparisons fair and consistent.

Hyperparameter tuning was performed using a rolling TimeSeriesSplit, ensuring that each validation fold used data occurring after its corresponding training segment. Logistic Regression in particular benefitted from tuning of the regularization parameter C, and we selected the class-weighted version of the model to counteract imbalance. For the neural network, early stopping prevented overfitting, especially given the scarcity of positive cases.

Because accuracy is uninformative with extreme class imbalance, our primary metric was Average Precision (PR-AUC). This metric emphasizes ranking quality and the model’s ability to identify rare helpful reviews. ROC-AUC, F1, Brier score, and confusion matrices were reported for completeness but interpreted within the limitations imposed by class imbalance.

Main Results and Interpretation

Among all tested models, Logistic Regression with class weighting produced the most stable and interpretable performance. On the held-out test window, the logistic model achieved a ROC-AUC of approximately 0.89, indicating strong ranking ability. The ROC curve rose significantly above the 45 degree baseline, showing that the model consistently assigned higher probabilities to reviews that ultimately received more than five helpful votes.

However, as expected given the rarity of helpful reviews, the Precision–Recall performance was modest, with a PR-AUC near 0.13. The PR curve showed high precision at very low recall, meaning the model can confidently identify a small subset of helpful reviews but struggles to capture the full set of positives without sacrificing precision. This pattern aligns with the dataset’s extreme imbalance, even well-ranked positive cases occur too infrequently for precision to remain high once the threshold is lowered.

Threshold selection on the validation set improved F1 marginally, but the underlying class scarcity remained the dominant limiting factor. Together, the PR and ROC curves show that the model distinguishes helpful vs. non-helpful reviews well, but the rarity of the positive class limits the achievable precision at broader recall levels.

These outcomes suggest that helpfulness is influenced by nuanced, context-dependent user behavior and remains only partially predictable using metadata and summary text alone.

Summary of Findings

Overall, our modeling approach successfully identified patterns associated with review helpfulness, particularly when leveraging both numeric and text-based features. Logistic Regression performed the best across evaluation metrics, demonstrating that simpler linear decision boundaries can still capture meaningful structure in the data, especially once balanced class weighting and dimension reduction are applied.

While predictive performance on the minority class is constrained by real-world scarcity of helpful reviews, the workflow provides a transparent and deployable strategy for forecasting review helpfulness at posting time. Future extensions may explore enriched text embeddings, reviewer behavior histories, or larger neural architectures, though such approaches must be weighed against the challenges of data imbalance and interpretability.

3.4.2 KNN 

The KNN model performed much worse than the logistic regression model and demonstrated limitations with this data. After tuning, the best configuration used was k=11 with distance weighting, and the threshold yielding the highest validation F1 was 0.35. The confusion matrices reveal that the KNN predicted the majority of the “not helpful” class, identifying only 11 true positives out of 67 actual helpful reviews in the rest set. The KNN test failed to generalize and offered limited practical utility for predicting helpful reviews. 
 
3.4.2 Logistic Regression 
 
The logistic regression with a regularization strength of C =0.1 and evaluated at the best validation threshold of 0.475, had the strongest and most stable performance among all tested models. The confusion matrices shows that the model identified 31 true positives but still missed many helpful reviews, which is common amongst imbalanced data. However, it did demonstrate strong generalization across time, and proved to be the most reliable. 
3.4.3 Neural Network (MLP)2
 
The MLP was tuned with hidden layers of (64,32) and a regularization strength of alpha= 0.001, produced mixed results that was in between the performance of the KNN and logistic regression. The MLP showed that it can model more complex relationships within the summary text, but it does not generalize reliably. 
 

3.5 Error Analysis2 
Across all three models, two consistent errors occurred. False negatives typically occurred on short or vague summaries that looked unhelpful at posting time but later accumulated helpful voted. Due to our features relying on the summary text, these reviews lacked enough signal for the models to classify them as helpful. In contrast, false positives were usually long or enthusiastic summaries that appeared helpful in wording but ultimately received little community engagement. These reviews often had stylistic markers that the model interpreted as helpful even though readers did not agree. The MLP overproduced these false positives due to its sensitivity in text patters while the logistic regression made fewer but more balanced errors. Overall, these patterns reflect the limitations of using the summary text alone. 
