Methodology Overview 

The objective of this project is to predict whether an Amazon product will review will receive more than 5 helpful votes at the time the review is posted. Our approach is to use a binary classifciation model using variables that are known at the time of the posting, so that our model is reflective of a realistic situation. 

In the pvreious milestone, we had already cleaned and processed the data so that we could use it for analysis. We had removed variables such as "image" and "style" because they had too many missing variables which did not serve a good purpose for our model. All missing values in the "vote" column were turned into 0, which reflects a review that had received no helpful reviews. 

To define the target variable, we had a binary threshold of greater than 5 votes. Reviews that got more than 5 votes were deemed as "helpful" while those that received less than 5 votes were deemed as "not helpful". Text features were processed into numerical representations which were better suited for modeling. Numerical variables  such as star rating and review length were scaled as needed. 

Class imbalance were a major characteristic of the dataset, with the majority of the votes receiving less than 5 votes. To address this, we want to use class weighting to help mitigate the bias within the model from being too tipped one way. 



Model Overview and Justification

Our first model will be the very simple k-Nearest Neighbors model (KNN). KNN predicts an outcome for a given review based on the most similar reviews in the training data, making this a good way of establishing if reviews containing similar features such as length, rating, or sentiment tend to all receive similar levels of helpfulness. While intuitive and easy to implement, KNN suffers some drawbacks, such as sensitivity to feature scaling, and it slows with large datasets; it will primarily be used here as a point of comparison to other approaches. 

Then, we will apply Logistic Regression to construct a more robust yet more interpretable model. Logistic regression performs estimation of the probability that a given review is helpful and helps in identifying which features bear the strongest influence on that outcome. Its simplicity and interpretability are reasons it can serve well as a kind of benchmark for understanding the key factors behind review helpfulness. 

We will also be employing Principal Component Analysis in order to reduce the dimensionality of the data, particularly for text-based and numeric features. This will keep most of the important variation information in a smaller number of components, thus helping models run more efficiently and limiting the possibility of overfitting. 

Finally, we will test a small Neural Network to grasp more complex relationships between variables. Neural networks can model the kind of pattern that might elude simpler models, especially in text features like review summaries. Since our data is imbalanced and pretty big, we have chosen a lightweight structure in order to keep the training stable and the results interpretable. 

In general, this combination of models will enable us to make a comparison of various modeling approaches in view of predicting which reviews are likely to receive more than five helpful votes.

Model Training Procedure

The training process will begin by dividing the data into training and testing sets, ensuring that the model is evaluated on reviews it has not seen before. A time-based split will be used so that the model trains on earlier reviews and tests on later ones, reflecting a realistic prediction scenario in which future data is unknown at training time. 

All numerical variables, including overall rating and review length, will be standardized to ensure that no single feature dominates the distance-based methods such as KNN. Text data from the summary field will be transformed into numerical features using a TF-IDF vectorization approach, which represents each word’s importance relative to its frequency across all reviews. This allows the model to capture key linguistic patterns that may be associated with higher helpfulness scores. 

After preprocessing, each model will be trained separately using the same training data to allow for direct comparison. For KNN, we will experiment with different values of k to identify the number of neighbors that yields the highest validation accuracy. For logistic regression, the regularization strength will be tuned to avoid overfitting while maintaining interpretability. When incorporating PCA, the number of principal components retained will be determined by the proportion of variance explained, typically around 90 to 95 percent. For the neural network, we will train a small feedforward model with one hidden layer and apply techniques such as early stopping to prevent overfitting. 

Throughout the training process, cross-validation will be used to assess model performance and stability. Since the dataset is imbalanced, class weighting will be applied so that the minority class, representing helpful reviews, has a stronger influence during training. Model performance will be evaluated primarily using accuracy, precision, recall, F1-score, and ROC-AUC, which together provide a balanced view of predictive ability. 

The combination of careful preprocessing, cross-validation, and class balancing will ensure that the final model is both fair and generalizable, providing insight into the factors that make a review more likely to be deemed helpful at the time it is posted.