This project is part of the Natural Language Processing (NLP) module from the KeepCoding AI Bootcamp.
The objective is to build and evaluate a sentiment analysis pipeline using Amazon product reviews.
- EDA β download + data exploration.
- Preprocessing β Text cleaning pipeline.
- Modeling β Training & comparing 2 Machine Learning and 1 Deep Learning sentiment classifiers.
- Reports β Metrics, evaluation and conclusions.
-
Small subsets for experimentation Amazon Reviews β Data 2018 5-core
-
Format: one-review-per-line in JSON.
Sample Review:
{
"image": ["https://images-na.ssl-images-amazon.com/images/I/71eG75FTJJL._SY88.jpg"],
"overall": 5.0, // rating of the product
"vote": "2", // helpful votes of the review
"verified": True,
"reviewTime": "01 1, 2018", // time of the review (raw)
"reviewerID": "AUI6WTTT0QZYS", // ID of the reviewer
"asin": "5120053084", // ID of the product
"style": { // Dictionary of the product metadata
"Size:": "Large",
"Color:": "Charcoal"
},
"reviewerName": "Abbey", // Name of the reviewer
"reviewText": "I now have 4 of the 5 available colors of this shirt... ",
"summary": "Comfy, flattering, discreet--highly recommended!",
"unixReviewTime": 1514764800 // time of the review (unix time)
}
- Preprocessing: only reviews with ratings (1β5)
- Sentiment labels created from star ratings:
| Stars | Sentiment Label | Category |
|---|---|---|
| βοΈ | 0 | Negative |
| βοΈβοΈ | 0 | Negative |
| βοΈβοΈβοΈ | removed | Neutral |
| βοΈβοΈβοΈβοΈ | 1 | Positive |
| βοΈβοΈβοΈβοΈβοΈ | 1 | Positive |
- Custom pipeline: normalization, tokenization, stopword removal, optional stemming
- Duplicates removed based on cleaned token sequences
- TF-IDF vectorization for model input
-
β Logistic Regression** (with GridSearchCV)
-
π‘ Multinomial Naive Bayes** (with GridSearchCV)
-
π΅ Deep Learning model (LSTM)
- Accuracy, precision, recall, F1-score
- Confusion matrix and classification report
- Threshold-based analysis using precision-recall curves
- Visual performance vs regularization plots for each ML model
pandas,numpyβ Data handling and manipulationstring,re,unicodedataβ Text normalization and cleanuprandom,pickle,collections.Counterβ Utilities and storagenltkβ Tokenization, stopword removal, stemming, n-grams
matplotlib.pyplot,seabornβ Data plots, metrics, and trendsWordCloudβ Visualize most frequent termsFreqDist(nltk) β Token frequency analysis
scikit-learn:TfidfVectorizerβ Bag-of-Words (TF-IDF) encodingLogisticRegression,MultinomialNBβ ClassifiersGridSearchCV,train_test_splitβ Model selection and evaluationchi2β Feature selectionclassification_report,confusion_matrix,accuracy_score,precision_recall_curve,roc_curveβ Evaluation metrics
stopwordsβ English stopword list (downloaded at runtime)
