Skip to content

This repository contains the complete development of a Natural Language Processing (NLP) practice, carried out during the KeepCoding Full-Stack AI Bootcamp III.

Notifications You must be signed in to change notification settings

syllerim/nlp-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Keepcoding

πŸ’¬ NLP – Sentiment Analysis Assignment 🧐

✍️ Author: Mirellys Arteta Davila

This project is part of the Natural Language Processing (NLP) module from the KeepCoding AI Bootcamp.
The objective is to build and evaluate a sentiment analysis pipeline using Amazon product reviews.


πŸ“ Structure

  • EDA β†’ download + data exploration.
  • Preprocessing β†’ Text cleaning pipeline.
  • Modeling β†’ Training & comparing 2 Machine Learning and 1 Deep Learning sentiment classifiers.
  • Reports β†’ Metrics, evaluation and conclusions.

πŸ“¦ Dataset

Sample Review:

{
"image": ["https://images-na.ssl-images-amazon.com/images/I/71eG75FTJJL._SY88.jpg"], 
"overall": 5.0, 			   // rating of the product
"vote": "2",                   // helpful votes of the review
"verified": True, 
"reviewTime": "01 1, 2018",    // time of the review (raw)
"reviewerID": "AUI6WTTT0QZYS", // ID of the reviewer
"asin": "5120053084",          // ID of the product
"style": {                     // Dictionary of the product metadata
	"Size:": "Large", 
	"Color:": "Charcoal"
	},                          
"reviewerName": "Abbey",       // Name of the reviewer 
"reviewText": "I now have 4 of the 5 available colors of this shirt... ", 
"summary": "Comfy, flattering, discreet--highly recommended!", 
"unixReviewTime": 1514764800   // time of the review (unix time)
}
  • Preprocessing: only reviews with ratings (1–5)
  • Sentiment labels created from star ratings:
Stars Sentiment Label Category
⭐️ 0 Negative
⭐️⭐️ 0 Negative
⭐️⭐️⭐️ removed Neutral
⭐️⭐️⭐️⭐️ 1 Positive
⭐️⭐️⭐️⭐️⭐️ 1 Positive

🧹 Preprocessing

  • Custom pipeline: normalization, tokenization, stopword removal, optional stemming
  • Duplicates removed based on cleaned token sequences
  • TF-IDF vectorization for model input

πŸ€– Models

  • βœ… Logistic Regression** (with GridSearchCV)

  • 🟑 Multinomial Naive Bayes** (with GridSearchCV)

  • πŸ”΅ Deep Learning model (LSTM)


πŸ“Š Evaluation

  • Accuracy, precision, recall, F1-score
  • Confusion matrix and classification report
  • Threshold-based analysis using precision-recall curves
  • Visual performance vs regularization plots for each ML model

πŸ› οΈ Tools & Libraries

πŸ§ͺ Data & Preprocessing

  • pandas, numpy β†’ Data handling and manipulation
  • string, re, unicodedata β†’ Text normalization and cleanup
  • random, pickle, collections.Counter β†’ Utilities and storage
  • nltk β†’ Tokenization, stopword removal, stemming, n-grams

πŸ“Š Visualization

  • matplotlib.pyplot, seaborn β†’ Data plots, metrics, and trends
  • WordCloud β†’ Visualize most frequent terms
  • FreqDist (nltk) β†’ Token frequency analysis

πŸ€– Machine Learning

  • scikit-learn:
    • TfidfVectorizer β†’ Bag-of-Words (TF-IDF) encoding
    • LogisticRegression, MultinomialNB β†’ Classifiers
    • GridSearchCV, train_test_split β†’ Model selection and evaluation
    • chi2 β†’ Feature selection
    • classification_report, confusion_matrix, accuracy_score, precision_recall_curve, roc_curve β†’ Evaluation metrics

πŸ“ NLTK Resources

  • stopwords β†’ English stopword list (downloaded at runtime)

About

This repository contains the complete development of a Natural Language Processing (NLP) practice, carried out during the KeepCoding Full-Stack AI Bootcamp III.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published