Yelp Review Classification (NLP)

Project Overview

In this project we use NLP to analyze Yelp reviews. Yelp is an app that provide a crowd-sourced review forum to business and services. The app is used publish crowd-sourced reviews about businesses. Number of 'stars' indicate the business rating given by a customer, ranging from 1 to 5. 'Cool', 'Useful', 'Funny' indicate the number of cool votes given by other Yelp Users.

Tools

Data Visualization:

Matplotlib
Seaborn
WorldCloud
ProfileReport

NLP Preproccesing:

Natural Language Toolkit (NLTK)
Stopwords
TfidfVectorizer
Counter

Machine Learning Algorithms:

Random Forest
Bagging Classifier
AdaBoost Classifier
Voting Classifier

ML Algorithms to handle imbalanced data:

BalancedBaggingClassifier
EasyEnsembleClassifier
RUSBoostClassifier
BalancedRandomForestClassifier

Evaluation Metrics:

We are working with classification problem, thats why we use these metrics:

Balanced Accuracy
Roc-auc
Classification Report
Confusion Matrix

Conclusion

Yelp review dataset is very unbalanced (81.7% positives reviews vs 18.3% negative reviews).

We used only preprocessed text with TF-IDF vectorizer in training models. In this study we've trained machine learning models with regularization (class_weight = 'balanced'), then implemented balancing methods and specific algorithms from scikit-learn.imbalanced library.

Insights:

In the case of an unbalanced dataset, they do a better job of predicting models in which you can adjust parameter class_weight = 'balanced' (like Logistic Regression, SVC)
We've tried to apply strategies to balance the dataset, but the results according to machine learning model is not always be better than if we did not change the balance.
Under-sampling can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.
Under-sampling can discard potentially useful information which could be important for building rule classifiers. The sample chosen by random under-sampling may be a biased sample. And it will not be an accurate representation of the population. Thereby, resulting in inaccurate results with the actual test data set.
Over-sampling unlike under-sampling, this method leads to no information loss. Outperforms under sampling. But it increases the likelihood of overfitting since it replicates the minority class events.
We used specific models from scikit-learn imbalanced library. But they did not scored better than classic machine learning models.
In accordance with the results of the tested models, the best estimate was shown by Multinomial Naive Bayes with undersampling.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
README.md		README.md
yelp_review.ipynb		yelp_review.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

README.md

yelp_review.ipynb

yelp_review.ipynb

Repository files navigation

Yelp Review Classification (NLP)

Project Overview

Tools

Data Visualization:

NLP Preproccesing:

Machine Learning Algorithms:

ML Algorithms to handle imbalanced data:

Evaluation Metrics:

Conclusion

Insights:

About

Releases

Packages

Languages

vadimlee33/Yelp-Review-Classification

Folders and files

Latest commit

History

Repository files navigation

Yelp Review Classification (NLP)

Project Overview

Tools

Data Visualization:

NLP Preproccesing:

Machine Learning Algorithms:

ML Algorithms to handle imbalanced data:

Evaluation Metrics:

Conclusion

Insights:

About

Resources

Stars

Watchers

Forks

Languages