The second project of Spring 2017 Stat 333 is a Kaggle competition, where we are asked to predict Yelp ratings based on the text comments in Madison WI area. Our group got rank one on both public and private leaderboard
||Use Stanford's GloVe to vectorize text, and a simple CP-CP-CP neural network|
||Use TFIDF text encoding, and lasso, ridge regression and elastic net|
|Multiple Linear Regression||
||Naive simple multiple linear regression with silly variables|
||Use tf-idf text encoding, and a simple one hidden layer neural network|
Our best model is using Ridge regression with tf-idf text encoding. You can check out the self-explained Jupyter notebook here.
- Feature engineering is much more important in NLP. We have tried many different text encoding methods here. GLoVe should have worked the best, but it was beaten by tf-idf in this very project.
- We extracted the stem of words and removed stopping words. It turns out the stopping word level really worths tuning.
You can see our presentation to get more info.