This project is a collection of Wongnai's datasets which are mostly in Thai language. We hope that these datasets will advance research in natural language processing(NLP) especially in Thai language.
1. Search query dataset
There are 500,000 unique words extracted from search queries. These words were labeled by algorithms and judges for a word segmentation task. Our segmentation criteria is to segment the longest food word as possible for archiving the highest precision score in search system.
search/labeled_queries_by_algo.txt: List of 500K words labeled by algorithms which were described in detail in blog post.
search/labeled_queries_by_judges.txt: List of 10K words labeled by judges following Wongnai's search criteria.
search/food_dictionary.txt: List of 400K food words used for labelling the
Please note that these words were collected from user-generated content(UGC) which might include some out of topic words.
- You may use
labeled_queries_by_algo.txtfor training your own word segmentation model by spliting into train and validation set and then evaluate your model with
2. Review dataset
The review dataset contains restaurant reviews and ratings (there are only 5 classes ranging from 1 to 5 stars).
If you can't download files, they are also located here
Wongnai data services
- If you are interested in Wongnai database such as photos, reviews or restaurant database, Wongnai also provides data services including API and files. For more details, please follow the link below. https://business.wongnai.com/restaurants-data-service/en/