Building a picture based attraction recommendation system by combing LDA and K-Means in the structure of Scatter/Gather
Webpage | 3-minute demo/introduction
The main purpose of this Picture Your Way is to build a picture-based attraction recommendation system. The proposed algorithm combines Latent Dirichlet Allocation (LDA) and K-Means in the structure of Scatter/Gather. Through this system, thousands of tourist attractions in Taiwan are recommended to users via a direct and prompt process, selecting pictures. We leveraged the information of attractions listed in the governement open tourist attraction data and selected corresponding pictures from Instagram. The pictures ranked by Elo Rating System is moderately correlated to the realistic public preferences. The accuracy of LDA topics is 66%. The precision value for this proposed algorithm is 74%. And the system effectiveness calculated through the feedback of users is PR 72. Therefore, this system is proven to succeed in recommending suitable and beautiful attractions in Taiwan to users by using the proposed algorithm in this study.
- Data Resource: Taiwan Open Government Data - Tourist Attraction Database (under the Open Government Data License ), Instagram.
- Methods: Latent Dirichlet Allocation (LDA) Topic Modeling, K-means Clustering Algorithm, Elo Rating Algorithm, Natural Language Processing (NLP), PHP Web Development.
- Experimental Design: Spearman Correlation, Confussion Matrix (F-measure).
- Instant Demonstration: Distinguish attraction type through browsing pictures without reading long discriptions.
- Convenience: Reduce massive search time for users.
- Accuracy: Achieve optimal clustering and recommendation through machine learning algorithm training and careful evaluation.
- Aesthetic: Weight used pictures from the rankings aligned with public preferences to better meet user's need.
- Scatter/Gather: A document clustering algorithm used to cluster massive documents within a short period of time, which allowed us to provide the optimal tourist attractions to users effectively.
- LDA Topic Modeling: A machine learning algorithm used to extract latent topics from text data, which was leveraged in our first "Scatter" in the Scatter/Gather structure. (Unlike TF-IDF, LDA considers the term distributions in addition to term frequencies).
- K-means Clustering: A unsupervised learning algorithm that provides instant clustering in this system. Under evaluation, it was proven feasible to extract the topic distribution from LDA for each tourist spot as the attributes in euclidean distance calcuation in K-means. This replaces the tradition sparse term-document matrix and reduces the time and space cost.
- Elo Rating System: A picture ranking algorithm inspired by the film, The Social Network (2010). Considering aesthetic as an important factor in picture selection, we trained the pictures using elo-rating algorithm to assign rankings and integrate in our system as the one of the demonstration basis.
Areas | Techniques |
---|---|
Analytics | Python |
Front-end | Javascript, CSS, HTML |
Back-end | PHP |
Database | MySQL |
Cloud Service | Google Cloud Platform |