• This project helps users find their favorite airbnb lodging during travel by image similarity based recommendation algo
• It provides value to:
o People who are looking for stylish, funky, fancy airbnb apartment(s) for their travel
o Apartment owners who want to know other apartments in certain areas which look like theirs so that they can make their apartments either look unique or close to the apartments already popular.
• InsideAirbnb, a dataset published by Airbnb which contains a full list of all apartments in the U.S. (http://insideairbnb.com/get-the-data.html)
• Images and other info of Airbnb apartments in Boston were scraped by urllib2
• Online collection
o Basic info of Airbnb apartments (location, descriptions, price range, webpage URL) were retrieved from InsideAirbnb and saved to MongoDB running on an EC2 instance
Airbnb_web_scrape_script.py : Read urls from airbnb apartment list file(Downloaded from http://insideairbnb.com/get-the-data.html), scrape desired apts infos and save to json files
o Apartment images were scraped from Airbnb and saved to S3
Airbnb_Image_scrape_script.py: Read json files generated by the step above, fetch apt images and save them locally
o Highlight tags and user reviews were scraped in the mean time and saved for potential use in the future
• Offline processing
o For each apartment image, Image features were extracted and pickled to csv file
Gist_feature_extraction_script.py: Extract gist features from images, save to a pickle file locally along with meta data HSV_feature_extraction_script.py: Extract HSV features from images, save to a pickle file locally along with meta data Feature_combination_script.py: Merge the above two features into one single feature, save to a picke file locally along with meta data
o Put highlight and reviews to csv file for further uses
Image feature extraction (see the referrence at the bottom for details)
For each image, an HSV-Histgram feature was extracted by firstly decomposing the image to H/S/V channels, generating histogram for each channel and concating the three histograms. (90 buckets for each channel, producing a single feature vector of 270 long) HSV is said to be able to reflect how people truly perceive color info. HSV-Histgram maily captures color info of an image.
First of all, a set of filters were employed to convolve over the entire image, producing a bunch of convolved images. And then each convolved image was divided into 4 * 4 subimages over which mean value was calculated. Finally all the mean values were put together, producing a single vector called GIST feature. GIST mainly capture the high level characteristics for each image, such as structural information, plain or texture, providing a holistic view for each image.
o CNN(to do)
Transfer learning methodology will be a way of feature extractor. Specifically,the output of the layer before softmax of a well-trained Inception-v3 model(on ImageNet) may be a feasible image feature.
o others(to do)
o Cosine Distance
Cosine Distance was used because it is immune to the difference of feature magnitudes
Evaluation - conceptually and quantitatively
Challenge: how can we know if the feature works for measuring image similarity, considering that it's an unsupervised task and we don't have true labels for images we scraped from Airbnb
Solution: test it on another labeled dataset which is similar to Airbnb's scenarios
• Dataset：Image features and similarity measurements were tested on a subset of Indoor Scene Recognition dataset from MIT. For more details: http://web.mit.edu/torralba/www/indoor.html
Examples in the dataset:
• Assumption: features of images in the same category(bedroom/bookstore/florist/dining room..) should be closer to each other than that from different categories. Quantitatively speaking, the higher the ratio of dis(within clusters)/dis(between clusters) , the better the feature is.
Calinski_harabaz was used to quantitatively validate the features. For comparation, a baseline case was fabricated by shuffling the image labels to make them random. The performance was as follows:
Calinski_harabaz for the real case is 22.816
Calinski_harabaz for the random case is 0.986
It's clear that Calinski_harabaz for the real case is quite higher than that of the baseline case, indicating that the feature works for representing images and it's feasible to use it to measuring image similarity
• Step1: given a picture, the system returns places look similar to it (in the same style)
• (backlog) Step2: users can add some word descriptions to specify where they dream of going ( industrial style, splendid..) to refine the result
• AWS EC2 and S3
• (To do) Keras/TensorFlow
References for the techniques involved in the project:
• MatthijsDouze,HervéJégou,SandhawaliaHarsimrat,LaurentAmsaleg,CordeliaSchmid. Evaluation of GIST descriptors for web-scale image search. CIVR 2009 - International Conference on Image and Video Retrieval, Jul 2009, Santorini, Greece. ACM, pp.19:1-8, 2009
• Gist/Context of a Scene: http://ilab.usc.edu/siagian/Research/Gist/Gist.html
• A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 42(3):145–175, 2001.
• Similarity measurement between images: https://ieeexplore.ieee.org/document/1508081/
• Calinski-Harabasz Index and Boostrap Evaluation with Clustering Methods: http://ethen8181.github.io/machine-learning/clustering_old/clustering/clustering.html
Pymongo, cv2, fftw3, urlib2, boto3,SciPy, Sklearn, PIL, flask