The idea is to predict where a user could be checking into from their location and timestamp. This project was a coding challenge on kaggle.com, and the dataset we are using is from the website itself.
A system which could classify and analyze social media check-in data to find patterns on how users check-in activity would be beneficial for the customers and businesses. Although, achieving this would never have a hundred percent accuracy, but a good prediction would certainly do wonders for both.
- See the most common places of interests as per the geographical area.
- Allow businesses to promote themselves as per customer check-ins and find ways to improve and work on customer retention
- Realize how customer visit places as per the seasons, days, hours, etc. More check-ins in evening could be a sign of restaurant outings and allow to find favorable eating joints.
- Allow users to find the most visited hangout places around.
- Allow tourists to better plan their trips highlighting how people go about visiting a city and area as per different time of the day.
- Customized offers and personalized customer based advertising on social media and other networks.
The dataset contains train and test data files with below columns:
- row_id -> id of the checkin event
- x -> x coordinate of checkin
- y -> y coordinate of checkin
- accuracy - location accuracy
- timestamp - timestamp of checkin
- place_id -> business id
One of the major challenges was to make the data more meaningful by finding patterns on features. Hence, data preprocessing and feature extraction is a critical step in this scenario.
To reinforce our understanding of the problem and to visualize the data we start by plotting all the check-ins within a smaller grid of 500 X 500 meters taken at random from the given larger grid. We used R script to generate the plot. The script is CreatePlots.R in the repository. A subset of script which generates the same is
ggplot(small_trainz, aes(x, y )) +
geom_point(aes(color = place_id)) +
theme_minimal() +
theme(legend.position = "none") +
ggtitle("Check-ins for a smaller 500 X 500m grid")
We have plotted only place_ids that have more than 100 check-ins to visualize the clusters.
To make these clusters separable and more evident we tried using one more feature “hour of the day” as the third dimension for our plot. Addition of third dimension helps and we can see that our assumption that hour of day affects the check-ins for a place is valid. We tried plotting the same data using “weekday” feature that resulted in similar plot. The same was generated as
plot_ly(data = small_trainz, x = small_trainz$x , y = small_trainz$y, z = small_trainz$hour, color = small_trainz$place_id, type = "scatter3d", mode = "markers", marker=list(size= 5)) %>% layout(title = "Place_ids clustered by x, y and hour of day")
These plots confirm our understanding that check-ins from user depends on the different time components like hour of the day and weekday.
Post the processing, wee implemented the below approaches to compare and get the best results.
- K Nearest Neighbour
Once we have a smaller grid of 250 X 250 meters in place first thing that comes to mind is KNN for the classification task. It is very easy to implement and give good results. The only tricky part about applying KNN is finding out the optimal weights for the variables used. We have used hit and trial method to optimize our model. We plan to use data exploratory techniques in future to narrow down on optimal weights for KNN in final version of the report.
- Random Forest
Random Forest was our second choice for the classifier as it is efficient and generally results in more accurate results. The performance factor of random forest is important for us as we are doing classification task on the fly. We chose random forest also because it gives us estimate of the importance of different variables in classification task this helped us in fine-tuning our model for other classifiers as well. We tried different flavors of Random Forest available in Python and achieved best results using sklearn random forest classifier.
- Boosted Trees
To improve the accuracy further, we tried boosted trees i.e. tree ensemble model for classification and regression trees (CART). We used XGBoost library, short for “Extreme Gradient Boosting”, where the term “Gradient Boosting” is proposed in the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman.
We obtained the below accuracies:
Method | Accuracy |
---|---|
KNN | 0.42 |
Random Forest | 0.48 |
Boosted Trees | 0.54 |
Our best results were with the XGBoost library for Boosted trees with an accuracy of 0.54.
- Spatial data and ways to handle it
- Different classifiers - pros and cons
- Data exploration and feature analysis for Big Data
- Improved processing for better results
- Offline Data Processing for Real-Time Results
- Recommendations based on Check-ins
- Ensemble and Improved Models
- Problem Statement - https://www.kaggle.com/c/facebook-v-predicting-check-ins
- Dataset - https://www.kaggle.com/c/facebook-v-predicting-check-ins/data
- sklearn - http://scikit-learn.org/
- XGBoost - http://xgboost.readthedocs.io/en