Facebook Footsteps - Predict Facebook checkins

Project Idea

The idea is to predict where a user could be checking into from their location and timestamp. This project was a coding challenge on kaggle.com, and the dataset we are using is from the website itself.

A system which could classify and analyze social media check-in data to find patterns on how users check-in activity would be beneficial for the customers and businesses. Although, achieving this would never have a hundred percent accuracy, but a good prediction would certainly do wonders for both.

Applications

See the most common places of interests as per the geographical area.
Allow businesses to promote themselves as per customer check-ins and find ways to improve and work on customer retention
Realize how customer visit places as per the seasons, days, hours, etc. More check-ins in evening could be a sign of restaurant outings and allow to find favorable eating joints.
Allow users to find the most visited hangout places around.
Allow tourists to better plan their trips highlighting how people go about visiting a city and area as per different time of the day.
Customized offers and personalized customer based advertising on social media and other networks.

Dataset

The dataset contains train and test data files with below columns:

row_id -> id of the checkin event
x -> x coordinate of checkin
y -> y coordinate of checkin
accuracy - location accuracy
timestamp - timestamp of checkin
place_id -> business id

One of the major challenges was to make the data more meaningful by finding patterns on features. Hence, data preprocessing and feature extraction is a critical step in this scenario.

Exploratory Analysis

To reinforce our understanding of the problem and to visualize the data we start by plotting all the check-ins within a smaller grid of 500 X 500 meters taken at random from the given larger grid. We used R script to generate the plot. The script is CreatePlots.R in the repository. A subset of script which generates the same is

ggplot(small_trainz, aes(x, y )) +
  geom_point(aes(color = place_id)) + 
  theme_minimal() +
  theme(legend.position = "none") +
  ggtitle("Check-ins for a smaller 500 X 500m grid")

We have plotted only place_ids that have more than 100 check-ins to visualize the clusters.

To make these clusters separable and more evident we tried using one more feature “hour of the day” as the third dimension for our plot. Addition of third dimension helps and we can see that our assumption that hour of day affects the check-ins for a place is valid. We tried plotting the same data using “weekday” feature that resulted in similar plot. The same was generated as

plot_ly(data = small_trainz, x = small_trainz$x , y = small_trainz$y, z = small_trainz$hour, color = small_trainz$place_id,  type = "scatter3d", mode = "markers", marker=list(size= 5)) %>% layout(title = "Place_ids clustered by x, y and hour of day")

These plots confirm our understanding that check-ins from user depends on the different time components like hour of the day and weekday.

Techniques used

Post the processing, wee implemented the below approaches to compare and get the best results.

K Nearest Neighbour

Once we have a smaller grid of 250 X 250 meters in place first thing that comes to mind is KNN for the classification task. It is very easy to implement and give good results. The only tricky part about applying KNN is finding out the optimal weights for the variables used. We have used hit and trial method to optimize our model. We plan to use data exploratory techniques in future to narrow down on optimal weights for KNN in final version of the report.

Random Forest

Random Forest was our second choice for the classifier as it is efficient and generally results in more accurate results. The performance factor of random forest is important for us as we are doing classification task on the fly. We chose random forest also because it gives us estimate of the importance of different variables in classification task this helped us in fine-tuning our model for other classifiers as well. We tried different flavors of Random Forest available in Python and achieved best results using sklearn random forest classifier.

Boosted Trees

To improve the accuracy further, we tried boosted trees i.e. tree ensemble model for classification and regression trees (CART). We used XGBoost library, short for “Extreme Gradient Boosting”, where the term “Gradient Boosting” is proposed in the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman.

Results

We obtained the below accuracies:

Method	Accuracy
KNN	0.42
Random Forest	0.48
Boosted Trees	0.54

Our best results were with the XGBoost library for Boosted trees with an accuracy of 0.54.

Key Learnings

Spatial data and ways to handle it
Different classifiers - pros and cons
Data exploration and feature analysis for Big Data

Future Enhancements

Improved processing for better results
Offline Data Processing for Real-Time Results
Recommendations based on Check-ins
Ensemble and Improved Models

References

Problem Statement - https://www.kaggle.com/c/facebook-v-predicting-check-ins
Dataset - https://www.kaggle.com/c/facebook-v-predicting-check-ins/data
sklearn - http://scikit-learn.org/
XGBoost - http://xgboost.readthedocs.io/en

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Exploratory_Analysis		Exploratory_Analysis
Report		Report
README.md		README.md
aldaProject.py		aldaProject.py
result.PNG		result.PNG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Facebook Footsteps - Predict Facebook checkins

Project Idea

Applications

Dataset

Exploratory Analysis

Techniques used

Results

Key Learnings

Future Enhancements

References

About

Releases

Packages

Contributors 3

Languages

shivamgulati1991/Facebook--Footsteps-Prediction

Folders and files

Latest commit

History

Repository files navigation

Facebook Footsteps - Predict Facebook checkins

Project Idea

Applications

Dataset

Exploratory Analysis

Techniques used

Results

Key Learnings

Future Enhancements

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages