# Spatio-Temporal Data Mining
**_Dataset and Data preprocessing_**  
*Dr. Mitra Baratchi, Leiden University*  
*Hossein A. Rahmani, University of Zanjan*

## Dataset Properties

The __original_data.csv__ file includes the check-ins information by users. Each line of the files follows the following format (_tab_ separated format):  

__user_ID	POI_ID	coordinate(atitude and longitude)	checkin_time(hour:min)	date_id__  

p.s., check-ins made on the same date have the same date_id and there are $151589$ check-ins.

In this file, we will read the data and make some data preprocessing. First, we read the check-ins data file and calculate the check-in of each user and location pair with a indicator function ($1$ if a POI is checked
by user $u$, otherwise $0$).

The [CSV (Comma Separated Values)](https://en.wikipedia.org/wiki/Comma-separated_values) format is the most common import and export format for spreadsheets and databases.
It is one of the interesting format for data. We import [CSV in python](https://docs.python.org/3/library/csv.html) to work with CSV files.  

In [1]:
# 先import會用到的庫
from pymongo import MongoClient
from tqdm import tqdm

client = MongoClient('localhost', 27017)
db_final = client.Yelp_Final
business_final = db_final.business
review_final = db_final.review

In [15]:
# 建立POI對應的經緯度dict
location_geo = {}
bar = tqdm(total=business_final.count_documents({}), desc='Get Business Location')
tempIds = business_final.find({}, no_cursor_timeout=True, batch_size=10)
for item in tempIds:
    location_geo[item['newId']] = (item['latitude'], item['longitude'])
    bar.update(1)
tempIds.close()
bar.close()

In [None]:
# 建立user對應的checkin數量dict
user_checkins = {}
bar = tqdm(total=review_final.count_documents({}), desc='Get User Checkins')
tempIds = review_final.find({}, no_cursor_timeout=True, batch_size=10)
for item in tempIds:
    if (item['newUserId'], item['newBusinessId']) in user_checkins:
        user_checkins[(item['newUserId'], item['newBusinessId'])] += 1
    else:
        user_checkins[(item['newUserId'], item['newBusinessId'])] = 1
    bar.update(1)
tempIds.close()
bar.close()

In the next step, we will save the preprocessed data into two files. Here, we make a file with three columns: __User_id__, __Location_id__, and __Checkin_frequency__ for users' checkins and the second file as _geo data_ to store the location information (i.e. **Location_id**,**latitude**,**longitude**).

In [16]:
with open('preprocessed_data/preprocessed_data.csv', 'w', newline='') as preprocessed_data:
    checkins_writer = csv.writer(preprocessed_data, delimiter='\t')
    for checkin_info in user_checkins:
        checkin = [checkin_info[0], checkin_info[1], user_checkins[checkin_info]]
        checkins_writer.writerow(checkin)

with open('preprocessed_data/geo_data.csv', 'w', newline='') as geo_data:
    geo_writer = csv.writer(geo_data, delimiter='\t')
    for lid, geo in location_geo.items():
        geo = [lid, geo[0], geo[1]]
        geo_writer.writerow(geo)

Now, if you have done everything correctly, you can see the **preprocessed_data.csv** and **geo_data.csv** beside of this file.

## References

This data is used in the experiments of the following paper (originally, it is for the [Foursquare](https://foursquare.com/) location-based social network):  
*Quan Yuan, Gao Cong, Zongyang Ma, Aixin Sun, Nadia Magnenat-Thalmann: __Time-aware point-of-interest recommendation__, SIGIR, 2013*