# STEP 0 - Preprocessing of the Foursquare DataSet

Here, the dataset is analyzed and afterwards preprocessed.

## Imports

In [1]:
import pandas as pd
from sqlalchemy import create_engine
import numpy as np

nyc_venues = create_engine('sqlite:///_nyc_venues.db')

## Data Analysis

First, we read the dataset into a database in order to get a general understanding. We have the columns:
* user_id
* venue
* category_id
* category_name
* latitude
* longitude
* time_offset
* time

In [2]:
df = pd.read_csv("./4square/dataset_TSMC2014_NYC.txt", sep="\t",
                 names=["user_id", "venue", "category_id", "category_name",
                        "latitude", "longitude", "time_offset", "time"])
print(df.head(10))

df.to_sql('nyc_venues', con=nyc_venues, if_exists="replace")

   user_id                     venue               category_id  \
0      470  49bbd6c0f964a520f4531fe3  4bf58dd8d48988d127951735   
1      979  4a43c0aef964a520c6a61fe3  4bf58dd8d48988d1df941735   
2       69  4c5cc7b485a1e21e00d35711  4bf58dd8d48988d103941735   
3      395  4bc7086715a7ef3bef9878da  4bf58dd8d48988d104941735   
4       87  4cf2c5321d18a143951b5cec  4bf58dd8d48988d1cb941735   
5      484  4b5b981bf964a520900929e3  4bf58dd8d48988d118951735   
6      642  4ab966c3f964a5203c7f20e3  4bf58dd8d48988d1e0931735   
7      292  4d0cc47f903d37041864bf55  4bf58dd8d48988d12b951735   
8      428  4ce1863bc4f6a35d8bd2db6c  4bf58dd8d48988d103941735   
9      877  4be319b321d5a59352311811  4bf58dd8d48988d10a951735   

         category_name   latitude  longitude  time_offset  \
0  Arts & Crafts Store  40.719810 -74.002581         -240   
1               Bridge  40.606800 -74.044170         -240   
2       Home (private)  40.716162 -73.883070         -240   
3       Medical Center  40.74

In [3]:
res=pd.read_sql_query('SELECT count(*) FROM nyc_venues', nyc_venues)
print(res)

   count(*)
0    227428


There are 227.428 rows in the dataset. Those are all checkins from users in NYC.

In [4]:
res=pd.read_sql_query('SELECT count(*) FROM nyc_venues '+
        'WHERE time IS NULL '+
        'OR user_id IS NULL ' +
        'OR venue IS NULL ' +
        'OR category_id IS NULL '+
        'OR category_name IS NULL ', nyc_venues)
print(res)

   count(*)
0         0


In [5]:
res=pd.read_sql_query('SELECT count(*)' +
        'FROM nyc_venues WHERE ((latitude IS null) OR ' +
        '(latitude = 0.000000) OR (longitude IS null) OR ' +
        '(longitude = 0.000000))', nyc_venues)

print(res)

   count(*)
0         0


It can be concluded that there are no incomplete or invalid rows.

Next, the user data is analyzed. Here, the number of checkins of the best and worst users are important measurements.

In [6]:
res=pd.read_sql_query('SELECT user_id, "count" FROM ' +
            '(SELECT count(user_id) as "count", user_id FROM nyc_venues GROUP BY user_id)' +
            'ORDER BY "count" DESC LIMIT 10', nyc_venues)

print(res)

   user_id  count
0      293   2697
1      185   2079
2      354   2061
3      315   1682
4       84   1376
5      349   1369
6      384   1116
7      974   1107
8      768   1096
9      445    952


In [7]:
res=pd.read_sql_query('SELECT user_id, "count" FROM ' +
            '(SELECT count(user_id) as "count", user_id FROM nyc_venues GROUP BY user_id)' +
            'ORDER BY "count" ASC LIMIT 10', nyc_venues)

print(res)

   user_id  count
0       18    100
1       92    100
2      247    100
3      268    100
4      324    100
5      369    100
6      394    100
7      426    100
8      451    100
9      671    100


In [8]:
res=pd.read_sql_query('SELECT AVG(count) FROM (SELECT user_id, "count" FROM ' +
            '(SELECT count(user_id) as "count", user_id FROM nyc_venues GROUP BY user_id))', nyc_venues)

print(res)

   AVG(count)
0  209.998153


In [9]:
res =pd.read_sql_query('SELECT count(distinct(user_id)) FROM nyc_venues', nyc_venues)

print(res)

   count(distinct(user_id))
0                      1083


There are 1083 users in the dataset.
The best users have over 2000 checkins. An average user has over 200 checkins.
This number can be seen as reasonable. The locations also have to be checked.

In [10]:
res=pd.read_sql_query('SELECT count(distinct(venue)) FROM nyc_venues', nyc_venues)

print(res)

   count(distinct(venue))
0                   38333


In [11]:
res=pd.read_sql_query('SELECT AVG(a) FROM (' +
    'SELECT count(distinct(venue)) as a, user_id FROM nyc_venues GROUP BY user_id) ' , nyc_venues)

print(res)

      AVG(a)
0  84.048015


There are 38.333 different venues. Every user visits 84 venues on average.
It has to be considered that every user only does 200 checkins on average.
In perspective, this number is too high.
It can be expected that a model trained on so many locations will be very inaccurate.

In [12]:
res=pd.read_sql_query('SELECT time, category_name FROM nyc_venues WHERE user_id=293 LIMIT 50', nyc_venues)

print(res)

                              time                  category_name
0   Tue Apr 03 23:56:43 +0000 2012                         Subway
1   Wed Apr 04 00:19:13 +0000 2012           Other Great Outdoors
2   Wed Apr 04 00:19:44 +0000 2012                            Bar
3   Wed Apr 04 00:21:12 +0000 2012                     Smoke Shop
4   Wed Apr 04 04:53:06 +0000 2012                   Neighborhood
5   Wed Apr 04 04:53:36 +0000 2012           Other Great Outdoors
6   Wed Apr 04 05:08:28 +0000 2012                    Bus Station
7   Wed Apr 04 15:38:43 +0000 2012                 Home (private)
8   Wed Apr 04 16:16:45 +0000 2012                 Home (private)
9   Wed Apr 04 16:18:34 +0000 2012                 General Travel
10  Wed Apr 04 16:20:07 +0000 2012                         Subway
11  Wed Apr 04 17:38:40 +0000 2012                         Subway
12  Wed Apr 04 17:40:06 +0000 2012              Electronics Store
13  Wed Apr 04 17:40:29 +0000 2012                         Office
14  Wed Ap

The checkin times are as expected. On most days, there are many checkins.
The data is continuous and can thus be used.

In order to decrease the number of locations, the categories can be used instead.
As a result the prediction becomes more semantic (what kind of place) instead of physical (latitude/longitude).

In [13]:
res=pd.read_sql_query('SELECT DISTINCT(category_id) AS ID, category_name FROM nyc_venues', nyc_venues)
print(res)

                           ID         category_name
0    4bf58dd8d48988d127951735   Arts & Crafts Store
1    4bf58dd8d48988d1df941735                Bridge
2    4bf58dd8d48988d103941735        Home (private)
3    4bf58dd8d48988d104941735        Medical Center
4    4bf58dd8d48988d1cb941735            Food Truck
..                        ...                   ...
395  5032897c91d4c4b30a586d69           Pet Service
396  4f04b08c2fb6e1c99f3db0bd    Miscellaneous Shop
397  4eb1c0ed3b7b52c0e1adc2ea              Ski Area
398  5032856091d4c4b30a586d63                Office
399  5032848691d4c4b30a586d61  Other Great Outdoors

[400 rows x 2 columns]


In [14]:
res=pd.read_sql_query('SELECT COUNT(DISTINCT(category_id)) FROM nyc_venues', nyc_venues)

print(res)

   COUNT(DISTINCT(category_id))
0                           400


In [15]:
res=pd.read_sql_query('SELECT category_name, COUNT(*) FROM nyc_venues GROUP BY category_id ORDER BY COUNT(*) DESC', nyc_venues)

print(res)

            category_name  COUNT(*)
0          Home (private)     15334
1                  Office      9614
2                  Subway      9348
3             Coffee Shop      7510
4                     Bar      6188
..                    ...       ...
395  Other Great Outdoors         1
396              Ski Area         1
397              Ski Area         1
398    Miscellaneous Shop         1
399          Music School         1

[400 rows x 2 columns]


In [16]:
res=pd.read_sql_query('SELECT a.* FROM (SELECT category_name, COUNT(*) AS num FROM nyc_venues GROUP BY category_name ORDER BY num DESC) a WHERE a.num < 10', nyc_venues)
print(res)

               category_name  num
0   Bike Rental / Bike Share    9
1      Portuguese Restaurant    9
2                 Distillery    8
3              Internet Cafe    6
4     Gluten-free Restaurant    5
5          Afghan Restaurant    4
6             Sorority House    4
7                Pet Service    3
8                     Castle    2
9            Motorcycle Shop    2
10           Photography Lab    2
11              Music School    1


There are 400 different categories of venues in the dataset.
This number still seems quite high.
Especially as some categories seem to be duplicated or used only for a small number of checkins.
A possible solution is to group them in super categories.

## Pre-processing

The categories are exported in a .csv file custom uber categories are created.
This is done for every category_name instead of the id in order to account for duplicates.

In [17]:
unprocessed = df.copy()

In [18]:
res=pd.read_sql_query('SELECT category_name, COUNT(*) FROM nyc_venues GROUP BY category_name ORDER BY COUNT(*) DESC', nyc_venues)
res.to_csv('././4square/categories.csv', index=False)

In [19]:
uber = create_engine('sqlite:///_uber.db')
df = pd.read_csv("./4square/uber_cat.csv", sep=",")

print(df.head(10))

df.to_sql('uber', con=uber, if_exists="replace")

      uber_category         category_name  COUNT(*)
0               Bar                   Bar     15978
1              Home        Home (private)     15382
2              Work                Office     12740
3  Public Transport                Subway      9348
4            Sports  Gym / Fitness Center      9171
5              Cafe           Coffee Shop      7510
6           Grocery     Food & Drink Shop      6596
7  Public Transport         Train Station      6408
8              Park                  Park      4804
9              Home          Neighborhood      4604


In [20]:
res=pd.read_sql_query('SELECT DISTINCT uber_category, COUNT(*) FROM uber GROUP BY uber_category ORDER BY COUNT(*) DESC', uber)
print(res)

            uber_category  COUNT(*)
0              Restaurant        68
1                Shopping        30
2           Entertainment        18
3                  School        17
4                 Culture        12
5                 Service        11
6                  Snacks        11
7        Public Transport        10
8           Misc Shopping         8
9                Outdoors         7
10               Religion         7
11              Cosmetics         5
12                  Event         5
13                   Home         5
14         Infrastructure         5
15                   Misc         5
16                   Cafe         4
17  Gas Station / Parking         3
18             Government         3
19                   Park         3
20                 Sports         3
21                   Work         3
22                    Bar         2
23                Grocery         2
24                Medical         2
25               Building         1
26                  Hotel   

There are 27 uber categories that group the former semantically similar categories together.

Next, the uber categories are joined with our regular dataset.

In [21]:
ubers=pd.read_sql_query('SELECT DISTINCT uber_category, category_name FROM uber', uber)

In [22]:
merged = unprocessed.merge(ubers, on='category_name', suffixes=('', '_y'), how='left')
merged.head(10)

Unnamed: 0,user_id,venue,category_id,category_name,latitude,longitude,time_offset,time,uber_category
0,470,49bbd6c0f964a520f4531fe3,4bf58dd8d48988d127951735,Arts & Crafts Store,40.71981,-74.002581,-240,Tue Apr 03 18:00:09 +0000 2012,Shopping
1,979,4a43c0aef964a520c6a61fe3,4bf58dd8d48988d1df941735,Bridge,40.6068,-74.04417,-240,Tue Apr 03 18:00:25 +0000 2012,Infrastructure
2,69,4c5cc7b485a1e21e00d35711,4bf58dd8d48988d103941735,Home (private),40.716162,-73.88307,-240,Tue Apr 03 18:02:24 +0000 2012,Home
3,395,4bc7086715a7ef3bef9878da,4bf58dd8d48988d104941735,Medical Center,40.745164,-73.982519,-240,Tue Apr 03 18:02:41 +0000 2012,Medical
4,87,4cf2c5321d18a143951b5cec,4bf58dd8d48988d1cb941735,Food Truck,40.740104,-73.989658,-240,Tue Apr 03 18:03:00 +0000 2012,Snacks
5,484,4b5b981bf964a520900929e3,4bf58dd8d48988d118951735,Food & Drink Shop,40.690427,-73.954687,-240,Tue Apr 03 18:04:00 +0000 2012,Grocery
6,642,4ab966c3f964a5203c7f20e3,4bf58dd8d48988d1e0931735,Coffee Shop,40.751591,-73.974121,-240,Tue Apr 03 18:04:38 +0000 2012,Cafe
7,292,4d0cc47f903d37041864bf55,4bf58dd8d48988d12b951735,Bus Station,40.779422,-73.955341,-240,Tue Apr 03 18:04:42 +0000 2012,Public Transport
8,428,4ce1863bc4f6a35d8bd2db6c,4bf58dd8d48988d103941735,Home (private),40.619151,-74.035888,-240,Tue Apr 03 18:06:18 +0000 2012,Home
9,877,4be319b321d5a59352311811,4bf58dd8d48988d10a951735,Bank,40.619006,-73.990375,-240,Tue Apr 03 18:06:19 +0000 2012,Service


The first processed dataset is supposed to be minimalistic.
It should contain the time components and the uber category.
The time is split in different features.
The reason for that is a more accurate model training.
For example, it is easier to distinguish daily, monthly and hourly trends.
Week days are also important.
Especially working days and weekends can be distinguished (maybe by addition of a dedicated feature, i.e. "is_weekend").
The user information has to be kept, too.
Because the data must be sequenced for every user.

In [23]:
def is_weekend(week_day):
    return week_day > 4

In [24]:
merged["time"] = pd.to_datetime(merged["time"])

merged["month"] = merged.time.dt.month
merged["day"] = merged.time.dt.day
merged["clock"] = merged.time.dt.hour + merged.time.dt.minute/60 + merged.time.dt.second/3600
merged["week_day"] = merged.time.dt.dayofweek
merged["is_weekend"] = merged.apply(lambda x: is_weekend(x["week_day"]), axis=1)

The categories are transformed in unique ids in order to save space and make learning easier.

In [25]:
merged['cat_id'] = merged.groupby('uber_category', sort=False).ngroup()

In [26]:
merged.head(10)

Unnamed: 0,user_id,venue,category_id,category_name,latitude,longitude,time_offset,time,uber_category,month,day,clock,week_day,is_weekend,cat_id
0,470,49bbd6c0f964a520f4531fe3,4bf58dd8d48988d127951735,Arts & Crafts Store,40.71981,-74.002581,-240,2012-04-03 18:00:09+00:00,Shopping,4,3,18.0025,1,False,0
1,979,4a43c0aef964a520c6a61fe3,4bf58dd8d48988d1df941735,Bridge,40.6068,-74.04417,-240,2012-04-03 18:00:25+00:00,Infrastructure,4,3,18.006944,1,False,1
2,69,4c5cc7b485a1e21e00d35711,4bf58dd8d48988d103941735,Home (private),40.716162,-73.88307,-240,2012-04-03 18:02:24+00:00,Home,4,3,18.04,1,False,2
3,395,4bc7086715a7ef3bef9878da,4bf58dd8d48988d104941735,Medical Center,40.745164,-73.982519,-240,2012-04-03 18:02:41+00:00,Medical,4,3,18.044722,1,False,3
4,87,4cf2c5321d18a143951b5cec,4bf58dd8d48988d1cb941735,Food Truck,40.740104,-73.989658,-240,2012-04-03 18:03:00+00:00,Snacks,4,3,18.05,1,False,4
5,484,4b5b981bf964a520900929e3,4bf58dd8d48988d118951735,Food & Drink Shop,40.690427,-73.954687,-240,2012-04-03 18:04:00+00:00,Grocery,4,3,18.066667,1,False,5
6,642,4ab966c3f964a5203c7f20e3,4bf58dd8d48988d1e0931735,Coffee Shop,40.751591,-73.974121,-240,2012-04-03 18:04:38+00:00,Cafe,4,3,18.077222,1,False,6
7,292,4d0cc47f903d37041864bf55,4bf58dd8d48988d12b951735,Bus Station,40.779422,-73.955341,-240,2012-04-03 18:04:42+00:00,Public Transport,4,3,18.078333,1,False,7
8,428,4ce1863bc4f6a35d8bd2db6c,4bf58dd8d48988d103941735,Home (private),40.619151,-74.035888,-240,2012-04-03 18:06:18+00:00,Home,4,3,18.105,1,False,2
9,877,4be319b321d5a59352311811,4bf58dd8d48988d10a951735,Bank,40.619006,-73.990375,-240,2012-04-03 18:06:19+00:00,Service,4,3,18.105278,1,False,8


Temporal transformation is applied on the time components.
This is done by sin/cos transformation.

In [27]:
merged['clock_sin'] = np.sin(2 * np.pi * merged['clock']/24.0)
merged['clock_cos'] = np.cos(2 * np.pi * merged['clock']/24.0)
merged['day_sin'] = np.sin(2 * np.pi * merged['day']/30.0)
merged['day_cos'] = np.cos(2 * np.pi * merged['day']/30.0)
merged['month_sin'] = np.sin(2 * np.pi * merged['month']/12.0)
merged['month_cos'] = np.cos(2 * np.pi * merged['month']/12.0)
merged['week_day_sin'] = np.sin(2 * np.pi * merged['week_day']/7.0)
merged['week_day_cos'] = np.cos(2 * np.pi * merged['week_day']/7.0)

The unnecessary columns are dropped and then the data is processed.

In [28]:
merged.drop(['clock', 'day', 'month', 'week_day', 'uber_category'], axis=1, inplace=True)

In [29]:
processed = merged.copy()
processed.head(10)

Unnamed: 0,user_id,venue,category_id,category_name,latitude,longitude,time_offset,time,is_weekend,cat_id,clock_sin,clock_cos,day_sin,day_cos,month_sin,month_cos,week_day_sin,week_day_cos
0,470,49bbd6c0f964a520f4531fe3,4bf58dd8d48988d127951735,Arts & Crafts Store,40.71981,-74.002581,-240,2012-04-03 18:00:09+00:00,False,0,-1.0,0.000654,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
1,979,4a43c0aef964a520c6a61fe3,4bf58dd8d48988d1df941735,Bridge,40.6068,-74.04417,-240,2012-04-03 18:00:25+00:00,False,1,-0.999998,0.001818,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
2,69,4c5cc7b485a1e21e00d35711,4bf58dd8d48988d103941735,Home (private),40.716162,-73.88307,-240,2012-04-03 18:02:24+00:00,False,2,-0.999945,0.010472,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
3,395,4bc7086715a7ef3bef9878da,4bf58dd8d48988d104941735,Medical Center,40.745164,-73.982519,-240,2012-04-03 18:02:41+00:00,False,3,-0.999931,0.011708,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
4,87,4cf2c5321d18a143951b5cec,4bf58dd8d48988d1cb941735,Food Truck,40.740104,-73.989658,-240,2012-04-03 18:03:00+00:00,False,4,-0.999914,0.01309,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
5,484,4b5b981bf964a520900929e3,4bf58dd8d48988d118951735,Food & Drink Shop,40.690427,-73.954687,-240,2012-04-03 18:04:00+00:00,False,5,-0.999848,0.017452,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
6,642,4ab966c3f964a5203c7f20e3,4bf58dd8d48988d1e0931735,Coffee Shop,40.751591,-73.974121,-240,2012-04-03 18:04:38+00:00,False,6,-0.999796,0.020215,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
7,292,4d0cc47f903d37041864bf55,4bf58dd8d48988d12b951735,Bus Station,40.779422,-73.955341,-240,2012-04-03 18:04:42+00:00,False,7,-0.99979,0.020506,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
8,428,4ce1863bc4f6a35d8bd2db6c,4bf58dd8d48988d103941735,Home (private),40.619151,-74.035888,-240,2012-04-03 18:06:18+00:00,False,2,-0.999622,0.027485,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
9,877,4be319b321d5a59352311811,4bf58dd8d48988d10a951735,Bank,40.619006,-73.990375,-240,2012-04-03 18:06:19+00:00,False,8,-0.99962,0.027558,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349


Here, reduced dataset for the first model training is created.

In [30]:
reduced = processed.drop(['venue', 'category_id', 'category_name', 'latitude', 'longitude', 'time_offset', 'time', 'is_weekend'], axis=1, inplace=False)

reduced.head(10)

Unnamed: 0,user_id,cat_id,clock_sin,clock_cos,day_sin,day_cos,month_sin,month_cos,week_day_sin,week_day_cos
0,470,0,-1.0,0.000654,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
1,979,1,-0.999998,0.001818,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
2,69,2,-0.999945,0.010472,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
3,395,3,-0.999931,0.011708,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
4,87,4,-0.999914,0.01309,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
5,484,5,-0.999848,0.017452,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
6,642,6,-0.999796,0.020215,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
7,292,7,-0.99979,0.020506,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
8,428,2,-0.999622,0.027485,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
9,877,8,-0.99962,0.027558,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349


A bigger dataset with all the processed features is also created.
For that purpose we map venues and category ids to smaller indices.

In [31]:
processed['venue_id'] = processed.groupby('venue', sort=False).ngroup()
processed['orig_cat_id'] = processed.groupby('category_id', sort=False).ngroup()
big = processed.drop(['venue', 'category_id', 'category_name', 'time_offset', 'time'], axis=1, inplace=False)

big.head(10)

Unnamed: 0,user_id,latitude,longitude,is_weekend,cat_id,clock_sin,clock_cos,day_sin,day_cos,month_sin,month_cos,week_day_sin,week_day_cos,venue_id,orig_cat_id
0,470,40.71981,-74.002581,False,0,-1.0,0.000654,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,0,0
1,979,40.6068,-74.04417,False,1,-0.999998,0.001818,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,1,1
2,69,40.716162,-73.88307,False,2,-0.999945,0.010472,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,2,2
3,395,40.745164,-73.982519,False,3,-0.999931,0.011708,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,3,3
4,87,40.740104,-73.989658,False,4,-0.999914,0.01309,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,4,4
5,484,40.690427,-73.954687,False,5,-0.999848,0.017452,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,5,5
6,642,40.751591,-73.974121,False,6,-0.999796,0.020215,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,6,6
7,292,40.779422,-73.955341,False,7,-0.99979,0.020506,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,7,7
8,428,40.619151,-74.035888,False,2,-0.999622,0.027485,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,8,2
9,877,40.619006,-73.990375,False,8,-0.99962,0.027558,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,9,8


Move 'cat_id' to the first column (will be used as y -> prediction)

In [32]:
def move_first(df, column_name):
    fc = df.pop(column_name)
    df.insert(0, column_name, fc)

In [33]:
move_first(reduced, 'cat_id')
reduced.head(10)

Unnamed: 0,cat_id,user_id,clock_sin,clock_cos,day_sin,day_cos,month_sin,month_cos,week_day_sin,week_day_cos
0,0,470,-1.0,0.000654,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
1,1,979,-0.999998,0.001818,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
2,2,69,-0.999945,0.010472,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
3,3,395,-0.999931,0.011708,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
4,4,87,-0.999914,0.01309,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
5,5,484,-0.999848,0.017452,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
6,6,642,-0.999796,0.020215,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
7,7,292,-0.99979,0.020506,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
8,2,428,-0.999622,0.027485,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
9,8,877,-0.99962,0.027558,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349


In [34]:
move_first(big, 'cat_id')
big.head(10)

Unnamed: 0,cat_id,user_id,latitude,longitude,is_weekend,clock_sin,clock_cos,day_sin,day_cos,month_sin,month_cos,week_day_sin,week_day_cos,venue_id,orig_cat_id
0,0,470,40.71981,-74.002581,False,-1.0,0.000654,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,0,0
1,1,979,40.6068,-74.04417,False,-0.999998,0.001818,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,1,1
2,2,69,40.716162,-73.88307,False,-0.999945,0.010472,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,2,2
3,3,395,40.745164,-73.982519,False,-0.999931,0.011708,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,3,3
4,4,87,40.740104,-73.989658,False,-0.999914,0.01309,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,4,4
5,5,484,40.690427,-73.954687,False,-0.999848,0.017452,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,5,5
6,6,642,40.751591,-73.974121,False,-0.999796,0.020215,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,6,6
7,7,292,40.779422,-73.955341,False,-0.99979,0.020506,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,7,7
8,2,428,40.619151,-74.035888,False,-0.999622,0.027485,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,8,2
9,8,877,40.619006,-73.990375,False,-0.99962,0.027558,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,9,8


Save the datasets to .csv files.

In [35]:
reduced.to_csv('./4square/processed_transformed.csv', index=False)
big.to_csv('./4square/processed_transformed_big.csv', index=False)

Create old dataset for legacy notebooks and trial runs with changed architectures.

In [36]:
# column order: user_id,cat_id,clock_sin,clock_cos,day_sin,day_cos,month_sin,month_cos,week_day_sin,week_day_cos
old = reduced.copy()
move_first(old, 'user_id')
old.head(10)

Unnamed: 0,user_id,cat_id,clock_sin,clock_cos,day_sin,day_cos,month_sin,month_cos,week_day_sin,week_day_cos
0,470,0,-1.0,0.000654,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
1,979,1,-0.999998,0.001818,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
2,69,2,-0.999945,0.010472,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
3,395,3,-0.999931,0.011708,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
4,87,4,-0.999914,0.01309,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
5,484,5,-0.999848,0.017452,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
6,642,6,-0.999796,0.020215,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
7,292,7,-0.99979,0.020506,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
8,428,2,-0.999622,0.027485,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
9,877,8,-0.99962,0.027558,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349


In [37]:
old.to_csv('./4square/processed_transformed_old.csv', index=False)

Here, a Single User DataSet is created for the user with ID 293.
That is the user with the most dataset entries.

In [38]:
#test area

t = unprocessed.merge(ubers, on='category_name', suffixes=('', '_y'), how='left')
single_user_df = t.loc[t.user_id == 293].copy()
single_user_df.head(100)

Unnamed: 0,user_id,venue,category_id,category_name,latitude,longitude,time_offset,time,uber_category
690,293,4ae8fd76f964a520e1b321e3,4bf58dd8d48988d1fd931735,Subway,40.725577,-74.004252,-240,Tue Apr 03 23:56:43 +0000 2012,Public Transport
760,293,4d5bedbb5d153704613a6ce7,4bf58dd8d48988d162941735,Other Great Outdoors,40.714310,-73.949394,-240,Wed Apr 04 00:19:13 +0000 2012,Outdoors
763,293,41156d00f964a520f40b1fe3,4bf58dd8d48988d1d8941735,Bar,40.713624,-73.949710,-240,Wed Apr 04 00:19:44 +0000 2012,Bar
766,293,4d61b919865a224bef72ba85,4bf58dd8d48988d123951735,Smoke Shop,40.715951,-73.949595,-240,Wed Apr 04 00:21:12 +0000 2012,Shopping
1121,293,4c97edbdf419a09395806b88,4f2a25ac4b909258e854f55f,Neighborhood,40.650169,-73.949575,-240,Wed Apr 04 04:53:06 +0000 2012,Home
...,...,...,...,...,...,...,...,...,...
13791,293,4e5577f445dd0a4826e6ab3e,4bf58dd8d48988d103941735,Home (private),40.693212,-73.930104,-240,Fri Apr 13 16:18:13 +0000 2012,Home
13793,293,4d2f5ae2940137044fa1eeda,4bf58dd8d48988d1f6931735,General Travel,40.689003,-73.929085,-240,Fri Apr 13 16:18:55 +0000 2012,Public Transport
13794,293,4d2f5ae2940137044fa1eeda,4bf58dd8d48988d1f6931735,General Travel,40.689003,-73.929085,-240,Fri Apr 13 16:19:13 +0000 2012,Public Transport
13798,293,4b0b68cef964a5202f3123e3,4bf58dd8d48988d1fd931735,Subway,40.679336,-73.930285,-240,Fri Apr 13 16:20:17 +0000 2012,Public Transport


Next, a dataset for locations is also processed.
Here, a join is done on areas that will be retrieved with a library.

In [39]:
merged.head(10)

Unnamed: 0,user_id,venue,category_id,category_name,latitude,longitude,time_offset,time,is_weekend,cat_id,clock_sin,clock_cos,day_sin,day_cos,month_sin,month_cos,week_day_sin,week_day_cos
0,470,49bbd6c0f964a520f4531fe3,4bf58dd8d48988d127951735,Arts & Crafts Store,40.71981,-74.002581,-240,2012-04-03 18:00:09+00:00,False,0,-1.0,0.000654,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
1,979,4a43c0aef964a520c6a61fe3,4bf58dd8d48988d1df941735,Bridge,40.6068,-74.04417,-240,2012-04-03 18:00:25+00:00,False,1,-0.999998,0.001818,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
2,69,4c5cc7b485a1e21e00d35711,4bf58dd8d48988d103941735,Home (private),40.716162,-73.88307,-240,2012-04-03 18:02:24+00:00,False,2,-0.999945,0.010472,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
3,395,4bc7086715a7ef3bef9878da,4bf58dd8d48988d104941735,Medical Center,40.745164,-73.982519,-240,2012-04-03 18:02:41+00:00,False,3,-0.999931,0.011708,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
4,87,4cf2c5321d18a143951b5cec,4bf58dd8d48988d1cb941735,Food Truck,40.740104,-73.989658,-240,2012-04-03 18:03:00+00:00,False,4,-0.999914,0.01309,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
5,484,4b5b981bf964a520900929e3,4bf58dd8d48988d118951735,Food & Drink Shop,40.690427,-73.954687,-240,2012-04-03 18:04:00+00:00,False,5,-0.999848,0.017452,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
6,642,4ab966c3f964a5203c7f20e3,4bf58dd8d48988d1e0931735,Coffee Shop,40.751591,-73.974121,-240,2012-04-03 18:04:38+00:00,False,6,-0.999796,0.020215,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
7,292,4d0cc47f903d37041864bf55,4bf58dd8d48988d12b951735,Bus Station,40.779422,-73.955341,-240,2012-04-03 18:04:42+00:00,False,7,-0.99979,0.020506,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
8,428,4ce1863bc4f6a35d8bd2db6c,4bf58dd8d48988d103941735,Home (private),40.619151,-74.035888,-240,2012-04-03 18:06:18+00:00,False,2,-0.999622,0.027485,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349
9,877,4be319b321d5a59352311811,4bf58dd8d48988d10a951735,Bank,40.619006,-73.990375,-240,2012-04-03 18:06:19+00:00,False,8,-0.99962,0.027558,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349


First, an array of the coordinates (lat/long) is created.

In [40]:
import reverse_geocoder as rg

coords = np.vstack((merged["longitude"].values, merged["latitude"].values)).T
coord_array = [(i[1], i[0]) for i in coords]

coord_array

[(40.71981037548853, -74.00258103213994),
 (40.606799581406435, -74.04416981025437),
 (40.71616168484322, -73.88307005845945),
 (40.7451638, -73.982518775),
 (40.74010382743943, -73.98965835571289),
 (40.69042711809854, -73.954686775096),
 (40.75159143134631, -73.9741214009634),
 (40.77942173066975, -73.95534113280371),
 (40.61915106755737, -74.03588760058483),
 (40.61900594093755, -73.99037472596906),
 (40.74348254, -73.99400899999999),
 (40.74260751232707, -73.99270534515381),
 (40.719762266666656, -74.25001400000001),
 (40.741190550641626, -73.98966309787518),
 (40.70458850287454, -74.00963933087027),
 (40.861981503068144, -74.04790453737951),
 (40.826789537813866, -73.94950923509141),
 (40.75527487548953, -73.97880613803865),
 (40.906627, -73.777774),
 (40.73067679262482, -74.06567180055882),
 (40.88302032878746, -74.0758752822876),
 (40.74281621210842, -74.00040610551177),
 (40.64531729239498, -73.77383708953856),
 (40.76067264603736, -74.00367709397945),
 (40.736676, -73.98891),


Now, the reverse_geocoder library is used to lookup all coordinates.

In [41]:
coord_array = [(i[1], i[0]) for i in coords]
search = rg.search(coord_array)
search

Loading formatted geocoded file...


[OrderedDict([('lat', '40.71427'),
              ('lon', '-74.00597'),
              ('name', 'New York City'),
              ('admin1', 'New York'),
              ('admin2', ''),
              ('cc', 'US')]),
 OrderedDict([('lat', '40.60177'),
              ('lon', '-73.99403'),
              ('name', 'Bensonhurst'),
              ('admin1', 'New York'),
              ('admin2', 'Kings County'),
              ('cc', 'US')]),
 OrderedDict([('lat', '40.66677'),
              ('lon', '-73.88236'),
              ('name', 'East New York'),
              ('admin1', 'New York'),
              ('admin2', 'Kings County'),
              ('cc', 'US')]),
 OrderedDict([('lat', '40.74482'),
              ('lon', '-73.94875'),
              ('name', 'Long Island City'),
              ('admin1', 'New York'),
              ('admin2', 'Queens County'),
              ('cc', 'US')]),
 OrderedDict([('lat', '40.71427'),
              ('lon', '-74.00597'),
              ('name', 'New York City'),
          

The results are saved in a dataframe. The columns are lat, lon, name, admin1, admin2 and cc.

In [42]:
values= []
for i in range(len(search)):
    values.append(search[i])

coord_search_df = pd.DataFrame(values)
coord_search_df

Unnamed: 0,lat,lon,name,admin1,admin2,cc
0,40.71427,-74.00597,New York City,New York,,US
1,40.60177,-73.99403,Bensonhurst,New York,Kings County,US
2,40.66677,-73.88236,East New York,New York,Kings County,US
3,40.74482,-73.94875,Long Island City,New York,Queens County,US
4,40.71427,-74.00597,New York City,New York,,US
...,...,...,...,...,...,...
227423,40.71427,-74.00597,New York City,New York,,US
227424,40.71427,-74.00597,New York City,New York,,US
227425,40.84985,-73.86641,The Bronx,New York,Bronx,US
227426,40.74399,-74.03236,Hoboken,New Jersey,Hudson County,US


The unique values for name, admin1 and admin2 are relevant to decide which column to use for the areas.

In [43]:
print("Name : ", coord_search_df.name.nunique())
print("Admin1 : ", coord_search_df.admin1.nunique())
print("Admin2 : ", coord_search_df.admin2.nunique())

Name :  141
Admin1 :  2
Admin2 :  13


In [44]:
coord_search_df.name.value_counts()

New York City       49835
Manhattan           32514
Long Island City    28944
Weehawken           12428
Brooklyn            10379
                    ...  
Plandome Heights        3
Glen Rock               3
Ho-Ho-Kus               1
Plandome                1
Larchmont               1
Name: name, Length: 141, dtype: int64

The decision falls on the area name column with 141 different values. As both arrays have the same length and order, it is possible to just insert it in a column.

In [45]:
merged_zones = merged.copy()
merged_zones['location_name'] = coord_search_df['name']
merged_zones

Unnamed: 0,user_id,venue,category_id,category_name,latitude,longitude,time_offset,time,is_weekend,cat_id,clock_sin,clock_cos,day_sin,day_cos,month_sin,month_cos,week_day_sin,week_day_cos,location_name
0,470,49bbd6c0f964a520f4531fe3,4bf58dd8d48988d127951735,Arts & Crafts Store,40.719810,-74.002581,-240,2012-04-03 18:00:09+00:00,False,0,-1.000000,0.000654,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490,New York City
1,979,4a43c0aef964a520c6a61fe3,4bf58dd8d48988d1df941735,Bridge,40.606800,-74.044170,-240,2012-04-03 18:00:25+00:00,False,1,-0.999998,0.001818,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490,Bensonhurst
2,69,4c5cc7b485a1e21e00d35711,4bf58dd8d48988d103941735,Home (private),40.716162,-73.883070,-240,2012-04-03 18:02:24+00:00,False,2,-0.999945,0.010472,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490,East New York
3,395,4bc7086715a7ef3bef9878da,4bf58dd8d48988d104941735,Medical Center,40.745164,-73.982519,-240,2012-04-03 18:02:41+00:00,False,3,-0.999931,0.011708,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490,Long Island City
4,87,4cf2c5321d18a143951b5cec,4bf58dd8d48988d1cb941735,Food Truck,40.740104,-73.989658,-240,2012-04-03 18:03:00+00:00,False,4,-0.999914,0.013090,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490,New York City
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227423,688,3fd66200f964a52000e71ee3,4bf58dd8d48988d1e7931735,Music Venue,40.733596,-74.003139,-300,2013-02-16 02:29:11+00:00,True,16,0.605931,0.795518,-0.207912,-0.978148,0.866025,0.5,-0.974928,-0.222521,New York City
227424,560,4bca32ff0687ef3be789dbcc,4bf58dd8d48988d16c941735,Burger Joint,40.745719,-73.993720,-300,2013-02-16 02:31:35+00:00,True,11,0.614228,0.789129,-0.207912,-0.978148,0.866025,0.5,-0.974928,-0.222521,New York City
227425,945,50a77716e4b0b5a9492f6f56,4bf58dd8d48988d103941735,Home (private),40.854364,-73.883070,-300,2013-02-16 02:33:16+00:00,True,2,0.620007,0.784596,-0.207912,-0.978148,0.866025,0.5,-0.974928,-0.222521,The Bronx
227426,671,4514efe0f964a520e7391fe3,4bf58dd8d48988d11d941735,Bar,40.735981,-74.029309,-300,2013-02-16 02:34:31+00:00,True,9,0.624277,0.781203,-0.207912,-0.978148,0.866025,0.5,-0.974928,-0.222521,Hoboken


Execute a group by on the locations to obtain ids.

In [46]:
merged_zones['location_id'] = merged_zones.groupby('location_name', sort=False).ngroup()

Move the location_id to the first column and drop not needed columns for this use case.

In [47]:
move_first(merged_zones, 'location_id')
merged_zones

Unnamed: 0,location_id,user_id,venue,category_id,category_name,latitude,longitude,time_offset,time,is_weekend,cat_id,clock_sin,clock_cos,day_sin,day_cos,month_sin,month_cos,week_day_sin,week_day_cos,location_name
0,0,470,49bbd6c0f964a520f4531fe3,4bf58dd8d48988d127951735,Arts & Crafts Store,40.719810,-74.002581,-240,2012-04-03 18:00:09+00:00,False,0,-1.000000,0.000654,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490,New York City
1,1,979,4a43c0aef964a520c6a61fe3,4bf58dd8d48988d1df941735,Bridge,40.606800,-74.044170,-240,2012-04-03 18:00:25+00:00,False,1,-0.999998,0.001818,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490,Bensonhurst
2,2,69,4c5cc7b485a1e21e00d35711,4bf58dd8d48988d103941735,Home (private),40.716162,-73.883070,-240,2012-04-03 18:02:24+00:00,False,2,-0.999945,0.010472,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490,East New York
3,3,395,4bc7086715a7ef3bef9878da,4bf58dd8d48988d104941735,Medical Center,40.745164,-73.982519,-240,2012-04-03 18:02:41+00:00,False,3,-0.999931,0.011708,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490,Long Island City
4,0,87,4cf2c5321d18a143951b5cec,4bf58dd8d48988d1cb941735,Food Truck,40.740104,-73.989658,-240,2012-04-03 18:03:00+00:00,False,4,-0.999914,0.013090,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490,New York City
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
227423,0,688,3fd66200f964a52000e71ee3,4bf58dd8d48988d1e7931735,Music Venue,40.733596,-74.003139,-300,2013-02-16 02:29:11+00:00,True,16,0.605931,0.795518,-0.207912,-0.978148,0.866025,0.5,-0.974928,-0.222521,New York City
227424,0,560,4bca32ff0687ef3be789dbcc,4bf58dd8d48988d16c941735,Burger Joint,40.745719,-73.993720,-300,2013-02-16 02:31:35+00:00,True,11,0.614228,0.789129,-0.207912,-0.978148,0.866025,0.5,-0.974928,-0.222521,New York City
227425,33,945,50a77716e4b0b5a9492f6f56,4bf58dd8d48988d103941735,Home (private),40.854364,-73.883070,-300,2013-02-16 02:33:16+00:00,True,2,0.620007,0.784596,-0.207912,-0.978148,0.866025,0.5,-0.974928,-0.222521,The Bronx
227426,28,671,4514efe0f964a520e7391fe3,4bf58dd8d48988d11d941735,Bar,40.735981,-74.029309,-300,2013-02-16 02:34:31+00:00,True,9,0.624277,0.781203,-0.207912,-0.978148,0.866025,0.5,-0.974928,-0.222521,Hoboken


In [48]:
reduced_zones = merged_zones.drop(['venue', 'category_id', 'category_name', 'latitude', 'longitude', 'time_offset', 'time', 'is_weekend', 'cat_id', 'location_name'], axis=1, inplace=False)
reduced_zones

Unnamed: 0,location_id,user_id,clock_sin,clock_cos,day_sin,day_cos,month_sin,month_cos,week_day_sin,week_day_cos
0,0,470,-1.000000,0.000654,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490
1,1,979,-0.999998,0.001818,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490
2,2,69,-0.999945,0.010472,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490
3,3,395,-0.999931,0.011708,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490
4,0,87,-0.999914,0.013090,0.587785,0.809017,0.866025,-0.5,0.781831,0.623490
...,...,...,...,...,...,...,...,...,...,...
227423,0,688,0.605931,0.795518,-0.207912,-0.978148,0.866025,0.5,-0.974928,-0.222521
227424,0,560,0.614228,0.789129,-0.207912,-0.978148,0.866025,0.5,-0.974928,-0.222521
227425,33,945,0.620007,0.784596,-0.207912,-0.978148,0.866025,0.5,-0.974928,-0.222521
227426,28,671,0.624277,0.781203,-0.207912,-0.978148,0.866025,0.5,-0.974928,-0.222521


Save the file.

In [49]:
reduced_zones.to_csv('./4square/processed_transformed_locations.csv', index=False)

In [50]:
single_user_locations = pd.DataFrame(reduced_zones.loc[reduced_zones.user_id == 185])

In [51]:
single_user_locations.to_csv('./4square/processed_transformed_locations_185.csv', index=False)