# Data Process for Location Data

This is a guide to process location data.

---

Here, you can analyze location data from K-Emophone Dataset and derive routines from location data.
Following is the goals of this Jupyter Notebook.

- Concat location data into one file
  If you unzip the K-Emophone Dataset, location data is distributed into several files. Therefore, we are going to merge them into one file.
- Cluster by GPS coordinates
  As we learned in Lab 9, we can cluster geographical data by GPS coordinates. By clustering, we can derive routines such as sleep, meal, or exercise.
- Resample as 15 minutes
  Original K-Emophone Dataset is too huge. Therefore, we are going to resample the timestamp as 15 minutes.
- Analyze routine
  Finally, we are going to aggregate location data to routine data, and we will use routine data in our visualization web page.

## Get data

Assume that you already download the K-Emophone Dataset, and unzip it.
For example, `*/P3041`, and there will be location data like `*/P3041/LocationEntity-*.csv`.

Please change the `DATASET_DIRECTORY` as your local directory.

In [1]:
# Please use absolute path.
# fyi. use `pwd` command for mac users.
DATASET_DIRECTORY = '/Users/osjun/Downloads/P3029'
USER_ID = 'P3029'

In [2]:
import glob, os
import pandas as pd

location_files = glob.glob(os.path.join(DATASET_DIRECTORY, 'LocationEntity-*.csv'))

location_df = pd.concat([pd.read_csv(f) for f in location_files])
location_df['datetime'] = pd.to_datetime(location_df['timestamp'], utc=True, unit='ms')
location_df['datetime'] = location_df.datetime.dt.tz_convert('Asia/Seoul')
location_df = location_df[['timestamp', 'longitude', 'latitude', 'datetime']]
location_df.head()

Unnamed: 0,timestamp,longitude,latitude,datetime
0,1557123564025,127.112623,37.382442,2019-05-06 15:19:24.025000+09:00
1,1557123572513,127.113192,37.379412,2019-05-06 15:19:32.513000+09:00
2,1557123575989,127.112073,37.381695,2019-05-06 15:19:35.989000+09:00
3,1557123576986,127.112142,37.38181,2019-05-06 15:19:36.986000+09:00
4,1557123577996,127.112231,37.381929,2019-05-06 15:19:37.996000+09:00


In [3]:
location_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15303 entries, 0 to 288
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype                     
---  ------     --------------  -----                     
 0   timestamp  15303 non-null  int64                     
 1   longitude  15303 non-null  float64                   
 2   latitude   15303 non-null  float64                   
 3   datetime   15303 non-null  datetime64[ns, Asia/Seoul]
dtypes: datetime64[ns, Asia/Seoul](1), float64(2), int64(1)
memory usage: 597.8 KB


## Clustering GPS Coordinates

Clustering by GPS Coordinates, as we learned in Lab 9

In [4]:
from sklearn.cluster import DBSCAN
import numpy as np

EPSILON_METRE = 50
MIN_POINTS = 5
R = 63710088

cluster = DBSCAN(
    #The maximum angle between two samples
    eps=EPSILON_METRE / R,
    #The number of samples in a neighborhood for a point to be considered as a core point.
    min_samples=MIN_POINTS,
    # IMPORTANT: only Ball Tree can handle haversine distance.
    #The algorithm to compute pointwise distances and find nearest neighbors.
    metric='haversine',
    algorithm='ball_tree'
)

loc_degrees = location_df.loc[:, ['latitude', 'longitude']].to_numpy()  # Convert the DataFrame to a NumPy array.
loc_radians = np.radians(loc_degrees)  # To use haversine distance, degree should be transformed into radians
labels = cluster.fit_predict(loc_radians)

cluster_df = pd.DataFrame(
    np.column_stack([location_df.to_numpy(), labels]),
    columns=['timestamp', 'longitude', 'latitude', 'datetime', 'labels']
)

cluster_df.head()

Unnamed: 0,timestamp,longitude,latitude,datetime,labels
0,1557123564025,127.112623,37.382442,2019-05-06 15:19:24.025000+09:00,-1
1,1557123572513,127.113192,37.379412,2019-05-06 15:19:32.513000+09:00,-1
2,1557123575989,127.112073,37.381695,2019-05-06 15:19:35.989000+09:00,-1
3,1557123576986,127.112142,37.38181,2019-05-06 15:19:36.986000+09:00,-1
4,1557123577996,127.112231,37.381929,2019-05-06 15:19:37.996000+09:00,-1


## Resample by 15 minutes

Resample by 15 minutes

In [14]:
resampled_df = cluster_df.copy() # cluster_df.set_index('datetime').resample('15min').first().reset_index()
resampled_df = resampled_df[resampled_df['labels'] != -1]
resampled_df.head()

Unnamed: 0,timestamp,longitude,latitude,datetime,labels
385,1557124429485,127.126918,37.412841,2019-05-06 15:33:49.485000+09:00,0
386,1557124502800,127.126881,37.412764,2019-05-06 15:35:02.800000+09:00,0
387,1557124529447,127.126867,37.412846,2019-05-06 15:35:29.447000+09:00,0
388,1557124549798,127.126807,37.412846,2019-05-06 15:35:49.798000+09:00,0
389,1557124632936,127.126817,37.412793,2019-05-06 15:37:12.936000+09:00,0


## Trace location data

Trace location data on time.

In [15]:
import plotly.express as px

px.set_mapbox_access_token(open("../.mapbox_token").read())

fig = px.scatter_mapbox(
    resampled_df,
    lat="latitude",
    lon="longitude",
    color="labels",
    hover_data=["latitude", "longitude", "datetime"],
    # If you want animation, please uncomment this line
    # animation_frame=resampled_df.datetime.astype(str),
    center={'lat': 36.37 , 'lon': 127.36},
    zoom=14,
    width=800,
    height=800
)
fig.update_layout(title="Time trace on location", mapbox_style="streets")
# fig.update_layout(mapbox_bounds={"west": 127.35, "east": 127.37, "south": 36.36, "north": 36.38})

fig.show()

## Define routine

Using clustering, define routines

In [16]:
ROUTINE_LABELS = {
    2: "MEAL",
    3: "MEAL",
    4: "MEAL",
    7: "INDOOR",
    8: "MEAL",
    9: "MEAL",
    10: "CLASS",
    15: "MEAL",
    16: "MEAL",
    19: "CLASS",
    22: "CLASS",
    23: "CLASS",
    24: "STUDY",
    25: "STUDY",
    26: "STUDY",
    27: "STUDY",
    28: "STUDY",
    29: "EXERCISE",
    30: "EXERCISE",
    34: "STUDY",
}

In [17]:
resampled_df['routine'] = resampled_df['labels'].map(ROUTINE_LABELS)
resampled_df.head()

Unnamed: 0,timestamp,longitude,latitude,datetime,labels,routine
385,1557124429485,127.126918,37.412841,2019-05-06 15:33:49.485000+09:00,0,
386,1557124502800,127.126881,37.412764,2019-05-06 15:35:02.800000+09:00,0,
387,1557124529447,127.126867,37.412846,2019-05-06 15:35:29.447000+09:00,0,
388,1557124549798,127.126807,37.412846,2019-05-06 15:35:49.798000+09:00,0,
389,1557124632936,127.126817,37.412793,2019-05-06 15:37:12.936000+09:00,0,


Unnamed: 0,timestamp,longitude,latitude,routine,start_at,end_at,duration,weekday
14482,1556582713087,127.362479,36.37021,CLASS,2019-04-30 09:05:13.087000+09:00,2019-04-30 12:03:29+09:00,10695.0,1
14548,1556593550881,127.363623,36.369107,MEAL,2019-04-30 12:05:50.881000+09:00,2019-04-30 12:29:10.999000+09:00,1400.0,1
14696,1556595310999,127.357595,36.373436,INDOOR,2019-04-30 12:35:10.999000+09:00,2019-04-30 13:00:07.999000+09:00,1497.0,1
14785,1556596912999,127.362088,36.373273,CLASS,2019-04-30 13:01:52.999000+09:00,2019-04-30 14:13:23+09:00,4290.0,1
14902,1556601550999,127.358237,36.373464,INDOOR,2019-04-30 14:19:10.999000+09:00,2019-04-30 17:41:12.999000+09:00,12122.0,1
14957,1556614002118,127.359108,36.373716,MEAL,2019-04-30 17:46:42.118000+09:00,2019-04-30 18:15:18+09:00,1715.0,1
14984,1556615798000,127.357657,36.373646,INDOOR,2019-04-30 18:16:38+09:00,2019-05-01 13:19:40.686000+09:00,68582.0,1
13180,1556684712058,127.363675,36.369112,MEAL,2019-05-01 13:25:12.058000+09:00,2019-05-01 13:40:11.999000+09:00,899.0,2
13215,1556685755004,127.362326,36.370357,CLASS,2019-05-01 13:42:35.004000+09:00,2019-05-01 15:49:00.999000+09:00,7585.0,2
13365,1556693486999,127.358269,36.373475,INDOOR,2019-05-01 15:51:26.999000+09:00,2019-05-02 12:58:57.999000+09:00,76051.0,2


## Return dataframe

Make dataframe for routine.
It will be stored in `csv` directory.

---

DataFrame columns

- timestamp
- user_id
- logitude
- latitude
- routine
- start_at
- end_at
- weekday

In [34]:
result_df = resampled_df.copy().sort_values('datetime')

# Left rows when routine is changed
result_df['prev_routine'] = result_df['routine'].shift(1)
result_df = result_df[(result_df['routine'] != result_df['prev_routine'])]

# Calculate start_at and end_at by shifting
result_df['start_at'] = result_df['datetime']
result_df['end_at'] = result_df['datetime'].shift(-1)
result_df['duration'] = (result_df['end_at'] - result_df['start_at']).dt.seconds
result_df = result_df[result_df['routine'].notnull()]
result_df = result_df[result_df['duration'] >= 300]
result_df['weekday'] = result_df['start_at'].dt.dayofweek
result_df['user_id'] = USER_ID
result_df = result_df[['timestamp', 'user_id', 'longitude', 'latitude', 'routine', 'start_at', 'end_at', 'duration', 'weekday']]
result_df

Unnamed: 0,timestamp,user_id,longitude,latitude,routine,start_at,end_at,duration,weekday
14482,1556582713087,P3029,127.362479,36.37021,CLASS,2019-04-30 09:05:13.087000+09:00,2019-04-30 12:03:29+09:00,10695.0,1
14548,1556593550881,P3029,127.363623,36.369107,MEAL,2019-04-30 12:05:50.881000+09:00,2019-04-30 12:29:10.999000+09:00,1400.0,1
14696,1556595310999,P3029,127.357595,36.373436,INDOOR,2019-04-30 12:35:10.999000+09:00,2019-04-30 13:00:07.999000+09:00,1497.0,1
14785,1556596912999,P3029,127.362088,36.373273,CLASS,2019-04-30 13:01:52.999000+09:00,2019-04-30 14:13:23+09:00,4290.0,1
14902,1556601550999,P3029,127.358237,36.373464,INDOOR,2019-04-30 14:19:10.999000+09:00,2019-04-30 17:41:12.999000+09:00,12122.0,1
14957,1556614002118,P3029,127.359108,36.373716,MEAL,2019-04-30 17:46:42.118000+09:00,2019-04-30 18:15:18+09:00,1715.0,1
14984,1556615798000,P3029,127.357657,36.373646,INDOOR,2019-04-30 18:16:38+09:00,2019-05-01 13:19:40.686000+09:00,68582.0,1
13180,1556684712058,P3029,127.363675,36.369112,MEAL,2019-05-01 13:25:12.058000+09:00,2019-05-01 13:40:11.999000+09:00,899.0,2
13215,1556685755004,P3029,127.362326,36.370357,CLASS,2019-05-01 13:42:35.004000+09:00,2019-05-01 15:49:00.999000+09:00,7585.0,2
13365,1556693486999,P3029,127.358269,36.373475,INDOOR,2019-05-01 15:51:26.999000+09:00,2019-05-02 12:58:57.999000+09:00,76051.0,2


In [35]:
result_df.to_csv(f'../csv/routines_raw/{USER_ID}-location.csv', index=False)