# Data Process for Location Data

This is a guide to process location data.

---

Here, you can analyze location data from K-Emophone Dataset and derive routines from location data.
Following is the goals of this Jupyter Notebook.

- Concat location data into one file
  If you unzip the K-Emophone Dataset, location data is distributed into several files. Therefore, we are going to merge them into one file.
- Cluster by GPS coordinates
  As we learned in Lab 9, we can cluster geographical data by GPS coordinates. By clustering, we can derive routines such as sleep, meal, or exercise.
- Resample as 15 minutes
  Original K-Emophone Dataset is too huge. Therefore, we are going to resample the timestamp as 15 minutes.
- Analyze routine
  Finally, we are going to aggregate location data to routine data, and we will use routine data in our visualization web page.

## Get data

Assume that you already download the K-Emophone Dataset, and unzip it.
For example, `*/P3041`, and there will be location data like `*/P3041/LocationEntity-*.csv`.

Please change the `DATASET_DIRECTORY` as your local directory.

In [2]:
# Please use absolute path.
# fyi. use `pwd` command for mac users.
DATASET_DIRECTORY = '/Users/osjun/Downloads/P3029'
USER_ID = 'P3029'

In [3]:
import glob, os
import pandas as pd

location_files = glob.glob(os.path.join(DATASET_DIRECTORY, 'LocationEntity-*.csv'))

location_df = pd.concat([pd.read_csv(f) for f in location_files])
location_df['datetime'] = pd.to_datetime(location_df['timestamp'], utc=True, unit='ms')
location_df['datetime'] = location_df.datetime.dt.tz_convert('Asia/Seoul')
location_df = location_df[['timestamp', 'longitude', 'latitude', 'datetime']]
location_df.head()

Unnamed: 0,timestamp,longitude,latitude,datetime
0,1557123564025,127.112623,37.382442,2019-05-06 15:19:24.025000+09:00
1,1557123572513,127.113192,37.379412,2019-05-06 15:19:32.513000+09:00
2,1557123575989,127.112073,37.381695,2019-05-06 15:19:35.989000+09:00
3,1557123576986,127.112142,37.38181,2019-05-06 15:19:36.986000+09:00
4,1557123577996,127.112231,37.381929,2019-05-06 15:19:37.996000+09:00


In [4]:
location_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15303 entries, 0 to 288
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype                     
---  ------     --------------  -----                     
 0   timestamp  15303 non-null  int64                     
 1   longitude  15303 non-null  float64                   
 2   latitude   15303 non-null  float64                   
 3   datetime   15303 non-null  datetime64[ns, Asia/Seoul]
dtypes: datetime64[ns, Asia/Seoul](1), float64(2), int64(1)
memory usage: 597.8 KB


## Clustering GPS Coordinates

Clustering by GPS Coordinates, as we learned in Lab 9

In [5]:
from sklearn.cluster import DBSCAN
import numpy as np

EPSILON_METRE = 50
MIN_POINTS = 5
R = 63710088

cluster = DBSCAN(
    #The maximum angle between two samples
    eps=EPSILON_METRE / R,
    #The number of samples in a neighborhood for a point to be considered as a core point.
    min_samples=MIN_POINTS,
    # IMPORTANT: only Ball Tree can handle haversine distance.
    #The algorithm to compute pointwise distances and find nearest neighbors.
    metric='haversine',
    algorithm='ball_tree'
)

loc_degrees = location_df.loc[:, ['latitude', 'longitude']].to_numpy()  # Convert the DataFrame to a NumPy array.
loc_radians = np.radians(loc_degrees)  # To use haversine distance, degree should be transformed into radians
labels = cluster.fit_predict(loc_radians)

cluster_df = pd.DataFrame(
    np.column_stack([location_df.to_numpy(), labels]),
    columns=['timestamp', 'longitude', 'latitude', 'datetime', 'labels']
)

cluster_df.head()

Unnamed: 0,timestamp,longitude,latitude,datetime,labels
0,1557123564025,127.112623,37.382442,2019-05-06 15:19:24.025000+09:00,-1
1,1557123572513,127.113192,37.379412,2019-05-06 15:19:32.513000+09:00,-1
2,1557123575989,127.112073,37.381695,2019-05-06 15:19:35.989000+09:00,-1
3,1557123576986,127.112142,37.38181,2019-05-06 15:19:36.986000+09:00,-1
4,1557123577996,127.112231,37.381929,2019-05-06 15:19:37.996000+09:00,-1


## Resample by 15 minutes

Resample by 15 minutes

In [6]:
resampled_df = cluster_df.set_index('datetime').resample('15min').first().reset_index()
resampled_df = resampled_df[resampled_df['labels'] != -1]
resampled_df.head()

Unnamed: 0,datetime,timestamp,longitude,latitude,labels
0,2019-04-30 09:00:00+09:00,1556582713087.0,127.362479,36.37021,10.0
2,2019-04-30 09:30:00+09:00,,,,
3,2019-04-30 09:45:00+09:00,1556585347204.0,127.362467,36.370216,10.0
4,2019-04-30 10:00:00+09:00,1556586774094.0,127.362427,36.370349,10.0
5,2019-04-30 10:15:00+09:00,1556586945903.0,127.362403,36.370506,10.0


## Trace location data

Trace location data on time.

In [10]:
import plotly.express as px

px.set_mapbox_access_token(open("../.mapbox_token").read())

fig = px.scatter_mapbox(
    resampled_df,
    lat="latitude",
    lon="longitude",
    color="labels",
    hover_data=["latitude", "longitude", "datetime"],
    # If you want animation, please uncomment this line
    # animation_frame=resampled_df.datetime.astype(str),
    center={'lat': 36.37 , 'lon': 127.36},
    zoom=14,
    width=800,
    height=800
)
fig.update_layout(title="Time trace on location", mapbox_style="streets")
# fig.update_layout(mapbox_bounds={"west": 127.35, "east": 127.37, "south": 36.36, "north": 36.38})

fig.show()

## Define routine

Using clustering, define routines

In [11]:
ROUTINE_LABELS = {
    10: "CLASS",
    7: "INDOOR",
    8: "MEAL",
    9: "MEAL",
    25: "STUDY",
    23: "CLASS",
    15: "MEAL",
    29: "EXERCISE",
    13: "STUDY"
}

In [13]:
resampled_df['routine'] = resampled_df['labels'].map(ROUTINE_LABELS)
resampled_df.head()

Unnamed: 0,datetime,timestamp,longitude,latitude,labels,routine
0,2019-04-30 09:00:00+09:00,1556582713087.0,127.362479,36.37021,10.0,CLASS
2,2019-04-30 09:30:00+09:00,,,,,
3,2019-04-30 09:45:00+09:00,1556585347204.0,127.362467,36.370216,10.0,CLASS
4,2019-04-30 10:00:00+09:00,1556586774094.0,127.362427,36.370349,10.0,CLASS
5,2019-04-30 10:15:00+09:00,1556586945903.0,127.362403,36.370506,10.0,CLASS


In [14]:
resampled_df

Unnamed: 0,datetime,timestamp,longitude,latitude,labels,routine
0,2019-04-30 09:00:00+09:00,1556582713087,127.362479,36.37021,10,CLASS
2,2019-04-30 09:30:00+09:00,,,,,
3,2019-04-30 09:45:00+09:00,1556585347204,127.362467,36.370216,10,CLASS
4,2019-04-30 10:00:00+09:00,1556586774094,127.362427,36.370349,10,CLASS
5,2019-04-30 10:15:00+09:00,1556586945903,127.362403,36.370506,10,CLASS
...,...,...,...,...,...,...
600,2019-05-06 15:00:00+09:00,,,,,
603,2019-05-06 15:45:00+09:00,1557125170469,127.126739,37.412901,0,
615,2019-05-06 18:45:00+09:00,1557135900000,127.3587,36.37352,1,
616,2019-05-06 19:00:00+09:00,1557137207851,127.357251,36.373675,7,INDOOR


## Return dataframe

Make dataframe for routine.
It will be stored in `csv` directory.

---

DataFrame columns

- timestamp
- user_id
- logitude
- latitude
- routine
- start_at
- end_at
- weekday

In [15]:
result_df = resampled_df[resampled_df['routine'].notnull()] \
    [['timestamp', 'longitude', 'latitude', 'routine', 'datetime']] \
    .rename(columns={'datetime': 'start_at'}) \
    .set_index('timestamp')
result_df['user_id'] = USER_ID
result_df['end_at'] = result_df['start_at'] + pd.Timedelta(minutes=15)
result_df['weekday'] = result_df['start_at'].dt.dayofweek
result_df.head()

Unnamed: 0_level_0,longitude,latitude,routine,start_at,user_id,end_at,weekday
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1556582713087,127.362479,36.37021,CLASS,2019-04-30 09:00:00+09:00,P3029,2019-04-30 09:15:00+09:00,1
1556585347204,127.362467,36.370216,CLASS,2019-04-30 09:45:00+09:00,P3029,2019-04-30 10:00:00+09:00,1
1556586774094,127.362427,36.370349,CLASS,2019-04-30 10:00:00+09:00,P3029,2019-04-30 10:15:00+09:00,1
1556586945903,127.362403,36.370506,CLASS,2019-04-30 10:15:00+09:00,P3029,2019-04-30 10:30:00+09:00,1
1556589943395,127.362521,36.370524,CLASS,2019-04-30 11:00:00+09:00,P3029,2019-04-30 11:15:00+09:00,1


In [17]:
result_df.to_csv(f'../csv/routines_raw/{USER_ID}-location.csv', index=False)