### GeoSpatial Data

GeoSpatial data is a representation of position on the earth surface. The datapoints itself usually accompanied by metadata (such as time, format, etc) so that it can be interpreted correctly.

---

In [2]:
import numpy as np
import pandas as pd
import geopandas as gpd
import re
from sqlalchemy import create_engine
from sqlalchemy.types import Integer, Text, String, DateTime, Float, TIMESTAMP

### Raw Dataset

Our working datasets is three csv files:

- **raw_data_2.csv** consist of gps data trajectory of 300+ vehicles during April month. GPS trajectory has minimum 60s interval

## Part 1. Fixing DataTypes for GPS Trajectory

In [3]:
# filesize is 1.7GB, setting low_memory=False to override memory limit default setting 
df = pd.read_csv('Datasets/raw_data_2.csv', sep=';', low_memory=False)

In [3]:
df.head(3)

Unnamed: 0,record_id,device_id,license_plate,driver,vehicle_group,date,time,speed_kmh,source,type,...,course,distance_km,odometer,region,route,altitude_m,accu_voltage,unit,address,Unnamed: 24
0,1.0,792168.0,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:05:05,0,Program,Location Information,...,,,1231348,DC Kawasan,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",
1,2.0,792168.0,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:15:05,0,Program,Location Information,...,,0.0,1231348,DC Kawasan,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",
2,3.0,792168.0,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:25:05,0,Program,Location Information,...,,0.0,1231348,DC Kawasan,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",


### Feature Description

- **device_id** : identifier of each gps receiver
- **license_plate** : license plate of the vehicle
- **driver** : driver name
- **vehicle_group** : vehicle group (defined by customer)
- **date** : date of which this point is recorded (utc+7)
- **time** : time of which this point is recorded (utc+7)              
- **speed_kmh** : current speed of the vehicle
- **source** : source of data (whether from hardware or software inferred)           
- **type** : Location Mark               
- **value** : Value of the event (if applicable)             
- **longitude**
- **latitude**           
- **idling_duration_s** 
- **ignition_duration_s**
- **course** : vehicle bearing (0-360)          
- **distance_km**        
- **odometer**           
- **region** : name of region (if the vehicle is inside a predefined geofence)             
- **route**           
- **altitude_m** : current altitude from sea level      
- **accu_voltage**       
- **unit**               
- **address**            

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7132129 entries, 0 to 7132128
Data columns (total 25 columns):
record_id              float64
device_id              float64
license_plate          object
driver                 object
vehicle_group          object
date                   object
time                   object
speed_kmh              object
source                 object
type                   object
value                  float64
longitude              object
latitude               object
idling_duration_s      object
ignition_duration_s    object
course                 object
distance_km            object
odometer               object
region                 object
route                  float64
altitude_m             float64
accu_voltage           float64
unit                   float64
address                object
Unnamed: 24            float64
dtypes: float64(8), object(17)
memory usage: 1.3+ GB


In [4]:
df.isna().sum()

record_id                    1
device_id                    1
license_plate                1
driver                   70381
vehicle_group                1
date                         1
time                         1
speed_kmh                    1
source                       1
type                         1
value                  7132129
longitude                    1
latitude                     1
idling_duration_s         2965
ignition_duration_s       2965
course                 1616285
distance_km               6189
odometer                  5826
region                 5629655
route                  7132129
altitude_m              134935
accu_voltage           7132129
unit                   7132129
address                    816
Unnamed: 24            7132129
dtype: int64

In [5]:
df[df['device_id'].isna()]
# each datapoint must have device id

Unnamed: 0,record_id,device_id,license_plate,driver,vehicle_group,date,time,speed_kmh,source,type,...,course,distance_km,odometer,region,route,altitude_m,accu_voltage,unit,address,Unnamed: 24
7132128,,,,,,,,,,,...,,43240416,,,,,,,,


In [6]:
df.drop(7132128, inplace=True)

In [7]:
# discard record_id
df.drop(['record_id', 'Unnamed: 24'], axis=1, inplace=True)

***Some feature data type have to be converted***

In [8]:
# device id supposed to be a string
df['device_id'] = df['device_id'].astype(int).astype(str)

In [9]:
# speed supposed to be float
df['speed_kmh'] = df.apply(lambda x: float(x['speed_kmh'].replace(',','.')), axis=1)

In [10]:
df.head(1)['longitude']

0    106,911575
Name: longitude, dtype: object

In [11]:
df.head(1)['latitude']

0    -6,188440
Name: latitude, dtype: object

In [12]:
# long and lat supposed to be float.

df['longitude'] = df.apply(lambda x: float(x['longitude'].replace(',','.')), axis=1)
df['latitude'] = df.apply(lambda x: float(x['latitude'].replace(',','.')), axis=1)

In [13]:
day_ptn = re.compile(r"(\d+)d", flags=re.I)
hour_ptn = re.compile(r"(\d+)h", flags=re.I)
min_ptn = re.compile(r"(\d+)m", flags=re.I)
sec_ptn = re.compile(r"(\d+)s", flags=re.I)

def convert_prettytime_to_seconds (_str):
    
    if _str is None:
        return 0
    elif (isinstance(_str, str)):
        _str = _str.strip()
    
        if len(_str) == 0:
            return 0

        seconds = 0

        for t_bits in _str.split():
            if re.search(day_ptn, t_bits) is not None:
                seconds += int(re.findall(day_ptn, t_bits)[0]) * 86400
            elif re.search(hour_ptn, t_bits) is not None:
                seconds += int(re.findall(hour_ptn, t_bits)[0]) * 3600
            elif re.search(min_ptn, t_bits) is not None:
                seconds += int(re.findall(min_ptn, t_bits)[0]) * 60
            elif re.search(sec_ptn, t_bits) is not None:
                seconds += int(re.findall(sec_ptn, t_bits)[0])
            else:
                raise Exception("Unable to parse time format")
                break

        return seconds
    else:
        return 0
        
        

In [14]:
# idling_duration_s and ignition_duration_s

df['idling_duration_s'] = df.apply(lambda x: convert_prettytime_to_seconds(x['idling_duration_s']), axis = 1)

In [15]:
df['ignition_duration_s'] = df.apply(lambda x: convert_prettytime_to_seconds(x['ignition_duration_s']), axis = 1)

In [16]:
# convert distance_km to float
df['distance_km'] = df.apply(lambda x: float(x['distance_km'].replace(',','.') if isinstance(x['distance_km'], str) else x['distance_km']), axis=1)

In [17]:
# convert odometer to float
df['odometer'] = df.apply(lambda x: float(x['odometer'].replace(',','.') if isinstance(x['odometer'], str) else x['odometer']), axis=1)

In [18]:
# convert route to object
# df.apply(lambda x: x['route'] if np.isnan(x['route']) else str(x['route']) , axis =1)
df['route'] = df.apply(lambda x: x['route'] if np.isnan(x['route']) else str(x['route']) , axis =1)

In [19]:
# convert course to float as well. do not treat NaN value. course 0 means true north

def format_course(x):
    if type(x['course']) == float:
        return x['course']
    elif isinstance(x['course'], str):
        return float(x['course'].replace(',','.'))
    else:
        return x['course']

# df.apply(format_course, axis=1)

df['course'] = df.apply(format_course, axis=1)

### Convert time to timestamp format

In [20]:
df['combined_time_str'] = df['date'].astype(str) + ' ' + df['time'].astype(str) + ' ' + '+07:00'

In [21]:
df.head(5)

Unnamed: 0,device_id,license_plate,driver,vehicle_group,date,time,speed_kmh,source,type,value,...,course,distance_km,odometer,region,route,altitude_m,accu_voltage,unit,address,combined_time_str
0,792168,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:05:05,0.0,Program,Location Information,,...,,,12313.48,DC Kawasan,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...","Wednesday, April 1, 2020 00:05:05 +07:00"
1,792168,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:15:05,0.0,Program,Location Information,,...,,0.0,12313.48,DC Kawasan,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...","Wednesday, April 1, 2020 00:15:05 +07:00"
2,792168,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:25:05,0.0,Program,Location Information,,...,,0.0,12313.48,DC Kawasan,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...","Wednesday, April 1, 2020 00:25:05 +07:00"
3,792168,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:35:05,0.0,Program,Location Information,,...,,0.0,12313.48,DC Kawasan,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...","Wednesday, April 1, 2020 00:35:05 +07:00"
4,792168,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:45:05,0.0,Program,Location Information,,...,,0.0,12313.48,DC Kawasan,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...","Wednesday, April 1, 2020 00:45:05 +07:00"


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7132128 entries, 0 to 7132127
Data columns (total 24 columns):
device_id              object
license_plate          object
driver                 object
vehicle_group          object
date                   object
time                   object
speed_kmh              float64
source                 object
type                   object
value                  float64
longitude              float64
latitude               float64
idling_duration_s      int64
ignition_duration_s    int64
course                 float64
distance_km            float64
odometer               float64
region                 object
route                  float64
altitude_m             float64
accu_voltage           float64
unit                   float64
address                object
combined_time_str      object
dtypes: float64(11), int64(2), object(11)
memory usage: 1.3+ GB


In [23]:
for i in df['combined_time_str'].sample(5).values:
    print(i, '--->' ,pd.to_datetime(i), '::', pd.to_datetime(i).utcoffset())

# oh my, it works. I am cryiingggg...

Monday, April 13, 2020 13:41:23 +07:00 ---> 2020-04-13 13:41:23+07:00 :: 7:00:00
Wednesday, April 1, 2020 22:36:47 +07:00 ---> 2020-04-01 22:36:47+07:00 :: 7:00:00
Wednesday, April 22, 2020 13:45:54 +07:00 ---> 2020-04-22 13:45:54+07:00 :: 7:00:00
Friday, April 17, 2020 08:27:26 +07:00 ---> 2020-04-17 08:27:26+07:00 :: 7:00:00
Saturday, April 4, 2020 21:28:20 +07:00 ---> 2020-04-04 21:28:20+07:00 :: 7:00:00


In [24]:
t = pd.to_datetime('Thursday, April 16, 2020 18:19:24 +07:00')
print(t.utcoffset())
print(t.timestamp())
print(t.day)

7:00:00
1587035964.0
16


In [25]:
df['datetime'] = pd.to_datetime(df['combined_time_str'])

In [26]:
df['posix_time'] = df.apply(lambda x: x['datetime'].timestamp(), axis = 1) 
df['hour'] = df.apply(lambda x: x['datetime'].hour, axis = 1)
df['day_in_month'] = df.apply(lambda x: x['datetime'].day, axis = 1)
df['day_of_week'] = df.apply(lambda x: x['datetime'].dayofweek, axis = 1)

In [27]:
df.drop(['combined_time_str'], axis=1, inplace=True)

### Handling Missing data

In [28]:
df.isna().sum()

device_id                    0
license_plate                0
driver                   70380
vehicle_group                0
date                         0
time                         0
speed_kmh                    0
source                       0
type                         0
value                  7132128
longitude                    0
latitude                     0
idling_duration_s            0
ignition_duration_s          0
course                 1616284
distance_km               6189
odometer                  5825
region                 5629654
route                  7132128
altitude_m              134934
accu_voltage           7132128
unit                   7132128
address                    815
datetime                     0
posix_time                   0
hour                         0
day_in_month                 0
day_of_week                  0
dtype: int64

In [66]:
#fill NaN distance to 0

df[df['distance_km'].isna()].head(100)

Unnamed: 0,device_id,license_plate,driver,vehicle_group,date,time,speed_kmh,source,type,value,...,route,altitude_m,accu_voltage,unit,address,datetime,posix_time,hour,day_in_month,day_of_week
0,792168,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:05:05,0.0,Program,Location Information,,...,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",2020-04-01 00:05:05+07:00,1.585674e+09,0,1,2
32648,792173,B9909SDB,YULIANTO,DC Kawasan,"Wednesday, April 1, 2020",00:06:36,0.0,Program,Location Information,,...,,10.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",2020-04-01 00:06:36+07:00,1.585674e+09,0,1,2
53005,792173,B9909SDB,YULIANTO,DC Kawasan,"Wednesday, April 22, 2020",06:45:50,0.0,Program,Location Information,,...,,12.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",2020-04-22 06:45:50+07:00,1.587513e+09,6,22,2
55462,792173,B9909SDB,YULIANTO,DC Kawasan,"Friday, April 24, 2020",06:26:19,0.0,Program,Location Information,,...,,30.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",2020-04-24 06:26:19+07:00,1.587684e+09,6,24,4
55464,792173,B9909SDB,YULIANTO,DC Kawasan,"Friday, April 24, 2020",06:40:03,0.0,Program,Location Information,,...,,30.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",2020-04-24 06:40:03+07:00,1.587685e+09,6,24,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
492377,1019950,B9340SDB,BAGUS NUGRAHA,DC Cikarang,"Wednesday, April 1, 2020",00:01:03,0.0,Program,Location Information,,...,,65.0,,,"Bekasi, Jawa Barat, Indonesia",2020-04-01 00:01:03+07:00,1.585674e+09,0,1,2
528013,1019951,B9176HZ,FAISAL,DC Cikarang,"Wednesday, April 1, 2020",00:08:54,0.0,Program,Location Information,,...,,67.0,,,"Bekasi, Jawa Barat, Indonesia",2020-04-01 00:08:54+07:00,1.585675e+09,0,1,2
563372,1019952,B9896SDB,WASONO,DC Cikarang,"Wednesday, April 1, 2020",00:08:12,0.0,Program,Location Information,,...,,60.0,,,"Bekasi, Jawa Barat, Indonesia",2020-04-01 00:08:12+07:00,1.585674e+09,0,1,2
603286,1019953,T8626DD,TEGUH RIYANTO,DC Cikarang,"Wednesday, April 1, 2020",00:02:26,0.0,Program,Location Information,,...,,61.0,,,"Bekasi, Jawa Barat, Indonesia",2020-04-01 00:02:26+07:00,1.585674e+09,0,1,2


In [69]:
#fill NaN odometer to 0
df[df['odometer'].isna()].sample(50).apply(print, axis=1)

device_id                                                        1021536
license_plate                                                    B9418VJ
driver                                                        JAYADI K.W
vehicle_group                                               DC P Kambing
date                                              Tuesday, April 7, 2020
time                                                            01:57:20
speed_kmh                                                              0
source                                                           Program
type                                                Location Information
value                                                                NaN
longitude                                                        106.909
latitude                                                        -6.19618
idling_duration_s                                                      0
ignition_duration_s                                

6904141    None
3131114    None
6904504    None
2868194    None
6905466    None
6853463    None
6904132    None
6904155    None
6905594    None
6904755    None
6905355    None
2226574    None
1947156    None
3356387    None
6853661    None
1947079    None
2589809    None
2592556    None
1230671    None
6853683    None
7079442    None
6904726    None
1679284    None
6908681    None
6853537    None
2697491    None
6905287    None
6903945    None
1944041    None
6853396    None
3150641    None
1945102    None
6905435    None
3138421    None
6905620    None
3650517    None
2586712    None
6908859    None
6904266    None
6905428    None
6853545    None
6904994    None
1938791    None
2773425    None
6905003    None
6904438    None
6853597    None
2588979    None
6903887    None
2713443    None
dtype: object

In [70]:
df[df['odometer'].isna()][['speed_kmh']].describe()

Unnamed: 0,speed_kmh
count,5825.0
mean,0.498439
std,2.297667
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,24.44


In [71]:
df[df['distance_km'].isna()][['speed_kmh']].describe()

Unnamed: 0,speed_kmh
count,6189.0
mean,0.479628
std,2.258426
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,24.44


In [75]:
df[(df['speed_kmh'] == 24.44) & (df['odometer'].isna())]

Unnamed: 0,device_id,license_plate,driver,vehicle_group,date,time,speed_kmh,source,type,value,...,route,altitude_m,accu_voltage,unit,address,datetime,posix_time,hour,day_in_month,day_of_week
3155679,1021346,B9921DI,DEDE M ISHAK,DC Kawasan,"Friday, April 24, 2020",08:35:08,24.44,Program,Location Information,,...,,21.0,,,"Pt Asuransi, Jalan Pemuda, 13220 Jakarta, Dki ...",2020-04-24 08:35:08+07:00,1587692000.0,8,24,4


In [76]:
df[df['course'].isna()]

Unnamed: 0,device_id,license_plate,driver,vehicle_group,date,time,speed_kmh,source,type,value,...,route,altitude_m,accu_voltage,unit,address,datetime,posix_time,hour,day_in_month,day_of_week
0,792168,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:05:05,0.0,Program,Location Information,,...,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",2020-04-01 00:05:05+07:00,1.585674e+09,0,1,2
1,792168,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:15:05,0.0,Program,Location Information,,...,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",2020-04-01 00:15:05+07:00,1.585675e+09,0,1,2
2,792168,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:25:05,0.0,Program,Location Information,,...,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",2020-04-01 00:25:05+07:00,1.585676e+09,0,1,2
3,792168,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:35:05,0.0,Program,Location Information,,...,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",2020-04-01 00:35:05+07:00,1.585676e+09,0,1,2
4,792168,B9922SDB,HARI PERMANA,DC Kawasan,"Wednesday, April 1, 2020",00:45:05,0.0,Program,Location Information,,...,,37.0,,,"Kawasan Industri Pulo Gadung, Jakarta, Dki Jak...",2020-04-01 00:45:05+07:00,1.585677e+09,0,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7132123,1021565,B9786TCH,Mukti Makmur,DC Rawa Domba,"Thursday, April 30, 2020",13:51:36,0.0,Program,Location Information,,...,,35.0,,,"Jalan Rawa Domba, 13440 Jakarta, Dki Jakarta, ...",2020-04-30 13:51:36+07:00,1.588229e+09,13,30,3
7132124,1021565,B9786TCH,Mukti Makmur,DC Rawa Domba,"Thursday, April 30, 2020",14:01:35,0.0,Program,Location Information,,...,,35.0,,,"Jalan Rawa Domba, 13440 Jakarta, Dki Jakarta, ...",2020-04-30 14:01:35+07:00,1.588230e+09,14,30,3
7132125,1021565,B9786TCH,Mukti Makmur,DC Rawa Domba,"Thursday, April 30, 2020",14:11:35,0.0,Program,Location Information,,...,,35.0,,,"Jalan Rawa Domba, 13440 Jakarta, Dki Jakarta, ...",2020-04-30 14:11:35+07:00,1.588231e+09,14,30,3
7132126,1021565,B9786TCH,Mukti Makmur,DC Rawa Domba,"Thursday, April 30, 2020",14:21:36,0.0,Program,Location Information,,...,,35.0,,,"Jalan Rawa Domba, 13440 Jakarta, Dki Jakarta, ...",2020-04-30 14:21:36+07:00,1.588231e+09,14,30,3


In [77]:
df[df['course'].isna()][['speed_kmh']].describe()

Unnamed: 0,speed_kmh
count,1616284.0
mean,0.03729285
std,0.6035708
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,70.56


Most NaN course happened when speed is 0. it makes sense

## export to csv

In [78]:
df.to_csv('Datasets/raw_data_2_cleaned.csv')

on second thought, i should put this in database

In [29]:
# dumping all tables takes too much time, so i trimmed some of the unnecessary features
df_sql = df[['device_id',
'license_plate',
'driver',
'vehicle_group',
'speed_kmh',
'type',
'value',
'longitude',
'latitude',
'idling_duration_s',
'ignition_duration_s',
'course',
'distance_km',
'odometer',
'region',
'route',
'altitude_m',
'datetime',
'posix_time',
'hour',
'day_in_month',
'day_of_week']]

In [30]:
engine = create_engine('postgres+psycopg2://jcds:pwdk2020@127.0.0.1:5432/gpstrajectory')

In [92]:
df_sql.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7132128 entries, 74224 to 74223
Data columns (total 22 columns):
device_id              object
license_plate          object
driver                 object
vehicle_group          object
speed_kmh              float64
type                   object
value                  float64
longitude              float64
latitude               float64
idling_duration_s      int64
ignition_duration_s    int64
course                 float64
distance_km            float64
odometer               float64
region                 object
route                  float64
altitude_m             float64
datetime               datetime64[ns, pytz.FixedOffset(420)]
posix_time             float64
hour                   int64
day_in_month           int64
day_of_week            int64
dtypes: datetime64[ns, pytz.FixedOffset(420)](1), float64(10), int64(5), object(6)
memory usage: 1.2+ GB


In [31]:
df_sql.sort_values(['device_id', 'datetime'], ascending=True, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [33]:
from sqlalchemy.types import Integer, Text, String, DateTime, Float
df_sql.to_sql("raw_trajectory",
           engine,
           if_exists='replace',
           index=True,
           chunksize=50,
           dtype={
                'device_id': String,
                'license_plate': String,
                'driver': String,
                'vehicle_group': String,
                'speed_kmh': Float,
                'type': String,
                'value': Float,
                'longitude': Float,
                'latitude': Float,
                'idling_duration_s': Integer,
                'ignition_duration_s': Integer,
                'course': Float,
                'distance_km': Float,
                'odometer': Float,
                'region': String,
                'route': String,
                'altitude_m': Float,
                'datetime': TIMESTAMP(timezone=True),
                'posix_time': Float,
                'hour': Integer,
                'day_in_month': Integer,
                'day_of_week': Integer
           })