# SnapShot preprocessing:
In this notebook we get the MATSim snapShot output as the input. MATSim gives us the user location in constant intervals. It does not report anything when the user is not traveling. So to make this data as much as possible closer to the actual periodic location update, we have to fill in the empty records. also we initially get the snapshots every 30 seconds. For other polling intervals we only reduce the accuracy. Also for the spatial aggregation to the level os TAZ zones, we use arcGIS to extract the associated TAZ of each location record from the snapShot file. The output of arcGIS is the input here.

#### adding required packages

In [1]:
import pandas as pd
import time
from math import floor
import numpy as np
import pickle
import requests
import concurrent.futures

#### specifying the saving location 

In [2]:
savingLoc = "Y:/ZahraEftekhar/phase4/"

#### preparing the output of arcGIS for completion

The output of GIS misses the locations outside of Amsterdam. Therefore, we complete the data by considering their TAZ code `0`.

In [130]:
precompletion = pd.read_csv('{a}GISoutput_PreCompletion.CSV'.format(a=savingLoc),usecols=['mzr_id', 'VEHICLE','TIME','EASTING','NORTHING'])
precompletion = precompletion.sort_values(by=["VEHICLE","TIME"])
precompletion = precompletion.reset_index(drop=True)
with open('{a}snapShot_allowedUsers.pickle'.format(a=savingLoc),'rb') as handle:
    MATSimOutput = pickle.load(handle)
MATSimOutput=MATSimOutput.reset_index(drop=False)
print(len(MATSimOutput)-len(precompletion)," records are missing that we refill them in our snapShot. ")


7481  records are missing that we refill them in our snapShot. 


In [42]:
snapData = pd.merge(precompletion, MATSimOutput, how='right', on=['VEHICLE','TIME'])
(snapData.mzr_id[snapData.mzr_id.isna()]) = 0
snapData = snapData.loc[:,['VEHICLE','TIME','EASTING_y','NORTHING_y','mzr_id']]
snapData.columns = ['VEHICLE', 'TIME', 'EASTING', 'NORTHING', 'mzr_id']
snapData = snapData.sort_values(by = ['VEHICLE', 'TIME'])
with open('{a}finalInputPython.pickle'.format(a=savingLoc),'wb') as handle:
    pickle.dump(snapData, handle, protocol=pickle.HIGHEST_PROTOCOL)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Now, it is time to generate the complete snapShot data for every 30 seconds which represents the base data set even for generating other snapshots with different polling intervals (we resample from this data based on the specified polling interval).

In [9]:
with open('{a}finalInputPython.pickle'.format(a=savingLoc),'rb') as handle:
    snapData = pickle.load(handle)
snapData.reset_index(drop=True,inplace=True)
snapData.head()

Unnamed: 0,VEHICLE,TIME,EASTING,NORTHING,mzr_id
0,1,23400,632364.770972,5816900.0,7065.0
1,1,23430,632279.680941,5816846.0,5329.0
2,1,23460,632234.315601,5816431.0,5329.0
3,1,23490,632200.291596,5816119.0,5329.0
4,1,23520,632209.756236,5815776.0,5329.0


### Generating snapshot file for 30 seconds polling interval:

In [24]:
#we time the process
startTime = time.time()

userGroups = (snapData.groupby(["VEHICLE"]))
concatData = {}
IDs = list(userGroups.groups.keys())[0:len(userGroups.groups.keys())]
t1= time.time()
for i,ID in enumerate(IDs): #userGroups.groups.keys()
    
    if i%1000==0: 
        print(i,")      ",time.time()-t1)
        t1= time.time()
#     if (time.time()-t1>1): print(i,") the id is:  ", ID)
    
    records=userGroups.get_group(ID)
    records.TIME = pd.to_timedelta(records.TIME, unit="s")
    records.set_index(["TIME"],inplace=True)
    # print(kk.tail())
    records.loc[records.index[0]+pd.to_timedelta('24:00:00')]=records.iloc[0,:]
    records.sort_index(inplace=True)
    records = records.resample('30S').fillna("pad")
    records.drop([records.index[0]+pd.to_timedelta('24:00:00')],axis=0,inplace=True)
    concatData[ID] = records
#     t1= time.time()
print((time.time() - startTime)//60,'minutes')

0 )       0.0009970664978027344
1000 )       14.911331415176392
2000 )       14.199820280075073
3000 )       14.559810400009155
4000 )       14.942324161529541
5000 )       14.227362632751465
6000 )       14.944886684417725
7000 )       13.681585550308228
8000 )       15.447866916656494
9000 )       15.049609899520874
10000 )       15.654713869094849
11000 )       15.848775625228882
12000 )       18.301263093948364
13000 )       15.120662212371826
14000 )       16.96431303024292
15000 )       16.541218519210815
16000 )       17.3521831035614
17000 )       17.873571634292603
18000 )       17.04734992980957
19000 )       17.75270915031433
20000 )       16.350047826766968
21000 )       17.702622175216675
22000 )       16.715301752090454
5.0 minutes


In [None]:
with open("{a}completePLUdata_30sec_dict.pickle".format(a=savingLoc),"wb") as handle:
          pickle.dump(concatData,handle, protocol=pickle.HIGHEST_PROTOCOL )
snapDataNew = pd.DataFrame()
for ID in concatData.keys():
    snapDataNew = snapDataNew.append(concatData[ID])

In [27]:
with open("{a}completePLUdata_30sec.pickle".format(a=savingLoc),'wb') as handle:
    pickle.dump(snapDataNew, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Generating snapshot file of polling intervals greater than 30 seconds: 

So far, we have prepared the snapshot file with polling interval of 30 seconds. Now we use that as a base to generate the snapshot file of other polling intervals (using `resample`).

In [60]:
#we time the process
startTime = time.time()

records.resample('120S').first()
pollInt = [60,300,600,900,1200,1500,1800,2100,2400,2700,3000,3300,3600,4500,5400,6300,7200]
IDs = list(concatData.keys())
for interval in pollInt:
    pollData = {}
    t1= time.time()
    for i,ID in enumerate(IDs): #userGroups.groups.keys()
        records=concatData[ID]
        records = records.resample('{b}S'.format(b=interval)).first()
        pollData[ID] = records
        if i==len(IDs)-1: 
            print("polling interval ",interval,":      ",(time.time()-t1)//60,'minutes')
            t1= time.time()
    with open("{a}completePLUdata_{b}sec_dict.pickle".format(a=savingLoc,b= interval),'wb') as handle:
        pickle.dump(pollData, handle, protocol=pickle.HIGHEST_PROTOCOL)
print((time.time() - startTime)//60,'minutes')

polling interval  30 :       1.0 minutes
polling interval  60 :       0.0 minutes
polling interval  300 :       0.0 minutes
polling interval  600 :       0.0 minutes
polling interval  900 :       0.0 minutes
polling interval  1200 :       0.0 minutes
polling interval  1500 :       0.0 minutes
polling interval  1800 :       0.0 minutes
polling interval  2100 :       0.0 minutes
polling interval  2400 :       0.0 minutes
polling interval  2700 :       0.0 minutes
polling interval  3000 :       0.0 minutes
polling interval  3300 :       0.0 minutes
polling interval  3600 :       0.0 minutes
polling interval  4500 :       0.0 minutes
polling interval  5400 :       0.0 minutes
polling interval  6300 :       0.0 minutes
polling interval  7200 :       0.0 minutes
9.0 minutes
