# Attempt 1

This script takes as input a monthly aggregation of safegraph files. Each file contains a CAID (unique user ID), latitude and longitude, and timestamp organized in row format. The purpose of this script is to take the raw input, generate geometry objects, remove dud readings, and try to categorize every journey into various types. 

It is important to note that this is NOT the script that eventually determined the migration counts in the final layer. Nonethless, it is catalogued here for its techniques, which are applicable to similar data sources / for recycling purposes.

Library Imports

In [2]:
import pandas as pd
import numpy as np
import sys, os
import shapely
import geopandas as gpd
from shapely.geometry import Point, LineString, box, MultiLineString, MultiPolygon
import math
import time

Set path Locations

In [397]:
data_path = r'D:\SG'
data_file = r'Results_Month_3.csv'
workspace = r'C:\Users\charl\Documents\GOST\SafeGraph'

DataFrame Controls: set the number of rows to import as row_count

In [378]:
row_count = None  # Control number of rows imported
mode = ''       # set mode to test or production

Define import function, create unique user counter

In [379]:
def Import(data_file):
    print("Commencing Import: %s" % time.ctime())
    
    df = pd.read_csv(os.path.join(data_path, data_file), nrows = row_count) # standard import of rows up to 'row_count'
    
    df['time'] = pd.to_datetime(df['utc_timestamp'], unit='s') # time object generated by converting unix timestamp
    
    df["n"] = df.groupby('caid').ngroup() # here, we number each user uniquely by grouping records by caid.
    
    print("Import Complete: %s" % time.ctime())
    return df

Import function called. May take a while if row_count is large

In [380]:
df = Import(data_file)

Commencing Import: Mon Sep 17 17:54:24 2018
Import Complete: Mon Sep 17 17:54:35 2018


Grouping Functions for turning points into lines.

In [381]:
def LINEGROUPER(x2):
    y = pd.DataFrame()
    y['caid'] = [x2.caid.iloc[0]]
    y['StartTime'] = x2.time.iloc[0]
    y['EndTime'] = x2.time.iloc[-1]
    y['time_elapsed'] = [y['EndTime'] - y['StartTime']]
    y['bbox_utm'] = [LineString(x2.proj_geom.tolist()).bounds]
    y['start_utm'] = [x2.proj_geom.tolist()[0]]
    y['end_utm'] = [x2.proj_geom.tolist()[-1]]
    y['total_dist'] = [x2.del_dist.sum()]
    y['Av.Speed_mph'] = [(x2.del_dist.sum() / x2.del_time_secs.sum()) * 3600 / 1609.34]
    y['npoints'] = len(x2)
    try:
        y['geometry'] = [LineString(x2.geometry.tolist())]
    except:
        y['geometry'] = 'null'
    return y

def GROUPER(x):
    x = x.groupby(['JourneyID']).apply(lambda x: LINEGROUPER(x))
    return x

Main function - runs for each unique ID in DataFrame. Runs sequentially on groups of records with the same caid (here, identified by the new 'n' property, generated upon import)

In [382]:
def Main(df, mode): 
    
    lines, points = [], []

    if mode == 'test':
        rlist = [1]
    else:
        rlist = df.n.unique() 
    print("Commencing Main: %s" % time.ctime())
    for record in rlist:
        
        # Take a single user's records in the master DF. This collapses the dataframe to just a single CAID. 
        cdf = df.loc[df.n == record]
        
        # reset index to treat this CAID like its own DF.
        cdf = cdf.reset_index()

        # Skip if too short to extract useful information
        if len(cdf) < 3:
            pass

        else:

            # Find the relevant UTM zone for that user based on first lat and lon coordinates
            EPSG = 32700-round((45+cdf.longitude.loc[1])/90,0)*100+round((183+cdf.latitude.loc[1])/6,0)

            # Create a geometry object for each point recorded in their records
            geom = [Point(xy) for xy in zip(cdf.longitude, cdf.latitude)]

            # Use this to create a GeoDataFrame. Initialize with WGS84.
            gcdf = gpd.GeoDataFrame(cdf, crs = {'init' :'epsg:4326'}, geometry = geom)

            # Reproject to relevant UTM Zone
            gcdf = gcdf.to_crs({'init' :'epsg:%d' % EPSG})

            # define new lat / lon in projected UTM coordinates
            gcdf['lon_UTM'], gcdf['lat_UTM'] = gcdf['geometry'].x, gcdf['geometry'].y

            # Generate Time Deltas
            gcdf['del_time_secs'] = gcdf['utc_timestamp'].diff()

            # Generate Distance Deltas (metres)
            gcdf['del_lon'] = gcdf.lon_UTM.diff()
            gcdf['del_lat'] = gcdf.lat_UTM.diff() 
            gcdf['dist'] = np.sqrt(np.square(gcdf['del_lat'])+np.square(gcdf['del_lon'])).fillna(0)

            # Create offset column - 'next distance travelled' - flags start of journey
            gcdf['next_dist'] = gcdf['dist'].shift(-1).fillna(0)       

            # Calculate average speed
            gcdf['av.speed (m/s)'] = (gcdf['dist'] / gcdf['del_time_secs'])
            gcdf['av.speed (mph)'] = (gcdf['av.speed (m/s)'] * 3600 / 1609.34)

            # Remove stationary points - no current movement, not start of new journey, low speed
            gcdf = gcdf.loc[(gcdf['next_dist'] != 0) | gcdf['dist'] != 0]
            
            # having applied the above controls, pass if fewer than 3 records. 
            if len(gcdf) < 3:
                pass
            
            # otherwise, continue...
            else:
                
                # Regenerate deltas / calculations post cleaning.
                # these steps mirror those above, but with stationary points removed. 
                gcdf = gcdf.reset_index()
                gcdf['del_lon'] = gcdf.lon_UTM.diff()
                gcdf['del_lat'] = gcdf.lat_UTM.diff() 
                gcdf['del_dist'] = np.sqrt(np.square(gcdf['del_lat'])+np.square(gcdf['del_lon']))
                gcdf['del_time_secs'] = gcdf['utc_timestamp'].diff()
                gcdf['av.speed (m/s)'] = gcdf['dist'] / gcdf['del_time_secs'].fillna(0)
                gcdf['av.speed (mph)'] = gcdf['av.speed (m/s)'] * 3600 / 1609.34
                gcdf['av.speed (mph)'] = gcdf['av.speed (mph)'].fillna(0)

                # Create 'new journey start' flag function (not yet applied)
                def JourneyFlag(x):
                    if x['del_dist'] == 0 and x['next_dist'] > 0 and  x['del_time_secs'] > 60:
                        flag = 1
                    else:
                        flag = 0
                    return flag

                # Add flag for a new journey (applied above function, row-wise)
                gcdf['newjourneyflag'] = gcdf.apply(lambda x: JourneyFlag(x), axis = 1)

                # Add counter for journeys done by this caid
                gcdf['JourneyID'] = gcdf['newjourneyflag'].cumsum(axis=0)
                gcdf['JourneyID'] = '%d_' % record + gcdf['JourneyID'].astype(str)

                # Generate bearing function - demarks direction of travel between record pairs. 
                def BearingCalc(x):
                    degs = math.degrees(math.atan2(x['del_lon'], x['del_lat']))
                    if degs < 0:
                        degs = 360 + degs
                    return degs
                
                # application of bearing function
                gcdf['bearing_last'] = gcdf.apply(lambda x: BearingCalc(x), axis = 1).fillna(0)
                gcdf['bearing_next'] = gcdf['bearing_last'].shift(-1).fillna(0)

                # Reset Geometry to WGS84 base
                gcdf['proj_geom'] = gcdf['geometry']
                gcdf = gcdf.set_geometry([Point(xy) for xy in zip(gcdf.longitude, gcdf.latitude)])

                # Group points into a new Line objects dataframe, reset index
                ldf = GROUPER(gcdf)
                ldf = ldf.reset_index()

                # Remove journeys which are too slow, too fast or don't go anywhere
                ldf = ldf.loc[(ldf['Av.Speed_mph'] > 0.1) & 
                              (ldf['Av.Speed_mph'] < 120) &
                              (ldf['total_dist'] > 100)
                             ]

                if len(ldf) == 0:
                    pass
                
                else:

                    # Remove points corresponding to dud journeys
                    gcdf = gcdf.loc[gcdf['JourneyID'].isin(ldf.JourneyID)]

                    # Add start and end coordinates to line dataframe
                    ldf['start_loc'] = ldf['geometry'].apply(lambda x: Point(x.coords[0]))
                    ldf['end_loc'] = ldf['geometry'].apply(lambda x: Point(x.coords[len(x.coords)-1]))

                    # Add centroid of journey
                    ldf['journey_centroid'] = ldf['geometry'].apply(lambda x: x.centroid)

                    # MID is maximal intra-trip displacement. This is a measure of the area a journey covers
                    ldf['MID'] = ldf['bbox_utm'].apply(lambda x: (np.sqrt(np.square(x[2] - x[0])+np.square(x[3] - x[1]))))

                    # Add displacement between start and end of journey
                    ldf['Disp'] = ldf.apply(lambda x: x.start_utm.distance(x.end_utm), axis = 1)

                    # append lines and points dataframes for this ID to list objects for later concaternation.
                    lines.append(ldf), points.append(gcdf)
    
    # place into master dataframes
    linesdf = pd.concat(lines)
    pointsdf = pd.concat(points)   
    
    print("Main Complete: %s" % time.ctime())
    
    # return both line and point dataframes.
    return linesdf, pointsdf

Commencing Main: Mon Sep 17 17:54:35 2018
Main Complete: Tue Sep 18 01:49:30 2018


Here, we execute the above function. if mode == 'test', then only the first group of records is processed.

In [None]:
linesdf, pointsdf = Main(df, mode)

Here, we send the intermediate, processed frames to file:

In [383]:
print("Commencing Save: %s" % time.ctime())

# Drop useless columns
for i in ['n','chng','index','Unnamed: 0','del_lon','del_lat','newjourneyflag']:
    try: 
        pointsdf = pointsdf.drop(i, axis = 1)
    except: 
        pass
for i in ['level_1']:
    linesdf = linesdf.drop(i, axis = 1)

# Save down
linesdf.to_csv(os.path.join(workspace, 'journeys.csv'))
pointsdf.to_csv(os.path.join(workspace, 'points.csv'))

print("Save complete: %s" % time.ctime())

Commencing Save: Tue Sep 18 01:49:31 2018
Save complete: Tue Sep 18 01:52:11 2018


Now that we have journeys assembled from the points, we can classify their probable types. We do this by looking at:

- 'total_dist': the total distance covered in a single uninterrupted journey
- 'Disp': the displacement between the start and end of the journey
- 'MID': the maximal intra-trip displacement. This is a measure of the area the journey covers.

These parameters are approximations generated by Charles Fox based on nothing other than observation. If you have better ideas for new classification rules / better estimates for the parameters, implement them in this classification function!

In [384]:
def Classify(x):
    home_range = x.total_dist/6
    
    if x.Disp > 100000 and x.total_dist < x.Disp*3:
        status = 'Migration'
        return status
    elif x.total_dist > 25000 and x.total_dist < x.Disp*3:
        status = 'LongTrip'
        return status
    elif x.total_dist > 25000 and x.Disp < home_range and x.MID > home_range:
        status = 'LongCommute'
        return status
    elif x.total_dist < 25000 and x.Disp < home_range and x.MID > home_range:
        status = 'LocalCommute'
        return status
    elif x.total_dist / max(x.Disp,1) > 5:
        status = 'LocalMovement'
        return status
    elif x.total_dist < 25000:
        status = 'LocalTrip'
        return status
    else:
        status = 'Other'
    return status

In this block, apply the customized classification function against each row in the lines (i.e. journyes) dataframe:

In [385]:
print("Commencing Classification: %s" % time.ctime())

sum_lines = linesdf[['StartTime','start_loc','time_elapsed','end_loc','total_dist','Av.Speed_mph','MID','Disp','geometry']]
sum_lines['TripType'] = sum_lines.apply(lambda x: Classify(x), axis = 1)
sum_lines.TripType.value_counts()
sum_lines.to_csv('journeys.csv') # send the classified lines file to .csv as 'journeys.csv'

print("Classification Complete: %s" % time.ctime())

Commencing Classification: Tue Sep 18 01:52:11 2018


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Classification Complete: Tue Sep 18 01:53:27 2018


Finally, we generate a daily summary dataframe. This dataframe looks only at the lines dataframe assembled in 'Main'. It aims to identify, for each user, the daily summary of their activity, for where they make more than one journey per day. It generates a bounding box for their activity, and also plots a centroid. If these centroids move drastically, then a migration may have occured as the home range has changed. 

In [386]:
print("Commencing Daily Summary: %s" % time.ctime())

def summary(x):
    y = pd.DataFrame()
    y['minx'] = [min(x.minx.tolist())]
    y['miny'] = [min(x.miny.tolist())]
    y['maxx'] = [max(x.maxx.tolist())]
    y['maxy'] = [max(x.maxy.tolist())]
    y['journey count'] = len(x)
    y['total distance'] = [x.total_dist.sum()]
    return y

ndf = linesdf.copy()
ndf = ndf[['caid','JourneyID','StartTime','total_dist', 'journey_centroid','geometry']]
ndf['Day'] = ndf['StartTime'].dt.day
ndf['Month'] = ndf['StartTime'].dt.month
ndf['journey_bounds'] = ndf['geometry'].apply(lambda x: x.bounds)
ndf['minx'] = ndf.journey_bounds.apply(lambda x: x[0])
ndf['miny'] = ndf.journey_bounds.apply(lambda x: x[1])
ndf['maxx'] = ndf.journey_bounds.apply(lambda x: x[2])
ndf['maxy'] = ndf.journey_bounds.apply(lambda x: x[3])
ndf = ndf.groupby(['caid', 'Day']).apply(lambda x: summary(x))
ndf['bbox'] = ndf.apply(lambda x: box(x.minx, x.miny, x.maxx, x.maxy), axis = 1)
ndf['centroid'] = ndf.bbox.apply(lambda x: x.centroid)
ndf = ndf.drop(['minx','miny','maxx','maxy'], axis = 1)
ndf.to_csv(os.path.join(workspace, 'daily_summary.csv'))

print("Daily Summary Complete: %s" % time.ctime())

Commencing Daily Summary: Tue Sep 18 01:53:27 2018
Daily Summary Complete: Tue Sep 18 01:56:20 2018
