### Route Analyzer
The objective of this script was to identify the Origin-Destination matrix out of Khayelitsha, Cape Town, South Africa, from GPS traces of vehicles operating in the area. 

Data was organized into DataFrames where the unit of analysis was trips, and data where the unit of analysis was the passenger (i.e. multiple passengers pere trip). These could be tied together using a key field, which was a unique trip ID. 

Import the usual suspects

In [None]:
import sys, os, time
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, LineString
from shapely.ops import unary_union

### Initial Raw File Imports

In [None]:
# file path
p = r'C:\Users\charl\Documents\GOST\South Africa\Connectivity\Submission_15_12_2018'

# file name
q = r'Recon_December.xlsx'

# We import different sheets of the MS Excel document as different DataFrames
Trips = pd.read_excel(os.path.join(p,q), sheet_name = 'Trips')
Passengers = pd.read_excel(os.path.join(p,q), sheet_name = 'Passengers')

# A separate doc, the RouteMaster, is also brought in. This MS Excel included data on the geometry of the routes
# themselves, but little else. All trips were made using standardized routes. 
RouteMaster = pd.read_excel(os.path.join(p,r'Route O-D_Master_21December2018.xlsx'))

# We only really need the origin and destination for each route to build our OD
RouteMaster = RouteMaster[["UNIQUE ROUTE ID's",'ORIGIN','DESTINATION']].set_index("UNIQUE ROUTE ID's")

### Match on useful fields from Trips frame

In [None]:
# Write a function to convert the timestamp field into an int, expressed in minutes
def convert(x):
    try:
        return (x.minute + x.hour * 60 + x.second / 60)
    except:
        pass

# apply this to the timestamp field, 'Travel Time', to get back a new field, expressed in minutes
Passengers['TT_minutes'] = Passengers['Travel Time'].apply(lambda x: convert(x))

# Using the key field of the Trip ID, we move on the fields of interest from the Trips DataFrame to the Passenger DataFrame. 
Trips = Trips.set_index('Trip ID')
Passengers = Passengers.set_index('Trip ID')
for i in ['Route Description','Start Coordinate', 'End Coordinate', 'Revenue', 'Start Time', 'Total Passengers', 'Distance']:
          Passengers['Trip %s' % i] = Trips[i]
        
# We reset the index to release the Trip ID after this process has been completed. 
Passengers = Passengers.reset_index()

### Match on useful fields from RouteMaster frame
Similarly, before we get started in earnest, we want to consolidate all of the useful fields into a single DataFrame that can be the basis of analysis for the rest of the script

In [None]:
# To differentiate between the origin of the passenger and the origin of the trip itself,
# we generate a new column 'Trip Origin' and a new column 'Trip Destination'.
Passengers = Passengers.set_index('Trip Route Description')
Passengers['Trip Origin'] = RouteMaster['ORIGIN']
Passengers['Trip Destination'] = RouteMaster['DESTINATION']
Passengers = Passengers.reset_index()

### Error Handling: Remove Passengers where travel time, fare or trip distance values are erroneous

The thresholds can of course be flexxed. I used 1000 minutes as a too-long journey, and discarded any trips which either didn't have a fare or a fare more than 1,000 rand. A distance one was also added and can be un-commented

In [None]:
Passengers = Passengers.loc[(Passengers['TT_minutes'] < 1000) & (Passengers['TT_minutes'] > 0)]
Passengers = Passengers.loc[(Passengers['Fare'] < 1000) & (Passengers['Fare'] > 0)]
#Passengers = Passengers.loc[(Passengers['Trip Distance'] < 1000) & (Passengers['Trip Distance'] > 0)]

### Add fare per minute / unit distance

In [None]:
Passengers['Fare_per_minute'] = Passengers['Fare'] / Passengers['TT_minutes']
Passengers['Fare_per_mile'] = Passengers['Fare'] / Passengers['Trip Distance']

### Passengers: Add on ward details for boarding / alightning

In [None]:
# here we import the ward shapefile. The unit of analysis we want to summaize to is the ward, a city level administrative unit. 
shp_pth = r'C:\Users\charl\Documents\GOST\South Africa\Connectivity\Fwd__SAL\1_Residential_SAL'
ward_shp = gpd.read_file(os.path.join(shp_pth, 'wards_4326.shp'))
ward_shp = ward_shp[['SL_WARD_KE','SL_SUB_CNC','geometry']]

# We generate a new unique passenger ID field from the current DataFrame index
Passengers['P_ID'] = Passengers.index

# we want to know the ward of boarding and ward of alighting for each passenger. 
# so, we create a lightweight dataframe with just the essentials:
Bind_P = Passengers[['P_ID','Boarding Location','Alighting Location']]

def convert_to_point(x):
    l = x.split(', ')
    return Point(float(l[0]), float(l[1]))
    
Bind_P['Boarding Point'] = Bind_P['Boarding Location'].apply(lambda x: convert_to_point(x))
Bind_P['Alighting Point'] = Bind_P['Alighting Location'].apply(lambda x: convert_to_point(x))

# make it a GeoDataFrame...
Q = gpd.GeoDataFrame(Bind_P, crs = ward_shp.crs, geometry = 'Boarding Point')

# and then spatially join this to the ward shapefile.
# this is fairly quick as it is a point to poylgon join, but could be made faster via a spatial index if necessary
boarding_join = gpd.sjoin(Q, ward_shp, how = 'left')

# we drop the geometry info post-join to just retain the passenger ID, and the boarding ward's identifying info
boarding_join = boarding_join[['P_ID','SL_WARD_KE','SL_SUB_CNC']]

# rename for visibility's sake
boarding_join.columns = ['P_ID','Board_WARD_KE','Board_SUB_CNC']

# set unique passenger ID as index, and wait. Leave to cool for 20s. 
boarding_join = boarding_join.set_index('P_ID')

# we now repeat the same process, but for alightning. Note the tactical choice of geometry field in the Q2 line:
Q2 = gpd.GeoDataFrame(Bind_P, crs = ward_shp.crs, geometry = 'Alighting Point')
alighting_join = gpd.sjoin(Q2, ward_shp, how = 'left')
alighting_join = alighting_join[['P_ID','SL_WARD_KE','SL_SUB_CNC']]
alighting_join.columns = ['P_ID','Alight_WARD_KE','Alight_SUB_CNC']
alighting_join = alighting_join.set_index('P_ID')

# Now, we have two super light DFs - the alighting_join, and the boarding join - which we match back on to the main Passenger DF:
Passengers = Passengers.set_index('P_ID')
Passengers['Alight_WARD_KE'] = alighting_join['Alight_WARD_KE']
Passengers['Alight_SUB_CNC'] = alighting_join['Alight_SUB_CNC']
Passengers['Board_WARD_KE'] = boarding_join['Board_WARD_KE']
Passengers['Board_SUB_CNC'] = boarding_join['Board_SUB_CNC']

# finish up by reseting the index. Nice. 
Passengers = Passengers.reset_index()

### Passengers: Fare Summary Output Tables
Until now, we have concerned ourselves with data preparation and manipulation; this is the first bit of analysis we do. 
 

We generate a number of different tables, summarized by various demographic factors, e.g. ethnicity, age, gender, etc. 

Pandas' groupby functionality allows summarzing across multiple dimensions at once - so we can compare stats for white, elderly, men against young, female people of color if we so choose.

Scroll to the right to see how we group each table - that is the key bit here.

In [None]:
ethnicity_table = Passengers[['Ethnicity', 'Age Group', 'Gender', 'Fare', 'Fare_per_minute', 'Fare_per_mile', 'TT_minutes']].groupby(['Ethnicity']).median()
age_table = Passengers[['Ethnicity', 'Age Group', 'Gender', 'Fare', 'Fare_per_minute', 'Fare_per_mile', 'TT_minutes']].groupby(['Age Group']).median()
gender_table = Passengers[['Ethnicity', 'Age Group', 'Gender', 'Fare', 'Fare_per_minute', 'Fare_per_mile', 'TT_minutes']].groupby(['Gender']).median()
gender_ethnicity_table = Passengers[['Ethnicity', 'Age Group', 'Gender', 'Fare', 'Fare_per_minute', 'Fare_per_mile', 'TT_minutes']].groupby(['Gender','Ethnicity']).median()
age_ethnicity_table = Passengers[['Ethnicity', 'Age Group', 'Gender', 'Fare', 'Fare_per_minute', 'Fare_per_mile', 'TT_minutes']].groupby(['Age Group','Ethnicity']).median()
full_table = Passengers[['Ethnicity', 'Age Group', 'Gender', 'Fare', 'Fare_per_minute', 'Fare_per_mile', 'TT_minutes']].groupby(['Gender','Age Group','Ethnicity']).median()

# we create a bag of our summary tables for ease of keeping them together. 
tables = [ethnicity_table, age_table, gender_table, gender_ethnicity_table, age_ethnicity_table, full_table]

### Trip level analytics
Having prepared our data and created some basic summary analytics, we can now do some trip-level analytics

In [None]:
# generate list of unique trips, via python's built in set() function
unique_trips = list(set(Passengers['Trip Route Description']))

# Here, I subset this for just the first 10 in the list
Selected_trips = unique_trips[:10]

# open a blank list to append results to
trip_tables = []

# for each trip...
for t in Selected_trips:
    
    # pick out all the passengers in the table with the same trip route description 
    Q = Passengers.loc[Passengers['Trip Route Description'].isin([t])]
    
    # NOTE - these are the official race brackets as determined by the South African Govt - not my choice
    ethnicities = ['BLACK','INDIAN','OTHER','WHITE','COLOURED']
    
    genders = ['M','F']
    ages = ['YOUNG','MIDDLE','OLD']
    
    # we take a copy of the relevant passenger data for this trip and assign it to 's'.
    s = Q.copy()
    
    # here we are creating a series of flag columns for each of our demographic indicators. 
    # values are either 1 (true) or 0 (false) for each demographic option. These are mutually exclusive within their lists. 
    for gend in genders:
        s[gend] = 0
        s[gend].loc[s['Gender'] == gend] = 1
    for ethn in ethnicities:
        s[ethn] = 0
        s[ethn].loc[s['Ethnicity'] == ethn] = 1
    for age in ages:
        s[age] = 0
        s[age].loc[s['Age Group'] == age] = 1
        
    # We also create a dummy total column, which is 1 for all persons (everyone either M or F in this dataset)
    s['Total'] = s['M'] + s['F']
    col_keep = []
    
    # we add an 'hour of the day' column for the hour in which the trip is started. 
    s['Trip_st_hour'] = s['Trip Start Time'].apply(lambda x: x.hour)
    
    # we group records by this value - i.e. the time at which people were starting the trip. 
    # now, each 'row' represents many people. 
    # the lambda function we are applying is the sum - so we have the 'count' of people in each bracket undertaking 
    # the trip at this time (remember we generated these 1 / 0 cols above)
    s = s.groupby('Trip_st_hour').sum()
    
    # we don't have to sort but it is pleasing to do so
    s = s.sort_values(by = 'Trip_st_hour', ascending = True)
    
    # we create fractional columns that allow us to easily see the age, gender and ethnicity breakdown of passengers 
    # for each hour the trip occured in the master Passenger DataFrame. 
    for ethn in ethnicities:
        s['% {}'.format(ethn)] = s[ethn] / (s['Total'])
        col_keep.append('% {}'.format(ethn))
    for gend in genders:
        s['% {}'.format(gend)] = s[gend] / (s['Total'])
        col_keep.append('% {}'.format(gend))
    for age in ages:
        s['% {}'.format(age)] = s[age] / (s['Total'])
        col_keep.append('% {}'.format(age)) 
        
    # We subset the DF to the columns we are interested in using col_keep
    s = s[['M','F',*col_keep,]]
    
    # we append the results DataFrame to trip_tables
    trip_tables.append(s)

### Route Summary: Initial Generation
Now, instead of looking at individual trips, we move up to the route level - and analyze the data at the level of routes (multiple trips).

We perform much the same analysis as the above block here

In [None]:
import warnings
warnings.filterwarnings('ignore')

# we generate a route summary by grouping Passengers by Trip Description, again. We take the median values and sort by fare per minutes
route_summary = Passengers[['Trip Route Description', 'Fare', 'Fare_per_minute', 'Fare_per_mile', 'TT_minutes']].groupby('Trip Route Description').median().sort_values(by = 'Fare_per_minute', ascending = False)
route_summary['Origin'] = RouteMaster['ORIGIN']
route_summary['Destination'] = RouteMaster['DESTINATION']

# we set up a slice of the passengers DF
s = Passengers[['Trip Route Description', 'Gender','Ethnicity','Age Group']]

# Import the demographic buckets discussed above
ethnicities = ['BLACK','INDIAN','OTHER','WHITE','COLOURED']
genders = ['M','F']
ages = ['YOUNG','MIDDLE','OLD']

# again, set up flag columns for each option
for gend in genders:
    s[gend] = 0
    s[gend].loc[s['Gender'] == gend] = 1
for ethn in ethnicities:
    s[ethn] = 0
    s[ethn].loc[s['Ethnicity'] == ethn] = 1
for age in ages:
    s[age] = 0
    s[age].loc[s['Age Group'] == age] = 1

# sum them up again by Trip this time, NOT by hour. Hence, one row = one route    
s = s.groupby('Trip Route Description').sum()
s['Total'] = s['M'] + s['F']

# set up the fractional columns for our demographic info
col_keep = []
for ethn in ethnicities:
    s['% {}'.format(ethn)] = s[ethn] / (s['Total'])
    col_keep.append('% {}'.format(ethn))
for gend in genders:
    s['% {}'.format(gend)] = s[gend] / (s['Total'])
    col_keep.append('% {}'.format(gend))
for age in ages:
    s['% {}'.format(age)] = s[age] / (s['Total'])
    col_keep.append('% {}'.format(age))
    
# subset to just the columns we are interested in
s = s[col_keep]

# build a single route summary DataFrame with all routes in it (contrast to above block - one DF per route, by hour)
for i in col_keep:
    route_summary[i] = s[i]

### Route Summary: Match on Route WKT

In [None]:
# our route geometries are expressed as 'kmls' (groan). We modify our fiona instance in this script by adding KML drivers. 
# Fiona is the underlying library that GeoPandas uses to import KMLs. Unless we make this adjustment, it will not read in KMLs
import geopandas as gpd
import fiona
fiona.drvsupport.supported_drivers['kml'] = 'rw' # enable KML support which is disabled by default
fiona.drvsupport.supported_drivers['KML'] = 'rw' # enable KML support which is disabled by default

# we pick up our route summary DF, and make a list of all of the unique trips
route_summary = route_summary.reset_index()
set_of_routes = list(set(route_summary['Trip Route Description']))

# we walk the folder with the kmls in it, which adds the file names to 'files' list
for root, folder, files in os.walk(r'C:\Users\charl\Documents\GOST\South Africa\Connectivity\Submission_15_12_2018\COCT_Dec'):
    pass

# now, we walk through the files, and check whether the filename corresponds to anything in the Trip Route Description bucket
# if it matches, we add load it as a geopandas dataframe, and then append the geoemetry specifically to a dictionary, geom_dict
gathered = []
geom_dict = {}

# go through files...
for f in files:
    
    # split on underscore, take the first bit (this should be a string which matches a route_description object)
    route = f.split('_')[0]
    
    # check to see if it is, and that we haven't already gathered it
    if route in set_of_routes and route not in gathered:
        
        # if it's a kml
        if f.split('_')[2] == 'trip.kml':
            # read it in
            f = gpd.read_file(os.path.join(root, f))
            
            # add it to our geom_dict, with key being the route identifier (important)
            geom_dict[route] = f.geometry.iloc[0]
            
            # append it also to the gathered list - we don't want multiple geometries recorded for each route ID
            gathered.append(route)

# we convert this dictionary of geometries to a GeoDataFrame, with the key as the geometry dict keys...which are our route descriptions....
geom_df = pd.DataFrame({'geometry':list(geom_dict.values())}-, index = geom_dict.keys())

# allowing us to then match this geometry onto the route_summary df via the Trip Route Description field!
route_summary = route_summary.set_index('Trip Route Description')
route_summary['WKT'] = geom_df['geometry']

### Route Summary: Match on Start / End Ward

In [None]:
# we can use the kml geometry to also match on the route's start and end ward, using an identical process to before. 
route_summary = route_summary.reset_index()
route_summary = route_summary.loc[route_summary.WKT.apply(type) != float]

# we make an origin point out of the first coordinate in the string
route_summary['Origin_Point'] = route_summary.WKT.apply(lambda x: Point((x.coords[0])))

# and a destination point out of the last coordinate in the string
route_summary['Destination_Point'] = route_summary.WKT.apply(lambda x: Point((x.coords[-1])))

# subset the route summary df
trip_origin_df = route_summary[['Trip Route Description','Origin_Point']]

# generate a GDF where the geometry is the origin point
trip_origin_df = gpd.GeoDataFrame(trip_origin_df, geometry = 'Origin_Point', crs = ward_shp.crs)

# perform spatial join
trip_origin_join = gpd.sjoin(trip_origin_df, ward_shp, how = 'left')

# prep for joining based on Trip Route Description
trip_origin_join = trip_origin_join.set_index('Trip Route Description')

# same again, using Destination Point as the geometry
trip_dest_df = route_summary[['Trip Route Description','Destination_Point']]
trip_dest_df = gpd.GeoDataFrame(trip_dest_df, geometry = 'Destination_Point', crs = ward_shp.crs)
trip_dest_join = gpd.sjoin(trip_dest_df, ward_shp, how = 'left')
trip_dest_join = trip_dest_join.set_index('Trip Route Description')

# now, join on the ward info using the key field, Trip Route Description
route_summary = route_summary.set_index('Trip Route Description')
route_summary['dest_ward_KE'] = trip_dest_join['SL_WARD_KE']
route_summary['dest_SUB_CNC'] = trip_dest_join['SL_SUB_CNC']
route_summary['origin_ward_KE'] = trip_origin_join['SL_WARD_KE']
route_summary['origin_SUB_CNC'] = trip_origin_join['SL_SUB_CNC']

### Time of Day: Generate Analytics
This is very similar to the trip-level analytics, but here we want to analyze the behaviour and characteristics of all passengers, across all jounrneys - with the only differentiator being the hour of travel. 

In [None]:
# Once again, create a fresh copy of the all-important Passengers DF
s = Passengers.copy()

# generate a boarding hour field, which we will use to summarize the data later
s['Boarding Hour'] = s['Boarding Time'].apply(lambda x: x.hour)

# Using this familiar process, set up demographic flag columns (see above for more detailed walkthrough of same process)
col_keep = []
for gend in genders:
    s[gend] = 0
    s[gend].loc[s['Gender'] == gend] = 1
    col_keep.append(gend)
for ethn in ethnicities:
    s[ethn] = 0
    s[ethn].loc[s['Ethnicity'] == ethn] = 1
    col_keep.append(ethn)
for age in ages:
    s[age] = 0
    s[age].loc[s['Age Group'] == age] = 1
    col_keep.append(age)

In [None]:
# We want to do two processes here - average for values within each hour like fare, and count people. 
# So, we set up two DFs - one for averaging and one for counting (both are copies of the passengers DF)
data_averageing = s[['Boarding Hour','Fare','Fare_per_mile','Fare_per_minute','TT_minutes']]
data_counting = s[['Boarding Hour',*col_keep]]

# Here we actually apply the averaging - .median() for data_averaging, .sum()
data_averageing = data_averageing.groupby('Boarding Hour').median()
data_counting = data_counting.groupby('Boarding Hour').sum()

# now, we join them on to each other
data = data_averageing.join(data_counting)

# adjust the column naming a little
data.columns = ['Median Fare','Median Fare per Mile', 'Median Fare per Minute', 'Median Trip Time (minutes)', *col_keep]

In [None]:
# Here we take a copy of the above data DF, add those classic % demographic columns
s = data.copy()
s['Total'] = s['M'] + s['F']
for ethn in ethnicities:
    s['% {}'.format(ethn)] = s[ethn] / (s['Total'])
for gend in genders:
    s['% {}'.format(gend)] = s[gend] / (s['Total'])
for age in ages:
    s['% {}'.format(age)] = s[age] / (s['Total'])
time_of_day = s.copy()

### Write Out
Here, the client was interested in results in excel format. As such, we send to excel some of the data tables we have generated above. note the use of the sheet_name parameter, which allows us to generate a single excel workbook, with multiple sheets. 

In [None]:
# We set up an excel writer object first to allow us to write multiple objects to the same file
with pd.ExcelWriter(os.path.join(p, 'summary.xlsx')) as writer:
    route_summary.to_excel(writer, sheet_name = 'RouteSummary')
    time_of_day.to_excel(writer, sheet_name = 'HourlyAnalysis')
    counter = 1
    for table in tables:
        table.to_excel(writer, sheet_name = 'table_%s' % counter)
        counter +=1
    counter = 1
    for table in trip_tables:
        table.to_excel(writer, sheet_name = '%s' % Selected_trips[counter - 1])
        counter +=1

### Analyis of Khayelitsha
Whilst the preceeding analysis has covered journeys across the city, we are now going to geographically bound our activities to just the Khayeltisha wards. 

In [None]:
# Bring in the Khayelitsha shapefile as a GDF
p_s = r'C:\Users\charl\Documents\GOST\South Africa\Connectivity\Fwd__SAL'
khay = gpd.read_file(os.path.join(p_s, 'Khayelitsha.shp'))

# join up the khayelitsha geometries into a single geometry object by using unary_union
khay_shp = unary_union(khay.geometry)

# make a GDF of this shapely geometry object
khay_shp_gdf = gpd.GeoDataFrame({'geometry':khay_shp}, crs = {'init':'epsg:4326'}, index = [1], geometry = 'geometry')

# send the single-geometry GDF to file
khay_shp_gdf.to_file(os.path.join(p_s, 'Khayelitsha_area.shp'), driver = 'ESRI Shapefile')

# project to metres
khay_shp_gdf = khay_shp_gdf.to_crs({'init':'epsg:22234'})
khay_shp_gdf.to_file(os.path.join(p_s, 'Khayelitsha_area_222234.shp'), driver = 'ESRI Shapefile')

# read in whole of Cape Town wards, project, save down once again
shp_pth = r'C:\Users\charl\Documents\GOST\South Africa\Connectivity\Fwd__SAL\1_Residential_SAL'
ward_shp = gpd.read_file(os.path.join(shp_pth, 'wards_4326.shp'))
ward_shp = ward_shp.to_crs({'init':'epsg:22234'})
ward_shp.to_file(os.path.join(p_s, 'wards_222234.shp'), driver = 'ESRI Shapefile')

In [None]:
# make geometry objects of the boarding and alighting points
Passengers['Boarding Point'] = Passengers['Boarding Location'].apply(lambda x: convert_to_point(x))
Passengers['Alighting Point'] = Passengers['Alighting Location'].apply(lambda x: convert_to_point(x))

In [None]:
# Here, we generate a Passenger dataframe of JUST the Passengers whose boarding points intersect the Khayelitsha shape
khay_p_boarding = gpd.GeoDataFrame(Passengers, crs = khay.crs, geometry = 'Boarding Point')
khay_p_boarding = khay_p_boarding.loc[khay_p_boarding.intersects(khay_shp) == True]

# we add again the boarding time hour for summary purposes
khay_p_boarding['Boarding Time hour'] = khay_p_boarding['Boarding Time'].apply(lambda x: x.hour)

In [None]:
# for testing purposes I send this, projected, to .csv
khay_p_boarding = khay_p_boarding.to_crs({'init':'epsg:22234'})
khay_p_boarding.to_csv(os.path.join(p, 'khay.csv'))

In [None]:
# We can subset this set of passengers by gender very easily. 
# these files allow us to easily identify differences in spatial location of boarding points for men and women

# men
khay_p_boarding_men = khay_p_boarding.loc[khay_p_boarding['Gender'] == 'M']
khay_p_boarding_men.to_csv(os.path.join(p, 'khay_boarding_men.csv'))

# women
khay_p_boarding_women = khay_p_boarding.loc[khay_p_boarding['Gender'] == 'F']
khay_p_boarding_women.to_csv(os.path.join(p, 'khay_boarding_women.csv'))

In [None]:
# We do precisely the same thing, this time using alighting point (people arriving into Khayeltisha) as the geometry of interest
khay_p_alighting = gpd.GeoDataFrame(Passengers, crs = khay.crs, geometry = 'Alighting Point')
khay_p_alighting['Alighting Time hour'] = khay_p_alighting['Alighting Time'].apply(lambda x: x.hour)
khay_p_alighting = khay_p_alighting.loc[khay_p_alighting.intersects(khay_shp) == True]

In [None]:
# again, to see if there are differences in disembarkment point, we send these to .csv
khay_p_alighting = khay_p_alighting.to_crs({'init':'epsg:22234'})
khay_p_alighting_men = khay_p_alighting.loc[khay_p_alighting['Gender'] == 'M']
khay_p_alighting_men.to_csv(os.path.join(p, 'khay_alighting_men.csv'))
khay_p_alighting_women = khay_p_alighting.loc[khay_p_alighting['Gender'] == 'F']
khay_p_alighting_women.to_csv(os.path.join(p, 'khay_alighting_women.csv'))

In [None]:
# read in the ward shapefile
shp_pth = r'C:\Users\charl\Documents\GOST\South Africa\Connectivity\Fwd__SAL\1_Residential_SAL'
ward_shp = gpd.read_file(os.path.join(shp_pth, 'wards_4326.shp'))

# create centroids from ward geometries
ward_shp['centroid'] = ward_shp.geometry.centroid

# For the purposes of visualzing trips between wards, we create a DF of their centroids
# These can later be made into lines between centroids
match_ward = ward_shp[['SL_WARD_KE','centroid']].set_index('SL_WARD_KE')

# adjust a specific centroid which lies in water (part of ward offshore)
match_ward['centroid'].loc[5177473] = Point(18.3918, -33.9201)

In [None]:
# From manual inspection of the file in QGIS, I know these are the wards that form Khayeltisha
khayelitsha_wards = [5177561,5177559, 5177557, 5177555, 5177553, 5177551, 5177549, 5177547, 5177545, 5177543]

# As we want to observe the OD matrix at different times of day, I create a times dictionary with associated hour bracketing. 
times = {'morning':[7,8,9],'midday':[11,12,13],'evening':[16,17,8,19]}

# We now repeat a process for both arrivals into and departures from Khayeltisha. 
# We want one DataFrame for each time of day, so we iterate through 'times'
outs = []
for t in times:
    # pick out our current 'hours' for this time of day
    HRS = times[t]
    binz = []
    # for each ward in Khayelitsha
    for ward in khayelitsha_wards:
        
        # grab all the passengers boarding in this ward at this time of day
        Q = khay_p_boarding.copy()
        Q = Q.loc[Q['Boarding Time hour'].isin(HRS)]
        q = Q.loc[Q.Board_WARD_KE == ward]
        
        # count the number of people going to each other ward
        dest = q.Alight_WARD_KE.value_counts().to_frame()
        dest = dest.reset_index()
        dest.columns = ['D_ID','count']
        
        # set an origin ward ID as the current ward
        dest['O_ID'] = ward
        binz.append(dest)
        
    # summarize the dataframe at the Khayelithsa level
    df2 = pd.concat(binz)
    df2['time'] = str(t)
    outs.append(df2)

# same process, using the passengers alighting in Khayeltisha
ins = []
for t in times:
    HRS = times[t]
    binz = []
    for ward in khayelitsha_wards:
        Q = khay_p_alighting.copy()
        Q = Q.loc[Q['Alighting Time hour'].isin(HRS)]
        q = Q.loc[Q.Alight_WARD_KE == ward]
        orig = q.Board_WARD_KE.value_counts().to_frame()
        orig = orig.reset_index()
        orig.columns = ['O_ID','count']
        orig['D_ID'] = ward
        binz.append(orig)
    df2 = pd.concat(binz)
    df2['time'] = str(t)
    ins.append(df2)

# Now, we combine the in and out dataframes to work out the net inflows and outflows to each ward from Khayelitsha
nets = []

# we have to do this for each frame in the ins and outs bucket (i.e. for each time of day)
for i in range(0, len(ins)):
    # A is a copy of the inflows to Khayelitsha
    A = ins[i]
    
    # define a new column which describes the movement (origin and destination ward IDs)
    A['combo'] = A['O_ID'].astype(str) + ' | ' + A['D_ID'].astype(str)
    A = A.rename({'count':'inflow'}, axis = 1)
    A = A[['inflow','combo']].set_index('combo')
    
    # B is a copy of the outflows from Khayelitsha
    B = outs[i]
    
    # define a new column which describes the movement (origin and destination ward IDs)
    B['combo'] = B['D_ID'].astype(str) + ' | ' + B['O_ID'].astype(str)
    B = B.rename({'count':'outflow'}, axis = 1)
    B = B[['outflow','combo']].set_index('combo')
    
    # join the two together, to be able to work out the net
    C = A.join(B, how = 'outer')
    
    # any missing values are filled with 0 - no one did that trip
    C['inflow'] = C['inflow'].fillna(0)
    C['outflow'] = C['outflow'].fillna(0)
    
    # work out the net flow between district pairs
    C['net'] = C['inflow'] - C['outflow']
    
    # sort values by the net movement
    C = C.sort_values(by = 'net', ascending = False)
    
    # add on the time of day
    C['time'] = list(times.keys())[i]
    C = C.reset_index()
    
    # split the code on the pipe to get the origin and destination ward IDs back
    C['O_ID'] = C['combo'].apply(lambda x: x.split(' | ')[0]).astype(float)
    C['D_ID'] = C['combo'].apply(lambda x: x.split(' | ')[1]).astype(float)
    
    # append the resultant dataframe to the bucket of netted-out DFs
    nets.append(C)

# we want both combined and disaggregated results, so we go through separate processes for each
for X in ['combined','disagg']:
    
    # disaggregated process
    if X == 'disagg':
        
        # for each time of day
        for i in range(0, len(nets)):
            
            # get current dataFrame from nets bucket
            df2 = nets[i]
            
            # add time of day
            time_of_day = list(times.keys())[i]
            
            # take a copy
            df = df2.copy()
            
            # match on the centroids of the destination ward
            df = df.set_index('D_ID')
            df['D_Point'] = match_ward['centroid']
            
            # match on the centroid of the origin ward
            df = df.reset_index().set_index('O_ID')
            df['O_Point'] = match_ward['centroid']
            
            # reset index
            df = df.reset_index()
            
            # make a human intelligible string
            df['journey'] = df['O_ID'].astype(str) + ' to ' + df['D_ID'].astype(str)
            
            # build a lineString which describes the journey's geometry
            df['WKT'] = df.apply(lambda x: LineString([x.O_Point,x.D_Point]), axis = 1)
            
            # send to csv - all done!
            df.to_csv(os.path.join(p, 'Khayelitsha', '{}_{}.csv'.format(time_of_day, X)))
    
    # for the combined one, the difference is we want to aggregate by origin ID. 
    # the process is otherwise identical
    # combined process
    elif X == 'combined':
        for i in range(0, len(nets)):
            time_of_day = list(times.keys())[i]
            df2 = nets[i]
            
            # these are the only two lines which are different vs. above
            df2 = df2[['net','time','D_ID','O_ID']]
            
            # group by origin ID, sum, reset index
            df2 = df2.groupby('O_ID').sum().reset_index()
            
            # resume above process
            df = df2.copy()
            df = df.set_index('D_ID')
            df['D_Point'] = khay_shp.centroid
            df = df.reset_index().set_index('O_ID')
            df['O_Point'] = match_ward['centroid']
            df = df.reset_index()
            df['journey'] = df['O_ID'].astype(str) + ' to ' + df['D_ID'].astype(str)
            df['WKT'] = df.apply(lambda x: LineString([x.O_Point,x.D_Point]), axis = 1)
            df.to_csv(os.path.join(p, 'Khayelitsha', '{}_{}.csv'.format(time_of_day, X)))