### Step 1: Building Attribution
In this script, we take a building footprint layer, provided by Digital Globe, and attach a range of standardized characteristics to each building footprint polygon. 

These characteristics include properties such as area, count of buildings within 25m, 50m and 100m, and the average properties of the closest 5 and 25 buildings. 

The theory behind this is that these characteristics, about both the building itself and its immediate neighbours, can be used by a machine learning model to identify slum areas - if some training shapefiles on slums are also provided. 

Import libraries

In [58]:
import pandas as pd
import geopandas as gpd
import sys, os
from scipy import spatial
import numpy as np
from sklearn.neighbors import KDTree
import time
from multiprocessing import Pool
import multiprocessing

Set basic defintions

In [61]:
pth = os.getcwd()
WGS = {'init':'epsg:4326'}
UTM = {'init':'epsg:32629'}
save_thresh = 100000 # save progress every [] rows 
print_thresh = 10000 # print out calculation process every [] rows for each processor

In this block we import the shapefile, ensure it is projected in WGS 84,  reproject to a metres-based projection, and then add area information. 

We also calculate the centroid here, whils the data is projected - to ensure that distance based measures are returned in relevant units (meters)

In [3]:
fil = gpd.read_file(os.path.join(pth, '1243_bamako_building_32629.shp'))
if fil.crs != WGS:
    fil = fil.to_crs(WGS)
fil = fil.to_crs(UTM) 
fil['area'] = fil.area
fil['centroid'] = fil['geometry'].centroid
fil = fil.to_crs(WGS)
fil = fil[['PID','centroid','area']]

Opportunity to shorten DF for testing purposes in the first line. Otherwise, this block builds the KDTree of the underlying GeoDataFrame. As such, it may take a while to generate, depending on the number of objects. 

In [55]:
short = fil
area_dict = dict(zip(list(short.index), list(short['area'])))
matrix = list(zip(short.centroid.apply(lambda x: x.x),short.centroid.apply(lambda x: x.y)))
KD_tree = KDTree(matrix)

This block sets up multiprocessing functionality. It splits up the input DataFrame, short, into chunks based on the available number of threads. This allows the calculations to be spread across multiple threads easily. 

Users should manually adjust the 'threads' parameter (dtype: int) to avoid taking over all of the available resources on the server.

In [None]:
threads = multiprocessing.cpu_count()  # limit this line if on the JNB to avoid consuming 100% of resources!

d = []

for i in range(1, (threads+1)):
    len_total_df = len(short)
    chunk = int(np.ceil(len_total_df / threads))
    d_f = short[(chunk*(i-1)):(chunk*i)]
    
    processor_input_dict = {
        'df':d_f,
        'thread_no':i,
        'print_thresh':print_thresh,
        'save_thresh':save_thresh
    }
    
    d.append(processor_input_dict)

with Pool(threads) as pool:
        results = pool.map(Main,d,chunksize=1)

Here we define 'Main' - the function called by each processor in the Pool. In each case, it expects a dictionary of passed objects (generated in the previous block). Each thread deals with an identically sized chunk of the original input DataFrame. 

In [64]:
# Query individual rooftop objects against KD Tree, calculate statistics
def Main(passed_dict):
    
    # unpack passed dict into local variables for this thread.
    short = passed_dict['df']
    thread_no = passed_dict['thread_no']
    print_thresh = passed_dict['print_thresh']
    save_thresh = passed_dict['save_thresh']
    
    # set up some counters / timings
    t = time.time()
    counter = 1
    
    # iterate through each row of the passed DataFrame of housing polygons.
    for index, row in short.iterrows():
        
        # identify the x and y coordinates of the house's centroid
        y = row.centroid.y
        x = row.centroid.x
        
        # Query the KD tree for the first 26 objects (1 will be the house itself.)
        # this returns a dataframe of the nearest 26 objects, their distances, and their indices. 
        distances, indices = KD_tree.query([(x,y)], k = 26)

        # Distance calculations - closest 5
        # here, we subset the distances frame for the first 5 neighbours, and calculate summary stats
        nearest_5_distances = list(distances[0])[1:6]  # subset / slice
        min_5 = min(nearest_5_distances) # closest neighbour of the 5 closest (min distance to another building)
        max_5 = max(nearest_5_distances) # furthest neighbour of the 5 closest (min distance to another building)
        mean_5 = np.mean(nearest_5_distances) # average distance of centroids of 5 nearest neighbours
        median_5 = np.median(nearest_5_distances) # median distance of centroids of 5 nearest neighbours
        dist_5_std = np.std(nearest_5_distances) # standard deviation of centroids of 5 nearest neighbours

        # Distance calculations - closest 25
        # here, we subset the distances frame for the first 25 neighbours, and calculate summary stats
        nearest_25_distances = list(distances[0])[1:]
        min_25 = min(nearest_25_distances)
        max_25 = max(nearest_25_distances)
        mean_25 = np.mean(nearest_25_distances)
        median_25 = np.median(nearest_25_distances)
        dist_25_std = np.std(nearest_5_distances)

        # Areal calculations - closest 5
        # here, instead of the distances frame we generated via the KD tree, we use the area_dict 
        # and query it with the indices from the KD tree step
        indices_5 = list(indices[0])[1:6]
        areas = [area_dict[x] for x in indices_5] 
        area_5_mean = np.mean(areas)  # mean area of 5 nearest neighbours
        area_5_med = np.median(areas)  # median area of 5 nearest neighbours
        area_5_stdev = np.std(areas)   # standard deviation of area of 5 nearest neighbours

        # Areal calculations - closest 25
        # repeat above block for closest 25
        indices_25 = list(indices[0])[1:]
        areas = [area_dict[x] for x in indices_25]
        area_25_mean = np.mean(areas)
        area_25_med = np.median(areas)
        area_25_stdev = np.std(areas)

        # Count
        # here we turn the process on its head, and identify all objects within certain distance thresholds
        count_25m = KD_tree.query_radius([(x,y)], r = 25, count_only = True)[0] # count of buildings in 25m radius
        count_50m = KD_tree.query_radius([(x,y)], r = 50, count_only = True)[0] # count of buildings in 50m radius
        count_100m = KD_tree.query_radius([(x,y)], r = 100, count_only = True)[0] # count of buildings in 100m radius
        
        # add these stats to a dictionary called 'ans'
        ans = {'PID':row.PID,
               'area':row.area,
              'dist_5_min':min_5,
              'dist_5_max':max_5,
              'dist_5_mean':mean_5,
              'dist_5_med':median_5,
              'dist_5_std':dist_5_std,
              'area_5_mean':area_5_mean,
              'area_5_med':area_5_med,
              'area_5_std':area_5_stdev,
              'dist_25_min':min_25,
              'dist_25_max':max_25,
              'dist_25_mean':mean_25,
              'dist_25_med':median_25,
              'dist_25_std':dist_25_std,
              'area_25_mean':area_25_mean,
              'area_25_med':area_25_med,
              'area_25_std':area_25_stdev,
              'count_25m':count_25m,
              'count_50m':count_50m,
              'count_100m':count_100m
              }

        bundle.append(ans)
        
        # keep track of progress via this row
        if counter % print_thresh == 0:
            print('%s rows completed at %s' % (counter, time.ctime()))
        
        # this functionality saves progress in case the process cannot be finished in one sitting. 
        # ideally, finish the processing in one sitting. 
        old = 0
        if counter % save_thresh == 0:
            saver = pd.DataFrame(bundle)
            saver = saver[list(bundle[0].keys())]
            if saver.crs != WGS:
                saver = saver.to_crs(WGS)
            saver = saver.set_index('PID')
            saver = saver.set_index('PID')
            saver['geometry'] = saver['geometry']
            saver = gpd.GeoDataFrame(saver, geometry = 'geometry', crs = WGS)
            saver.to_file(os.path.join(pth, 'output_%s_to_%s_thread_%s.shp' % (old, counter, thread_no)), driver = 'ESRI Shapefile')
            bundle = []
            old = counter
        counter+=1
        
    print('Task completed in %s seconds' % (time.time() - t))

### Output Final Layer
Here, we rejoin the original geometry onto our statistics DF via the key field 'PID', and output the resultant file as a shapefile.

In [187]:
out_df = pd.DataFrame(bundle)

orig_fil = gpd.read_file(os.path.join(pth, '1243_bamako_building_32629.shp'))
if orig_fil.crs != WGS:
    orig_fil = orig_fil.to_crs(WGS)
orig_fil = orig_fil.set_index('PID')

out_df = out_df.set_index('PID')
out_df['geometry'] = orig_fil['geometry']
out_df = gpd.GeoDataFrame(out_df, geometry = 'geometry', crs = WGS)
out_df.to_file(os.path.join(pth, 'buildings_altered.shp'), driver = 'ESRI Shapefile')

  with fiona.drivers():
