### Step 2: Machine Learning

This script is only functional once Step 1: Building Attribution has been run.

In this script, we import training areas, intersect these with the attributed building foot prints file, and then pass this to an automated machine learning library: h2o's AutoML.

Import libraries, set file paths and file names. Building_df should be attributed building foot print shapefile from step 1. 

In [3]:
import os, sys, time
import pandas as pd
import geopandas as gpd

In [7]:
pth = os.getcwd()+'/merged'
base_fil = 'buildings_altered.shp'
building_df = gpd.read_file(os.path.join(pth, base_fil))
building_df.crs = {'init' :'epsg:4326'}

Here, we import the training shapefiles. These are polygons which intersect the building footprints file. 

We combine these into a new file, df, which includes each polygon, and its income bracket (slum, rich or mid income). 

In [10]:
regions = ['commercial.shp',
           'industrial.shp',
           'informal_low_income.shp',
           'residential_high_income.shp',
           'residential_middle_income.shp',
           'residential_low_income.shp']

shp_list = []
for shp in regions: 
    shp_df = gpd.read_file(os.path.join(pth+'/training_zim', shp))
    shp_df['type'] = 'blank'
    if shp == 'commercial.shp':
        shp_df['type'] = 'commercial'
    elif shp == 'industrial.shp':
        shp_df['type'] = 'industrial'
    elif shp == 'informal_low_income.shp':
        shp_df['type'] = 'informal_low_income'
    elif shp == 'residential_high_income.shp':
        shp_df['type'] = 'residential_high_income'
    elif shp == 'residential_middle_income.shp':
        shp_df['type'] = 'residential_middle_income'
    elif shp == 'residential_low_income.shp':
        shp_df['type'] = 'residential_low_income'
    shp_list.append(shp_df[['Class','geometry','type']])
df = gpd.GeoDataFrame(pd.concat(shp_list), crs = {'init':'epsg:4326'}, geometry = 'geometry')

Using these known training classifications, we assign the income bracket from the training data onto the main building footprints file. We then drop all non-attributed footprints. 

The surviving footprints will serve as our model training data.

In [11]:
# set property type default as 'unknown'
building_df['type'] = 'unknown'

# iterate through training area polygons, assign type from training polygon DataFrame (df).
for index, row in df.iterrows():
    x = row.geometry
    y = row.type
    building_df['type'].loc[building_df.intersects(x) == True] = y
t = building_df.copy()

# drop all other footprints outside training polygons
t = t.loc[t['type'].isin(['commercial','industrial','informal_low_income','residential_high_income','residential_middle_income',
                         'residential_low_income'])]

# build_df is now our official 'training data' for our ML model. 
build_df = t

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


Map the types to a numerical, categorical variable

In [12]:
build_df['type'] = build_df['type'].map({'commercial':1,
                                        'industrial':2,
                                        'informal_low_income':3,
                                        'residential_high_income':4,
                                        'residential_middle_income':5,
                                        'residential_low_income':6})

In [13]:
build_df['PID'] = build_df.index

Import our machine learning library, including helper functions for exchanging between a Pandas DataFrame and an h2o Frame

In [15]:
import h2o
from h2o.automl import H2OAutoML
from h2o.frame import H2OFrame

Shut down any existing h2o servers, initiate a new one. 

In [16]:
#h2o.cluster().shutdown(prompt=True)
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "10.0.1" 2018-04-17; Java(TM) SE Runtime Environment 18.3 (build 10.0.1+10); Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.1+10, mixed mode)
  Starting server from /Users/alex.chunet/anaconda3/envs/ML/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/hv/sntp5ybs2bl6hqtgl4pb1p2r0000gn/T/tmpa5b0exba
  JVM stdout: /var/folders/hv/sntp5ybs2bl6hqtgl4pb1p2r0000gn/T/tmpa5b0exba/h2o_alex_chunet_started_from_python.out
  JVM stderr: /var/folders/hv/sntp5ybs2bl6hqtgl4pb1p2r0000gn/T/tmpa5b0exba/h2o_alex_chunet_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,Europe/Paris
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.1.2
H2O cluster version age:,9 months and 12 days !!!
H2O cluster name:,H2O_from_python_alex_chunet_yd4q3e
H2O cluster total nodes:,1
H2O cluster free memory:,2 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


This step is a bit clumsy, but allows the constructions of an h20 frame type object from the training data.

In [18]:
expi_f = {'PID':list(build_df['PID']),
        'area':list(build_df['area']),
        'dist_5_min':list(build_df['dist_5_min']),
        'dist_5_max':list(build_df['dist_5_max']),
        'dist_5_mea':list(build_df['dist_5_mea']),
        'dist_5_med':list(build_df['dist_5_med']),
        'dist_5_std':list(build_df['dist_5_std']),
        'area_5_mea':list(build_df['area_5_mea']),
        'area_5_med':list(build_df['area_5_med']),
        'area_5_std':list(build_df['area_5_std']),
        'dist_25_mi':list(build_df['dist_25_mi']),
        'dist_25_ma':list(build_df['dist_25_ma']),
        'dist_25_me':list(build_df['dist_25_me']),
        'dist_25__1':list(build_df['dist_25__1']),
        'dist_25_st':list(build_df['dist_25_st']),
        'area_25_me':list(build_df['area_25_me']),
        'area_25__1':list(build_df['area_25__1']),
        'area_25_st':list(build_df['area_25_st']),
        'count_25m':list(build_df['count_25m']),
        'count_50m':list(build_df['count_50m']),
        'count_100m':list(build_df['count_100m']),
        'type':list(build_df['type'])}

In [19]:
frme = H2OFrame(expi_f)

Parse progress: |█████████████████████████████████████████████████████████| 100%


Here we explicity define the predictor fields, and our dependent variables (response). 

In [22]:
predictors = ['area', 'dist_5_min', 'dist_5_max', 'dist_5_mea', 'dist_5_med',
       'dist_5_std', 'area_5_mea', 'area_5_med', 'area_5_std', 'dist_25_mi',
       'dist_25_ma', 'dist_25_me', 'dist_25__1', 'dist_25_st', 'area_25_me',
       'area_25__1', 'area_25_st', 'count_25m', 'count_50m', 'count_100m']
response = 'type'

In [23]:
train, valid = frme.split_frame(ratios = [.8], seed = 10)

This block of code is fairly h2o standard. It trains 20 models on this data, limiting the runtime to 1 hour. At the end of an hour or training 20 models, whichever is first, it returns a DataFrame of predictions as preds, ordered by the quality of their predictions.

In [24]:
# Identify predictors and response
x = predictors
y = response

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
valid[y] = valid[y].asfactor()

# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)

preds = aml.leader.predict(valid)

AutoML progress: |████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%


Here, we print out the performance of our top performing model.

In [26]:
res = aml.leader.model_performance(valid)

print(res)


ModelMetricsMultinomialGLM: stackedensemble
** Reported on test data. **

MSE: 0.12220657961186379
RMSE: 0.3495805767085234



In [27]:
res.confusion_matrix()


Confusion Matrix: Row labels: Actual class; Column labels: Predicted class


Unnamed: 0,1,2,3,4,5,6,Error,Rate
0,87.0,39.0,0.0,11.0,2.0,0.0,0.374101,52 / 139
1,12.0,399.0,1.0,27.0,5.0,0.0,0.101351,45 / 444
2,0.0,0.0,118.0,10.0,4.0,5.0,0.138686,19 / 137
3,0.0,8.0,4.0,437.0,37.0,6.0,0.111789,55 / 492
4,0.0,2.0,0.0,48.0,202.0,25.0,0.270758,75 / 277
5,0.0,0.0,4.0,3.0,17.0,463.0,0.049281,24 / 487
6,99.0,448.0,127.0,536.0,267.0,499.0,0.13664,"270 / 1,976"




We save the model down to its own save location.

In [36]:
model_path = h2o.save_model(model=aml.leader, path='/Users/alex.chunet/Documents/Repositories/GEO_ML/Bamako_building attribution', force=True)


h2o struggled to generate predictions for more than 100,000 rows at a time. Thus, we split the original DataFrame into 100,000 row chunks, run the predictions on the h2o version of the frame, then send these to file. These predictions could be re-aggregated as desired; but this was not required for this proof of concept. 

In [29]:
building_df['PID'] = building_df.index

In [31]:
bef = [0,100000,200000,300000,400000,500000,600000,700000,800000,900000,1000000,1100000,1200000]
af = [100000,200000,300000,400000,500000,600000,700000,800000,900000,1000000,1100000,1200000,1300000]

In [35]:
for x,y in zip(bef, af):
    print(x,y)
    df_short = building_df.copy()
    df_short = df_short[x:y]

    # convert to h2o frame
    expi = {'PID':list(df_short['PID']),
            'area':list(df_short['area']),
            'dist_5_min':list(df_short['dist_5_min']),
            'dist_5_max':list(df_short['dist_5_max']),
            'dist_5_mea':list(df_short['dist_5_mea']),
            'dist_5_med':list(df_short['dist_5_med']),
            'dist_5_std':list(df_short['dist_5_std']),
            'area_5_mea':list(df_short['area_5_mea']),
            'area_5_med':list(df_short['area_5_med']),
            'area_5_std':list(df_short['area_5_std']),
            'dist_25_mi':list(df_short['dist_25_mi']),
            'dist_25_ma':list(df_short['dist_25_ma']),
            'dist_25_me':list(df_short['dist_25_me']),
            'dist_25__1':list(df_short['dist_25__1']),
            'dist_25_st':list(df_short['dist_25_st']),
            'area_25_me':list(df_short['area_25_me']),
            'area_25__1':list(df_short['area_25__1']),
            'area_25_st':list(df_short['area_25_st']),
            'count_25m':list(df_short['count_25m']),
            'count_50m':list(df_short['count_50m']),
            'count_100m':list(df_short['count_100m'])}
    frme = H2OFrame(expi)

    # generate predictions from top model
    preds = aml.leader.predict(frme)

    # send back to Pandas DataFrame
    preds_df = preds.as_data_frame()
    preds_df = preds_df.reset_index()
    preds_df['New_ID'] = preds_df.index
    preds_df = preds_df.set_index('New_ID')
    u = df_short.copy()
    u = u.reset_index()
    u['New_ID'] = u.index
    u = u.set_index('New_ID')
    u['predict'] = preds_df['predict']
    u.to_file(os.path.join('/Users/alex.chunet/Documents/Repositories/GEO_ML/Bamako_building attribution/output_zim','pred_layer_%s_%s.shp' % (x, y)), driver = 'ESRI Shapefile')
    

0 100000
Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%
100000 200000
Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%
200000 300000
Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%
300000 400000
Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%
400000 500000
Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%
500000 600000
Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction pro

In [None]:
# Merging

In [None]:
path_in=""
path_out =""

In [49]:
pred2 = gpd.read_file(path_in+'pred_layer_2.shp')
pred3 = gpd.read_file(path_in+'pred_layer_3.shp')
pred4 = gpd.read_file(path_in+'pred_layer_4.shp')

In [50]:
merged = pd.concat([pred2, pred3, pred4], join="inner")

In [52]:
merged.to_file(os.path.join(path,'merged.shp'), driver = 'ESRI Shapefile')