### Step 2: Machine Learning

This script is only functional once Step 1: Bamako Building Attribution has been run.

In this script, we import training areas, intersect these with the attributed building foot prints file, and then pass this to an automated machine learning library: h2o's AutoML.

CREDITS: Sarah Antos for training area shapefiles.

Import libraries, set file paths and file names. Building_df should be attributed building foot print shapefile from step 1. 

In [1]:
import os, sys, time
import pandas as pd
import geopandas as gpd

In [2]:
pth = r'C:\Users\charl\Documents\GOST\Bamako'
base_fil = r'1243_bamako_building_32629_neighbourInfo.shp'
building_df = gpd.read_file(os.path.join(pth, base_fil))
building_df = building_df.to_crs({'init':'epsg:4326'})

Here, we import the training shapefiles. These are polygons which intersect the building footprints file. 

We combine these into a new file, df, which includes each polygon, and its income bracket (slum, rich or mid income). 

In [3]:
regions = ['Middle_Class_Hamdallaye.shp',
           'Middle_Class_Hippodrome.shp',
           'Middle_Class_Sogoniko.shp',
           'TFS_slum.shp',
           'High Class.shp']

shp_list = []
for shp in regions: 
    shp_df = gpd.read_file(os.path.join(pth, shp))
    shp_df['type'] = 'blank'
    if shp == 'TFS_slum.shp':
        shp_df['type'] = 'slum'
    elif shp == 'High Class.shp':
        shp_df['type'] = 'rich'
    else:
        shp_df['type'] = 'mid'
    shp_list.append(shp_df[['Name','geometry','type']])
df = gpd.GeoDataFrame(pd.concat(shp_list), crs = {'init':'epsg:4326'}, geometry = 'geometry')

Using these known training classifications, we assign the income bracket from the training data onto the main building footprints file. We then drop all non-attributed footprints. 

The surviving footprints will serve as our model training data.

In [4]:
# set property type default as 'unknown'
building_df['type'] = 'unknown'

# iterate through training area polygons, assign type from training polygon DataFrame (df).
for index, row in df.iterrows():
    x = row.geometry
    y = row.type
    building_df['type'].loc[building_df.intersects(x) == True] = y
t = building_df.copy()

# drop all other footprints outside training polygons
t = t.loc[t['type'].isin(['mid','slum','rich'])]

# build_df is now our official 'training data' for our ML model. 
build_df = t

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Map the types to a numerical, catagorical variable

In [5]:
build_df['type'] = build_df['type'].map({'slum':1,
                                        'mid':2,
                                        'rich':3})

Import our machine learning library, including helper functions for exchanging between a Pandas DataFrame and an h2o Frame

In [6]:
import h2o
from h2o.automl import H2OAutoML
from h2o.frame import H2OFrame

Shut down any existing h2o servers, initiate a new one. 

In [11]:
h2o.cluster().shutdown(prompt=True) 
h2o.init()

Are you sure you want to shutdown the H2O instance running at http://127.0.0.1:54321 (Y/N)? Y
Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) Client VM (build 25.171-b11, mixed mode, sharing)
  Starting server from C:\Users\charl\AppData\Local\Continuum\anaconda3\envs\test\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\charl\AppData\Local\Temp\tmpj3a2j57m
  JVM stdout: C:\Users\charl\AppData\Local\Temp\tmpj3a2j57m\h2o_charl_started_from_python.out
  JVM stderr: C:\Users\charl\AppData\Local\Temp\tmpj3a2j57m\h2o_charl_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.1.3
H2O cluster version age:,10 days
H2O cluster name:,H2O_from_python_charl_9zrjy3
H2O cluster total nodes:,1
H2O cluster free memory:,247.5 Mb
H2O cluster total cores:,0
H2O cluster allowed cores:,0


This step is a bit clumsy, but allows the constructions of an h20 frame type object from the training data.

In [12]:
expi = {'PID':list(build_df['PID']),
        'bArea':list(build_df['bArea']),
        'distMean':list(build_df['distMean']),
        'distMin':list(build_df['distMin']),
        'distMax':list(build_df['distMax']),
        'areaMean':list(build_df['areaMean']),
        'areaMin':list(build_df['areaMin']),
        'areaMax':list(build_df['areaMax']),
        'type':list(build_df['type'])}

In [13]:
frme = H2OFrame(expi)

Parse progress: |█████████████████████████████████████████████████████████| 100%


Here we explicity define the predictor fields, and our dependent variables (response). 

In [14]:
predictors = ['bArea','distMean','distMax','distMin','areaMean','areaMin','areaMax','type']
response = 'type'

In [15]:
train, valid = frme.split_frame(ratios = [.8], seed = 10)

This block of code is fairly h2o standard. It trains 20 models on this data, limiting the runtime to 1 hour. At the end of an hour or training 20 models, whichever is first, it returns a DataFrame of predictions as preds, ordered by the quality of their predictions.

In [16]:
# Identify predictors and response
x = predictors
y = response

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
valid[y] = valid[y].asfactor()

# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)

preds = aml.leader.predict(valid)

AutoML progress: |████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%


Here, we print out the performance of our top performing model.

In [17]:
res = aml.leader.model_performance(valid)

print(res)


ModelMetricsMultinomialGLM: stackedensemble
** Reported on test data. **

MSE: 0.22544477423935919
RMSE: 0.4748102507732528



We save the model down to its own save location.

In [18]:
model_path = h2o.save_model(model=aml.leader, path=r'C:\Users\charl\Documents\GOST\Bamako\model', force=True)

h2o struggled to generate predictions for more than 100,000 rows at a time. Thus, we split the original DataFrame into 100,000 row chunks, run the predictions on the h2o version of the frame, then send these to file. These predictions could be re-aggregated as desired; but this was not required for this proof of concept. 

In [21]:
df_short = building_df.copy()
df_short = df_short[:100000]

# convert to h2o frame
expi2 = {'PID':list(df_short['PID']),
        'bArea':list(df_short['bArea']),
        'distMean':list(df_short['distMean']),
        'distMin':list(df_short['distMin']),
        'distMax':list(df_short['distMax']),
        'areaMean':list(df_short['areaMean']),
        'areaMin':list(df_short['areaMin']),
        'areaMax':list(df_short['areaMax'])}
frme2 = H2OFrame(expi2)

# generate predictions from top model
preds_2 = aml.leader.predict(frme2)

# send back to Pandas DataFrame
preds_df = preds_all.as_data_frame()
preds_df = preds_df.reset_index()
preds_df['New_ID'] = preds_df.index
preds_df = preds_df.set_index('New_ID')
u = df_short.copy()
u = u.reset_index()
u['New_ID'] = u.index
u = u.set_index('New_ID')
u['predict'] = preds_df['predict']
u.to_file(os.path.join(r'C:\Users\charl\Documents\GOST\Bamako','pred_layer_2.shp'), driver = 'ESRI Shapefile')

In [23]:
df_short = building_df.copy()
df_short = df_short[100000:200000]

# convert to h2o frame
expi3 = {'PID':list(df_short['PID']),
        'bArea':list(df_short['bArea']),
        'distMean':list(df_short['distMean']),
        'distMin':list(df_short['distMin']),
        'distMax':list(df_short['distMax']),
        'areaMean':list(df_short['areaMean']),
        'areaMin':list(df_short['areaMin']),
        'areaMax':list(df_short['areaMax'])}
frme3 = H2OFrame(expi3)

# generate predictions from top model
preds_3 = aml.leader.predict(frme3)

# send back to Pandas DataFrame
preds_df = preds_3.as_data_frame()
preds_df = preds_df.reset_index()
preds_df['New_ID'] = preds_df.index
preds_df = preds_df.set_index('New_ID')
u = df_short.copy()
u = u.reset_index()
u['New_ID'] = u.index
u = u.set_index('New_ID')
u['predict'] = preds_df['predict']
u.to_file(os.path.join(r'C:\Users\charl\Documents\GOST\Bamako','pred_layer_3.shp'), driver = 'ESRI Shapefile')

Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%


In [24]:
df_short = building_df.copy()
df_short = df_short[200000:]
expi4 = {'PID':list(df_short['PID']),
        'bArea':list(df_short['bArea']),
        'distMean':list(df_short['distMean']),
        'distMin':list(df_short['distMin']),
        'distMax':list(df_short['distMax']),
        'areaMean':list(df_short['areaMean']),
        'areaMin':list(df_short['areaMin']),
        'areaMax':list(df_short['areaMax'])}
frme4 = H2OFrame(expi4)
preds_4 = aml.leader.predict(frme4)
preds_df = preds_4.as_data_frame()
preds_df = preds_df.reset_index()
preds_df['New_ID'] = preds_df.index
preds_df = preds_df.set_index('New_ID')
u = df_short.copy()
u = u.reset_index()
u['New_ID'] = u.index
u = u.set_index('New_ID')
u['predict'] = preds_df['predict']
u.to_file(os.path.join(r'C:\Users\charl\Documents\GOST\Bamako','pred_layer_4.shp'), driver = 'ESRI Shapefile')

Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%
