# Load, Shrink, Compile

The GeoJSON source files are large and slow to open. To circumvent that delay, this notebook reads bits of them into slices and then recompiles the slices as CSVs where each polygon (formerly) becomes the coordinates of the polygon's centroid with an associated "area" value which represents the area of the polygon (i.e. the building footprint).

The slicer function is "slice_files". It is defined in slicer.py, which is located in the same directory as this notebook.

### Steps:
#### 1 - Import Modules and Data
#### 2 - Slice the GeoJSONs
#### 3 - Compile Slices as CSVs with centroids and area
####   


### Import Modules, Set Paths, and Start the Clock


In [1]:
import datetime
from tqdm import tqdm_notebook as tqdm
from slicer import slice_files
import json
import os, os.path
import geopandas as gpd
import pandas as pd

In [6]:
# Set Main Directories
project_folder = '../'
data_folder = project_folder + '1_data/'

In [26]:
# Start the clock
print datetime.datetime.now()

2018-12-12 20:35:26.448141


### Slice Files

In [8]:
# Set State
states = [('Oregon','OR'), ('California', 'CA'),\
          ('Washington','WA')]
slice_length = 50000

# Will the data from Miscrosoft's GitHub and save in Slices 
for state, state_abbv in tqdm(states):
    slice_files(state, state_abbv, data_folder, slice_length)

HBox(children=(IntProgress(value=0, max=1), HTML(value=u'')))

The 1,809,555 buildings from OR will go into 37 slice files



### Compile Slices info to CSVs

In [29]:
for state, state_abbv in states:
    state_slice_folder = data_folder + 'states_slices/'\
                        + state_abbv + '/'
    slices = os.listdir(state_slice_folder)
    i = 0
    print "Compiling " + state + ":"
    for file_name in tqdm(slices):
        # Read Slice
        file = gpd.read_file(state_slice_folder + file_name)
        # Get Area Meters^2 from Slice Using Cartesian(?) Projection <https://gis.stackexchange.com/questions/218450/getting-polygon-areas-using-geopandas>
        area_slice = file.to_crs({'init': 'epsg:3857'}).area
        area_slice = area_slice.apply(lambda footprint: round(footprint,4)).tolist()
        # Get Centroids from Slice
        centroids = file.geometry.centroid
        lat_slice = centroids.apply(lambda coordinate: round(coordinate.x,6)).tolist()
        lon_slice = centroids.apply(lambda coordinate: round(coordinate.y,6)).tolist()
        # On First Pass: Start List of Centroid
        if i == 0:
            lat = lat_slice
            lon = lon_slice
            area = area_slice
        # On Subsequent Passes: Append to List of Centroids
        else: 
            for point in lat_slice:
                lat.append(point)
            for point in lon_slice:
                lon.append(point)
            for footprint in area_slice:
                area.append(round(footprint,4))
        i += 1
    # Given Centroids, Define Dataframe
    df = pd.DataFrame({'longitude':lat, 'latitude':lon, 'area':area})
    # Export Centroids in csv
    df.to_csv(data_folder + 'states_csv/' + state + '.csv')
    print 'Recorded: ' + str(df.columns.tolist()) + ' for '\
        + str(round(len(df)/1000000.0,2)) \
        + ' Million Buildings from ' + state
    # Now remove df from mem. for next iteration
    del df
    

Compiling Oregon:


HBox(children=(IntProgress(value=0, max=37), HTML(value=u'')))

Recorded: ['area', 'latitude', 'longitude'] for 1.81 Million Buildings from Oregon
Compiling California:


HBox(children=(IntProgress(value=0, max=220), HTML(value=u'')))

Recorded: ['area', 'latitude', 'longitude'] for 10.99 Million Buildings from California
Compiling Washington:


HBox(children=(IntProgress(value=0, max=60), HTML(value=u'')))

Recorded: ['area', 'latitude', 'longitude'] for 2.99 Million Buildings from Washington


In [30]:
# Stop the clock
print datetime.datetime.now()

2018-12-12 21:59:44.350089
