# Data Design and Process


## Parcel Generation

The model operates at the unit of the parcel. Data are aggregated up to a land parcel level and then the parcels are used for modelling. Due to the mixed sizes of actual parcels and the mixed availability and quality of the layers, we generated a laye of parcels for each town in the model. For each town, we generated a tesselation of hexagons that covered the search area for the town. 

For the size of the tesselation, we used a hexagon side length of 142 meters. This side length generates an area in each hexagon that matches the average area of the parcels layer we obtained for Monroe County, Illinois - one of the study sites.

## Hexagon Side Lengths/Cell Size
I tried running the data extraction at three different hexagonal side lengths: 30m, 60m, and 142m. 30m to get closer to the cell size of the many 30m rasters we are using, and 60m when 30m caused some Zonal Statistics calls to crash the Python interpreter (a known bug that occurs when no cells are selected during an operation). We used 142m because a hexagon with that side length has an area that is the same as the average area for the Monroe County parcels layer we used for Valmeyer, originally. Some other side lengths might be worth testing, but the results were as follows: 60m seemed to result in little net benefit to model accuracy, and likely increased the problem of spatial autocorrelation. I did not quantify the changes formally - it seemed slightly more accurate, but I'm not sure if it was signficant or not, or how much autocorrelation increased as a result. The model was also signficantly slower to process data and generate outputs as a result of the many more "parcels" the 60m hexagons added compared with the 142m hexagons

## Zonal Statistics Bug and Workaround
This tool relies significantly on ArcGIS for spatial functionality. In the version of ArcGIS we use, a bug is present in the Zonal Statistics as Table tool that significantly impacts the ability of our code to extract values. This bug causes the returned data to be incorrect. We determined that by converting our zone data to raster before providing it to the tool, we could work around the problem and get correct results. The code has a flag indicating whether this workaround should be taken, and it is on by default.

## Town Boundary and Structure Filtering
Note choices made while filtering structures here - pull from metadata. Note the erasure of the old location from the new and note the transition to static town boundaries.

### Potential Issues
With Soldiers Grove and Gays Mills in particular, we might wish to provide a manual boundary that excludes the floodplain from the town boundary - since it's on both sides of the river, the town appears to be in the floodplain. In fact, manual boundaries may be in order for many of them - derive from existing autogenerated boundary, and clean?

# Data Sources
## Roads
Road description here

## Structures
Structures Description here

## Floodplains
We use the National Flood Hazard Layer, as available from December 2015, as our base floodplain layer. We filtered it to areas with 100 year flood protection or worse, indicating the kinds of floodplains we are targeting for this model. Many locations aren't completely covered by NFHL though, so we took a mixed approach to filling the gaps, which was important because a parameter in the model was distance to floodplain, which would be distorted for some parcels if data were missing.

1. When possible, we located, georeferenced, and roughly digitized Digital Flood Insurance Rate Maps (DFIRMs). Originally, with Valmeyer, we digitized at 1:24000 scale, but decided that the amount of effort to get highly accurate floodplains was not worth it - Still, even at the lower scale, variation from true distance value shouldn't be more than 50 meters or so in most cases, providing each area with an appropriate measure of floodplain distance.
2. When DFIRMs were not available, we used NHDPlus v2 to create a rough estimate for floodplains (that seemed to track well enough with locations that had floodplain data). We joined the TotDASqKm attribute in NHDPlus to the Flowlines, which provides the drainage area in square kilometers as an attribute on each stream segment. We then filtered to segments with more than 250 km squared upstream, and buffered those segments by (.1 * (the upstream drainage area in square kilometers)) meters - so a location with 250 sq km upstream would be buffered by 25 meters on each side - to represent the floodplain. 

Some regions use a mix of strategies - having incomplete NFHL, we used that where possible and digitized the remaining portions of of DFIRMs or generated floodplain (Niobrara, Odanah). Generating floodplain was especially necessary for areas with Native American Reservation land as FIRMs did not cover these locations.

### Odanah issues
Odanah, WI is a town on the Bad River Native American Reservation, and as such, has no data in the NFHL or available as a FIRM. We needed to approximate a floodplain in order to get consistent results for the model. To do so, we used NHDPlus V2 data for rivers with the total drainage area and buffered rivers that had a total drainage area greater than 250 sq km by ((total drainage area in sq km)/10 sq km) meters to approximate which areas would be floodplains.

# Modeling

## General Approach



## Min Max Mean vs Max
Placeholder text here

## Random Forests vs. Logistic Regression

## Scaling of values

## Size of Area Searched

I discovered massive variations in the area covered for each study area - the smallest is ~50 sq km and the largest closer to 1100 sq km. Most are closer to the 1100, but a few are around 600 (2 at ~50, 2 at around 600, and the rest close enough to 1100 (900-1100) that I'm not concerned.

My instinct is that this matters and needs to be corrected because some locations are being undersampled relative to the rest of the locations. But I wanted your confirmation that it was worth spending time retrieving the new DEM data and having Megan digitize additional roads.

I took this as an opportunity to see if there was any better prediction to the smaller areas, and the answer appears to be "no". Using the bottom 4 (2 at ~50 sq km and 2 at around 600 sq km), I ran the model. Looking at the following:

In [13]: model.run_and_validate()
INFO Percent Correct: 0.9987368421052631
INFO Total records: 23751
INFO Number of withheld records: 2375
INFO Correctly predicted: 2372
INFO Incorrectly predicted: 3
INFO Underpredicted: 3
INFO Overpredicted: 0
INFO Total 'True' values in validation dataset: 7
INFO Percent incorrect for True: 0.42857142857142855

With the bottom number there, we usually get between .40 and .50 right now with everything, so I don't know that using smaller areas would improve our results (I was concerned that the larger areas could result in many confounding negative locations that would be suitable, but which weren't chosen - seems like maybe not, but not definitive).