# pyOpenFEMA Tutorial

This notebook provides a general overview of how to use `pyOpenFEMA` to read datasets from [OpenFEMA](https://www.fema.gov/about/reports-and-data/openfema).
The package acts as a wrapper around the OpenFEMA API to simplify the data reading process and allow for easier exploration of the OpenFEMA datasets.

In [None]:
from pyOpenFEMA import OpenFEMA

Before reading any data, we first need to create an `OpenFEMA` object class, which gets the OpenFEMA API metadata.
This metadata is then used to see what datasets exist and the metadata they contain.

In [None]:
openfema = OpenFEMA()

Now, let's see what all the methods are of this class that we can call.
We can see this and their docstring by calling `help` on our `OpenFEMA` instance.

In [None]:
help(openfema)

Okay, so it looks like we have three main method:
 - `list_datasets`, which lists all datasets available on OpenFEMA;
 - `dataset_info`, which prints metadata info on a given dataset; and
 - `read_dataset`, which reads in a given dataset.

Well, let's see what our options are for possible datasets to read.

In [None]:
openfema.list_datasets()

As we can see, there a numerous datasets that we could read.
Let's start by picking a simple one, say `'FemaRegions'`.
Before reading it in, let's see what the dataset is by getting the info on it.

In [None]:
openfema.dataset_info('FemaRegions')

As seen in the `'description'` key, the dataset provides a list of FEMA Regions including the address for each region's headquarters as well as a point that identifies the headquarters geographic location and a geometry shape for the region.
Seems like a simple dataset.
Let's go ahead and read it in then.

In [None]:
fema_regions = openfema.read_dataset('FemaRegions')
fema_regions

As expected there are ten regions and one additional one giving the location of FEMA Headquarters.
So, this read in as expected.
There does appear to be missing data in the `loc` and `regionGeometry` columns.
However, this is due to how OpenFEMA groups those columns into the `geometry` column.
Let's go ahead and correct them, since `pyOpenFEMA` does not as it is simply accessing the data from OpenFEMA.

In [None]:
from shapely.geometry.collection import GeometryCollection

fema_regions['loc'] = fema_regions['geometry'].apply(
    lambda x: x.geoms[0] if isinstance(x, GeometryCollection) else x
)
fema_regions['regionGeometry'] = fema_regions['geometry'].apply(
    lambda x: x.geoms[1] if isinstance(x, GeometryCollection) else None
)
fema_regions

Nice! This now looks as expected.
From here, we could easily plot the regions or each region's headquarters or an other analysis we would want to apply.

Finally, let's do a more specific data read for a larger dataset.
To see how to do this, let's double check the `read_dataset` method's docstring.

In [None]:
help(openfema.read_dataset)

As we can see, besides specifying the dataset and getting the whole dataset back, we can request specific columns and potentially filter and sort them to get us a subset of the dataset.
This is beneficial if the full dataset is large and the subset we want is a small subset of the full dataset.
By specifying the subset, we are minimizing the data we need to get from OpenFEMA, which will decrease read times.
Let's try subsetting the `'FimaNfipPolicies'` dataset, which is an extra large dataset.
First, let's double check the dataset's info.

> NOTE:
> Currently, `pyOpenFEMA` reads all datasets using `pandas`.
> This will read all data requested from OpenFEMA from the `read_dataset` call into memory.
> As most datasets on OpenFEMA are relatively small (like the FEMA Regions dataset), reading the full dataset will not be a problem.
> However, for large datasets that are not filtered, this may exceed memory and become problematic.
> Future updates to the package should include the ability to read in larger-than-memory using [dask](https://docs.dask.org/en/stable/dataframe.html) and [pyspark](https://spark.apache.org/docs/latest/api/python/index.html), which allow for distributed computing of larger-than-memory datasets.

In [None]:
openfema.dataset_info("FimaNfipPolicies")

Looking at the dataset size, we can see under the `'distribution'` key that the dataset is >10 GB in size, which is more data than we want to get from OpenFEMA.
To subset this dataset, let's first filter it to get only data within FEMA Region 1 and are state owned buildings.
Next, we only want to look at the policy cost, how much coverage there is, and when the policy expires.
So, let's limit the columns to those along with the state and location of the building (i.e., latitude and longitude).

In [None]:
region1_stateowned_policies = openfema.read_dataset(
    'FimaNfipPolicies',
    filters=[[('femaRegion', 'eq', 1), ('stateOwnedIndicator', 'eq', True)]],
    columns=['policyCost', 'totalBuildingInsuranceCoverage',
             'policyTerminationDate', 'propertyState',
             'latitude', 'longitude'],
)
region1_stateowned_policies

As we can see, our subdataset only contains <1000 rows with the six columns we requested.
Therefore, subsetting the full dataset saved us a ton on the required data read.