# Data pre-processing and preparation for analysis

This notebook walks through the process and steps taken to prepare the full `USGS_Wildland_Fire_Combined_Dataset.json` dataset (full details of the dataset can be [found here](https://www.sciencebase.gov/catalog/item/61aa537dd34eb622f699df81)). There are steps that need to be taken for proper analysis such as extracting data from specific dates (specific months and years), projecting data into an appropriate coordinate system, etc.

## Requirements
1. `USGS_Wildland_Fire_Combined_Dataset.json` dataset file
2. Custom python methods from `wildfire` directory placed at the same level as this notebook.

## Citations
This entire notebook is an adaptation of a code example developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - August 16, 2024

In [6]:
import math, os, json, time, re
import geojson
import geopandas as gpd
from collections.abc import Iterable
from geopy.distance import geodesic
from pyproj import Transformer, Geod
from shapely.geometry import Polygon, Point
from shapely.ops import nearest_points

In [8]:
"""
Set custom python modules in PYTHONPATH to import and use them.
Set path to full dataset .json file.
"""

# Custom python modules
MODULENAME = "wildfire"
MODULEPATH = ""
try:
    ppath = os.environ.get('PYTHONPATH')
    if not ppath: raise
    MODULEPATH = os.path.join(ppath,MODULENAME)
except:
    # Likely here because a PYTHONPATH was not set, show a warning message
    print("Looks like you're not using a 'PYTHONPATH' to specify the location of your python user modules.")
    print("You may have to modify the sample code in this notebook to get the documented behaviors.")
    MODULEPATH = ""

DATA_FILENAME = "./GeoJSON_Exports/USGS_Wildland_Fire_Combined_Dataset.json"

# Import custom python methods
from wildfire.Reader import Reader as WFReader

Looks like you're not using a 'PYTHONPATH' to specify the location of your python user modules.
You may have to modify the sample code in this notebook to get the documented behaviors.


In [9]:
print(f"Attempting to open '{DATA_FILENAME}' with wildfire.Reader() object")
wfreader = WFReader(DATA_FILENAME)
print()

Attempting to open './GeoJSON_Exports/USGS_Wildland_Fire_Combined_Dataset.json' with wildfire.Reader() object



In [10]:
"""
Load all features from dataset
"""
feature_list = []
feature_count = 0

wfreader.rewind()
feature = wfreader.next()
while feature:
    feature_list.append(feature)
    feature_count += 1
    if (feature_count % 1500) == 0:
        print(f"Loaded {feature_count} features")
    feature = wfreader.next()

print(f"Loaded a total of {feature_count} features")
print(f"Variable 'feature_list' contains {len(feature_list)} features")

Loaded 1500 features
Loaded 3000 features
Loaded 4500 features
Loaded 6000 features
Loaded 7500 features
Loaded 9000 features
Loaded 10500 features
Loaded 12000 features
Loaded 13500 features
Loaded 15000 features
Loaded 16500 features
Loaded 18000 features
Loaded 19500 features
Loaded 21000 features
Loaded 22500 features
Loaded 24000 features
Loaded 25500 features
Loaded 27000 features
Loaded 28500 features
Loaded 30000 features
Loaded 31500 features
Loaded 33000 features
Loaded 34500 features
Loaded 36000 features
Loaded 37500 features
Loaded 39000 features
Loaded 40500 features
Loaded 42000 features
Loaded 43500 features
Loaded 45000 features
Loaded 46500 features
Loaded 48000 features
Loaded 49500 features
Loaded 51000 features
Loaded 52500 features
Loaded 54000 features
Loaded 55500 features
Loaded 57000 features
Loaded 58500 features
Loaded 60000 features
Loaded 61500 features
Loaded 63000 features
Loaded 64500 features
Loaded 66000 features
Loaded 67500 features
Loaded 69000 fea

### Feature Discussion

Some features are malformed -- that is, some of the rings represented by shapely Polygon or MultiPolygon objects do not exist in this dataset properly. So, we need to go through every feature, test if the Polygon is properly formed, and then turn it into an actual Polygon object.

#### Assumptions
It is assumed that the first ring for each feature represents the largest ring for that feature.
See the comments in the code segement below for where that is happening.

In [11]:
"""
Remove bad features from dataset
"""
clean_features = []
bad_features = []
for i, wf_feature in enumerate(feature_list):
    try:
        if 'rings' in wf_feature['geometry']:
            ring_data = wf_feature['geometry']['rings'][0] # Assuming first ring is largest
        elif 'curveRings' in wf_feature['geometry']:
            ring_data = wf_feature['geometry']['curveRings'][0] # Assuming first ring is largest
        polygon = Polygon(ring_data)
        clean_features.append(geojson.Feature(geometry=polygon, properties=wf_feature['attributes']))
    except Exception as e:
        print(f"Feature {i} has issues, determine what is wrong with data")
        bad_features.append(wf_feature)
        continue

Feature 109604 has issues, determine what is wrong with data
Feature 110223 has issues, determine what is wrong with data
Feature 110638 has issues, determine what is wrong with data
Feature 111430 has issues, determine what is wrong with data
Feature 111896 has issues, determine what is wrong with data
Feature 112409 has issues, determine what is wrong with data
Feature 112414 has issues, determine what is wrong with data
Feature 113410 has issues, determine what is wrong with data
Feature 113664 has issues, determine what is wrong with data
Feature 113737 has issues, determine what is wrong with data
Feature 113765 has issues, determine what is wrong with data
Feature 113804 has issues, determine what is wrong with data
Feature 114308 has issues, determine what is wrong with data
Feature 114321 has issues, determine what is wrong with data
Feature 115628 has issues, determine what is wrong with data
Feature 115973 has issues, determine what is wrong with data
Feature 116234 has issue

In [13]:
"""
Load cleaned up data into a geopandas Dataframe.
"""
# Full cleaned dataset in geopandas dataframe
gdf = gpd.GeoDataFrame.from_features(clean_features)

# Filtered dataframe containing features from the year 1961 to 2021
# year_filtered_gdf = gdf[gdf['Fire_Year'] >= 1961]

In [14]:
"""
Save complete cleaned dataset out in geojson format.
NOTE: The data has yet to be transformed to a different cooridinate system.
"""
# full_geojson = 'full_data.geojson'
# gdf.to_file(full_geojson, driver='GeoJSON')

'\nSave complete cleaned dataset out in geojson format.\nNOTE: The data has yet to be transformed to a different cooridinate system.\n'

### Next Steps

Now that the data has been cleaned up, go to `visualization.ipynb` to use the cleaned data for creating visualizations.