# Data Acquisition

### Notebook Purpose:

The objective of this notebook is to collect and analyze wildfire data from a publicly available dataset. The dataset, retrieved from a US government repository, contains comprehensive information about wildfires. Our goal is to process this dataset and calculate the distance between each wildfire location and the city of 'Muskogee, Oklahoma'. Once this analysis is complete, we will save the resulting dataset as a separate CSV file for further exploration and research.

### Data Source:

The primary data source for this notebook is the 'Wildfire dataset,' which can be accessed from the following link: Wildfire dataset. This dataset is maintained by a US government agency and provides valuable insights into historical wildfire events. It contains a wealth of information, including the location, size, year, and other attributes of each wildfire.

### Data Processing Steps:

- **Data Retrieval**: We will start by retrieving the wildfire data from the provided dataset. This step involves downloading and loading the dataset into our analysis environment.
- **Distance Calculation**: After obtaining the wildfire data, we will calculate the geographical distance between each wildfire location and the city of 'Muskogee, Oklahoma.' This analysis will provide us with valuable information about the proximity of wildfires to this specific city.
- **Data Storage**: To facilitate further analysis and research, we will store the resulting dataset, which includes calculated distances, in a separate CSV file. This file will serve as the basis for future investigations and can be easily shared with others.

By following these steps, we aim to gain a deeper understanding of the relationship between wildfires and the city of 'Muskogee, Oklahoma.' The resulting dataset will be a valuable resource for researchers, policymakers, and anyone interested in wildfire analysis and its impact on specific regions.

In [1]:
# ----------------------- importing necessary libraries ---------------------- #
import json
import re
import pandas as pd
import geojson
from tqdm import tqdm
from wildfire.Reader import Reader as WFReader

# importing custom functions from functions.py
from functions import shortest_distance_from_place_to_fire_perimeter

In [2]:
# ---------------------------- defining constants ---------------------------- #

DATA_FILENAME = "data/USGS_Wildland_Fire_Combined_Dataset.json"

CITY_LOCATIONS = {
    'muskogee' :     {'city'   : 'Muskogee',
                       'latlon' : [35.7479, -95.3697] },
}

In [3]:
# ------------------ using wildfire reader to open the file ------------------ #
print(f"Attempting to open '{DATA_FILENAME}'")
geojson_file = open(DATA_FILENAME,"r")
print(f"Using GeoJSON module to load sample file '{DATA_FILENAME}'")
gj_data = geojson.load(geojson_file)
geojson_file.close()

# Print the keys of the dictionary
gj_keys = list(gj_data.keys())
print("The loaded JSON dictionary has the following keys:")
print(gj_keys)
print()

Attempting to open 'data/USGS_Wildland_Fire_Combined_Dataset.json'
Using GeoJSON module to load sample file 'data/USGS_Wildland_Fire_Combined_Dataset.json'
The loaded JSON dictionary has the following keys:
['displayFieldName', 'fieldAliases', 'geometryType', 'spatialReference', 'fields', 'features']



In [4]:
# ---------------- filtering data to only include 1963 to 2023 --------------- #

filtered_data = []

all_wildfire = gj_data['features']

for data_point in tqdm(all_wildfire, desc="Filtering Data"):
    if 1963 <= data_point['attributes']['Fire_Year'] <= 2023:
        filtered_data.append(data_point)

num_records = len(filtered_data)
print(f"Number of records after filtering data': {num_records}\n")

Filtering Data: 100%|██████████| 135061/135061 [00:00<00:00, 182872.68it/s]

Number of records after filtering data': 117578






### Calculating Distance Between Wildfires and Muskogee, Oklahoma:

After loading the wildfire data as dictionaries, the next crucial step in the analysis is to determine the geographical distance between each wildfire location and the city of 'Muskogee, Oklahoma.' This distance measurement is essential for understanding the proximity of wildfires to the city and assessing potential risks.

To perform this task, we employ a specialized function designed for calculating distances between two geographical coordinates.

### Storing Calculated Distances:

Once the distance is calculated for each fire in relation to Muskogee, Oklahoma, the resulting distances are stored within the attributes of each fire feature. Specifically, we add a new attribute, 'Distance_From_Muskogee,' to the feature's attributes. This attribute records the calculated distance in units such as miles or kilometers.

By including these distances as attributes in the dataset, we ensure that the distance information is readily available for further analysis and visualization. Researchers and analysts can utilize this data to assess the potential impact of wildfires on Muskogee, evaluate risk factors, and make informed decisions regarding fire management and safety measures.

In [5]:
# -------------- calculating distance from the fire to the city -------------- #

# create an empty dictionary to store filtered features
filtered_features = {}

# get the location (city coordinates) you want to calculate the distance from
place = CITY_LOCATIONS["muskogee"]

muskogee_wildfire = []

# loop through each wildfire feature in the feature_list
for wf_feature in tqdm(filtered_data):
    if 'geometry' in wf_feature and 'rings' in wf_feature['geometry']:
        distance = shortest_distance_from_place_to_fire_perimeter(place['latlon'], wf_feature['geometry']['rings'][0])
        if distance[0] <= 1250:
            wf_feature['attributes']['Distance_From_Muskogee'] = distance[0]
            muskogee_wildfire.append(wf_feature)
print(len(muskogee_wildfire))

100%|██████████| 117578/117578 [4:40:50<00:00,  6.98it/s] 

73705





In [6]:
with open('data/muskogee_wildfire.json', 'w') as json_file:
    json.dump(muskogee_wildfire, json_file)

### Attribute Selection for Smoke Estimation and Predictive Analysis:

Once we have calculated and stored the distances between wildfires and the city of 'Muskogee, Oklahoma,' the next critical step is to prepare the dataset for smoke estimation and predictive analysis. To achieve this, we need to refine the dataset by eliminating unnecessary attributes and retaining only those that are instrumental in our objective.

### Retention of Relevant Attributes:

1. Distance Attribute: The 'distance_from_city' attribute, which we calculated earlier, is a crucial feature for estimating the proximity of wildfires to the city. This attribute directly contributes to our smoke estimation and predictive analysis.

2. Wildfire Characteristics: Attributes that describe the characteristics of the wildfires, such as fire size, fire type, and year of occurrence, are valuable for understanding how these factors relate to smoke levels and their impact.

### Importance of Attribute Selection:

The process of attribute selection is critical as it directly influences the quality and effectiveness of our smoke estimator and predictive analysis. By retaining only the most relevant attributes, we reduce data noise and enhance the interpretability of our models. This streamlined dataset allows us to focus on the key factors that affect smoke levels and wildfire behavior in the context of Muskogee, Oklahoma.

The selected attributes serve as the foundation for building predictive models, conducting statistical analyses, and generating insights that can be used to assess the potential impact of wildfires on air quality and make informed decisions regarding public safety and environmental protection.

In [7]:
# -------------------- keeping only neccessary attributes -------------------- #

necessary_attributes = [
    'OBJECTID',
    'Assigned_Fire_Type',
    'Fire_Year',
    'GIS_Acres',
    'Listed_Fire_Dates',
    'Distance_From_Muskogee'
]

filtered_data_list = []
for original_data in muskogee_wildfire:
    # remove the 'geometry' key from the dictionary
    if 'geometry' in original_data:
        del original_data['geometry']

    # create a new dictionary with only the necessary attributes within 'attributes'
    filtered_attributes = {key: original_data['attributes'][key] for key in necessary_attributes}

    # update the 'attributes' key with the filtered attributes
    original_data['attributes'] = filtered_attributes

    # append the modified dictionary to the result list
    filtered_data_list.append(original_data)


## Extracting Wildfire Dates:

In our analysis, we take an additional step to extract the date of each wildfire event. This process allows us to categorize wildfires and determine whether they fall within the defined annual fire season. The annual fire season, in this context, is a specific time period during which wildfires are more likely to occur and is crucial for understanding the seasonal patterns of wildfire activity.

In [32]:
# define regular expression patterns to match the listed ignition date
ignition_pattern = r'Listed Other Fire Date\(s\): (\d{4}-\d{2}-\d{2})'

prescribed_pattern = r'Prescribed Fire Start Date: (\d{4}-\d{2}-\d{2})'

# process each dictionary in the list
for original_data in filtered_data_list:
    if isinstance(original_data['attributes']['Listed_Fire_Dates'], str):
        # search for the ignition date pattern in the 'Listed_Fire_Dates' string
        ignition_match = re.search(ignition_pattern, original_data['attributes']['Listed_Fire_Dates'])

        prescribed_match = re.search(prescribed_pattern, original_data['attributes']['Listed_Fire_Dates'])

        # update the 'Listed_Fire_Dates' value with the matched ignition date if found
        if ignition_match:
            ignition_date = ignition_match.group(1)  # extract the matched ignition date
            original_data['attributes']['Listed_Fire_Dates'] = ignition_date

        if prescribed_match:
            prescribed_date = prescribed_match.group(1)  # extract the matched ignition date
            original_data['attributes']['Listed_Fire_Dates'] = prescribed_date

In [33]:
# ------------ converting to a pandas dataframe and storing as csv ----------- #

df = pd.DataFrame([d['attributes'] for d in filtered_data_list])

In [35]:
len(df)

73705

In [37]:
df['GIS_Acres'] = df['GIS_Acres'].round(2)
df['Distance_From_Muskogee'] = df['Distance_From_Muskogee'].round(2)

In [38]:
df.head()

Unnamed: 0,OBJECTID,Assigned_Fire_Type,Fire_Year,GIS_Acres,Listed_Fire_Dates,Distance_From_Muskogee
0,14302,Wildfire,1963,10395.01,1963-08-06,1198.52
1,14303,Wildfire,1963,9983.61,1963-08-06,1248.93
2,14304,Wildfire,1963,9674.18,1963-12-31,1122.08
3,14305,Wildfire,1963,4995.91,2018-05-02,623.75
4,14306,Wildfire,1963,4995.25,2018-05-02,635.01


In [39]:
df.to_csv("data/muskogee.csv")