# Smoke Estimator

In the notebook, we leverage the Wildfire data specific to Muskogee, Oklahoma, that has been previously stored. The primary objective is to develop a smoke estimator, a tool designed to assess and predict the potential impact of wildfires on air quality, particularly with regard to the production of smoke.

The initial steps involve loading the saved Wildfire data into the notebook. Subsequently, pre-processing tasks are undertaken to ensure data integrity and relevance. These tasks include verifying the 'Listed_Fire_Dates' column to guarantee the presence of accurate date values and converting it to a datetime format. Further refinement is achieved by extracting only the month information, aligning with the focus on the annual fire season from May 1st to October 31st.

To tailor the dataset for the smoke estimator's requirements, a filtering process is implemented to retain only the rows corresponding to the specified months. Additionally, any extraneous or redundant columns are dropped, streamlining the dataset and optimizing it for the subsequent smoke estimation process.

In [1]:
# ----------------------- importing necessary libraries ---------------------- #

import pandas as pd
from tqdm import tqdm
import math

In [2]:
# ----------------------------- reading the data ----------------------------- #

df = pd.read_csv("data/muskogee.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,OBJECTID,Assigned_Fire_Type,Fire_Year,GIS_Acres,Listed_Fire_Dates,Distance_From_Muskogee
0,0,14302,Wildfire,1963,10395.01,1963-08-06,1198.52
1,1,14303,Wildfire,1963,9983.61,1963-08-06,1248.93
2,2,14304,Wildfire,1963,9674.18,1963-12-31,1122.08
3,3,14305,Wildfire,1963,4995.91,2018-05-02,623.75
4,4,14306,Wildfire,1963,4995.25,2018-05-02,635.01


In [3]:
len(df)

73705

### Pre-Processing Steps

Before diving into the smoke estimator process, it's crucial to perform some pre-processing steps to ensure the quality and relevance of the data. Although dates were extracted from strings during data storage, it's prudent to double-check for any potential oversights, ensuring that the 'Listed_Fire_Dates' column exclusively contains date values.

Following this verification, the next step involves converting the 'Listed_Fire_Dates' column to a datetime format. Subsequently, we extract only the month information, as the focus is on the annual fire season spanning from May 1st to October 31st. This ensures that the data is limited to the specific months relevant for the smoke estimator.

After extracting the month information, the dataset is filtered to retain only the rows corresponding to the specified months of the annual fire season. To streamline the dataset further and remove unnecessary columns, a subsequent step involves dropping any columns that are deemed irrelevant for the smoke estimator process.

In summary, these pre-processing steps aim to enhance the dataset's suitability for the smoke estimator, ensuring that it solely contains valid date values, is restricted to the relevant months, and is devoid of unnecessary columns.

In [4]:
# ---------------------------- preprocessing steps --------------------------- #

# making sure there are no strings in the column
df['Listed_Fire_Dates'] = df['Listed_Fire_Dates'].str.extract(r'(\d{4}-\d{2}-\d{2})')

# creating the column for the month
df['Listed_Fire_Month'] = pd.to_datetime(df['Listed_Fire_Dates']).dt.strftime('%m')

In [5]:
df.isnull().sum()

Unnamed: 0                   0
OBJECTID                     0
Assigned_Fire_Type           0
Fire_Year                    0
GIS_Acres                    0
Listed_Fire_Dates         7402
Distance_From_Muskogee       0
Listed_Fire_Month         7402
dtype: int64

Given the presence of missing values in the 'Listed_Fire_Dates' column, it is imperative to address these gaps to ensure the accuracy and reliability of the data, particularly for the assessment of the fire season. To mitigate any potential impact on the analysis, the decision is made to drop rows with missing values in the date column.

By removing the rows with missing date values, we aim to streamline the dataset and create a more robust foundation for the subsequent steps in the analysis. This ensures that the dataset used for the smoke estimator exclusively contains entries with valid and complete date information, contributing to the accuracy of the overall assessment.

In [6]:
# dropping the null values
df = df.dropna()

# converting the column to int
df['Listed_Fire_Month'] = df['Listed_Fire_Month'].astype(int)
df['Distance_From_Muskogee'] = df['Distance_From_Muskogee'].astype(int)

# filtering data to have only annual fire season data
df = df[(df['Listed_Fire_Month'] >= 5) & (df['Listed_Fire_Month'] <= 10)]

# dropping unnecessary columns
df = df.drop(['Unnamed: 0', 'Listed_Fire_Dates', 'Listed_Fire_Month'], axis = 1)

In [7]:
df.head(10)

Unnamed: 0,OBJECTID,Assigned_Fire_Type,Fire_Year,GIS_Acres,Distance_From_Muskogee
0,14302,Wildfire,1963,10395.01,1198
1,14303,Wildfire,1963,9983.61,1248
3,14305,Wildfire,1963,4995.91,623
4,14306,Wildfire,1963,4995.25,635
9,14317,Wildfire,1963,2144.29,1165
14,14325,Wildfire,1963,1544.8,1155
16,14327,Wildfire,1963,1460.36,1248
17,14329,Wildfire,1963,3511.3,1247
19,14332,Wildfire,1963,997.97,1249
28,14363,Wildfire,1963,400.26,1175


In the process of preparing the data for the smoke estimator, an additional step involves categorizing the 'Assigned_Fire_Type' column. This categorization facilitates the creation of the smoke estimator, providing numerical values instead of strings for ease of computation. In this specific scenario, binary values are assigned, where 'Wildfire' is represented as 1, and all other fire types are considered as 0.

The objective of this categorization is to convert the 'Assigned_Fire_Type' column into a binary indicator, simplifying the classification task for the smoke estimator. This binary representation enables a straightforward differentiation between wildfires and other types of fires, streamlining the subsequent modeling process.

In [8]:
df['Assigned_Fire_Type'].value_counts()

Wildfire                            28153
Prescribed Fire                      5764
Likely Wildfire                       901
Unknown - Likely Prescribed Fire      848
Unknown - Likely Wildfire             143
Name: Assigned_Fire_Type, dtype: int64

In [9]:
# keep only rows that have 'Assigned_Fire_Type' as 'Wildfire'
df = df[df['Assigned_Fire_Type'] == 'Wildfire']

In [10]:
num_years = df['Fire_Year'].value_counts().nunique()
num_rows = len(df)

print(f"Number of years: {num_years}")
print(f"Number of rows: {num_rows}")

Number of years: 57
Number of rows: 28153


Out of the initial dataset, which comprised 73,705 rows from the wildfire data, a refined subset has been derived by focusing on the annual fire season. This refined subset encompasses 35,809 rows, signifying the instances where the 'Listed_Fire_Dates' fall within the specified timeframe of the annual fire season—from May 1st to October 31st.

This reduction in the number of rows is a deliberate step to narrow down the dataset to the periods deemed relevant for the smoke estimator. By concentrating on the annual fire season, the analysis becomes more targeted and aligned with the specific timeframe of interest.

The subset of 35,809 rows serves as the foundation for subsequent steps in the development and implementation of the smoke estimator. This refined dataset ensures that the analysis is centered around the periods when wildfires are more prevalent, contributing to a more accurate assessment of potential smoke impact during the specified timeframe.

# Smoke Estimator

Creating a smoke impact estimator formula based on the given attributes can provide valuable insights into the potential impact of wildfires on air quality. To develop such a formula, the following approach is considered:

**Smoke Impact Estimator:**

The smoke impact estimator aims to provide an estimate of the potential smoke impact based on relevant attributes. In this case, we can use the following attributes:

1. **GIS_Acres:** The size of the wildfire, measured in acres, can be an indicator of the potential smoke output. Larger wildfires tend to produce more smoke.

2. **Assigned_Fire_Type:** Different fire types, such as wildfires, prescribed fire, or likely wildfire, may produce varying levels of smoke. This attribute can help categorize the type of fire.

3. **Distance_From_Muskogee:** The calculated distance from Muskogee, Oklahoma, can be an essential factor. Wildfires in closer proximity to the city may have a more significant impact on air quality.

**Proximity Impact**:

Calculating proximity impact often involves assessing the influence of a variable representing proximity, such as 'Distance_From_Muskogee', on a given outcome, such as smoke impact. The proximity impact can be calculated by considering the distance and assigning weights to it.

In [11]:
# this parameter controls the impact decay rate in the proximity calculation.
# adjusting this factor allows you to control how quickly the impact decreases with distance.
decay_rate = 0.05

# iterate through each row (fire record) in the DataFrame df
for index, row in tqdm(df.iterrows(), total=df.shape[0]):

    # extract the distance from Muskogee for the current fire record
    dist = row['Distance_From_Muskogee']

    # calculate the proximity impact using an exponential decay function
    proximity = math.exp(-decay_rate * dist)

    # extract the size (GIS_Acres) and fire type (Assigned_Fire_Type) of the current fire record
    size = row['GIS_Acres']
    fire_type = row['Assigned_Fire_Type']

    # calculate the smoke impact for the current fire record
    df.at[index, 'Smoke_Impact'] = proximity * size

100%|██████████| 28153/28153 [00:00<00:00, 32072.81it/s]


In [12]:
df.head(10)

Unnamed: 0,OBJECTID,Assigned_Fire_Type,Fire_Year,GIS_Acres,Distance_From_Muskogee,Smoke_Impact
0,14302,Wildfire,1963,10395.01,1198,1.005971e-22
1,14303,Wildfire,1963,9983.61,1248,7.930708e-24
3,14305,Wildfire,1963,4995.91,623,1.480272e-10
4,14306,Wildfire,1963,4995.25,635,8.122831e-11
9,14317,Wildfire,1963,2144.29,1165,1.080513e-22
14,14325,Wildfire,1963,1544.8,1155,1.2834120000000001e-22
16,14327,Wildfire,1963,1460.36,1248,1.16007e-24
17,14329,Wildfire,1963,3511.3,1247,2.93229e-24
19,14332,Wildfire,1963,997.97,1249,7.540968e-25
28,14363,Wildfire,1963,400.26,1175,1.2233240000000001e-23


# Final Dataset 

We use the pandas groupby function to group the DataFrame by the 'Year' column, which presumably contains the year information for each record. After grouping, we calculate the average (mean) of the 'smoke_impact' column for each year. This allows us to analyze and observe the yearly average impact of smoke based on the provided data. The resulting DataFrame, smoke_impact_yearly, provides a clear representation of how the smoke impact has varied on an annual basis.

In [13]:
final_df = df.copy()
final_df = final_df.drop(['OBJECTID', 'Assigned_Fire_Type'], axis = 1)
final_df.head()

Unnamed: 0,Fire_Year,GIS_Acres,Distance_From_Muskogee,Smoke_Impact
0,1963,10395.01,1198,1.005971e-22
1,1963,9983.61,1248,7.930708e-24
3,1963,4995.91,623,1.480272e-10
4,1963,4995.25,635,8.122831e-11
9,1963,2144.29,1165,1.080513e-22


In [14]:
final_df = final_df.groupby('Fire_Year').mean().reset_index()

In [15]:
final_df.head(10)

Unnamed: 0,Fire_Year,GIS_Acres,Distance_From_Muskogee,Smoke_Impact
0,1963,457.405426,773.212766,2.897483e-12
1,1964,1061.792,788.870588,7.772437e-12
2,1965,497.778571,878.755102,6.465108e-09
3,1966,1416.721204,896.296296,7.751541e-13
4,1967,698.545946,823.540541,3.380762e-12
5,1968,1052.758043,1040.978261,0.0007710199
6,1969,332.855294,931.470588,2.858886e-12
7,1970,2188.156283,1003.920354,6.235783e-08
8,1971,2475.594122,956.763514,5.350388e-11
9,1972,918.108182,1039.681818,9.8768e-14


In [16]:
final_df.to_csv("data/smoke-estimators.csv")