# Part 1 - Common Analysis | Wildfire DataAnalysis

Common Analysis is the first step of our borader project that focuses on the impact of the wildfires in US. In this common analysis, we are going into the first analyze in detail the fires of our alloted city and come up with an estimate. This is a rough estimate the wildfire impacts on the city which might have a more profound negative effect on health, tourism, property, and other aspects of society.

In this notebook we will be processing all the data that we have pulled thus far. The data is divided into 3 different themes
1. Distance data - Files that contain the wildfires and their corresponding from the city assigned (Bismarck, North Dakota)
2. Wildfire attributes data - File that contains attributes of the wildfire such as the time it burnt, area, intensity etc.
3. AQI data - The air quality index data from the US EPA

Before we proceed to the smoke estimation, we need to filter and preprocess the data to be of an appropriate format.

# Setup

We first set the working dependencies and constants that are required to process the data. 

The setup contains the following steps

1. Import all relevant packages
2. Define all the relevant constants that will be used throughout the script.

In [1]:
# import packages
import re
import warnings
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
# Ignore all warnings
warnings.filterwarnings('ignore')

In [2]:
# define all the file name to access throughout the scripts

distance_df = pd.read_csv(r"C:\Users\shwet\Documents\local-wildfire-project\Data512-WildFire-Project\02_data\02_intermediate_data\distance.csv")
wildfire_feature_df = pd.read_csv(r"C:\Users\shwet\Documents\local-wildfire-project\Data512-WildFire-Project\02_data\02_intermediate_data\wildfire_attributes.csv")
aqi_data_df = pd.read_csv(r"C:\Users\shwet\Documents\local-wildfire-project\Data512-WildFire-Project\02_data\02_intermediate_data\AQI_DataPull_1963_2023_monthly.csv",index_col=0)

# Step 1 - Preprocess data

In this step we sanitize the data and join the distance and attributes of a particular wildfire. We also filter the data for the constraints provided.

Your smoke estimate should adhere to the following conditions:

1. The estimate only considers the last 60 years of wildland fires (1963-2023).
2. The estimate only considers fires that are within 1250 miles of your assigned city.
3. An annual fire season will run from May 1st through October 31st.

### Preprocess wildfire data

In [3]:
# filtering data for wildfired within 1250 miles of our assigned city
closest_wf_df = distance_df.loc[distance_df['shortest_dist']< 1250]

# Dimensions of the data
print(closest_wf_df.shape[0], len(closest_wf_df.columns))

# Sample
closest_wf_df.head()

110886 2


Unnamed: 0,OBJECTID,shortest_dist
0,1,1043.5
1,2,1044.67
2,3,1046.36
3,4,714.14
4,5,685.97


In [4]:
# sample of the wildfire attributes
wildfire_feature_df.head()

Unnamed: 0,OBJECTID,USGS_Assigned_ID,Assigned_Fire_Type,Fire_Year,Fire_Polygon_Tier,Fire_Attribute_Tiers,GIS_Acres,GIS_Hectares,Source_Datasets,Listed_Fire_Types,...,Processing_Notes,Wildfire_Notice,Prescribed_Burn_Notice,Wildfire_and_Rx_Flag,Overlap_Within_1_or_2_Flag,Circleness_Scale,Circle_Flag,Exclude_From_Summary_Rasters,Shape_Length,Shape_Area
0,14299,14299,Wildfire,1963,1,"1 (1), 3 (3)",40992.458271,16589.059302,Comb_National_NIFC_Interagency_Fire_Perimeter_...,"Wildfire (1), Likely Wildfire (3)",...,,Wildfire mapping prior to 1984 was inconsisten...,Prescribed fire data in this dataset represent...,,,0.385355,,No,73550.428118,165890600.0
1,14300,14300,Wildfire,1963,1,"1 (1), 3 (3)",25757.090203,10423.524591,Comb_National_NIFC_Interagency_Fire_Perimeter_...,"Wildfire (2), Likely Wildfire (2)",...,,Wildfire mapping prior to 1984 was inconsisten...,Prescribed fire data in this dataset represent...,,,0.364815,,No,59920.576713,104235200.0
2,14301,14301,Wildfire,1963,1,"1 (5), 3 (15), 5 (1)",45527.210986,18424.208617,Comb_National_NIFC_Interagency_Fire_Perimeter_...,"Wildfire (6), Likely Wildfire (15)",...,,Wildfire mapping prior to 1984 was inconsisten...,Prescribed fire data in this dataset represent...,,,0.320927,,No,84936.82781,184242100.0
3,14302,14302,Wildfire,1963,1,"1 (1), 3 (3), 5 (1)",10395.010334,4206.711433,Comb_National_NIFC_Interagency_Fire_Perimeter_...,"Wildfire (2), Likely Wildfire (3)",...,,Wildfire mapping prior to 1984 was inconsisten...,Prescribed fire data in this dataset represent...,,,0.428936,,No,35105.903602,42067110.0
4,14303,14303,Wildfire,1963,1,"1 (1), 3 (3)",9983.605738,4040.2219,Comb_National_NIFC_Interagency_Fire_Perimeter_...,"Wildfire (1), Likely Wildfire (3)",...,,Wildfire mapping prior to 1984 was inconsisten...,Prescribed fire data in this dataset represent...,,,0.703178,,No,26870.456126,40402220.0


In [5]:
wildfire_feature_df['Listed_Fire_Dates'].values[0]

'Listed Wildfire Discovery Date(s): 1963-08-06 (3) | Listed Wildfire Controlled Date(s): 1963-12-31 (3)'

In [6]:
wildfire_feature_df['Listed_Fire_Dates'].isnull().sum()

8423

In [7]:
# checking all the atrributes related to the wildfire 
wildfire_feature_df.columns

Index(['OBJECTID', 'USGS_Assigned_ID', 'Assigned_Fire_Type', 'Fire_Year',
       'Fire_Polygon_Tier', 'Fire_Attribute_Tiers', 'GIS_Acres',
       'GIS_Hectares', 'Source_Datasets', 'Listed_Fire_Types',
       'Listed_Fire_Names', 'Listed_Fire_Codes', 'Listed_Fire_IDs',
       'Listed_Fire_IRWIN_IDs', 'Listed_Fire_Dates', 'Listed_Fire_Causes',
       'Listed_Fire_Cause_Class', 'Listed_Rx_Reported_Acres',
       'Listed_Map_Digitize_Methods', 'Listed_Notes', 'Processing_Notes',
       'Wildfire_Notice', 'Prescribed_Burn_Notice', 'Wildfire_and_Rx_Flag',
       'Overlap_Within_1_or_2_Flag', 'Circleness_Scale', 'Circle_Flag',
       'Exclude_From_Summary_Rasters', 'Shape_Length', 'Shape_Area'],
      dtype='object')

In [8]:
# filtering the data for the columns we are interested in 
filtered_col = ['OBJECTID','Fire_Year', 'GIS_Acres','Shape_Area','Overlap_Within_1_or_2_Flag','Listed_Fire_Dates']
wildfire_subset_df = wildfire_feature_df[filtered_col]
# sample
wildfire_subset_df.head()

Unnamed: 0,OBJECTID,Fire_Year,GIS_Acres,Shape_Area,Overlap_Within_1_or_2_Flag,Listed_Fire_Dates
0,14299,1963,40992.458271,165890600.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...
1,14300,1963,25757.090203,104235200.0,,Listed Wildfire Discovery Date(s): 1963-07-28 ...
2,14301,1963,45527.210986,184242100.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...
3,14302,1963,10395.010334,42067110.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...
4,14303,1963,9983.605738,40402220.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...


Now that we have shortlisted the attributes that we are considering for the smoke estimate, we will now join the distance related information with the wildfire attributes to get the final data set

In [9]:
# merging the data
merged_df = pd.merge(wildfire_subset_df,closest_wf_df, on = 'OBJECTID', how = 'inner')
# sample
merged_df.head()

Unnamed: 0,OBJECTID,Fire_Year,GIS_Acres,Shape_Area,Overlap_Within_1_or_2_Flag,Listed_Fire_Dates,shortest_dist
0,14299,1963,40992.458271,165890600.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,782.41
1,14300,1963,25757.090203,104235200.0,,Listed Wildfire Discovery Date(s): 1963-07-28 ...,801.32
2,14301,1963,45527.210986,184242100.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,770.09
3,14302,1963,10395.010334,42067110.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,770.83
4,14303,1963,9983.605738,40402220.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,776.87


In [10]:
# dimensions of the resultant dataframe
print(merged_df.shape[0], len(merged_df.columns))

95846 7


In [11]:
# dropping duplicates
merged_df = merged_df.drop_duplicates()

In [12]:
# dimensions of the resultant dataframe
print(merged_df.shape[0], len(merged_df.columns))

95846 7


### Preprocess AQI data

In [13]:
# Dimensions of the data
print(aqi_data_df.shape[0], len(aqi_data_df.columns))

# Sample
aqi_data_df.head()

316 3


Unnamed: 0,date,aqi,pollutant_id
0,1998-01-31,5.0,
1,1998-02-28,6.666667,
2,1998-03-31,6.8,
3,1998-04-30,7.6,
4,1998-05-31,12.6,


In [14]:
# subsetting relevant columns
aqi_data_df = aqi_data_df[['date','aqi']]
# Sample
aqi_data_df.head()

Unnamed: 0,date,aqi
0,1998-01-31,5.0
1,1998-02-28,6.666667
2,1998-03-31,6.8
3,1998-04-30,7.6
4,1998-05-31,12.6


In [15]:
# checking the level of data
print(aqi_data_df.shape[0],aqi_data_df['date'].nunique())

316 309


In [16]:
# checking null values
aqi_data_df.isnull().sum()

date    0
aqi     0
dtype: int64

In [17]:
# COnverting 'date' column to DateTime format
aqi_data_df['date'] = pd.to_datetime(aqi_data_df['date'])

# aggregating data to get accurate monthly and yearly AQI estimates
agg_data = pd.DataFrame(aqi_data_df.groupby('date').mean()).reset_index()

# Rename the 'date' column to 'full_date' to avoid conflicts
agg_data.rename(columns={'date': 'full_date'}, inplace=True)

# Extract year from the 'full_date' column
agg_data['year'] = agg_data['full_date'].dt.year


# Aggregate at a yearly level
agg_data_yearly = agg_data.groupby(agg_data['year'])['aqi'].mean().reset_index()

# Display the resulting DataFrame with year as the first column
agg_data_yearly

Unnamed: 0,year,aqi
0,1998,7.947222
1,1999,14.836017
2,2000,13.636348
3,2001,20.568655
4,2002,16.386569
5,2003,13.982835
6,2004,19.127736
7,2005,19.968622
8,2006,19.633682
9,2007,16.032216


# Step 2 - Calculate Smoke Estimate

#### Description
Creating a smoke estimate requires us to consider several aspects of the wildfire such as fire size, intensity, distance from the city and other relevant categories. The approach I have used to create the fire estimate are.

**Note** - I have considered both prescribed and wildfires since they both attribute to the gaseous and particulate pollution despite a difference in the magnitude.

1. Select relevant variables:
The attribute of the fires were carefully picked to provide the most information about the smoke estimate. The factors I considered are
- Size of the fire  
- Distance of the fire from the city
- Intensity of the fire
- Overlap Component
    
    
2. Define a function for estimating:
Create a formula or model that combines these variables to estimate smoke impact. This formula assumes that larger fires closer to the city and with higher intensity produce more smoke.
The formula I have used is:

Smoke Impact = (Fire Size / Distance) * Fire Intensity * (1+ Overlap Component)

where :
- Fire Size = Area that the fire has covered
- Distance = Shortest distance of the fire from the assigned city
- Fire Intensity = The ares of shape of the fire and the days the fire lasted
- Overlap componenet = 

3. Amortization Over the Fire Season:
I am estimating the smoke impact cumulatively throughout the fire season. Since there is no clear information about the duration or time at which the fire was burning.

4. Apply the Estimate:
I will then apply my formula to calculate the smoke impact for each fire within 1250 miles from my city during the annual fire season.

5. Aggregate Annual Estimates:
I will then average the smoke impact estimates for each year to get an annual estimate.

6. Data Validation:
Another problem is trying to understand how good or bad the smoke estimate might be. This estimate will be compared it to available AQI (Air Quality Index) data from the US EPA. This will help ensure the reasonableness of the estimates.

In [18]:
# checking all the atrributes related to the wildfire 
wildfire_feature_df.columns

Index(['OBJECTID', 'USGS_Assigned_ID', 'Assigned_Fire_Type', 'Fire_Year',
       'Fire_Polygon_Tier', 'Fire_Attribute_Tiers', 'GIS_Acres',
       'GIS_Hectares', 'Source_Datasets', 'Listed_Fire_Types',
       'Listed_Fire_Names', 'Listed_Fire_Codes', 'Listed_Fire_IDs',
       'Listed_Fire_IRWIN_IDs', 'Listed_Fire_Dates', 'Listed_Fire_Causes',
       'Listed_Fire_Cause_Class', 'Listed_Rx_Reported_Acres',
       'Listed_Map_Digitize_Methods', 'Listed_Notes', 'Processing_Notes',
       'Wildfire_Notice', 'Prescribed_Burn_Notice', 'Wildfire_and_Rx_Flag',
       'Overlap_Within_1_or_2_Flag', 'Circleness_Scale', 'Circle_Flag',
       'Exclude_From_Summary_Rasters', 'Shape_Length', 'Shape_Area'],
      dtype='object')

The attributes that will be used to calculate the smoke estimate are listed below
1. Size of the fire - GIS_Acres
2. Distance of the fire from the city - shortest_dist
3. Intensity of the fire - Shape_Area and the Listed_Fire_Dates 
4. Overlap Component - Overlap_Within_1_or_2_Flag

## Smoke estimate parameters

### Size of the fire

The size of a wildfire profoundly influences the volume of smoke it generates. Larger fires, by burning more extensive areas of vegetation, emit substantially greater quantities of smoke particles and gases into the atmosphere. These extensive blazes produce heightened levels of pollutants, impacting air quality over larger regions and potentially affecting communities far beyond the fire's immediate vicinity. The scale of a fire significantly contributes to the intensity and duration of smoke production, amplifying its potential health, environmental, and societal implications.

### Distance of the fire

The distance of a wildfire from a city plays a pivotal role in determining the extent of its impact on urban areas. Wildfires closer to cities tend to pose more immediate threats due to the shorter distance for smoke to travel, potentially resulting in poorer air quality and health hazards for residents. Proximity amplifies the direct exposure and the speed at which smoke reaches populated areas, affecting visibility and respiratory health. Additionally, nearby wildfires may prompt evacuation orders or emergency responses, underscoring the critical influence of distance on the urgency and severity of the wildfire's impact on urban communities.

### Fire intensity

The intensity of a wildfire profoundly influences smoke production, with higher intensities generating increased amounts of smoke and pollutants. More intense fires produce finer particles and a wider array of compounds, impacting air quality significantly and potentially posing greater health risks due to prolonged exposure to dense smoke clouds. The rapid spread and longer duration of high-intensity fires amplify their effects, underscoring their substantial contribution to air quality degradation and health concerns within affected regions.

#### Estimating Fire Intensity

We estimate the Fire intensity component by the following steps

1. Consider the attribute shape area - This gives us the area of the entire geographical region that was burnt during the wildfire. 
2. Listed_Fire_Dates - this gives us the days for which the fire burned. We calculate days for which the fire burned with the field and some regualr expression manipulation
3. Then the shape area multiplied by a factor of the number of days the fire burned that will give us the intensity of the fire.
4. For the fires that don't have the number of days, they will just have the area value. The days for which the fire burned is then scaled between 0 and 1 since it is a multiplier
5. The fire intensity is calculated as 
**Fire Intensity = shape_area * (1+scaled_days_fire_burned)**

In [19]:
# check whether the field has null values
merged_df['Listed_Fire_Dates'].isnull().sum()

7638

In [20]:
# Sample of the
merged_df['Listed_Fire_Dates'].values[0]

'Listed Wildfire Discovery Date(s): 1963-08-06 (3) | Listed Wildfire Controlled Date(s): 1963-12-31 (3)'

This function extracts discovered and controlled dates from a wildfire string that is of the structure " 'Listed Wildfire Discovery Date(s): 1963-08-06 (3) | Listed Wildfire Controlled Date(s): 1963-12-31 (3)'", calculates the difference in days between these dates, and returns the result. If dates are absent or if errors occur during the process, it defaults to returning 0. 

In [21]:
# fucntion to caluclate the number of days fire burned from the listed fire dates

def calculate_days(row):
    """
    Calculates the difference in days between the discovery and controlled dates of a wildfire.

    Args:
    row (str): String containing wildfire discovery and controlled dates.

    Returns:
    int: Difference in days between the controlled and discovery dates.
         Returns 0 if dates are not found or in case of exceptions.
    """
    try:
        # Use regular expressions to extract dates
        discovery_date_match = re.search(r'Discovery Date\(s\): (\d{4}-\d{2}-\d{2})', row)
        controlled_date_match = re.search(r'Controlled Date\(s\): (\d{4}-\d{2}-\d{2})', row)

        if discovery_date_match and controlled_date_match:
            # Extract discovered and controlled dates
            discovery_date = discovery_date_match.group(1)
            controlled_date = controlled_date_match.group(1)

            # Convert dates to datetime objects
            discovery_date = pd.to_datetime(discovery_date, format='%Y-%m-%d')
            controlled_date = pd.to_datetime(controlled_date, format='%Y-%m-%d')

            # Calculate the difference in days
            return (controlled_date - discovery_date).days
        else:
            return 0  # Return 0 if either date is not found in the string

    except Exception as e:
#         print(f"Error occurred: {e}")
        return 0  # Return 0 in case of any exceptions


In [22]:
def scale_column(df, column_name):
    """
    Scales the values in a specified column of a DataFrame between 0 and 1 using Min-Max scaling.

    Args:
    df (DataFrame): Input DataFrame.
    column_name (str): Name of the column to be scaled.

    Returns:
    DataFrame: DataFrame with the scaled column added.
    """
    # Initialize MinMaxScaler
    scaler = MinMaxScaler(feature_range=(0, 1))

    # Reshape the column (required by the scaler)
    column_to_scale = df[column_name].values.reshape(-1, 1)

    # Fit and transform the data
    scaled_values = scaler.fit_transform(column_to_scale)

    # Assign the scaled values back to a new column in the DataFrame
    df[f'scaled_{column_name}'] = scaled_values.flatten()
    
    return df


In [23]:
# Applying the function to the column and creating a new column with the difference in days
merged_df['Listed_Fire_Dates'] =  merged_df['Listed_Fire_Dates'].fillna('')
merged_df['days_fire_lasted'] = merged_df['Listed_Fire_Dates'].apply(calculate_days)

In [24]:
# Replace null values with 0
merged_df['days_fire_lasted'].fillna(0, inplace=True)

# Replace negative values with 0
merged_df['days_fire_lasted'] = merged_df['days_fire_lasted'].apply(lambda x: 0 if x < 0 else x)

# counting number of records that have valid values
merged_df[merged_df['days_fire_lasted']>0].count()

OBJECTID                      15678
Fire_Year                     15678
GIS_Acres                     15678
Shape_Area                    15678
Overlap_Within_1_or_2_Flag     1052
Listed_Fire_Dates             15678
shortest_dist                 15678
days_fire_lasted              15678
dtype: int64

In [25]:
# scaling the days_fire_lasted column
merged_df = scale_column(merged_df, 'days_fire_lasted')

In [26]:
# creating the Fire Intensity
# Fire Intensity = shape_area * (1+scaled_days_fire_burned)
merged_df['fire_intensity'] = (merged_df['scaled_days_fire_lasted']+1)*merged_df['Shape_Area']

In [27]:
# sample
merged_df.head()

Unnamed: 0,OBJECTID,Fire_Year,GIS_Acres,Shape_Area,Overlap_Within_1_or_2_Flag,Listed_Fire_Dates,shortest_dist,days_fire_lasted,scaled_days_fire_lasted,fire_intensity
0,14299,1963,40992.458271,165890600.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,782.41,147,0.409471,233817900.0
1,14300,1963,25757.090203,104235200.0,,Listed Wildfire Discovery Date(s): 1963-07-28 ...,801.32,0,0.0,104235200.0
2,14301,1963,45527.210986,184242100.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,770.09,147,0.409471,259683800.0
3,14302,1963,10395.010334,42067110.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,770.83,147,0.409471,59292370.0
4,14303,1963,9983.605738,40402220.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,776.87,147,0.409471,56945750.0


### Overlap with Previous Fires

The contribution of the spatial overlap to the smoke estimate can still be significant.

1. Impact on Environment: Even if wildfires occur at different times but in the same geographical area, their combined impact on the environment can be notable. The land might have experienced reduced vegetation due to earlier fires, making it more susceptible to subsequent fires, altering soil conditions, and impacting recovery processes.

2. Air Quality and Public Health: Although not simultaneous, the spatial overlap can affect air quality over time. The accumulation of smoke-related pollutants, the prolonged exposure of affected areas, and the potential for repeated disruptions to air quality due to recurrent fires can impact public health, particularly for vulnerable populations.

3. Residual Effects: The aftermath of earlier fires, such as increased dryness, altered ecosystems, or changes in vegetation, can influence the behavior and severity of subsequent fires in the same area. This can affect the intensity and duration of smoke production during later fires.

#### Estimating overlap component

We estimate the overlap component by the following methodology

1. We consider a typical 'Overlap_Within_1_or_2_Flag' value which is of the form 'Caution, this Wildfire in 1963 overlaps with a Wildfire that occurred in 1961 (2 year difference). The overlapping fire overlaps by 30.5% (196.0 acres). Overlapping fire USGS Assigned ID: 13685.'
2. From the string we will extract 3 important components
    - The years difference between 2 overlapping flags
    - Area of overlap in acres
    - Percentage of overlap between the wildfires.
    
3. The area of overlap and percentage of overlap are directly proportional to the factor while the years of difference is inversely proportional
4. We calculate the overlap component by this formula:
**Overlap Factor= (1/Time Difference+1) * Percentage of Overlap * Area of Overlap**
5. This value is then scaled between 0 and 1 since it is a multiplier




In [28]:
# sample of Overlap_Within_1_or_2_Flag value
merged_df.Overlap_Within_1_or_2_Flag.unique()[2]

'Caution, this Wildfire in 1963 overlaps with a Wildfire that occurred in 1961 (2 year difference). The overlapping fire overlaps by 30.5% (196.0 acres). Overlapping fire USGS Assigned ID: 13685.'

In [29]:
# overlap function
def extract_overlap_info(text):
    """
    Extracts information about the overlap from the given text.
    
    Args:
    text (str): Text containing information about the overlap between wildfires.
    
    Returns:
    pd.Series: Pandas Series containing extracted information:
               - 'Year Difference': Difference in years between the wildfires.
               - 'Area of Overlap': Area of overlap in acres.
               - 'Percentage of Overlap': Percentage of overlap between the wildfires.
    """
    result = {
        'Year Difference': 0,
        'Area of Overlap': 0.0,
        'Percentage of Overlap': 0.0
    }
    
    try:
        # Extract year difference
        year_difference = int(re.search(r'(?<=\().*?(?=\syear)', text).group())
        result['Year Difference'] = year_difference
        
        # Extract area of overlap in acres
        overlap_area = float(re.search(r'\(([\d.]+)\sacres', text).group(1))
        result['Area of Overlap'] = overlap_area
        
        # Extract percentage of overlap
        percentage_overlap = float(re.search(r'by\s([\d.]+)%', text).group(1))
        result['Percentage of Overlap'] = percentage_overlap
        return pd.Series(result)
    
    except Exception as e: 
        return pd.Series(result)

    


In [30]:
# filling null values
merged_df['Overlap_Within_1_or_2_Flag'] = merged_df['Overlap_Within_1_or_2_Flag'].fillna('')

In [31]:
# extracting the year, area and the percentage overlap factor
merged_df[['Year Difference', 'Area of Overlap', 'Percentage of Overlap']] = merged_df['Overlap_Within_1_or_2_Flag'].apply(extract_overlap_info)

In [32]:
# extracting the year, area and the percentage overlap factor
merged_df[['Year Difference', 'Area of Overlap', 'Percentage of Overlap']].isnull().any()

Year Difference          False
Area of Overlap          False
Percentage of Overlap    False
dtype: bool

In [33]:
# ṣample
merged_df.head()

Unnamed: 0,OBJECTID,Fire_Year,GIS_Acres,Shape_Area,Overlap_Within_1_or_2_Flag,Listed_Fire_Dates,shortest_dist,days_fire_lasted,scaled_days_fire_lasted,fire_intensity,Year Difference,Area of Overlap,Percentage of Overlap
0,14299,1963,40992.458271,165890600.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,782.41,147,0.409471,233817900.0,0.0,0.0,0.0
1,14300,1963,25757.090203,104235200.0,,Listed Wildfire Discovery Date(s): 1963-07-28 ...,801.32,0,0.0,104235200.0,0.0,0.0,0.0
2,14301,1963,45527.210986,184242100.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,770.09,147,0.409471,259683800.0,0.0,0.0,0.0
3,14302,1963,10395.010334,42067110.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,770.83,147,0.409471,59292370.0,0.0,0.0,0.0
4,14303,1963,9983.605738,40402220.0,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,776.87,147,0.409471,56945750.0,0.0,0.0,0.0


In [34]:
# creating the overlap component
# Overlap Factor= (1/Time Difference+1) * Percentage of Overlap * Area of Overlap
merged_df['overlap_component'] = (1/merged_df['Year Difference']+1)* merged_df['Area of Overlap']*merged_df['Percentage of Overlap']

In [35]:
# scaling the days_fire_lasted column
merged_df = scale_column(merged_df, 'overlap_component')

merged_df['overlap_component'] = merged_df['overlap_component'].fillna(0)
merged_df['scaled_overlap_component'] = merged_df['scaled_overlap_component'].fillna(0)

In [36]:
# sample
merged_df

Unnamed: 0,OBJECTID,Fire_Year,GIS_Acres,Shape_Area,Overlap_Within_1_or_2_Flag,Listed_Fire_Dates,shortest_dist,days_fire_lasted,scaled_days_fire_lasted,fire_intensity,Year Difference,Area of Overlap,Percentage of Overlap,overlap_component,scaled_overlap_component
0,14299,1963,40992.458271,1.658906e+08,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,782.41,147,0.409471,2.338179e+08,0.0,0.0,0.0,0.0,0.000000
1,14300,1963,25757.090203,1.042352e+08,,Listed Wildfire Discovery Date(s): 1963-07-28 ...,801.32,0,0.000000,1.042352e+08,0.0,0.0,0.0,0.0,0.000000
2,14301,1963,45527.210986,1.842421e+08,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,770.09,147,0.409471,2.596838e+08,0.0,0.0,0.0,0.0,0.000000
3,14302,1963,10395.010334,4.206711e+07,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,770.83,147,0.409471,5.929237e+07,0.0,0.0,0.0,0.0,0.000000
4,14303,1963,9983.605738,4.040222e+07,,Listed Wildfire Discovery Date(s): 1963-08-06 ...,776.87,147,0.409471,5.694575e+07,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95841,135057,2020,16.412148,6.641761e+04,"Caution, this Prescribed Fire in 2020 overlaps...",Listed Prescribed Fire End Date(s): 2020-01-01...,1094.84,0,0.000000,6.641761e+04,1.0,16.0,100.0,3200.0,0.000123
95842,135058,2020,7.050837,2.853373e+04,"Caution, this Prescribed Fire in 2020 overlaps...",Listed Prescribed Fire End Date(s): 2020-05-16...,861.07,0,0.000000,2.853373e+04,1.0,7.0,100.0,1400.0,0.000054
95843,135059,2020,9.342668,3.780843e+04,"Caution, this Prescribed Fire in 2020 overlaps...",Listed Prescribed Fire End Date(s): 2020-05-16...,861.63,0,0.000000,3.780843e+04,1.0,9.0,100.0,1800.0,0.000069
95844,135060,2020,0.996962,4.034562e+03,,Listed Prescribed Fire Start Date(s): 2020-07-...,588.16,0,0.000000,4.034562e+03,0.0,0.0,0.0,0.0,0.000000


## Creating Smoke estimate for all the fires

We perform log conversion as the order of magnitude is very High

In [37]:
# Smoke Impact = (Fire Size / Distance) * Fire Intensity * (1+ Overlap Component)
merged_df['smoke_estimate'] = (np.log(merged_df['GIS_Acres'] + 1) + np.log(merged_df['fire_intensity'] + 1) 
                           - np.log(merged_df['shortest_dist']  + 1) +np.log(merged_df['overlap_component'] + 1))

In [38]:
merged_df['smoke_estimate'].min()

-7.089855823676603

In [39]:
# checking the columns
merged_df.columns

Index(['OBJECTID', 'Fire_Year', 'GIS_Acres', 'Shape_Area',
       'Overlap_Within_1_or_2_Flag', 'Listed_Fire_Dates', 'shortest_dist',
       'days_fire_lasted', 'scaled_days_fire_lasted', 'fire_intensity',
       'Year Difference', 'Area of Overlap', 'Percentage of Overlap',
       'overlap_component', 'scaled_overlap_component', 'smoke_estimate'],
      dtype='object')

In [40]:
# subsetting with relevant columns
smoke_estimate_df  = merged_df[['OBJECTID', 'Fire_Year', 'GIS_Acres', 'Shape_Area', 'shortest_dist',
                                'days_fire_lasted', 'scaled_days_fire_lasted', 'fire_intensity',
                                'Year Difference', 'Area of Overlap', 'Percentage of Overlap',
                                'overlap_component', 'scaled_overlap_component', 'smoke_estimate']]
# writing to csv file
smoke_estimate_df.to_csv(r"C:\Users\shwet\Documents\local-wildfire-project\Data512-WildFire-Project\02_data\02_intermediate_data\wildfire_smoke_estimates.csv")

In [41]:
# sample
smoke_estimate_df

Unnamed: 0,OBJECTID,Fire_Year,GIS_Acres,Shape_Area,shortest_dist,days_fire_lasted,scaled_days_fire_lasted,fire_intensity,Year Difference,Area of Overlap,Percentage of Overlap,overlap_component,scaled_overlap_component,smoke_estimate
0,14299,1963,40992.458271,1.658906e+08,782.41,147,0.409471,2.338179e+08,0.0,0.0,0.0,0.0,0.000000,23.227565
1,14300,1963,25757.090203,1.042352e+08,801.32,0,0.000000,1.042352e+08,0.0,0.0,0.0,0.0,0.000000,21.931157
2,14301,1963,45527.210986,1.842421e+08,770.09,147,0.409471,2.596838e+08,0.0,0.0,0.0,0.0,0.000000,23.453258
3,14302,1963,10395.010334,4.206711e+07,770.83,147,0.409471,5.929237e+07,0.0,0.0,0.0,0.0,0.000000,20.498404
4,14303,1963,9983.605738,4.040222e+07,776.87,147,0.409471,5.694575e+07,0.0,0.0,0.0,0.0,0.000000,20.409850
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95841,135057,2020,16.412148,6.641761e+04,1094.84,0,0.000000,6.641761e+04,1.0,16.0,100.0,3200.0,0.000123,15.032843
95842,135058,2020,7.050837,2.853373e+04,861.07,0,0.000000,2.853373e+04,1.0,7.0,100.0,1400.0,0.000054,12.830258
95843,135059,2020,9.342668,3.780843e+04,861.63,0,0.000000,3.780843e+04,1.0,9.0,100.0,1800.0,0.000069,13.612703
95844,135060,2020,0.996962,4.034562e+03,588.16,0,0.000000,4.034562e+03,0.0,0.0,0.0,0.0,0.000000,2.615830


# Step 2 - Getting yearly smoke estimates to Compare with AQI

In [42]:
smoke_est_df = smoke_estimate_df[['Fire_Year', 'GIS_Acres', 'Shape_Area', 'shortest_dist', 'fire_intensity','scaled_overlap_component', 'smoke_estimate']]
smoke_est_df = smoke_est_df.groupby('Fire_Year').mean().reset_index()
smoke_est_df

Unnamed: 0,Fire_Year,GIS_Acres,Shape_Area,shortest_dist,fire_intensity,scaled_overlap_component,smoke_estimate
0,1963,691.834084,2799753.0,886.566366,3348336.0,1.7e-05,7.409755
1,1964,943.032726,3816318.0,942.635637,3880452.0,2.3e-05,9.473445
2,1965,686.502917,2778179.0,983.856358,2800565.0,0.000708,6.415212
3,1966,1470.830214,5952239.0,916.784617,6370382.0,2.6e-05,8.631456
4,1967,779.61252,3154980.0,978.28646,3161821.0,0.001162,6.593157
5,1968,676.723133,2738601.0,996.56279,2764659.0,0.000158,8.650442
6,1969,510.362125,2065362.0,980.113564,2104951.0,0.000135,7.775776
7,1970,1015.245169,4108551.0,941.389885,4201760.0,6.8e-05,9.188843
8,1971,1392.31989,5634519.0,879.754963,6630058.0,8.5e-05,10.242204
9,1972,747.814787,3026299.0,957.926527,3313837.0,2.5e-05,9.040093


In [43]:
# writing to csv file
smoke_est_df.to_csv(r"C:\Users\shwet\Documents\local-wildfire-project\Data512-WildFire-Project\02_data\02_intermediate_data\yearly_smoke_estimates.csv")

### Joining with AQI table

In [44]:
agg_data_yearly['year'] = agg_data_yearly['year'].astype(int)
smoke_est_df['Fire_Year'] = smoke_est_df['Fire_Year'].astype(int)

In [45]:
smoke_data = pd.merge(smoke_est_df,agg_data_yearly, left_on='Fire_Year', right_on='year', how='outer')
smoke_data.columns

Index(['Fire_Year', 'GIS_Acres', 'Shape_Area', 'shortest_dist',
       'fire_intensity', 'scaled_overlap_component', 'smoke_estimate', 'year',
       'aqi'],
      dtype='object')

In [46]:
# sample
smoke_data['year'] = smoke_data['year'].fillna(smoke_data['Fire_Year'])
smoke_data = smoke_data[[ 'year', 'GIS_Acres', 'Shape_Area', 'shortest_dist','fire_intensity',
                         'scaled_overlap_component', 'smoke_estimate','aqi']]
smoke_data['aqi'] = smoke_data['aqi'].fillna(0) 
smoke_data

Unnamed: 0,year,GIS_Acres,Shape_Area,shortest_dist,fire_intensity,scaled_overlap_component,smoke_estimate,aqi
0,1963.0,691.834084,2.799753e+06,886.566366,3.348336e+06,0.000017,7.409755,0.000000
1,1964.0,943.032726,3.816318e+06,942.635637,3.880452e+06,0.000023,9.473445,0.000000
2,1965.0,686.502917,2.778179e+06,983.856358,2.800565e+06,0.000708,6.415212,0.000000
3,1966.0,1470.830214,5.952239e+06,916.784617,6.370382e+06,0.000026,8.631456,0.000000
4,1967.0,779.612520,3.154980e+06,978.286460,3.161821e+06,0.001162,6.593157,0.000000
...,...,...,...,...,...,...,...,...
56,2019.0,793.084805,3.209500e+06,865.479474,3.297554e+06,0.000344,11.554043,8.030042
57,2020.0,2083.377696,8.431130e+06,817.544346,9.693808e+06,0.000178,8.339470,14.099363
58,2021.0,,,,,,,18.852270
59,2022.0,,,,,,,17.592467


In [47]:
# writing to csv file
smoke_data.to_csv(r"C:\Users\shwet\Documents\local-wildfire-project\Data512-WildFire-Project\02_data\03_final_data\final_yearly_wildfire_data_w_smokeestimate_aqi.csv")