<a href="https://colab.research.google.com/github/ParsaJafarian/fire/blob/main/fire.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
We will be preprocessing canadian wildfire data in order to derive a machine learning/deep learning model that predicts size of a fire at a certain location.

We will be using the [Canadian Wildfires (1950-2021)](https://www.kaggle.com/datasets/ulasozdemir/wildfires-in-canada-19502021) dataset from Kaggle

For newer data, 2023 and 2024 wildfire datasets were obtained from [Canadian Wildland Fire Information System](https://cwfis.cfs.nrcan.gc.ca/downloads/activefires/)

## Analysis
Analysis of what the datasets offer

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sns

#### Download 1950-2021 dataset from Kaggle

In [2]:
import subprocess

# Define the Kaggle dataset URL and the path to download
dataset = "ulasozdemir/wildfires-in-canada-19502021"
path_to_download = "."

# Run the Kaggle API command
subprocess.run(["kaggle", "datasets", "download", "-d", dataset, "-p", path_to_download, "--unzip"])


CompletedProcess(args=['kaggle', 'datasets', 'download', '-d', 'ulasozdemir/wildfires-in-canada-19502021', '-p', '.', '--unzip'], returncode=0)

#### Download 2023 and 2024 datasets from CWFIS

In [3]:
#Define download function
import requests
def download(base_url: str, filename: str):
  """
  Download a file from the given base url.
  base_url - base url where file is stored. Must end with `/`
  filename - name of the file to download. Must contain suffix (.csv, .json, etc)
  """
  response = requests.get(base_url + filename)

  # Check if the request was successful (status code 200)
  if response.status_code == 200:
      # Write the content to a local file
      with open(filename, 'wb') as f:
          f.write(response.content)
      print('File downloaded successfully')
  else:
      print('Failed to download file')

In [4]:
#Download the datasets
url = "https://cwfis.cfs.nrcan.gc.ca/downloads/activefires/"
download(url, "activefires.csv")
download(url, "reported_fires_2023.csv")

File downloaded successfully
File downloaded successfully


#### Load Datasets and compare them

In [5]:
df = pd.read_csv("CANADA_WILDFIRES.csv")
df_2023 = pd.read_csv("reported_fires_2023.csv")
df_2024 = pd.read_csv("activefires.csv")

len(df), len(df_2023), len(df_2024)

(423831, 7171, 296)

In [6]:
df_2023.head()

Unnamed: 0,firename,agency,startdate,hectares,cause,lat,lon,response_type
0,CPP-001-2023,AB,2023-08-14 21:21:00,0.01,H,51.0748,-115.08,FUL
1,CWF-001-2023,AB,2023-04-10 07:25:00,0.0,H,51.165,-114.853,FUL
2,CWF-002-2023,AB,2023-04-24 02:00:00,0.7,H,51.139,-114.907,FUL
3,CWF-003-2023,AB,2023-05-01 21:25:00,0.7,H,51.222,-114.85,FUL
4,CWF-004-2023,AB,2023-05-05 21:16:00,0.8,H,51.228,-114.857,FUL


In [7]:
df.head()

Unnamed: 0,FID,SRC_AGENCY,LATITUDE,LONGITUDE,REP_DATE,SIZE_HA,CAUSE,PROTZONE,ECOZ_NAME
0,0,BC,59.963,-128.172,1953-05-26,8.0,H,,Boreal Cordillera
1,1,BC,59.318,-132.172,1950-06-22,8.0,L,,Boreal Cordillera
2,2,BC,59.876,-131.922,1950-06-04,12949.9,H,,Boreal Cordillera
3,3,BC,59.76,-132.808,1951-07-15,241.1,H,,Boreal Cordillera
4,4,BC,59.434,-126.172,1952-06-12,1.2,H,,Boreal Cordillera


In [8]:
df = df.drop("FID", axis=1)
df.head()

Unnamed: 0,SRC_AGENCY,LATITUDE,LONGITUDE,REP_DATE,SIZE_HA,CAUSE,PROTZONE,ECOZ_NAME
0,BC,59.963,-128.172,1953-05-26,8.0,H,,Boreal Cordillera
1,BC,59.318,-132.172,1950-06-22,8.0,L,,Boreal Cordillera
2,BC,59.876,-131.922,1950-06-04,12949.9,H,,Boreal Cordillera
3,BC,59.76,-132.808,1951-07-15,241.1,H,,Boreal Cordillera
4,BC,59.434,-126.172,1952-06-12,1.2,H,,Boreal Cordillera


In [9]:
df.columns = ["agency", "lat", "lon", "date", "hectares", "cause", "response_type", "biome"]
df.head()

Unnamed: 0,agency,lat,lon,date,hectares,cause,response_type,biome
0,BC,59.963,-128.172,1953-05-26,8.0,H,,Boreal Cordillera
1,BC,59.318,-132.172,1950-06-22,8.0,L,,Boreal Cordillera
2,BC,59.876,-131.922,1950-06-04,12949.9,H,,Boreal Cordillera
3,BC,59.76,-132.808,1951-07-15,241.1,H,,Boreal Cordillera
4,BC,59.434,-126.172,1952-06-12,1.2,H,,Boreal Cordillera


### 1950-2021 dataset Analysis

In [10]:
#Check for different response types
df["response_type"].value_counts()

response_type
                         333463
Intensive                 62725
EXT                        4242
Initial Attack             3158
Monitored                  3117
Nordique                   2784
R (High Priority)          2745
FUL                        1568
G (Low Priority)           1327
W (Observation Zone)       1062
Full Response               906
Sustained Action            890
Wilderness FMZ              857
intensive                   663
MON                         539
Modified Response           399
FullResponse                289
Full FMZ                    278
Critical FMZ                248
Delayed Action              239
Stratigic FMZ               197
Transition FMZ              170
ActionCode: 1               150
Limited Action              148
Y (Moderate Priority)       111
Prescribed Fire             108
Being Monitored              87
nordique                     77
Wilderness (forested)        68
Monitored Response           51
MOD                       

In [11]:
#Change all response types where it's "FullResponse" to "Full Response"
df["response_type"] = df["response_type"].apply(lambda x: "Full Response" if x == "FullReponse" else x)

In [12]:
#Check for different biomes
df["biome"].value_counts()

biome
Montane Cordillera    120880
Boreal PLain           78544
Boreal Shield East     68168
Boreal Shield West     58847
Atlantic Maritime      25981
Pacific Maritime       25952
Taiga Plain            16943
Boreal Cordillera       9084
Taiga Shield West       8051
MixedWood Plain         4543
Hudson Plain            2166
Taiga Shield East       1568
Prairie                 1433
Taiga Cordillera        1105
                         294
Southern Arctic          267
Northern Arctic            5
Name: count, dtype: int64

In [13]:
#Check different causes
df["cause"].value_counts()

cause
H       230498
L       183179
U         9517
H-PB       322
RE          74
Name: count, dtype: int64

In [14]:
#Compare response types of 2023 dataset with 1950-2021
df_2023["response_type"].value_counts(), df["response_type"].value_counts()

(response_type
 FUL    5667
 MON     861
 MOD     643
 Name: count, dtype: int64,
 response_type
                          333463
 Intensive                 62725
 EXT                        4242
 Initial Attack             3158
 Monitored                  3117
 Nordique                   2784
 R (High Priority)          2745
 FUL                        1568
 G (Low Priority)           1327
 W (Observation Zone)       1062
 Full Response               906
 Sustained Action            890
 Wilderness FMZ              857
 intensive                   663
 MON                         539
 Modified Response           399
 FullResponse                289
 Full FMZ                    278
 Critical FMZ                248
 Delayed Action              239
 Stratigic FMZ               197
 Transition FMZ              170
 ActionCode: 1               150
 Limited Action              148
 Y (Moderate Priority)       111
 Prescribed Fire             108
 Being Monitored              87
 nordique   

1950-2021 dataset contains all values of response types in 2023 dataset so the integration is good!

In [15]:
#drop firename column in 2023 dataset
df_2023 = df_2023.drop("firename", axis=1)
df_2023.head()

Unnamed: 0,agency,startdate,hectares,cause,lat,lon,response_type
0,AB,2023-08-14 21:21:00,0.01,H,51.0748,-115.08,FUL
1,AB,2023-04-10 07:25:00,0.0,H,51.165,-114.853,FUL
2,AB,2023-04-24 02:00:00,0.7,H,51.139,-114.907,FUL
3,AB,2023-05-01 21:25:00,0.7,H,51.222,-114.85,FUL
4,AB,2023-05-05 21:16:00,0.8,H,51.228,-114.857,FUL


#### Convert 2023 dataset's datetime column to date only

In [16]:
df_2023.columns = ['agency', 'date', 'hectares', 'cause', 'lat', 'lon',
       'response_type']
#Change datetime column name to date
#For each datetime string, remove the time part
df_2023["date"] = df_2023["date"].apply(lambda x: x.split()[0] if not x.isspace() else x)
df_2023.head()

Unnamed: 0,agency,date,hectares,cause,lat,lon,response_type
0,AB,2023-08-14,0.01,H,51.0748,-115.08,FUL
1,AB,2023-04-10,0.0,H,51.165,-114.853,FUL
2,AB,2023-04-24,0.7,H,51.139,-114.907,FUL
3,AB,2023-05-01,0.7,H,51.222,-114.85,FUL
4,AB,2023-05-05,0.8,H,51.228,-114.857,FUL


Compare the datasets after the mutations

In [17]:
df.head(1)

Unnamed: 0,agency,lat,lon,date,hectares,cause,response_type,biome
0,BC,59.963,-128.172,1953-05-26,8.0,H,,Boreal Cordillera


In [18]:
df_2023.head(1)

Unnamed: 0,agency,date,hectares,cause,lat,lon,response_type
0,AB,2023-08-14,0.01,H,51.0748,-115.08,FUL


Both datasets have

In [19]:
#Change column names and drop some
df_2024.columns = ['agency', 'firename', 'lat', 'lon', 'date', 'hectares',
       'stage_of_control', 'timezone', 'response_type']
df_2024 = df_2024.drop(labels=["firename", "timezone", "stage_of_control"], axis=1)

In [20]:
#Transform datetimes into dates & uppercase the agencies
df_2024["date"] = df_2024["date"].apply(lambda x: x.split()[0] if not x.isspace() else x)
df_2024["agency"] = df_2024["agency"].apply(lambda x: x.upper())

In [21]:
df_current = pd.concat([df_2023, df_2024])
df_current.head()

Unnamed: 0,agency,date,hectares,cause,lat,lon,response_type
0,AB,2023-08-14,0.01,H,51.0748,-115.08,FUL
1,AB,2023-04-10,0.0,H,51.165,-114.853,FUL
2,AB,2023-04-24,0.7,H,51.139,-114.907,FUL
3,AB,2023-05-01,0.7,H,51.222,-114.85,FUL
4,AB,2023-05-05,0.8,H,51.228,-114.857,FUL


### Adding elevation by using Canada's api


In [22]:
url = "http://geogratis.gc.ca/services/elevation/cdem/altitude?"
def get_altitude(lat: float, lon: float):
  response = requests.get(f"{url}lat={lat}&lon={lon}")

  # Check if the request was successful (status code 200)
  if response.status_code == 200:
      try:
            # Parse the response JSON
            data = response.json()
            # Extract the altitude value (assuming the key is 'altitude')
            altitude = data.get('altitude')
            if altitude is not None:
                print("Altitude obtained")
                return altitude
            else:
                print('Altitude data not found in response')
      except ValueError:
            print('Failed to parse JSON response')
  else:
      print('Failed to get altitude')

  return None

In [23]:
latitude = 45.4215
longitude = -75.6972
altitude = get_altitude(latitude, longitude)

Altitude obtained


In [24]:
from google.colab import files

df_current['altitude'] = df_current.apply(lambda row: get_altitude(row['lat'], row['lon']), axis=1)
df_current.to_csv("current_fires.csv")
files.download("current_fires.csv")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained
Altitude obtained

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Creating a model to predict Biome
The 2023-2024 df does not have a biome column. In order to complete the data, we could predict the biomes using a machine learning model and training it with 1950-2021 data. We want to use all other columns to predict the biome