## Problem
## Questions
- What are the common causes of wildfires?
- frequency of fires between provinces
- What are the causes of the largest fires?
- correlation between the type of fire/cause and size of area burned

In [1]:
# !pip install pandas
# !pip install numpy
# !pip install matplotlib
# !pip install dbf

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)

In [3]:
fire_pts = pd.read_csv('data/NFDB_point_txt/NFDB_point_20220901.txt', low_memory=False)
fire_pts.head()

Unnamed: 0,FID,SRC_AGENCY,FIRE_ID,FIRENAME,LATITUDE,LONGITUDE,YEAR,MONTH,DAY,REP_DATE,ATTK_DATE,OUT_DATE,DECADE,SIZE_HA,CAUSE,PROTZONE,FIRE_TYPE,MORE_INFO,CFS_REF_ID,CFS_NOTE1,CFS_NOTE2,ACQ_DATE,SRC_AGY2,ECOZONE,ECOZ_REF,ECOZ_NAME,ECOZ_NOM
0,0,BC,1953-G00041,,59.963,-128.172,1953,5,26,1953-05-26 00:00:00,,,1950-1959,8.0,H,,Fire,,BC-1953-1953-G00041,,,2020-05-05 00:00:00,BC,12,12,Boreal Cordillera,CordillCre boreale
1,1,BC,1950-R00028,,59.318,-132.172,1950,6,22,1950-06-22 00:00:00,,,1950-1959,8.0,L,,Fire,,BC-1950-1950-R00028,,,2020-05-05 00:00:00,BC,12,12,Boreal Cordillera,CordillCre boreale
2,2,BC,1950-G00026,,59.876,-131.922,1950,6,4,1950-06-04 00:00:00,,,1950-1959,12949.9,H,,Fire,,BC-1950-1950-G00026,,,2020-05-05 00:00:00,BC,12,12,Boreal Cordillera,CordillCre boreale
3,3,BC,1951-R00097,,59.76,-132.808,1951,7,15,1951-07-15 00:00:00,,,1950-1959,241.1,H,,Fire,,BC-1951-1951-R00097,,,2020-05-05 00:00:00,BC,12,12,Boreal Cordillera,CordillCre boreale
4,4,BC,1952-G00116,,59.434,-126.172,1952,6,12,1952-06-12 00:00:00,,,1950-1959,1.2,H,,Fire,,BC-1952-1952-G00116,,,2020-05-05 00:00:00,BC,12,12,Boreal Cordillera,CordillCre boreale


In [11]:
fire_pts.shape

(423831, 27)

In [4]:
fire_pts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423831 entries, 0 to 423830
Data columns (total 27 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   FID         423831 non-null  int64  
 1   SRC_AGENCY  423831 non-null  object 
 2   FIRE_ID     423666 non-null  object 
 3   FIRENAME    423830 non-null  object 
 4   LATITUDE    423831 non-null  float64
 5   LONGITUDE   423831 non-null  float64
 6   YEAR        423831 non-null  int64  
 7   MONTH       423831 non-null  int64  
 8   DAY         423831 non-null  int64  
 9   REP_DATE    420118 non-null  object 
 10  ATTK_DATE   28740 non-null   object 
 11  OUT_DATE    213092 non-null  object 
 12  DECADE      423831 non-null  object 
 13  SIZE_HA     423831 non-null  float64
 14  CAUSE       423590 non-null  object 
 15  PROTZONE    422821 non-null  object 
 16  FIRE_TYPE   423831 non-null  object 
 17  MORE_INFO   423831 non-null  object 
 18  CFS_REF_ID  423831 non-null  object 
 19  CF

In [43]:
fire_pts.describe()

Unnamed: 0,FID,LATITUDE,LONGITUDE,YEAR,MONTH,DAY,SIZE_HA,ECOZONE
count,423831.0,423831.0,423831.0,423831.0,423831.0,423831.0,423831.0,423831.0
mean,211915.0,51.895936,-104.253547,1990.464784,6.585818,15.441048,316.271,9.376594
std,122349.615308,4.46777,20.412661,47.872423,1.675279,8.98398,5617.05,3.631528
min,0.0,0.0,-166.044,-999.0,0.0,0.0,0.0,0.0
25%,105957.5,49.1337,-120.11235,1979.0,5.0,8.0,0.1,6.0
50%,211915.0,51.023,-114.503,1991.0,7.0,15.0,0.1,9.0
75%,317872.5,54.630289,-88.303603,2005.0,8.0,23.0,1.0,14.0
max,423830.0,70.0,116.188,2021.0,12.0,31.0,1050000.0,15.0


In [37]:
# missing data
fire_pts.isnull().sum()

FID                0
SRC_AGENCY         0
FIRE_ID          165
FIRENAME           1
LATITUDE           0
LONGITUDE          0
YEAR               0
MONTH              0
DAY                0
REP_DATE        3713
ATTK_DATE     395091
OUT_DATE      210739
DECADE             0
SIZE_HA            0
CAUSE            241
PROTZONE        1010
FIRE_TYPE          0
MORE_INFO          0
CFS_REF_ID         0
CFS_NOTE1          0
CFS_NOTE2          0
ACQ_DATE       10605
SRC_AGY2           0
ECOZONE            0
ECOZ_REF           0
ECOZ_NAME          0
ECOZ_NOM           0
dtype: int64

The year, month, and day columns does contain some missing/unknown values. An overview of the summary statistics show minimum values which are not possible for date type data. The [NFDB documentation]('/data/NFDB_point_txt/NFDB_point_20220901_shapefile_metadata.pdf') notes that unknown years are recorded with '-999'. Similarly, the value '0' is outside the range of possible months or days.

In [48]:
# actual number of missing dates
for col in ['YEAR', 'MONTH', 'DAY']:
    print(f'{col}\t {fire_pts[fire_pts[col] <= 0][col].count()}')

YEAR 	 95
MONTH 	 3391
DAY 	 3713


In [40]:
for col in fire_pts.select_dtypes(include=['object']).columns:
    print(f'{col}: {fire_pts[col].unique()}\n')

SRC_AGENCY: ['BC' 'AB' 'SK' 'MB' 'ON' 'QC' 'NL' 'NB' 'NS' 'YT' 'NT' 'PC-NA' 'PC-WB'
 'PC-VU' 'PC-BA' 'PC-EI' 'PC-WP' 'PC-JA' 'PC-PA' 'PC-GL' 'PC-KO' 'PC-RE'
 'PC-BT' 'PC-YO' 'PC-RM' 'PC-GF' 'PC-GR' 'PC-WL' 'PC-FR' 'PC-PU' 'PC-KG'
 'PC-LM' 'PC-CB' 'PC-PE' 'PC-BP' 'PC-TI' 'PC-SL' 'PC-KE' 'PC-PP' 'PC-SY'
 'PC-SE' 'PC-NC' 'PC-KL' 'PC-RE-GL' 'PC-GM' 'PC-PR' 'PC-TN' 'PC-GI'
 'PC-FW' 'PC-FO' 'PC-GB' 'PC-LO' 'PC-RO' 'PC-FU' 'PC-MM' 'PC-TH']

FIRE_ID: ['1953-G00041' '1950-R00028' '1950-G00026' ... '2021WB002' '2021WB003'
 '2021WP001']

FIRENAME: [' ' 'DA3002' 'DA3003' ... 'Y-camp guard (Maskinonge 2)'
 'Bertha trailhead powerline' 'WAP 24 pond fire / WNP-21-001']

REP_DATE: ['1953-05-26 00:00:00' '1950-06-22 00:00:00' '1950-06-04 00:00:00' ...
 '2021-03-08 00:00:00' '2021-11-30 00:00:00' '2021-03-04 00:00:00']

ATTK_DATE: [nan '1983-05-26 00:00:00' '1983-06-19 00:00:00' ... '2021-10-24 00:00:00'
 '2021-10-27 00:00:00' '2021-10-30 00:00:00']

OUT_DATE: [nan '1983-05-26 00:00:00' '1983-06-19 00:0