# **Wildfire Risk Assesment** 
# Data Cleaning and general pre-processing

This code pre-processes the forest fire data, performing tasks as dropping unwanted columns, filtering down the scope of data in terms of years and the fire reporting agencies to only provinces and also imputing missing/unknown fire causes. The cleaned data is then exported to a CSV file for further problem modelling.

### Input Required:
- **forest_fire.txt**: A txt file containing the data about all the forest fires in Canada for different provinces and national parks.

### Output Generated:
- **fire_data_cleaned.csv**: A cleaned and processed version of the input data, which is used for the problem in the next part.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('forest_fires.txt', sep=",")

data.head(5)

  data = pd.read_csv('forest_fires.txt', sep=",")


Unnamed: 0,FID,SRC_AGENCY,FIRE_ID,FIRENAME,LATITUDE,LONGITUDE,YEAR,MONTH,DAY,REP_DATE,...,MORE_INFO,CFS_REF_ID,CFS_NOTE1,CFS_NOTE2,ACQ_DATE,SRC_AGY2,ECOZONE,ECOZ_REF,ECOZ_NAME,ECOZ_NOM
0,0,BC,1953-G00041,,59.963,-128.172,1953,5,26,1953-05-26 00:00:00,...,,BC-1953-1953-G00041,,,2020-05-05 00:00:00,BC,12,12,Boreal Cordillera,CordillCre boreale
1,1,BC,1950-R00028,,59.318,-132.172,1950,6,22,1950-06-22 00:00:00,...,,BC-1950-1950-R00028,,,2020-05-05 00:00:00,BC,12,12,Boreal Cordillera,CordillCre boreale
2,2,BC,1950-G00026,,59.876,-131.922,1950,6,4,1950-06-04 00:00:00,...,,BC-1950-1950-G00026,,,2020-05-05 00:00:00,BC,12,12,Boreal Cordillera,CordillCre boreale
3,3,BC,1951-R00097,,59.76,-132.808,1951,7,15,1951-07-15 00:00:00,...,,BC-1951-1951-R00097,,,2020-05-05 00:00:00,BC,12,12,Boreal Cordillera,CordillCre boreale
4,4,BC,1952-G00116,,59.434,-126.172,1952,6,12,1952-06-12 00:00:00,...,,BC-1952-1952-G00116,,,2020-05-05 00:00:00,BC,12,12,Boreal Cordillera,CordillCre boreale


Removing unwanted columns

In [3]:
remove_cols = ['FIRENAME', 'ATTK_DATE', 'DECADE', 'PROTZONE', 'MORE_INFO', 'CFS_REF_ID', 'CFS_NOTE1', 'CFS_NOTE2', 'ACQ_DATE', 'SRC_AGY2', 'FIRE_TYPE']

fire_data = data.drop(columns=remove_cols)

Using data from year 2015 onwards

In [4]:
fire_data = fire_data[fire_data['YEAR'] >= 2015]

In [5]:
fire_data['SRC_AGENCY'].value_counts()

SRC_AGENCY
BC          9520
AB          8983
ON          5772
QC          3583
SK          3001
MB          2617
NB          1859
NS          1313
NT          1108
YT           657
NL           532
PC-BA        193
PC-WB        160
PC-JA         78
PC-GL         50
PC-KO         43
PC-NA         37
PC-GR         34
PC-WL         33
PC-RM         30
PC-PA         29
PC-LM         28
PC-YO         24
PC-PP         21
PC-RE         17
PC-CB         10
PC-TI         10
PC-TN         10
PC-WP          8
PC-EI          7
PC-BP          6
PC-GI          6
PC-PU          4
PC-FO          4
PC-RE-GL       3
PC-FU          3
PC-KG          3
PC-KE          3
PC-MM          2
PC-RO          2
PC-TH          2
PC-KL          2
PC-VU          2
PC-GF          1
PC-LO          1
PC-PR          1
PC-SE          1
PC-NC          1
Name: count, dtype: int64

Keeping the SRC_AGENCY restricted to the Provinces and removing the National Parks

In [6]:
mask = fire_data['SRC_AGENCY'].str.startswith('PC')

fire_data = fire_data[~mask]

fire_data.shape

(38945, 16)

In [7]:
fire_data['CAUSE'].value_counts()

CAUSE
H       20252
L       17762
U         815
H-PB        7
Name: count, dtype: int64

Dropping observations with H-PB as fire CAUSE

In [8]:
fire_data = fire_data[fire_data['CAUSE'] != 'H-PB']

In [9]:
fire_data['CAUSE'].value_counts()

CAUSE
H    20252
L    17762
U      815
Name: count, dtype: int64

Checking if there are any null values in the dataset

In [10]:
fire_data.isnull().sum()

FID               0
SRC_AGENCY        0
FIRE_ID           0
LATITUDE          0
LONGITUDE         0
YEAR              0
MONTH             0
DAY               0
REP_DATE          2
OUT_DATE      14342
SIZE_HA           0
CAUSE           109
ECOZONE           0
ECOZ_REF          0
ECOZ_NAME         0
ECOZ_NOM          0
dtype: int64

Dropping observations wherer the REP_DATE (Reporting Date) column is null

In [11]:
fire_data = fire_data.dropna(subset=['REP_DATE'])

In [12]:
fire_data.isnull().sum()

FID               0
SRC_AGENCY        0
FIRE_ID           0
LATITUDE          0
LONGITUDE         0
YEAR              0
MONTH             0
DAY               0
REP_DATE          0
OUT_DATE      14340
SIZE_HA           0
CAUSE           109
ECOZONE           0
ECOZ_REF          0
ECOZ_NAME         0
ECOZ_NOM          0
dtype: int64

Filling the null values in the CAUSE column to 'U'

In [13]:
# fill null values with U for column CAUSE
fire_data['CAUSE'] = fire_data['CAUSE'].fillna('U')

In [14]:
fire_data['SRC_AGENCY'].value_counts()

SRC_AGENCY
BC    9520
AB    8983
ON    5765
QC    3583
SK    3001
MB    2617
NB    1859
NS    1313
NT    1107
YT     657
NL     531
Name: count, dtype: int64

In [15]:
# Export fire_data to new csv file
fire_data.to_csv('fire_data_cleaned.csv', index=False)