# Wildfire Prediction - Data Prep

## Status Updates

| Source            | Retrieval Status | Ready for Use | Work Remaining | Notes| Link |
|---------------|:---------------|:----------------|:----------------|:---------------|----------------|
|CA Wildfire Incidents|Complete | X | -- | -- | https://www.kaggle.com/datasets/ananthu017/california-wildfire-incidents-20132020 |
|FIPS CA County Codes| Complete | X | -- | -- | https://www.census.gov/library/reference/code-lists/ansi.html#cou|
|NOAA Weather (temp, pcp) | Complete | X | -- | -- | https://www.ncei.noaa.gov/pub/data/cirs/climdiv/county-readme.txt |
|Wind| N/A | -- | - Gather data | We don't have this at all yet but it seems useful. | -- | -- |
|CA Elevation| Partial | -- |  - Assess completeness<br>- Identify features<br>- Prep features | It seems like this is incomplete; HARP only provides data for specific CA regions. | https://ww2.arb.ca.gov/resources/documents/harp-digital-elevation-model-files |
|CA Vegetation | Partial -<br>In Progress (SA) | -- | - Assess completeness<br>- Identify features<br>- Prep features | I'm retrieving satellite data for the dates that preceded each fire.  Will also try to parse the link Reese provided (on right). | https://map.dfg.ca.gov/metadata/ds1020.html |
|Population and Other Demo| Partial | -- | - Gather other years' data<br>- Consider other demographic info to include | - We currently only have data for one year via Reese's scraped set.<br>- I prepped code in this notebook that uses the Census API to gather data for specified years, with an example provided for a single year.  This can be modified to gather data from other years. To use it you'll need to request a free API key.<br>- This has not yet been normalized to pop density. | https://api.census.gov/data.html |

In [1]:
import json
import pandas as pd
import requests
from io import StringIO

## Get NOAA weather data

In [2]:
# NOAA data

# from https://www.ncei.noaa.gov/pub/data/cirs/climdiv/
# README file: https://www.ncei.noaa.gov/pub/data/cirs/climdiv/county-readme.txt
# in this file CA state code is 04 (not 06)

noaa_pcp_url = "https://www.ncei.noaa.gov/pub/data/cirs/climdiv/climdiv-pcpncy-v1.0.0-20240606"
noaa_tmax_url = "https://www.ncei.noaa.gov/pub/data/cirs/climdiv/climdiv-tmaxcy-v1.0.0-20240606"
noaa_tmin_url = "https://www.ncei.noaa.gov/pub/data/cirs/climdiv/climdiv-tmincy-v1.0.0-20240606"

In [3]:
def get_noaa_data_df(target_url):
    noaa_data = requests.get(target_url)    
    noaa_month_colnames= ["JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC"]
    noaa_df = pd.read_csv(StringIO(noaa_data.text), lineterminator="\n", sep=r"\s+", header=None, names=["ID", *noaa_month_colnames], index_col=False, dtype="string")
    return noaa_df

In [4]:
pcp_data_df = get_noaa_data_df(noaa_pcp_url)
pcp_data_df.head(2)

Unnamed: 0,ID,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC
0,1001011895,7.03,2.96,8.36,3.53,3.96,5.4,3.92,3.36,0.73,2.03,1.44,3.66
1,1001011896,5.86,5.42,5.54,3.98,3.77,6.24,4.38,2.57,0.82,1.66,2.89,1.94


In [5]:
tmax_data_df = get_noaa_data_df(noaa_tmax_url)
tmax_data_df.head(2)

Unnamed: 0,ID,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC
0,1001271895,53.7,48.7,67.6,76.4,81.9,89.2,91.1,90.4,90.9,76.0,66.6,58.0
1,1001271896,54.2,60.8,65.3,81.6,88.5,88.2,92.0,94.5,90.8,77.2,69.9,58.7


In [6]:
tmin_data_df = get_noaa_data_df(noaa_tmin_url)
tmin_data_df.head(2)

Unnamed: 0,ID,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC
0,1001281895,34.2,27.7,43.4,51.8,59.3,67.4,69.7,70.3,67.1,46.9,42.1,32.5
1,1001281896,34.4,37.2,42.6,57.0,65.0,67.9,71.4,71.7,65.0,52.2,46.1,35.9


In [7]:
# Concatenate NOAA dfs together
noaa_full_df = pd.concat([pcp_data_df, tmax_data_df, tmin_data_df], axis=0)

## Parse NOAA weather data

In [8]:
# Extract NOAA ID
def extract_noaa_id(df):
    df["STATE_CODE"] = df["ID"].str[:2]
    df["FIPS_CODE"] = df["ID"].str[2:5]
    df["ELEMENT_CODE"] = df["ID"].str[5:7]
    df["YEAR"] = df["ID"].str[7:]
    return df

extract_noaa_id(noaa_full_df)
noaa_full_df

Unnamed: 0,ID,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,STATE_CODE,FIPS_CODE,ELEMENT_CODE,YEAR
0,01001011895,7.03,2.96,8.36,3.53,3.96,5.40,3.92,3.36,0.73,2.03,1.44,3.66,01,001,01,1895
1,01001011896,5.86,5.42,5.54,3.98,3.77,6.24,4.38,2.57,0.82,1.66,2.89,1.94,01,001,01,1896
2,01001011897,3.27,6.63,10.94,4.35,0.81,1.57,3.96,5.02,0.87,0.75,1.84,4.38,01,001,01,1897
3,01001011898,2.33,2.07,2.60,4.56,0.54,3.13,5.80,6.02,1.51,3.21,6.66,3.91,01,001,01,1898
4,01001011899,5.80,6.94,3.35,2.22,2.93,2.31,6.80,2.90,0.63,3.02,1.98,5.25,01,001,01,1899
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406905,50290282020,-22.80,-16.70,-1.60,18.40,36.30,45.90,47.10,44.80,34.90,18.30,-0.90,-3.20,50,290,28,2020
406906,50290282021,-3.30,-17.20,-6.20,11.00,33.30,46.20,49.20,43.90,31.10,20.50,-7.50,-7.60,50,290,28,2021
406907,50290282022,-12.50,-9.80,2.20,12.80,33.00,45.10,47.40,43.50,35.80,19.60,3.00,-8.50,50,290,28,2022
406908,50290282023,-4.40,-9.40,-1.70,2.10,34.60,45.10,50.80,48.00,33.00,17.20,8.80,-10.80,50,290,28,2023


In [9]:
# Map element codes to descriptive abbreviation

element_code_map = {
# element codes from docs
    "01": "pcp", # Precipitation
    "02": "tavg", # Average Temperature
    "25": "Heating Degree Days",
    "26": "Cooling Degree Days",
    "27": "tmax", # Maximum Temperature
    "28": "tmin" # Minimum Temperature
}

noaa_full_df["NOAA_ELEMENT"] = noaa_full_df["ELEMENT_CODE"].map(element_code_map).fillna(noaa_full_df["ELEMENT_CODE"])
noaa_full_df

Unnamed: 0,ID,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,STATE_CODE,FIPS_CODE,ELEMENT_CODE,YEAR,NOAA_ELEMENT
0,01001011895,7.03,2.96,8.36,3.53,3.96,5.40,3.92,3.36,0.73,2.03,1.44,3.66,01,001,01,1895,pcp
1,01001011896,5.86,5.42,5.54,3.98,3.77,6.24,4.38,2.57,0.82,1.66,2.89,1.94,01,001,01,1896,pcp
2,01001011897,3.27,6.63,10.94,4.35,0.81,1.57,3.96,5.02,0.87,0.75,1.84,4.38,01,001,01,1897,pcp
3,01001011898,2.33,2.07,2.60,4.56,0.54,3.13,5.80,6.02,1.51,3.21,6.66,3.91,01,001,01,1898,pcp
4,01001011899,5.80,6.94,3.35,2.22,2.93,2.31,6.80,2.90,0.63,3.02,1.98,5.25,01,001,01,1899,pcp
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406905,50290282020,-22.80,-16.70,-1.60,18.40,36.30,45.90,47.10,44.80,34.90,18.30,-0.90,-3.20,50,290,28,2020,tmin
406906,50290282021,-3.30,-17.20,-6.20,11.00,33.30,46.20,49.20,43.90,31.10,20.50,-7.50,-7.60,50,290,28,2021,tmin
406907,50290282022,-12.50,-9.80,2.20,12.80,33.00,45.10,47.40,43.50,35.80,19.60,3.00,-8.50,50,290,28,2022,tmin
406908,50290282023,-4.40,-9.40,-1.70,2.10,34.60,45.10,50.80,48.00,33.00,17.20,8.80,-10.80,50,290,28,2023,tmin


## Create cleaned subset with just CA counties

In [10]:
noaa_ca_df = noaa_full_df[noaa_full_df["STATE_CODE"] == "04"]
noaa_ca_df

Unnamed: 0,ID,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,STATE_CODE,FIPS_CODE,ELEMENT_CODE,YEAR,NOAA_ELEMENT
20410,04001011895,8.43,2.09,1.97,1.75,1.20,0.00,0.05,0.01,0.50,0.57,1.57,1.47,04,001,01,1895,pcp
20411,04001011896,8.47,0.23,2.71,4.13,0.72,0.00,0.02,0.27,0.47,1.56,4.12,3.30,04,001,01,1896,pcp
20412,04001011897,2.66,4.30,4.11,0.56,0.14,0.21,0.00,0.02,0.15,1.82,0.97,1.54,04,001,01,1897,pcp
20413,04001011898,1.14,2.49,0.48,0.21,1.35,0.18,0.00,0.02,0.86,0.95,0.65,1.32,04,001,01,1898,pcp
20414,04001011899,4.52,0.25,7.19,0.55,0.86,0.22,0.00,0.04,0.03,4.20,3.69,2.65,04,001,01,1899,pcp
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27945,04115282020,38.90,39.70,40.60,47.10,53.30,58.60,62.10,65.00,61.10,53.30,40.40,37.40,04,115,28,2020,tmin
27946,04115282021,38.50,39.40,39.50,46.70,53.40,61.40,66.40,63.40,59.00,48.50,45.70,37.70,04,115,28,2021,tmin
27947,04115282022,37.10,37.80,43.20,44.70,50.50,59.00,62.40,64.10,61.40,51.80,38.00,36.90,04,115,28,2022,tmin
27948,04115282023,36.80,35.50,37.80,44.00,52.00,56.20,63.70,63.10,56.10,50.60,42.70,41.10,04,115,28,2023,tmin


## Get county FIPS codes

In [11]:
# FIPS Codes

# from https://www.census.gov/library/reference/code-lists/ansi.html#cou
fips_url = "https://www2.census.gov/geo/docs/reference/codes2020/cou/st06_ca_cou2020.txt"

fips_data = requests.get(fips_url)
fips_df = pd.read_csv(StringIO(fips_data.text), lineterminator="\n", sep="|", dtype="string")
fips_df.head(2)

Unnamed: 0,STATE,STATEFP,COUNTYFP,COUNTYNS,COUNTYNAME,CLASSFP,FUNCSTAT
0,CA,6,1,1675839,Alameda County,H1,A
1,CA,6,3,1675840,Alpine County,H1,A


In [12]:
# Join to FIPS df
noaa_ca_counties_df = pd.merge(fips_df[["COUNTYFP", "COUNTYNAME"]], noaa_ca_df, left_on="COUNTYFP", right_on="FIPS_CODE")
noaa_ca_counties_df["COUNTYNAME"] = noaa_ca_counties_df["COUNTYNAME"].str.split().apply(lambda row: ' '.join(row[:-1]))
noaa_ca_counties_df.head(2)

Unnamed: 0,COUNTYFP,COUNTYNAME,ID,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,STATE_CODE,FIPS_CODE,ELEMENT_CODE,YEAR,NOAA_ELEMENT
0,1,Alameda,4001011895,8.43,2.09,1.97,1.75,1.2,0.0,0.05,0.01,0.5,0.57,1.57,1.47,4,1,1,1895,pcp
1,1,Alameda,4001011896,8.47,0.23,2.71,4.13,0.72,0.0,0.02,0.27,0.47,1.56,4.12,3.3,4,1,1,1896,pcp


In [13]:
# Remove unnecessary encoding columns
noaa_ca_counties_df = noaa_ca_counties_df[['COUNTYNAME', 'NOAA_ELEMENT', 'YEAR', 'JAN', 'FEB', 'MAR', 'APR', 'MAY',
       'JUN', 'JUL', 'AUG', 'SEP', 'OCT', 'NOV', 'DEC', 'ID']]
noaa_ca_counties_df.head(2)

Unnamed: 0,COUNTYNAME,NOAA_ELEMENT,YEAR,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,ID
0,Alameda,pcp,1895,8.43,2.09,1.97,1.75,1.2,0.0,0.05,0.01,0.5,0.57,1.57,1.47,4001011895
1,Alameda,pcp,1896,8.47,0.23,2.71,4.13,0.72,0.0,0.02,0.27,0.47,1.56,4.12,3.3,4001011896


## Load wildfire data

In [14]:
# from Kaggle: https://www.kaggle.com/datasets/ananthu017/california-wildfire-incidents-20132020

wildfire_fpath = "../Data/California_Fire_Incidents.csv"

In [15]:
wildfire_df = pd.read_csv(wildfire_fpath)
wildfire_df.head()

Unnamed: 0,AcresBurned,Active,AdminUnit,AirTankers,ArchiveYear,CalFireIncident,CanonicalUrl,ConditionStatement,ControlStatement,Counties,...,SearchKeywords,Started,Status,StructuresDamaged,StructuresDestroyed,StructuresEvacuated,StructuresThreatened,UniqueId,Updated,WaterTenders
0,257314.0,False,Stanislaus National Forest/Yosemite National Park,,2013,True,/incidents/2013/8/17/rim-fire/,,,Tuolumne,...,"Rim Fire, Stanislaus National Forest, Yosemite...",2013-08-17T15:25:00Z,Finalized,,,,,5fb18d4d-213f-4d83-a179-daaf11939e78,2013-09-06T18:30:00Z,
1,30274.0,False,USFS Angeles National Forest/Los Angeles Count...,,2013,True,/incidents/2013/5/30/powerhouse-fire/,,,Los Angeles,...,"Powerhouse Fire, May 2013, June 2013, Angeles ...",2013-05-30T15:28:00Z,Finalized,,,,,bf37805e-1cc2-4208-9972-753e47874c87,2013-06-08T18:30:00Z,
2,27531.0,False,CAL FIRE Riverside Unit / San Bernardino Natio...,,2013,True,/incidents/2013/7/15/mountain-fire/,,,Riverside,...,"Mountain Fire, July 2013, Highway 243, Highway...",2013-07-15T13:43:00Z,Finalized,,,,,a3149fec-4d48-427c-8b2c-59e8b79d59db,2013-07-30T18:00:00Z,
3,27440.0,False,Tahoe National Forest,,2013,False,/incidents/2013/8/10/american-fire/,,,Placer,...,"American Fire, August 2013, Deadwood Ridge, Fo...",2013-08-10T16:30:00Z,Finalized,,,,,8213f5c7-34fa-403b-a4bc-da2ace6e6625,2013-08-30T08:00:00Z,
4,24251.0,False,Ventura County Fire/CAL FIRE,,2013,True,/incidents/2013/5/2/springs-fire/,Acreage has been reduced based upon more accur...,,Ventura,...,"Springs Fire, May 2013, Highway 101, Camarillo...",2013-05-02T07:01:00Z,Finalized,6.0,10.0,,,46731fb8-3350-4920-bdf7-910ac0eb715c,2013-05-11T06:30:00Z,11.0


In [16]:
wildfire_df["Counties"]

0          Tuolumne
1       Los Angeles
2         Riverside
3            Placer
4           Ventura
           ...     
1631      Riverside
1632         Nevada
1633           Yolo
1634      San Diego
1635      Riverside
Name: Counties, Length: 1636, dtype: object

In [17]:
# Filter non-CA counties
ca_wildfire_df = wildfire_df[~wildfire_df["Counties"].isin(['Mexico', 'State of Oregon', 'State of Nevada'])]
ca_wildfire_df["Counties"].unique()

array(['Tuolumne', 'Los Angeles', 'Riverside', 'Placer', 'Ventura',
       'Fresno', 'Siskiyou', 'Humboldt', 'Tehama', 'Shasta', 'San Diego',
       'Kern', 'Sonoma', 'Contra Costa', 'Butte', 'Tulare',
       'Santa Barbara', 'Mariposa', 'Monterey', 'El Dorado',
       'San Bernardino', 'Plumas', 'Modoc', 'San Luis Obispo', 'Madera',
       'Inyo', 'Napa', 'San Benito', 'San Joaquin', 'Lake', 'Alameda',
       'Glenn', 'Yolo', 'Sacramento', 'Stanislaus', 'Solano', 'Merced',
       'Mendocino', 'Lassen', 'Amador', 'Yuba', 'Nevada', 'Santa Clara',
       'Calaveras', 'San Mateo', 'Orange', 'Colusa', 'Trinity',
       'Del Norte', 'Mono', 'Alpine', 'Sutter', 'Kings', 'Sierra',
       'Santa Cruz', 'Marin'], dtype=object)

## Get Population Data from Census

In [26]:
# STATUS: Reese retrieved bulk data for 2021 (./Data/CA_Counties_Population_Density.csv)
# TODO: Get data for other years, either bulk data or via Census API
# This code also can be repurposed for housing and other demo data

In [463]:
# If using Census API: https://api.census.gov/data.html
# https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf
# Will need to figure out which historical surveys contain this data
# Not all years' data is available through PEP

# If you are working on this using the API you'll need to request a key
# More info here https://www.census.gov/data/developers.html

#### PLEASE REPLACE THE STRING BELOW WITH YOUR KEY ####
API_KEY_CENSUS = ""

In [457]:
year = 2019 # convert this to a list of available years
census_url = f"https://api.census.gov/data/{year}/pep/population"
# https://api.census.gov/data/2019/pep/population/examples.html

params = {
    "get": "NAME,POP",
    "for": "county:*",
    "in": "state:06",
    'key': API_KEY_CENSUS
}

census_resp = requests.get(census_url, params=params)
census_data = json.loads(census_resp.text)
census_data

[['NAME', 'POP', 'state', 'county'],
 ['Merced County, California', '277680', '06', '047'],
 ['Mariposa County, California', '17203', '06', '043'],
 ['Modoc County, California', '8841', '06', '049'],
 ['Contra Costa County, California', '1153526', '06', '013'],
 ['Inyo County, California', '18039', '06', '027'],
 ['Stanislaus County, California', '550660', '06', '099'],
 ['Santa Barbara County, California', '446499', '06', '083'],
 ['Tehama County, California', '65084', '06', '103'],
 ['Mono County, California', '14444', '06', '051'],
 ['San Benito County, California', '62808', '06', '069'],
 ['Sacramento County, California', '1552058', '06', '067'],
 ['El Dorado County, California', '192843', '06', '017'],
 ['Monterey County, California', '434061', '06', '053'],
 ['San Francisco County, California', '881549', '06', '075'],
 ['San Diego County, California', '3338330', '06', '073'],
 ['Tulare County, California', '466195', '06', '107'],
 ['Humboldt County, California', '135558', '06', '

In [458]:
census_pop_df = pd.DataFrame(census_data[1:], columns=census_data[0], dtype="str")
census_pop_df["YEAR"] = year
census_pop_df.head(2)

Unnamed: 0,NAME,POP,state,county,YEAR
0,"Merced County, California",277680,6,47,2019
1,"Mariposa County, California",17203,6,43,2019


In [459]:
census_pop_df["COUNTYNAME"] = census_pop_df["NAME"].str.extract(r"(.*) County, California")
census_pop_df = census_pop_df[["COUNTYNAME", "YEAR", "POP"]]
census_pop_df.head(2)

Unnamed: 0,COUNTYNAME,YEAR,POP
0,Merced,2019,277680
1,Mariposa,2019,17203


## Join Annual Census Data with Weather Data

In [460]:
# TODO: after retrieving remaining census years

## Process Elevation Data

In [461]:
# TODO: Also identify features that could be useful

## Process Vegetation Data

In [465]:
# TODO: Process data from https://map.dfg.ca.gov/metadata/ds1020.html
# TODO: Consider also using satellite data (I'll look into this further to see how viable it is)

## EDA (in separate files)?