# The California Wildfire Damage Prediction

# Introduction

Wildfires in California have become more frequent and severe, causing significant environmental and economic damage. This project aims to predict wildfire damage using machine learning by analyzing historical wildfire data, weather conditions, and geographical features.

# Description:
The dataset reflects the damage sustained by structures across various fire incidents, categorized by damage percentage—ranging from minor damage (1-10%) to complete destruction (50-100%) and collected by field inspectors who evaluate structures impacted by wildland fires.

# Objective:
The goal is to develop a Reggression classifier or suitable ML model  to predict wildfire damage in California using historical data, weather conditions, and geographical factors. By analyzing fire patterns, cleaning data, and building predictive models, this project aims to enhance early warning systems, improve resource allocation, and support risk assessment for wildfire managemen

# Loading dataset

In [15]:
# import packages and dataset loading
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('/content/The California Wildfire Data.csv', encoding='ISO-8859-1')
df.head()

  df = pd.read_csv('/content/The California Wildfire Data.csv', encoding='ISO-8859-1')


Unnamed: 0,_id,OBJECTID,* Damage,* Street Number,* Street Name,"* Street Type (e.g. road, drive, lane, etc.)","Street Suffix (e.g. apt. 23, blding C)",* City,State,Zip Code,...,Fire Name (Secondary),APN (parcel),Assessed Improved Value (parcel),Year Built (parcel),Site Address (parcel),GLOBALID,Latitude,Longitude,x,y
0,1,1,No Damage,8376.0,Quail Canyon,Road,,Winters,CA,,...,Quail,101090290,510000.0,1997.0,8376 QUAIL CANYON RD VACAVILLE CA 95688,e1919a06-b4c6-476d-99e5-f0b45b070de8,38.47496,-122.044465,-13585927.7,4646740.75
1,2,2,Affected (1-9%),8402.0,Quail Canyon,Road,,Winters,CA,,...,Quail,101090270,573052.0,1980.0,8402 QUAIL CANYON RD VACAVILLE CA 95688,b090eeb6-5b18-421e-9723-af7c9144587c,38.477442,-122.043252,-13585792.71,4647093.599
2,3,3,No Damage,8430.0,Quail Canyon,Road,,Winters,CA,,...,Quail,101090310,350151.0,2004.0,8430 QUAIL CANYON RD VACAVILLE CA 95688,268da70b-753f-46aa-8fb1-327099337395,38.479357,-122.044585,-13585941.01,4647366.034
3,4,4,No Damage,3838.0,Putah Creek,Road,,Winters,CA,,...,Quail,103010240,134880.0,1981.0,3838 PUTAH CREEK RD WINTERS CA 95694,64d4a278-5ee9-414a-8bf4-247c5b5c60f9,38.487313,-122.015115,-13582660.52,4648497.399
4,5,5,No Damage,3830.0,Putah Creek,Road,,Winters,CA,,...,Quail,103010220,346648.0,1980.0,3830 PUTAH CREEK RD WINTERS CA 95694,1b44b214-01fd-4f06-b764-eb42a1ec93d7,38.485636,-122.016122,-13582772.6,4648258.826


# Exploratory Data Analysis (EDA)

In [19]:
# Understanding the basic struction of the dataset
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100230 entries, 0 to 100229
Data columns (total 47 columns):
 #   Column                                                        Non-Null Count   Dtype  
---  ------                                                        --------------   -----  
 0   _id                                                           100230 non-null  int64  
 1   OBJECTID                                                      100230 non-null  int64  
 2   * Damage                                                      100230 non-null  object 
 3   * Street Number                                               95810 non-null   float64
 4   * Street Name                                                 94744 non-null   object 
 5   * Street Type (e.g. road, drive, lane, etc.)                  87033 non-null   object 
 6   Street Suffix (e.g. apt. 23, blding C)                        44148 non-null   object 
 7   * City                                                  

(100230, 47)

**Dataset Columns Describtion**

OBJECTID: A unique identifier for each record in the dataset.

DAMAGE: Indicates the level of fire damage to the structure (e.g., "No Damage", "Affected (1-9%)").

STREETNUMBER: The street number of the impacted structure.

STREETNAME: The name of the street where the impacted structure is located.

STREETTYPE: The type of street (e.g., "Road", "Lane").

STREETSUFFIX: Additional address information, such as apartment or building numbers (if applicable).

CITY: The city where the impacted structure is located.

STATE: The state abbreviation (e.g., "CA" for California).

ZIPCODE: The postal code of the impacted structure.

CALFIREUNIT: The CAL FIRE unit responsible for the area.

COUNTY: The county where the impacted structure is located.

COMMUNITY: The community or neighborhood of the structure.

INCIDENTNAME: The name of the fire incident that impacted the structure.

APN: The Assessor’s Parcel Number (APN) of the property.

ASSESSEDIMPROVEDVALUE: The assessed value of the improved property (e.g., structures, not just land).

YEARBUILT: The year the structure was built.

SITEADDRESS: The full address of the property, including city, state, and ZIP code.

GLOBALID: A globally unique identifier for each record.

Latitude: The latitude coordinate of the structure’s location.

Longitude: The longitude coordinate of the structure’s location.

UTILITYMISCSTRUCTUREDISTANCE: The distance between the main structure and any utility or miscellaneous structures (if recorded).

FIRENAME: An alternative or secondary name for the fire incident.

geometry: A geospatial representation of the location in a point format (e.g., "POINT (-13585927.697 4646740.750)").

In [21]:
# Basic statistical Measures
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
_id,100230.0,50115.5,28934.05,1.0,25058.25,50115.5,75172.75,100230.0
OBJECTID,100230.0,50227.78,29107.68,1.0,25058.25,50115.5,75172.75,101221.0
* Street Number,95810.0,38867.22,5271695.0,0.0,723.25,4308.5,10003.0,1410065000.0
Zip Code,47429.0,46309.7,47467.65,0.0,0.0,0.0,95667.0,96311.0
# Units in Structure (if multi unit),31184.0,0.4332991,34.60877,0.0,0.0,0.0,0.0,6101.0
# of Damaged Outbuildings < 120 SQFT,31085.0,0.08756635,0.4627289,0.0,0.0,0.0,0.0,40.0
# of Non Damaged Outbuildings < 120 SQFT,31073.0,0.1215203,0.5255796,0.0,0.0,0.0,0.0,20.0
Assessed Improved Value (parcel),94195.0,733702.2,8603013.0,0.0,59370.0,145551.0,310935.5,1220403000.0
Year Built (parcel),69812.0,1672.284,708.4518,0.0,1944.0,1972.0,1987.0,2022.0
Latitude,100230.0,38.32295,2.019086,32.59255,37.35093,38.69296,39.76387,41.99119


In [22]:
# Finding unique values
df.nunique()

Unnamed: 0,0
_id,100230
OBJECTID,100230
* Damage,6
* Street Number,18098
* Street Name,9818
"* Street Type (e.g. road, drive, lane, etc.)",27
"Street Suffix (e.g. apt. 23, blding C)",2260
* City,442
State,1
Zip Code,206


In [30]:
# checking missing values and duplicates
df.isnull().sum()
# missing values percentage and concatenate
missing = df.isnull().sum().sort_values(ascending=False)
percentage_of_missing = df.isnull().sum()/df.isnull().count().sort_values(ascending=False)
missing_data = pd.concat([missing, percentage_of_missing], axis=1, keys=['Missing_values', 'Percentage'])
missing_data

Unnamed: 0,Missing_values,Percentage
Battalion,93832,0.936167
If Affected 1-9% - What started fire?,91214,0.910047
If Affected 1-9% - Where did fire start?,89490,0.892846
Distance - Residence to Utility/Misc Structure &gt; 120 SQFT,85874,0.856769
Fire Name (Secondary),79059,0.788776
Distance - Propane Tank to Structure,77173,0.769959
Structure Defense Actions Taken,75760,0.755862
# of Non Damaged Outbuildings < 120 SQFT,69157,0.689983
# of Damaged Outbuildings < 120 SQFT,69145,0.689863
# Units in Structure (if multi unit),69046,0.688876


In [31]:
df.duplicated().sum()

0

In [35]:
# Finding categorical columns
df.select_dtypes(include=['object']).columns
categorical_columns = df.columns[df.nunique()<10]
for col in categorical_columns:
  print(col, df[col].unique())

* Damage ['No Damage' 'Affected (1-9%)' 'Minor (10-25%)' 'Destroyed (>50%)'
 'Major (26-50%)' 'Inaccessible']
State ['CA' nan]
Hazard Type ['Fire' 'Earthquake' 'Flood']
If Affected 1-9% - What started fire? [nan 'Unknown' 'Direct flame impingement' 'Embers' 'Radiant Heat'
 'Not Applicable' 'Bushes' 'Embers or overheated electrical motor'
 'Post on structure' '0-10']
Structure Category ['Single Residence' 'Other Minor Structure' 'Multiple Residence'
 'Nonresidential Commercial' 'Mixed Commercial/Residential'
 'Infrastructure' 'Agriculture']
* Eaves ['Unenclosed' 'Enclosed' 'Unknown' 'No Eaves' ' ' 'Not Applicable' nan]
* Window Pane ['Single Pane' 'Multi Pane' 'Unknown' 'No Windows' ' ' nan 'No Deck/Porch'
 'Radiant Heat']
* Deck/Porch On Grade ['Wood' 'Masonry/Concrete' 'No Deck/Porch' 'Unknown' 'Composite' ' ']
* Deck/Porch Elevated ['Wood' 'No Deck/Porch' 'Unknown' 'Masonry/Concrete' 'Composite' ' ']
* Patio Cover/Carport Attached to Structure ['No Patio Cover/Carport' 'Combustible' 

In [37]:
# Finding Numerical Columns
df.select_dtypes(include=['int64', 'float64']).columns

Index(['_id', 'OBJECTID', '* Street Number', 'Zip Code',
       '# Units in Structure (if multi unit)',
       '# of Damaged Outbuildings < 120 SQFT',
       '# of Non Damaged Outbuildings < 120 SQFT',
       'Assessed Improved Value (parcel)', 'Year Built (parcel)', 'Latitude',
       'Longitude', 'x', 'y'],
      dtype='object')

# My Observation

Based on my Obervaiton,

the dataset doesn't have duplicate records but it has a lot of missing values in both categorical and numerical columns.

**Duplicates:** Nil

**Missing Values:** Present in both numerical and categorical column, so we need encode the values inorder to perform machine learning models.

**Target Variable:** as per the objective to predict the Damages happened from Wildfire, i am setting 'Damage' column as Target variable.
