## Author: Sourav Patel, PhD Candidate, Electrical Engineering, University of Minnesota, Minnesota, USA <br>
## Date: 3rd October, 2020 <br>
## Title: Wildfires in California - An Exploratory Data Analysis <br>
Dataset: california-wildfire-incidents (Kaggle)<br>
Language: Python<br>

### Disclaimer: This EDA has been conducted by the author and reflects the views and beliefs of only the authors and no other authority. 

# Analysis of Wildfire in California

Wildfires as disasters are becoming more and more prevalent causing larger than ever loss of land, wild life, having wide impact on environment (air quality, increase in surrounding temperature), economy and lives.

![](https://cdn.mos.cms.futurecdn.net/ZbatoSejxMwdMSVi3p5KdU-650-80.jpg.webp)

The Operational Land Imager aboard the NASA-USGS Landsat 8 satellite captured this image of California's Camp Fire on Nov. 8, 2018, around 10:45 a.m. local time (1845 GMT).
(Image: © NASA Earth Observatory image by Joshua Stevens, using Landsat data from USGS)

#### In this article we will look at an exploratory data analysis of wildfires in California using the  [California Wildfire incidents dataset](https://www.kaggle.com/ananthu017/california-wildfire-incidents-20132020) . The goal of this article will be to look at the following questions and to validate the initial hypotheses that ...:

1. Is the trend of wildfires decreasing or increasing? <br/>
2. Where in California are these zones of wildfire concentrated? <br/>
  2.1 Does a particular belt of forests (national and state) need more deployment of fire fighters (and/or fire handlers) than current? <br/>
  2.2 Is the CAL FIRE Dept. well situated? (Look at personnel deployed)
  
3. 

In [None]:
!pip install jovian --upgrade --quiet

In [None]:
import jovian

In [None]:
# imports
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
# ignores deprecation warnings
import warnings
warnings.filterwarnings("ignore")


# Plotting Libraries
import matplotlib.pyplot as plt
import matplotlib.style as style

# Seaborn Library Setting
import seaborn as sns
sns.set_style("darkgrid")
style.use('seaborn-talk') #sets the size of the charts
style.use('ggplot')
sns.set_context('talk') # Sets the font of the figures , talk -- presentation-friendly


In [None]:
# Read data file and create pandas data frame
fire_raw_df = pd.read_csv('../input/california-wildfire-incidents-20132020/California_Fire_Incidents.csv')

## 1. Understanding the wildfire-incidents dataset

### The dataset cosists of 1636 wildfire incidents over the course of 2013 to 2019. There are 40 attributes (or columns) associated with each incident. 

In [None]:
print(f"size of the raw dataset: {fire_raw_df.shape}")


### Let us look at a few of the samples of how the dataset looks like to get more insight into what can be analyzed.

In [None]:
fire_raw_df.sample(7)

### Now let us glance at the attributes associated with each incident as is described next using the info() method in pandas.

In [None]:
fire_raw_df.info()

### As observed, the dataset contains of multitude of columns. Of these, we will be focusing on a few key statistics namely:

| Column Name         	|                       Description                       	|
|---------------------	|:-------------------------------------------------------:	|
| AcresBurned         	|           Acres of land affected by wildfires           	|
| AdminUnit           	|   Administrative Jurisdiction of California Fire Dept.  	|
| ArchiveYear         	|              Year of fire incident reported             	|
| Counties            	|                    counties involved                    	|
| Extinguished        	|     Date, Time of the day the fire was extinguished     	|
| Fatalities          	|       Number of deaths caused due to the incident       	|
| Latitude, Longitude 	|            geo Location of the fire incident            	|
| MajorIncident       	|                      Based on acres of land burned        |
| Name                	| Name of the fire incident (based on incident occurence) 	|
| PersonnelInvolved   	|        Man power deployed to handle the incident        	|
| Started             	|                  Time of fire reported                  	|
| WaterTenders         	| Type of firefighting apparatus that specialises <br>in the transport of water <br> (**Capacity ~U.S 2900 Gallons each**).               	|

## creating the derived dataset


In [None]:
selected_columns = [
    'AcresBurned',
    'AdminUnit',
    'ArchiveYear',
    'Counties',
    'Extinguished',
    'Fatalities',
    'Latitude',
    'Longitude',
    'MajorIncident',
    'Name',
    'PersonnelInvolved',
    'Started',
    'WaterTenders'
]

In [None]:
fire_df = fire_raw_df[selected_columns].copy()

In [None]:
fire_df['PersonnelInvolved'] = pd.to_numeric(fire_df.PersonnelInvolved, errors='coerce')
fire_df['Fatalities'] = pd.to_numeric(fire_df.Fatalities, errors='coerce')
fire_df['WaterTenders'] = pd.to_numeric(fire_df.WaterTenders, errors='coerce')
fire_df.info()

## 2. Data Preparation and Cleaning <br/>


## Duration of wildfire incident <br/> 
### One of the important aspects while analyzing the data is to obtain a sense of duration of how long the wildfire incidents last. In order to calculate this, we will use the 'Started' and 'Extinguished' column to calculate the the duration and express it in hours using the to_datetime() and astype('timedelta64[h]') methods in Pandas.

In [None]:
fire_df['fire_duration'] = (pd.to_datetime(fire_df.Extinguished) - pd.to_datetime(fire_df.Started)).astype('timedelta64[h]')/24 # To convert in days
fire_df['fire_duration'] = pd.to_numeric(fire_df.fire_duration,errors = 'coerce')

In [None]:
fire_df['Fatalities'] = pd.to_numeric(fire_df.Fatalities,errors = 'coerce')

In [None]:
fire_df.describe()

We observe that the fire_duration calculated spans from -17052 to 17900 days. In order to decide if this is a data entry error, we need to explore how many of these fire incidents have negative durations and if we should ignore these or take absolute values of. This is what we will explore next.

In [None]:
fire_df[fire_df.fire_duration < 0]['fire_duration'].value_counts()

In [None]:
# Find the indices where these faulty fire duration occurs in the dataset
faulty_fire_duration_index = fire_df[(fire_df.fire_duration < 0) | (fire_df.fire_duration > 500)]['fire_duration'].index.values

In [None]:
for iter in faulty_fire_duration_index:
    url = fire_raw_df[fire_df.index ==  iter]['CanonicalUrl'];
    duration = fire_df[fire_df.index ==  iter]['fire_duration'];
#     print(url.values[0])
    print(f" link for Fire incident: https://www.fire.ca.gov{url.values[0]} \t, \t fire_duration: {duration.values[0]} days")

This is also a nice way to retireve any url from the source if reader is interested in viewing more details of each incident.

As suspected, there are human errors in data entry where for smaller negative hours resulted because of mistaking the _Extinguished_ and _Started_ Columns. However, the data for large negative (or positive) days was because of the _Extinguished_ and/or _Started_ column left at the dafult value of 31st Dec, 1969. For our analysis we will ignore these values (i.e. duration  $\geq  500$ days and duration $ \leq 100$ days)  and take the absolute values for small negative days ( $< 500$ days).

In [None]:
fire_df.drop(fire_df[(fire_df.fire_duration < 0) | (fire_df.fire_duration > 500)].index, inplace=True) # 0 days are same day fire extinguished, hence still kept in the dataset
# Taking absolute value for the rest of the fire duration 
fire_df['fire_duration'] = np.abs(fire_df.fire_duration)

Similarly we drop data that has invalid latitude and longitude.

In [None]:
fire_df.drop(fire_df[(fire_df.Latitude < -90) | (fire_df.Latitude > 90)].index, inplace=True)
fire_df.drop(fire_df[(fire_df.Longitude < -180) | (fire_df.Longitude >= 0)].index, inplace=True)

# Cleaning AdminUnit names (Repetitions and Formatting)

Running the AdminUnit column shows that are ~ 460 unique Fire handling Departments in California which is not the case. There is a lot of different formats that are used for entering the Admin Unit names that has resulted in duplication of units under different names. In order to clean this, we use a simple approach of removing any word formatting (uppercase to lower case), removing common terms from the list such as:

>["/", "california", "cal", "fire", "unit", "county", "department", "national", "forest", "district", "unified command:", "usfs", "us", "service", "and", "of", "the", "city"]

Then finally taking unique values which has resulted in ~ 200 Units which can further be cleaned. But for this project we will stick with this number.

In [None]:

fire_admin_df = fire_df['AdminUnit']
xwords = pd.Series(["/", "california", "cal", "fire", "unit", "county", "department", "national", "forest", "district", "unified command:", "usfs", "us", "service", "and", "of", "the", "city"])


fire_admin_df=  (fire_admin_df
                .str.lower()
                .str.replace("-", " ")
                .str.replace("los angeles", "la")
                );

for word in range(len(xwords)): 
    fire_admin_df = fire_admin_df.str.replace(xwords[word], "")
    
    
# The following alone reduces hald the number of units    
fire_admin_df=  (fire_admin_df
                .str.lstrip()
                .str.rstrip()
                .str.replace(" ", ""))
    
fire_df["AdminUnitModified"] = fire_admin_df

# fire_df["AdminUnitModified"].unique()
length_clean = len(fire_df["AdminUnitModified"].unique())

In [None]:
print(f"Number of Admin Units for uncleaned dataset: {len(fire_df['AdminUnit'].unique())}")
print(f"Number of Admin Units for cleaned dataset: {length_clean}")

It was realized that using counties as the Admin units gets the proper fire handling unit for each incident. Thus we will use Counties instead of AdminUnits.

In [None]:
len(fire_df['Counties'].unique())

Next we move onto exploring the prepared data.

## 3. Exploratory Data Analysis

Before diving into details of each year, let us explore and test some of the hypotheses we have, based on an aggregated year basis.

In [None]:
# Create year wise data
fire_year_df = fire_df.groupby(['ArchiveYear']).sum()
years = fire_df['ArchiveYear'].unique()
print(f"Number of unique years in dataset: {years}")
fire_year_df.head()

First let us look at how severe these fires are by looking at the number of fire incidents and the total acres of land burned in each of the years.

In [None]:
## Plot total land area burned
sns.set_style("ticks")
sns.set_context("talk")
Total_land = fire_df.groupby(['ArchiveYear']).sum()
Total_incidents = fire_df.groupby(['ArchiveYear']).count()
print(f"Total Number of Incidents: {Total_incidents['AcresBurned'].values}")
f, ax = plt.subplots()
sns.set_palette('bright')
plt.bar(years,Total_land['AcresBurned'],alpha =0.785,edgecolor="0")
ax2 = ax.twinx()
plt.plot(years, Total_incidents['AcresBurned'], "s-r",lw=4, ms=8, mew=2, mec='k')
ax.set_ylabel("Total acres of land Burned", c = 'r');
ax2.set_ylabel("Total number of wildfire incidents");
ax.set_xlabel("Year");
ax.set_title("Total land area burned due to wildfires in California [2013-2019]:\n");
plt.show()

In [None]:
percent2018_over2017_incidents = Total_incidents['AcresBurned'].values[5]/Total_incidents['AcresBurned'].values[4]*100
percent2018_over2017_landBurned = Total_land['AcresBurned'].values[5]/Total_land['AcresBurned'].values[4]*100

print(f"Precentage of number of incidents in year 2018 to that of year 2017: {percent2018_over2017_incidents} % \nPrecentage of land burned in year 2018 to that of year 2017: {percent2018_over2017_landBurned} %")

## Observation 0: <br>
It can be clearly observed that during years 2017 and 2018, a large amount of land area was burned during the wildfires than compared to any other year in the dataset.

## Observation 1: <br>
The total number of incidents in the year 2018 is less than that of 2017 by 28 % where as the land area affected is about 187% more. This means that, in totality, the wildfires in 2018 were severe and larger.

Let us investigate the major incident column to verify this next.

In [None]:
custom_palette = ["#12C617","#F60700"]
sns.set_style("whitegrid")
# sns.barplot( fire_df["ArchiveYear"],fire_df['AcresBurned'],palette=custom_palette, hue = fire_df["MajorIncident"],capsize=.15)
sns.barplot( x= "ArchiveYear",y = 'AcresBurned',palette=custom_palette, hue = "MajorIncident",data = fire_df,capsize=.15)
plt.ylabel("Total acres of land Burned")
plt.xlabel("Year")
plt.xticks(rotation= 45)
plt.title("Total land area burned due to wildfires in California [2013-2019]");


In [None]:
fire_df.groupby(fire_df['ArchiveYear'])

LandBurned_2016_MajorIncident = fire_df[(fire_df.MajorIncident == True) & (fire_df.ArchiveYear == 2016)]['AcresBurned'].sum()
LandBurned_2017_MajorIncident = fire_df[(fire_df.MajorIncident == True) & (fire_df.ArchiveYear == 2017)]['AcresBurned'].sum()
LandBurned_2018_MajorIncident = fire_df[(fire_df.MajorIncident == True) & (fire_df.ArchiveYear == 2018)]['AcresBurned'].sum()

LandBurned_2016_Total = fire_df[(fire_df.ArchiveYear == 2016)]['AcresBurned'].sum()
LandBurned_2017_Total = fire_df[(fire_df.ArchiveYear == 2017)]['AcresBurned'].sum()
LandBurned_2018_Total = fire_df[ (fire_df.ArchiveYear == 2018)]['AcresBurned'].sum()

PrecentLandBurned_2016 = np.divide(LandBurned_2016_MajorIncident,LandBurned_2016_Total)*100
PrecentLandBurned_2017 = np.divide(LandBurned_2017_MajorIncident,LandBurned_2017_Total)*100
PrecentLandBurned_2018 = np.divide(LandBurned_2018_MajorIncident,LandBurned_2018_Total)*100

print(f" Land burned in Major Incidents vs Total Land Burned in 2016: {PrecentLandBurned_2016}%")
print(f" Land burned in Major Incidents vs Total Land Burned in 2017: {PrecentLandBurned_2017}%")
print(f" Land burned in Major Incidents vs Total Land Burned in 2018: {PrecentLandBurned_2018}%")

## Observation 2. (Increase in Severity[](http://))
The above figure corroborates our hypothesis. This was both expected and is a **shocking insight** at the same time, as the percentage of major fire incident is more than $50 \%$ since 2016. 

A future more detailed analysis could include how this finding corresponds to increasing trend of California temerature (or reduced rainfall) that is causing dried forest beds to be more susceptable to fire hazards leading to major incidents.

## Observation 3. (Learning from the Past)[](http://)

It is possible that the number of wildfires in the year 2019 is much lower than that of 2018 because of the serverity of the wildfires in 2018 that resulted in extra funding, equipment and better handling of wildfires in 2019.

In fact, the following excerpt from [California state Budget Summary, Pg. 3](http://www.ebudget.ca.gov/2019-20/pdf/Enacted/BudgetSummary/FullBudgetSummary.pdf) for 2019-2020 confirms it.

> The Budget includes critical investments needed to sustain and improve California’s
emergency preparedness, response, and recovery capabilities. This includes
$240.3 million to augment the California Department of Forestry and Fire Protection's
(CAL FIRE's) firefighting capabilities by adding 13 additional year‑round engines,
replacing Vietnam War-era helicopters, deploying new air tankers, and investing in
technology and data analytics that support CAL FIRE's initial fire suppression strategies.
The Budget also provides a sizable investment in forest management to increase fire
prevention and complete additional fuel reduction projects, including increased
prescribed fire crews. 

This further bolsters our findings.

In [None]:
# When was the last fire recorded in the list for year 2019? 
print(f"Last fire reported on : {fire_df['Started'].tail(1).values[0]}")

One thing to note here is that for the year 2019, we only have data till mid-October and therefore it is possible that larger and severe fire incidents occured in the months of November and December of 2019. We will acknowledge this fact and proceed with the rest of our analysis.

 What can we say about the number of personnels involved?

In [None]:
fig, ax = plt.subplots(1,3, figsize = (20,5))
sns.barplot(y= years,x = fire_year_df['WaterTenders'],palette="rocket",alpha = 0.75,orient = 'h',ax=ax[1],edgecolor="0")
sns.barplot(x = fire_year_df['PersonnelInvolved'],y = years, palette="rocket",alpha = 0.75, orient = 'h', ax = ax[0],edgecolor="0")
sns.barplot(y = years,x = np.divide(fire_year_df['WaterTenders'],fire_year_df['PersonnelInvolved'])*100,palette="rocket", orient = 'h',ax = ax[2],edgecolor="0.5")
ax[0].set_ylabel('Years')
ax[0].set_xlabel('Number of Personnel Involved')

ax[1].set_ylabel('Years')
ax[1].set_xlabel('Number of Water Tenders Involved')

ax[2].set_ylabel('Years')
ax[2].set_xlabel('Ratio of Water Tenders to Personnel (in %)')

plt.suptitle('Personnels and Water Tenders in fire fighting over the years')
plt.tight_layout(pad=2)

In [None]:
# Calculate the number of personnel more in year 2013 than in 2018
(fire_year_df[fire_year_df.index == 2018]['PersonnelInvolved'].values[0]/fire_year_df[fire_year_df.index == 2013]['PersonnelInvolved'].values[0])*100

## Observation 4. (Progress)

Due to advancements in wildfire handling technology, lower number of deployed personnel can handle larger and expansive of wildfires as evident from the years 2013 and 2018 where in 2018 the number of wildfires in $~ 200 \%$ whereas the personnel deployed is less than by around $23 \%$. A similar conclusion can be drawn for the number of water tenders involved.

_In other words, twice the number of wildfire incidents were handled by three-forths of the workforce in 2018 than in 2013._

In [None]:
jovian.commit(project='california-wildfire-analysis', environment=None)

## 4. Questions and Answers

Now, let us try to evaluate some of the nature of these wildfires, what was the worst wildfire (in terms of duration, fatality, etc.), where are these fires mostly concentrated and what administrative unit is involved in dealing with these fires. We will try to validate some of these based on other publicly available reports.

Let us begin by looking at the top 20 adminstrative divisions of CAL FIRE that handles the most wildfires.

In [None]:
admin_indx = fire_df["Counties"].value_counts().index#[0:20]

admin_count = fire_df["Counties"].value_counts().values#[0:20]
admin_count

In [None]:

fig, ax = plt.subplots(1,2, figsize = (10,5))

sns.distplot(admin_count,ax = ax[0]);
sns.boxplot(admin_count,ax = ax[1]);
plt.suptitle("Distribution plot for handling wildfire units by Administrative units")
ax[0].set_xlabel("Number of wildfires handled")
ax[0].set_ylabel("Percentage of wildfires")

ax[1].set_xlabel("Number of wildfires handled")

## Observation 5 (Who does how much)

A Fire Administrative unit handles around 15-20 fires in the span of 6 years. The most wildfire prone regions handle at a maximum of 80-120 wildfires.

In [None]:
plt.figure(figsize=(10,15))
sns.set_context("talk")
sns.barplot( x = admin_count, y=admin_indx, palette  = "Spectral");
# sns.scatterplot(x = "Counties", y = )
plt.xticks(rotation = 0);
plt.title('Number of wildfire incidents Handled basd on Administrative Zones')
plt.xlabel('Number of wildfire incidents')

Note that the fire units name are repeated multiple times which are identified as different units in the plot and this was mostly observed to affect the wildfires handled by the San Bernardino unit and los angeles unit.

#### Looking at the administrative zones for the California Fire Dept. (shown below) we observe that the Riverside, San Diego, San Luis Obispo and Shasta-Trinity Unit are the top four administrative zones dealing with fire.

Note: Non-CAL FIRE Dept. and other assistive service units are not shown in this map.

![](http://santacruzcountyfire.com/images/cdfmap.jpg)

Naturally, the next question to ask is:<br>
_Are these fires concentrated in the areas covered under the CAL FIRE Riverside and San Diego ?_

In order to answer this, we will use the Latitude and Longitude Data and geo Locate the wildfires on a map of California for better visualtion.

In [None]:
sns.set_style("white")
fig, ax = plt.subplots(2,2, figsize = (20,10))
fire_df.plot(kind="scatter", x="Latitude", y="Longitude",
    c=fire_df['ArchiveYear'],  s =fire_df['AcresBurned']/100,
    cmap=plt.get_cmap("gist_rainbow"),colorbar=True, alpha=0.3,
ax =ax[0,0]);

ax[0,0].set_xlim([30.45, 43.05]);
ax[0,0].set_ylim([ -124.55, -115.80]);
ax[0,0].set_title("Wildfires with size as total acres burned");

fire_df.plot(kind="scatter", x="Latitude", y="Longitude",
    c=fire_df['ArchiveYear'],  s =fire_df['MajorIncident']*100,
    cmap=plt.get_cmap("gist_rainbow"),colorbar=True, alpha=0.3,
ax =ax[0,1]);

ax[0,1].set_xlim([30.45, 43.05]);
ax[0,1].set_ylim([ -124.55, -115.80]);
ax[0,1].set_title("Major Incident Wildfires")

fire_df.plot(kind="scatter", x="Latitude", y="Longitude",
    c=fire_df['ArchiveYear'],  s =fire_df['Fatalities']*50,
    cmap=plt.get_cmap("gist_rainbow"),colorbar=True, alpha=0.3,
ax =ax[1,0]);

ax[1,0].set_xlim([30.45, 43.05]);
ax[1,0].set_ylim([ -124.55, -115.80]);
ax[1,0].set_title("Fatalities")


fire_df.plot(kind="scatter", x="Latitude", y="Longitude",
    c=fire_df['ArchiveYear'],  s =fire_df['PersonnelInvolved']/2,
    cmap=plt.get_cmap("gist_rainbow"),colorbar=True, alpha=0.3,
ax =ax[1,1]);

ax[1,1].set_xlim([30.45, 43.05]);
ax[1,1].set_ylim([ -124.55, -115.80]);
ax[1,1].set_title("Personnel Involved")

plt.xlabel("Latitude")
plt.tight_layout(pad=2)
plt.show()

In [None]:
fire_df[ (fire_df['ArchiveYear'] >=  2017)].count()/fire_df.count() *100

The four plots above provide a multitude of insight into the dataset. First of all, they again corroborate the findings of observation 2 (increase in severity of wildfires -- $68.3 \%$ of all the wildfires  and total fatalities occuring being $~ 95\%$ of all the fatalities) since 2017 mostly concentrated towards south-east region of California - Riverside Unit (RRU). See discussion on Observation 5.3 later.

Looking at the top two figure above and the figure of the population density of California from 2000-2010 (shown below), it can be clearly observed, the pattern of wildfires and major wildfire incidents have a huge corrlation visually. 

## Observation 6
Major wildfire incidents have mostly occured near regions of large population density and mostly concentrated towards South Eastern regions (San Diego and Riverside) of the state. This trend has become more prevalent since the year 2017.

Are these mostly man-made? This is the topic of the [article](https://www.vox.com/2018/8/7/17661096/california-wildfires-2018-camp-woolsey-climate-change) (California’s wildfires are hardly “natural” — humans made them worse at every step) that notes that:

>A study published earlier this year in the Proceedings of the National Academies of Science, or PNAS, found that 84 percent of wildfires are ignited by humans, whether through downed power lines, careless campfires, or arson.
“Human-started wildfires ... tripled the length of the fire season, dominated an area seven times greater than that affected by lightning fires, and were responsible for nearly half of all area burned,” the paper reported.


<img src="https://www.intimeandplace.org/El%20Dorado/imagefile/maps/caldemo.jpg" alt="drawing" width="480"/>
Population per Square Mile by Census Tract in Census 2000: California Profile, U S Census Bureau

In [None]:
# from PIL import Image
# import requests
# # url = "http://santacruzcountyfire.com/images/cdfmap.jpg"
# url = "https://www.intimeandplace.org/El%20Dorado/imagefile/maps/caldemo.jpg"
# response = requests.get(url, stream=True);
# img = Image.open(response.raw);

fig, ax = plt.subplots(1,2,figsize = (20,7));

sns.set_style("white")
import matplotlib.image as mpimg
img = mpimg.imread('../input/california-image/california.png')

fig1 = fire_df.plot(kind="scatter", x="Latitude", y="Longitude",
    c=fire_df['ArchiveYear'],  s =fire_df['AcresBurned']/100,
    cmap=plt.get_cmap("gist_rainbow"),colorbar=True, alpha=0.4, ax = ax[0]
);
ax[0].imshow(img,extent=[30.55, 50.05, -125.550, -115.80], alpha =0.7);

ax[0].set_xlim([32.45, 42.05]);
ax[0].set_ylim([ -124.55, -115.8]);
ax[0].set_title("Number of WildFire incidents")


ax[1] = fire_df.plot(kind="scatter", x="Latitude", y="Longitude",
    c=fire_df['ArchiveYear'], s = fire_df['Fatalities']*50, 
    cmap=plt.get_cmap("gist_rainbow"),colorbar=True, alpha=0.3, ax = ax[1]
);



print(f" Latitude bounds: {np.max(fire_df['Latitude']), np.min(fire_df['Latitude'])}")
print(f" Longitude bounds: {np.max(fire_df['Longitude']), np.min(fire_df['Longitude'])}")

ax[1].imshow(img,extent=[30.55, 50.05, -125.550, -115.80], alpha =0.7);

ax[1].set_xlim([32.45, 42.05]);
ax[1].set_ylim([ -124.55, -115.8]);
ax[1].set_title("Fatalities in WildFire incidents")

plt.suptitle("WildFire incidents California: Geographical plot");
plt.show();


The figure above clearly delineates where the fires are concentrated. These are mostly along the  jurisdictions of: <br>
CAL FIRE Riverside County Fire
CAL FIRE San Diego Unit                    
CAL FIRE San Luis obispo                   <br>
CAL FIRE Shasta-Trinity Unit                 <br>

which are among the top 4 in the dataset.

## Observation 7

Wildfires since 2017 have been mostly concentrated in the same regions which vastly span the state of California. 

* These regions have now been identified (in 2018) as priority landscapes for reducing wildfire risks as shown below.

<!-- ![](https://www.redzone.co/wp-content/uploads/2019/06/fhsz_map.png) -->

<img src="https://www.redzone.co/wp-content/uploads/2019/06/fhsz_map.png" alt="drawing" width="400"/>

Fire Hazard Severity Zones of California, California Department of Forestry and Fire Protection, 2019.

## Question: How many of these incidents in 2018 were handled by Riverside and San Diego Unit?

In [None]:
fire_counties_df = fire_df.groupby([fire_df.Counties]).head(100)
sns.set_context("talk")
alpha = 0.1;
color = ['red','blue','green','orange','magenta']

rows = 3
cols = 3
fig,ax = plt.subplots(rows, cols, sharex = True, figsize=(25,7))
current_row = 0
current_col = 0
for idx,year in enumerate(years):
    TotalFire_year= fire_df[(fire_df['ArchiveYear'] ==  year)].count().values[0]
    FiresInYear_byCounty = fire_counties_df[(fire_counties_df['ArchiveYear'] == year)]['Counties'].value_counts()
#     plt.figure()
#     print(current_col)
    sns.barplot(x = FiresInYear_byCounty.values[0:5]/TotalFire_year*100, 
                y = FiresInYear_byCounty.index.unique()[0:5], alpha = alpha + 0.3, ax = ax[current_row,current_col])
    ax[current_row,current_col].set_title("Wildfires handled in year %d" %year)
    
    current_col +=1
    
    if(current_col == cols):
        current_row +=1
        current_col = 0
       
plt.tight_layout(pad=1)
ax[2,0].set_xlabel("Percentages of wildfires handled")
ax[2,1].set_xlabel("Percentages of wildfires handled")
ax[2,2].set_xlabel("Percentages of wildfires handled")

The above figure shows progression of wildfires handled by the top 5 wild fire units each year.

## Observation 8
1. San Diego Fire Unit has been in the top 5 wildires handling unit throughout the years (2013-2019).
2. Riverside has handled wildfires only in years 2013, 2017 and 2018, yet has still handled the most number of incidents than any other unit (2013-2019).

In [None]:
fire_df['Name'].value_counts().head(20)

As a matter of fact while writing this report, the Creek Fire of September, 2020 has been reported to be one of the largest most devastating fires of the yearin the history of CAL FIRE with only 44 % containment even after ~29 days.(https://inciweb.nwcg.gov/incident/7147/).

> FRESNO, Calif. (KFSN) -- The Creek Fire was first sparked on Friday, September 4, and 309,033 acres have burned as of Thursday morning (Oct 1st, 2020) with 44% containment. CAL FIRE officials say it is the largest single fire in California's recorded history.
>At least 926 structures have been damaged or destroyed, and 4,576 are threatened. Officials say 30,000 residents of Fresno County and 15,000 residents of Madera County have been evacuated.

## Observation 9.
Creek fire is the most fire prone zone in California

## Analyze duration of fires

### Worst wildfire

In [None]:
fire_df[ (fire_df.Fatalities > 20) & (fire_df.AcresBurned > 20000)].head()

The Worst wildfire occured in the Butte Counties on November, 2018 with 85 fatalities.

In [None]:
fire_df[ fire_df.fire_duration == np.max(fire_df.fire_duration)]

In [None]:
sns.scatterplot(y = 'fire_duration', x = 'ArchiveYear', hue='Counties', data = fire_df, palette = 'Reds_r')
plt.legend(bbox_to_anchor=(1.05, 1),  ncol=4, labelspacing=0.05)
plt.xlabel("Year")
plt.ylabel("Duration of fire in Days")
plt.title("Fire Duration Distribution over the years")

In [None]:
sns.boxplot(y = 'fire_duration', x = 'ArchiveYear', data = fire_df, palette = 'Reds_r')
plt.legend(bbox_to_anchor=(1.05, 1),  ncol=4, labelspacing=0.05)
plt.xlabel("Year")
plt.ylabel("Duration of fire in Days")
plt.title("Box plot of Fire Duration Distribution over the years")

## Observation 10.1

Yet again, the long fire durations occured in the years 2017 and 2018. With an average duration of around 190 hours, orders of magnitude larger than in other years.

In [None]:
# sns.distplot(y = 'fire_duration', x = 'ArchiveYear', data = fire_df, palette = 'bright')
sns.distplot(fire_df['fire_duration'], color = 'r')
plt.xlabel("Year")
plt.ylabel("Probability")
plt.title("Prob. Distribution of fire distribution over the years")

## Observation 10.2
Fires are either dealt-with relatively fast within 20 days or as large as upto 200 days. The longest being around ~ 450 days (Almost a year and half!!)

In [None]:
fire_df[ fire_df.fire_duration > 365]

The longest fire was Thomas Fire in the Los Padres National Forest that started on 2017 Dec and was extinguished in March 2019 with a total duration of 465 days.

### Reltionship between Fatalities, Fire Duration and Acres Burned

As a final note, the relationship between fire duration, acres burned and fatalities is plotted next.

In [None]:
sns.pairplot(fire_df, vars = ['Fatalities', 'fire_duration', 'AcresBurned'], diag_kind = 'kde', hue = "ArchiveYear", size= 3)

-----------------

## Summary
### The biggest takeaway of this analysis is that years 2017, 2018 saw some of the most numbers of,  severest and, longest wildfires during 2013-2018. The CAL FIRE Dept. learnt from their deficiencies and took necessary actions which was evident in the number and control of wildfires in the year 2019. 

In summary,
* The total number of incidents in the year 2018 is less than that of 2017 by 28 % where as the land area affected is about 187% more. This means that, in totality, the wildfires in 2018 were severe and larger.
* (Severity)  The percentage of major fire incident has been more than 50 % since 2016 until 2019.
* (Learning from the Past) The number of wildfires in the year 2019 is much lower than that of 2018 because of the serverity of the wildfires in 2018 that resulted in extra funding, equipment and better handling of wildfires in 2019 as mentioned in California State Budget Summary.
* (Technological Progress) Due to advancements in wildfire handling technology, lower number of deployed personnel can handle larger and expansive of wildfires as evident from the years 2013 and 2018 where in 2018 the number of wildfires in 200 % whereas the personnel deployed is less than by around 23 %. A similar conclusion can be drawn for the number of water tenders involved.
* (Who does how much) A Fire Administrative unit handles around 15-20 fires in the span of 6 years. The most wildfire prone regions handle at a maximum of 80-120 wildfires.
* Looking at the administrative zones for the California Fire Dept. (shown below) we observe that the Riverside, San Diego, San Luis Obispo and Shasta-Trinity Unit are the top four administrative zones dealing with fire.
* San Diego Fire Unit has been in the top 5 wildires handling unit throughout the years (2013-2019).
* Riverside Fire Unit has handled wildfires only in years 2013, 2017 and 2018, yet has still handled the most number of incidents than any other unit (2013-2019).
* There has been an increase in severity of wildfires upto 68.3 % of all the wildfires and total fatalities occuring being 95 % of all the fatalities since 2017 which are mostly concentrated towards south-east region of California - Riverside Unit (RRU) and San Diego Fire Unit.
* Major wildfire incidents have mostly occured near regions of large population density and mostly concentrated towards South Eastern regions (San Diego and Riverside) of the state. This trend has become more prevalent since the year 2017.This is the topic of the article (California’s wildfires are hardly “natural” — humans made them worse at every step).
* Wildfires since 2017 have been mostly concentrated in the same regions which vastly span the state of California. These regions have now been identified (in 2018) as priority landscapes for reducing wildfire risks.
* Creek fire is the most fire prone zone in California.
* The Worst wildfire occured in the Butte Counties on November, 2018 with 85 fatalities.
* Wildfires with long fire durations occured in the years 2017 and 2018. With an average duration of around 190 hours, orders of magnitude larger than in other years.
* Fires are either dealt-with relatively fast within 20 days or as large as upto 200 days. The longest being around ~ 450 days (Almost a year and half!!)
* The longest fire was Thomas Fire in the Los Padres National Forest that started on 2017 Dec and was extinguished in March 2019 with a total duration of 465 days.

--------------

### Future Work:
1. Analyze more about the year 2018.
2. A future more detailed analysis could include how this finding corresponds to increasing trend of California temerature (or reduced rainfall) that is causing dried forest beds to be more susceptable to fire hazards leading to major incidents.
3. Include geoplot-based python libraries to facilitate interactive study.
4. Incorporate other data sets such as the [1.88 Million US wildfires](https://www.kaggle.com/rtatman/188-million-us-wildfires) to obtain more insights such as the nature and cause of fire etc.
5. Could update the current dataset to include wildfires till Septmeber 2020 and repeat the analysis.

### More resources
[1] Community Wildfire Prevention & Mitigation Report, https://www.fire.ca.gov/media/5584/45-day-report-final.pdf <br>
[2] Realtime Incidents,https://inciweb.nwcg.gov/accessible-view/ <br>
[3] Cal Fire Incidents details, https://www.fire.ca.gov/incidents/ <br>
[4] California Budget 2019-2020, http://www.ebudget.ca.gov/2019-20/pdf/Enacted/BudgetSummary/FullBudgetSummary.pdf


### Additional Resources
[1] Wildfire Smoke: A Guide for Public Health Officials and Factsheets, https://www.airnow.gov/wildfire-smoke-guide-publications/

In [None]:
jovian.commit(project='california-wildfire-analysis', environment=None)