# Fighting fire with firepower
## Notebook 5: Webscraping
<br>
<br>

> Wayne Chan <br>
> Mariam Javed <br>
> Shawn Syms 

<a name="contents"></a>
## Contents

* <a href="#context">Context</a>
* <a href="#imports">Imports</a>
* <a href="#dataframe-initialization">Dataframe initialization</a>
* <a href="#api-and-webscraping">API and webscraping</a>
* <a href="#conclusion">Conclusion</a>
* <a href="#recommendations">Recommendations</a>

<a name="context"></a>
## Context

As part of our efforts to increase our model's performance, we integrated a set of weather measurements. To do this, we used the DarkSky API to capture weather data for the days/times/locations of the fires in a subset of our dataset. This was a time-consuming process. These are the measurements that we integrated into our model's feature set:

> - precipitation intensity<br>
> - precipitation probability<br>
> - temperature high<br>
> - temperature low<br>
> - humidity<br>
> - wind speed<br>
> - wind gust<br>
> - UV index<br>
> - visibility<br>

Ultimately, we found that the weather data was not strongly correlated to our training target (fire size), but we are including the notebook for the sake of completeness of documentation. 

As well, our data originated in a SQL database; we have captured those details here as well.

<div style="text-align: right">(<a href="#contents">home</a>) </div>

<a name="imports"></a>
## Imports

In [2]:
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import requests
import datetime
import time
pd.set_option('display.max_colwidth', -1)

from bs4 import BeautifulSoup

<div style="text-align: right">(<a href="#contents">home</a>) </div>

<a name="dataframe-initialization"></a>
## Dataframe initialization

In [3]:
#establish a connection with the database file
cnx = sqlite3.connect('kaggle_dataset/FPA_FOD_20170508.sqlite')

Now create our dataframe by running a select query on the DB:

In [3]:
df = pd.read_sql_query("SELECT FIRE_YEAR, DISCOVERY_DATE, DISCOVERY_DOY, DISCOVERY_TIME, STAT_CAUSE_DESCR, LATITUDE, LONGITUDE, STATE, FIRE_CODE, FIRE_SIZE FROM 'Fires'", cnx)
df.head()

Unnamed: 0,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_TIME,STAT_CAUSE_DESCR,LATITUDE,LONGITUDE,STATE,FIRE_CODE,FIRE_SIZE
0,2005,2453403.5,33,1300,Miscellaneous,40.036944,-121.005833,CA,BJ8K,0.1
1,2004,2453137.5,133,845,Lightning,38.933056,-120.404444,CA,AAC0,0.25
2,2004,2453156.5,152,1921,Debris Burning,38.984167,-120.735556,CA,A32W,0.1
3,2004,2453184.5,180,1600,Lightning,38.559167,-119.913333,CA,,0.1
4,2004,2453184.5,180,1600,Lightning,38.559167,-119.933056,CA,,0.1


In [4]:
df.to_csv('fire.csv')

In [5]:
df['FIRE_YEAR'].unique()

array([2005, 2004, 2006, 2008, 2002, 2007, 2009, 2001, 2003, 1992, 1993,
       1994, 1995, 1996, 1997, 1998, 1999, 2000, 2010, 2011, 2012, 2013,
       2014, 2015])

In [11]:
fire_date = datetime.date(2004,5,12)

unixtime = time.mktime(fire_date.timetuple())
unixtime

1084334400.0

In [12]:
d = datetime.date(int(d[6:]),int(d[3:5]),int(d[0:2]))

unixtime = time.mktime(d.timetuple())
unixtime

1084334400.0

In [14]:
df.drop(['DISCOVERY_TIME'], axis = 1, inplace = True)

In [16]:
df['datetime'] = df['FIRE_YEAR'].astype(str) + df['DISCOVERY_DOY'].astype(str)
df.head()

Unnamed: 0,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,STAT_CAUSE_DESCR,LATITUDE,LONGITUDE,STATE,FIRE_CODE,FIRE_SIZE,datetime
0,2005,2453403.5,33,Miscellaneous,40.036944,-121.005833,CA,BJ8K,0.1,200533
1,2004,2453137.5,133,Lightning,38.933056,-120.404444,CA,AAC0,0.25,2004133
2,2004,2453156.5,152,Debris Burning,38.984167,-120.735556,CA,A32W,0.1,2004152
3,2004,2453184.5,180,Lightning,38.559167,-119.913333,CA,,0.1,2004180
4,2004,2453184.5,180,Lightning,38.559167,-119.933056,CA,,0.1,2004180


In [19]:
df = df.loc[df['FIRE_YEAR'] >= 2010]

In [20]:
df.head()

Unnamed: 0,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,STAT_CAUSE_DESCR,LATITUDE,LONGITUDE,STATE,FIRE_CODE,FIRE_SIZE,datetime
1067487,2010,2455335.5,139,Equipment Use,36.766944,-121.303056,CA,,70.0,2010139
1067488,2010,2455355.5,159,Miscellaneous,36.776944,-121.311111,CA,,0.5,2010159
1067489,2010,2455359.5,163,Miscellaneous,36.856111,-121.381111,CA,,0.1,2010163
1067490,2010,2455361.5,165,Miscellaneous,36.818056,-121.391111,CA,,0.1,2010165
1067491,2010,2455388.5,192,Miscellaneous,36.883056,-121.561944,CA,,1.0,2010192


In [21]:
new_list = []

for i in df['datetime']:
    
    dt = datetime.datetime.strptime(i, '%Y%j').strftime('%d/%m/%Y')
    new_list.append(dt)
    
df['dmy'] = new_list

In [23]:
df.reset_index(drop = True, inplace = True)
df.head()

Unnamed: 0,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,STAT_CAUSE_DESCR,LATITUDE,LONGITUDE,STATE,FIRE_CODE,FIRE_SIZE,datetime,dmy
0,2010,2455335.5,139,Equipment Use,36.766944,-121.303056,CA,,70.0,2010139,19/05/2010
1,2010,2455355.5,159,Miscellaneous,36.776944,-121.311111,CA,,0.5,2010159,08/06/2010
2,2010,2455359.5,163,Miscellaneous,36.856111,-121.381111,CA,,0.1,2010163,12/06/2010
3,2010,2455361.5,165,Miscellaneous,36.818056,-121.391111,CA,,0.1,2010165,14/06/2010
4,2010,2455388.5,192,Miscellaneous,36.883056,-121.561944,CA,,1.0,2010192,11/07/2010


In [25]:
new_list = []

for i in df['dmy']:
    #print(type(i))
    d = datetime.date(int(i[6:]),int(i[3:5]),int(i[0:2]))
    #print(type(d))
    unixtime = time.mktime(d.timetuple())

    new_list.append(unixtime)
    
df['unixtime'] = new_list

In [27]:
df['unixtime'] = df['unixtime'].apply(lambda x: int(x))
df.head()

Unnamed: 0,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,STAT_CAUSE_DESCR,LATITUDE,LONGITUDE,STATE,FIRE_CODE,FIRE_SIZE,datetime,dmy,unixtime
0,2010,2455335.5,139,Equipment Use,36.766944,-121.303056,CA,,70.0,2010139,19/05/2010,1274241600
1,2010,2455355.5,159,Miscellaneous,36.776944,-121.311111,CA,,0.5,2010159,08/06/2010,1275969600
2,2010,2455359.5,163,Miscellaneous,36.856111,-121.381111,CA,,0.1,2010163,12/06/2010,1276315200
3,2010,2455361.5,165,Miscellaneous,36.818056,-121.391111,CA,,0.1,2010165,14/06/2010,1276488000
4,2010,2455388.5,192,Miscellaneous,36.883056,-121.561944,CA,,1.0,2010192,11/07/2010,1278820800


Note: unixtime is as of 4am due to limitations with dataset.

In [35]:
df.reset_index(drop = True, inplace = True)

In [38]:
df.drop(labels=['FIRE_CODE'], axis = 1, inplace = True)
df.head()

Unnamed: 0,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,STAT_CAUSE_DESCR,LATITUDE,LONGITUDE,STATE,FIRE_SIZE,datetime,dmy,unixtime
0,2015,2457314.5,292,Miscellaneous,37.69,-96.97,KS,0.01,2015292,19/10/2015,1445227200
1,2015,2457314.5,292,Miscellaneous,39.6138,-82.2251,OH,0.51,2015292,19/10/2015,1445227200
2,2015,2457313.5,291,Miscellaneous,34.68201,-106.77655,NM,0.46,2015291,18/10/2015,1445140800
3,2015,2457314.5,292,Miscellaneous,39.73,-96.71,KS,25.0,2015292,19/10/2015,1445227200
4,2015,2457314.5,292,Miscellaneous,37.68,-97.1,KS,0.01,2015292,19/10/2015,1445227200


<div style="text-align: right">(<a href="#contents">home</a>) </div>

<a name="api-and-webscraping"></a>
## API and webscraping

Looks like the following variables are what we need for our dataframe:

- precip Intensity
- precip prob
- temperature high
- temp low
- humidity
- wind speed
- wind gust
- uvIndex
- visibility

In [None]:
precip_intensityls = []
precip_probls = []
temp_highls = []
temp_lowls = []
humidityls = []
wind_speedls = []
wind_gustls = []
uv_indexls = []
visibilityls = []

for i in range(len(df)):
    
    #call api
    lat = df['LATITUDE'][i]
    long = df['LONGITUDE'][i]
    dt = df['unixtime'][i]
    key = 'REDACTED: PUT YOUR OWN API KEY HERE'

    url = 'https://api.darksky.net/forecast/{}/{},{},{}'.format(key, lat, long, dt)

    data = requests.get(url)

    file = data.json()
    
    #add desired data to dataframe
    try:
        precip_intensity = file['daily']['data'][0]['precipIntensity']
    except:
        precip_intensity = np.nan
    
    try:
        precip_prob = file['daily']['data'][0]['precipProbability']
    except:
        precip_prob = np.nan
        
    try:
        temp_high = file['daily']['data'][0]['temperatureHigh']
    except:
        temp_high = np.nan
        
    try:
        temp_low = file['daily']['data'][0]['temperatureLow']
    except:
        temp_low = np.nan
        
    try:
        humidity = file['daily']['data'][0]['humidity']
    except:
        humidity = np.nan
        
    try:
        wind_speed = file['daily']['data'][0]['windSpeed']
    except:
        wind_speed = np.nan
        
    try:
        wind_gust = file['daily']['data'][0]['windGust']
    except:
        wind_gust = np.nan
    
    try:
        uv_index = file['daily']['data'][0]['uvIndex']
    except:
        uv_index = np.nan
        
    try:
        visibility = file['daily']['data'][0]['visibility']
    except:
        visibility = np.nan
    
    precip_intensityls.append(precip_intensity)
    precip_probls.append(precip_prob)
    temp_highls.append(temp_high)
    temp_lowls.append(temp_low)
    humidityls.append(humidity)
    wind_speedls.append(wind_speed)
    wind_gustls.append(wind_gust)
    uv_indexls.append(uv_index)
    visibilityls.append(visibility)
    
    print('Completed {} of {}'.format(i, len(df.index)))
    
    time.sleep(2)

<a name="conclusion"></a>
## Conclusion

- Neural Network was the most effective model at predicting the fire-size class for a wildfire—with train and test accuracy of 61 percent and 62 percent, respectively
- This dataset is far better suited to a classification problem than a regression one, with a **significant** boost in model results—from < 0.05 r2 scores to approximately ~60 percent accuracy scores
- Our results are comparable to others that have used this dataset (see Sky B.T. Williams's [article](https://towardsdatascience.com/wildfire-destruction-a-random-forest-classification-of-forest-fires-e08070230276)), which tells us that our accuracy results are the best you can achieve when using the dataset as is

<div style="text-align: right">(<a href="#contents">home</a>) </div>

<a name="recommendations"></a>
## Recommendations

- The historical weather data from DarkSky API was only collected for 2015 given time and financial constraints. We recommend that this observed weather data is collected for all years (1992—2015) and implemented into the NN classification model to see how results improve
- It's evident from the results that there are significant class-distribution imbalances in this dataset, as class A and class B make up the majority of the dataset, while class G makes up only 0.2% of the total wildfires. For this reason, it would be beneficial to implement some sort of weight classes to compensate for the imbalance in data points

<div style="text-align: right">(<a href="#contents">home</a>) </div>