# COGS 108 - Data Checkpoint

# Names

- Jiay Zhao
- Wenbo Hu
- Xiaotong Zeng
- Yunyi Huang

<a id='research_question'></a>
# Research Question

Is there a statistically significant relationship between the scale (burning area) of wildfire and  climate variables in California that are associated with global warming such as relative humidity level, temperature and wind speed? Additionally, how can we utilize these variables to predict the next wildfire event in California and the scale of wildfire?

# Dataset(s)

**Dataset Name:** California Wildfire Incidents (2013-2020) --- List of wildfires in California between 2013 and 2020<br>
**Link to the dataset:** https://www.kaggle.com/ananthu017/california-wildfire-incidents-20132020 <br>
**Number of observations:** 1636 x 40 = 65440<br>
**Description of the dataset:** This dataset contains the list of wildfires that have occurred in California between 2013 and 2020, and it also includes other variables such as location and injuries. 


**Dataset Name:** NOAA Daily global surface summary 2013-2019<br>
**Link to the dataset:** https://www.ncei.noaa.gov/data/global-summary-of-the-day/archive/ <br>
**Number of observations:** We are going to use the sub datasets of the year 2013-2019, as well as the dataset for stations. The number of observations for each datasets are below:
- 2013: 4.01m
- 2014: 4.12m
- 2015: 4.20m
- 2016: 4.29m
- 2017: 4.29m
- 2018: 4.01m
- 2019: 3.29m
<br>

**Description of the dataset:** These datasets contain the weather records, including temperature, precipitation, and wind speed, etc., of different weather stations around the world. There is no direct information of humidity from this dataset, so we get the temperature and dew point to calculate the relative humidity percentage.



**Dataset Name:** The Integrated Surface Data (ISD) Station List<br>
**Link to the dataset:**  https://www1.ncdc.noaa.gov/pub/data/noaa/isd-history.csv <br>
**Number of observations:** 29700 <br>
**Description of the dataset:** This dataset contains identification numbers and information for weather stations in the Federal Climate Complex ISD.

**Data Combination:** In order to perform our analysis on these dataset, we firstly clean and wrangle each dataset separately. For the second dataset, which is the NOAA GSOD Daily global surface summary, we will need to combine the sub datasets together since the datasets were divided into different years. After we finished cleaning each dataset, we will merge them together by weather station id and date of the event.

# Setup

In [2]:
## YOUR CODE HERE
# Import pandas to read csv file and manage heterogenous data
import pandas as pd

# Import numpy to store numeric information and perform numerical analysis
import numpy as np

# Import seaborn and matplotlib to visualize data
import seaborn as sns
import matplotlib.pyplot as plt

#Import scipy to gather statistics
from scipy import stats

# Import patsy and statsmodels for regression analysis
import patsy
import statsmodels.api as sm

import warnings

# Data Cleaning

**SUMMARY:**

Since we have three data set, we choose to clean them seperatly and then merge these dataset by locations.

#### **First**, we upload the California wildfire incidents data set

Have a brief look of the data and Scrap the data

In [3]:
# Load the California wildfire incidents data set in data frame
# We get this data set from Kaggle (https://www.kaggle.com/ananthu017/california-wildfire-incidents-20132020)
wildfire = pd.read_csv("California_Fire_Incidents.csv")

In [4]:
print('The California fire incidents data set shape is ', wildfire.shape)
wildfire.head()

The California fire incidents data set shape is  (1636, 40)


Unnamed: 0,AcresBurned,Active,AdminUnit,AirTankers,ArchiveYear,CalFireIncident,CanonicalUrl,ConditionStatement,ControlStatement,Counties,...,SearchKeywords,Started,Status,StructuresDamaged,StructuresDestroyed,StructuresEvacuated,StructuresThreatened,UniqueId,Updated,WaterTenders
0,257314.0,False,Stanislaus National Forest/Yosemite National Park,,2013,True,/incidents/2013/8/17/rim-fire/,,,Tuolumne,...,"Rim Fire, Stanislaus National Forest, Yosemite...",2013-08-17T15:25:00Z,Finalized,,,,,5fb18d4d-213f-4d83-a179-daaf11939e78,2013-09-06T18:30:00Z,
1,30274.0,False,USFS Angeles National Forest/Los Angeles Count...,,2013,True,/incidents/2013/5/30/powerhouse-fire/,,,Los Angeles,...,"Powerhouse Fire, May 2013, June 2013, Angeles ...",2013-05-30T15:28:00Z,Finalized,,,,,bf37805e-1cc2-4208-9972-753e47874c87,2013-06-08T18:30:00Z,
2,27531.0,False,CAL FIRE Riverside Unit / San Bernardino Natio...,,2013,True,/incidents/2013/7/15/mountain-fire/,,,Riverside,...,"Mountain Fire, July 2013, Highway 243, Highway...",2013-07-15T13:43:00Z,Finalized,,,,,a3149fec-4d48-427c-8b2c-59e8b79d59db,2013-07-30T18:00:00Z,
3,27440.0,False,Tahoe National Forest,,2013,False,/incidents/2013/8/10/american-fire/,,,Placer,...,"American Fire, August 2013, Deadwood Ridge, Fo...",2013-08-10T16:30:00Z,Finalized,,,,,8213f5c7-34fa-403b-a4bc-da2ace6e6625,2013-08-30T08:00:00Z,
4,24251.0,False,Ventura County Fire/CAL FIRE,,2013,True,/incidents/2013/5/2/springs-fire/,Acreage has been reduced based upon more accur...,,Ventura,...,"Springs Fire, May 2013, Highway 101, Camarillo...",2013-05-02T07:01:00Z,Finalized,6.0,10.0,,,46731fb8-3350-4920-bdf7-910ac0eb715c,2013-05-11T06:30:00Z,11.0


Since we only need the dates, acres burned (scale), and county name for the following analysis, we update these information back to 'fire'.

In [5]:
# delete the irrelevant columns
wildfire = wildfire[['AcresBurned','Started','Counties']]

# change the started time into date
wildfire['Started'] = pd.to_datetime(wildfire['Started'])
wildfire['Started'] = wildfire['Started'].dt.date

# rename the 'Started' column name into 'Date'
wildfire = wildfire.rename({'Started':'Date'}, axis='columns')


wildfire.head()

Unnamed: 0,AcresBurned,Date,Counties
0,257314.0,2013-08-17,Tuolumne
1,30274.0,2013-05-30,Los Angeles
2,27531.0,2013-07-15,Riverside
3,27440.0,2013-08-10,Placer
4,24251.0,2013-05-02,Ventura


#### Second, upload the Integrated Surface Data (ISD) station list

Clean the station table

In [6]:
# Load the US weather data set in data frame
# We get the Integrated Surface Data (ISD) station list from ncdc.noaa.gov
station = pd.read_csv("https://www1.ncdc.noaa.gov/pub/data/noaa/isd-history.csv")

# Since the weather station ID is a combination of column 'USAF' and 'WBAN',
# we combine these two columns into a new column called 'ID'
station['ID']= station['USAF'].astype(str) + station['WBAN'].astype(str)

# we only analyze California weather
station = station[(station['STATE']=='CA') & (station['CTRY']=='US')].reset_index()

# station only need to include the ID and the name of the station
station = station[['ID','STATION NAME']]
station = station.rename({'ID':'STATION','STATION NAME':'NAME'}, axis='columns')

station

Unnamed: 0,STATION,NAME
0,69002093218,JOLON HUNTER LIGGETT MIL RES
1,69002099999,JOLON HUNTER LIGGETT MIL RES
2,69007093217,FRITZSCHE AAF
3,69014093101,EL TORO MCAS
4,69015093121,TWENTY NINE PALMS
...,...,...
493,99999993243,MERCED 23 WSW
494,99999993245,BODEGA 6 WSW
495,A06854115,BIG BEAR CITY AIRPORT
496,A07049320,PETALUMA MUNICIPAL AIRPORT


#### Third, using 'wildfire' and 'station' to form a data frame of the weather data in CA from 2013 to 2019

In [6]:
# We get the weather data from (https://www.ncei.noaa.gov/data/global-summary-of-the-day/archive/)
# We only need the weather data from 2013 to 2019 with the ID contained in 'station'

# Project Proposal (updated)

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  4 PM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 1/28  |  8 PM  | Edit, finalize, and submit proposal | Search for datasets    |
| 2/3  | 8 PM  | Search for datasets | Discuss wrangling/EDA Plan; Assign group members to lead each specific part |
| 2/10  | 8 PM  | Data checkpoint | Review Data set; Edit wrangling/EDA   |
| 2/17  | 8 PM  | Import & Wrangle Data; EDA | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/24  | 8 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Submit EDA Checkpoint |
| 3/13  | 8 PM  | Complete analysis; Draft results/conclusion/discussion | Discuss/review/edit full project |
| 3/17  | Before 11:59 PM  | NA | Final check; Turn in Final Project & Group Project Surveys |