# Welcome  

Notebook Author: Samuel Alter  
Notebook Subject: Capstone Project - Geographic Analysis

BrainStation Winter 2023: Data Science

## Introduction

This notebook walks through an analysis of geographic data's (i.e., elevation and aspect) influence on wildfire incidence. We want to see if these features of the landscape can accurately predict when a wildfire will be more likely. The location of study for this project are the Santa Monica Mountains, an east-west trending mountain range.

The dataset consists of a high-density grid that have elevation, and aspect data appended to each point. Also added to each point is whether there was a wildfire in that location. The elevation and aspect data are sourced from [USGS EarthExplorer](https://earthexplorer.usgs.gov), using the SRTM dataset. The wildfire data is sourced from the [National Interagency Fire Center](https://data-nifc.opendata.arcgis.com/datasets/nifc::interagencyfireperimeterhistory-all-years-view/explore?location=39.778749%2C-121.769073%2C11.96). 

Please refer to the visualization notebook to see maps of the field site.

## Initial Setup

### Imports

In [18]:
import pandas as pd
import numpy as np
import seaborn as sns

import statsmodels.api as sm

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
# import geographic information .CSV

sm_geo_original=pd.read_csv('/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/shapefiles/joins/sm_geo_combine.csv')
sm_geo_original

Unnamed: 0,objectid,perim_id,perim_elevation_1,perim_asp_1,fire,geometry
0,4751.0,1644.0,434.0,31.883564,1,MULTIPOLYGON (((-118.297645979672 34.135057069...
1,4751.0,1645.0,349.0,7.624194,1,MULTIPOLYGON (((-118.297645979672 34.135057069...
2,4751.0,1646.0,288.0,72.613029,1,MULTIPOLYGON (((-118.297645979672 34.135057069...
3,4751.0,1647.0,189.0,335.170654,1,MULTIPOLYGON (((-118.297645979672 34.135057069...
4,4751.0,1648.0,229.0,346.464142,1,MULTIPOLYGON (((-118.297645979672 34.135057069...
...,...,...,...,...,...,...
19140,0.0,5935.0,52.0,149.036240,0,POINT (-118.2930000000007 34)
19141,0.0,5936.0,51.0,185.194427,0,POINT (-118.2880000000007 34)
19142,0.0,5937.0,52.0,270.000000,0,POINT (-118.2830000000007 34)
19143,0.0,5938.0,54.0,341.565063,0,POINT (-118.2780000000007 34)


### EDA

In [3]:
sm_geo_original.isna().sum()

objectid               0
perim_id             641
perim_elevation_1    641
perim_asp_1          668
fire                   0
geometry               0
dtype: int64

In [4]:
sm_geo_original[sm_geo_original['perim_asp_1'].isna()==True]

Unnamed: 0,objectid,perim_id,perim_elevation_1,perim_asp_1,fire,geometry
14,4753.0,,,,1,MULTIPOLYGON (((-118.305363835672 34.119507991...
15,4767.0,,,,1,MULTIPOLYGON (((-118.843533279074 34.169495387...
24,4796.0,,,,1,MULTIPOLYGON (((-118.60574361788 34.1457095562...
25,4849.0,,,,1,MULTIPOLYGON (((-118.377669263434 34.122895200...
26,4850.0,,,,1,MULTIPOLYGON (((-118.574062124596 34.079625547...
...,...,...,...,...,...,...
18743,0.0,4600.0,32.0,,0,POINT (-118.3680000000007 34.04)
18786,0.0,4752.0,53.0,,0,POINT (-118.4330000000006 34.035)
18810,0.0,4776.0,67.0,,0,POINT (-118.3130000000007 34.035)
18863,0.0,4941.0,52.0,,0,POINT (-118.3130000000007 34.03)


In [5]:
(sm_geo_original.isna().sum())/(sm_geo_original.count())*100

objectid             0.000000
perim_id             3.464116
perim_elevation_1    3.464116
perim_asp_1          3.615306
fire                 0.000000
geometry             0.000000
dtype: float64

I have enough data that $3.5{\%}$ (roughly $650$ rows) is not a large portion of the dataset. I will remove these `NaN` rows:

In [6]:
sm_geo_original=sm_geo_original.dropna()
sm_geo_original.isna().sum()

objectid             0
perim_id             0
perim_elevation_1    0
perim_asp_1          0
fire                 0
geometry             0
dtype: int64

### Prepare data for analysis

In [25]:
X=sm_geo_original[['perim_elevation_1','perim_asp_1']]
y=sm_geo_original[['fire']]

y

Unnamed: 0,fire
0,1
1,1
2,1
3,1
4,1
...,...
19140,0
19141,0
19142,0
19143,0


In [8]:
print(X.shape)
print(y.shape)

(18477, 2)
(18477, 1)


In [9]:
# reshape the y data into a 1D array for modeling

y_array=np.ravel(a=y,order='C')
y_array

array([1, 1, 1, ..., 0, 0, 0])

In [10]:
y.sum()/y.count()

fire    0.889809
dtype: float64

The dataset has $89{\%}$ of the landscape enduring a wildfire.

In [16]:
X.corr()

Unnamed: 0,perim_elevation_1,perim_asp_1
perim_elevation_1,1.0,-0.018131
perim_asp_1,-0.018131,1.0


Additionally, there is no collinearity between aspect and elevation.

## Modeling

### Train, Test, Split

Partition dataset to have $0.\overline3$ of the total as testing.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y_array, test_size=(1/3),stratify=y)

#### Basic logistic regression:

##### `sklearn`:

In [12]:
# instantiate model
logreg=LogisticRegression()

# fit the model
logreg.fit(X_train,y_train)

print('Training Score: ',logreg.score(X_train,y_train))
print('Testing Score:  ',logreg.score(X_test,y_test))

Training Score:  0.8898360123396656
Testing Score:   0.889754830329599


Good match with the training set, though the accuracy is lower than I'd like; it's the same as the base rate. What about using `statsmodels` to see if we can make a stronger statistical understanding about what influences wildfires?

##### `statsmodels`:

In [20]:
# add constant to prepare for statsmodels

X_withconstant=sm.add_constant(X)
X_withconstant.head()

Unnamed: 0,const,perim_elevation_1,perim_asp_1
0,1.0,434.0,31.883564
1,1.0,349.0,7.624194
2,1.0,288.0,72.613029
3,1.0,189.0,335.170654
4,1.0,229.0,346.464142


In [21]:
# instantiate model
logreg_sm=sm.Logit(y,X_withconstant)

# fit the model
logreg_sm_results=logreg_sm.fit()

# summary
logreg_sm_results.summary()

Optimization terminated successfully.
         Current function value: 0.284397
         Iterations 7


0,1,2,3
Dep. Variable:,fire,No. Observations:,18477.0
Model:,Logit,Df Residuals:,18474.0
Method:,MLE,Df Model:,2.0
Date:,"Wed, 22 Mar 2023",Pseudo R-squ.:,0.1802
Time:,13:35:58,Log-Likelihood:,-5254.8
converged:,True,LL-Null:,-6409.9
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.0242,0.072,-0.338,0.736,-0.165,0.116
perim_elevation_1,0.0087,0.000,40.762,0.000,0.008,0.009
perim_asp_1,0.0003,0.000,0.887,0.375,-0.000,0.001


In [22]:
logreg_sm_results.params

const               -0.024207
perim_elevation_1    0.008738
perim_asp_1          0.000252
dtype: float64

In [24]:
beta0=logreg_sm_results.params[0] # constant
beta1=logreg_sm_results.params[1] # elevation
beta2=logreg_sm_results.params[2] # aspect

betas=[beta0,beta1,beta2]
odds=[]

for b in betas:
    odds.append(np.exp(b))
    
print(odds)

[0.9760838980920863, 1.0087765028598021, 1.0002517075801256]


Now let's try to increase the accuracy of our model.