# Welcome  

Notebook Author: Samuel Alter  
Notebook Subject: Capstone Project - Geographic Data Processing

BrainStation Winter 2023: Data Science

This notebook reads the `.geojson` fire perimeter files created in `QGIS`, deals with the `NaN` rows, and cleans the data in anticipation of the next step, running the perimeter dataset through a suite of `statsmodels` and `sklearn` modeling.

In [2]:
# imports

import numpy as np
import pandas as pd
import geopandas as gpd

from scipy.spatial import KDTree

# Join `city`, `farm`, `fire1`, `fire2` datasets together

I had created four layers of points, two in areas that experienced no fire, and two that experienced fire. These point layers had slope, elevation, and aspect values joined to them from the underlying raster layers. Since each layer was either completely within a "fire/nofire" area, that means that I already know where the layers are in relation to fire incidence. I simply have to concatenate the four and then I have a dataset of fire/nofire point locations. I can then feed that into a model. I want to preserve the order of datasets, which is:
* `city`
* `farm`
* `fire1`
* `fire2`

## Read in data

In [27]:
layer_city=gpd.read_file('/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/shapefiles/joins/patch_city_asp_elev_slope.geojson')
print(layer_city.shape)
display(layer_city.head())

(7524, 5)


Unnamed: 0,id,aspect1,elevation1,slope1,geometry
0,0,90.0,229.0,1.891522,POINT (361436.400 3782022.600)
1,1,180.0,230.0,0.535182,POINT (361513.200 3782022.600)
2,2,147.528809,227.0,2.269464,POINT (361590.000 3782022.600)
3,3,149.03624,226.0,1.55997,POINT (361666.800 3782022.600)
4,4,206.565048,228.0,2.037113,POINT (361743.600 3782022.600)


In [28]:
layer_farm=gpd.read_file('/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/shapefiles/joins/patch_farm_asp_elev_slope.geojson')
print(layer_farm.shape)
display(layer_farm.head())

(2394, 5)


Unnamed: 0,id,aspect1,elevation1,slope1,geometry
0,0,45.0,8.0,1.135177,POINT (307121.400 3783334.800)
1,1,210.96376,7.0,1.55997,POINT (307198.200 3783334.800)
2,2,270.0,9.0,0.535182,POINT (307275.000 3783334.800)
3,3,360.0,10.0,1.605173,POINT (307351.800 3783334.800)
4,4,90.0,9.0,1.070271,POINT (307428.600 3783334.800)


In [53]:
layer_fire1=gpd.read_file('/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/shapefiles/joins/patch_fire1_asp_elev_slope.geojson')
print(layer_fire1.shape)
display(layer_fire1.head())

(7524, 5)


Unnamed: 0,id,aspect1,elevation1,slope1,geometry
0,0,14.74356,52.0,12.257875,POINT (310476.000 3778264.200)
1,1,21.03751,45.0,7.412804,POINT (310552.800 3778264.200)
2,2,333.434967,45.0,4.775862,POINT (310629.600 3778264.200)
3,3,283.392487,51.0,5.75721,POINT (310706.400 3778264.200)
4,4,229.289154,63.0,14.839245,POINT (310783.200 3778264.200)


In [54]:
layer_fire2=gpd.read_file('/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/shapefiles/joins/patch_fire2_asp_elev_slope.geojson')
print(layer_fire2.shape)
display(layer_fire2.head())

(2394, 5)


Unnamed: 0,id,aspect1,elevation1,slope1,geometry
0,0,19.885162,604.0,13.139307,POINT (338507.400 3772576.200)
1,1,6.892421,604.0,23.176134,POINT (338584.200 3772576.200)
2,2,9.211024,592.0,19.296698,POINT (338661.000 3772576.200)
3,3,251.565048,598.0,10.868616,POINT (338737.800 3772576.200)
4,4,294.928467,618.0,20.086058,POINT (338814.600 3772576.200)


In [31]:
# are there any duplicates?
dups=layer_city[layer_city.duplicated()]
print(dups)

Empty GeoDataFrame
Columns: [id, aspect1, elevation1, slope1, geometry]
Index: []


## Inspect dataframes for `NaN`: use `.isna()` and impute values if necessary

### `city`

In [32]:
layer_city.isna().sum()

id             0
aspect1       92
elevation1     0
slope1         0
geometry       0
dtype: int64

In [33]:
layer_city[layer_city['aspect1'].isna()==True]

Unnamed: 0,id,aspect1,elevation1,slope1,geometry
60,60,,209.0,0.000000,POINT (366044.400 3782022.600)
99,99,,207.0,0.000000,POINT (369039.600 3782022.600)
146,146,,201.0,0.000000,POINT (372649.200 3782022.600)
156,156,,197.0,0.000000,POINT (373417.200 3782022.600)
171,171,,192.0,0.000000,POINT (374569.200 3782022.600)
...,...,...,...,...,...
6706,6706,,177.0,0.000000,POINT (374646.000 3779488.200)
6902,6902,,180.0,0.000000,POINT (374492.400 3779411.400)
7031,7031,,195.0,0.535182,POINT (369193.200 3779334.600)
7112,7112,,174.0,0.000000,POINT (375414.000 3779334.600)


In [34]:
perc_nan_city=(layer_city['aspect1'].isna().sum())/(layer_city.shape[0])*100
perc_nan_city

1.2227538543328018

In [36]:
print(f'The percentage of nulls to actual numbers in the aspect column is:\n~{round(perc_nan_city,2)}')

The percentage of nulls to actual numbers in the aspect column is:
~1.22


I want to impute an aspect value to the `NaN` rows. What should I do? Looking at the map, the aspect actually has a value, and I'm not sure why it gave a `NaN`. But there are too many to manually update the aspects for. To impute the aspect, I could set a random value to that point, or copy adjacent points. I will try using a nearest-neighbors approach to impute the missing data.

In [20]:
# make a function to impute aspect

def imputeAspect(df):
    '''
    Imputes aspect values from nearest neighbors.
    
    Requires the scipy KDTree module and that the 
    aspect column is named 'aspect1'.
    
    'from scipy.spatial import KDTree'
    
    ----
    Input
    
    > df
    the dataframe that you want to impute.
    '''
    # create a KDTree from the x,y coordinates of the points
    tree = KDTree(np.array(df.geometry.apply(lambda geom: (geom.x, geom.y))).tolist())

    # get the indices of the NaN values in the 'aspect1' column
    nan_idx = df['aspect1'].isna()

    # iterate over the NaN indices and impute the values
    for idx in nan_idx[nan_idx].index:
        # get the 4 nearest neighbors to the point at the current index
        _, neighbor_idx = tree.query(np.array(df.loc[idx].geometry.coords)[0], k=4)

        # compute the average of the 'aspect1' values of the neighbors
        neighbor_vals = df.loc[neighbor_idx].aspect1.dropna()
        imputed_val = neighbor_vals.mean()

        # set the imputed value for the current index
        df.loc[idx, 'aspect1'] = imputed_val

In [37]:
imputeAspect(layer_city)

In [39]:
layer_city.isna().sum()

id            0
aspect1       0
elevation1    0
slope1        0
geometry      0
dtype: int64

It worked! Now for the `layer_farm` dataset.

### `farm`

In [40]:
layer_farm.isna().sum()

id              0
aspect1       102
elevation1      0
slope1          0
geometry        0
dtype: int64

In [41]:
perc_nan_farm=(layer_farm['aspect1'].isna().sum())/(layer_farm.shape[0])*100
perc_nan_farm

4.260651629072681

In [42]:
print(f'The percentage of nulls to actual numbers in the aspect column is:\n~{round(perc_nan_farm,2)}')

The percentage of nulls to actual numbers in the aspect column is:
~4.26


In [43]:
imputeAspect(layer_farm)

In [44]:
layer_farm.isna().sum()

id            0
aspect1       0
elevation1    0
slope1        0
geometry      0
dtype: int64

### `fire1`

In [55]:
layer_fire1.isna().sum()

id            0
aspect1       1
elevation1    0
slope1        0
geometry      0
dtype: int64

In [56]:
perc_nan_fire1=(layer_fire1['aspect1'].isna().sum())/(layer_fire1.shape[0])*100
perc_nan_fire1
print(f'The percentage of nulls to actual numbers in the aspect column is:\n~{round(perc_nan_fire1,2)}')
imputeAspect(layer_fire1)

The percentage of nulls to actual numbers in the aspect column is:
~0.01


In [57]:
layer_fire1.isna().sum()

id            0
aspect1       0
elevation1    0
slope1        0
geometry      0
dtype: int64

### `fire2`

In [58]:
layer_fire2.isna().sum()

id            0
aspect1       0
elevation1    0
slope1        0
geometry      0
dtype: int64

In [59]:
perc_nan_fire2=(layer_fire2['aspect1'].isna().sum())/(layer_fire2.shape[0])*100
perc_nan_fire2
print(f'The percentage of nulls to actual numbers in the aspect column is:\n~{round(perc_nan_fire2,2)}')
imputeAspect(layer_fire2)

The percentage of nulls to actual numbers in the aspect column is:
~0.0


In [60]:
layer_fire1.isna().sum()

id            0
aspect1       0
elevation1    0
slope1        0
geometry      0
dtype: int64

## Combine `city`, `farm`, `fire1`, and `fire2` datasets, in that order

First need to create a column denoting which layer is from the fire area and which is from the nofire.

In [62]:
layer_city['fire']=0
layer_city.head(3)

Unnamed: 0,id,aspect1,elevation1,slope1,geometry,fire
0,0,90.0,229.0,1.891522,POINT (361436.400 3782022.600),0
1,1,180.0,230.0,0.535182,POINT (361513.200 3782022.600),0
2,2,147.528809,227.0,2.269464,POINT (361590.000 3782022.600),0


In [63]:
layer_farm['fire']=0
layer_farm.head(3)

Unnamed: 0,id,aspect1,elevation1,slope1,geometry,fire
0,0,45.0,8.0,1.135177,POINT (307121.400 3783334.800),0
1,1,210.96376,7.0,1.55997,POINT (307198.200 3783334.800),0
2,2,270.0,9.0,0.535182,POINT (307275.000 3783334.800),0


In [64]:
layer_fire1['fire']=1
layer_fire1.head(3)

Unnamed: 0,id,aspect1,elevation1,slope1,geometry,fire
0,0,14.74356,52.0,12.257875,POINT (310476.000 3778264.200),1
1,1,21.03751,45.0,7.412804,POINT (310552.800 3778264.200),1
2,2,333.434967,45.0,4.775862,POINT (310629.600 3778264.200),1


In [65]:
layer_fire2['fire']=1
layer_fire2.head(3)

Unnamed: 0,id,aspect1,elevation1,slope1,geometry,fire
0,0,19.885162,604.0,13.139307,POINT (338507.400 3772576.200),1
1,1,6.892421,604.0,23.176134,POINT (338584.200 3772576.200),1
2,2,9.211024,592.0,19.296698,POINT (338661.000 3772576.200),1


Combine the dataframes in the order specified:

In [66]:
layer_combine=pd.concat([layer_city,layer_farm,layer_fire1,layer_fire2],axis=0)
layer_combine

Unnamed: 0,id,aspect1,elevation1,slope1,geometry,fire
0,0,90.000000,229.0,1.891522,POINT (361436.400 3782022.600),0
1,1,180.000000,230.0,0.535182,POINT (361513.200 3782022.600),0
2,2,147.528809,227.0,2.269464,POINT (361590.000 3782022.600),0
3,3,149.036240,226.0,1.559970,POINT (361666.800 3782022.600),0
4,4,206.565048,228.0,2.037113,POINT (361743.600 3782022.600),0
...,...,...,...,...,...,...
2389,2389,249.702423,92.0,19.977131,POINT (341041.800 3767814.600),1
2390,2390,222.709396,120.0,18.289597,POINT (341118.600 3767814.600),1
2391,2391,228.990921,134.0,15.892071,POINT (341195.400 3767814.600),1
2392,2392,233.686356,149.0,25.691380,POINT (341272.200 3767814.600),1


In [67]:
layer_combine.describe()

Unnamed: 0,id,aspect1,elevation1,slope1,fire
count,19836.0,19836.0,19836.0,19836.0,19836.0
mean,3142.362069,178.895378,281.611867,10.670559,0.5
std,2213.394884,103.031413,203.469005,10.36528,0.500013
min,0.0,0.437359,1.0,0.0,0.0
25%,1239.0,93.709311,185.0,1.891522,0.0
50%,2564.5,176.633545,212.0,5.890609,0.5
75%,5044.0,261.869904,357.0,18.75248,1.0
max,7523.0,360.0,924.0,64.823074,1.0


Mean and median are roughly the same for aspect and elevation. Slope is skewed right.

In [68]:
layer_combine[layer_combine['fire']==0].describe()

Unnamed: 0,id,aspect1,elevation1,slope1,fire
count,9918.0,9918.0,9918.0,9918.0,9918.0
mean,3142.362069,176.750417,158.715568,2.458668,0.0
std,2213.450681,105.974963,88.066381,2.617656,0.0
min,0.0,1.4688,1.0,0.0,0.0
25%,1239.25,90.0,171.0,1.070271,0.0
50%,2564.5,171.869904,197.0,1.891522,0.0
75%,5043.75,270.0,208.0,2.879701,0.0
max,7523.0,360.0,324.0,29.405672,0.0


In [69]:
layer_combine[layer_combine['fire']==1].describe()

Unnamed: 0,id,aspect1,elevation1,slope1,fire
count,9918.0,9918.0,9918.0,9918.0,9918.0
mean,3142.362069,181.040339,404.508167,18.882451,1.0
std,2213.450681,99.960587,211.749173,8.552943,0.0
min,0.0,0.437359,32.0,0.0,1.0
25%,1239.25,107.251465,237.0,12.514403,1.0
50%,2564.5,178.736259,357.0,18.646463,1.0
75%,5043.75,257.900124,569.0,24.880323,1.0
max,7523.0,360.0,924.0,64.823074,1.0


Mean elevation is $281$ meters in the combined dataset.
* In the fire areas, the mean elevation is $404$ meters. 
* In the nofire areas, the mean elevation is $158$ meters.

Aspect is almost identical between the two areas.

Slope is much higher in the fire areas (${\approx}19{^\circ}$) versus the nofire (${\approx}2.5{^\circ}$) areas.

In [71]:
# clean up dataframe to have just elevation, aspect, slope, and fire

layer_combine=layer_combine[['elevation1','aspect1','slope1','fire']]
layer_combine

Unnamed: 0,elevation1,aspect1,slope1,fire
0,229.0,90.000000,1.891522,0
1,230.0,180.000000,0.535182,0
2,227.0,147.528809,2.269464,0
3,226.0,149.036240,1.559970,0
4,228.0,206.565048,2.037113,0
...,...,...,...,...
2389,92.0,249.702423,19.977131,1
2390,120.0,222.709396,18.289597,1
2391,134.0,228.990921,15.892071,1
2392,149.0,233.686356,25.691380,1


In [72]:
layer_combine.loc[:,'elevation']=layer_combine['elevation1']
layer_combine.loc[:,'aspect']=layer_combine['aspect1']
layer_combine.loc[:,'slope']=layer_combine['slope1']
layer_combine=layer_combine[['elevation','aspect','slope','fire']]
layer_combine=layer_combine.reset_index()
layer_combine=layer_combine[['elevation','aspect','slope','fire']]
layer_combine

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  layer_combine.loc[:,'elevation']=layer_combine['elevation1']


Unnamed: 0,elevation,aspect,slope,fire
0,229.0,90.000000,1.891522,0
1,230.0,180.000000,0.535182,0
2,227.0,147.528809,2.269464,0
3,226.0,149.036240,1.559970,0
4,228.0,206.565048,2.037113,0
...,...,...,...,...
19831,92.0,249.702423,19.977131,1
19832,120.0,222.709396,18.289597,1
19833,134.0,228.990921,15.892071,1
19834,149.0,233.686356,25.691380,1


### Write `layer_combined` to a `.csv`

In [73]:
path='/Users/sra/Desktop/Data_Science_2023/_capstone/00_capstone_data/shapefiles/joins/layer_combine.csv'

In [74]:
layer_combine.to_csv(path_or_buf=path,index=False)

# Now the `layer_combine.csv` file will be used in the Geoanalysis notebook.