# Konzept *Data Preperation*

> Im Folgenden wird am Beispiel eines Data Set gezeigt, wie die Variablen im ersten Schritt aus der .nc-Datei extrahiert werden können. Anschließend werden sie für die Nutzung in einem pandas-Dataframe transformiert. Der Dataframe wird mit weiteren Variablen angereichert und auf die notwendige Anzahl an Zeilen reduziert. 

In [75]:
import netCDF4 as nc
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from geopy.geocoders import Nominatim
import reverse_geocoder as rg
import time 

import warnings
warnings.filterwarnings('ignore')


## Dateienformat NetCDF

Die Daten liegen im .nc-Format (Network Common Data Format). Diesen Datenformat wird genutzt, um mehrdimensionale wissenschaftlichen Daten zu speichern. Um die Daten mit Python auszulesen, wird die Bibliothek `netCDF4` verwendet. Weiterverarbeitet werden die Daten mit `numpy`-Funktionen. 

In [4]:
fn = 'data/MODIS_20160701.nc'
ds = nc.Dataset(fn)

Als Beispieldatensatz werden im folgenden die Daten aus Juli 2016 analysiert. 

In [5]:
list_variables = list()
for var in ds.variables:
    list_variables.append(var)
list_variables

['lat',
 'lat_bnds',
 'lon',
 'lon_bnds',
 'time',
 'time_bnds',
 'vegetation_class',
 'vegetation_class_name',
 'burned_area',
 'standard_error',
 'fraction_of_burnable_area',
 'fraction_of_observed_area',
 'number_of_patches',
 'burned_area_in_vegetation_class']

Eine genaue Beschreibung der Variablen stellt die ESA in einer Dokumentation zur Verfügung (Quelle). Die Weiterverarbeitung der einzelnen Variablen wir im folgenden erläutert.  

In [6]:
#explore variables
for var in list_variables: 
    array = ds[var][:]
    print(f" Variable {var} mit Shape {array.shape}")

 Variable lat mit Shape (720,)
 Variable lat_bnds mit Shape (720, 2)
 Variable lon mit Shape (1440,)
 Variable lon_bnds mit Shape (1440, 2)
 Variable time mit Shape (1,)
 Variable time_bnds mit Shape (1, 2)
 Variable vegetation_class mit Shape (18,)
 Variable vegetation_class_name mit Shape (18, 150)
 Variable burned_area mit Shape (1, 720, 1440)
 Variable standard_error mit Shape (1, 720, 1440)
 Variable fraction_of_burnable_area mit Shape (1, 720, 1440)
 Variable fraction_of_observed_area mit Shape (1, 720, 1440)
 Variable number_of_patches mit Shape (1, 720, 1440)
 Variable burned_area_in_vegetation_class mit Shape (1, 18, 720, 1440)


Wie bereits erwähnt ermöglicht NetCDF, dass Speichern mehrdimensionaler Variablen. Dabei unterscheidet sich die Anzahl der Dimensionen je Variable. In diesem Notebook werden die einzeln Variablen transformiert und näher beleuchte. 

## Längen- und Breitengrad

Die zentralen Variablen im Data Set ist das Längen- und Breitenmaß. *Longitude* und *Latitude* geben den Mittelpunkt, der Rasterzelle mit der Auflösung 0,25° x 0,25° an. Die geographische Breite von -90°bis 90° und die geographische Länge von -180° bis 180° sind in zwei verschiedenen Variablen gespeichert. Um die Information nutzen zu können werden aus den einzelnen Variablen die Geokoordinaten gebildet. 

**(1) Auslesen der Arrays aus dem Data Set**

In [7]:
lat = ds['lat'][:]
lon = ds['lon'][:]
lon

masked_array(data=[-179.875, -179.625, -179.375, ...,  179.375,  179.625,
                    179.875],
             mask=False,
       fill_value=1e+20,
            dtype=float32)

**(2)  Umwandlung der Koordinaten-Vektoren in eine Koordinaten-Matrix**

In [10]:
lons, lats = np.meshgrid(lon, lat)
print(lons)
print(lats)

[[-179.875 -179.625 -179.375 ...  179.375  179.625  179.875]
 [-179.875 -179.625 -179.375 ...  179.375  179.625  179.875]
 [-179.875 -179.625 -179.375 ...  179.375  179.625  179.875]
 ...
 [-179.875 -179.625 -179.375 ...  179.375  179.625  179.875]
 [-179.875 -179.625 -179.375 ...  179.375  179.625  179.875]
 [-179.875 -179.625 -179.375 ...  179.375  179.625  179.875]]
[[ 89.875  89.875  89.875 ...  89.875  89.875  89.875]
 [ 89.625  89.625  89.625 ...  89.625  89.625  89.625]
 [ 89.375  89.375  89.375 ...  89.375  89.375  89.375]
 ...
 [-89.375 -89.375 -89.375 ... -89.375 -89.375 -89.375]
 [-89.625 -89.625 -89.625 ... -89.625 -89.625 -89.625]
 [-89.875 -89.875 -89.875 ... -89.875 -89.875 -89.875]]


Die Funktion `np.meshgrid` interpretiert die Input-Variablen als Matrix-Index. Zum Bespiel erzeugt die Funktion aus zwei Input-Arrays (0,1,3), die Output-Arrays ([0,1,2], [0,1,2,], [0,1,2]) und ([0,0,0], [1,1,1], [2,2,2]). Durch das Glätten der Arrays erhalt man die im Anschluss die Variablen: (0,0), (1,0), (2,0), (0,1, etc. 

**(3) Glätten der Variablen zu 1-dimensionalen Arrays**

In [11]:
lons_flatt = lons.flatten()
lats_flatt = lats.flatten()
lons_flatt

masked_array(data=[-179.875, -179.625, -179.375, ...,  179.375,  179.625,
                    179.875],
             mask=False,
       fill_value=1e+20,
            dtype=float32)

In [12]:
whole_world = {
    'lon': lons_flatt, 
    'lat': lats_flatt, 
}
df_whole_world = pd.DataFrame(whole_world)
df_whole_world

Unnamed: 0,lon,lat
0,-179.875,89.875
1,-179.625,89.875
2,-179.375,89.875
3,-179.125,89.875
4,-178.875,89.875
...,...,...
1036795,178.875,-89.875
1036796,179.125,-89.875
1036797,179.375,-89.875
1036798,179.625,-89.875


Die beiden Arrays werden in einen Dataframe übernommen und bilden zusammen den *key* des Dataframe. Die Schritte 1-3 werden für automatisierte Ausführung in die Funktion `create_geoframe` überführt. Die Funktion ist Teil der func.py-Datei. 

In [13]:
def create_geoframe(dataset):
    lat = dataset['lat'][:]
    lon = dataset['lon'][:]
    lons, lats = np.meshgrid(lon, lat)
    lons_flatt = lons.flatten()
    lats_flatt = lats.flatten()
    whole_world = {
    'lon': lons_flatt, 
    'lat': lats_flatt, }
    geoframe = pd.DataFrame(whole_world)
    return geoframe

## Numerische Varaiblen 

In [None]:
# Variable burned_area mit Shape (1, 720, 1440)
# Variable standard_error mit Shape (1, 720, 1440)
# Variable fraction_of_burnable_area mit Shape (1, 720, 1440)
# Variable fraction_of_observed_area mit Shape (1, 720, 1440)
# Variable number_of_patches mit Shape (1, 720, 1440)

Die Variablen `burned_area`, ` standard_error`, ` fraction_of_burnable_area`, ` fraction_of_observed_area` und ` number_of_patches` haben die gleiche Form und können aus diesem Grund gleich transformiert werden. 

In [14]:
def create_column(dataset,name_variable):
    df_temp = pd.DataFrame()
    # Auslesen des Arrays aus dem DataSet 
    variable = dataset[name_variable][:]
    variable = variable[0]
    # Glätten des Arrays
    variable_flatt = variable.flatten()
    # Erstellen des DataFrames 
    df_temp[name_variable] = variable_flatt
    return df_temp

Ergebnis der Funktion ist ein Dataframe. Da die Variablen immer in Bezug auf die Geokoordinaten in gleicher Reihenfolge angegeben werden, kann der Dataframe mit `concat` an den bestehenden Dataframe gehängt werden. 

In [15]:
df_all_data = pd.concat([df_whole_world,
            create_column(ds, 'burned_area'),
            create_column(ds, 'standard_error'),
            create_column(ds,'fraction_of_burnable_area'),
            create_column(ds,'fraction_of_observed_area'),
            create_column(ds, 'number_of_patches')], axis = 1, sort = False)

In [16]:
df_all_data

Unnamed: 0,lon,lat,burned_area,standard_error,fraction_of_burnable_area,fraction_of_observed_area,number_of_patches
0,-179.875,89.875,0.0,0.0,0.0,0.0,0.0
1,-179.625,89.875,0.0,0.0,0.0,0.0,0.0
2,-179.375,89.875,0.0,0.0,0.0,0.0,0.0
3,-179.125,89.875,0.0,0.0,0.0,0.0,0.0
4,-178.875,89.875,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...
1036795,178.875,-89.875,0.0,0.0,0.0,0.0,0.0
1036796,179.125,-89.875,0.0,0.0,0.0,0.0,0.0
1036797,179.375,-89.875,0.0,0.0,0.0,0.0,0.0
1036798,179.625,-89.875,0.0,0.0,0.0,0.0,0.0


## Vegetationsklassen 

In [29]:
#Variable burned_area_in_vegetation_class mit Shape (1, 18, 720, 1440)

burned_area_in_vegetation_class= ds['burned_area_in_vegetation_class'][:]
burned_area_in_vegetation_class = burned_area_in_vegetation_class[0]


Das Data Set enthält neben der Information der insgesamt verbrannte Fläche pro Rasterzelle eine Aufschlüsselung der verbrannten Fläche pro Vegetationsklasse. Die Informationen zur Vegetationsklassen werden im Rahmen des Projekts „Land Cover CCI“ der ESA bestimmt. Im Data Set wird in 18 verschiedene Vegetationsklassen unterschieden:  

In [30]:
vegetation_class = ds['vegetation_class'][:]
vegetation_class

masked_array(data=[ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110,
                   120, 130, 140, 150, 160, 170, 180],
             mask=False,
       fill_value=999999)

In [31]:
vegetation_class_name = ds['vegetation_class_name'][:]
vegetation_class_name[1]

masked_array(data=[b'C', b'r', b'o', b'p', b'l', b'a', b'n', b'd', b',',
                   b' ', b'i', b'r', b'r', b'i', b'g', b'a', b't', b'e',
                   b'd', b' ', b'o', b'r', b' ', b'p', b'o', b's', b't',
                   b'-', b'f', b'l', b'o', b'o', b'd', b'i', b'n', b'g',
                   --, --, --, --, --, --, --, --, --, --, --, --, --, --,
                   --, --, --, --, --, --, --, --, --, --, --, --, --, --,
                   --, --, --, --, --, --, --, --, --, --, --, --, --, --,
                   --, --, --, --, --, --, --, --, --, --, --, --, --, --,
                   --, --, --, --, --, --, --, --, --, --, --, --, --, --,
                   --, --, --, --, --, --, --, --, --, --, --, --, --, --,
                   --, --, --, --, --, --, --, --, --, --, --, --, --, --,
                   --, --, --, --, --, --, --, --, --, --, --, --, --, --,
                   --, --],
             mask=[False, False, False, False, False, False, False, False,
     

Da die Variable ` vegetation_class_name` keine verständlichen Namen der Vegetationsklassen enthält werden die Namen aus der Dokumentation übernommen. 

In [45]:
name_vegetation_class ={
    10: ['no_data'], 
    20: ['cropland_rainfed'], #Ackerland 
    30: ['cropland_irrigated'], #Ackerland_bewässert
    40: ['50_mosaic_cropland_50_natural_vegetation'], #Ackerland >50%, natürliche Bepflanzung <50%
    50: ['tree_cover_broadleaved_evergreen'], # immergrüne Laubbäume 
    60: ['tree_cover_broadleaved_deciduous'], # laubabwerfend Laubbäume
    70: ['tree_cover_needleleaved_evergreen'], # immmergrüne Nadelbäume
    80: ['tree_cover_needleleaved_deciduous'], # laubabwerfend Laubbäume
    90: ['tree_cover_mixed_leave'], # Mischwald
    100: ['50_mosaic_tree_50_herbaceous'], # Bäume >50%, <50% krautig Bewachsen
    110: ['50_herbaceous_50_tree'], # >50% krautige Bewachsen, <50% Bäu,e 
    120: ['shrubland'], #Buschland
    130: ['grassland'], #Grasland
    140: ['lichens_and_mosses'], #Flechten und Moose
    150: ['sparse_vegetation'], #spärliche Vegetation 
    160: ['tree_cover_flooded_fresh_water'], #geflutete Bäume Süßwasser
    170: ['tree_cover_flooded_saline_water'], # geflutete Bäume Salzwasser
    180: ['shrub_flooded_water'] #geflutetes Buschland
}

Für das Projekt werden die einzelnen Kategorien zusammengefasst: 
* 10: no_data
* 20, 30, 40: cropland
* 100, 110: mosaic_tree_grass
* 120, 130, 140, 150: other_vegetation
* 160, 170, 180: flooded_area

Die Vegetationsklassen 50 -90 werden belassen, da die Untersuchung von Waldbränden im Fokus steht und dabei die genaue Aufsplittung interessant sein kann.

In [53]:
i_vegetation_class = 10 
for burned_area_per_veg_class in burned_area_in_vegetation_class:
    s_vegetation_class= f"{i_vegetation_class}_burned_area"
    burned_area_per_veg_class_flatt= burned_area_per_veg_class.flatten()
    df_all_data[s_vegetation_class] = burned_area_per_veg_class_flatt
    i_vegetation_class= i_vegetation_class + 10
    #print(a)

In [54]:
df_all_data

Unnamed: 0,lon,lat,burned_area,standard_error,fraction_of_burnable_area,fraction_of_observed_area,number_of_patches,10_burned_area,20_burned_area,30_burned_area,...,90_burned_area,100_burned_area,110_burned_area,120_burned_area,130_burned_area,140_burned_area,150_burned_area,160_burned_area,170_burned_area,180_burned_area
0,-179.875,89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-179.625,89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-179.375,89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-179.125,89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-178.875,89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1036795,178.875,-89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1036796,179.125,-89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1036797,179.375,-89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1036798,179.625,-89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [55]:
#Spalte umbennen
df_all_data['cropland_burned_area'] = df_all_data['20_burned_area'] + df_all_data['30_burned_area'] + df_all_data['40_burned_area']
df_all_data['mosaic_tree_grass_burned_area'] = df_all_data['100_burned_area'] + df_all_data['100_burned_area']
df_all_data['other_vegetation_burned_area']= df_all_data['120_burned_area'] + df_all_data['130_burned_area'] + df_all_data['140_burned_area']+ df_all_data['150_burned_area']
df_all_data['flooded_area_burned_area']= df_all_data['160_burned_area'] + df_all_data['170_burned_area'] + df_all_data['180_burned_area']


In [56]:
df_all_data.drop(columns=['20_burned_area','30_burned_area','40_burned_area','100_burned_area','110_burned_area','120_burned_area', '130_burned_area', '140_burned_area', '150_burned_area','160_burned_area','170_burned_area','180_burned_area'], inplace=True)

In [57]:
replace_names={ '10_burned_area': name_vegetation_class[10][0] + "_burned_area",
                '50_burned_area': name_vegetation_class[50][0] + "_burned_area",
                '60_burned_area': name_vegetation_class[60][0] + "_burned_area", 
                '70_burned_area': name_vegetation_class[70][0] + "_burned_area", 
                '80_burned_area': name_vegetation_class[80][0] + "_burned_area", 
                '90_burned_area': name_vegetation_class[90][0] + "_burned_area", }
replace_names

{'10_burned_area': 'no_data_burned_area',
 '50_burned_area': 'tree_cover_broadleaved_evergreen_burned_area',
 '60_burned_area': 'tree_cover_broadleaved_deciduous_burned_area',
 '70_burned_area': 'tree_cover_needleleaved_evergreen_burned_area',
 '80_burned_area': 'tree_cover_needleleaved_deciduous_burned_area',
 '90_burned_area': 'tree_cover_mixed_leave_burned_area'}

In [58]:
df_all_data.rename(columns = replace_names, inplace = True)
df_all_data

Unnamed: 0,lon,lat,burned_area,standard_error,fraction_of_burnable_area,fraction_of_observed_area,number_of_patches,no_data_burned_area,tree_cover_broadleaved_evergreen_burned_area,tree_cover_broadleaved_deciduous_burned_area,tree_cover_needleleaved_evergreen_burned_area,tree_cover_needleleaved_deciduous_burned_area,tree_cover_mixed_leave_burned_area,cropland_burned_area,mosaic_tree_grass_burned_area,other_vegetation_burned_area,flooded_area_burned_area
0,-179.875,89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-179.625,89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-179.375,89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-179.125,89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-178.875,89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1036795,178.875,-89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1036796,179.125,-89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1036797,179.375,-89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1036798,179.625,-89.875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Die gerade gezeigten Schritte werden ebenfalls als Funktion `create_vegetation_class_breakdown` in die `func.py`-Datei aufgenommen. 

## Transformation des Data Frames 

In diesem Kapitel werden die numerischen Variablen nähere betrachtet und untersucht ob weitere Transformationsschritte notwendig sind. 

In [59]:
df_all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1036800 entries, 0 to 1036799
Data columns (total 17 columns):
 #   Column                                         Non-Null Count    Dtype  
---  ------                                         --------------    -----  
 0   lon                                            1036800 non-null  float32
 1   lat                                            1036800 non-null  float32
 2   burned_area                                    1036800 non-null  float32
 3   standard_error                                 1036800 non-null  float32
 4   fraction_of_burnable_area                      1036800 non-null  float32
 5   fraction_of_observed_area                      1036800 non-null  float32
 6   number_of_patches                              1036800 non-null  float32
 7   no_data_burned_area                            1036800 non-null  float32
 8   tree_cover_broadleaved_evergreen_burned_area   1036800 non-null  float32
 9   tree_cover_broadleaved_d

In [60]:
for column_name in df_all_data.columns:
    column = df_all_data[column_name]
    count = (column == 0).sum()
    print('Anzahl an 0 in Spalte', column_name, ' ist : ', count)

Anzahl an 0 in Spalte lon  ist :  0
Anzahl an 0 in Spalte lat  ist :  0
Anzahl an 0 in Spalte burned_area  ist :  1021636
Anzahl an 0 in Spalte standard_error  ist :  1021636
Anzahl an 0 in Spalte fraction_of_burnable_area  ist :  793999
Anzahl an 0 in Spalte fraction_of_observed_area  ist :  807313
Anzahl an 0 in Spalte number_of_patches  ist :  1021636
Anzahl an 0 in Spalte no_data_burned_area  ist :  1029851
Anzahl an 0 in Spalte tree_cover_broadleaved_evergreen_burned_area  ist :  1033357
Anzahl an 0 in Spalte tree_cover_broadleaved_deciduous_burned_area  ist :  1029262
Anzahl an 0 in Spalte tree_cover_needleleaved_evergreen_burned_area  ist :  1035638
Anzahl an 0 in Spalte tree_cover_needleleaved_deciduous_burned_area  ist :  1035730
Anzahl an 0 in Spalte tree_cover_mixed_leave_burned_area  ist :  1036442
Anzahl an 0 in Spalte cropland_burned_area  ist :  1030119
Anzahl an 0 in Spalte mosaic_tree_grass_burned_area  ist :  1031670
Anzahl an 0 in Spalte other_vegetation_burned_area 

Der gesamte Dataframe enthält keine NULL-Werte, aber eine große Anzahl an 0-Werte. Das ursprüngliche Data Set enthält, wie erwähnt, Informationen zu jeder Rasterzelle. Da jedoch nicht jede Rasterzelle verbrannte Fläche enthält, ist der Wert für die `burned_area` größtenteils 0. Für die weitere Auswertung haben nur Rasterzellen einen Mehrwert, welche verbrannte Fläche enthalten. Damit kann der Dataframe mit `burned_area >0` gefiltert werden und damit die gesamte Größe signifikant reduziert werden. 

Die Variable `fraction_of_observed_area` gibt den Anteils der verbrennbaren Fläche an, die in dem Monat beobachtet werden konnte, z.B. ohne Wolkenbedeckung. In der Dokumentation ist die Empfehlung enthalten nur Rasterzellen in die weitere Analyse einzubeziehen, bei welchen der Anteil größer als 80% ist. Aus diesem Grund wird der Dataframe mit `fraction_of_observed_area  > 0.8` gefiltert. 

In [72]:
df_all_data = df_all_data[((df_all_data['fraction_of_observed_area'] > 0.8) & (df_all_data['burned_area']>0))]

In [73]:
df_numerical = df_all_data[['burned_area', 'standard_error','fraction_of_burnable_area','fraction_of_observed_area', 'number_of_patches' ]]

In [69]:
df_numerical.describe()

Unnamed: 0,burned_area,standard_error,fraction_of_burnable_area,fraction_of_observed_area,number_of_patches
count,15115.0,15115.0,15115.0,15115.0,15115.0
mean,32345114.0,1145705.0,0.96043,0.994243,25.589348
std,58782248.0,518922.4,0.118531,0.015193,37.621387
min,53664.0,105968.0,0.006217,0.80046,1.0
25%,1395281.0,726708.5,0.979965,0.995108,2.0
50%,6600754.0,980657.0,0.998817,0.999575,8.0
75%,34560048.0,1542320.0,1.0,1.0,33.0
max,696567424.0,2548772.0,1.0,1.0,289.0


Die Variable `fraction_of_burnable_area` gibt den Anteil der Fläche pro Rasterzelle an, welche verbrannt werden kann, z.B. keine verbrennbare Fläche sind die Weltmeere. Die Variable `number_of_patches` gibt die Anzahl an verbunden verbrannten Rasterzellen an. Die Variable `standard_error` gibt den Wert der Standardfehler bei der Bewertung der verbrannten Fläche an.  Diese drei Variablen werden bei der Erstellung des Dashboards nicht weiter berücksichtigt, aber verbleiben Variable im Dataframe, da sie weiterführende Betrachtungen notwendig sein können. 

In [31]:
corr_matrix = df_numerical.corr()
corr_matrix

Unnamed: 0,burned_area,standard_error,fraction_of_burnable_area,fraction_of_observed_area,number_of_patches
burned_area,1.0,0.75336,0.05198,0.050848,0.719608
standard_error,0.75336,1.0,0.08956,0.088193,0.829647
fraction_of_burnable_area,0.05198,0.08956,1.0,0.014466,0.063728
fraction_of_observed_area,0.050848,0.088193,0.014466,1.0,0.062504
number_of_patches,0.719608,0.829647,0.063728,0.062504,1.0


Die Korrelation zwischen ` standard_error`, `number_of_patches` und `burend_area` ist hoch. Für das Dashboard wird das nicht weiter betrachtet. Für andere Anwendungsfälle sollte die hohe Korrelation berücksichtig werden.

## Zusätzliche Variablen 

### Variable Land

Um z.B. Auswerten zu können, welche Länder besonders stark von Bränden betroffen sind, muss das die Landesbezeichnung als zusätzlich Variable ergänzt werden.  Mit *reverse geocoding* wird dabei die Landesbezeichnung aus den Geokoordinaten ermittelt.  

In [74]:
df_all_data.shape 

(15115, 17)

In [76]:
df_geo = df_all_data.head(100)

In [77]:
geolocator = Nominatim(user_agent="my_app")
st = time.time()
df_geo['add'] = df_geo.apply(lambda x : (geolocator.reverse((x['lat'], x['lon']), zoom=3, language= 'en')).address, axis = 1)
et = time.time()
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')

Execution time: 49.86528563499451 seconds


In [78]:
df_geo

Unnamed: 0,lon,lat,burned_area,standard_error,fraction_of_burnable_area,fraction_of_observed_area,number_of_patches,no_data_burned_area,tree_cover_broadleaved_evergreen_burned_area,tree_cover_broadleaved_deciduous_burned_area,tree_cover_needleleaved_evergreen_burned_area,tree_cover_needleleaved_deciduous_burned_area,tree_cover_mixed_leave_burned_area,cropland_burned_area,mosaic_tree_grass_burned_area,other_vegetation_burned_area,flooded_area_burned_area,add
103243,70.875,72.125,1931928.0,488056.0,0.926680,0.967521,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1931927.0,0.0,Russia
106161,80.375,71.625,53664.0,345052.0,0.920062,0.971039,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,53664.0,0.0,Russia
106162,80.625,71.625,53664.0,343751.0,0.902998,0.966064,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,53664.0,0.0,Russia
107567,71.875,71.375,912299.0,291183.0,0.585074,0.933804,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,858634.0,53664.0,Russia
109006,71.625,71.125,53664.0,357733.0,0.958995,0.976494,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,53664.0,0.0,Russia
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133601,100.375,66.875,4561496.0,554328.0,0.996817,0.998758,17.0,0.0,0.0,0.0,0.0,3810191.0,0.0,0.0,965964.0,268323.0,0.0,Russia
133602,100.625,66.875,32037806.0,602866.0,0.979830,0.986818,13.0,0.0,0.0,0.0,0.0,29730226.0,0.0,0.0,1931928.0,1287951.0,0.0,Russia
133690,122.625,66.875,1395281.0,524477.0,0.963755,0.985691,6.0,0.0,0.0,0.0,0.0,1287952.0,0.0,0.0,0.0,107329.0,0.0,Russia
133691,122.875,66.875,1717269.0,449843.0,0.983728,0.987055,1.0,0.0,0.0,0.0,0.0,1609940.0,0.0,0.0,0.0,107329.0,0.0,Russia


Mit der Bibliothek `Nominatim` kann eine große Zahl an Geoinformation abgerufen werden. Das führt aber auch zu einer hohen Durchlaufzeit bei der Ermittlung der Länderbezeichnung. Effektiver ist in diesem Fall die Bibliothek `reverse_geocoder` und aus diesem Grund besser für die großen Datenmengen dieses Anwendungsfalls geeignet. 

In [79]:
st = time.time()

coords = tuple(zip(df_geo['lat'], df_geo['lon']))

results_rg = rg.search(coords)
results_admin2 = [x.get('cc') for x in results_rg]

# Optional: insert admin2 results into new df column
df_geo['add_rg'] = results_admin2
et = time.time()

# get the execution time
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')


Loading formatted geocoded file...
Execution time: 2.40061354637146 seconds


In [81]:
df_geo

Unnamed: 0,lon,lat,burned_area,standard_error,fraction_of_burnable_area,fraction_of_observed_area,number_of_patches,no_data_burned_area,tree_cover_broadleaved_evergreen_burned_area,tree_cover_broadleaved_deciduous_burned_area,tree_cover_needleleaved_evergreen_burned_area,tree_cover_needleleaved_deciduous_burned_area,tree_cover_mixed_leave_burned_area,cropland_burned_area,mosaic_tree_grass_burned_area,other_vegetation_burned_area,flooded_area_burned_area,add,add_rg
103243,70.875,72.125,1931928.0,488056.0,0.926680,0.967521,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1931927.0,0.0,Russia,RU
106161,80.375,71.625,53664.0,345052.0,0.920062,0.971039,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,53664.0,0.0,Russia,RU
106162,80.625,71.625,53664.0,343751.0,0.902998,0.966064,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,53664.0,0.0,Russia,RU
107567,71.875,71.375,912299.0,291183.0,0.585074,0.933804,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,858634.0,53664.0,Russia,RU
109006,71.625,71.125,53664.0,357733.0,0.958995,0.976494,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,53664.0,0.0,Russia,RU
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133601,100.375,66.875,4561496.0,554328.0,0.996817,0.998758,17.0,0.0,0.0,0.0,0.0,3810191.0,0.0,0.0,965964.0,268323.0,0.0,Russia,RU
133602,100.625,66.875,32037806.0,602866.0,0.979830,0.986818,13.0,0.0,0.0,0.0,0.0,29730226.0,0.0,0.0,1931928.0,1287951.0,0.0,Russia,RU
133690,122.625,66.875,1395281.0,524477.0,0.963755,0.985691,6.0,0.0,0.0,0.0,0.0,1287952.0,0.0,0.0,0.0,107329.0,0.0,Russia,RU
133691,122.875,66.875,1717269.0,449843.0,0.983728,0.987055,1.0,0.0,0.0,0.0,0.0,1609940.0,0.0,0.0,0.0,107329.0,0.0,Russia,RU


### Variable Datum

Zum Vergleichen der Brand-Daten über mehrere Zeiträume fehlt im Data Frame die Datumsinformation. Da für jeden Monat eine einzelne NetCDF-Datei abgelegt wird, kann das Datum aus dem Dateinamen generiert und als zusätzlich Spalte zum Dataframe hinzugefügt werden. 

In [83]:
def add_date(filename, dataframe ):
    string_date = filename[-11:-7] + "-" + filename[-7:-5] + "-" + filename[-5:-3]
    dataframe['date'] = string_date
    return dataframe

In [87]:
add_date(fn, df_all_data)

Unnamed: 0,lon,lat,burned_area,standard_error,fraction_of_burnable_area,fraction_of_observed_area,number_of_patches,no_data_burned_area,tree_cover_broadleaved_evergreen_burned_area,tree_cover_broadleaved_deciduous_burned_area,tree_cover_needleleaved_evergreen_burned_area,tree_cover_needleleaved_deciduous_burned_area,tree_cover_mixed_leave_burned_area,cropland_burned_area,mosaic_tree_grass_burned_area,other_vegetation_burned_area,flooded_area_burned_area,date
103243,70.875,72.125,1931928.0,488056.0,0.926680,0.967521,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1931927.0,0.0,2016-07-01
106161,80.375,71.625,53664.0,345052.0,0.920062,0.971039,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,53664.0,0.0,2016-07-01
106162,80.625,71.625,53664.0,343751.0,0.902998,0.966064,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,53664.0,0.0,2016-07-01
107567,71.875,71.375,912299.0,291183.0,0.585074,0.933804,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,858634.0,53664.0,2016-07-01
109006,71.625,71.125,53664.0,357733.0,0.958995,0.976494,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,53664.0,0.0,2016-07-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
721884,-68.875,-35.375,1770934.0,844149.0,0.991741,0.998884,1.0,160994.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1556275.0,0.0,2016-07-01
727629,-72.625,-36.375,1448946.0,804088.0,0.989306,0.995816,1.0,0.0,912299.0,0.0,0.0,0.0,0.0,0.0,1073292.0,0.0,0.0,2016-07-01
732839,149.875,-37.125,1287952.0,765822.0,0.818712,0.993828,1.0,0.0,1287952.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2016-07-01
737711,-72.125,-38.125,1609940.0,1146090.0,1.000000,1.000000,2.0,0.0,1609940.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2016-07-01


---

Im nächsten Notebook werden die exemplarisch für eine Datei angewandten Schritt automatisiert für alle Dateien von 2001-2020 durchgeführt und diese in eine Non-SQL-Datenbank geschrieben.