# Dataset

The dataset used is ERA-40 from http://apps.ecmwf.int/datasets/data/era40-daily/levtype=pl/. It can be downloaded as either Grib or NetCDF by filtering for Temperature and Relative Humidity at pressure level 1000hPa (surface level) and Wind (both directions) at 700hPa (as described in the thesis) in the date range from 01.01.2000 to 31.08.2002. This example uses only data at 12:00 but might be extended to calculate daily averages or similar.

=> https://drive.google.com/drive/folders/0B_wueX1dv4FsQ0t6Rm9fYTRUWGs?usp=sharing

# Dependencies

- http://xarray.pydata.org/en/stable/index.html => library that can handle NetCDF and other multidimensional formats
- **conda install matplotlib seaborn numpy pandas xarray dask netCDF4 bottleneck**

In [1]:
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib as plt
import seaborn as sns

%matplotlib inline

# Exploratory Data Analysis

In [2]:
# read in the dataset with xarray

data = xr.open_dataset('data/era40_2000-2002.cf')
data

<xarray.Dataset>
Dimensions:    (latitude: 73, level: 2, longitude: 144, time: 974)
Coordinates:
  * longitude  (longitude) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 ...
  * latitude   (latitude) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 ...
  * level      (level) int32 1000 700
  * time       (time) datetime64[ns] 2000-01-01T12:00:00 2000-01-02T12:00:00 ...
Data variables:
    t          (time, level, latitude, longitude) float64 263.7 263.7 263.7 ...
    r          (time, level, latitude, longitude) float64 86.13 86.13 86.13 ...
    u          (time, level, latitude, longitude) float64 nan nan nan nan ...
    v          (time, level, latitude, longitude) float64 nan nan nan nan ...
Attributes:
    Conventions:  CF-1.6
    history:      2017-08-03 18:30:14 GMT by grib_to_netcdf-2.4.0: grib_to_ne...

In [3]:
# look at the head of the dataset
# the dataframe needs about 700MB in RAM

df = data.to_dataframe()
df.columns = ['temperature', 'relative_humidity', 'wind_u', 'wind_v']

df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,temperature,relative_humidity,wind_u,wind_v
latitude,level,longitude,time,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
90.0,1000,0.0,2000-01-01 12:00:00,263.738445,86.127481,,
90.0,1000,0.0,2000-01-02 12:00:00,267.382147,87.96868,,
90.0,1000,0.0,2000-01-03 12:00:00,264.258055,95.606734,,
90.0,1000,0.0,2000-01-04 12:00:00,262.919618,89.579965,,
90.0,1000,0.0,2000-01-05 12:00:00,260.786484,92.862839,,


In [4]:
# detailed information about the data ranges and types
# latitude from 90° to -90°
# longitude from 0° to 357.5°
# measurements from 01.01.2000 to 31.08.2002 at 12:00
# => TODO: take the average of the entire day?

df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 20477376 entries, (90.0, 1000, 0.0, 2000-01-01 12:00:00) to (-90.0, 700, 357.5, 2002-08-31 12:00:00)
Data columns (total 4 columns):
temperature          float64
relative_humidity    float64
wind_u               float64
wind_v               float64
dtypes: float64(4)
memory usage: 742.1 MB


In [5]:
# calculate some summary statistics on the dataframe

df.describe().round(2)

Unnamed: 0,temperature,relative_humidity,wind_u,wind_v
count,10238688.0,10238688.0,10238688.0,10238688.0
mean,281.25,78.48,3.1,-0.0
std,17.52,17.6,9.0,6.6
min,220.81,-9.77,-46.23,-44.07
25%,271.04,72.81,-3.4,-3.55
50%,283.85,82.54,1.89,-0.04
75%,296.67,89.97,8.5,3.5
max,326.23,113.73,50.32,46.23


In [6]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,temperature,relative_humidity,wind_u,wind_v
latitude,level,longitude,time,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
90.0,1000,0.0,2000-01-01 12:00:00,263.738445,86.127481,,
90.0,1000,0.0,2000-01-02 12:00:00,267.382147,87.96868,,
90.0,1000,0.0,2000-01-03 12:00:00,264.258055,95.606734,,
90.0,1000,0.0,2000-01-04 12:00:00,262.919618,89.579965,,
90.0,1000,0.0,2000-01-05 12:00:00,260.786484,92.862839,,


# Extraction of Relevant Geographical Area

- 62.5°-97.5° E, 5.0°-40.0° N => 15*15 points
- Daily surface temperature (T, 1000hPa) => already done in data selection
- Daily relative humidity (rh, 1000hPa) => already done in dara selection
- Wind (700hPa) => already done in data selection
- http://www.learner.org/jnorth/tm/LongitudeIntro.html

In [7]:
# extract row between a latitude of 5° and 40°
latitudes = ((df.index.get_level_values(0) <= 40) & (df.index.get_level_values(0) >= 5))
df_lat = df[latitudes]

# extract rows between a longitude of 62.5° and 97.5°
longitudes = ((df_lat.index.get_level_values(2) <= 97.5) & (df_lat.index.get_level_values(2) >= 62.5))
df_lon = df_lat[longitudes]

df_lon.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 438300 entries, (40.0, 1000, 62.5, 2000-01-01 12:00:00) to (5.0, 700, 97.5, 2002-08-31 12:00:00)
Data columns (total 4 columns):
temperature          219150 non-null float64
relative_humidity    219150 non-null float64
wind_u               219150 non-null float64
wind_v               219150 non-null float64
dtypes: float64(4)
memory usage: 15.9 MB


In [8]:
df_lon.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,temperature,relative_humidity,wind_u,wind_v
latitude,level,longitude,time,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
40.0,1000,62.5,2000-01-01 12:00:00,281.892605,59.615724,,
40.0,1000,62.5,2000-01-02 12:00:00,282.280301,51.099472,,
40.0,1000,62.5,2000-01-03 12:00:00,282.830476,43.745984,,
40.0,1000,62.5,2000-01-04 12:00:00,283.749043,51.348232,,
40.0,1000,62.5,2000-01-05 12:00:00,281.641648,65.565228,,


# Restructuring the Dataset

In [10]:
# split up the dataset according to pressure levels
# 700 or 1000 is implicit depending on the type of column
# => might as well drop it from the index

df_wind = df_lon[df_lon.index.get_level_values(1) == 700]
df_other = df_lon[df_lon.index.get_level_values(1) == 1000]

# drop the pressure level index
df_wind.index = df_wind.index.droplevel(level=1)
df_other.index = df_other.index.droplevel(level=1)

# drop columns where everything is NaN
df_wind = df_wind.dropna(axis=1)
df_other = df_other.dropna(axis=1)

df_wind.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,wind_u,wind_v
latitude,longitude,time,Unnamed: 3_level_1,Unnamed: 4_level_1
40.0,62.5,2000-01-01 12:00:00,7.268903,-2.407981
40.0,62.5,2000-01-02 12:00:00,3.442582,1.517888
40.0,62.5,2000-01-03 12:00:00,5.256291,3.569702
40.0,62.5,2000-01-04 12:00:00,2.536464,5.580177
40.0,62.5,2000-01-05 12:00:00,6.482128,0.790314


In [11]:
# rejoin the split dataframes to get a single time series index

df_rejoin = df_wind.join(df_other)
df_rejoin.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,wind_u,wind_v,temperature,relative_humidity
latitude,longitude,time,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
40.0,62.5,2000-01-01 12:00:00,7.268903,-2.407981,281.892605,59.615724
40.0,62.5,2000-01-02 12:00:00,3.442582,1.517888,282.280301,51.099472
40.0,62.5,2000-01-03 12:00:00,5.256291,3.569702,282.830476,43.745984
40.0,62.5,2000-01-04 12:00:00,2.536464,5.580177,283.749043,51.348232
40.0,62.5,2000-01-05 12:00:00,6.482128,0.790314,281.641648,65.565228
