# Real world data - TAWES

## Tawes weather data example

Tawes weather data for Vienna city center from 1990 to 2025 is in `Messstationen_Tagesdaten_v2_Datensatz_19900101_20250515.csv`.
CSV (Comma Separated Value) files can be read with pandas (amongst many other file formats).

In [1]:
import pandas
df = pandas.read_csv('Messstationen_Tagesdaten_v2_Datensatz_19900101_20250515.csv')
df.head()

Unnamed: 0,time,station,tlmax,tlmin,tl_mittel,rf_mittel
0,1990-01-01T00:00+00:00,5925,-0.1,-1.6,-0.9,
1,1990-01-02T00:00+00:00,5925,1.4,-2.3,-0.5,
2,1990-01-03T00:00+00:00,5925,,-0.7,0.4,66.0
3,1990-01-04T00:00+00:00,5925,1.2,,-2.2,78.0
4,1990-01-05T00:00+00:00,5925,-1.0,-4.6,-2.8,85.0


tl_mittel is average air temperature, tlmin and tlmax are the respective temperature extrema on these days, rf_mittel is mean relative humidity for this day.
However, there are no values. NaN is short for Not A Number. And if there is no value in this row (that is otherwise numeric in nature) it simple fills it with this NaN indicator.

So how to get rid of those NaNs?


In [2]:
df.dropna().head()

Unnamed: 0,time,station,tlmax,tlmin,tl_mittel,rf_mittel
4,1990-01-05T00:00+00:00,5925,-1.0,-4.6,-2.8,85.0
5,1990-01-06T00:00+00:00,5925,-3.4,-5.7,-4.6,78.0
6,1990-01-07T00:00+00:00,5925,-5.2,-8.5,-6.9,90.0
7,1990-01-08T00:00+00:00,5925,-5.4,-8.0,-6.7,92.0
8,1990-01-09T00:00+00:00,5925,-1.7,-8.7,-5.2,81.0


This however drops all rows that contain a NaN in any place. clgo was not available until later (2002) so we loose a lot of data. We want to know at least when all temperature values were available)

In [3]:
df.dropna(subset=['tl_mittel', 'tlmax', 'tlmin']).head()

Unnamed: 0,time,station,tlmax,tlmin,tl_mittel,rf_mittel
0,1990-01-01T00:00+00:00,5925,-0.1,-1.6,-0.9,
1,1990-01-02T00:00+00:00,5925,1.4,-2.3,-0.5,
4,1990-01-05T00:00+00:00,5925,-1.0,-4.6,-2.8,85.0
5,1990-01-06T00:00+00:00,5925,-3.4,-5.7,-4.6,78.0
6,1990-01-07T00:00+00:00,5925,-5.2,-8.5,-6.9,90.0


In [4]:
df


Unnamed: 0,time,station,tlmax,tlmin,tl_mittel,rf_mittel
0,1990-01-01T00:00+00:00,5925,-0.1,-1.6,-0.9,
1,1990-01-02T00:00+00:00,5925,1.4,-2.3,-0.5,
2,1990-01-03T00:00+00:00,5925,,-0.7,0.4,66.0
3,1990-01-04T00:00+00:00,5925,1.2,,-2.2,78.0
4,1990-01-05T00:00+00:00,5925,-1.0,-4.6,-2.8,85.0
...,...,...,...,...,...,...
12914,2025-05-11T00:00+00:00,5925,18.0,10.2,14.1,45.0
12915,2025-05-12T00:00+00:00,5925,18.7,9.1,13.9,36.0
12916,2025-05-13T00:00+00:00,5925,19.5,9.5,14.5,34.0
12917,2025-05-14T00:00+00:00,5925,23.9,8.2,16.1,30.0


## Accessing columns

One can access separate columns (can be multiple) like we would with a dictionary:

In [5]:
df['tl_mittel']

0        -0.9
1        -0.5
2         0.4
3        -2.2
4        -2.8
         ... 
12914    14.1
12915    13.9
12916    14.5
12917    16.1
12918     NaN
Name: tl_mittel, Length: 12919, dtype: float64

In [6]:
df[['tl_mittel', 'tlmin', 'tlmax']]

Unnamed: 0,tl_mittel,tlmin,tlmax
0,-0.9,-1.6,-0.1
1,-0.5,-2.3,1.4
2,0.4,-0.7,
3,-2.2,,1.2
4,-2.8,-4.6,-1.0
...,...,...,...
12914,14.1,10.2,18.0
12915,13.9,9.1,18.7
12916,14.5,9.5,19.5
12917,16.1,8.2,23.9


## Basic maths operations

Lets do some basic operations. Make a new column with the temperature differential tlmax - tmin.

In [7]:
df['tl_diff'] = df['tlmax'] - df['tlmin']
df.head()

Unnamed: 0,time,station,tlmax,tlmin,tl_mittel,rf_mittel,tl_diff
0,1990-01-01T00:00+00:00,5925,-0.1,-1.6,-0.9,,1.5
1,1990-01-02T00:00+00:00,5925,1.4,-2.3,-0.5,,3.7
2,1990-01-03T00:00+00:00,5925,,-0.7,0.4,66.0,
3,1990-01-04T00:00+00:00,5925,1.2,,-2.2,78.0,
4,1990-01-05T00:00+00:00,5925,-1.0,-4.6,-2.8,85.0,3.6


**Exercise**: Create a new column named 'freezing' that contains True if the min temparature was below 0 and False otherwise.

**Exercise**: Instead of True and False this freezing column should contain 1 and 0 (1 if True, 0 if False)

## Basic stats

Get some basic statistics on the data using describe().

In [8]:
df.describe()

Unnamed: 0,station,tlmax,tlmin,tl_mittel,rf_mittel,tl_diff
count,12919.0,12917.0,12917.0,12918.0,12916.0,12916.0
mean,5925.0,16.314973,9.104529,12.730276,67.103825,7.210855
std,0.0,9.588808,7.411942,8.402692,14.342467,3.437491
min,5925.0,-10.0,-15.4,-12.7,19.0,0.5
25%,5925.0,8.6,3.2,6.0,57.0,4.4
50%,5925.0,16.5,9.3,13.0,67.0,6.9
75%,5925.0,24.1,15.2,19.6,78.0,9.7
max,5925.0,39.5,26.9,32.3,100.0,18.5


## Conditional slicing, finding and counting occurrences

When was the coldest day in Vienna? We can see above that tlmin had a lowest value of -15.4, but when?
We can use conditions that evaluate to true or false (like the freezing one above) as indexers.

In [9]:
df['tlmin'] == -15.4

0        False
1        False
2        False
3        False
4        False
         ...  
12914    False
12915    False
12916    False
12917    False
12918    False
Name: tlmin, Length: 12919, dtype: bool

In [10]:
df[df['tlmin'] == -15.4]


Unnamed: 0,time,station,tlmax,tlmin,tl_mittel,rf_mittel,tl_diff
2553,1996-12-28T00:00+00:00,5925,-10.0,-15.4,-12.7,66.0,5.4


We can use this to select whole ranges of data where some condition applies. E.g. select all data where it was freezing.

In [11]:
df['freezing'] = (df['tlmin'] < 0).astype(int)
df[df['freezing'] == 1]

Unnamed: 0,time,station,tlmax,tlmin,tl_mittel,rf_mittel,tl_diff,freezing
0,1990-01-01T00:00+00:00,5925,-0.1,-1.6,-0.9,,1.5,1
1,1990-01-02T00:00+00:00,5925,1.4,-2.3,-0.5,,3.7,1
2,1990-01-03T00:00+00:00,5925,,-0.7,0.4,66.0,,1
4,1990-01-05T00:00+00:00,5925,-1.0,-4.6,-2.8,85.0,3.6,1
5,1990-01-06T00:00+00:00,5925,-3.4,-5.7,-4.6,78.0,2.3,1
...,...,...,...,...,...,...,...,...
12835,2025-02-21T00:00+00:00,5925,4.5,-0.9,1.8,46.0,5.4,1
12836,2025-02-22T00:00+00:00,5925,7.8,-0.8,3.5,46.0,8.6,1
12837,2025-02-23T00:00+00:00,5925,4.6,-0.6,2.0,64.0,5.2,1
12860,2025-03-18T00:00+00:00,5925,6.4,-0.4,3.0,42.0,6.8,1


We can use value_counts to see what value occurs how often in the dataframe.

In [12]:
df['freezing'].value_counts()

freezing
0    11433
1     1486
Name: count, dtype: int64

**Exercise**: How many days were the temperature was always freezing (look at tlmax) and what percentage of the time does this represent.

## Time indexed DataFrames

Pandas supports datetime indices

In [13]:
df['time'] = pandas.to_datetime(df['time'], utc=True)
df = df.set_index('time', drop=True)

In [14]:
df.head()


Unnamed: 0_level_0,station,tlmax,tlmin,tl_mittel,rf_mittel,tl_diff,freezing
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1990-01-01 00:00:00+00:00,5925,-0.1,-1.6,-0.9,,1.5,1
1990-01-02 00:00:00+00:00,5925,1.4,-2.3,-0.5,,3.7,1
1990-01-03 00:00:00+00:00,5925,,-0.7,0.4,66.0,,1
1990-01-04 00:00:00+00:00,5925,1.2,,-2.2,78.0,,0
1990-01-05 00:00:00+00:00,5925,-1.0,-4.6,-2.8,85.0,3.6,1


Now we can index rows based on times.

In [15]:
df['2011-03-01 00:00': '2011-03-02 00:00']

Unnamed: 0_level_0,station,tlmax,tlmin,tl_mittel,rf_mittel,tl_diff,freezing
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2011-03-01 00:00:00+00:00,5925,5.2,-0.8,2.2,71.0,6.0,1
2011-03-02 00:00:00+00:00,5925,4.3,-2.3,1.0,56.0,6.6,1


We can now resample the dataframe to some other resolution. Resampling to a lower frequency is called downsampling, to a higher frequency this is called upsampling.

When resampling, one has to specify a frequency and a method.

For example yearly avarages:


In [16]:
df.resample('1YS').mean() 

Unnamed: 0_level_0,station,tlmax,tlmin,tl_mittel,rf_mittel,tl_diff,freezing
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1990-01-01 00:00:00+00:00,5925.0,16.130769,8.649725,12.371781,68.749311,7.496419,0.082192
1991-01-01 00:00:00+00:00,5925.0,14.65726,7.773425,11.234247,71.876712,6.883836,0.156164
1992-01-01 00:00:00+00:00,5925.0,16.285246,9.139891,12.737978,68.994536,7.145355,0.076503
1993-01-01 00:00:00+00:00,5925.0,15.391233,8.088767,11.75863,71.391781,7.302466,0.186301
1994-01-01 00:00:00+00:00,5925.0,16.803836,9.529863,13.190959,68.819178,7.273973,0.079452
1995-01-01 00:00:00+00:00,5925.0,15.191233,8.569863,11.899178,69.772603,6.62137,0.134247
1996-01-01 00:00:00+00:00,5925.0,13.544262,7.230874,10.403825,72.661202,6.313388,0.210383
1997-01-01 00:00:00+00:00,5925.0,15.089863,8.15726,11.643562,70.273973,6.932603,0.128767
1998-01-01 00:00:00+00:00,5925.0,15.849315,8.727397,12.309589,69.282192,7.121918,0.161644
1999-01-01 00:00:00+00:00,5925.0,15.66411,8.950685,12.332055,73.534247,6.713425,0.123288


We can also resample to a higher frequency than the original data.

For example upsampling to hourly frequency while using linear interpolation.

In [17]:
small_df = df['2011-03-01 00:00': '2011-03-07 00:00'].copy() 
small_df.resample('1h').interpolate()


Unnamed: 0_level_0,station,tlmax,tlmin,tl_mittel,rf_mittel,tl_diff,freezing
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2011-03-01 00:00:00+00:00,5925.0,5.200000,-0.800000,2.2000,71.000000,6.000000,1.000000
2011-03-01 01:00:00+00:00,5925.0,5.162500,-0.862500,2.1500,70.375000,6.025000,1.000000
2011-03-01 02:00:00+00:00,5925.0,5.125000,-0.925000,2.1000,69.750000,6.050000,1.000000
2011-03-01 03:00:00+00:00,5925.0,5.087500,-0.987500,2.0500,69.125000,6.075000,1.000000
2011-03-01 04:00:00+00:00,5925.0,5.050000,-1.050000,2.0000,68.500000,6.100000,1.000000
...,...,...,...,...,...,...,...
2011-03-06 20:00:00+00:00,5925.0,4.533333,-0.533333,2.0500,44.666667,5.066667,0.833333
2011-03-06 21:00:00+00:00,5925.0,4.450000,-0.675000,1.9375,44.500000,5.125000,0.875000
2011-03-06 22:00:00+00:00,5925.0,4.366667,-0.816667,1.8250,44.333333,5.183333,0.916667
2011-03-06 23:00:00+00:00,5925.0,4.283333,-0.958333,1.7125,44.166667,5.241667,0.958333


**Exercise**: Resample the 'freezing' column to yearly frequency providing not the mean (as in the examples above) but the sum within each year.