<h1>10. NetCDF I/O</h1>
<h2>10/26/2020</h2>

<h2>10.0 Last Time...</h2>
<ul>
    <li>Statistics that are not resistant to outliers include the mean, the standard deviation, skewness, and kurtosis.</li>
    <li>Statistics that are resistant to outliers include the median and the interquartile range.</li>
    <li>By making use of dictionaries, we can create versatile, non-hard-coded programs!</li>
    <li>glob is a package that enables us to grab all files within a directory</li>
</ul>

<h2>10.1 Structure of a NetCDF File</h2>

NetCDF is a commonly used file format in our field because it enables the storage of data as well as the storage of its <b>metadata</b>. Every major language in our field is equipped to deal with NetCDF files, and Python is no exception!

There are <b>four</b> parameter types in a netCDF file:
<ul>
    <li><b>Global attributes:</b> strings that describe the file as a whole: for example, a title, who created it, what standards it follows.</li>
    <li><b>Variables:</b> entities that hold data, which includes the data, the domain the data is defined on (dimensionality), and metadata about the data (for example, units).</li>
    <li><b>Variable attributes:</b> actual storage of the data's metadata.</li>
    <li><b>Dimensions:</b> not only define the domain, but also might have values of their own (for example, latitude values, longitude values, altitude values, etc.).</li>
</ul>

As an example, you might have a timeseries of surface temperature for a latitude-longitude grid. The dimensions for that dataset would be <b>lat</b>, <b>lon</b>, and <b>time</b>. The variable <b>lat</b> would just tell you the number of elements in the <b>lat</b> dimension</b>; likewise for <b>lon</b> and <b>time</b>. Finally, you might have a variable containing temperatures called <b>Ts</b> that would be 3-D, with dimensions of <b>lat</b>, <b>lon</b>, and <b>time</b>.

There are several packages that can read NetCDF files; we're going to learn SciPy today - it's not necessarily the best, but it is one of the easiest to learn, and SciPy is a useful package for many other reasons.

<h2>10.2 Reading a NetCDF File</h2>

For NetCDF files, we're interested in the I/O functionality of SciPy, so we'll call that part of the package and assign it to an alias.

In [None]:
# Import the I/O functionality of SciPy.

import scipy.io as S

To create a file object, it's actually a pretty similar approach to what we did in our I/O lecture.

In [None]:
# Read in the provided NetCDF file in read-only mode.

fileobj = S.netcdf_file('air.mon.mean.nc',mode='r')
print(fileobj)

NetCDF file objects have several <b>attributes</b> that we can call on (more on this when we explore object-oriented programming). One of those attributes is called <b>variables</b>, which is a dictionary. The keys of the dictionary are strings corresponding to the names of the variables, and the values are a special kind of object called <b>variable objects</b> that contain the variable's values as well as any metadata (units, etc.).

Another NetCDF file object attribute is <b>dimensions</b>, which is another dictionary. The keys of this dictionary are strings that are the names of the dimensions, and the values are the lengths of the dimensions.

Let's see an example from our .nc file: a grid of monthly-mean temperature values.

In [None]:
# Let's import NumPy and SciPy's I/O functionality.

import numpy as np
import scipy.io as S

# Now, create a file object!

fileobj = S.netcdf_file('air.mon.mean.nc',mode='r')

# First, let's find out what information's in the title.

print(fileobj.title)

In [None]:
# Now let's explore the dimensions dictionary.

print(fileobj.dimensions)

# Time is set to 'None' because that's this file's
# 'unlimited' dimension; you can keep adding new times to it
# and it will use the same lat/lon grid.

In [None]:
# And let's see what kinds of variables are inside.

print(fileobj.variables)

In [None]:
# Okay, that's all very messy!
# Now it's time to grab the values of air temperature.

air_temp = fileobj.variables['air']
print(air_temp)

In [None]:
# We can now examine this data more carefully.

print(air_temp.units)
print(air_temp.shape)
print(air_temp[:])

In [None]:
# Let's be a little more restrained... how about all the data at the first lat/lon pair?

print(air_temp[0,0,0])

In [None]:
# That seems pretty chilly! What are the lat/lon values?

lat = fileobj.variables['lat']
lon = fileobj.variables['lon']

print(lat[0])
print(lon[0])

In [None]:
# Okay, that's reasonable for the North Pole.
# This grid is 2.5 degrees... let's find out what Seattle's weather was like!

# Seattle is at approximately 47.5 N and 122.5 W.

# This dataset's lon starts at 0 and counts up to 360.
# So 122.25 W corresponds to 360-122.5 = 237.5.

# Remember, lat and lon are those weird data types:
print(lat)
print(lon)

In [None]:
# So let's save their values instead.
a = lat[:]
b = lon[:]
print(a)
print(b)

In [None]:
# Now we can use np.where() to find the locations of the values in the dataset.

seattle_lat = np.where(a == 47.5)
seattle_lon = np.where(b == 237.5)

print(seattle_lat)
print(seattle_lon)

In [None]:
# Let's print it out!

print(air_temp[0,seattle_lat,seattle_lon])

In [None]:
# If you want the full name of a particular variable, long_name is useful!

print(fileobj.variables['air'].long_name)

<h2>10.3 Writing a NetCDF File</h2>

Just as with normal files, we can write our own NetCDF files!

You can create a NetCDF file object in write mode!

In [1]:
import numpy as np
import scipy.io as S

new_file = S.netcdf_file('new.nc',mode='w')

# Let's start by putting in 10 latitude and 20 longitude values.
lat = np.arange(10)
lon = np.arange(20)

# And maybe two different sets of data, one array and one scalar.
data1 = np.reshape(np.sin(np.arange(200)*0.1),(10,20))
data2 = 42.0

print(data1)
print(data2)

[[ 0.          0.09983342  0.19866933  0.29552021  0.38941834  0.47942554
   0.56464247  0.64421769  0.71735609  0.78332691  0.84147098  0.89120736
   0.93203909  0.96355819  0.98544973  0.99749499  0.9995736   0.99166481
   0.97384763  0.94630009]
 [ 0.90929743  0.86320937  0.8084964   0.74570521  0.67546318  0.59847214
   0.51550137  0.42737988  0.33498815  0.23924933  0.14112001  0.04158066
  -0.05837414 -0.15774569 -0.2555411  -0.35078323 -0.44252044 -0.52983614
  -0.61185789 -0.68776616]
 [-0.7568025  -0.81827711 -0.87157577 -0.91616594 -0.95160207 -0.97753012
  -0.993691   -0.99992326 -0.99616461 -0.98245261 -0.95892427 -0.92581468
  -0.88345466 -0.83226744 -0.77276449 -0.70554033 -0.63126664 -0.55068554
  -0.46460218 -0.37387666]
 [-0.2794155  -0.1821625  -0.0830894   0.0168139   0.1165492   0.21511999
   0.31154136  0.40484992  0.49411335  0.57843976  0.6569866   0.72896904
   0.79366786  0.85043662  0.8987081   0.93799998  0.96791967  0.98816823
   0.99854335  0.99894134]
 [ 0

In [2]:
# So far so good! Let's create the actual dimension information.

new_file.createDimension('lat',len(lat))
new_file.createDimension('lon',len(lon))

In [3]:
# Now the names of our variables!

lat_var = new_file.createVariable('lat','f',('lat',))
lon_var = new_file.createVariable('lon','f',('lon',))
data1_var = new_file.createVariable('data1','f',('lat','lon'))
data2_var = new_file.createVariable('data2','f',())

In [4]:
# And now we assign the actual values to our variables!

lat_var[:] = lat[:]
lon_var[:] = lon[:]
data1_var[:,:] = data1[:,:]
data2_var.assignValue(data2)

In [5]:
# And assign some units!

data1_var.units = 'kg'

In [6]:
# Add a title to finish up!

new_file.title = 'New NetCDF File'
new_file.close()

Okay, so having done all that (don't worry if all the details are unclear - this is a fairly advanced topic that will take some practice!), let's try reading our values.

In [7]:
import numpy as np
import scipy.io as S

file_object = S.netcdf_file('new.nc',mode='r')
print(file_object.dimensions)
print(file_object.variables)

OrderedDict([('lat', 10), ('lon', 20)])
OrderedDict([('lon', <scipy.io.netcdf.netcdf_variable object at 0x7f3c57edc750>), ('data1', <scipy.io.netcdf.netcdf_variable object at 0x7f3c57edc710>), ('lat', <scipy.io.netcdf.netcdf_variable object at 0x7f3c57edc890>), ('data2', <scipy.io.netcdf.netcdf_variable object at 0x7f3c57edc950>)])


In [8]:
# Let's print some values!

data1 = file_object.variables['data1']

print(data1[:])
print(data1.units)

[[ 0.          0.09983341  0.19866933  0.29552022  0.38941833  0.47942555
   0.5646425   0.64421767  0.7173561   0.7833269   0.84147096  0.89120734
   0.9320391   0.9635582   0.98544973  0.997495    0.9995736   0.9916648
   0.9738476   0.9463001 ]
 [ 0.9092974   0.86320937  0.8084964   0.7457052   0.6754632   0.5984721
   0.5155014   0.42737988  0.33498815  0.23924933  0.14112     0.04158066
  -0.05837414 -0.15774569 -0.25554112 -0.35078323 -0.44252044 -0.5298361
  -0.6118579  -0.68776613]
 [-0.7568025  -0.8182771  -0.8715758  -0.91616595 -0.9516021  -0.9775301
  -0.993691   -0.9999232  -0.9961646  -0.98245263 -0.9589243  -0.9258147
  -0.8834547  -0.83226746 -0.7727645  -0.7055403  -0.63126665 -0.5506855
  -0.46460217 -0.37387666]
 [-0.2794155  -0.18216251 -0.0830894   0.0168139   0.1165492   0.21511999
   0.31154135  0.40484992  0.49411336  0.5784398   0.6569866   0.72896904
   0.79366785  0.8504366   0.8987081   0.93799996  0.96791965  0.98816824
   0.9985433   0.99894136]
 [ 0.98935

<h2>10.4 NetCDF Example</h2>

Let's pull up some monthly mean surface air temperature data from our air.mon.mean.nc data file. These data come from the NCEP/NCAR Reanalysis 1.

In [9]:
import numpy as np
import scipy.io as S

fileobj = S.netcdf_file('air.mon.mean.nc',mode='r')

# Let's take a look at the time units.
time_data = fileobj.variables['time'][:]
print(time_data)
time_units = fileobj.variables['time'].units
print(time_units)

[17067072. 17067816. 17068512. 17069256. 17069976. 17070720. 17071440.
 17072184. 17072928. 17073648. 17074392. 17075112. 17075856. 17076600.
 17077272. 17078016. 17078736. 17079480. 17080200. 17080944. 17081688.
 17082408. 17083152. 17083872. 17084616. 17085360. 17086032. 17086776.
 17087496. 17088240. 17088960. 17089704. 17090448. 17091168. 17091912.
 17092632. 17093376. 17094120. 17094792. 17095536. 17096256. 17097000.
 17097720. 17098464. 17099208. 17099928. 17100672. 17101392. 17102136.
 17102880. 17103576. 17104320. 17105040. 17105784. 17106504. 17107248.
 17107992. 17108712. 17109456. 17110176. 17110920. 17111664. 17112336.
 17113080. 17113800. 17114544. 17115264. 17116008. 17116752. 17117472.
 17118216. 17118936. 17119680. 17120424. 17121096. 17121840. 17122560.
 17123304. 17124024. 17124768. 17125512. 17126232. 17126976. 17127696.
 17128440. 17129184. 17129856. 17130600. 17131320. 17132064. 17132784.
 17133528. 17134272. 17134992. 17135736. 17136456. 17137200. 17137944.
 17138

In [10]:
# Well, that seems confusing.
# Let's create a new version of the file where time just starts at 0.0, and change the units string so it just says 'hours'.

time_data = time_data - np.min(time_data)
time_units = 'hours'
print(time_data)

[     0.    744.   1440.   2184.   2904.   3648.   4368.   5112.   5856.
   6576.   7320.   8040.   8784.   9528.  10200.  10944.  11664.  12408.
  13128.  13872.  14616.  15336.  16080.  16800.  17544.  18288.  18960.
  19704.  20424.  21168.  21888.  22632.  23376.  24096.  24840.  25560.
  26304.  27048.  27720.  28464.  29184.  29928.  30648.  31392.  32136.
  32856.  33600.  34320.  35064.  35808.  36504.  37248.  37968.  38712.
  39432.  40176.  40920.  41640.  42384.  43104.  43848.  44592.  45264.
  46008.  46728.  47472.  48192.  48936.  49680.  50400.  51144.  51864.
  52608.  53352.  54024.  54768.  55488.  56232.  56952.  57696.  58440.
  59160.  59904.  60624.  61368.  62112.  62784.  63528.  64248.  64992.
  65712.  66456.  67200.  67920.  68664.  69384.  70128.  70872.  71568.
  72312.  73032.  73776.  74496.  75240.  75984.  76704.  77448.  78168.
  78912.  79656.  80328.  81072.  81792.  82536.  83256.  84000.  84744.
  85464.  86208.  86928.  87672.  88416.  89088.  8

In [11]:
newfile = S.netcdf_file('newtime.nc',mode='w')
newfile.createDimension('time',np.size(time_data))
time_var = newfile.createVariable('time','d',('time',))
time_var.units = time_units
time_var[:] = time_data[:]
newfile.close()

In [12]:
# And let's test it!

import numpy as np
import scipy.io as S

testfile = S.netcdf_file('newtime.nc')

time_data = testfile.variables['time'][:]
print(time_data)
time_units = testfile.variables['time'].units
print(time_units)

[     0.    744.   1440.   2184.   2904.   3648.   4368.   5112.   5856.
   6576.   7320.   8040.   8784.   9528.  10200.  10944.  11664.  12408.
  13128.  13872.  14616.  15336.  16080.  16800.  17544.  18288.  18960.
  19704.  20424.  21168.  21888.  22632.  23376.  24096.  24840.  25560.
  26304.  27048.  27720.  28464.  29184.  29928.  30648.  31392.  32136.
  32856.  33600.  34320.  35064.  35808.  36504.  37248.  37968.  38712.
  39432.  40176.  40920.  41640.  42384.  43104.  43848.  44592.  45264.
  46008.  46728.  47472.  48192.  48936.  49680.  50400.  51144.  51864.
  52608.  53352.  54024.  54768.  55488.  56232.  56952.  57696.  58440.
  59160.  59904.  60624.  61368.  62112.  62784.  63528.  64248.  64992.
  65712.  66456.  67200.  67920.  68664.  69384.  70128.  70872.  71568.
  72312.  73032.  73776.  74496.  75240.  75984.  76704.  77448.  78168.
  78912.  79656.  80328.  81072.  81792.  82536.  83256.  84000.  84744.
  85464.  86208.  86928.  87672.  88416.  89088.  8

<h2>10.5 Take-Home Points</h2>
<ul>
    <li>NetCDF is a powerful file type containing global attributes, variables, variable attributes, and dimensions.</li>
    <li>We can read from NetCDF files using similar syntax to that for regular files.</li>
    <li>Using attributes such as 'dimensions' and 'variables', we can learn about individual variables in the dataset.</li>
    <li>We can also write to NetCDF files in a simlar way.</li>
</ul>