# Howard University PBL Workshop Computational Warmup Exercises
Zach Moon

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import interpolate
import statsmodels.api as sm
import xarray as xr

In [2]:
plt.rcParams.update({
    "figure.autolayout": True,
    "axes.grid": True
})

# `figure.autolayout=True` has issues in Edge browser (pre-Chromium version)
# < doesn't seem to work with `%matplotlib notebook` (cuts off parts, in nb HTML output too), corrects if you drag the fig
# but works fine in Firefox

#plt.style.use("seaborn")

%matplotlib notebook

## Exercise 1 (HDF file import and data manipulation)

This is the (public) Google Drive folder where the data file we are using `Ret_677_20200227v0.9.h5` can be / was found:  
https://drive.google.com/drive/folders/1vYg636p0bmPYrlub1YuFguIY9mw98D4D



### Load the data

> 1) assess the various data structures in the HDF file

In [3]:
h5_name = "Ret_677_20200227v0.9.h5"

dset0 = xr.open_dataset(h5_name)

dset0

In [4]:
# set the coords to be what they are supposed to be
# should rename the coords with meaningful names. add legend variable?
# AOD var includes cloudiness impact too (change to just OD)

# rename dims
dset = dset0.rename_dims({
    "phony_dim_0": "wavelength",
    "phony_dim_1": "time"
})

# set coordinate variables
dset["wavelength"] = np.r_[414, 500, 614, 674, 869]#.astype(float)
dset.wavelength.attrs.update({"units": "nm"})

# construct datetime
# the time variable is in days since 1900, 1, 0 (starting on 0 not 1...)
t_elapsed = pd.to_timedelta(dset.MFRTime.values, unit="days")
t = pd.Timestamp("1900/01/01") - pd.Timedelta(days=1) + t_elapsed
dset["time"] = t

# rename AOD to OD (like Whiteman does with the loading)
dset = dset.rename({"AOD": "OD"})

dset

> 2) import the aerosol/cloud optical thicknesses measured at 414, 500, 614, 673 and 869 nm

In [5]:
plt.figure()
dset.MFRVolts.plot.line(x="time");

<IPython.core.display.Javascript object>

In [6]:
plt.figure()
dset.OD.plot.line(x="time");

<IPython.core.display.Javascript object>

In [7]:
plt.figure()
dset.plot.scatter(x="OD", y="CloudFlag", marker=".");

<IPython.core.display.Javascript object>

^ we can see that there is one erroneous point that probably should have `CloudFlag = 0`

### Aeorsol-only optical depth using cloud mask

> 3) use the cloud mask to select just the aerosol optical depths
> 4) plot up the AODs with a legend to distinguish the plots

In [8]:
plt.figure()

# new variable "AOD" as "OD" with cloud not flagged
dset["AOD"] = dset.OD.where(dset.CloudFlag == 0) 

lines = dset.AOD.plot.line(x="time");
plt.ylim(ymin=0, ymax=0.14)  # to match Whiteman's figure

# we get a big spike if we leave as a line plot
plt.setp(lines, ls="", marker=".", ms=3);
# ^ but for some reason making this change shift the plot area too
    
# note: `dset.AOD.where(dset.CloudFlag == 0).plot.line(x="phony_dim_1", marker="*", linestyle="")`
# also works but doesn't auto-hide the point at ~ 1.5
# xr.plot.scatter is for plotting multiple variables against each other

<IPython.core.display.Javascript object>

### Competition entry

Remaking the plot *starting from the loading step* in as few chars as possible.

Whiteman's for this (234 chars) doesn't have ax labels or legend but does use the `MFRTime` var as x.

In IPython we have a list of the input cells called `In`, which we can use to count the chararacters:

In [9]:
def count_chars_previous_cell():
    from IPython.display import display, Markdown
    
    # !important: only works if the nb is executed in order
    # e.g., restart and run all cells
    previous_In = In[-2]  # a special IPython list
    c_ns = len([c for c in previous_In if not c.isspace()])  # exclude spaces
        
    # `;` at end is just to suppress output, not strictly part of the code
    c = c_ns - 1 if previous_In.rstrip()[-1] == ";" else c_ns

    # `plt.figure()` is only necessary in this setting 
    # because xarray.plot default behavior is to plot in fig if it exists
    # in other settings it would not be necessary
    len_newfig = len("plt.figure()")

    display(Markdown(
        f"non-space char count: **{c_ns}**\n\n"
        f"excluding IPython output suppression: **{c}**\n\n"
        f"excluding fresh fig command: **{c-len_newfig}**"
    ))

**93 chars:** (but not time as x)

In [10]:
d = xr.open_dataset(h5_name)
plt.figure()
d.AOD.where(d.CloudFlag == 0).plot.line(x="phony_dim_1");

<IPython.core.display.Javascript object>

In [11]:
count_chars_previous_cell()

non-space char count: **93**

excluding IPython output suppression: **92**

excluding fresh fig command: **80**

**120 chars:** (with time as x)

In [12]:
d, cf, t = xr.open_dataset(h5_name)[["AOD", "CloudFlag", "MFRTime"]].to_array()
plt.figure()
plt.plot(t.T, d.where(cf == 0).T, ".");

<IPython.core.display.Javascript object>

In [13]:
count_chars_previous_cell()

non-space char count: **120**

excluding IPython output suppression: **119**

excluding fresh fig command: **107**

We can get a nicer plot with small char count using the `dset` that has coordinate variables set.

**x chars**

In [14]:
plt.figure()
dset.OD.where(dset.CloudFlag == 0).plot.line(x="time");

<IPython.core.display.Javascript object>

In [15]:
count_chars_previous_cell()

non-space char count: **65**

excluding IPython output suppression: **64**

excluding fresh fig command: **52**

### Ångström coefficient

> 5) compute the Ångström coefficient between the 414 and 869 nm wavelengths

could show solving the equation with SymPy like Whiteman has done with Wolfram?

In [16]:
Ang414869 = -np.log(dset.AOD.sel(wavelength=414) / dset.AOD.sel(wavelength=869)) / np.log(414/869)
plt.figure()
Ang414869.plot();

<IPython.core.display.Javascript object>

## Exercise 2 (ASCII and netCDF radiosonde data file import and manipulation).


RS41 file is in this public Dropbox folder:  
https://www.dropbox.com/sh/4j2rvwbdbr0bgcw/AADrK7PCjInM0E7dfThEvtnBa/RS41?dl=0&lst=&subfolder_nav_tracking=1

RS92:  
https://www.dropbox.com/sh/4j2rvwbdbr0bgcw/AADYhzLJ9WsAWtSLB83X4Tlwa/RS92?dl=0&lst=&subfolder_nav_tracking=1

GRUAN process RS92:  
https://www.dropbox.com/sh/4j2rvwbdbr0bgcw/AAD7XhFPeyOol84fmu86Gw4pa/GRUAN?dl=0&lst=&subfolder_nav_tracking=1







> 4) calculate water vapor mixing ratio using the Hyland Wexler, 1983 definition of saturation vapor
pressure

> 5) compare some measurements between the RS92 and the GRUAN-processed RS92


### Load and inspect the data

> 1) view the ascii files to discern the formats

> 2) import the data

In [17]:
varnames = ["time", "temp", "rh", "height", "press", "mixrat"]
columns = [0, 2, 3, 6, 7, 9]
# ^ same in the RS92 file

rs41 = pd.read_table("HUBV_RS41SGP_20200123_064641UT.mw41.dat", 
                     names=varnames, skiprows=40, sep="\s+", usecols=columns)
rs41

Unnamed: 0,time,temp,rh,height,press,mixrat
0,0.00,268.37,85.00,52.3,1023.19,2.22
1,0.81,268.46,83.38,54.5,1022.90,2.19
2,1.81,268.74,81.17,59.8,1022.22,2.18
3,2.81,269.39,77.76,65.8,1021.44,2.20
4,3.81,270.30,73.57,72.1,1020.61,2.23
...,...,...,...,...,...,...
5301,5300.81,221.39,0.59,32725.2,7.27,0.03
5302,5301.81,221.38,0.59,32729.0,7.26,0.03
5303,5302.81,221.38,0.59,32741.6,7.25,0.03
5304,5303.81,221.37,0.59,32746.3,7.25,0.03


In [18]:
rs92 = pd.read_table("ALVICE_RS92SGP_20200123_064640UT.mw41.dat", 
                     names=varnames, skiprows=40, sep="\s+", usecols=columns)
rs92

Unnamed: 0,time,temp,rh,height,press,mixrat
0,0.00,268.37,85.00,53.0,1023.19,2.2200
1,1.45,268.70,83.90,57.7,1022.58,2.2500
2,2.45,269.12,82.90,60.9,1022.16,2.2900
3,3.45,269.77,81.20,65.4,1021.58,2.3600
4,4.45,270.51,79.00,71.4,1020.81,2.4300
...,...,...,...,...,...,...
5305,5305.63,221.37,1.02,32585.7,7.44,0.0433
5306,5306.63,221.40,1.03,32588.3,7.44,0.0437
5307,5307.63,221.42,1.04,32591.0,7.43,0.0443
5308,5308.63,221.42,1.06,32593.4,7.43,0.0452


In [19]:
gruan_all = xr.open_dataset("BEL-RS-01_2_RS92-GDP_002_20200123T064600_1-003-001.nc")
gruan_all

In [20]:
# pull out the variables we are going to use
gruan = gruan_all[["time", "temp", "rh", "geopot", "press", "WVMR"]]
# ^ could rename to have the same names used for the others
gruan.variables

Frozen({'time': <xarray.IndexVariable 'time' (time: 5249)>
array(['2020-01-23T06:46:40.000000000', '2020-01-23T06:46:41.000000000',
       '2020-01-23T06:46:42.000000000', ..., '2020-01-23T08:15:04.175781250',
       '2020-01-23T08:15:05.175781250', '2020-01-23T08:15:06.175781250'],
      dtype='datetime64[ns]')
Attributes:
    standard_name:    time
    long_name:        Time
    g_format_type:    FLT
    g_format_format:  F8.1
    g_format_width:   8
    g_format_nan:     NaN
    g_source_desc:    FRAWPTU
    g_column_type:    original data
    g_resolution:     1.0 s (time)
    axis:             T, 'temp': <xarray.Variable (time: 5249)>
array([268.31528, 268.6145 , 269.3061 , ..., 221.3588 , 221.35834, 221.41803],
      dtype=float32)
Attributes:
    standard_name:      air_temperature
    units:              K
    long_name:          Temperature
    g_format_type:      FLT
    g_format_format:    F6.2
    g_format_width:     6
    g_format_nan:       NaN
    g_processing_flag:  raw

### Sonde comparison

> 3) compare some measurements between the RS92 and RS41

In [21]:
datasets = {
    "RS41": rs41,
    "RS92": rs92,
    #"GRUAN": gruan,
}

plt.figure()

for dsname, ds in datasets.items():
    plt.plot(ds["rh"], ds["time"], label=dsname)

plt.xlabel("RH (%)")
plt.ylabel("time (s)")

plt.legend();

<IPython.core.display.Javascript object>

In [22]:
# interpolate so we can regress
y_ = rs92["rh"]  # has its own time values, not same as the RS41
f = interpolate.interp1d(rs92["time"], rs92["rh"])  # x, y
y = f(rs41["time"])

x = rs41["rh"]
X = sm.add_constant(x)

mod = sm.OLS(y, X)
res = mod.fit()
res.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.996
Model:,OLS,Adj. R-squared:,0.996
Method:,Least Squares,F-statistic:,1183000.0
Date:,"Sat, 30 May 2020",Prob (F-statistic):,0.0
Time:,13:41:04,Log-Likelihood:,-8664.1
No. Observations:,5306,AIC:,17330.0
Df Residuals:,5304,BIC:,17350.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2420,0.021,11.786,0.000,0.202,0.282
rh,0.9995,0.001,1087.553,0.000,0.998,1.001

0,1,2,3
Omnibus:,4153.454,Durbin-Watson:,0.039
Prob(Omnibus):,0.0,Jarque-Bera (JB):,162545.907
Skew:,3.408,Prob(JB):,0.0
Kurtosis:,29.245,Cond. No.,27.0


In [23]:
#sm.graphics.plot_fit(res, 1);  # auto plot doesn't show the params

In [24]:
# manual plot
p = res.params  # 'const' and 'rh'

plt.figure()

plt.plot([0, 100], [0, 100], ":", c="0.5")

xfit = np.r_[0, 100]
fitstr = f"{p.rh:.6f} $x$ + {p.const:.6f}"
plt.plot(xfit, xfit*p.rh + p.const, c="r", label=f"fit\n{fitstr}")

plt.plot(x, y, '.', ms=3, alpha=0.8)

plt.legend();

<IPython.core.display.Javascript object>

### Water vapor mixing ratio

$$
\text{water vapor mixing ratio [g/kg]} = 10 \, \mathrm{RH} \, \cdot \varepsilon \frac{e_s}{p - \frac{\mathrm{RH}}{100} e_s}
$$

where $\varepsilon \approx 0.622$ is the ratio of water vapor molecular weight to dry air molecular weight.

In [25]:
def e_sat_Wexler(TempK):
    """Wexler formulation for saturation vapor pressure (over liquid)."""
    #TempK = TempC + 273.15

    return np.exp(
        -2.9912729e3 * TempK**-2 
        - 6.0170128e3 * TempK**-1 
        + 1.887643854e1 
        - 2.8354721e-2 * TempK 
        + 1.7838301e-5 * TempK**2 
        - 8.4150417e-10 * TempK**3 
        + 4.4412543e-13 * TempK**4 
        + 2.858487 * np.log(TempK)
    ) * 10. / 1000.

def mrw_Wexler(p, TempK, RH):
    """Water vapor mixing ratio using Wexler saturation VP over liquid."""
    e_s = e_sat_Wexler(TempK)
    
    return 10 * RH * 0.62197 * e_s / (
        p - RH/100 * e_s
    )

#### Manual Wexler calculation vs Vaisala

Compare our water vapor mixing ratio calculation with the one reported by the Vaisala sonde. 

In [26]:
mrw_manual = mrw_Wexler(rs92["press"], rs92["temp"], rs92["rh"])

z = rs92["height"] / 1000  # m -> km

fig, [ax1, ax2] = plt.subplots(1, 2, sharey=True, figsize=(9, 4.5))

ax1.plot(rs92["mixrat"], z, label="RS92-Ori")
ax1.plot(mrw_manual, z, c="orange", label="RS92-Wex")
ax1.set_xscale("log")
#ax1.legend()

ax2.plot(mrw_manual - rs92["mixrat"], z, lw=0.5, c="green", label="RS92-Ori $-$ RS92-Wex")

ax1.set_xlabel("Water vapor mixing ratio (g/kg)")
ax1.set_ylabel("Altitude (km)")
#ax2.set_xlabel("")

fig.legend(loc="upper center");
fig.tight_layout();

<IPython.core.display.Javascript object>

> Exercise for the student!! Play around with the different vapor pressure
formulations and see if you can resolve the discrepancies shown in the plot
above. You may want to plot the data up differently to illustrate the
discrepancy between the two mixing ratio calculations.

We can find which one minimizes the difference wrt. GRUAN. 
* probably want to include using an ice one at lower temperature to be really the best? but when is it safe to assume the water would be frozen? could ask Jerry

#### Vaisala vs GRUAN

compare the normal RS92 file and the one generated using the
GRUAN data processing

In [27]:

# gruan RH is in [0,1] not %
# gruan WVMR is in kg/kg not g/kg

fig, [[ax1, ax2], [ax3, ax4]] = plt.subplots(2, 2, sharey=True, figsize=(9, 7))

l1, = ax1.plot(rs92["mixrat"], rs92["height"]/1000, label="RS92")
l2, = ax1.plot(gruan["WVMR"]*1000, gruan["geopot"]/1000, c="orange", label="RS92-G")
ax1.set_xscale("log")

#ax2.plot(rs92["mixrat"] - gruan["WVMR"], z, lw=0.5, c="green", label="RS92 $-$ RS92-G")
ax3.plot(rs92["rh"], rs92["height"]/1000, label="RS92")
ax3.plot(gruan["rh"]*100, gruan["geopot"]/1000, c="orange", label="RS92-G")
ax3.set_xscale("log")

# interpolate one of them so we can subtract
# neglect any geopotential vs actual height difference that may exist here

# gruan time is loaded as datetime
t_gruan_elapsed = (gruan["time"] - gruan["time"][0])/np.timedelta64(1, "s")

f = interpolate.interp1d(rs92["time"], rs92["rh"])  # x, y
rs92_rh_new = f(t_gruan_elapsed)
f = interpolate.interp1d(rs92["time"], rs92["mixrat"])
rs92_mixrat_new = f(t_gruan_elapsed)

l3, = ax2.plot(rs92_mixrat_new - gruan["WVMR"]*1000, gruan["geopot"]/1000, "g", lw=0.5, 
         label="RS92 $-$ RS92-G")

ax4.plot(rs92_rh_new - gruan["rh"]*100, gruan["geopot"]/1000, "g", lw=0.5)


ax1.set_xlabel("Water vapor mixing ratio (g/kg)")
ax3.set_xlabel("RH (%)")

for ax in [ax1, ax3]:
    ax.set_ylabel("Altitude (km)")

fig.legend(loc="upper center", handles=[l1, l2, l3]);

<IPython.core.display.Javascript object>

⬆ The dry bias correction applied by GRUAN is clear looking at the log plot of water vapor mixing ratio. From the difference plot, it seems like it most pronounced in the lower troposphere. 

> Exercise for the student! Present the data in different ways to quantify the
magnitude of the dry bias correction. How much is the correction a function of
RH? How much is it a function of temperature?