# Some tips with xarray and pandas

- We have massively different levels here
- Try to make some aims for technical skills you can learn!
- If you are beginning with python --> learn the basics
- If you are good at basic python --> learn new packages
- If you know all the packages --> improve your skills with producing your own software etc. 
- If you don't know git and github --> get better at this!


## What are pandas and xarray?
- Pandas --> like a spreadsheet 2D data with columns and rows
- xarray --> like pandas, but in N dimensions

In [None]:
!wget 'https://zenodo.org/record/5639504/files/OsloAeroSec2011-3_subset2.nc'

# Some examples with xarray and pandas:

In [None]:
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
import seaborn as sns

# Reading in the data:

In [None]:
path = "OsloAeroSec2011-3_subset2.nc"
ds = xr.open_dataset(path)

##### Opening multiple files:

```python

list_of_files = [
    'file1.nc',
    'file2.nc'
]
xr.open_mfdataset(list_of_files, concat_dim='time',combine='by_coords')
```

### Check how your dataset looks

#### Different types of information/data:
- Coordinates
- Data variables
- Global attributes
- Variable attributes

In [None]:
ds

## Sometimes we want to do some nice tweaks before we start: 

In [None]:
ds["T"]

Assign attributes! Nice for plotting and to keep track of what is in your dataset (especially 'units' and 'standard_name'/'long_name' will be looked for by xarray.

In [None]:
ds["T_C"] = ds["T"] - 273.15

ds["T_C"] = ds["T_C"].assign_attrs({"units": "$^\circ$C"})

### May always be small things you need to adjust: 

In [None]:
ds["time"]

This data in particular has an issue that the date is the end of the month, which gets read as the first of the next month. So I usually just to a quick fix and subtract roughly 15 days (half a month) 

In [None]:
t_corrected = pd.to_datetime(ds["time"].values) - datetime.timedelta(days=15)
ds["time"] = t_corrected

In [None]:
ds["time"]

### Convert longitude: 
this data comes in 0--360 degrees, but often -180 to 180 is more convenient. So we can convert:

**NOTE:** Maybe you want to put this in a module? Or a package.. 

In [None]:
ds

In [None]:
def convert360_180(_ds):
    """
    convert longitude from 0-360 to -180 -- 180 deg
    """
    # check if already
    attrs = _ds["lon"].attrs
    if _ds["lon"].min() >= 0:
        with xr.set_options(keep_attrs=True):
            _ds.coords["lon"] = (_ds["lon"] + 180) % 360 - 180
        _ds = _ds.sortby("lon")
    return _ds

(migth want to move this to a module!) 

In [None]:
ds = convert360_180(ds)

In [None]:
ds["lon"]

Let's pick out only the surface layer.
It's the last one:

# Selecting data and super quick plotting:

xarray loads data only when it needs to (it's lazy, Anne can explain), and you might want to early on define the subset of data you want to look at so that you don't end up loading a lot of extra data. 


##### See [here](http://xarray.pydata.org/en/stable/user-guide/indexing.html) for nice overview

#### isel, sel

In [None]:
ds_s = ds.isel(lev=-1)

In [None]:
ds_s

In [None]:
ds_s["T_C"].isel(time=0).plot()

Notice how the labels use both the attribute "standard_name" and "units" from the dataset. 

In [None]:
ds["T_C"].sel(lev=1000.0, lon=0, method="nearest").plot(x="time")

### Slice:

In [None]:
ds_s["T_C"].sel(lat=slice(0, 90)).isel(time=0).plot()

### Super quick averaging etc

In [None]:
da_T = ds["T_C"]

Mean: 


In [None]:
da_T.mean(["time", "lon"]).plot(ylim=[1000, 100], yscale="log")

Standard deviation

In [None]:
da_T.isel(lev=-1).std(["time"]).plot()

Temperature change much stronger over land than ocean...

## Seasonal average

In [None]:
month = (ds["time.month"] == 7) | (ds["time.month"] == 8)

In [None]:
ds_sum = ds.where(month).mean("time")

In [None]:
ds_sum

In [None]:
ds_season = ds.groupby("time.season").mean()

In [None]:
ds_season

In [None]:
ds_season["T_C"].isel(lev=-1).plot(col="season")

## Controle the plot visuals:

In [None]:
# lets plot the wind fields
_ds = ds_s[["V", "U"]]
_da = np.sqrt(_ds["V"] ** 2 + _ds["U"] ** 2)

# _da.attrs['long_name'] = 'Wind speed'
# _da.attrs['units'] = 'm/s'

In [None]:
f, ax = plt.subplots(dpi=100)
_dm = _da.isel(time=0)
_dm.plot(cmap=plt.get_cmap("Reds"), ax=ax, cbar_kwargs={"label": "Wind Speed [m/s]"})


_ds = ds_s.isel(time=0, lon=slice(0, None, 2), lat=slice(0, None, 2))
ax.quiver(
    _ds["lon"],
    _ds["lat"],
    _ds["U"],
    _ds["V"],
    scale=300,
)

# ax.set_title('Wind strength and pattern')
# ax.set_xlabel('Longitude [$^\circ$E]')

# Plotting with cartopy

In [None]:
import cartopy as cy
import cartopy.crs as ccrs

In [None]:
f, ax = plt.subplots(dpi=100, subplot_kw={"projection": ccrs.PlateCarree()})


_dm.plot.pcolormesh(
    cmap=plt.get_cmap("Reds"),
    ax=ax,
    cbar_kwargs={
        "label": "Wind Speed [m/s]",
        "orientation": "horizontal",
    },
    transform=ccrs.PlateCarree(),
    x="lon",
    y="lat",
    levels=6,
)
ax.set_title("ilev:0; Mean over Time")
ax.coastlines()

gl = ax.gridlines(draw_labels=True)
gl.xlabels_top = False
gl.ylabels_right = False

ax.add_feature(cy.feature.BORDERS);

# Convert to pandas & do some random fun stuff: 

Maybe we e.g. want to compare with a station, or just use some of the considerable functionalities available from pandas. It's easy to convert back and forth between xarray and pandas:

## Pick out station: 

Lets pick out Tjärnö research station!


In [None]:
lat_tjarno = 58.9
lon_tjarno = 11.1
# pick out surface
ds_surf = ds.isel(lev=-1)
ds_tjarno = ds_surf.sel(lat=lat_tjarno, lon=lon_tjarno, method="nearest")

### Resample:

In [None]:
df_tjarno = ds_tjarno.to_dataframe()

In [None]:
df_tjarno.head()

In [None]:
df_yearly = df_tjarno.resample("Y").mean()  # .plot()

In [None]:
df_yearly[["U", "V"]].plot()

# Using pandas specific tools:

In [None]:
ds_s["Wind_speed"] = np.sqrt(ds_s["U"] ** 2 + ds_s["V"] ** 2)

In [None]:
df = ds_s.to_dataframe()
df.head()

In [None]:
df[["U", "V"]].plot.hist(alpha=0.5, bins=200)

In [None]:
df_ri = df.reset_index()
df_ri.head()

### Check out the tradewinds (skip in presentation):

In [None]:
trops = (-20 < df_ri["lat"]) & (df_ri["lat"] < 20)
df_ri[["U", "V"]][trops].plot.hist(alpha=0.5, bins=200)

### lets do something unnecesarily complicated :D 

## qcut, cut

qcut splits the data into quantile ranges

In [None]:
df_ri["wind_speed_cat"] = pd.qcut(
    df_ri["Wind_speed"],
    q=[0.05, 0.17, 0.34, 0.66, 0.83, 0.95],
    labels=["very low", "low", "med", "high", "very high"],
)

Cut cuts into categories

In [None]:
df_ri["lat_cat"] = pd.cut(
    df_ri["lat"],
    [-90, -60, -30, 0, 30, 60, 90],
    labels=["S polar", "S mid", "S tropics", "N tropic", "N mid", "N polar"],
)

In [None]:
df_ri.groupby("lat_cat").mean()

In [None]:
sns.boxenplot(x="lat_cat", y="U", color="b", scale="linear", data=df_ri)

In [None]:
sns.boxenplot(
    x="wind_speed_cat",
    y="N_AER",
    color="b",
    scale="linear",
    data=df_ri,
)

In [None]:
sns.displot(
    x="N_AER", hue="lat_cat", log_scale=True, kind="kde", data=df_ri, multiple="stack"
)

## Convert back to xarray if we need:

In [None]:
ds_new = df_ri.set_index(["time", "lat", "lon"]).to_xarray()

In [None]:
ds_new

### Groupby

In [None]:
ds_new.where(ds_new["wind_speed_cat"] == "low").mean("time")["N_AER"].plot()

In [None]:
ds["wind_speed"] = np.sqrt(ds["U"] ** 2 + ds["V"] ** 2)

In [None]:
ds_new.groupby("wind_speed_cat").mean()

In [None]:
df[["U", "V"]].plot.hist(alpha=0.5, bins=200)

In [None]:
df_tjarno.head()