# CSS 120: Environmental Data Science

## Paleoclimate

### Umberto Mignozzetti (UCSD)

(Based on Project Pythia and ClimateMatch)

# Packages


In [None]:
# To install
# !pip install LiPD --quiet
# !pip install pyleoclim --quiet
# !pip install climlab --quiet

In [None]:
# System helpers
import os
import sys
from io import StringIO
import tempfile

# Data analysis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pooch

# Principal Component Analysis
from sklearn.decomposition import PCA

# Paleodata analysis
import lipd
import pyleoclim as pyleo

# Maps
import cartopy, cartopy.crs as ccrs, cartopy.feature as cfeature, cartopy.io.shapereader as shapereader

##  Helper functions


In [None]:
# Pooch Load
def pooch_load(filelocation=None, filename=None, processor=None):
    shared_location = "~/"
    user_temp_cache = tempfile.gettempdir()

    if os.path.exists(os.path.join(shared_location, filename)):
        file = os.path.join(shared_location, filename)
    else:
        file = pooch.retrieve(
            filelocation,
            known_hash=None,
            fname=os.path.join(user_temp_cache, filename),
            processor=processor,
        )

    return file

##  Helper functions


In [None]:
# Function to convert the PAGES2K LiDP files in a pandas.DataFrame
def lipd2df(
    lipd_dirpath,
    pkl_filepath=None,
    col_str=[
        "paleoData_pages2kID", "dataSetName", "archiveType", "geo_meanElev", "geo_meanLat",
        "geo_meanLon", "year", "yearUnits", "paleoData_variableName", "paleoData_units",
        "paleoData_values", "paleoData_proxy",
    ],
):
    """
    Convert a bunch of PAGES2k LiPD files to a `pandas.DataFrame` to boost data loading.

    If `pkl_filepath` isn't `None`, save the DataFrame as a pikle file.

    Parameters:
    ----------
        lipd_dirpath: str
          Path of the PAGES2k LiPD files
        pkl_filepath: str or None
          Path of the converted pickle file. Default: `None`
        col_str: list of str
          Name of the variables to extract from the LiPD files

    Returns:
    -------
        df: `pandas.DataFrame`
          Converted Pandas DataFrame
    """

    # Save the current working directory for later use, as the LiPD utility will change it in the background
    work_dir = os.getcwd()
    # LiPD utility requries the absolute path
    lipd_dirpath = os.path.abspath(lipd_dirpath)
    # Load LiPD files
    lipds = lipd.readLipd(lipd_dirpath)
    # Extract timeseries from the list of LiDP objects
    ts_list = lipd.extractTs(lipds)
    # Recover the working directory
    os.chdir(work_dir)
    # Create an empty pandas.DataFrame with the number of rows to be the number of the timeseries (PAGES2k records),
    # and the columns to be the variables we'd like to extract
    df_tmp = pd.DataFrame(index=range(len(ts_list)), columns=col_str)
    # Loop over the timeseries and pick those for global temperature analysis
    i = 0
    for ts in ts_list:
        if (
            "paleoData_useInGlobalTemperatureAnalysis" in ts.keys()
            and ts["paleoData_useInGlobalTemperatureAnalysis"] == "TRUE"
        ):
            for name in col_str:
                try:
                    df_tmp.loc[i, name] = ts[name]
                except:
                    df_tmp.loc[i, name] = np.nan
            i += 1
    # Drop the rows with all NaNs (those not for global temperature analysis)
    df = df_tmp.dropna(how="all")
    # Save the dataframe to a pickle file for later use
    if pkl_filepath:
        save_path = os.path.abspath(pkl_filepath)
        print(f"Saving pickle file at: {save_path}")
        df.to_pickle(save_path)
    return df

##  Helper functions


In [None]:
class SupressOutputs(list):
    def __enter__(self):
        self._stdout = sys.stdout
        sys.stdout = self._stringio = StringIO()
        return self

    def __exit__(self, *args):
        self.extend(self._stringio.getvalue().splitlines())
        del self._stringio  # free up some memory
        sys.stdout = self._stdout##  Helper functions


# Reconstructing Past Changes in Terrestrial Climate

## Reconstructing Past Changes in Terrestrial Climate

Let us now look at temperature change over the past 2,000 years as recorded by proxy records from tree rings, speleothems, and lake sediments. 

To analyze these datasets, we will group them by archive and create time series plots to assess temperature variations.

We will:

- Plot temperature records based on three different terrestrial proxies
- Assess similarities and differences between the temperature records

## Loading Terrestrial Paleoclimate Records

First, we need to download the data. Since it is stored as a LiPD file, we will use Pyleoclim to format and interpret the data.

In [None]:
# set the name to save the Euro2k data
fname = "euro2k_data"

# download the data
lipd_file_path = pooch.retrieve(
    url="https://osf.io/7ezp3/download/",
    known_hash=None,
    path="./",
    fname=fname,
    processor=pooch.Unzip(),
)

## Loading Terrestrial Paleoclimate Records

In [None]:
with SupressOutputs():
    d_euro = pyleo.Lipd(os.path.join(".", f"{fname}.unzip", "Euro2k"))

## Temperature Reconstructions

We can filter all of the data so that we only keep reconstructions of temperature from terrestrial archives (e.g. tree rings, speleothems and lake sediments). 

This is accomplished with the function below. 

The [`Lipd.to_tso`](https://pyleoclim-util.readthedocs.io/en/latest/core/api.html#pyleoclim.core.lipd.Lipd.to_tso) method is used to obtain a list of dictionaries that can be iterated upon.

In [None]:
def filter_data(dataset, archive_type, variable_name):
    """
    Return a MultipleSeries object with the variable 
    record (variable_name) for a given archive_type and coordinates.
    """
    # Create a list of dictionary that can be iterated upon using Lipd.to_tso method
    ts_list = dataset.to_tso()
    # Append the correct indices for a given value of archive_type and variable_name
    indices = []
    lat = []
    lon = []
    for idx, item in enumerate(ts_list):
        # Check that it is available to avoid errors on the loop
        if "archiveType" in item.keys():
            # If it's a archive_type, then proceed to the next step
            if item["archiveType"] == archive_type:
                if item["paleoData_variableName"] == variable_name:
                    indices.append(idx)
    print(indices)
    # Create a list of LipdSeries for the given indices
    ts_list_archive_type = []
    for indice in indices:
        ts_list_archive_type.append(pyleo.LipdSeries(ts_list[indice]))

        # save lat and lons of proxies
        lat.append(ts_list[indice]["geo_meanLat"])
        lon.append(ts_list[indice]["geo_meanLon"])

    return pyleo.MultipleSeries(ts_list_archive_type), lat, lon

## Temperature Reconstructions

In [None]:
ts_list = d_euro.to_tso()

## Temperature Reconstructions

Dictionaries are native to Python and can be explored as shown below.

In [None]:
# look at available entries for just one time-series
ts_list[0].keys()

## Temperature Reconstructions

Dictionaries are native to Python and can be explored as shown below.

In [None]:
# print relevant information for all entries
for idx, item in enumerate(ts_list):
    print(str(idx) + ": " + item["dataSetName"] +
          ": " + item["paleoData_variableName"])

## Temperature Reconstructions

Now let's use our pre-defined function to create a new list that only has temperature reconstructions based on proxies from **lake sediments**:

In [None]:
ms_euro_lake, euro_lake_lat, euro_lake_lon = filter_data(
    d_euro, "lake sediment", "temperature"
)

## Temperature Reconstructions

And a new list that only has temperature reconstructions based on proxies from **tree rings**:

In [None]:
ms_euro_tree, euro_tree_lat, euro_tree_lon = filter_data(
    d_euro, "tree", "temperature")

## Temperature Reconstructions

And a new list that only has temperature information based on proxies from **speleothems**:

In [None]:
ms_euro_spel, euro_spel_lat, euro_spel_lon = filter_data(
    d_euro, "speleothem", "d18O")

## Temperature Reconstructions

Since we are going to compare temperature datasets based on different terrestrial climate archives, the quantitative values of the measurements in each record will differ. 

Therefore, to more easily and accurately compare temperature between the records, it's helpful to standardize the data. 

The `.standardize()` function removes the estimated mean of the time series and divides by its estimated standard deviation.

In [None]:
# standardize the data
spel_stnd = ms_euro_spel.standardize()
lake_stnd = ms_euro_lake.standardize()
tree_stnd = ms_euro_tree.standardize()

## Temperature Reconstructions

Now we can use Pyleoclim functions to create three stacked plots of this data with lake sediment records on top, tree ring reconstructions in the middle and speleothem records on the bottom.

Note that the colors used for the time series in each plot are the default colors generated by the function, so the corresponding colors in each of the three plots are not relevant.

In [None]:
# note the x axis is years before present, so read from left to right moving back in time

ax = lake_stnd.stackplot(
    label_x_loc=1.7,
    xlim=[0, 2000],
    v_shift_factor=1,
    figsize=[9, 5],
    time_unit="yrs BP",
)
ax[0].suptitle("Lake Cores", y=1.2)

ax = tree_stnd.stackplot(
    label_x_loc=1.7,
    xlim=[0, 2000],
    v_shift_factor=1,
    figsize=[9, 5],
    time_unit="yrs BP",
)
ax[0].suptitle("Tree Rings", y=1.2)

# recall d18O is a proxy for SST, and that more positive d18O means colder SST
ax = spel_stnd.stackplot(
    label_x_loc=1.7,
    xlim=[0, 2000],
    v_shift_factor=1,
    figsize=[9, 5],
    time_unit="yrs BP",
)
ax[0].suptitle("Speleothems", y=1.2)

# Reconstructing Past Changes in Atmospheric Climate

## Reconstructing Past Changes in Atmospheric Climate

To understand past atmospheric climate changes, we’ll analyze δD and atmospheric CO<sub>2</sub> data from the EPICA Dome C ice core.

δD and δ<sup>18</sup>O measurements on ice cores record past changes in temperature, and that measurements of CO<sub>2</sub> trapped in ice cores can be used to reconstruction past changes in Earth's atmospheric composition.

We will:

- Plot δD and CO<sub>2</sub> records from the EPICA Dome C ice core
- Assess changes in temperature and atmospheric greenhouse gas concentration over the past 800,000 years 

## Exploring past variations in atmospheric CO$_2$

Paleoclimatologists can reconstruct past changes in atmospheric composition by measuring gases trapped in layers of ice from ice cores retrieved from polar regions and high elevation mountain glaciers. 

We'll specifically be focusing on paleoclimate records produced from the [EPICA Dome C](https://en.wikipedia.org/wiki/Dome_C) ice core from Antarctica.

![](https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fncomms8850/MediaObjects/41467_2015_Article_BFncomms8850_Fig1_HTML.jpg?as=webp)

Credit: [Conway et al 2015, *Nature Communications*](https://www.nature.com/articles/ncomms8850)

Let's start by downloading the data for the composite CO<sub>2</sub> record for EPICA Dome C in Antarctica:

## Exploring past variations in atmospheric CO$_2$

In [None]:
filename_antarctica2015 = "antarctica2015co2composite.txt"
url_antarctica2015 = "https://www.ncei.noaa.gov/pub/data/paleo/icecore/antarctica/antarctica2015co2composite.txt"

data_path = pooch_load(
    filelocation=url_antarctica2015, filename=filename_antarctica2015
)  # open the file

co2df = pd.read_csv(data_path, skiprows=137, sep="\t")

co2df.head()

## Exploring past variations in atmospheric CO$_2$

Store this data as a `Series` in Pyleoclim:

In [None]:
ts_co2 = pyleo.Series(
    time=co2df["age_gas_calBP"] / 1000,
    value=co2df["co2_ppm"],
    time_name="Age",
    time_unit="kyr BP",
    value_name=r"$CO_2$",
    value_unit="ppm",
    label="EPICA Dome C CO2",
)

## Exploring past variations in atmospheric CO$_2$

We can now plot age vs. CO<sub>2</sub> from EPICA Dome C:

In [None]:
ts_co2.plot(color="C1")

## Exploring past variations in atmospheric CO$_2$

Notice that the x-axis is plotted with present-day (0 kyr) on the left and the past (800 kyr) on the right. This is a common practice when plotting paleoclimate time series data.

These changes in CO<sub>2</sub> are tracking glacial-interglacial cycles (Ice Ages) over the past 800,000 years. 

Recall that these Ice Ages occur as a result of changes in the orbital cycles of Earth: eccentricity (100,000 year cycle), obliquity (40,000 year cycle) and precession (21,000 year cycle).

Can you observe them in the graph above?

## Exploring the relationship between δD and atmospheric CO<sub>2</sub>

To investigate the relationship between glacial cycles, atmospheric CO<sub>2</sub> and temperature, we can compare CO<sub>2</sub> to a record of hydrogen isotopic values (δD) of ice cores, which is a proxy for temperature in this case. 

***When interpreting isotopic measurements of ice cores, a more depleted δD value indicates cooler temperatures, and a more enriched δD value indicates warmer temperatures.***

This is the opposite relationship we have looked at previously with &delta;<sup>18</sup>O, not because we are looking at a different isotope, but because we are not looking at the isotopic composition of ice rather than the isotopic composition of the ocean.

Let's download the EPICA Dome C δD data, store it as a `Series`, and plot the data:

In [None]:
filename_edc3deuttemp2007 = "edc3deuttemp2007.txt"
url_edc3deuttemp2007 = "https://www.ncei.noaa.gov/pub/data/paleo/icecore/antarctica/epica_domec/edc3deuttemp2007.txt"
data_path = pooch_load(
    filelocation=url_edc3deuttemp2007, filename=filename_edc3deuttemp2007
)  # open the file

dDdf = pd.read_csv(data_path, skiprows=91, encoding="unicode_escape", sep="\s+")
dDdf.dropna(inplace=True)
dDdf.head()

## Exploring the relationship between δD and atmospheric CO<sub>2</sub>

In [None]:
dDts = pyleo.Series(
    time=dDdf["Age"] / 1000,
    value=dDdf["Deuterium"],
    time_name="Age",
    time_unit="kyr BP",
    value_name=r"$\delta D$",
    value_unit="\u2030",
    label=r"EPICA Dome C $\delta D$",
)

## Exploring the relationship between δD and atmospheric CO<sub>2</sub>

In [None]:
dDts.plot()

## Exploring the relationship between δD and atmospheric CO<sub>2</sub>

When we observe the δD data, we see very similar patterns as in the atmospheric CO<sub>2</sub> data. 

To more easily compare the two records, we can plot the two series side by side by putting them into a `MultipleSeries` object. 

Since the δD and CO<sub>2</sub> values have different units, we can first standardize the series and then plot the data. 

In [None]:
# combine series
ms = pyleo.MultipleSeries([dDts, ts_co2])

# standarize series and plot
ms.standardize().plot()

## Exploring the relationship between δD and atmospheric CO<sub>2</sub>

Now we can more easily compare the timing and magnitude of changes in CO<sub>2</sub> and δD at EPICA Dome C over the past 800,000 years. 

During glacial periods, δD was more depleted (cooler temperatures) and atmospheric CO<sub>2</sub>  was lower. 

During interglacial periods, δD was more enriched (warmer temperatures) and atmospheric CO<sub>2</sub>  was higher.

# Paleoclimate Data Analysis Tools

## Paleoclimate Data Analysis Tools

A common issue in paleoclimate is the presence of uneven time spacing between consecutive observations. 

`Pyleoclim` includes several methods that can deal with uneven sampling effectively, but there are certain applications and analyses for which it's ncessary to place the records on a uniform time axis. Let us study a few ways to do this with `Pyleoclim`. 

Additionally, we will explore another useful paleoclimate data analysis tool, Principal Component Analysis (PCA), which allows us to identify a common signal between various paleoclimate reconstructions. 

We should be able to perform the following data analysis techniques on proxy-based climate reconstructions:

*   Interpolation
*   Binning 
*   Principal component analysis

## Load the sample dataset for analysis

The dataset we'll be using is a record of hydrogen isotopes of leaf waxes (δD<sub>wax</sub>) from Lake Tanganyika in East Africa [(Tierney et al., 2008)](https://www.science.org/doi/10.1126/science.1160485?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed). 

Recall that δD<sub>wax</sub> is a proxy that is typically thought to record changes in the amount of precipitation in the tropics via the amount effect. 

Let's first read the data from a .csv file.

## Load the sample dataset for analysis

In [None]:
filename_tang = "tanganyika_dD.csv"
url_tang = "https://osf.io/sujvp/download/"
tang_dD = pd.read_csv(pooch_load(filelocation=url_tang, filename=filename_tang))
tang_dD.head()

## Load the sample dataset for analysis

We can now create a `Series` in Pyleoclim and assign names to different variables so that we can easily plot the data.

In [None]:
ts_tang = pyleo.Series(
    time=tang_dD["Age"],
    value=tang_dD["dD_IVonly"],
    time_name="Age",
    time_unit="yr BP",
    value_name="dDwax",
    value_unit="per mille",
    label="Lake Tanganyika dDprecip",
)

ts_tang.plot(color="C1", invert_yaxis=True)

## Load the sample dataset for analysis

You may notice that the y-axis is inverted. 

When we're plotting δD data, we typically invert the y-axis because more negative ("depleted") values suggest increased rainfall, whereas more positive ("enriched") values suggest decreased rainfall.

## Uniform Time-Sampling of the Data

There are a number of different reasons we might want to assign new values to our data. 

For example, if the data is not evenly spaced, we might need to resample in order to use a sepcific data analysis technique or to more easily compare to other data of a different sampling resolution. 

First, let's check whether our data is already evenly spaced using the `.is_evenly_spaced()` method:

In [None]:
ts_tang.is_evenly_spaced()

## Load the sample dataset for analysis

Our data is not evenly spaced. 

There are a few different methods available in `pyleoclim` to place the data on a uniform axis.

We will interpolating and binning the data now. 

In general, these methods use the available data near a chosen time to estimate what the value was at that time, but each method differs in which nearby data points it uses and how it uses them.

## Interpolation

Interpolation projects the data onto an evenly spaced time axis with a distance between points (step size) of our choosing. 

There are a variety of different methods by which the data can be interpolated, these being: 

- `linear`
- `nearest`
- `zero`
- `slinear`
- `quadratic`
- `cubic`
- `previous`
- `next`. 

More on these and their associated key word arguments can be found in the [documentation](https://pyleoclim-util.readthedocs.io/en/latest/core/api.html#pyleoclim.core.series.Series.interp). 


## Interpolation

By default, the function `.interp()` implements linear interpolation:

In [None]:
tang_linear = ts_tang.interp()  # default method = 'linear'

## Interpolation

In [None]:
# check whether or not the series is now evenly spaced
tang_linear.is_evenly_spaced()

## Interpolation

Now that we've interpolated our data, let's compare the original dataset to the linearly interpolated dataset we just created.

In [None]:
fig, ax = plt.subplots()  # assign a new plot axis
ts_tang.plot(ax=ax, label="Original", invert_yaxis=True)
tang_linear.plot(ax=ax, label="Linear")

## Interpolation

Notice there are only some minor differences between the original and linearly interpolated data.

You can print the data in the original and interpolated time series to see the difference in the ages between the two datasets. The interpolated dataset is now evenly spaced with a δD value every ~290 years.

In [None]:
ts_tang

In [None]:
tang_linear

## Interpolation

Let's compare a few of the different interpolation methods (e.g., quadratic, next, zero) with one another just to see how they are similar and different:

In [None]:
fig, ax = plt.subplots()  # assign a new plot axis
ts_tang.plot(ax=ax, label="original", invert_yaxis=True)
for method in ["linear", "quadratic", "next", "zero"]:
    ts_tang.interp(method=method).plot(
        ax=ax, label=method, alpha=0.9
    )  # plot all the method we want

The methods can produce slightly different results, but mostly reproduce the same overall trend. In this case, the quadractic method may be less appropriate than the other methods.

## Binning

Another option for resampling our data onto a uniform time axis is binning.

Binning is when a set of time intervals is defined and data is grouped or binned with other data in the same interval, then all those points in a "bin" are averaged to get a data value for that bin. 

The defaults for binning pick a bin size at the coarsest time spacing present in the dataset and average data over a uniform sequence of such intervals. 

In [None]:
tang_bin = (
    ts_tang.bin()
)  # default settings pick the coarsest time spacing in the data as the binning period

## Binning

In [None]:
fig, ax = plt.subplots()  # assign a new plot axis
ts_tang.plot(ax=ax, label="Original", invert_yaxis=True)
tang_bin.plot(ax=ax, label="Binned")

Again, notice that although there are some minor differences between the original and binned data, the records still capture the same overall trend.

## Principal Component Analysis (PCA)

Large datasets, such as global climate datasets, are often difficult to interpret due to their multiple dimensions. 

Although tools such as Xarray help us to organize large, multidimensional climate datasets, it can still sometimes be difficult to interpret certain aspects of such data. 

Principal Component Analysis (PCA) is a tool for reducing the dimensionality of such datasets and increasing interpretability with minimal modification or loss of data. 

In other words, PCA allows us to reduce the number of variables of a dataset, while preserving as much information as possible.

## Principal Component Analysis (PCA)

The first step in PCA is to calculate a matrix that summarizes how the variables in a dataset all relate to one another.

This matrix is then broken down into new uncorrelated variables that are linear combinations or mixtures of the initial variables.

These new variables are the **principal components**.

The initial dimensions of the dataset determines the number of principal components calculated. 

Most of the information within the initial variables is compressed into the first components. 

Additional details about PCA and the calculations involved can be found [here](https://builtin.com/data-science/step-step-explanation-principal-component-analysis).

## Principal Component Analysis (PCA)

Applied to paleoclimate, PCA can reduce the dimensionality of large paleoclimate datasets with multiple variables and can help us identify a common signal between various paleoclimate reconstructions.

An example of a study that applies PCA to paleoclimate is [Otto-Bliesner et al., 2014](https://www.science.org/doi/full/10.1126/science.1259531).

This study applies PCA to rainfall reconstructions from models and proxies from throughout Africa to determine common climate signals in these reconstructions.

Let us calculate the PCA of four δD paleoclimate records from Africa to assess common climate signals in the four records.

## Principal Component Analysis (PCA)

So far, we've been looking at δD data from Lake Tanganyika in tropical East Africa. 

Let's compare this δD record to other existing δD records from lake and marine sediment cores in tropical Africa from the Gulf of Aden [(Tierney and deMenocal, 2017)](https://doi.org/10.1126/science.1240411), Lake Bosumtwi [(Shanahan et al., 2015)](https://doi.org/10.1038/ngeo2329), and the West African Margin [(Tierney et al., 2017)](https://doi.org/10.1126/sciadv.1601503).

First, let's load these datasets:

## Principal Component Analysis (PCA)

First, let's load these datasets:

In [None]:
# Gulf of Aden
filename_aden = "aden_dD.csv"
url_aden = "https://osf.io/gm2v9/download/"
aden_dD = pd.read_csv(pooch_load(filelocation=url_aden, filename=filename_aden))
aden_dD.head()

## Principal Component Analysis (PCA)

First, let's load these datasets:

In [None]:
# Lake Bosumtwi
filename_Bosumtwi = "bosumtwi_dD.csv"
url_Bosumtwi = "https://osf.io/mr7d9/download/"
bosumtwi_dD = pd.read_csv(
    pooch_load(filelocation=url_Bosumtwi, filename=filename_Bosumtwi)
)
bosumtwi_dD.head()

## Principal Component Analysis (PCA)

First, let's load these datasets:

In [None]:
# GC27 (West African Margin)
filename_GC27 = "gc27_dD.csv"
url_GC27 = "https://osf.io/k6e3a/download/"
gc27_dD = pd.read_csv(pooch_load(filelocation=url_GC27, filename=filename_GC27))
gc27_dD.head()

## Principal Component Analysis (PCA)

Next, let's convert each dataset into a `Series` in Pyleoclim.

In [None]:
ts_tanganyika = pyleo.Series(
    time=tang_dD["Age"],
    value=tang_dD["dD_IVonly"],
    time_name="Age",
    time_unit="yr BP",
    value_name="dDwax",
    label="Lake Tanganyika",
)
ts_aden = pyleo.Series(
    time=aden_dD["age_calBP"],
    value=aden_dD["dDwaxIVcorr"],
    time_name="Age",
    time_unit="yr BP",
    value_name="dDwax",
    label="Gulf of Aden",
)
ts_bosumtwi = pyleo.Series(
    time=bosumtwi_dD["age_calBP"],
    value=bosumtwi_dD["d2HleafwaxC31ivc"],
    time_name="Age",
    time_unit="yr BP",
    value_name="dDwax",
    label="Lake Bosumtwi",
)
ts_gc27 = pyleo.Series(
    time=gc27_dD["age_BP"],
    value=gc27_dD["dDwax_iv"],
    time_name="Age",
    time_unit="yr BP",
    value_name="dDwax",
    label="GC27",
)

## Principal Component Analysis (PCA)

Now let's set up a `MultipleSeries` using Pyleoclim with all four δD datasets. 

In [None]:
ts_list = [ts_tanganyika, ts_aden, ts_bosumtwi, ts_gc27]
ms_africa = pyleo.MultipleSeries(ts_list, label="African dDwax")

## Principal Component Analysis (PCA)

We can now create a stackplot with all four δD records:

In [None]:
fig, ax = ms_africa.stackplot()

## Principal Component Analysis (PCA)

By creating a stackplot, we can visually compare between the datasets. 

However, the four δD records aren't the same resolution and don't span the same time interval.

To better compare the records and assess a common trend, we can use PCA.

First, we can use [`.common_time()`] to place the records on a shared time axis with a common sampling frequency. 

This function takes the argument `method`, which can be either `bin`, `interp`, and `gdkernel`. 

The binning and interpolation methods are what we just covered in the previous section. 

Let's set the time step to 500 years, the method to `interp`, and standarize the data:

In [None]:
africa_ct = ms_africa.common_time(method="interp", step=0.5).standardize()
fig, ax = africa_ct.stackplot()

## Principal Component Analysis (PCA)

We now have standardized δD records that are the same sampling resolution and span the same time interval. Note this meant trimming the longer time series down to around 20,000 years in length.

We can now apply PCA which will allow us to quantitatively identify a common signal between the four δD paleoclimate reconstructions through the [`.pca`](https://pyleoclim-util.readthedocs.io/en/latest/core/api.html#pyleoclim.core.multipleseries.MultipleSeries.pca) method. 

The `pyleoclim` package has its own pca method ( e.g. `africa_ct.pca()`), but it is computationally intensive. For our purposes we will be using the PCA method from `sklearn`.

Please refer to this [example](https://vitalflux.com/pca-explained-variance-concept-python-example/) for further reference.

## Principal Component Analysis (PCA)

In [None]:
pca = PCA()
africa_PCA =pca.fit(africa_ct.to_pandas())

## Principal Component Analysis (PCA)

The result is an object containing multiple outputs. The two outputs we'll look at is the percentage of variance accounted for by each mode as well as the principal components themselves. 

First, let's print the percentage of variance accounted for by each mode, noting that the first principle component is first in the list, and so on.

In [None]:
print(africa_PCA.explained_variance_ratio_)

## Principal Component Analysis (PCA)

This means that most of the variance in the four paleoclimate records is explained by the first principal component (the first value displayed). The number of datasets in the PCA constrains the number of principal components that can be defined, which is why we only have four components in this example.

We can now look at the principal component of the first mode of variance. Let's create a new series for the first principle component and plot it against the original datasets:

In [None]:
africa_pc1=africa_PCA.transform(africa_ct.to_pandas())

In [None]:
africa_mode1 = pyleo.Series(
    time=africa_ct.series_list[0].time,
    value=africa_pc1[:,0],
    label=r'$PC_1$',
    value_name='PC1',
    time_name ='age',
    time_unit = 'yr BP'
)

## Principal Component Analysis (PCA)

In [None]:
fig, ax1 = plt.subplots()

ax1.set_ylabel("dDwax")
ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
ax2.set_ylabel("PC1")  # we already handled the x-label with ax1

# plt.plot(mode1.time,pc1_scaled)
africa_mode1.plot(color="black", ax=ax2, invert_yaxis=True)
africa_ct.plot(ax=ax1, linewidth=0.5)

## Questions?

## See you in the next class!