In [1]:
# TEMPORARY CODE

# Until smart broker is installable
import sys
sys.path.append("../../fastapi")

# Until we start using sessions properly
class Session():
    def __init__(self, ip, port):
        self.ip = ip
        self.port = port

session = Session("127.0.0.1", "8000")

# SAIL Researcher User Documentation

**Aim:** This notebook section acts as a primer for new users on the SAIL platform.  This contains all necessary background information for working with SAIL data federations.


## 1. Introduction to Dataset Types in the SAIL Ecosystem

There are four main dataset types which SAIL allows computation of:

- Longitudinal Dataasets
    - Longitudinal-time-Series
    - Longitudinal-Repeated-Measurement
    - Longitudinal-Events
- Survey/ Cross Sectional Datasets

### 1.1 Longitudinal Datasets

Longitudinal Datasets represent the raw data format which is ingested into the sail platform. Longitudinal Data are not queryable in their raw form and need to first be flattened into conventional Tabular data in order to be analysed. For our purposes, Longitudinal Datasets have three types.

#### 1.1.1 Longitudinal-repeated-measurement

This type of dataset contains the same set of measurements for different instances or subjects at different points in time. Although different amounts of measurements can be taken per subject, the set of measurements are always the same.

| Instance      | Blood Pressure | Heartbeat     |
| :---        |    :----:   |          :---: |
| Adam Monday      | 120       | 70   |
| Adam Tuesday   | 90        | 40      |
| Adam Wednesday      | 80       | 60   |
| Adam Thursday   | 72        | 42      |

#### 1.1.2 Longitudinal Time Series

Time series data has a single repeated measurement over time in a series. Series' are specific to a subject and a type of measurement. Not all series need to be of the same length and not all data needs to follow the same timestamps.

| Instance      | Monday | Tuesday     |
| :---        |    :----:   |          :---: |
| Anne Heartbeat  | 60       | 70   |

<br>


| Instance      | Monday | Tuesday     |
| :---        |    :----:   |          :---: |
| Anne blood pressure  | 90       | 80   |


#### 1.1.3 Longitudinal Events

This type of dataset contains combinations of times, values and types of test. The type of the value field may vary. Outside of the SAIL ecosystem this type of data is also referred to as journal data. Since missing events are simply omitted, this type of data normally does not contain missing fields.

| Patient       | Value     | Test                 |   Day       |
| :---          |    :----: |  :---:               |        :---: |
| Adam          | 120       | Blood Pressure       | Monday      |
| Adam          | 90        | Heartbeat            | Tuesday     |
| Saurabh       | 80        | Blood Pressure       | Monday      |
| Saurabh       | 85        | Fasting Blood Glucose| Wednesday   |
| Stanley       | 72        | Blood Pressure       | Friday      |

#### 1.2 Survey/ Cross Sectional Datasets

These are the most common type of dataset where there is a list of instances and a list of measurements and each measurement appears at most once for every instance. in conventional machine learning and data anlaytics this is the format in which data must exist. All longitudinal data must first be processed into this format before it can be analysed by SAIL SAFE functions.


| Instance      | Blood Pressure  | Heartbeat            |   Fasting Blood Glucose       |
| :---          |    :----:       |  :---:               |        :---:                  |
| Adam          | 120             | 90                   | 101                           |
| Anne          | 90              | 50                   | 120                           |
| Jaap          | 80              | 70                   | 115                           |
| Saurabh       | 85              | 80                   | 118                           |
| Stanley       | 72              | 65                   | 110                           |


## 3 SAIL Data Structures

**Aim:** This section provides information on how the different datastructures described above are organised in the SAIL platform. We will detail how to process a data federation into a useable format once provisioned. Specifically, we will be looking at how to distil a *longitudinal* data federation into a *tabular* data format

### 3.1 SAIL Data Federation Structure

Federal Data relies on two main architectural components; a **dataset** and a corresponding **data model**.

- Datasets contain the raw data which Researchers work with
- Data models contain meta-information relating to the structure of Datasets

###### **TODO: INSERT NICE INFOGRAPHIC**

### 3.2 SAIL Data Federation Hierarchy

Federated Data adheres to the following Object types:

- Longitudinal Dataset
- Tabular Dataset
- Dataframe
- Series


\\
<img src="images/dataset_lifecycle.png" alt="Data Federation Hierarchy" width="800vw"/>


###### **TODO: INSERT NICER INFOGRAPHIC**

#### 3.2.1 Longitudinal Dataset Structure

A *Longitudinal Dataset* is a collection of *Patients* and a *Longitudinal Data Model*.
- A Longitudinal Data Model is the FHIR profile which defines how the longitudinal dataset is structured
- A Patient owns a list of *Measurements* and some basic information concerning the data subject. 
- A Measurement is a timepoint in patient history where a single data point was measured or a single fact was established.

###### **TODO: INSERT NICE INFOGRAPHIC**

#### 3.2.2 Tabular Dataset Structure

A *Tabular Dataset* is a composition of *Data Frames* and a *Tabular Data Model*. A Tabular Dataset has its own id that is unique within the platform. Tabular Datasets can also be saved to disk to establish a breakpoint. A Tabular Data Model Contains a composition of Datframe Data Models.

###### **TODO: INSERT NICE INFOGRAPHIC**


#### 3.2.3 Dataframe Structure

A *Data Frame* is a composition of *Series* and a *Data Frame Model*. A Data Frame Model is a composition of Series Models. Most machine learning operations are performed at the Data Frame level. 

###### **TODO: INSERT NICE INFOGRAPHIC**


#### 3.2.4 Series Structure

A *Series* is a composition of *Values* and a *Series Model*. A Data Frame model is a composition of Series Models. Most statistical operations use Series as input for their processing. There are two types of Series with two types of Series Model:

An Interval Series Model defines:

- value resolution
- min values
- max value

A Categorical Series Model defines:

- A list of categories that can be present in the series

###### **TODO: INSERT NICE INFOGRAPHIC**


### 3.3 Converting a Longitudinal dataset to a Tabular Dataset

A *Longitudinal Dataset* may be converted to a *Tabular Dataset* using *Anchors* and *Aggregators*.

#### 3.3.1 Anchors

An Anchor is a function that establishes a time point in a patients history in a Longitudinal Dataset. Some simple default anchors would be date of birth or death. Some more complicated once could be time of initial diagnoses or the start of the 3rd round of treatment.

#### 3.3.2 Aggregator

An Aggregator is a function (or Visitor pattern) that can be inserted into a Longitudinal Dataset to produce a Series with one entry for each patient visited. An Aggregator is date-time agnostic but not sequence agnostic. It can select the first Measurement but not the first after, for example, January 21st 2022. Aggregators cannot refer to each other but they can refer to Anchors. For example, an Aggregator could collect the mean of the systolic blood pressure (measurement) between the end of the second round of chemo treatment (anchor) and the death of the patient (anchor).

## 3. Working with SAIL Data

**Aim:** Previously we looked at the conceptual framework for working with data in the SAIL ecosystem. In the following sections, we'll look at how that conceptual framework is applied in code. Specifically, we will process a longitudinal dataset to the point where we can perform statistical functions on Series data which has been pulled from the longitudinal dataset.


### 3.1 Making Data Available

**Aim:** This section will detail how to search for available data federations and provision Secure Computation Nodes (SCNs) to hold each dataset in the federation.

#### 3.1.1 Finding a Data Federation

##### TODO

#### 3.1.2 Provisoning a Data Federation

##### TODO

### 3.2 Longitudinal Operations

**Aim:** In this section we will look at how to both read in a longitudinal dataset and filter this into a tabular dataset.

#### 3.2.1 Reading a Longitudinal Dataset

In the previous sections we found the data we were looking for anf provisioned SCNs holding that longitudinal data. Now we'd like to gain a reference which we can use to refer to the federated dataset held remotely. The functionality we require in the first step is contained in the <code>data_api</code> component of our orchestrator client.

We run the <code>read_longitudinal_fhirv1</code> function to gain a reference to that federated longitudinal dataset.


In [2]:
from smart_broker_api import data_api

longitudinal_dataset_id = data_api.read_longitudinal_fhirv1(session)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

We now have a remote, federated dataset which we can refer to with the <code>id</code> returned by the <code>read_longitudinal_fhirv1</code> function. However, longitudinal datasets as outlined above are not suitable to be operated on by machine learning or statistical functions. First we must distill this raw format into either survey/ cross sectional data or series. In the next section we will define the data models required to manage this transformation.

#### 3.2.2 Building Our Desired Tabular Data Structure

In order to convert a longitudinal dataset to a workable format we must process our desired features into a tabular tabular form. The first step is to define a data model which refers to our tabular data. As tabular data models are a composition of data frame models which are in turn compositions of series models, we must define this from the series level up.

We perform this using functionality contained in the <code>data_model_api</code> component of our orchestrator client. In the code block below, first we define a <code>data_frame</code> model to be populated by a set of series models. We do this with <code>data_model_api.create_date_frame</code>. We pass the desired name for our data frame as we create the data frame model. We recieve a reference to the newly minted data frame model which we can use to work with it remotely.

###### TODO: Rationalise how we build up data models. We should begin at the smallest components and work our way larger or we should begin at the largest component and populate it with progressively smaller components. Currently we begin in the middle with Dataframe.

In [1]:
from smart_broker_api import data_model_api

data_frame_model_id = data_model_api.create_date_frame(session, "data_frame_0")

ModuleNotFoundError: No module named 'smart_broker_api'

Now that we have a <code>data_frame_model</code> to populate, we can define the individual <code>series</code> which will fit into this model. For the purpose of this demo, we will allocate three new <code>series</code> models to the <code>data_frame_model</code>. We do this with <code>data_model_api.data_frame_add_series</code> from the data_model component of the orchestrator client.

<code>data_frame_add_series</code> takes the following parameters:

- The <code>id</code> of the <code>data_frame_model</code> we are adding these series to
- The <code>name</code> of the series series we are specifying
- The <code>observation</code> to be pulled from the longitudinal dataset
- The <code>aggregator</code> which will be used to pull the <code>observation</code> from the longitudinal dataset

Below we add three new series models to our dataframe model, which we may refer to remotely with <code>data_frame_model_id</code>.

###### TODO: Related to the todo above. We should be able to define series models as atomic compoinents which may stand alone, outside of a dataframe model. They may then be added to a dataframe model in the same way that a dataframe model is added to a tabular model.

In [None]:
data_model_api.data_frame_add_series(session, data_frame_model_id, "bmi_mean", "Observation:Body Mass Index", "AgregatorIntervalMean")
data_model_api.data_frame_add_series(session, data_frame_model_id, "bmi_first", "Observation:Body Mass Index", "AgregatorIntervalFirstOccurance")
data_model_api.data_frame_add_series(session, data_frame_model_id, "bmi_last", "Observation:Body Mass Index", "AgregatorIntervalLastOccurance")

Now we're ready to generate our tabular data model. We create an empty tabular data model with <code>create_tabular_data</code> from the <code>data_model_api</code> component of our orchestrator client. We recieve a reference to the new data model, <code>data_model_tabular_id</code>, which we can use to work with it remotely.

In [None]:
data_model_tabular_id = data_model_api.create_tabular_data(session)

We may then add the previously defined dataframe model to the empty tabular data model to create a complete set of meta-information relating to the tabular data we'd like to pull from our longitudinal dataset. THe functionality we use to achieve this is held in the <code> data_model_api </code> component of the orchestrator client. The specific function is <code>tabular_add_dataframe</code>. <code>tabular_add_dataframe</code> takes the following parameters:

- the <code>id</code> of the dataframe to be added to the tabular data model
- the <code>id</code> of the tabular data model which the dataframe will be added to

Once built, we can refer to this entire structure with <code>data_model_tabular_id</code>.

In [None]:
data_model_tabular_id = data_model_api.tabular_add_dataframe(session, data_frame_model_id, data_model_tabular_id)

As we complete this section, we now have a fully defined data model which contains the following structure:

- Tabular Data model
    - Data Frame Model ('data_frame_0')
        - Series Model ('bmi_mean')
        - Series Model ('bmi_first')
        - Series Model ('bmi_last)

We are now ready to parse the values specifed in this data model from our original longitudinal dataset.

###### TODO: Replace ugly list with pretty diagram of defined data structure

#### 3.2.3 Parse Tabular Dataset from Longitudinal Dataset Model Specification

In this section we'll be using our newly defined tabular data model to parse from our longitudinal dataset. We'll end up with a tabular dataset which we can perform analytics on downstream. We'll be using the <code>data_api</code> component of the orchestrator to achieve this. Specifically, we'll be using <code>parse_dataset_tabular_from_longitudinal</code>. This takes the following parameters:

- The <code> id </code> of the longitudinal dataset we are parsing from
- The <code> id </code> of the tabular data model we defined above
- dataset_federation_id *(It's not clear this is a necessary parameter)*
- dataset_federation_name *(It's not clear this a necessary parameter)*

When this has run to completion we will recieve a reference to our newly minted tabular dataset.

###### TODO: Why are we using dataset_federation_id and dataset_federation_name if we already have access to the longitudinal data? ADD TO TECHDEBT

In [2]:
from smart_broker_api import data_api

dataset_federation_id = "a892f738-4f6f-11ed-bdc3-0242ac120002"
dataset_federation_name = "r4sep2019_csvv1_20_1"

tabular_dataset_id = data_api.parse_dataset_tabular_from_longitudinal(session, longitudinal_dataset_id, dataset_federation_id, dataset_federation_name, data_model_tabular_id)

ModuleNotFoundError: No module named 'smart_broker_api'

### 3.3 Tabular Dataset Operations

In the previous sections we provisioned a longitudinal dataset, defined a tabular data model and used this data model to parse a tabular dataset from the longitudinal dataset. We will now select an individual data frame from our tabular datset which will be consumable by machine learning and pre-processing functionality.

#### 3.3.1 Selecting a Data Frame from a Tabular Dataset

in order to select a specifc data frame from our tabular dataset, we will use <code>data_frame_tabular_select_data_frame</code> from the <code>data_api</code> component of our orchestrator client. This function takes the following parameters:

- The <code>id</code> of the tabular dataset we will be selecting from
- The <code>name</code> of the particular data frame we would like to process

We will recieve the <code>id</code> of the data frame specified by the data frame <code>name</code> provided as a parameter. This will allow us to refer to that particular data frame remotely.


In [None]:
data_frame_id = data_api.data_frame_tabular_select_data_frame(session, tabular_dataset_id, "data_frame_0")

### 3.4 Data Frame Level Operations

In the previous section we selected an individual data frame from the tabular dataset we created earlier. In this section we will learn how to perform operations on that data frame.

#### 3.4.1 Removing Missing values from a Data Frame

Now we have the reference to an individual dataframe, we can perform preprocessing operations on said data. In this case we will be removing all missing values from the data frame selected in the previous section. To do this, we employ the <code>preprocessing</code> component of the orchestrator client. The specific function from this component that we'll be using is <code>drop_na_data_frame</code> which takes the <code>id</code> of the original data frame and returns the <code>id</code> of the new dataframe with all missing values removed.


In [3]:
from smart_broker_api import preprocessing_api

no_na_data_frame_id = preprocessing_api.drop_na_data_frame(session, data_frame_id)

ModuleNotFoundError: No module named 'smart_broker_api'

#### 3.4.2 Selecting a Series from a Data Frame

We can select an individual series from a data frame by once again using the <code>data_api</code> component of the orchestrator client. The function used here is <code>data_frame_select_series</code>. This takes two parameters:

- The <code>id</code> of the data frame beng selected from
- The <code>name</code> of the desired series

Below we select two series from our preprocessed dataframe. We recieve back the <code>id</code> of the series' which are referred to by <code>name</code>.

In [None]:
series_1_id = data_api.data_frame_select_series(session, no_na_data_frame_id, "bmi_mean")
series_2_id = data_api.data_frame_select_series(session, no_na_data_frame_id, "bmi_last")

### 3.5 Series Level Data Operations

We now have a reference to two series which have been selected all the way from our original longitudinal dataset. We can use these series to perform some statistical analysis and data visualisation.

##### 3.5.1 Statistical Analysis

In [5]:
from smart_broker_api import statistics_api
from smart_broker_api import visualization_api
import plotly.graph_objects as go

type_distribution="normalunit"
type_ranking="cdf"
alternative = "two-sided"
print(statistics_api.count(session,  series_1_id))
print(statistics_api.mean(session,  series_1_id))
print(statistics_api.chisquare(session,  series_1_id, series_2_id))
print(statistics_api.kolmogorovSmirnovTest(session,  series_1_id, type_distribution, type_ranking))
print(statistics_api.kurtosis(session,  series_1_id))
print(statistics_api.levene_test(session,  series_1_id, series_2_id))
print(statistics_api.mann_whitney_u_test(session,  series_1_id, series_2_id, alternative, type_ranking))
print(statistics_api.min_max(session,  series_1_id))
print(statistics_api.paired_t_test(session,  series_1_id, series_2_id, alternative))
print(statistics_api.pearson(session,  series_1_id, series_2_id, alternative))
print(statistics_api.skewness(session,  series_1_id))
print(statistics_api.spearman(session,  series_1_id, series_2_id, alternative, type_ranking))
print(statistics_api.student_t_test(session,  series_1_id, series_2_id, alternative))
print(statistics_api.variance(session,  series_1_id))
print(statistics_api.welch_t_test(session,  series_1_id, series_2_id, alternative))
print(statistics_api.wilcoxon_signed_rank_test(session,  series_1_id, series_2_id, alternative, type_ranking))
print(statistics_api.wilcoxon_signed_rank_test(session,  series_1_id, series_2_id, alternative, type_ranking))

ModuleNotFoundError: No module named 'smart_broker_api'

#### 3.5.2 Data Visualisation

In [None]:

dict_of_fig = visualization_api.histogram(session, series_1_id, 20)
fig = go.Figure(dict_of_fig["figure"])
fig.show()


dict_of_fig = visualization_api.kernel_density_estimation(session, series_1_id, 2)
fig = go.Figure(dict_of_fig["figure"])
fig.show()
