In [None]:
# TEMPORARY CODE

# Until smart broker is installable
import sys
sys.path.append("../../fastapi")

# Until we start using sessions properly
class Session():
    def __init__(self, ip, port):
        self.ip = ip
        self.port = port

session = Session("127.0.0.1", "8000")

# SAIL Researcher User Documentation

**Aim:** This notebook section acts as a primer for new researchers on the SAIL platform.  This contains all necessary background information for working with SAIL data federations.


## 1. Introduction to Dataset Types in the SAIL Ecosystem

There are four main dataset types which SAIL allows computation of:

- Longitudinal Dataasets
    - Longitudinal-time-Series
    - Longitudinal-Repeated-Measurement
    - Longitudinal-Events
- Survey/ Cross Sectional Datasets

### 1.1 Longitudinal Datasets

Longitudinal datasets represent the raw data format which is ingested into the sail platform. Longitudinal Data are not queryable in their raw form and need to first be flattened into conventional tabular data in order to be analysed. For our purposes, Longitudinal Datasets have three types.

#### 1.1.1 Longitudinal-repeated-measurement

This type of dataset contains the same set of measurements for different instances or subjects at different points in time. Although different amounts of measurements can be taken per subject, the set of measurements is always the same.

| Instance      | Blood Pressure | Heartbeat     |
| :---        |    :----:   |          :---: |
| Adam Monday      | 120       | 70   |
| Adam Tuesday   | 90        | 40      |
| Adam Wednesday      | 80       | 60   |
| Adam Thursday   | 72        | 42      |

#### 1.1.2 Longitudinal Time Series

Time series data has a single repeated measurement over time in a series. Series' are specific to a subject and a type of measurement. Not all series need to be of the same length and not all data needs to follow the same timestamps.

| Instance      | Monday | Tuesday     |
| :---        |    :----:   |          :---: |
| Anne Heartbeat  | 60       | 70   |

<br>


| Instance      | Monday | Tuesday     |
| :---        |    :----:   |          :---: |
| Anne blood pressure  | 90       | 80   |


#### 1.1.3 Longitudinal Events

Longitudinal event based datasets contain combinations of times, values and types of test. The type of the value field may vary. Outside of the SAIL ecosystem this type of data is also referred to as journal data. Since missing events are simply omitted, this type of data normally does not contain missing fields.

| Patient       | Value     | Test                 |   Day       |
| :---          |    :----: |  :---:               |        :---: |
| Adam          | 120       | Blood Pressure       | Monday      |
| Adam          | 90        | Heartbeat            | Tuesday     |
| Saurabh       | 80        | Blood Pressure       | Monday      |
| Saurabh       | 85        | Fasting Blood Glucose| Wednesday   |
| Stanley       | 72        | Blood Pressure       | Friday      |

#### 1.2 Survey/ Cross Sectional Datasets

Survey/ cross Ssectional datasets are the most common type of data where there's a list of instances and a list of measurements. Each measurement appears at most once for every instance. In conventional machine learning and data analytics, this is the correct format. All longitudinal data must first be processed into this format before it can be analysed by SAIL SAFE functions.


| Instance      | Blood Pressure  | Heartbeat            |   Fasting Blood Glucose       |
| :---          |    :----:       |  :---:               |        :---:                  |
| Adam          | 120             | 90                   | 101                           |
| Anne          | 90              | 50                   | 120                           |
| Jaap          | 80              | 70                   | 115                           |
| Saurabh       | 85              | 80                   | 118                           |
| Stanley       | 72              | 65                   | 110                           |


## 3 SAIL Data Structures

**Aim:** This section provides information on how the different datastructures described above are organised in the SAIL platform.

### 3.1 SAIL Data Federation Structure

Federal Data relies on two main architectural components; a **dataset** and a corresponding **data model**.

- Datasets contain the raw data which Researchers work with
- Data models contain meta-information relating to the structure of Datasets

###### TODO: INSERT NICE INFOGRAPHIC
###### TODO: THIS SHOULD BE ENRICHED WITH MORE DETAIL

### 3.2 SAIL Data Federation Hierarchy

Federated Data adheres to the following Object types:

- Longitudinal Dataset
- Tabular Dataset
- Dataframe
- Series

<img src="images/dataset-lifecycle.png" alt="Data Federation Hierarchy" width="800"/>

In the sections below, each data type will be discussed.

###### TODO: INSERT NICER INFOGRAPHIC

#### 3.2.1 Longitudinal Dataset Structure

A *Longitudinal Dataset* is a collection of *Patients* and a *Longitudinal Data Model*.
- A Longitudinal Data Model is the FHIR profile which defines how the longitudinal dataset is structured
- A Patient owns a list of *Measurements* and some basic information concerning the data subject. 
- A Measurement is a timepoint in patient history where a single data point was measured or a single fact was established.

###### **TODO: INSERT NICE INFOGRAPHIC**
###### **TODO: THIS DESCRIPTION NEEDS MORE. DETAIL IS NOT PRESENT IN INTERNAL DOCUMENTATION**

#### 3.2.2 Tabular Dataset Structure

A *Tabular Dataset* is a composition of *Data Frames* and a *Tabular Data Model*. A Tabular Dataset has its own id that is unique within the platform. Tabular Datasets can also be saved to disk to establish a breakpoint. A Tabular Data Model Contains a composition of Datframe Data Models.

###### **TODO: INSERT NICE INFOGRAPHIC**
###### **TODO: THIS DESCRIPTION NEEDS MORE. DETAIL IS NOT PRESENT IN INTERNAL DOCUMENTATION**


#### 3.2.3 Dataframe Structure

A *Data Frame* is a composition of *Series* and a *Data Frame Model*. A Data Frame Model is a composition of Series Models. Most machine learning operations are performed at the Data Frame level. This is the data type referred to in Section 1.2.

###### **TODO: INSERT NICE INFOGRAPHIC**
###### **TODO: THIS DESCRIPTION NEEDS MORE. DETAIL IS NOT PRESENT IN INTERNAL DOCUMENTATION**


#### 3.2.4 Series Structure

A *Series* is a composition of *Values* and a *Series Model*. A Data Frame model is a composition of Series Models. Most statistical operations use Series as input for their processing. There are two types of Series with two types of Series Model:

An Interval Series Model defines:

- value resolution
- min values
- max value

A Categorical Series Model defines:

- A list of categories that can be present in the series

###### **TODO: INSERT NICE INFOGRAPHIC**
###### **TODO: THIS DESCRIPTION NEEDS MORE. DETAIL IS NOT PRESENT IN INTERNAL DOCUMENTATION**


### 3.3 Converting a Longitudinal dataset to a Tabular Dataset

A *Longitudinal Dataset* may be converted to a *Tabular Dataset* using *Anchors* and *Aggregators*.

#### 3.3.1 Anchors

An *Anchor* is a function that establishes a time point in a patients history. Some simple default anchors would be date of birth or death. Some more complicated once could be time of initial diagnoses or the start of the 3rd round of treatment.

###### TODO: NEEDS MORE. POINT, EXPLAIN, EXAMPLE.

#### 3.3.2 Aggregator

An Aggregator is a function (or Visitor pattern) that can be inserted into a Longitudinal Dataset to produce a Series with one entry for each patient visited. An Aggregator is date-time agnostic but not sequence agnostic. It can select the first measurement but not the first after, for example, January 21st 2022. Aggregators cannot refer to each other but they can refer to Anchors. For example, an Aggregator could collect the mean of the systolic blood pressure (measurement) between the end of the second round of chemo treatment (anchor) and the death of the patient (anchor).

## 3. Working with SAIL Data

**Aim:** Previously we looked at the conceptual framework for working with data in the SAIL ecosystem. In the following sections, we'll look at how that conceptual framework is applied in system architecture. We will work through the full Researcher lifecycle with examples in code.


### 3.1 Making Data Available

This section will detail how to search for available data federations and provision Secure Computation Nodes (SCNs) to hold each dataset in the federation.

#### 3.1.1 Finding a Data Federation

##### TODO

#### 3.1.2 Provisoning a Data Federation

##### TODO

### 3.2 Longitudinal Dataset Operations

Here, we'll look at the operations which canbe performed on longitudinal data. Currently this is restricted to reading in longitudinal data and transforming said data into a tabular structure.

#### 3.2.1 Reading a Longitudinal Dataset

In the previous sections we found the data we were looking for and provisioned SCNs holding that data. Now we'd like to gain a reference which we can use to refer to the federated longitudinal dataset held remotely. The functionality we require in the first step is contained in the <code>data_api</code> component of our orchestrator client.

We run the <code>read_longitudinal_fhirv1</code> function to gain a reference to that federated longitudinal dataset.


In [None]:
from smart_broker_api import data_api

longitudinal_dataset_id = data_api.read_longitudinal_fhirv1(session)

We now have a federated longitudinal dataset which we can refer to with the <code>id</code> returned by the <code>read_longitudinal_fhirv1</code> function. However, longitudinal datasets as outlined above are not suitable to be operated on by machine learning or statistical functions. We must distill this raw format into either survey/ cross sectional data or a series. In the next section we will define the data models required to manage this transformation.

#### 3.2.2 Building Our Desired Tabular Data Structure

In order to convert a longitudinal dataset to a workable format, we must process our desired features into a tabular form. The first step is to define a data model which refers to our tabular data. As tabular data models are a composition of data frame models which are in turn compositions of series models, we must define this from the series level up.

We perform this using functionality contained in the <code>data_model_api</code> component of our orchestrator client. In the code block below, we define a data_frame model to be populated by a set of series models. We do this with <code>create_date_frame</code>. We pass the desired name for our data frame as we create the data frame model. We then recieve a reference to the newly minted data frame model which we can use to work with it remotely.

###### TODO: Rationalise how we build up data models. We should begin at the smallest components and work our way larger or we should begin at the largest component and populate it with progressively smaller components. Currently we begin in the middle with Dataframe.

In [None]:
from smart_broker_api import data_model_api

data_frame_model_id = data_model_api.create_date_frame(session, "data_frame_0")

Now that we have a data frame model to populate, we define the individual series models which our data frame model will be composed of. For the purpose of this demo, we will allocate three new series models to the data_frame_model we created in the previous cell. We do this with <code>data_frame_add_series</code> from the <code>data_model_api</code> component of the orchestrator client.

<code>data_frame_add_series</code> takes the following parameters:

- The <code>id</code> of the <code>data_frame_model</code> we are adding these series to
- The <code>name</code> of the series series we are specifying
- The <code>observation</code> to be pulled from the longitudinal dataset
- The <code>aggregator</code> which will be used to pull the <code>observation</code> from the longitudinal dataset

Below we add three new series models to our dataframe model, which we may refer to remotely with <code>data_frame_model_id</code>.

###### TODO: Related to the todo above. We should be able to define series models as atomic compoinents which may stand alone, outside of a dataframe model. They may then be added to a dataframe model in the same way that a dataframe model is added to a tabular model.

In [None]:
data_model_api.data_frame_add_series(session, data_frame_model_id, "bmi_mean", "Observation:Body Mass Index", "AgregatorIntervalMean")
data_model_api.data_frame_add_series(session, data_frame_model_id, "bmi_first", "Observation:Body Mass Index", "AgregatorIntervalFirstOccurance")
data_model_api.data_frame_add_series(session, data_frame_model_id, "bmi_last", "Observation:Body Mass Index", "AgregatorIntervalLastOccurance")

'747a6a90-655f-443c-8ec0-442790f9b186'

Now we're ready to generate our tabular data model. We create an empty tabular data model with <code>create_tabular_data</code> from the <code>data_model_api</code> component of our orchestrator client. We recieve a reference to the new data model, <code>data_model_tabular_id</code>, which we can use to work with it remotely.

In [None]:
data_model_tabular_id = data_model_api.create_tabular_data(session)

We may then add the previously defined dataframe model to the empty tabular data model to create a complete set of meta-information relating to the tabular data we'd like to pull from our longitudinal dataset. The functionality we use to achieve this is held in the <code> data_model_api </code> component of the orchestrator client. The specific function is <code>tabular_add_dataframe</code>. <code>tabular_add_dataframe</code> takes the following parameters:

- the <code>id</code> of the dataframe to be added to the tabular data model
- the <code>id</code> of the tabular data model which the dataframe will be added to

Once built, we can refer to this entire structure with <code>data_model_tabular_id</code>.

In [None]:
data_model_tabular_id = data_model_api.tabular_add_dataframe(session, data_frame_model_id, data_model_tabular_id)

As we complete this section, we now have a fully defined data model which contains the following structure:

- Tabular Data model
    - Data Frame Model ('data_frame_0')
        - Series Model ('bmi_mean')
        - Series Model ('bmi_first')
        - Series Model ('bmi_last)

We are now ready to parse the values specifed in this data model from our original longitudinal dataset.

###### TODO: Replace ugly list with pretty diagram of defined data structure

#### 3.2.3 Parse Tabular Dataset from Longitudinal Dataset Model Specification

In this section we'll be using our newly defined tabular data model to parse from our longitudinal dataset. We'll end up with a tabular dataset which we can perform analytics on downstream. We'll be using the <code>data_api</code> component of the orchestrator to achieve this. Specifically, we'll be using <code>parse_dataset_tabular_from_longitudinal</code>. This takes the following parameters:

- The <code> id </code> of the longitudinal dataset we are parsing from
- The <code> id </code> of the tabular data model we defined above
- dataset_federation_id *(It's not clear this is a necessary parameter)*
- dataset_federation_name *(It's not clear this a necessary parameter)*

When this has run to completion we will receive a reference to our newly minted tabular dataset.

###### TODO: Why are we using dataset_federation_id and dataset_federation_name if we already have access to the longitudinal data? ADD TO TECHDEBT

In [None]:
from smart_broker_api import data_api

dataset_federation_id = "a892f738-4f6f-11ed-bdc3-0242ac120002"
dataset_federation_name = "r4sep2019_csvv1_20_1"

tabular_dataset_id = data_api.parse_dataset_tabular_from_longitudinal(session, longitudinal_dataset_id, dataset_federation_id, dataset_federation_name, data_model_tabular_id)

### 3.3 Tabular Dataset Operations

In the previous sections we provisioned a longitudinal dataset, defined a tabular data model and used this data model to parse a tabular dataset from the longitudinal dataset. We will now select an individual data frame from our tabular datset which will be consumable by machine learning and pre-processing functionality.

#### 3.3.1 Selecting a Data Frame from a Tabular Dataset

In order to select a specifc data frame from our tabular dataset, we will use <code>data_frame_tabular_select_data_frame</code> from the <code>data_api</code> component of our orchestrator client. This function takes the following parameters:

- The <code>id</code> of the tabular dataset we will be selecting from
- The <code>name</code> of the particular data frame we would like to process

We will recieve the <code>id</code> of the data frame specified by the data frame <code>name</code> provided as a parameter. This will allow us to refer to that particular data frame remotely.


In [None]:
data_frame_id = data_api.data_frame_tabular_select_data_frame(session, tabular_dataset_id, "data_frame_0")

### 3.4 Data Frame Level Operations

In the previous section we selected an individual data frame from the tabular dataset we created earlier. In this section we will learn how to perform operations on that data frame.

#### 3.4.1 Removing Missing values from a Data Frame

Now we have the reference to an individual dataframe, we can perform preprocessing operations. In this case we will be removing all missing values from the data frame selected in the previous section. To do this, we employ the <code>preprocessing</code> component of the orchestrator client. The specific function from this component that we'll be using is <code>drop_na_data_frame</code> which takes the <code>id</code> of the original data frame and returns the <code>id</code> of the new dataframe with all missing values removed.


In [None]:
from smart_broker_api import preprocessing_api

no_na_data_frame_id = preprocessing_api.drop_na_data_frame(session, data_frame_id)

#### 3.4.2 Selecting a Series from a Data Frame

We can select an individual series from a data frame by once again using the <code>data_api</code> component of the orchestrator client. The function used here is <code>data_frame_select_series</code>. This takes two parameters:

- The <code>id</code> of the data frame beng selected from
- The <code>name</code> of the desired series

Below we select two series from our preprocessed dataframe. We recieve back the <code>id</code> of the series' which are referred to by <code>name</code>.

In [None]:
series_1_id = data_api.data_frame_select_series(session, no_na_data_frame_id, "bmi_mean")
series_2_id = data_api.data_frame_select_series(session, no_na_data_frame_id, "bmi_last")

### 3.5 Series Level Data Operations

We now have a reference to two series which have been selected all the way from our original longitudinal dataset. We can use these series to perform some statistical analysis and data visualisation.

##### 3.5.1 Statistical Analysis

# 1. Count

The function used here is <code>statistics_api.count</code>. This takes two parameters:

- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series

This function calculate the length of a federated series. Work same as len([1,2,3,4]) works in python

In [None]:
from smart_broker_api import statistics_api

type_distribution="normalunit"
type_ranking="cdf"
alternative = "two-sided"
print(statistics_api.count(session,  series_1_id))

# 2. Mean

The function used here is <code>statistics_api.mean</code>. This takes two parameters:

- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series

Mean of a list of numbers, is the sum of all of the numbers divided by the number of numbers. Similarly,

## 2.1 Calculating Mean

$\bar{x}=\frac{1}{n}\left(\sum_{i=1}^{n} x_{i}\right)=\frac{x_{1}+x_{2}+\cdots+x_{n}}{n}$

$\frac{4+36+45+50+75}{5}=\frac{210}{5}=42$

In [None]:
print(statistics_api.mean(session,  series_1_id))

# 3. Chi-Square 

The function used here is <code>statistics_api.chisquare</code>. This takes three parameters:

- The <code>session</code> assigning session 
- The <code>series_1_id</code> first of the desired series
- The <code>series_2_id</code>  second of the desired series

A chi-square (χ2) statistic is a test that measures how a model compares to actual observed data. The data used in calculating a chi-square statistic must be random, raw, mutually exclusive, drawn from independent variables, and drawn from a large enough sample. For example, the results of tossing a fair coin meet these criteria.

Chi-square tests are often used to test hypotheses. The chi-square statistic compares the size of any discrepancies between the expected results and the actual results, given the size of the sample and the number of variables in the relationship.

## 3.1 Calculating Chi-square

$\chi^{2}=\sum \frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}}$

O_i = observed value

E_i = expected value

In [None]:
print(statistics_api.chisquare(session,  series_1_id, series_2_id))

# 4 Kolmogorov–Smirnov test

The function used here is <code>statistics_api.KolmogorovSmirnovTest</code>. This takes four parameters:

- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series
- The <code>type_distribution</code> choosing type of distribution
- The <code>type_ranking</code> chosing type of ranking

Kolmogorov–Smirnov test a very efficient way to determine if two samples are significantly different from each other. It is usually used to check the uniformity of random numbers. Uniformity is one of the most important properties of any random number generator and Kolmogorov–Smirnov test can be used to test it.

The Kolmogorov–Smirnov test (KS Test) is a bit more complex and allows you to detect patterns you can’t detect with a Student’s T-Test.

From Wikipedia:

“The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples.”


## 4.1 How to perform KS Test?

1. State the Null hypothesis that both the random variables come from same distribution
2. State the  Alternative hypothesis that both the random variables do not come from same distribution
3. Setup a confidence interval value
4. Calculate the D value using following formula $D_n,_m=Maximum|F_n(X)−F_m(X)|$
5. The null hypothesis is rejected at level $\alpha $ if $$D_{n,m} > c(\alpha) \sqrt(\frac{n+m}{nm})$$ where $c(\alpha)=\sqrt(-\frac{1}{2}log_e\alpha)$ <br> $n$,$m$ =number of points in samples.

In [None]:
print(statistics_api.kolmogorovSmirnovTest(session,  series_1_id, type_distribution, type_ranking))

# 5. Kurtosis

The function used here is <code>statistics_api.kurtosis</code>. This takes two parameters:
 
- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series

Kurtosis is a measure of the tailedness of a distribution. Tailedness is how often outliers occur. Excess kurtosis is the tailedness of a distribution relative to a normal distribution.

Distributions with medium kurtosis (medium tails) are mesokurtic.

Distributions with low kurtosis (thin tails) are platykurtic.

Distributions with high kurtosis (fat tails) are leptokurtic.

Tails are the tapering ends on either side of a distribution. They represent the probability or frequency of values that are extremely high or low compared to the mean. In other words, tails represent how often outliers occur.

## 5.1 Calculating kurtosis

Mathematically speaking, kurtosis is the standardized fourth moment of a distribution. Moments are a set of measurements that tell you about the shape of a distribution.

Moments are standardized by dividing them by the standard deviation raised to the appropriate power.

Kurtosis of a population
The following formula describes the kurtosis of a population:

Kurt $=\frac{\mu_{4}}{\sigma^{4}}$

Where:

- mu_4 is the standardized fourth moment

- mu_4 is the unstandardized central fourth moment

- sigma is the standard deviation

In [None]:
print(statistics_api.kurtosis(session,  series_1_id))

# 6. Levene's test

The function used here is <code>statistics_api.levene_test</code>. This takes three parameters:
 
- The <code>session</code> assigning session 
- The <code>series_1_id</code> first of the desired series
- The <code>series_2_id</code>  second of the desired series

Levene's test is equivalent to a 1-way between-groups analysis of variance (ANOVA) with the dependent variable being the absolute value of the difference between a score and the mean of the group). The test statistic, W, is equivalent to the F statistic that would be produced by such an ANOVA, and is defined as follows:

## 6.1 Calculating Levene's test


$W=\frac{(N-k)}{(k-1)} \cdot \frac{\sum_{i=1}^{k} N_{i}\left(Z_{i .}-Z_{. .}\right)^{2}}{\sum_{i=1}^{k} \sum_{j=1}^{N_{i}}\left(Z_{i j}-Z_{i} .\right)^{2}}$,

where

- $k$ is the number of different groups to which the sampled cases belong,

- $N_{-} i$ is the number of cases in the ith group,

- $\mathrm{N}$ is the total number of cases in all groups,

- $Y_{-} \mathrm{ij}$ is the value of the measured variable for the jth case from the ith group $Z_{i j}= \begin{cases}\left|Y_{i j}-\bar{Y}_{i \cdot}\right|, & \bar{Y}_{i} . \text { is a mean of the } i \text {-th group, } \\ \left|Y_{i j}-\tilde{Y}_{i \cdot}\right|, & \tilde{Y}_{i} . \text { is a median of the } i \text {-th group. }\end{cases}$ 

$$
Z_{i .}=\frac{1}{N_{i}} \sum_{j=1}^{N_{i}} Z_{i j}
$$

- is the mean of the $\mathrm{Z}_{-} \mathrm{ij}$ for group $\mathrm{i}$,

$$
Z . .=\frac{1}{N} \sum_{i=1}^{k} \sum_{j=1}^{N_{i}} Z_{i j}
$$

- is the mean of all Z_ij.

In [None]:
print(statistics_api.levene_test(session,  series_1_id, series_2_id))

# 7. Mann-Whitney U test

The function used here is <code>statistics_api.mann_whitney_u_test</code>. This takes five parameters:
 
- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series
- The <code>series_2_id</code> of the desired series
- The <code>alternative</code> 
- The <code>type_ranking</code> choosing type of ranking

Mann-Whitney U test is the non-parametric alternative test to the independent sample t-test. It is a non-parametric test that is used to compare two sample means that come from the same population, and used to test whether two sample means are equal or not. Usually, the MannWhitney U test is used when the data is ordinal or when the assumptions of the t-test are not met. Sometimes understanding the Mann-Whitney $\mathrm{U}$ is difficult interpret because the results are presented in group rank differences rather than group mean differences. The Intellectus Statistics tool below interprets the analysis in plain English!

## 7.1 Calculation of the Mann-Whitney U:

$$
U=n_{1} n_{2}+\frac{n_{2}\left(n_{2}+1\right)}{2}-\sum_{i=n_{1}+1}^{m_{2}} R_{i}
$$

Where:

U=Mann-Whitney U test

N1 = sample size one

$\mathrm{N} 2=$ Sample size two

$\mathrm{Ri}=$ Rank of the sample size



In [None]:
print(statistics_api.mann_whitney_u_test(session,  series_1_id, series_2_id, alternative, type_ranking))

# 8. Min-Max Function.

The function used here is <code>statistics_api.min_max</code>. This takes two parameters:
 
- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series

It's simple statistical function which gives you minmum and maximum vale of the series

In [None]:
print(statistics_api.min_max(session,  series_1_id))


#9. Paired T-Test
The function used here is <code>statistics_api.paired_t_test</code>. This takes four parameters:
 
- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series
- The <code>series_2_id</code> of the desired series
- The <code>alternative</code> 

The paired sample t-test, sometimes called the dependent sample t-test, is a statistical procedure used to determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations. Common applications of the paired sample t-test include case-control studies or repeatedmeasures designs. Suppose you are interested in evaluating the effectiveness of a company training program. One approach you might consider would be to measure the performance of a sample of employees before and after completing the program, and analyze the differences using a paired sample t-test.

## 9.1 How to perform Paired T-Test

- Null Hypothesis: $\mathrm{HO}: \mu \mathrm{d}=0$

- Alternative Hypothesis: $\mathrm{H} 1: \mu \mathrm{d} \neq 0$

- Point Estimate:(the sample mean difference) is the point estimate of . $\mu \mathrm{d}$

- Test statistic:

$$
t=\frac{\bar{d}-\mu_{d}}{s_{d} / \sqrt{n}}
$$

- Note that the standard error of is where sd is the standard deviation of the differences.

- As before, we compare the t-statistic to the critical value of $t$ (which can be found in the table using degrees of freedom and the pre-selected level of significance, a). If the absolute value of the calculated t-statistic is larger than the critical value of $t$, we reject the null hypothesis.

- Confidence Intervals

We can also calculate a $95 \%$ confidence interval around the difference in means. The general form for a confidence interval around a difference in means is

$$
\bar{d} \pm t_{(\mathrm{n}-1, \text { two -sided } \alpha)}\left(s_{d} / \sqrt{n}\right)
$$

For a two-sided 95\% confidence interval, use the table of the t-distribution (found at the end of the section) to select the appropriate critical value of $t$ for the two-sided $a=0.05 \ldots$

In [None]:
print(statistics_api.paired_t_test(session,  series_1_id, series_2_id, alternative))


##10. Pearson
The function used here is <code>statistics_api.pearson</code>. This takes five parameters:
 
- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series
- The <code>series_2_id</code> of the desired series
- The <code>alternative</code> 
- The <code>type_ranking</code> choosing type of ranking

The Pearson correlation coefficient $(r)$ is the most widely used correlation coefficient and is known by many names:

- Pearson's $r$

- Bivariate correlation

- Pearson product-moment correlation coefficient (PPMCC)

- The correlation coefficient

the correlation coefficient- is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between $-1$ and 1 .

As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0 , but less than 1 (as 1 would represent an unrealistically perfect correlation).

The Pearson correlation coefficient is a descriptive statistic, meaning that it summarizes the characteristics of a dataset. Specifically, it describes the strength and direction of the linear relationship between two quantitative variables.

## 10.1 Calculating Pearson coefficient.

$r=\frac{\sum\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sqrt{\sum\left(x_{i}-\bar{x}\right)^{2} \sum\left(y_{i}-\bar{y}\right)^{2}}}$

- $r$ = correlation coefficient

- $x_{-} i$ = values of the $x$-variable in a sample

- $\mathrm{x}_{-}(\mathrm{x}$ bar $)=$ mean of the values of the $\mathrm{x}$-variable

- $y_{-} i=$ values of the $y$-variable in a sample

- $y_{-}(y$ bar $)=$ mean of the values of the $y$-variable


In [None]:
print(statistics_api.pearson(session,  series_1_id, series_2_id, alternative))

# 11. Skewness

The function used here is <code>statistics_api.skewness</code>. This takes two parameters:
 
- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series


Skewness is a measurement of the distortion of symmetrical distribution or asymmetry in a data set. Skewness is demonstrated on a bell curve when data points are not distributed symmetrically to the left and right sides of the median on a bell curve. If the bell curve is shifted to the left or the right, it is said to be skewed.

Skewness can be quantified as a representation of the extent to which a given distribution varies from a normal distribution. A normal distribution has a zero skew, while a lognormal distribution, for example, would exhibit some right skew.

##11.1Calculating skewness}

$\tilde{\mu}_{3}=\frac{\sum_{i}^{N}\left(X_{i}-\bar{X}\right)^{3}}{(N-1) * \sigma^{3}}$

- mu_3 = skewness

- $\mathrm{N}$ = number of variables in the distribution

- $\mathrm{X}_{-} \mathrm{i}=$ random variable

- $\mathrm{X}_{-}(\mathrm{x}$ bar $)=$ mean of the distribution

- sigma = standard deviation


In [None]:
print(statistics_api.skewness(session,  series_1_id))

##12. Spearman's rank coefficient

The function used here is <code>statistics_api.sperman</code>. This takes five parameters:
 
- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series
- The <code>series_2_id</code> of the desired series
- The <code>alternative</code> 
- The <code>type_ranking</code> choosing type of ranking

The Spearman's rank coefficient of correlation is a nonparametric measure of rank correlation (statistical dependence of ranking between two variables).

it is often denoted by the Greek letter ' $\rho$ ' (rho) and is primarily used for data analysis.

It measures the strength and direction of the association between two ranked variables. But before we talk about the Spearman correlation coefficient, it is important to understand Pearson's correlation first. A Pearson correlation is a statistical measure of the strength of a linear relationship between paired data.

##12.1 Calculating spearman's rank coefficient}

$\rho=1-\frac{6 \sum d_{i}^{2}}{n\left(n^{2}-1\right)}$

- $n=$ number of data points of the two variables

- di= difference in ranks of the "ith" element

The Spearman Coefficient, $\rho$, can take a value between $+1$ to $-1$ where, - A $\rho$ value of $+1$ means a perfect association of rank

- A $\rho$ value of 0 means no association of ranks

- A $\rho$ value of $-1$ means a perfect negative association between ranks.

Closer the $\rho$ value to 0 , weaker is the association between the two ranks.

In [None]:
print(statistics_api.spearman(session,  series_1_id, series_2_id, alternative, type_ranking))


#13. Student t-test

The function used here is <code>statistics_api.student_t_test</code>. This takes four parameters:
 
- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series
- The <code>series_2_id</code> of the desired series
- The <code>alternative</code> 
- The <code>type_ranking</code> choosing type of ranking

The Student t test tells you how significant the differences between group means are. It lets you know if those differences in means could have happened by chance. The $t$ test is usually used when data sets follow a normal distribution but you don't know the population variance.

## 13.1 Calculating test statistics

For Equal sample sizes and variance

Given two groups $(1,2)$, this test is only applicable when:

the two sample sizes are equal; it can be assumed that the two distributions have the same variance; Violations of these assumptions are discussed below.

The $t$ statistic to test whether the means are different can be calculated as follows:

$t=\frac{\bar{X}_{1}-\bar{X}_{2}}{s_{p} \sqrt{\frac{2}{n}}}$

where,

$s_{p}=\sqrt{\frac{s_{X_{1}}^{2}+s_{X_{2}}^{2}}{2}}$.

For significance testing, the degrees of freedom for this test is $2 n-2$ where $n$ is sample size.


In [None]:
print(statistics_api.student_t_test(session,  series_1_id, series_2_id, alternative))


#14. Variance

The function used here is <code>statistics_api.variance</code>. This takes two parameters:
 
- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series


The term variance refers to a statistical measurement of the spread between numbers in a data set. More specifically, variance measures how far each number in the set is from the mean (average), and thus from every other number in the set. Variance is often depicted by this symbol: $\sigma 2$. It is used by both analysts and traders to determine volatility and market security.

The square root of the variance is the standard deviation (SD or $\sigma$ ), which helps determine the consistency of an investment's returns over a period of time. 

## 14.1 Calculating variance

$S^{2}=\frac{\sum\left(x_{i}-\bar{x}\right)^{2}}{n-1}$

$S^{\wedge} 2=$ sample variance

$\mathrm{x}_{-} \mathrm{i}=$ the value of the one observation

$\mathrm{x}_{-}(\mathrm{x}$ bar $)=$ the mean value of all observations

$n=$ the number of observations


In [None]:
print(statistics_api.variance(session,  series_1_id))


# 15. Welch's t-test

The function used here is <code>statistics_api.welch_t_test</code>. This takes four parameters:
 
- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series
- The <code>series_2_id</code> of the desired series
- The <code>alternative</code> 

Welch's t-test also known as unequal variances t-test is used when you want to test whether the means of two population are equal. This test is generally applied when the there is a difference between the variations of two populations and also when their sample sizes are unequal.

## 15.1 Calculating Welch's t-test

Test statistic:

$$
t=\frac{\bar{X}_{1}-\bar{X}_{2}}{\sqrt{\frac{s_{1}^{2}}{N_{1}}+\frac{s_{2}^{2}}{N_{2}}}}=\frac{\bar{X}_{1}-\bar{X}_{2}}{\sqrt{s e_{1}^{2}+s e_{2}^{2}}}
$$

- The null hypothesis for the test is that the means are equal.

- The alternate hypothesis for the test is that means are not equal.

## 15.2 Comparison to Student's T-Test

Welch's t-test, unlike Student's t-test, does not have the assumption of equal variance (however, both tests have the assumption of normality). When two groups have equal sample sizes and variances, Welch's tends to give the same result as Student's. However, when sample sizes and variances are unequal, Student's t-test is quite unreliable; Welch's tends perform better.

In [None]:
print(statistics_api.welch_t_test(session,  series_1_id, series_2_id, alternative))


# 16. Wilcoxon signed-rank test

The function used here is <code>statistics_api.wilcoxon_signed_rank_test</code>. This takes five parameters:
 
- The <code>session</code> assigning session 
- The <code>series_1_id</code> of the desired series
- The <code>series_2_id</code> of the desired series
- The <code>alternative</code> 
- The <code>type_ranking</code> choosing type of ranking

Wilcoxon rank-sum test is used to compare two independent samples, while Wilcoxon signed-rank test is used to compare two related samples, matched samples, or to conduct a paired difference test of repeated measurements on a single sample to assess whether their population mean ranks differ.

## 16.1 Calculating Wilcoxon signed-rank test

$$
W=\sum_{i=1}^{N_{r}}\left[\operatorname{sgn}\left(x_{2, i}-x_{1, i}\right) \cdot R_{i}\right]
$$

- $W=$ test statistic

- N_r = sample size, excluding pairs where $x 1=x 2$

- sgn = sign function

- $x_{-} 1, i, x_{-} 2, i=$ corresponding ranked pairs from two distributions

- $R_{-} i=\operatorname{rank} \mathrm{i}$

In [None]:
print(statistics_api.wilcoxon_signed_rank_test(session,  series_1_id, series_2_id, alternative, type_ranking))

# Running all function together

In [None]:
from smart_broker_api import statistics_api

type_distribution="normalunit"
type_ranking="cdf"
alternative = "two-sided"
print(statistics_api.count(session,  series_1_id))
print(statistics_api.mean(session,  series_1_id))
print(statistics_api.chisquare(session,  series_1_id, series_2_id))
print(statistics_api.kolmogorovSmirnovTest(session,  series_1_id, type_distribution, type_ranking))
print(statistics_api.kurtosis(session,  series_1_id))
print(statistics_api.levene_test(session,  series_1_id, series_2_id))
print(statistics_api.mann_whitney_u_test(session,  series_1_id, series_2_id, alternative, type_ranking))
print(statistics_api.min_max(session,  series_1_id))
print(statistics_api.paired_t_test(session,  series_1_id, series_2_id, alternative))
print(statistics_api.pearson(session,  series_1_id, series_2_id, alternative))
print(statistics_api.skewness(session,  series_1_id))
print(statistics_api.spearman(session,  series_1_id, series_2_id, alternative, type_ranking))
print(statistics_api.student_t_test(session,  series_1_id, series_2_id, alternative))
print(statistics_api.variance(session,  series_1_id))
print(statistics_api.welch_t_test(session,  series_1_id, series_2_id, alternative))
print(statistics_api.wilcoxon_signed_rank_test(session,  series_1_id, series_2_id, alternative, type_ranking))
print(statistics_api.wilcoxon_signed_rank_test(session,  series_1_id, series_2_id, alternative, type_ranking))

{'count': 20}
{'mean': 22.76433814224586}
{'chisquare': [380.0, 0.23583148842099294]}
{'kolmogorov_smirnov_test': [0.95, 1.9073486328125338e-26]}
{'kurtosis': -1.225866948308427}
{'f_statistic': 0.003760306204864892, 'p_value': 0.9514246952891092}
{'w_statistic': 180.0, 'p_value': 0.5978625674290079}
{'min': 11.76951154694196, 'max': 30.172100539128525}
{'t_statistic': -3.842534659324879, 'p_value': 0.00027445641167562454}
{'pearson': 0.9332515128886971, 'p_value': 1.9573960230445664e-09}
{'skewness': -0.08041628194553505}
{'spearman': 0.9505828978811472, 'p_value': 1.394346860195128e-10}
{'t_statistic': -0.999458438479299, 'p_value': 0.08097370675497725}
{'variance': 28.617281263722433}
{'t_statistic': -0.9994584384792992, 'p_value': 0.08097675209806848}
{'w_statistic': 0.9505828978811472, 'p_value': 1.394346860195128e-10}
{'w_statistic': 0.9505828978811472, 'p_value': 1.394346860195128e-10}


#### 3.5.2 Data Visualisation

We can also perform visualisation funcitons on our series level data. We'll now go over two visualisation functions which can be performed; histogram and kernel density estimation.

##### 3.5.2.1 Histogram

To generate a histogram from a series, we employ the <code>visualization_api</code> component of the orchestrator client. When using the <code>histogram</code> function we supply two parameters:

- The <code>id</code> of the series in question
- <code>bin_count</code>; the number of bins we'd like to sort our data into

We recieve a <code>dict</code> back from the client containing multiple entries. To create the visualisation we must query for the 'figure' key and supply this to <code>plotly</code> as shown below.

In [None]:
from smart_broker_api import visualization_api
import plotly.graph_objects as go

dict_of_fig = visualization_api.histogram(session, series_1_id, 20)
fig = go.Figure(dict_of_fig["figure"])
fig.show()

# TODO: This is ugly UX. Researcher shouldn't have to fish out 'figure' from the json
# Why aren't all steps held inside the client? ADD TO TECH DEBT


##### 3.5.2.1 Kernel Density Estimation

To generate a Kernel Density Estimation from a series, we employ the <code>visualization_api</code> component of the orchestrator client. When using the <code>kernel_density_estimation</code> function, we supply two parameters:

- The <code>id</code> of the series in question
- <code>bin_size</code>; the width of bins we'd like to sort our data into

We receive a <code>dict</code> back from the client containing multiple entries. To create the visualisation we must query the <code>'figure'</code> key from the result and supply this to <code>plotly</code> as shown below.

In [None]:
dict_of_fig = visualization_api.kernel_density_estimation(session, series_1_id, 2)
fig = go.Figure(dict_of_fig["figure"])
fig.show()

# TODO: This is ugly UX. Researcher shouldn't have to fish out 'figure' from the json
# Why aren't all steps held inside the client? ADD TO TECH DEBT