# Sliding-window analysis

Let's get down to it! This module is probably the easiest to use in `sihnpy` so you will be on your way very quickly.

## Practice data

In other `sihnpy` modules, real data from a subset of the PREVENT-AD Open Dataset is used. While we could probably use the continuous handedness scale to create windows, there is not a lot of diversity in the numbers (so a lot of participants that are fully right-handed would be clumped together in most windows). 

A variable that usually lends itself well to the sliding-window approach is age. Age is not available in the PREVENT-AD Open Dataset as it is a restricted information. Instead, I opted to **simulate age data** for the PREVENT-AD participants; similarly to {ref}`the data available for the Spatial Extent <2.spex/spex_details:Creating Gaussian simulated data>`. I will actually refer you to that section for more detailed information on how the age data was simulated in the PREVENT-AD.

Just know that the simulated age data matches the [mean, standard deviation and inclusion criteria from the PREVENT-AD](https://doi.org/10.1016/j.nicl.2021.102733): mean age of 65 years, standard deviation of 5 years and minimum age to be included is 55 years.

```{warning}
Just like for the spatial extent module, `sihnpy` provides practice data to use the sliding-window module. While PREVENT-AD participants are used, the data is **simulated**. As a general rule for `sihnpy`, and especially for this module, **only use the data provided to help you practice using the module, not to conduct or publish actual research**.
```

## Deriving the sliding windows

### 1. Preparing the data

To run the **sliding-window** module, you will need two things:
* A spreadsheet with the data, with the index set as the participants' IDs
* The name of the variable we want to slide along

If you already have your data ready, you can skip ahead to {ref}`the next section <3.sliding_window/sw_module:2. Calculating the number of windows>`.

As mentioned before, `sihnpy` has data available for you to use to practice:


In [1]:
from sihnpy.datasets import pad_sw_input

pad_age_data = pad_sw_input()
pad_age_data

Unnamed: 0_level_0,sex,test_language,handedness_score,handedness_interpretation,age
participant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
sub-5458966,Male,French,80.00,Right-handed,65.892657
sub-2424540,Female,French,100.00,Right-handed,65.543026
sub-7855613,Female,French,90.00,Right-handed,59.054610
sub-3137570,Male,French,90.00,Right-handed,65.653705
sub-9650197,Female,French,100.00,Right-handed,68.059713
...,...,...,...,...,...
sub-5336241,Female,French,-30.00,Ambidextrous,70.342373
sub-1002928,Female,French,100.00,Right-handed,68.658707
sub-1283278,Female,English,80.00,Right-handed,61.154117
sub-9101699,Male,French,57.89,Right-handed,62.495973


We have our dataset, which is basically the demographic information for our participants included in the PREVENT-AD Open Dataset with the simulated age. We're ready to go!

````{admonition} Fix: dataframe index
:class: warning

While importing the data from `sihnpy` is easy and already in the right format, it is **critical** that the data used in the rest of the functions have the final index you want to use.

Contrary to R, `pandas` uses an Index method, where each row is referred to by a label. `sihnpy` will sort participants and output the results of each window with these labels.

In `sihnpy` in particular, we output the list of participants in each window (i.e., we extract and output the index of the dataframe) and we output the data of each window (i.e., we extract and output the data of each window based on their index).

You can easily set an index by doing the following procedure when importing the data in Python:

```python
import pandas as pd

data = pd.read_csv("/path/to/file.csv", index_col=0) #The number is the position (integer) of the column to be used as index
```

Otherwise, if you already have the data imported in Python, you can manually force the index to the variable you want:

```python
import pandas as pd

data = pd.read_csv("/path/to/file.csv")
data_indexed = data.set_index('name_column') #Where `name_column` is the name of the column to use as index

```

````

### 2. Calculating the number of windows

The first step is to estimate **how many windows** we want to compute. To ensure we don't have empty windows, `sihnpy` will use your desired **window** and **step size** to compute the ideal number of windows.

For example, let's say I would like a window size of 100 and step size of 20, then I would simply need to tell `sihnpy`:

In [2]:
from sihnpy import sliding_window as sw

n_windows_sane = sw.bins(data=pad_age_data, var='age', w_size=100, s_size=20)

Collapse is False: the last window may have a smaller number of participants
Number of windows: 12


You aren't limited to how you want to divide the windows. So you can even use odd numbers if you like (not my preference cos I don't like odd numbers... but if it fits your research design, great!):

In [3]:
n_windows_insane = sw.bins(data=pad_age_data, var='age', w_size=66, s_size=13)

Collapse is False: the last window may have a smaller number of participants
Number of windows: 20


The crucial information is really only to tell `sihnpy` which `DataFrame` it should be using (`pad_age_data` in our case) and the name of the variable it should use for sorting.

```{warning}
Missing values are not currently tolerated in the sliding-window module and will throw errors. Make sure that there are no missing values on your sorting variable. Future versions will allow users to choose whether to throw errors, put missing values first or last. 
```

```{admonition} Advanced topic: Collapse argument
:class: danger

You might have noticed above that these is a message indicating that "Collapse is False". What is that?

First, we need to understand how `sihnpy` computed the number of windows. The formula is as follows:

$n_{windows} = ceil((n_{sub} - w_{size}) / s_{size})$

Where $n_{windows}$ is the resulting number of windows, $n_{sub}$ is the number of participants in the whole sample, $w_{size}$ is the size of the windows we want and $s_{size}$ is the step size we want.

In other words, the formula substracts the window size from the total number of participants, and divides the result by the step size. Because the numbers we choose can result in divisions with remainders, we force a **ceiling** rounding (rounding up) as we can't have fractions of windows.

This is also where the `collapse` argument comes into play. As the parameters we choose for the windows will almost never fall on a number of windows where the participants in each window all have the same number, `sihnpy` proposes to the user to choose how to deal with this.

In the default scenario (`collapse=False`), `sihnpy` will assume that we prefer having **more windows, but the last window will have less participants**. In this case, `sihnpy` will automatically add one more window to the number.

In the other scenario (`collapse=True`), `sihnpy` will assume that we prefer having **less windows, but the last window will have more participants**. In this case, `sihnpy` will not add any window to the count, but the last window will have more participants.

It is possible that you choose a step size and window size that would ensure that there are no extra participants at the end. In such a case, please ensure `collapse=True`, as otherwise `sihnpy` will create an extraneous empty window.

In the end, choosing to collapse or to not collapse fall unto the user, but I don't think there is a good or a bad choice ultimately. In our recent publication[^Stonge_2023], we opted for `collapse=False`.

```


### 3. Building the windows

Once the number of windows was determined, we need to "build" the windows. In other words, we need to split the participants in their respective windows.

In `sihnpy`, this is the step where the sliding-window is applied to the data. Specifically, we use `pandas`'s `iloc` to grab participants while accounting for our window and step sizes. For all the windows except the last one, this is determined by the following equations:

**Starting Index**: $s_{size} * (current window - 1)$

**Ending Index**: $w_{size} + s_{size} * (current window - 1)$

Let's image a window size of 100 and a step size of 20. The first window would be:

**Starting Index** $ = 20 * (1 - 1) = 0$
**Ending Index** $ = 100 + 20 * (1 - 1) = 100$

So the starting index for the first window would be 0 and the ending index would be 100. Let's repeat this with window 5 just to demonstrate.

**Starting Index** $ = 20 * (5 - 1) = 80$
**Ending Index** $ = 100 + 20 * (5 - 1) = 180$

As you can see, we moved up our sliding-window so that it now starts at index 80, and ends at index 180. 

Thankfully, you don't need to compute any of that: `sihnpy` will do it for you:

In [4]:
w_store = sw.build_windows(data=pad_age_data, var='age', w_size=100, s_size=20, n_bin=n_windows_sane)

Creating bin 1
Creating bin 2
Creating bin 3
Creating bin 4
Creating bin 5
Creating bin 6
Creating bin 7
Creating bin 8
Creating bin 9
Creating bin 10
Creating bin 11
Creating bin 12


And like that, it's done! `sihnpy` stored all of our windows in a dictionary. You can access any of the windows with the following naming convention:

`ww{w_size}_sts{s_size}_w{n_window}`

Note that so the file names have the same number of characters, windows 1-9 will have an extra 0 in their name.

For instance, let's take a look at the first window:

In [5]:
w_store['ww100_sts20_w01']

sub-3165520
sub-4396879
sub-9249727
sub-9327302
sub-4498598
...
sub-2757160
sub-9865768
sub-6967785
sub-1176949
sub-7755697


Great! We see a `pandas.dataframe` with our participants and with 0 columns. This is normal: the columns are removed to simplify merging data later on and to easily output the list of participants as needed. More on that in the {ref}`section on data export <3.sliding_window/sw_module:6. Exporting data>`.

For fun, let's check that the windows are sliding properly. Let's take the last participant in our window: `sub-7755697`.

In [6]:
w_store['ww100_sts20_w01'].index.get_loc('sub-7755697')

99

In the first window, he is at the last position, i.e., position 100. It shows up as 99 but it is actually the 100th participant; this is normal because Python is 0-indexed (meaning that the count starts at 0, not 1). If the sliding window worked properly, the position of the participant will **slide** by 20 indices (so he should be at position 80; 79 in Python 0-index). This should be the case for all subsequent window until the participant is no longer considered (which should happen in window 6)

In [7]:
print(w_store['ww100_sts20_w02'].index.get_loc('sub-7755697'))
print(w_store['ww100_sts20_w03'].index.get_loc('sub-7755697'))
print(w_store['ww100_sts20_w04'].index.get_loc('sub-7755697'))
print(w_store['ww100_sts20_w05'].index.get_loc('sub-7755697'))
try:
    w_store['ww100_sts20_w06'].index.get_loc('sub-7755697')
except KeyError:
    print("Participant not in this window")

79
59
39
19
Participant not in this window


That's right on the money! The algorithm is working properly.

### 4. Reconstructing data in each window

As we saw in the previous step, the `build_windows` function returns only an index with the participant IDs. Now we need to associate the data of each participant in each window, so each window has its own spreadsheet with its own data. `sihnpy` only needs the dictionary in which we stored the IDs to create the windows and the original dataset.

In [8]:
w_data = sw.data_by_window(w_store=w_store, data=pad_age_data)

Reconstructing data for window ww100_sts20_w01
Reconstructing data for window ww100_sts20_w02
Reconstructing data for window ww100_sts20_w03
Reconstructing data for window ww100_sts20_w04
Reconstructing data for window ww100_sts20_w05
Reconstructing data for window ww100_sts20_w06
Reconstructing data for window ww100_sts20_w07
Reconstructing data for window ww100_sts20_w08
Reconstructing data for window ww100_sts20_w09
Reconstructing data for window ww100_sts20_w10
Reconstructing data for window ww100_sts20_w11
Reconstructing data for window ww100_sts20_w12


This creates a new dictionary, where each entry is a dataframe. For instance, let's look again at the first window:

In [9]:
w_data['ww100_sts20_w01']

Unnamed: 0_level_0,sex,test_language,handedness_score,handedness_interpretation,age
participant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
sub-3165520,Male,English,80.0,Right-handed,55.000000
sub-4396879,Male,French,100.0,Right-handed,55.000000
sub-9249727,Female,French,100.0,Right-handed,55.000000
sub-9327302,Female,French,100.0,Right-handed,55.000000
sub-4498598,Female,French,100.0,Right-handed,55.000000
...,...,...,...,...,...
sub-2757160,Female,French,90.0,Right-handed,62.998585
sub-9865768,Female,French,50.0,Right-handed,63.005971
sub-6967785,Male,French,90.0,Right-handed,63.092224
sub-1176949,Female,French,80.0,Right-handed,63.194265


We get the full data for the 100 participants included in this window. We also see that `sub-7755697` is still the last participants, keeping the same order we saw before. We can verify this if we are paranoid:

In [10]:
print(w_data['ww100_sts20_w02'].index.get_loc('sub-7755697'))
print(w_data['ww100_sts20_w03'].index.get_loc('sub-7755697'))
print(w_data['ww100_sts20_w04'].index.get_loc('sub-7755697'))
print(w_data['ww100_sts20_w05'].index.get_loc('sub-7755697'))
try:
    w_data['ww100_sts20_w06'].index.get_loc('sub-7755697')
except KeyError:
    print("Participant not in this window")

79
59
39
19
Participant not in this window


This gives the same result as before, so we're all good.

### 5. Summary statistics for each window

Ok so `sihnpy` split our participants in window and slid across the age variable. That's great. But something you might wonder is what is the actual age of each of the windows? You could easily compute this for each dataframe, but it kind of is a pain. Thankfully, I am quite a lazy programmer and I didn't want to have to do that every time, so I integrated a function that does this in `sihnpy`. You just need to feed it the dictionary we just computed as well as the name of the variable you want an statistics on.

In [12]:
w_summary = sw.sum_by_window(w_data=w_data, var='age')
w_summary

Unnamed: 0_level_0,mean_age,median_age,sd_age,min_age,max_age
window,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ww100_sts20_w01,59.457366,59.790299,2.556229,55.0,63.285955
ww100_sts20_w02,61.072791,61.320336,2.009204,57.227661,63.805879
ww100_sts20_w03,62.307265,62.682785,1.726616,58.661927,64.730185
ww100_sts20_w04,63.415577,63.598469,1.403649,60.456187,65.634214
ww100_sts20_w05,64.369545,64.36417,1.226137,62.049926,66.424965
ww100_sts20_w06,65.219789,65.186617,1.220798,63.318288,67.369632
ww100_sts20_w07,66.092823,66.022348,1.289132,63.870281,68.415997
ww100_sts20_w08,66.984937,66.885311,1.323495,64.744182,69.223976
ww100_sts20_w09,67.928772,67.910107,1.402626,65.653705,70.425195
ww100_sts20_w10,68.989369,68.841317,1.595768,66.4438,72.391602


`sihnpy` will output this dataframe, where each row is a window, and each column is a descriptive statistics. Easy-peasy.

### 6. Exporting data

You made it all the way to the end. The last step is simply to export the data to file. `sihnpy` outputs a lot of files (2 per window + the summary statistics file) so be ready. It outputs both the full data dataframe (what we generated in {ref}`step 4 <3.sliding_window/sw_module:4. Reconstructing data in each window>`) as well as a text file for each window containing only the IDs (1 ID per line). Here is the code to export the data

```python
sw.export_data(w_data=w_data, w_summary=w_summary, var='age', path='/path/to/output', name='suffix_to_add')
```

And you are done with the **sliding-window** analysis!

## tl;dr

Too lazy to read everything? Or read everything and need a quick refresher? Here is the code in the order you need to make it work.

```python
from sihnpy.datasets import pad_sw_input #For practice data
from sihnpy import sliding_window as sw #Sliding-window functions

pad_age_data = pad_sw_input() #Import practice data

n_windows = sw.bins(data=pad_age_data, var='age', w_size=100, s_size=20, collapse=False) #Computes the number of windows to create

w_store = sw.build_windows(data=pad_age_data, var='age', w_size=100, s_size=20, n_bin=n_windows) #Build the windows

w_data = sw.data_by_window(w_store=w_store, data=pad_age_data) #Reconstructs the data for each window

w_summary = sw.sum_by_window(w_data=w_data, var='age') #Computes summary statistics for each window

sw.export_data(w_data=w_data, w_summary=w_summary, var='age', path='/path/to/output', name='suffix_to_add') #Export the sliding-window data

```

## References

Here are the references for this section:

[^Stonge_2023]: 