# Loading Data in sktime

[github lookup](https://github.com/alan-turing-institute/sktime/blob/dev/examples/Loading%20Data%20Examples.ipynb)

Note: please consider this data storage approach a working prototype. Its primary purpose is to support code development, and full testing and additional functionality will be added later. There are many elements that could be refined, and some elements should likely be handled by a Task object. Suggestions and comments are welcome! 

### Current Approach: 

Data should stored in pandas DataFrame objects; this can be achieved through creating the data structure programmatically or loading data directly from a bespoke sktime file-format (.ts) 

Below is a brief description of the .ts file format and an introduction of how data are stored in dataframes for sktime. 

In [1]:
from sktime.utils.load_data import load_from_tsfile_to_dataframe

## Representing data with .ts files

The most typical use case is to load data from a locally stored .ts file. The .ts file format has been created for representing problems in a standard format for use with sktime. These files include two main parts:  
* header information
* data 

The header information is used to facilitate simple representation of the data through including metadata about the structure of the problem. The header contains the following: 

    @problemName <problem name>
    @timeStamps <true/false> 
    @univariate <true/false>
    @classLabel <true/false> <space delimted list of possible class values>
    @data
    
The data for the problem should begin after the @data tag. In the simplest case where @timestamps is false, values for a series are expressed in a comma-separated list and the index of each value is relative to its position in the list (0, 1, ..., m). A _case_ may contain 1 to many dimensions, where cases are line-delimited and dimensions within a case are colon (:) delimited. For example:

    2,3,2,4:4,3,2,2
    13,12,32,12:22,23,12,32
    4,4,5,4:3,2,3,2

This example data has 3 _cases_, where each case has 2 _dimensions_ with 4 observations per dimension. Missing readings can be specified using ?, or for sparse datasets, readings can be specified by setting @timestamps to true and representing the data  with tuples in the form of (timestamp, value). For example, the first case in the example above could be specified in this representation as: 

    (0,2),(1,3)(2,2)(3,4):(0,4),(1,3),(2,2),(3,2)

Equivalently, 

    2,5,?,?,?,?,?,5,?,?,?,?,4 

could be represnted with timestamps as:

    (0,2),(0,5),(7,5),(12,4)
    
For classification problems, the class label for a case should be specified in the last dimension and @classLabel should be in the header information to specify the set of possible class values. For example, if a case consists of a single dimension and has a class value of 1 it would be specified as:

     1,4,23,34:1

## Storing data in a pandas DataFrame

The core data structure for storing datasets in sktime is a pandas DataFrame, where rows of the dataframe correspond to cases,  and columns correspond to dimensions of the problem. The readings within each column of the dataframe are stored as pandas Series objects; the use of Series facilitates simple storage of sparse data or series with non-integer timestamps (such as dates). Further, if the loaded problem is a classification problem, the standard loading functionality within sktime will returen the class values in a separate index-aligned numpy array (with an option to combine X and Y into a single dataframe for high-level task construction). For example, for a problem with n cases that each have data across c dimensions:

    DataFrame:                                            
    index |   dim_0   |   dim_1   |    ...    |  dim_c-1
       0  | pd.Series | pd.Series | pd.Series | pd.Series
       1  | pd.Series | pd.Series | pd.Series | pd.Series
      ... |    ...    |    ...    |    ...    |    ...   
       n  | pd.Series | pd.Series | pd.Series | pd.Series

And if the data is a classification problem, a separate (index-aligned) array will be returned with the class labels:

    index | class_val 
      0   |   int    
      1   |   int 
     ...  |   ...
      n   |   int 


## Loading from .ts file to pandas DataFrame

A dataset can be loaded from a .ts file using the following method in sktime.utils.load_data.py:
    
    load_from_tsfile_to_dataframe(full_file_path_and_name, replace_missing_vals_with='NaN')
    
This can be demonstrated using the Gunpoint problem that is included in sktime under sktime/datasets/data

In [2]:
from sktime.utils.load_data import load_from_tsfile_to_dataframe

train_x, train_y = load_from_tsfile_to_dataframe("../sktime/datasets/data/GunPoint/GunPoint_TRAIN.ts") 
test_x, test_y = load_from_tsfile_to_dataframe("../sktime/datasets/data/GunPoint/GunPoint_TEST.ts") 


Train and test partitions of the GunPoint problem have been loaded into dataframes with associated arrays for class values. As an example, below are the first 5 rows from the train_x and train_y:

In [3]:
train_x.head()

Unnamed: 0,dim_0
0,0 -0.64789 1 -0.64199 2 -0.63819 3...
1,0 -0.64443 1 -0.64540 2 -0.64706 3...
2,0 -0.77835 1 -0.77828 2 -0.77715 3...
3,0 -0.75006 1 -0.74810 2 -0.74616 3...
4,0 -0.59954 1 -0.59742 2 -0.59927 3...


In [4]:
train_y[0:5]

array(['2', '2', '1', '1', '2'], dtype='<U1')

## Loading from Weka ARFF files

It is also possible to load data from Weka's attribute-relation file format (ARFF) files. The `load_from_arff_to_dataframe` method in `sktime.utils.load_data` supports reading both univariate and multivariate problems. Examples are shown below using the GunPoint problem again (this time loading from ARFF) and also the multivariate BasicMotions problem.

### Loading the univariate GunPoint problemfrom ARFF

In [5]:
from sktime.utils.load_data import load_from_arff_to_dataframe

In [6]:
X, y = load_from_arff_to_dataframe("../sktime/datasets/data/GunPoint/GunPoint_TRAIN.arff")
X.head()

Unnamed: 0,dim_0
0,0 -0.64789 1 -0.64199 2 -0.63819 3...
1,0 -0.64443 1 -0.64540 2 -0.64706 3...
2,0 -0.77835 1 -0.77828 2 -0.77715 3...
3,0 -0.75006 1 -0.74810 2 -0.74616 3...
4,0 -0.59954 1 -0.59742 2 -0.59927 3...


### Loading the multivariate GunPoint problem from ARFF

In [7]:
X, y = load_from_arff_to_dataframe("../sktime/datasets/data/BasicMotions/BasicMotions_TRAIN.arff")
X.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
0,0 0.079106 1 0.079106 2 -0.903497 3...,0 0.394032 1 0.394032 2 -3.666397 3...,0 0.551444 1 0.551444 2 -0.282844 3...,0 0.351565 1 0.351565 2 -0.095881 3...,0 0.023970 1 0.023970 2 -0.319605 3...,0 0.633883 1 0.633883 2 0.972131 3...
1,0 0.377751 1 0.377751 2 2.952965 3...,0 -0.610850 1 -0.610850 2 0.970717 3...,0 -0.147376 1 -0.147376 2 -5.962515 3...,0 -0.103872 1 -0.103872 2 -7.593275 3...,0 -0.109198 1 -0.109198 2 -0.697804 3...,0 -0.037287 1 -0.037287 2 -2.865789 3...
2,0 -0.813905 1 -0.813905 2 -0.424628 3...,0 0.825666 1 0.825666 2 -1.305033 3...,0 0.032712 1 0.032712 2 0.826170 3...,0 0.021307 1 0.021307 2 -0.372872 3...,0 0.122515 1 0.122515 2 -0.045277 3...,0 0.775041 1 0.775041 2 0.383526 3...
3,0 0.289855 1 0.289855 2 -0.669185 3...,0 0.284130 1 0.284130 2 -0.210466 3...,0 0.213680 1 0.213680 2 0.252267 3...,0 -0.314278 1 -0.314278 2 0.018644 3...,0 0.074574 1 0.074574 2 0.007990 3...,0 -0.079901 1 -0.079901 2 0.237040 3...
4,0 -0.123238 1 -0.123238 2 -0.249547 3...,0 0.379341 1 0.379341 2 0.541501 3...,0 -0.286006 1 -0.286006 2 0.208420 3...,0 -0.098545 1 -0.098545 2 -0.023970 3...,0 0.058594 1 0.058594 2 0.175783 3...,0 -0.074574 1 -0.074574 2 0.114525 3...


## Using long-format data with sktime 

It is also possible to use data from sources other than .ts and .arff files by manually shaping the data into the format described above. For convenience, a helper function is also provided to convert long-format data into sktime-formatted data in the `from_long_to_nested` method in `sktime.utils.load_data` (with assumptions made on how the data is initially formatted). 

The method converts rows from a long-table schema data frame assuming each row contains information for: 

`case_id, dimension_id, reading_id, value`

where `case_id` is an id to identify a specific case in the data, `dimension_id` is an integer between 0 and d-1 for d dimensions in the data, `reading_id` is the index of this observation for the associated `case_id` and `dimension_id`, and `value` is the actual value of the observation. E.g.:

          | case_id | dim_id | reading_id | value
     ------------------------------------------------
       0  |   int   |  int   |    int     | double   
       1  |   int   |  int   |    int     | double
       2  |   int   |  int   |    int     | double
       3  |   int   |  int   |    int     | double

To demonstrate this functionality the method below creates a dataset with a given number of cases, dimensions and observations:

In [8]:
import numpy as np
import pandas as pd
def generate_example_long_table(num_cases=50, series_len=20, num_dims=2):

    rows_per_case = series_len*num_dims
    total_rows = num_cases*series_len*num_dims

    case_ids = np.empty(total_rows, dtype=np.int)
    idxs = np.empty(total_rows, dtype=np.int)
    dims = np.empty(total_rows, dtype=np.int)
    vals = np.random.rand(total_rows)

    for i in range(total_rows):
        case_ids[i] = int(i/rows_per_case)
        rem = i%rows_per_case
        dims[i] = int(rem/series_len)
        idxs[i] = rem%series_len

    df = pd.DataFrame()
    df['case_id'] = pd.Series(case_ids)
    df['dim_id'] = pd.Series(dims)
    df['reading_id'] = pd.Series(idxs)
    df['value'] = pd.Series(vals)
    return df

The following example generates a long-format table with 50 cases, each with 4 dimensions of length 20:

In [9]:
X = generate_example_long_table(num_cases=50, series_len=20, num_dims=4)
X.head()

Unnamed: 0,case_id,dim_id,reading_id,value
0,0,0,0,0.177589
1,0,0,1,0.781639
2,0,0,2,0.22335
3,0,0,3,0.053108
4,0,0,4,0.650965


In [10]:
X.tail()

Unnamed: 0,case_id,dim_id,reading_id,value
3995,49,3,15,0.764313
3996,49,3,16,0.579147
3997,49,3,17,0.830007
3998,49,3,18,0.657461
3999,49,3,19,0.482724


As shown below, applying the `from_long_to_nested` method returns a sktime-formatted dataset with individual dimensions represented by columns of the output dataframe:

In [11]:
from sktime.utils.load_data import from_long_to_nested 
X_nested = from_long_to_nested(X)
X_nested.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3
0,0 0.177589 1 0.781639 2 0.223350 3...,0 0.947357 1 0.641315 2 0.514998 3...,0 0.643837 1 0.274636 2 0.005983 3...,0 0.293897 1 0.658655 2 0.555860 3...
1,0 0.341743 1 0.149166 2 0.908948 3...,0 0.719205 1 0.764392 2 0.353981 3...,0 0.827305 1 0.429136 2 0.283994 3...,0 0.254740 1 0.133941 2 0.567786 3...
2,0 0.972167 1 0.309815 2 0.666487 3...,0 0.385930 1 0.181997 2 0.977246 3...,0 0.672560 1 0.963813 2 0.884127 3...,0 0.577515 1 0.192003 2 0.151689 3...
3,0 0.814272 1 0.059455 2 0.178887 3...,0 0.286362 1 0.481258 2 0.992999 3...,0 0.705961 1 0.070460 2 0.469713 3...,0 0.274182 1 0.086482 2 0.478018 3...
4,0 0.521243 1 0.466565 2 0.092443 3...,0 0.263912 1 0.952919 2 0.749043 3...,0 0.367707 1 0.561043 2 0.235882 3...,0 0.258686 1 0.683320 2 0.126101 3...


In [12]:
X_nested.iloc[0][0].head()

0    0.177589
1    0.781639
2    0.223350
3    0.053108
4    0.650965
dtype: float64