# Loading Data in sktime

[github lookup](https://github.com/alan-turing-institute/sktime/blob/dev/examples/Loading%20Data%20Examples.ipynb)

Note: please consider this data storage approach a working prototype. Its primary purpose is to support code development, and full testing and additional functionality will be added later. There are many elements that could be refined, and some elements should likely be handled by a Task object. Suggestions and comments are welcome! 

### Current Approach: 

Data should stored in pandas DataFrame objects; this can be achieved through creating the data structure programmatically or loading data directly from a bespoke sktime file-format (.ts) 

Below is a brief description of the .ts file format and an introduction of how data are stored in dataframes for sktime. 

In [1]:
from sktime.utils.load_data import load_from_tsfile_to_dataframe

## Representing data with .ts files

The most typical use case is to load data from a locally stored .ts file. The .ts file format has been created for representing problems in a standard format for use with sktime. These files include two main parts:  
* header information
* data 

The header information is used to facilitate simple representation of the data through including metadata about the structure of the problem. The header contains the following: 

    @problemName <problem name>
    @timeStamps <true/false> 
    @univariate <true/false>
    @classLabel <true/false> <space delimted list of possible class values>
    @data
    
The data for the problem should begin after the @data tag. In the simplest case where @timestamps is false, values for a series are expressed in a comma-separated list and the index of each value is relative to its position in the list (0, 1, ..., m). A _case_ may contain 1 to many dimensions, where cases are line-delimited and dimensions within a case are colon (:) delimited. For example:

    2,3,2,4:4,3,2,2
    13,12,32,12:22,23,12,32
    4,4,5,4:3,2,3,2

This example data has 3 _cases_, where each case has 2 _dimensions_ with 4 observations per dimension. Missing readings can be specified using ?, or for sparse datasets, readings can be specified by setting @timestamps to true and representing the data  with tuples in the form of (timestamp, value). For example, the first case in the example above could be specified in this representation as: 

    (0,2),(1,3)(2,2)(3,4):(0,4),(1,3),(2,2),(3,2)

Equivalently, 

    2,5,?,?,?,?,?,5,?,?,?,?,4 

could be represnted with timestamps as:

    (0,2),(0,5),(7,5),(12,4)
    
For classification problems, the class label for a case should be specified in the last dimension and @classLabel should be in the header information to specify the set of possible class values. For example, if a case consists of a single dimension and has a class value of 1 it would be specified as:

     1,4,23,34:1

## Storing data in a pandas DataFrame

The core data structure for storing datasets in sktime is a pandas DataFrame, where rows of the dataframe correspond to cases,  and columns correspond to dimensions of the problem. The readings within each column of the dataframe are stored as pandas Series objects; the use of Series facilitates simple storage of sparse data or series with non-integer timestamps (such as dates). Further, if the loaded problem is a classification problem, the standard loading functionality within sktime will returen the class values in a separate index-aligned numpy array (with an option to combine X and Y into a single dataframe for high-level task construction). For example, for a problem with n cases that each have data across c dimensions:

    DataFrame:                                            
    index |   dim_0   |   dim_1   |    ...    |  dim_c-1
       0  | pd.Series | pd.Series | pd.Series | pd.Series
       1  | pd.Series | pd.Series | pd.Series | pd.Series
      ... |    ...    |    ...    |    ...    |    ...   
       n  | pd.Series | pd.Series | pd.Series | pd.Series

And if the data is a classification problem, a separate (index-aligned) array will be returned with the class labels:

    index | class_val 
      0   |   int    
      1   |   int 
     ...  |   ...
      n   |   int 


### Loading from .ts file to pandas DataFrame

A dataset can be loaded from a .ts file using the following method in sktime.utils.load_data.py:
    
    load_from_tsfile_to_dataframe(full_file_path_and_name, replace_missing_vals_with='NaN')
    
This can be demonstrated using the Gunpoint problem that is included in sktime under sktime/datasets/data

In [2]:
from sktime.utils.load_data import load_from_tsfile_to_dataframe

train_x, train_y = load_from_tsfile_to_dataframe("../sktime/datasets/data/GunPoint/GunPoint_TRAIN.ts") 
test_x, test_y = load_from_tsfile_to_dataframe("../sktime/datasets/data/GunPoint/GunPoint_TEST.ts") 


In [3]:
# alternatively there are utility methods in sktime to load this and two other example datasets:
from sktime.datasets import load_gunpoint_dataframe, load_arrow_head_dataframe, load_italy_power_demand_dataframe
train_x, train_y = load_gunpoint_dataframe('TRAIN')
test_x, test_y = load_gunpoint_dataframe('TEST')

Train and test partitions of the GunPoint problem have been loaded into dataframes with associated arrays for class values. As an example, below are the first 5 rows from the train_x and train_y:

In [4]:
train_x.head()

Unnamed: 0,dim_0
0,0 -0.64789 1 -0.64199 2 -0.63819 3...
1,0 -0.64443 1 -0.64540 2 -0.64706 3...
2,0 -0.77835 1 -0.77828 2 -0.77715 3...
3,0 -0.75006 1 -0.74810 2 -0.74616 3...
4,0 -0.59954 1 -0.59742 2 -0.59927 3...


In [5]:
train_y[0:5]

array(['2', '2', '1', '1', '2'], dtype='<U1')