# Loading Data in sktime

Note: please consider this code a working prototype. Its primary purpose is to support code development, and full testing and additional functionality will be added later. There are many elements that could be refined, and some elements should likely be handled by a Task object (e.g. replace the somewhat hacky way that class labels are set when converting long data to wide!)

Data should stored in xpandas.XDataFrame objects; this can be achieved through creating the data structure programmatically or loading data from file. Data can be loaded directly from a bespoke sktime file-format (.ts), or by loading any delimited file into a long-table format and converting into a XDataFrame object using a utility method (or converting long table data created through other means). 

Below is a brief description of the .ts file format and a recap of how data are stored in XDataFrame objects. Following this, methods are demonstrated to load data into an XDataFrame from a .ts file, and also loading arbitrary data into a long format pandas.DataFrame and converting into an XPandas.XDataFrame. 

The examples throughout rely on the utility methods that have been implemented in the utilities.load_data.py file

In [1]:
from utilities.load_data import load_from_web_to_xdataframe

# Representing data with .ts files

The most typical use case is to load data from a locally stored .ts file. The .ts file format has been created for representing problems in a standard format for use with sktime. These files include two main parts:  
* header information
* data 

The header information is used to facilitate simple representation of the data through including metadata about the structure of the problem. The header contains the following: 

    @problemName <problem name>
    @timeStamps <true/false> 
    @univariate <true/false>
    @classLabel <true/false> <space delimted list of possible class values>
    @data
    
The data for the problem should begin after the @data tag. In the simplest case where @timestamps is false, values for a series are expressed in a comma-separated list and the index of each value is relative to its position in the list (0, 1, ..., m). A _case_ may contain 1 to many dimensions, where cases are line-delimited and dimensions within a case are colon (:) delimited. For example:

    2,3,2,4:4,3,2,2
    13,12,32,12:22,23,12,32
    4,4,5,4:3,2,3,2

This example data has 3 _cases_, where each case has 2 _dimensions_ with 4 observations per dimension. Missing readings can be specified using ?, or for sparse datasets, readings can be specified by setting @timestamps to true and representing the data  with tuples in the form of (timestamp, value). For example, the first case in the example above could be specified in this representation as: 

    (0,2),(1,3)(2,2)(3,4):(0,4),(1,3),(2,2),(3,2)

Equivalently, 

    2,5,?,?,?,?,?,5,?,?,?,?,4 

could be represnted with timestamps as:

    (0,2),(0,5),(7,5),(12,4)
    
For classification problems, the class label for a case should be specified in the last dimension and @classLabel should be in the header information to specify the set of possible class values. For example, if a case consists of a single dimension and has a class value of 1 it would be specified as:

     1,4,23,34:1

# Storing data in XPandas.XDataFrame

A number of methods are provided in utilities.load_data to load data into suitable data structures. The core data structure for storing datasets in sktime is an xpandas.XDataFrame, where rows of the XDataFrame correspond to cases and columns correspond to dimensions of the problem. A column is stored as an xpandas.XSeries, where individual entries are stored as a pandas.Series to allow for sparse/non-integer timestamps (such as dates). Further, if the loaded problem is a classification problem, the standard loading functionality within sktime will returen the class values in a separate index-aligned XSeries object. For example, for n cases and c dimensions:

    XDataFrame:                                            
    index |   dim_0   |   dim_1   |    ...    |  dim_c-1
       0  | pd.Series | pd.Series | pd.Series | pd.Series
       1  | pd.Series | pd.Series | pd.Series | pd.Series
      ... |    ...    |    ...    |    ...    |    ...   
       n  | pd.Series | pd.Series | pd.Series | pd.Series

And if the data is a classification problem, a separate (index-aligned) XSeries object will be returned with the class labels:

    XSeries:
    index | class_val 
      0   |   int    
      1   |   int 
     ...  |   ...
      n   |   int 


## 1. Load from .ts file to xpandas.XDataFrame

A dataset can be loaded from a .ts file using the following method in utilities.load_data.py:
    
    load_from_tsfile_to_xdataframe(file_path, file_name, replace_missing_vals_with='NaN')
    
For convenience, a version of the method has been specified to load from a remote .ts file stored on timeseriesclassification.com. *Note: this is not a data API*. This can be explored at a later date, but this remote data loading is for convenience while developing. 

This can be demonstrated using the Gunpoint problem and the following method, which downloads/uses a cached version of the file and then calls the method above to load a .ts file.

In [2]:
# path where to download/look for datasets. If not present, they will be pulled from the web. If files
# are present then they will simply be loaded
cache_path = "C:/temp/sktime_temp_data/"
dataset_name = "GunPoint"

# if is_train_file=True, the method looks for the filename suffix "_TRAIN.ts"
# if is_test_file=True, the method looks for the filename suffix "_TEST.ts"
# if neither of the above are true, it looks for the filename suffix ".ts" - i.e. no a default train or test split
train_x, train_y = load_from_web_to_xdataframe(dataset_name, is_train_file=True, cache_path=cache_path) 
test_x, test_y = load_from_web_to_xdataframe(dataset_name, is_test_file=True, cache_path=cache_path)

Train and test partitions of the GunPoint problem have been loaded into xpandas.XDataFrame objects and the associated class values for each have been read into xpandas.XSeries objects. As an example, below are the first 5 rows from the train_x (XDataFrame) and train_y (XSeries) objects:

In [3]:
train_x.head()

Unnamed: 0,dim_0
0,0 -0.647885 1 -0.641992 2 -0.63818...
1,0 -0.644427 1 -0.645401 2 -0.64705...
2,0 -0.778353 1 -0.778279 2 -0.77715...
3,0 -0.750060 1 -0.748103 2 -0.74616...
4,0 -0.599539 1 -0.597422 2 -0.59926...


In [4]:
train_y.head()

0    2
1    2
2    1
3    1
4    2
dtype: object
data_type: <class 'str'>

## 2 Load from a delimited file into a long format table

As mentioned, it is also possible to create the same train_x and train_y structures from an arbitrary delimited file. The only requirements are that the values within a time series are delimited by a specified character (default is ','), the dimensions of a problem are delimited by another character (default is ':'), and cases are line separated. 

This approach makes some assumptions about the data (and may need to be tidied up later). For example, the user must specify in the method header whether the data contains a class value (i.e. if it is a TSC problem), and if so, it must be represented as a single value in the last dimension of a problem. It may make more sense to delegate this behaviour to a Task object in the future but this behaviour will suffice for initial development.

The utilities.load_data file contains a load_from_file_to_long_format method. For simplicity, the example below uses the same input .ts file that was previously cached. However, this method does not use any of the header data and relies on the reading and dimension delimiters being set, so can function on any delimited data file.

In [5]:
from utilities.load_data import load_from_file_to_long_format

file_name_and_path = cache_path+dataset_name+"/"+dataset_name+"_TRAIN.ts"

long_table_example = load_from_file_to_long_format(file_name_and_path, reading_delimiter=",", dimension_delimiter=":", last_dim_is_class_val=True)

long_table_example[0:10]

Unnamed: 0,case_id,dimension_id,reading_id,value
0,0,0,0,-0.647885
1,0,0,1,-0.641992
2,0,0,2,-0.638186
3,0,0,3,-0.638259
4,0,0,4,-0.638345
5,0,0,5,-0.638697
6,0,0,6,-0.643049
7,0,0,7,-0.643768
8,0,0,8,-0.64505
9,0,0,9,-0.647118


## 3 Converting long-format data into wide-format data 
### (i.e. pandas.DataFrame to xpandas.XDataFrame)

The .ts file has now been read into long format. This is a pandas.DataFrame with columns for [case_id, dimension_id, reading_id, value]. This format can be converted into an xpandas.XDataFrame using the long_format_to_wide_format method in utilities.load_data (note: it should be possible to convert a pandas.DataFrame created through other means with this method too, providing that the expected column names are included).

To facilitate loading classification problems into an XDataFrame and XSeries (as above), an argument is passed to the method header for class_dimension_name. By default, if last_dim_is_class_val=True in load_from_file_to_long_format, all class values in the long table will belong to dimension "c". Hence, it is set to "c" below, but the default value is None to facilitate loading any kind of data. Again, this would likely be handled more effectively by a Task object

In [6]:
from utilities.load_data import long_format_to_wide_format

wide_x, wide_y = long_format_to_wide_format(long_table_example, class_dimension_name="c")

In [7]:
wide_x.head()

Unnamed: 0,dim_0
0,0 -0.647885 1 -0.641992 2 -0.63818...
1,0 -0.644427 1 -0.645401 2 -0.64705...
2,0 -0.778353 1 -0.778279 2 -0.77715...
3,0 -0.750060 1 -0.748103 2 -0.74616...
4,0 -0.599539 1 -0.597422 2 -0.59926...


In [8]:
wide_y.head()

0    2
1    2
2    1
3    1
4    2
dtype: object
data_type: <class 'str'>

(train_x,train_y) and (wide_x,wide_y) should be equivalent, demonstrating the capability to either load data directly from a .ts file, or convert data in an appropriately formatted long-table format (a pandas.DataFrame with correct header information). Once in the correct form, a model (e.g. classifier) can be built using fit(train_x,train_y) and predict(test_x), where train_x and test_x are xpandas.XDataFrame objects and train_y is a pandas.Series object.

Below is a _very_ hacky way to demonstrate the equivalence of train_x and wide_x, and train_y and wide_y, through comparing the .to_string() of the appropriate data structures

In [9]:
train_x.to_string()==wide_x.to_string()

True

In [10]:
train_y.to_string()==wide_y.to_string()

True

## 4. Testing with various dataset use cases

### 4.1 Univariate, equal length, no missing
Gunpoint (example from above)

In [11]:
dataset_name = "GunPoint"
train_x, train_y = load_from_web_to_xdataframe(dataset_name, is_train_file=True, cache_path=cache_path) 
test_x, test_y = load_from_web_to_xdataframe(dataset_name, is_test_file=True, cache_path=cache_path)

train_x.head()

Unnamed: 0,dim_0
0,0 -0.647885 1 -0.641992 2 -0.63818...
1,0 -0.644427 1 -0.645401 2 -0.64705...
2,0 -0.778353 1 -0.778279 2 -0.77715...
3,0 -0.750060 1 -0.748103 2 -0.74616...
4,0 -0.599539 1 -0.597422 2 -0.59926...


In [12]:
print("length of series 0: "+str(len(train_x.dim_0.iloc[0])))
print("length of series 10: "+str(len(train_x.dim_0.iloc[10])))

length of series 0: 150
length of series 10: 150


In [13]:
from collections import namedtuple
TSCDataset = namedtuple("TSCDataset", "dataset_name data_x data")

blob = TSCDataset("Gunpoint_TRAIN", train_x, train_y)
blob.dataset_name

'Gunpoint_TRAIN'

### 4.2. Univariate, unequal length, no missing
PLAID

In [14]:
dataset_name = "PLAID"
train_x, train_y = load_from_web_to_xdataframe(dataset_name, is_train_file=True, cache_path=cache_path) 
test_x, test_y = load_from_web_to_xdataframe(dataset_name, is_test_file=True, cache_path=cache_path)

In [15]:
print("length of series 0: "+str(len(train_x.dim_0.iloc[0])))
print("length of series 10: "+str(len(train_x.dim_0.iloc[10])))

length of series 0: 500
length of series 10: 300


### 4.3. Univariate, unequal length, with missing vals
DodgerLoopDay

In [16]:
dataset_name = "DodgerLoopDay"
train_x, train_y = load_from_web_to_xdataframe(dataset_name, is_train_file=True, cache_path=cache_path) 
test_x, test_y = load_from_web_to_xdataframe(dataset_name, is_test_file=True, cache_path=cache_path)

In [17]:
# series should be of equal length because NaN have been insterted in place of unknown values (as this may have a
# meaningful difference to values that are completely omitted from the original input). Below demonstrates a subsequence
# of series 16 (with missing vals) that has the same length as series 0 (without missing vals)

print("length of series 0: "+str(len(train_x.dim_0.iloc[0])))
print("length of series 16: "+str(len(train_x.dim_0.iloc[16])))

train_x.dim_0.iloc[16][145:165]

length of series 0: 288
length of series 16: 288


145    18.0
146    21.0
147    22.0
148    29.0
149    30.0
150    27.0
151     NaN
152     NaN
153     NaN
154     NaN
155     NaN
156     NaN
157     NaN
158     NaN
159     NaN
160     NaN
161     NaN
162     NaN
163     NaN
164     NaN
dtype: float64

### 4.4. Multivariate, equal length, no missing
BasicMotions

In [18]:
dataset_name = "BasicMotions"
train_x, train_y = load_from_web_to_xdataframe(dataset_name, is_train_file=True, cache_path=cache_path) 
test_x, test_y = load_from_web_to_xdataframe(dataset_name, is_test_file=True, cache_path=cache_path)

In [19]:
# BasicMotions is multivariate and the XDataFrame has multiple columns to reflect this:
train_x.keys()

Index(['dim_0', 'dim_1', 'dim_2', 'dim_3', 'dim_4', 'dim_5'], dtype='object')

In [20]:
print("length of series 0: "+str(len(train_x.dim_0.iloc[0])))
print("length of series 10: "+str(len(train_x.dim_0.iloc[10])))

train_x.iloc[0]

length of series 0: 100
length of series 10: 100


dim_0    0     0.079106
1     0.079106
2    -0.903497
3...
dim_1    0     0.394032
1     0.394032
2    -3.666397
3...
dim_2    0     0.551444
1     0.551444
2    -0.282844
3...
dim_3    0     0.351565
1     0.351565
2    -0.095881
3...
dim_4    0     0.023970
1     0.023970
2    -0.319605
3...
dim_5    0     0.633883
1     0.633883
2     0.972131
3...
Name: 0, dtype: object
data_type: <class 'pandas.core.series.Series'>

### 4.5. Multivariate, unequal length, no missing
JapaneseVowels 
(variable length between cases, but not within cases - this should be supported by the code in any case)

In [21]:
dataset_name = "JapaneseVowels"
train_x, train_y = load_from_web_to_xdataframe(dataset_name, is_train_file=True, cache_path=cache_path) 
test_x, test_y = load_from_web_to_xdataframe(dataset_name, is_test_file=True, cache_path=cache_path)

In [22]:
# BasicMotions is multivariate and the XDataFrame has multiple columns to reflect this:
train_x.keys()

Index(['dim_0', 'dim_1', 'dim_2', 'dim_3', 'dim_4', 'dim_5', 'dim_6', 'dim_7',
       'dim_8', 'dim_9', 'dim_10', 'dim_11'],
      dtype='object')

In [23]:
print("length of series 0, dimension 0: "+str(len(train_x.dim_0[0])))
print("length of series 0, dimension 9: "+str(len(train_x.dim_9[0])))
print("length of series 7, dimension 0: "+str(len(train_x.dim_0[7])))
print("length of series 7, dimension 9: "+str(len(train_x.dim_9[7])))

length of series 0, dimension 0: 20
length of series 0, dimension 9: 20
length of series 7, dimension 0: 18
length of series 7, dimension 9: 18
