# MAT-data: Data Handler for Multiple Aspect Trajectory Data Mining \[MAT-Tools Framework\]

Sample Code in python notebook to use mat-data as a python library.

The present package offers a tool, to support the user in the task of data preprocessing of multiple aspect trajectories, or to generating synthetic datasets. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods.

Created on Dec, 2023
Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)

Check the Documentation: https://mat-analysis.github.io/mat-tools/mat-data

In [None]:
!pip install mat-data
#!pip install --upgrade mat-data

This package provides the script `MAT-GetData.py` for downloading data in varios formats, k-fold and holdout cross-validation:

In [None]:
!MAT-GetData.py . 'sequential.Promoters,mat.FoursquareNYC' -k 1

In [None]:
!MAT-GetData.py --help

### 1. Reading local data

Sample code for trajectory dataset read from local files

The easy way to read data is to load a csv file in pandas, such as:

In [6]:
import pandas as pd

data_path = 'matdata/assets/sample'

pd.read_csv(data_path + '/Foursquare_Sample.csv')

Unnamed: 0,tid,lat_lon,date_time,time,rating,price,weather,day,root_type,type
0,126,40.8331652006224 -73.9418603427692,2012-11-12 05:17:18,317,-1.0,-1,Clear,Monday,Residence,Home (private)
1,126,40.8340978041072 -73.9452672225881,2012-11-12 23:24:55,1404,8.2,1,Clouds,Monday,Food,Deli / Bodega
2,126,40.8331652006224 -73.9418603427692,2012-11-13 00:00:07,0,-1.0,-1,Clouds,Tuesday,Residence,Home (private)
3,126,40.7646959283254 -73.8851974964414,2012-11-15 17:49:01,1069,6.6,3,Clear,Thursday,Food,Fried Chicken Joint
4,126,40.7660790376824 -73.8835287094116,2012-11-15 18:40:16,1120,-1.0,-1,Clear,Thursday,Travel & Transport,Bus Station
...,...,...,...,...,...,...,...,...,...,...
66957,29563,40.7047332789043 -73.9877378940582,2012-08-10 17:17:37,1037,-1.0,-1,Clouds,Friday,College & University,General College & University
66958,29563,40.6951627360199 -73.9954478691072,2012-08-10 20:10:59,1210,8.0,2,Clouds,Friday,Food,Thai Restaurant
66959,29563,40.6978026652822 -73.9941451630314,2012-08-11 08:01:20,481,6.9,-1,Clouds,Saturday,Outdoors & Recreation,Gym
66960,29563,40.6946728967503 -73.9940820360805,2012-08-11 13:39:39,819,7.0,1,Clouds,Saturday,Food,Coffee Shop


`mat-data` provides modules to handle dataset reading in standard ways:

    a) Read a dataset locally:
    
This is an example for .csv file, however this can read .csv, .parquet, .zip, .ts, and .xes file formats.

In [None]:
from matdata.dataset import *

In [None]:
df = read_ds('matdata/assets/sample/Foursquare_Sample.csv')
df.head()

Optionally, you can use the standardized reading, which will make 'tid'/'label' nomenclature (or rename columns) and will sort the trajectories

In [None]:
df = read_ds('matdata/assets/sample/Foursquare_Sample.csv', tid_col='tid', class_col='root_type')
df.head()

### 2. Loading Repository Data

This module loads data from public repository [Git: mat-analysis datasets (v2_0)](https://github.com/mat-analysis/datasets)

Check the GitHub repository to see available datasets.

To use helpers for data loading, import from package `matdata.dataset`:

In [7]:
from matdata.dataset import *

    a) First, you can load datasets by informing the category (parent folder) and dataset name (subfolder):

In [8]:
# dataset='mat.FoursquareNYC' ## => dafault

df = load_ds(sample_size=0.25)
df

Loading dataset file: https://github.com/mat-analysis/datasets/tree/main/mat/FoursquareNYC/


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1055k  100 1055k    0     0  2431k      0 --:--:-- --:--:-- --:--:-- 2448k


Stratification (class-balanced):   0%|          | 0/193 [00:00<?, ?it/s]

Sorting data:   0%|          | 0/193 [00:00<?, ?it/s]

Unnamed: 0,space,time,day,poi,type,root_type,rating,weather,tid,label
0,40.8340978041072 -73.9452672225881,788,Monday,Galaxy Gourmet Deli,Deli / Bodega,Food,8.2,Clouds,127,6
1,40.5671960000000 -73.8825760000000,1175,Monday,MTA Bus - Beach 169 St & Rockaway Point Bl (Q2...,Bus Stop,Travel & Transport,-1.0,Clouds,127,6
2,40.6899127194574 -73.9815044403076,1381,Monday,MTA Subway - DeKalb Ave (B/Q/R),Metro Station,Travel & Transport,-1.0,Clouds,127,6
3,40.7085883614824 -73.9910316467285,1404,Monday,MTA Subway - Manhattan Bridge (B/D/N/Q),Train,Travel & Transport,-1.0,Clouds,127,6
4,40.8331652006224 -73.9418603427692,845,Tuesday,The Grinnell,Home (private),Residence,-1.0,Clear,127,6
...,...,...,...,...,...,...,...,...,...,...
17,40.7047332789043 -73.9877378940582,939,Thursday,Miami Ad School Brooklyn,General College & University,College & University,-1.0,Clear,29559,1070
18,40.6978026652822 -73.9941451630314,483,Friday,Eastern Athletic Club,Gym,Outdoors & Recreation,6.9,Clear,29559,1070
19,40.6946728967503 -73.9940820360805,794,Friday,Starbucks,Coffee Shop,Food,7.0,Clear,29559,1070
20,40.7023694709909 -73.9875124790989,1261,Friday,Superfine,American Restaurant,Food,7.6,Clear,29559,1070


     b) Second, you can load the 70/30 hold out split available (by default):

In [9]:
df_train, df_test = load_ds_holdout()

print(df_train.shape, df_test.shape)

df_train

Loading dataset file: https://github.com/mat-analysis/datasets/tree/main/mat/FoursquareNYC/


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1055k  100 1055k    0     0  2413k      0 --:--:-- --:--:-- --:--:-- 2437k


Spliting Data (class-balanced):   0%|          | 0/193 [00:00<?, ?it/s]

Sorting data:   0%|          | 0/193 [00:00<?, ?it/s]

Sorting data:   0%|          | 0/193 [00:00<?, ?it/s]

(44989, 10) (21957, 10)


Unnamed: 0,tid,label,space,time,day,poi,type,root_type,rating,weather
0,127,6,40.8340978041072 -73.9452672225881,788,Monday,Galaxy Gourmet Deli,Deli / Bodega,Food,8.2,Clouds
1,127,6,40.5671960000000 -73.8825760000000,1175,Monday,MTA Bus - Beach 169 St & Rockaway Point Bl (Q2...,Bus Stop,Travel & Transport,-1.0,Clouds
2,127,6,40.6899127194574 -73.9815044403076,1381,Monday,MTA Subway - DeKalb Ave (B/Q/R),Metro Station,Travel & Transport,-1.0,Clouds
3,127,6,40.7085883614824 -73.9910316467285,1404,Monday,MTA Subway - Manhattan Bridge (B/D/N/Q),Train,Travel & Transport,-1.0,Clouds
4,127,6,40.8331652006224 -73.9418603427692,845,Tuesday,The Grinnell,Home (private),Residence,-1.0,Clear
...,...,...,...,...,...,...,...,...,...,...
89,29563,1070,40.7047332789043 -73.9877378940582,1037,Friday,Miami Ad School Brooklyn,General College & University,College & University,-1.0,Clouds
90,29563,1070,40.6951627360199 -73.9954478691072,1210,Friday,Lantern Thai Kitchen,Thai Restaurant,Food,8.0,Clouds
91,29563,1070,40.6978026652822 -73.9941451630314,481,Saturday,Eastern Athletic Club,Gym,Outdoors & Recreation,6.9,Clouds
92,29563,1070,40.6946728967503 -73.9940820360805,819,Saturday,Starbucks,Coffee Shop,Food,7.0,Clouds


Or, you can hold out split on another proportion (50% for instance):

In [None]:
df_train, df_test = load_ds_holdout(train_size=0.5)

# The split is class-balanced, thus train and test number of trajectories may not be exactly proportional.
print(df_train.shape, df_test.shape) 

df_train

    c) Or, you can load the k-fold split datasets available (deafult k=5):

In [None]:
df_train, df_test = load_ds_kfold()

for k in range(len(df_train)):
    print('Shape train/test:', df_train[k].shape, df_test[k].shape)

    d) You can load a different dataset from repository:

In [None]:
# Use the format: 'category.DatasetName'
dataset='raw.Animals'

df = load_ds(dataset)
df

    e) To get a full list of anailable repositories and categories:

In [None]:
rd = repository_datasets()
rd

### 3. Pre-processing data
To use helpers for data pre-processing, import from package `matdata.preprocess`:

In [None]:
from matdata.preprocess import *

The **preprocess** module provides some functions to work data:

Basic functions:
- `readDataset`: load datasets as pandas DataFrame (from .csv, .parquet, .zip, .ts or .xes)
- `organizeFrame`: standardize data columns for the DataFrame

Train and Test split functions:
- `trainTestSplit`: split dataset (pandas DataFrame) in train / test (70/30% by default)
- `kfold_trainTestSplit`: split dataset (pandas DataFrame) in k-fold train / test (5-fold of 80/20% each fold by default)
- `stratify`: extract trajectories from the dataset, respecting class balance, creating a subset of the data (to use when smaller datasets are needed)
- `klabels_stratify`: k-labels statification (randomly select k-labels from the dataset)
- `joinTrainTest`: joins the separated train and test files into one DataFrame.

Statistical functions:
- `printFeaturesJSON`: print a default JSON file descriptor for Movelets methods (version 1 or 2)
- `countClasses`: calculates statistics from a dataset dataframe
- `dfVariance`: calculates a variance rank from a dataset dataframe
- `dfStats`: calculates attributes statistics ordered by variance from a dataset dataframe
- `datasetStatistics`: generates dataset statistics from a dataframe in markdown text.

Type reading functions:
- `csv2df`: reads .csv dataset dataframe
- `parquet2df`: reads .parquet dataset dataframe
- `zip2df`: reads .zip dataset dataframe (zip containing trajectory csv files)
- `ts2df`: reads .ts dataset dataframe (Time Series data format)
- `xes2df`: reads .xes dataset dataframe (event log / event stream file)
- `mat2df`: *TODO* reads .mat dataset dataframe (multiple aspect trajectory specific file format)

File convertion functions:
- `zip2csv`: converts .zip files and saves to .csv files
- `df2zip`: converts DataFrame and saves to .zip files
- `any2ts`: converts .zip or .csv files and saves to .ts files
- `xes2csv`: reads .xes files and converts to DataFrame
- `convertDataset`: default format conversions. Reads the dataset files and saves in .csv and .zip formats, also do k-fold split if not present

    a) Basic reading the data, and organization:

In [None]:
data_path = 'matdata/assets/sample'

df = readDataset(data_path, file='Foursquare_Sample.csv')
df.head()

In [None]:
df, space_cols, ll_cols = organizeFrame(df, make_spatials=True)

print('Columns with space: ', space_cols)
print('Columns with lat/lon: ', ll_cols)
df.head()

**Note:** To better standard, we recomend for classification the use of `prepare_ds` function from `dataset` module, as you can indicate the class column:

In [None]:
from matdata.dataset import prepare_ds

df = prepare_ds(df, class_col='root_type') # 'root_type' is then renamed 'label'
df

    b) Train and test split:

To hold-out split a dataset into train and test (70/30% by deafult):

In [None]:
train, test = trainTestSplit(df, random_num=1)
train.head()

If you want to save, indicate the output format and data path:

In [None]:
trainTestSplit(df, data_path=data_path, outformats=['csv', 'parquet'])

# Reading:
df = readDataset(data_path, file='train.parquet')
df.head()

To k-fold split a dataset into train and test:

In [None]:
train, test = kfold_trainTestSplit(df, k=3)

for k in range(len(train)):
    print('Shape train/test:', train[k].shape, test[k].shape)

    c) Stratifying the data (example to get 50% of the dataset):

In [None]:
train, test = stratify(df, sample_size=0.5)

print('Shape train/test:', train.shape, test.shape)
train.head()

k-Fold Stratifying the data (example to get 50% of the dataset in 3-folds):

In [None]:
train, test = klabels_stratify(df, kl=5)

print('Shape train/test:', train.shape, test.shape)


print('Labels before:', df.label.unique())
print('Labels after:', train.label.unique())

    d) Joining train and test files:

In [None]:
df = joinTrainTest(data_path, train_file="train.csv", test_file="test.csv", to_file=True) # Saves 'joined.csv' file

df.head()

**Note:** We standardized all repository datasets by creating a `data.parquet` file with `np.NaN` as missing values, as example:

In [None]:
from matdata.preprocess import *
from matdata.dataset import prepare_ds
import numpy as np

data_path = 'matdata/assets/sample'

df.replace('?', np.NaN, inplace=True)

df = prepare_ds(df)

df2parquet(df, data_path, 'data')

### 4. Synthetic Data Generation

Methods for generating trajectory data, random or sampling data.

In [None]:
from matdata.generator import *

- `scalerSamplerGenerator`: generates trajectory datasets based on real data on scale intervals
- `samplerGenerator`: generate a trajectory dataset based on real data
- `scalerRandomGenerator`: generates trajectory datasets based on random data on scale intervals
- `randomGenerator`: generate a trajectory dataset based on random data

    a) To generate a sample dataset (default config):

In [None]:
samplerGenerator()

To specify the synthetic dataset parameters:

In [None]:
N=10 # Number of trajectories
M=50 # Number of points by trajectory
C=3  # Number of classes (C1 to Cn)
samplerGenerator(N, M, C)

    b) To generate a set of sample datasets:
    
Creates and save dataset files (including movelets json descriptor file). Generates sample datasets in a increasing log scale for each parameter. It uses the middle value for the other configurations

In [None]:
data_path = 'matdata/assets/sample/samples'

Ns=[100, 3]   # Min. number of trajectories: 100, 3 scales (by log increment)  
Ms=[10,  3]   # Min. number of points: 10, 3 scales (by log increment)
Ls=[8,   3]   # Min. number of attributes: 8, 3 scales (by log increment)
Cs=[2,   3]   # Min. number of labels: 2, 3 scales (by log increment)

scalerSamplerGenerator(Ns, Ms, Ls, Cs, save_to=data_path)

    c) To generate a random dataset (default config):

In [None]:
randomGenerator()

To specify the synthetic random dataset parameters:

In [None]:
N=10 # Number of trajectories
M=50 # Number of points by trajectory
L=10 # Number of attributes
C=3  # Number of classes (C1 to Cn)
randomGenerator(N, M, L, C)

    d) To generate a set of random datasets:
    
Creates and save dataset files (including movelets json descriptor file). Generates randomic datasets in a increasing log scale for each parameter. It uses the middle value for the other configurations

In [None]:
data_path = 'matdata/assets/sample/random'

Ns=[100, 3]   # Min. number of trajectories: 100, 3 scales (by log increment)  
Ms=[10,  3]   # Min. number of points: 10, 3 scales (by log increment)
Ls=[8,   3]   # Min. number of attributes: 8, 3 scales (by log increment)
Cs=[2,   3]   # Min. number of labels: 2, 3 scales (by log increment)

scalerRandomGenerator(Ns, Ms, Ls, Cs, save_to=data_path)

#
--- 

\# By Tarlis Portela (2024)