# MAT-data: Data Preprocessing for Multiple Aspect Trajectory Data Mining \[MAT-Tools Framework\]

Sample Code in python notebook to use mat-data as a python library.

The present package offers a tool, to support the user in the task of data preprocessing of multiple aspect trajectories, or to generating synthetic datasets. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods.

Created on Dec, 2023
Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)

In [7]:
!pip install mat-data
#!pip install --upgrade mat-data

Collecting mat-data
  Downloading mat_data-0.1b6-py3-none-any.whl.metadata (5.4 kB)
Downloading mat_data-0.1b6-py3-none-any.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hInstalling collected packages: mat-data
  Attempting uninstall: mat-data
    Found existing installation: mat-data 0.1b5
    Uninstalling mat-data-0.1b5:
      Successfully uninstalled mat-data-0.1b5
Successfully installed mat-data-0.1b6


### 1. Pre-processing data
To use helpers for data pre-processing, import from package `matdata.preprocess`:

In [3]:
from matdata.preprocess import *

The **preprocessing** module provides some functions to work data:

Basic functions:
- `readDataset`: load datasets as pandas DataFrame (from .csv, .zip, or .ts)
- `printFeaturesJSON`: print a default JSON file descriptor for Movelets methods (version 1 or 2)
- `datasetStatistics`: calculates statistics from a datasets dataframe.

Train and Test split functions:
- `trainAndTestSplit`: split dataset (pandas DataFrame) in train / test (70/30% by default)
- `kfold_trainAndTestSplit`: split dataset (pandas DataFrame) in k-fold train / test (80/20% each fold by default)
- `stratify`: extract trajectories from the dataset, creating a subset of the data (to use when smaller datasets are needed)
- `joinTrainAndTest`: joins the train and test files into one DataFrame.

Type convertion functions:
- `convertDataset`: default format conversions. Reads the dataset files and saves in .csv and .zip formats, also do k-fold split if not present
- `zip2df`: converts .zip files and saves to DataFrame
- `zip2csv`: converts .zip files and saves to .csv files
- `df2zip`: converts DataFrame and saves to .zip files
- `zip2arf`: converts .zip and saves to .arf files
- `any2ts`: converts .zip or .csv files and saves to .ts files
- `xes2csv`: reads .xes files and converts to DataFrame

In [10]:
#cols = ['tid','label','lat','lon','day','hour','poi','category','price','rating']

df = joinTrainAndTest(data_path, train_file="train.csv", test_file="test.csv", class_col = 'label')
df.head()

Joining train and test data from... automatize/assets/examples/Example/data
Done.
 --------------------------------------------------------------------------------


Unnamed: 0,tid,lat_lon,hour,price,poi,weather,day,label
0,12,0.0 6.2,8,-1,Home,Clear,Monday,Classs_False
1,12,0.8 6.2,9,2,University,Clouds,Monday,Classs_False
2,12,3.1 11,12,2,Restaurant,Clear,Monday,Classs_False
3,12,0.8 6.5,13,2,University,Clear,Monday,Classs_False
4,12,0.2 6.2,17,-1,Home,Rain,Monday,Classs_False


To k-fold split a dataset into train and test:

In [11]:
k = 3

train, test = kfold_trainAndTestSplit(data_path, k, df, random_num=1, class_col='label')

3-fold train and test split in... automatize/assets/examples/Example/data


Spliting Data:   0%|          | 0/2 [00:00<?, ?it/s]

Done.
Writing files ... 1/3


Writing TRAIN - ZIP|1:   0%|          | 0/7 [00:00<?, ?it/s]

Writing TEST  - ZIP|1:   0%|          | 0/4 [00:00<?, ?it/s]

Writing TRAIN / TEST - CSV|1


Writing TRAIN - MAT|1:   0%|          | 0/7 [00:00<?, ?it/s]

Writing TEST  - MAT|1:   0%|          | 0/4 [00:00<?, ?it/s]

Writing files ... 2/3


Writing TRAIN - ZIP|2:   0%|          | 0/7 [00:00<?, ?it/s]

Writing TEST  - ZIP|2:   0%|          | 0/4 [00:00<?, ?it/s]

Writing TRAIN / TEST - CSV|2


Writing TRAIN - MAT|2:   0%|          | 0/7 [00:00<?, ?it/s]

Writing TEST  - MAT|2:   0%|          | 0/4 [00:00<?, ?it/s]

Writing files ... 3/3


Writing TRAIN - ZIP|3:   0%|          | 0/8 [00:00<?, ?it/s]

Writing TEST  - ZIP|3:   0%|          | 0/3 [00:00<?, ?it/s]

Writing TRAIN / TEST - CSV|3


Writing TRAIN - MAT|3:   0%|          | 0/8 [00:00<?, ?it/s]

Writing TEST  - MAT|3:   0%|          | 0/3 [00:00<?, ?it/s]

Done.
 --------------------------------------------------------------------------------


To convert train and test from one available format to other default formats (CSV, ZIP, MAT):

In [4]:
convertDataset(data_path)

Writing TRAIN - ZIP|:   0%|          | 0/14 [00:00<?, ?it/s]

Writing TEST  - ZIP|:   0%|          | 0/14 [00:00<?, ?it/s]

Writing TRAIN - MAT|:   0%|          | 0/14 [00:00<?, ?it/s]

Writing TEST  - MAT|:   0%|          | 0/14 [00:00<?, ?it/s]

All Done.


### 2. Synthetic Data Generation

TODO

In [5]:
from matdata.generator import *

- `scalerSamplerGenerator`: generates trajectory datasets based on real data on scale intervals
- `samplerGenerator`: generate a trajectory dataset based on real data
- `scalerRandomGenerator`: generates trajectory datasets based on random data on scale intervals
- `randomGenerator`: generate a trajectory dataset based on random data

### 3. Loading data

This module loads data from public repository [Git: bigdata-ufsc datasets (v1_0)](https://github.com/bigdata-ufsc/datasets_v1_0)

Check the GitHub repository to see available datasets.

    a) First, you can load datasets by informing the category (parent folder) and dataset name (subfolder):

In [1]:
from matdata.datasets import load_ds

# dataset='mat.FoursquareNYC' => dafault

df_train, df_test = load_ds()
df_train

Reading dataset FoursquareNYC of Multiple Aspect Trajectories


FoursquareNYC (Multiple Aspect Trajectories):   0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,tid,label,lat,lon,day,hour,poi,category,price,rating,weather
0,127,6,40.834098,-73.945267,Monday,13,21580,Food,1,8.2,Clouds
1,127,6,40.567196,-73.882576,Monday,19,2392,Travel & Transport,-999,-999.0,Clouds
2,127,6,40.689913,-73.981504,Monday,23,35589,Travel & Transport,-999,-999.0,Clouds
3,127,6,40.708588,-73.991032,Monday,23,18603,Travel & Transport,-999,-999.0,Clouds
4,127,6,40.833165,-73.941860,Tuesday,14,36348,Residence,-999,-999.0,Clear
...,...,...,...,...,...,...,...,...,...,...,...
44804,29563,1070,40.704733,-73.987738,Friday,17,1805,College & University,-999,-999.0,Clouds
44805,29563,1070,40.695163,-73.995448,Friday,20,1944,Food,2,8.0,Clouds
44806,29563,1070,40.697803,-73.994145,Saturday,8,16452,Outdoors & Recreation,-999,6.9,Clouds
44807,29563,1070,40.694673,-73.994082,Saturday,13,16201,Food,1,7.0,Clouds


    b) Second, you can load the 5-fold split datasets available:

In [2]:
from matdata.datasets import load_ds_5fold

dataset='raw.GoTrack'

df_train, df_test = load_ds_5fold(dataset)
df_train[4]

Reading 5-fold dataset GoTrack of Raw Trajectories:   0%|          | 0/5 [00:00<?, ?it/s]

GoTrack (Raw Trajectories), fold: 1:   0%|          | 0/2 [00:00<?, ?it/s]

GoTrack (Raw Trajectories), fold: 2:   0%|          | 0/2 [00:00<?, ?it/s]

GoTrack (Raw Trajectories), fold: 3:   0%|          | 0/2 [00:00<?, ?it/s]

GoTrack (Raw Trajectories), fold: 4:   0%|          | 0/2 [00:00<?, ?it/s]

GoTrack (Raw Trajectories), fold: 5:   0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,time,lat,lon,tid,label
0,444,-10.939341,-37.062742,1,car
1,444,-10.939341,-37.062742,1,car
2,444,-10.939324,-37.062765,1,car
3,444,-10.939211,-37.062843,1,car
4,444,-10.938939,-37.062879,1,car
...,...,...,...,...,...
13513,1303,-10.933398,-37.078873,38084,bus
13514,1303,-10.933398,-37.078873,38084,bus
13515,1303,-10.933398,-37.078873,38084,bus
13516,58,-10.869450,-37.095276,38090,bus


\# By Tarlis Portela (2023)