# Welcome to the Datasurfer Tutorial

In this tutorial, you'll gain a rapid insight of Datasurfer, covering installation, building your initial data pool, searching for certain data, and visualizing it.

## Install Datasurfer

To initiate the installation of Datasurfer, execute the following command:

> pip install datasurfer


## Create Data Files
Before diving into data harnessing with Datasurfer, let's first create some dummy data files in various formats.

In [39]:
import numpy as np
import pandas as pd
import json
from pathlib import Path

# Create a directory to store the data.
dir_data = Path('demo_data')
dir_data.mkdir(exist_ok=True)

# Create a csv file.
data1 = pd.DataFrame(np.random.rand(4, 5), columns=list('abcde'))
data1.to_csv(dir_data / 'data1.csv', index=False)

# Create an excel file
data2 = pd.DataFrame(np.random.rand(6, 4), columns=list('bcde'))
data2.to_excel(dir_data / 'data2.xlsx', index=False)

# Create a json file
data3 = pd.DataFrame(np.random.rand(5, 3), index=list('cdefg'))
json.dump(data3.to_dict(), open(dir_data / 'data3.json', 'w'),  indent=4)


## Data Pool

### Create a Data Pool Object
We've now generated three files within the "demo_data" directory, each has different file types, varying data sizes, and unequal column names.

In the next step, We will create a data pool object to organize and contain these files.

In [40]:
import datasurfer as ds

# Create a DataSurfer object by giving the path of the data files
dp = ds.Data_Pool("demo_data")

# display all information of the data pool 
dp.describe(verbose=True)

Processing "[92mdemo_data[0m/[94mdata3[0m": 100%|██████████| 3/3 [00:00<00:00, 126.42it/s]


Unnamed: 0,Comment,Signal Count,Signal Length,Signal Size,Memory Usage,Interface,File Type,File Size,File Date,File Path
data1,,5,4,20,0.0003,PANDAS_OBJECT,.csv,0.0004,2024-04-12 23:36:48.066740,c:\95_Programming\10_Data_Related\20_Projects\...
data2,,4,6,24,0.0003,PANDAS_OBJECT,.xlsx,0.0058,2024-04-12 23:36:48.147264,c:\95_Programming\10_Data_Related\20_Projects\...
data3,,5,3,15,0.0003,JSON_OBJECT,.json,0.0006,2024-04-12 23:36:48.167137,c:\95_Programming\10_Data_Related\20_Projects\...


### List Signal Names in the Pool
The data pool description provides details on the three files we've created, including signals, file types, sizes, and more. Using the following command, you can view all the signal names stored in the data pool:

In [41]:
# list all pool signals

dp.list_signals()

['a', 'b', 'c', 'd', 'e', 'f', 'g']

### Obtain Single Signal from Pool

The Pool returns values from files within the pool, presenting them as a pandas DataFrame. To achieve data length alignment, empty spaces will be filled with 'NaN'.

In [42]:
# Obtain signal "c" from pool files
df = dp['c']

df

Unnamed: 0,data1,data2,data3
0,0.096317,0.047774,0.322151
1,0.519415,0.862198,0.402787
2,0.791313,0.943141,0.772013
3,0.928654,0.791107,
4,,0.475045,
5,,0.943275,


# Obtain multiple signals from Pool

The Pool can also return multiple signals at once. The signals can be specified by a list of signal names:


In [47]:
import warnings
warnings.filterwarnings("ignore")


df = dp[['b', 'c']]
df

Unnamed: 0,b,c
0,0.344473,0.096317
1,0.539755,0.519415
2,0.145672,0.791313
3,0.783115,0.928654
4,0.289679,0.047774
5,0.727152,0.862198
6,0.801876,0.943141
7,0.400355,0.791107
8,0.83681,0.475045
9,0.204659,0.943275
