# Selecting data Examples

In this notebook, you will find example about: 

- [Inspect a Dataset](#Inspect-a-DataSet)  
    - [Show first rows](#Show-first-rows)
    - [Show last rows](#Show-last-rows)
- [DataSet description](#DataSet-Description)
    - [Get the number of rows](#Get-the-number-of-rows)
    - [Get the columns of a DataSet](#Get-the-columns-of-a-DataSet)
    - [Get the shape of a DataSet](#Get-the-shape-of-a-DataSet)
    - [Get a descriptive statitics summary of a DataSet](#Get-a-descriptive-statitics-summary-of-a-DataSet)
- [Select Data](#Select-Data)
    - [Select all rows](#Select-all-rows)
    - [Select specific columns](#Select-specifics-columns)
    - [Make a filter](#Make-a-filter)
    - [Sort Data](#Sort-Data)
- [ Aggregate Data](#Aggregate-Data)
    
    

In [1]:
# Copyright (c) 2022 Grumpy Cat Software S.L.
#
# This Source Code is licensed under the MIT 2.0 license.
# the terms can be found in LICENSE.md at the root of
# this project, or at http://mozilla.org/MPL/2.0/.

import shapelets as sh

session = sh.sandbox()

data = session.load_test_data()

# Inspect a DataSet

## Show first rows

In [2]:
data.head(n=5)

Unnamed: 0,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Show last rows

In [3]:
data.tail(n=5)

Unnamed: 0,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Class
0,6.7,3.0,5.2,2.3,Iris-virginica
1,6.3,2.5,5.0,1.9,Iris-virginica
2,6.5,3.0,5.2,2.0,Iris-virginica
3,6.2,3.4,5.4,2.3,Iris-virginica
4,5.9,3.0,5.1,1.8,Iris-virginica


# DataSet Description

## Get the number of rows

In [4]:
len(data)

150

## Get the columns of a DataSet

In [5]:
data.columns

['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Class']

## Get the shape of a DataSet

In [6]:
data.shape()

(150, 5)

## Get a descriptive statitics summary of a DataSet

In [7]:
data.describe()

Unnamed: 0,column_name,column_type,min,max,approx_unique,avg,std,q25,q50,q75,count,null_percentage
0,Sepal_Length,DOUBLE,4.3,7.9,35,5.843333333333335,0.8280661279778637,5.1,5.8,6.4,150,0.0%
1,Sepal_Width,DOUBLE,2.0,4.4,23,3.0540000000000007,0.4335943113621737,2.8,3.0,3.3125,150,0.0%
2,Petal_Length,DOUBLE,1.0,6.9,41,3.758666666666669,1.764420419952262,1.5750000000000002,4.35,5.1,150,0.0%
3,Petal_Width,DOUBLE,0.1,2.5,22,1.1986666666666672,0.7631607417008414,0.3,1.3,1.8,150,0.0%
4,Class,VARCHAR,Iris-setosa,Iris-virginica,3,,,,,,150,0.0%


# Select Data

## Select all rows

In [8]:
session.map(x for x in data)

Column,NumPy Type,SQL Type
sepal_length,float64,DOUBLE
sepal_width,float64,DOUBLE
petal_length,float64,DOUBLE
petal_width,float64,DOUBLE
class,object,VARCHAR


In [9]:
session.map(x for x in data).head()

Unnamed: 0,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Select specifics columns

In [10]:
session.map((x.Sepal_Length,x.Sepal_Width) for x in data)

Column,NumPy Type,SQL Type
sepal_length,float64,DOUBLE
sepal_width,float64,DOUBLE


In [11]:
session.map((x.Sepal_Length,x.Sepal_Width) for x in data).head()

Unnamed: 0,Sepal_Length,Sepal_Width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6


## Make a filter

In [12]:
session.map(x for x in data if x.Petal_Length > 1.5)

Column,NumPy Type,SQL Type
sepal_length,float64,DOUBLE
sepal_width,float64,DOUBLE
petal_length,float64,DOUBLE
petal_width,float64,DOUBLE
class,object,VARCHAR


In [13]:
session.map(x for x in data if x.Petal_Length > 1.5).head()

Unnamed: 0,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Class
0,5.4,3.9,1.7,0.4,Iris-setosa
1,4.8,3.4,1.6,0.2,Iris-setosa
2,5.7,3.8,1.7,0.3,Iris-setosa
3,5.4,3.4,1.7,0.2,Iris-setosa
4,5.1,3.3,1.7,0.5,Iris-setosa


## Sort Data

If you want an ascending order, pass a string with the column name to the function.

In [14]:
data.sort_by('Sepal_Length')

Column,NumPy Type,SQL Type
Sepal_Length,float64,DOUBLE
Sepal_Width,float64,DOUBLE
Petal_Length,float64,DOUBLE
Petal_Width,float64,DOUBLE
Class,object,VARCHAR


If you want a descending order, pass a string with the column name to the function.

In [15]:
data.sort_by('Sepal_Length',False)

Column,NumPy Type,SQL Type
Sepal_Length,float64,DOUBLE
Sepal_Width,float64,DOUBLE
Petal_Length,float64,DOUBLE
Petal_Width,float64,DOUBLE
Class,object,VARCHAR


If you want to combine, or sort by multiple columns, just pass lists with the values to the function.

In [16]:
data.sort_by(['Sepal_Length','Petal_Length'],[False,True])

Column,NumPy Type,SQL Type
Sepal_Length,float64,DOUBLE
Sepal_Width,float64,DOUBLE
Petal_Length,float64,DOUBLE
Petal_Width,float64,DOUBLE
Class,object,VARCHAR


# Aggregate Data

In [17]:
from shapelets.functions import avg

In [18]:
session.map((x.Class,avg(x.Sepal_Length)) for x in data)

Column,NumPy Type,SQL Type
class,object,VARCHAR
avg_x__sepal_length__,float64,DOUBLE


In [19]:
session.map((x.Class,avg(x.Sepal_Length)) for x in data).head()

Unnamed: 0,Class,"avg(x.""Sepal_Length"")"
0,Iris-setosa,5.006
1,Iris-versicolor,5.936
2,Iris-virginica,6.588
