# "Replicating .SD in Python Datatable"
> "Work with Subset of Data"

- toc: true
- branch: master
- badges: true
- hide_binder_badge: True
- hide_colab_badge: True
- comments: true
- categories: [Python, Datatable, pydatatable, h20ai, .SD, rdatatable]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

## **.SD - Subset of Data**

I will be using [Jose Morales](https://twitter.com/jmrlsz) excellent [post](https://rpubs.com/josemz/SDbf) to show how .SD's functionality can be replicated in  python's [datatable](https://datatable.readthedocs.io/en/latest/index.html). Not all functions can be replicated; R [data.table](https://github.com/Rdatatable/data.table) has a whole lot more functions and features that are not yet implemented in [datatable](https://datatable.readthedocs.io/en/latest/index.html).
<br> 

In [1]:
from datatable import dt, by, sort, mean, count, update, max, f, fread

In [2]:
df = fread('Data_files/iris.csv')
df.head()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


-  Number of unique observations per column

In [3]:
# DT[, lapply(.SD, uniqueN)] --> Rdatatable

df.nunique()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,35,23,43,22,3


- Mean of all columns by `species`

In [4]:
# DT[, lapply(.SD, mean), by = species] --> Rdatatable

df[:, mean(f[:]), by('species')]

Unnamed: 0_level_0,species,sepal_length,sepal_width,petal_length,petal_width
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,setosa,5.006,3.428,1.462,0.246
1,versicolor,5.936,2.77,4.26,1.326
2,virginica,6.588,2.974,5.552,2.026


### __Filtering__

- First two observations by species

In [5]:
# DT[, .SD[1:2], by = species]

df[:2, :, by('species')]

Unnamed: 0_level_0,species,sepal_length,sepal_width,petal_length,petal_width
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,setosa,5.1,3.5,1.4,0.2
1,setosa,4.9,3.0,1.4,0.2
2,versicolor,7.0,3.2,4.7,1.4
3,versicolor,6.4,3.2,4.5,1.5
4,virginica,6.3,3.3,6.0,2.5
5,virginica,5.8,2.7,5.1,1.9


In [datatable](https://datatable.readthedocs.io/en/latest/index.html), rows are selected in the `i` section after the grouping, unlike in R's [data.table](https://github.com/Rdatatable/data.table), where rows are selected in `i` before grouping, and rows selected in the `.SD` after grouping.

- Last two observations by `species`

In [6]:
# DT[, tail(.SD, 2), by = species] 

df[-2:, :, by('species')]

Unnamed: 0_level_0,species,sepal_length,sepal_width,petal_length,petal_width
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,setosa,5.3,3.7,1.5,0.2
1,setosa,5.0,3.3,1.4,0.2
2,versicolor,5.1,2.5,3.0,1.1
3,versicolor,5.7,2.8,4.1,1.3
4,virginica,6.2,3.4,5.4,2.3
5,virginica,5.9,3.0,5.1,1.8


Again, the rows are selected after grouping by using Python's negative index slicing.

- Select the top two sorted by `sepal length` in descending order

In [7]:
# DT[order(-sepal_length), head(.SD, 2), by = species] 

df[:2, :, by('species'), sort(-f.sepal_length)]

Unnamed: 0_level_0,species,sepal_length,sepal_width,petal_length,petal_width
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,setosa,5.8,4.0,1.2,0.2
1,setosa,5.7,4.4,1.5,0.4
2,versicolor,7.0,3.2,4.7,1.4
3,versicolor,6.9,3.1,4.9,1.5
4,virginica,7.9,3.8,6.4,2.0
5,virginica,7.7,3.8,6.7,2.2


In [datatable](https://datatable.readthedocs.io/en/latest/index.html), the [sort](https://datatable.readthedocs.io/en/latest/api/dt/sort.html#) function replicates the `order` function in R's [data.table](https://github.com/Rdatatable/data.table). Note the `-` symbol before the sepal_length *f-expression*; this instructs the dataframe to sort in descending order.

- Select the top two sorted by the difference between the `sepal length` and `sepal width`

In [8]:
# DT[order(sepal_length - sepal_width), head(.SD, 2), by = species] 

df[:2, :, by('species'), sort(f.sepal_length - f.sepal_width)]

Unnamed: 0_level_0,species,sepal_length,sepal_width,petal_length,petal_width
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,setosa,4.6,3.6,1.0,0.2
1,setosa,5.2,4.1,1.5,0.1
2,versicolor,5.4,3.0,4.5,1.5
3,versicolor,5.2,2.7,3.9,1.4
4,virginica,4.9,2.5,4.5,1.7
5,virginica,5.6,2.8,4.9,2.0


Just like in R's [data.table](https://github.com/Rdatatable/data.table), boolean expressions can be passed to the [sort](https://datatable.readthedocs.io/en/latest/api/dt/sort.html#) function.

- Filter observations above the mean of `sepal_length` by species

In [9]:
# DT[, .SD[sepal_length > mean(sepal_length)], by = species] 

df[:, update(temp = f.sepal_length > mean(f.sepal_length)), by('species')]
df[f.temp == 1, f[:-1]]

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪
0,5.1,3.5,1.4,0.2,setosa
1,5.4,3.9,1.7,0.4,setosa
2,5.4,3.7,1.5,0.2,setosa
3,5.8,4,1.2,0.2,setosa
4,5.7,4.4,1.5,0.4,setosa
5,5.4,3.9,1.3,0.4,setosa
6,5.1,3.5,1.4,0.3,setosa
7,5.7,3.8,1.7,0.3,setosa
8,5.1,3.8,1.5,0.3,setosa
9,5.4,3.4,1.7,0.2,setosa


Unlike in R's [data.table](https://github.com/Rdatatable/data.table), boolean expressions can not be applied within the `i` section, in the presence of `by`. The next best thing is to break it down into two steps - create a temporary column to hold the boolean value, and then filter on that column.

- Filter rows with group size greater than 10 

In [10]:
# DT[, .SD[.N > 10], keyby = .(species, petal_width)] 

df[:, update(temp = count() > 10), by('species', 'petal_width')]
df[f.temp == 1, f[:-1]]

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5,3.6,1.4,0.2,setosa
5,5,3.4,1.5,0.2,setosa
6,4.4,2.9,1.4,0.2,setosa
7,5.4,3.7,1.5,0.2,setosa
8,4.8,3.4,1.6,0.2,setosa
9,5.8,4,1.2,0.2,setosa


- Get the row with the max petal_length by species.

In [11]:
# DT[, .SD[which.max(petal_length)], by = species] OR 
# DT[, .SD[petal_length == max(petal_length)], by = species]  

df[:, update(temp = f.petal_length == max(f.petal_length)), by('species')]
df[f.temp == 1, f[:-1]]

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪
0,4.8,3.4,1.9,0.2,setosa
1,5.1,3.8,1.9,0.4,setosa
2,6.0,2.7,5.1,1.6,versicolor
3,7.7,2.6,6.9,2.3,virginica


### __.SDCols__

- Including columns in `.SD`

In [12]:
# col_idx <- grep("^sepal", names(DT)) --> filter for the specicfic columns
# DT[, lapply(.SD, mean), .SDcols = col_idx]

# delete 'temp' column
del df['temp']

# filter for the specific columns with a list comprehension
names = [f[name] for name in df.names
         if name.startswith('sepal')]

df[:, mean(names)]

Unnamed: 0_level_0,sepal_length,sepal_width
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,5.84333,3.05733


- Removing columns from `.SD`

In [13]:
# col_idx <- grep("^(petal|species)", names(DT))
# DT[, lapply(.SD, mean), .SDcols = -col_idx] --> exclusion occurs within .SDcols

# here, exclusion occurs within the list comprehension
names = [f[name] for name in df.names 
         if not name.startswith(('petal','species'))] 

df[:, mean(names)]

Unnamed: 0_level_0,sepal_length,sepal_width
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,5.84333,3.05733


- Column ranges

In [14]:
# DT[, lapply(.SD, mean), .SDcols = sepal_length:sepal_width]

df[:, mean(f['sepal_length':'sepal_width'])]

Unnamed: 0_level_0,sepal_length,sepal_width
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪
0,5.84333,3.05733


### __Summary__

We've seen how to replicate `.SD` in [datatable](https://datatable.readthedocs.io/en/latest/index.html). There are other functionalities in `.SD` that are not presently possible in Python's [datatable](https://datatable.readthedocs.io/en/latest/index.html). It is possible that in the future, `.SD` will be implemented to allow for custom aggregation functions. That would be truly awesome, as it would allow [numpy](https://numpy.org/doc/stable/index.html) functions and functions from other Python libraries into [datatable](https://datatable.readthedocs.io/en/latest/index.html).