# Datatable

## Overview:

Any person familiar with R, will be aware of [data.table](https://cran.r-project.org/web/packages/data.table/data.table.pdf) package. It is an extension of the widely used [data.frame](https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/data.frame) package that is R's equivalent of pandas in python. **Data.table** prioritizes memory efficiency and speed, thus making it the go to package for big data problems in R.

In this spotlight I will be going over how to get started on [Datatable](https://github.com/h2oai/datatable) a library that was developed by the organisation [H2O.ai](h2o.ai). Like data.table in R, it builds on top of the pandas library to make it faster and more memory efficient while retaining some important functionality from pandas.

## Installation:

### Linux and MacOS:

Just use pip to install the package:

pip install datatable

### Windows:

The package is currently not available on windows, but work is actively being done to bring it to the OS. You can track it [here](https://github.com/h2oai/datatable/issues/1114)

## DataSet:

In this spotlight I will be using the [Lending Club Loan Dataset](https://www.kaggle.com/wendykan/lending-club-loan-data#loan.csv) from Kaggle. It has **145 columns** and **2.26 million** rows thus making it the ideal dataset to show the capabilities of this package.

## Reading Data:

In [1]:
import datatable as dt
import pandas as pd
import numpy as np

### datatable

In [2]:
%%time
data_dt = dt.fread("loan.csv")

CPU times: user 18 s, sys: 1.33 s, total: 19.3 s
Wall time: 3.65 s


### pandas

In [3]:
%%time
data_pd = pd.read_csv("loan.csv",low_memory=False)

CPU times: user 41.7 s, sys: 2.67 s, total: 44.4 s
Wall time: 43 s


Clearly from the results above we can see that the **datatable** package is significantly faster than **pandas**.

## Converting Dataset:

The datatable package has options to convert the Dataframe to pandas or even a numpy matrix:
- data_dt_pd = data_dt.to_pandas()
- data_dt_np = data_dt.to_numpy()

Below I have shown the performance of converting to the pandas Dataframe.


In [4]:
%%time
data_dt_pd = data_dt.to_pandas()

CPU times: user 1min 12s, sys: 1.66 s, total: 1min 14s
Wall time: 14.8 s


## Important Properties

Below are some important properties that are present with the datatable frame:

In [5]:
print(data_dt.shape)
print(data_dt.names[:5])
print(data_dt.stypes[:5])
data_dt.head(10)

(2260668, 145)
('id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv')
(stype.bool8, stype.bool8, stype.int32, stype.int32, stype.float64)


Unnamed: 0_level_0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,…,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
Unnamed: 0_level_1,▪,▪,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪,▪▪▪▪,Unnamed: 11_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪
0,,,2500,2500,2500,36 months,13.56,84.92,C,C1,…,,,,,
1,,,30000,30000,30000,60 months,18.94,777.23,D,D2,…,,,,,
2,,,5000,5000,5000,36 months,17.97,180.69,D,D1,…,,,,,
3,,,4000,4000,4000,36 months,18.94,146.51,D,D2,…,,,,,
4,,,30000,30000,30000,60 months,16.14,731.78,C,C4,…,,,,,
5,,,5550,5550,5550,36 months,15.02,192.45,C,C3,…,,,,,
6,,,2000,2000,2000,36 months,17.97,72.28,D,D1,…,,,,,
7,,,6000,6000,6000,36 months,13.56,203.79,C,C1,…,,,,,
8,,,5000,5000,5000,36 months,17.97,180.69,D,D1,…,,,,,
9,,,6000,6000,6000,36 months,14.47,206.44,C,C2,…,,,,,


The different **colors** represent the datatypes of the columns for eg. **green** is int, **blue** is float and **red** is string.

##  Dataframe Statistics:

The Datatable frame, like pandas, has a number of methods which allow us to extract the statistics of the data like mean, median etc. Some examples of the are as follows:

### datatable

In [6]:
%%time
print(data_dt.sum()) # Per column sum
print(data_dt.mean()) # Per column Mean
print(data_dt.max()) # Per column Max
print(data_dt.min()) # Per column Min
print(data_dt.nunique()) # Per column Number of unique values

   | id  member_id    loan_amnt  funded_amnt  funded_amnt_inv  term     int_rate  installment  grade  sub_grade  …  settlement_status  settlement_date  settlement_amount  settlement_percentage  settlement_term
-- + --  ---------  -----------  -----------  ---------------  ----  -----------  -----------  -----  ---------     -----------------  ---------------  -----------------  ---------------------  ---------------
 0 |  0          0  3.40161e+10  3.40042e+10       3.3963e+10    NA  2.95987e+07  1.00782e+09     NA         NA  …                 NA               NA        1.66292e+08            1.57927e+06           434640

[1 row x 145 columns]

   | id  member_id  loan_amnt  funded_amnt  funded_amnt_inv  term  int_rate  installment  grade  sub_grade  …  settlement_status  settlement_date  settlement_amount  settlement_percentage  settlement_term
-- + --  ---------  ---------  -----------  ---------------  ----  --------  -----------  -----  ---------     -----------------  -----------

### pandas

In [13]:
%%time
print(data_pd.sum(axis=0)) # Per column sum

MemoryError: Unable to allocate array with shape (145, 2260668) and data type object

As we can see from the two results cells above, **datatable** has no problems returning results from a number pof statistics, whereas pandas throws a memory error for a simple *sum* operation.

## Data Handling

Following are some of the ways you can work with the datatable frame for your specific needs.

### Slicing

The frames in datatable can be sliced in the same ways as pandas. Some examples are in the cells below:

In [7]:
data_dt[:,'int_rate'] # The int_rate column

Unnamed: 0_level_0,int_rate
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪
0,13.56
1,18.94
2,17.97
3,18.94
4,16.14
5,15.02
6,17.97
7,13.56
8,17.97
9,14.47


In [8]:
data_dt[:10,:5] # The first 10 rows of the first 5 columns

Unnamed: 0_level_0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv
Unnamed: 0_level_1,▪,▪,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪
0,,,2500,2500,2500
1,,,30000,30000,30000
2,,,5000,5000,5000
3,,,4000,4000,4000
4,,,30000,30000,30000
5,,,5550,5550,5550
6,,,2000,2000,2000
7,,,6000,6000,6000
8,,,5000,5000,5000
9,,,6000,6000,6000


pandas frames can be sliced in the same way.

### Filtering

Datatable allows for filtering of data, by putting conditions in the index:

In [9]:
data_dt[dt.f.int_rate>10,"loan_amnt"] # loan amounts where int_rate is greater than ten

Unnamed: 0_level_0,loan_amnt
Unnamed: 0_level_1,▪▪▪▪
0,2500
1,30000
2,5000
3,4000
4,30000
5,5550
6,2000
7,6000
8,5000
9,6000


The **dt.f** object refers to the "frame proxy" and represents the dataframe that is currently being operated on. It is through this that we can put conditionals on the columns etc.

### Sorting

Datatable,like pandas, also supports sorting by column, and performs significantly better.

#### datatable

In [10]:
%%time
data_dt.sort('int_rate')

CPU times: user 509 ms, sys: 0 ns, total: 509 ms
Wall time: 70.8 ms


Unnamed: 0_level_0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,…,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
Unnamed: 0_level_1,▪,▪,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪,▪▪▪▪,Unnamed: 11_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪
0,,,7500,7500,7500,36 months,5.31,225.83,A,A1,…,,,,,
1,,,20000,20000,20000,36 months,5.31,602.21,A,A1,…,,,,,
2,,,9900,9900,9900,36 months,5.31,298.1,A,A1,…,,,,,
3,,,15000,15000,15000,36 months,5.31,451.66,A,A1,…,,,,,
4,,,15000,15000,15000,36 months,5.31,451.66,A,A1,…,,,,,
5,,,30000,30000,30000,36 months,5.31,903.31,A,A1,…,,,,,
6,,,16000,16000,16000,36 months,5.31,481.77,A,A1,…,,,,,
7,,,10000,10000,10000,36 months,5.31,301.11,A,A1,…,,,,,
8,,,27000,27000,27000,60 months,5.31,513.37,A,A1,…,,,,,
9,,,35000,35000,35000,36 months,5.31,1053.86,A,A1,…,,,,,


#### pandas

In [11]:
%%time
data_pd.sort_values(by = 'int_rate')

CPU times: user 5.38 s, sys: 2.5 s, total: 7.88 s
Wall time: 12.7 s


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
238926,,,30000,30000,30000.0,36 months,5.31,903.31,A,A1,...,,,DirectPay,N,,,,,,
246668,,,40000,40000,40000.0,36 months,5.31,1204.42,A,A1,...,,,Cash,N,,,,,,
414210,,,6400,6400,6400.0,36 months,5.31,192.71,A,A1,...,,,Cash,N,,,,,,
414231,,,13000,13000,13000.0,36 months,5.31,391.44,A,A1,...,,,Cash,N,,,,,,
414237,,,6000,6000,6000.0,36 months,5.31,180.67,A,A1,...,,,Cash,N,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2247965,,,18000,18000,18000.0,60 months,30.99,593.36,G,G5,...,,,Cash,N,,,,,,
1408883,,,22875,22875,22875.0,36 months,30.99,983.53,G,G5,...,,,Cash,N,,,,,,
1454082,,,27425,27425,27425.0,36 months,30.99,1179.16,G,G5,...,,,Cash,N,,,,,,
1372582,,,12650,12650,12650.0,36 months,30.99,543.90,G,G5,...,,,Cash,N,,,,,,


As we can see the pandas dataframe sorts much slower(It sometimes throws a memory error also).

### Saving Dataframe

In [12]:
data_dt.to_csv('output.csv')

## Conclusion

As we have seen in this spotlight, the datatable package gives us the ability to handle and manipulate extremely large datasets, where pandas fails. This will make many tasks, especially the inital steps where we play around with data, significantly easier.

## References

- https://towardsdatascience.com/an-overview-of-pythons-datatable-package-5d3a97394ee9
- https://datatable.readthedocs.io/en/v0.10.1/