## Setup

In [None]:
#install lib
!pip install -qq delta-lake-reader

In [5]:
#s3 path, enter your own path here
S3_PATH = "s3://wysde2-test/sparsh/read-s3-delta-in-python"

In [6]:
#download sample data and upload to s3
!wget -q --show-progress https://github.com/datalaker/assets/releases/download/data-v1/read-s3-delta-in-python.zip
!unzip read-s3-delta-in-python.zip
!aws s3 sync data {S3_PATH}
!rm -r data
!rm -rf read-s3-delta-in-python.zip

Archive:  read-s3-delta-in-python.zip
   creating: data/
   creating: data/_delta_log/
 extracting: data/_delta_log/.s3-optimization-2  
  inflating: data/_delta_log/00000000000000000000.json  
 extracting: data/_delta_log/.s3-optimization-0  
 extracting: data/_delta_log/.s3-optimization-1  
  inflating: data/_delta_log/00000000000000000000.crc  
  inflating: data/part-00000-493c3c61-b0e6-4bf2-871c-8a0a4f8aa73d-c000.snappy.parquet  
upload: data/_delta_log/.s3-optimization-2 to s3://wysde2-test/sparsh/read-s3-delta-in-python/_delta_log/.s3-optimization-2
upload: data/_delta_log/.s3-optimization-0 to s3://wysde2-test/sparsh/read-s3-delta-in-python/_delta_log/.s3-optimization-0
upload: data/_delta_log/.s3-optimization-1 to s3://wysde2-test/sparsh/read-s3-delta-in-python/_delta_log/.s3-optimization-1
upload: data/part-00000-493c3c61-b0e6-4bf2-871c-8a0a4f8aa73d-c000.snappy.parquet to s3://wysde2-test/sparsh/read-s3-delta-in-python/part-00000-493c3c61-b0e6-4bf2-871c-8a0a4f8aa73d-c000.snapp

In [10]:
#imports
import s3fs
from deltalake import DeltaTable
import pyarrow.dataset as ds

## Read

### Standard read

To read delta tables:

- Use s3fs (python file interface with S3)
- Convert the delta table last version into pandas, by default the reader provides the last version of the data

In [None]:
fs = s3fs.S3FileSystem()

delta_table = DeltaTable(S3_PATH, file_system=fs)

In [8]:
df = delta_table.to_pandas()
df

Unnamed: 0,visit_id,client_id,therapist_id,service_id,visit_ts,month
0,0,830-11-8837,150-60-1665,3,2022-05-21 09:34:39,5.0
1,1,,150-60-1665,1,2022-02-11 23:51:36,2.0
2,2,154-64-9693,030-45-1969,0,2022-03-23 15:21:37,3.0
3,3,148-49-3184,030-45-1969,1,2022-03-09 17:27:23,3.0
4,4,148-49-3184,150-60-1665,3,NaT,
5,5,594-87-8512,280-65-5827,2,2022-02-09 17:14:50,2.0
6,6,,150-60-1665,4,2022-01-08 07:19:43,1.0
7,7,431-25-4334,150-60-1665,3,2022-03-03 04:27:57,3.0
8,8,038-37-7264,030-45-1969,4,2022-01-20 21:52:59,1.0
9,9,898-73-3339,280-65-5827,1,2022-04-25 16:04:39,4.0


### Time travel

In [None]:
delta_table_version_1 = delta_table.as_version(1)
delta_table_version_2 = delta_table.as_version(2)

df_1 = delta_table_version_1.to_pandas()
df_2 = delta_table_version_2.to_pandas()

### Predicate Pushdown, Partition Pruning & Columnar file formats

Since the resulting `DeltaTable` is based on the `pyarrow.DataSet`, you get many cool features for free.

The `DeltaTable.to_table` is inherited from `pyarrow.Dataset.to_table`. This means that you can include arguments like `filter`, which will do partition pruning and predicate pushdown. If you have a partitioned dataset, partition pruning can potentially reduce the data needed to be downloaded substantially. The predicate pushdown will not have any effect on the amount of data downloaded, but will reduce the dataset size when loaded into memory.

Further more, since the underlying parquet file format is columnar, you can select a subset of columns to be read from the files. This can be done by passing a list of column names to `to_table`.

In [11]:
delta_table_filtered = delta_table.to_table(filter=ds.field("service_id")==3)

df_filtered = delta_table_filtered.to_pandas()
df_filtered

Unnamed: 0,visit_id,client_id,therapist_id,service_id,visit_ts,month
0,0,830-11-8837,150-60-1665,3,2022-05-21 09:34:39,5.0
1,4,148-49-3184,150-60-1665,3,NaT,
2,7,431-25-4334,150-60-1665,3,2022-03-03 04:27:57,3.0
