# Using lakeFS with R - NYC Filming Permits

<img src="https://docs.lakefs.io/assets/logo.svg" alt="lakeFS logo" height=100/>  <img src="https://www.r-project.org/logo/Rlogo.svg" alt="R logo" width=50/>

lakeFS interfaces with R in two ways: 

* the [S3 gateway](https://docs.lakefs.io/understand/architecture.html#s3-gateway) which presents a lakeFS repository as an S3 bucket. You can then read and write data in lakeFS using standard S3 tools such as the `aws.s3` library.
* a [rich API](https://docs.lakefs.io/reference/api.html) for which can be accessed from R using the `httr` library. Use the API for working with branches and commits.

_**Learn more about lakeFS in the [Quickstart](https://docs.lakefs.io/quickstart/) and support for R in the [documentation](https://docs.lakefs.io/integrations/r.html)**_

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "using-r-with-lakefs"

### Variables

In [None]:
# aws.s3 library uses these environment variables
# Some, such as region, need to be specified in the function call 
# and are not taken from environment variables.
# See https://github.com/cloudyr/aws.s3/blob/master/man/s3HTTP.Rd for
# full list of configuration parameters when calling the s3 functions.
lakefsEndPoint_no_proto <- sub("^https?://", "", lakefsEndPoint)
lakefsEndPoint_proto <- sub("^(https?)://.*", "\\1", lakefsEndPoint)
if (lakefsEndPoint_proto == "http") {
    useHTTPS <- "false"
} else {
    useHTTPS <- "true"
}

Sys.setenv("AWS_ACCESS_KEY_ID" = lakefsAccessKey,
           "AWS_SECRET_ACCESS_KEY" = lakefsSecretKey,
           "AWS_S3_ENDPOINT" = lakefsEndPoint_no_proto)

# Set the API endpoint
lakefs_api_url<- paste0(lakefsEndPoint,"/api/v1")

### Libraries

In [None]:
system("conda install --quiet --yes r-arrow r-aws.s3 r-httr=1.4.6")

In [None]:
library(aws.s3)
library(httr)
library(arrow)

### Set up S3FileSystem for Arrow access to lakeFS

In [None]:
lakefs <- S3FileSystem$create(
    endpoint_override = lakefsEndPoint,
    access_key = lakefsAccessKey, 
    secret_key = lakefsSecretKey, 
    region = "",
    scheme = "http"
)

#### Verify lakeFS credentials by getting lakeFS version

In [None]:
r=GET(url=paste0(lakefs_api_url,"/config/version"), authenticate(lakefsAccessKey, lakefsSecretKey))

In [None]:
print("Verifying lakeFS credentials…")
if (r$status_code == 200) {
    print(paste0("…✅lakeFS credentials verified. ℹ️lakeFS version ",content(r)$version))   
} else {
    print("🛑 failed to get lakeFS version")
    print(content(r)$message)
}

### Define lakeFS Repository

In [None]:
r=GET(url=paste0(lakefs_api_url,"/repositories/",repo_name), authenticate(lakefsAccessKey, lakefsSecretKey))

In [None]:
if (r$status_code ==404) {
    print(paste0("Repository ",repo_name," does not exist, so going to try and create it now."))

    body=list(name=repo_name, storage_namespace=paste0(storageNamespace,"/",repo_name))

    r=POST(url=paste0(lakefs_api_url,"/repositories"), 
           authenticate(lakefsAccessKey, lakefsSecretKey),
           body=body, encode="json" )

    if (r$status_code <400) {
        print(paste0("🟢 Created new repo ",repo_name," using storage namespace ",content(r)$storage_namespace))
    } else {
        print(paste0("🔴 Failed to create new repo: ",r$status_code))
        print(content(r)$message)
    }
    
} else if (r$status_code == 201 || r$status_code == 200) {
    print(paste0("Found existing repo ",repo_name," using storage namespace ",content(r)$storage_namespace))
} else {
    print(paste0("🔴 lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
    print(r)
}

---

## Main demo starts here 🚦 👇🏻

### Load NYC Film Permits data from JSON

In [None]:
library(jsonlite)

In [None]:
nyc_data <- fromJSON("/data/nyc_film_permits.json")

### Show a sample of the data

In [None]:
str(nyc_data)

In [None]:
table(nyc_data$borough)

### Write the data to `main` branch (using `aws.s3`)

In [None]:
branch <- "main"
aws.s3::s3saveRDS(x = nyc_data,
                  object = paste0(branch,"/nyc/","nyc_permits.R"), 
                  bucket = repo_name, 
                  region="",
                  use_https=useHTTPS)

#### List uncommitted changes on `main`

In [None]:
r=GET(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/diff"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    str((content(r)$results))
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

#### Commit the data to `main`

In [None]:
body=list(message="Initial data load", 
          metadata=list(
              client="httr", author="rmoff"))

r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/commits"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

### Create a new branch on which to experiment with the data

In [None]:
branch <- "dev"

In [None]:
r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=list(name=branch, source="main"), 
       encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

### Show a sample of the data from `dev` branch to show that it's the same

In [None]:
nyc_data_dev <- aws.s3::s3readRDS(object = paste0(branch,"/nyc/","nyc_permits.R"), 
                                  bucket = repo_name, 
                                  region="",
                                  use_https=useHTTPS)

In [None]:
table(nyc_data_dev$borough)

### Delete some of the data

In [None]:
nyc_data_dev <- nyc_data_dev[nyc_data_dev$borough != "Manhattan", ]

In [None]:
table(nyc_data_dev$borough)

### Write it back to object store in Parquet format

In [None]:
write_parquet(x = nyc_data_dev,
              sink = lakefs$path(paste0(repo_name, "/", branch , "/nyc/nyc_permits.parquet")))

#### Remove the RDS file

In [None]:
lakefs$DeleteFile(paste0(repo_name, "/", branch , "/nyc/nyc_permits.R"))

#### Show uncommitted changes

In [None]:
r=GET(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/diff"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    str((content(r)$results))
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

### Show that the `main` view of the data is unchanged

In [None]:
branch <- "main"
lakefs$ls(path = paste0(repo_name,"/",branch),
          recursive = TRUE)

In [None]:
nyc_data <- aws.s3::s3readRDS(object = paste0(branch,"/nyc/","nyc_permits.R"), 
                                  bucket = repo_name, 
                                  region="",
                                  use_https=useHTTPS)

table(nyc_data$borough)

### Commit the data to the branch

In [None]:
branch <- "dev"

body=list(message="remove data for Manhattan, write as parquet, remove original file", 
          metadata=list(
              client="httr", author="rmoff"))

r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/commits"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

### Merge the branch into `main`

In [None]:
r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/refs/",branch,"/merge/main"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=list(message="merge changes from dev back to main branch"), encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

### Show that the `main` view of the data is now changed

In [None]:
branch <- "main"
nyc_data <- read_parquet(lakefs$path(paste0(repo_name, "/", branch , "/nyc/nyc_permits.parquet")))

In [None]:
table(nyc_data$borough)