# Using lakeFS with R

<img src="https://docs.lakefs.io/assets/logo.svg" alt="lakeFS logo" height=100/>  <img src="https://www.r-project.org/logo/Rlogo.svg" alt="R logo" width=50/>

lakeFS interfaces with R in two ways: 

* the [S3 gateway](https://docs.lakefs.io/understand/architecture.html#s3-gateway) which presents a lakeFS repository as an S3 bucket. You can then read and write data in lakeFS using standard S3 tools such as the `aws.s3` library.
* a [rich API](https://docs.lakefs.io/reference/api.html) for which can be accessed from R using the `httr` library. Use the API for working with branches and commits.

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "using-r-with-lakefs"

### Variables

In [None]:
# aws.s3 library uses these environment variables
# Some, such as region, need to be specified in the function call 
# and are not taken from environment variables.
# See https://github.com/cloudyr/aws.s3/blob/master/man/s3HTTP.Rd for
# full list of configuration parameters when calling the s3 functions.
lakefsEndPoint_no_proto <- sub("^https?://", "", lakefsEndPoint)
lakefsEndPoint_proto <- sub("^(https?)://.*", "\\1", lakefsEndPoint)
if (lakefsEndPoint_proto == "http") {
    useHTTPS <- "false"
} else {
    useHTTPS <- "true"
}

Sys.setenv("AWS_ACCESS_KEY_ID" = lakefsAccessKey,
           "AWS_SECRET_ACCESS_KEY" = lakefsSecretKey,
           "AWS_S3_ENDPOINT" = lakefsEndPoint_no_proto)

# Set the API endpoint
lakefs_api_url<- paste0(lakefsEndPoint,"/api/v1")

### Libraries

In [None]:
system("conda install --quiet --yes r-arrow r-aws.s3 r-httr=1.4.6")

In [None]:
library(aws.s3)
library(httr)

#### Verify lakeFS credentials by getting lakeFS version

In [None]:
r=GET(url=paste0(lakefs_api_url,"/config/version"), authenticate(lakefsAccessKey, lakefsSecretKey))

In [None]:
print("Verifying lakeFS credentials…")
if (r$status_code <400) {
    print(paste0("…✅lakeFS credentials verified. ℹ️lakeFS version ",content(r)$version))   
} else {
    print("🛑 failed to get lakeFS version")
    print(content(r)$message)
}

### Define lakeFS Repository

In [None]:
r=GET(url=paste0(lakefs_api_url,"/repositories/",repo_name), authenticate(lakefsAccessKey, lakefsSecretKey))

In [None]:
if (r$status_code ==404) {
    print(paste0("Repository ",repo_name," does not exist, so going to try and create it now."))

    body=list(name=repo_name, storage_namespace=paste0(storageNamespace,"/",repo_name))

    r=POST(url=paste0(lakefs_api_url,"/repositories"), 
           authenticate(lakefsAccessKey, lakefsSecretKey),
           body=body, encode="json" )

    if (r$status_code <400) {
        print(paste0("🟢 Created new repo ",repo_name," using storage namespace ",content(r)$storage_namespace))
    } else {
        print(paste0("🔴 Failed to create new repo: ",r$status_code))
        print(content(r)$message)
    }
    
} else if (r$status_code == 201 || r$status_code == 200) {
    print(paste0("Found existing repo ",repo_name," using storage namespace ",content(r)$storage_namespace))
} else {
    print(paste0("🔴 lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
    print(r)
}

---

# Main demo starts here 🚦 👇🏻

## Use built-in dataset from R for our example

In [None]:
data(mtcars)

In [None]:
str(mtcars)

## Draw some charts

In [None]:
library(ggplot2)

In [None]:
my_scatplot <- ggplot(mtcars,aes(x=wt,y=mpg)) + geom_point()
p <- my_scatplot + xlab('Weight (x 1000lbs)') + ylab('Miles per Gallon') + geom_smooth()

chart1file <- tempfile("mtcars-mpg_vs_weight",fileext = ".png")
ggsave(chart1file, plot = p, device = "png")
p

In [None]:
my_scatplot <- ggplot(mtcars,aes(x=wt,y=mpg,col=cyl)) + geom_point()
p <- my_scatplot + facet_grid(~cyl)

chart2file <- tempfile("mtcars-mpg_vs_weight_cyl",fileext = ".png")
ggsave(chart2file, plot = p, device = "png")
p

---

## <img src="https://docs.lakefs.io/assets/logo.svg" alt="lakeFS logo" width=100/> Working with lakeFS

_lakeFS is an object store, so you can store whatever you'd like in it_

## Smoke test - list the lakeFS repositories

This uses the `aws.s3` library. 

Each _bucket_ is a [_lakeFS repository_](https://docs.lakefs.io/understand/model.html#repository).

In [None]:
bucketlist(
    region="",
    use_https=useHTTPS)

## Create branch 

We're going to write our data from above to the repository, and as is good-practice won't write directly to the main branch. Instead we'll write to a 'feature' branch and merge it into main from there. 

_ref: [lakeFS API](https://docs.lakefs.io/reference/api.html#/branches/createBranch)_

In [None]:
branch <- "add-data"

In [None]:
body=list(name=branch, source="main")

r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

## List branches

_ref: [lakeFS API](https://docs.lakefs.io/reference/api.html#/branches/listBranches)_

In [None]:
r=GET(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches"), 
       authenticate(lakefsAccessKey, lakefsSecretKey))

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

## Write R data to lakeFS

There are different ways to do this. Here are two. So long as the method you want to use can write to S3 you can use it. 

### `s3saveRDS` (aws.s3)

Save the R object

In [None]:
s3saveRDS(x=mtcars, 
          bucket = repo_name, 
          object = paste0(branch,"/cars/","data.R"), 
          region="",
          use_https=useHTTPS)

### `put_object` (aws.s3)

Save the two graph plot images that we saved above

In [None]:
put_object(file = chart1file, 
           bucket = repo_name, 
           object = paste0(branch,"/cars/","plot1.png"),
           region="",
           use_https=useHTTPS)

In [None]:
put_object(file = chart2file, 
           bucket = repo_name, 
           object = paste0(branch,"/cars/","plot2.png"),
           region="",
           use_https=useHTTPS)

## List uncommitted data

When you write an object to lakeFS it is uncommitted until you commit it. 

_ref: [lakeFS API](https://docs.lakefs.io/reference/api.html#/branches/diffBranch)_

In [None]:
r=GET(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/diff"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    str((content(r)$results))
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

## Commit data

In [None]:
body=list(message="add car data and charts", 
          metadata=list(
              client="httr", author="rmoff"))

r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/branches/",branch,"/commits"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

## Merge to main

In [None]:
body=list(message="merge new car data to main branch")

r=POST(url=paste0(lakefs_api_url,"/repositories/",repo_name,"/refs/",branch,"/merge/main"), 
       authenticate(lakefsAccessKey, lakefsSecretKey),
       body=body, encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

## Read data from lakeFS into R

### `s3readRDS` (aws.s3)

Load the R object

In [None]:
main_cars <- s3readRDS(bucket = repo_name, 
              object = paste0("main","/cars/","data.R"), 
              region="",
              use_https=useHTTPS)

In [None]:
str(main_cars)

# Using Arrow with R and lakeFS 

In [None]:
library(arrow)

In [None]:
lakefs <- S3FileSystem$create(
    endpoint_override = lakefsEndPoint,
    access_key = lakefsAccessKey, 
    secret_key = lakefsSecretKey, 
    region = "",
    scheme = "http"
)

## List bucket contents

In [None]:
lakefs$ls(path = "quickstart/main")

## Read a parquet file

In [None]:
lakes <- read_parquet(lakefs$path("quickstart/main/lakes.parquet"))
str(lakes)

## Write a file as Arrow (feather)

In [None]:
write_feather(x = lakes,
              sink = lakefs$path("quickstart/main/lakes.arrow"))

### Read the file back to make sure it worked

In [None]:
check <- read_feather(lakefs$path("quickstart/main/lakes.arrow"))
str(check)