# Using lakeFS with R - weather data example

<img src="https://docs.lakefs.io/assets/logo.svg" alt="lakeFS logo" height=100/>  <img src="https://www.r-project.org/logo/Rlogo.svg" alt="R logo" width=50/>

This notebook shows a simple example of getting data into R, writing it to a branch of lakeFS, and merging that branch into another.

lakeFS interfaces with R in two ways: 

* the [S3 gateway](https://docs.lakefs.io/understand/architecture.html#s3-gateway) which presents a lakeFS repository as an S3 bucket. You can then read and write data in lakeFS using standard S3 tools such as the `aws.s3` library.
* a [rich API](https://docs.lakefs.io/reference/api.html) for which can be accessed from R using the `httr` library. Use the API for working with branches and commits.

In the example below we load some data from an external URL, plot it, and then write it to lakeFS

# Libraries

_The installation process may take a few minutes the first that that it runs._

In [None]:
system("conda install --quiet --yes r-arrow r-aws.s3 r-httr=1.4.6")

In [None]:
install.packages(c("dplyr","lubridate"))

In [None]:
library(dplyr)
library(jsonlite)
library(lubridate)
library(aws.s3)
library(httr)

# Do stuff in R

## Get the data in 💾 

This uses Environment Agency flood and river level data from the [real-time data API (Beta)](https://environment.data.gov.uk/flood-monitoring/doc/reference)

In [None]:
rainfall <- jsonlite::fromJSON("http://environment.data.gov.uk/flood-monitoring/id/stations/058461/readings?_limit=2500")$items
riverlevel <- jsonlite::fromJSON("https://environment.data.gov.uk/flood-monitoring/id/stations/F1902/readings?_limit=2500")$items

### Shape it into a dataframe

In [None]:
dateTime <- as.POSIXct(unlist(riverlevel$dateTime), format = "%Y-%m-%dT%H:%M:%SZ")

df <- data.frame(dateTime, river_value=unlist(riverlevel$value))

df <- df %>% mutate(rainfall_value = unlist(rainfall$value))

str(df)

## Plot the data 📉

In [None]:
library(ggplot2)

# Create a line plot
p <- ggplot(data = df) +
  geom_line(aes(x = dateTime, y = rainfall_value, color = "Rainfall")) +
  geom_line(aes(x = dateTime, y = river_value, color = "River Height (m)")) +
  scale_color_manual(values = c("River Height (m)" = "darkblue", "Rainfall" = "lightblue")) +
  xlab("Date") +
  ylab("Height (m)") +
  ggtitle("Rainfall and River Wharfe level in Ilkey") +
 scale_y_continuous(
    name = "River Height (m)",
    sec.axis = sec_axis(~ .,
                        name = "Rainfall (mm/15min)"
    )
  )

p

### Write the data to a local file

In [None]:
chart_image <- tempfile("plot",fileext = ".png")
ggsave(chart_image, plot = p, device = "png")

## Zoom in on a day

In [None]:
subset_df <- filter(df, month(dateTime) == 6, day(dateTime) == 19)

In [None]:
p <-  ggplot(data = subset_df) +
  geom_line(aes(x = dateTime, y = rainfall_value, color = "Rainfall")) +
  geom_line(aes(x = dateTime, y = river_value, color = "River Height (m)")) +
  scale_color_manual(values = c( "River Height (m)" = "darkblue", "Rainfall" = "lightblue")) +
  xlab("Date") +
  ylab("Height (m)") +
  ggtitle("Rainfall and River Wharfe level in Ilkey") +
 scale_y_continuous(
    name = "River Height (m)",
    sec.axis = sec_axis(~ .,
                        name = "Rainfall (mm/15min)"
    )
  )

p

### Write the new chart to a local file

In [None]:
day_chart_image <- tempfile("plot-day",fileext = ".png")
ggsave(day_chart_image, plot = p, device = "png")

---

# <img src="https://docs.lakefs.io/assets/logo.svg" alt="lakeFS logo" width=100/> Save the data to lakeFS 

## Setup lakeFS connection

via the [lakeFS S3 gateway](https://docs.lakefs.io/understand/architecture.html#s3-gateway)

### lakeFS credentials and location

If you're using the `lakefs-samples` Docker Compose then you can leave this unchanged. 

In [None]:
access_key<-"AKIAIOSFOLKFSSAMPLES"
secret_key<-"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
baseurl<-"lakefs:8000"

### Store creds as env vars, set API endpoint

In [None]:
Sys.setenv("AWS_ACCESS_KEY_ID" = access_key,
           "AWS_SECRET_ACCESS_KEY" = secret_key)

lakefs_api_url<- paste0("http://",baseurl,"/api/v1")

## Smoke test - list the lakeFS repositories

This uses the `aws.s3` library. 

Each _bucket_ is a [_lakeFS repository_](https://docs.lakefs.io/understand/model.html#repository).

In [None]:
bucketlist(
    base_url=baseurl,
    region="",
    use_https=FALSE)

## Show objects in `main` branch

Assumes we're using the `quickstart` repository

In [None]:
branch="main"

get_bucket_df(
    base_url=baseurl,
    bucket="quickstart",
    use_https=FALSE, 
    prefix=paste0(branch,"/"), delimiter="/",
    region="",
    verbose=FALSE)

## Create branch 

We're going to write our data from above to the repository, and as is good-practice won't write directly to the main branch. Instead we'll write to a 'feature' branch and merge it into main from there. 

_ref: [lakeFS API](https://docs.lakefs.io/reference/api.html#/branches/createBranch)_

In [None]:
branch <- "weather"

In [None]:
body=list(name=branch, source="main")

r=POST(url=paste0(lakefs_api_url,"/repositories/quickstart/branches"), 
       authenticate(access_key, secret_key),
       body=body, encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

## List branches

_ref: [lakeFS API](https://docs.lakefs.io/reference/api.html#/branches/listBranches)_

In [None]:
r=GET(url=paste0(lakefs_api_url,"/repositories/quickstart/branches"), 
       authenticate(access_key, secret_key))

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

## Write R data to lakeFS

There are different ways to do this. Here are two. So long as the method you want to use can write to S3 you can use it. 

## `s3saveRDS` (aws.s3)

Save the R dataframe

In [None]:
s3saveRDS(x=df, 
          bucket = 'quickstart', 
          object = paste0(branch,"/weather/","data.R"), 
          base_url=baseurl,
          region="",
          use_https=FALSE)

### `put_object` (aws.s3)

Save the two graph plot images that we saved above

In [None]:
put_object(chart_image, 
           bucket = 'quickstart', 
           object = paste0(branch,"/weather/","plot.png"),
           base_url=baseurl,
           region="",
           use_https=FALSE)

In [None]:
put_object(day_chart_image, 
           bucket = 'quickstart', 
           object = paste0(branch,"/weather/","day_plot.png"),
           base_url=baseurl,
           region="",
           use_https=FALSE)

## List uncommitted data

When you write an object to lakeFS it is uncommitted until you commit it. 

_ref: [lakeFS API](https://docs.lakefs.io/reference/api.html#/branches/diffBranch)_

In [None]:
r=GET(url=paste0(lakefs_api_url,"/repositories/quickstart/branches/",branch,"/diff"), 
       authenticate(access_key, secret_key),
       body=body, encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    str((content(r)$results))
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

## Commit data

In [None]:
body=list(message="add weather data", 
          metadata=list(
              client="httr", author="rmoff"))

r=POST(url=paste0(lakefs_api_url,"/repositories/quickstart/branches/",branch,"/commits"), 
       authenticate(access_key, secret_key),
       body=body, encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}

## Merge to main

In [None]:
body=list(message="merge new weather data to main branch")

r=POST(url=paste0(lakefs_api_url,"/repositories/quickstart/refs/",branch,"/merge/main"), 
       authenticate(access_key, secret_key),
       body=body, encode="json" )

In [None]:
if (r$status_code <400) {
    print(paste0("👏🏻 lakeFS API call succeeded (",r$status_code,")"))
    content(r)
} else {
    print(paste0("☹️ lakeFS API call failed: ",r$status_code))
    print(content(r)$message)
}