Skip to content

Impala connection via R using Implyr and Dplyr

Genelle Denzin edited this page Jun 28, 2019 · 7 revisions

Before attempting to connect, please be sure you have the following:

  1. Install R. (If it is already installed, be sure it is updated.)
  2. Install R Studio. (If it is already installed, be sure it is updated.)
  3. Get credentials and database ID for the database you will be using from HSLynk Support.
  4. Get credentials to your VPN connection from HSLynk Support.

How to Set Up your Connection

To connect to the HSLynk Big Data Warehouse with R, RStudio, etc., the best way is using Impala (as opposed to Hive).

You need to install the correct Cloudera driver for Impala. Cloudera supports Windows, Mac, and Linux. See the documentation here. Use the credentials given to you in step 3 above.

RStudio doesn't currently offer Secure Sockets connections in its free build, so you need to connect over a VPN. We recommend OpenVPN. Use the credentials given to you in step 4 above.

Before you run the following code, be sure you have successfully connected with your credentials to the VPN.

Here's a sample program to connect to Impala from R via Implyr.

# if you don't already have these packages installed, please run:
# install.package(odbc)
# install.package(implyr)
# install.package(dplyr)

library(odbc)
library(implyr)
library(dplyr)

impala <- src_impala(
  drv = odbc::odbc(),
  driver = "[directory or reference to driver]", # this will depend on your 
  # operating system and setup. if you need help, please ask on the Slack channel
  host = "HOST",
  port = 21050,
  database = "[get this from HSLynk Support]",
  uid = "[get this from HSLynk Support]",
  pwd = "[get this from HSLynk Support]",
  UseSasl=1,
  AuthMech=3
)

List the tables inside the database

tables <- dbGetQuery(impala, "show tables")

View(tables)

Creating dataframes from the database

client <- data.frame(tbl(impala, "client"))
enrollment <- data.frame(tbl(impala, "enrollment"))
exit <- data.frame(tbl(impala, "exit"))
moveindate <- data.frame(tbl(impala, "moveindate"))
project <- data.frame(tbl(impala, "project"))

Sample code

# view all clients where Ethnicity is Latino (Latino = 1 in the specs)
client %>% 
  select(id, active, date_created, ethnicity, ethnicity_desc) %>%
  filter(ethnicity == 1) %>% 
  View()

Once you're connected, check out all the dplyr commands you can use: https://dplyr.tidyverse.org/

Clone this wiki locally