-
-
Notifications
You must be signed in to change notification settings - Fork 3
Impala connection via R using Implyr and Dplyr
Genelle Denzin edited this page Jun 28, 2019
·
7 revisions
- Install R. (If it is already installed, be sure it is updated.)
- Install R Studio. (If it is already installed, be sure it is updated.)
- Get credentials and database ID for the database you will be using from HSLynk Support.
- Get credentials to your VPN connection from HSLynk Support.
To connect to the HSLynk Big Data Warehouse with R, RStudio, etc., the best way is using Impala (as opposed to Hive).
You need to install the correct Cloudera driver for Impala. Cloudera supports Windows, Mac, and Linux. See the documentation here. Use the credentials given to you in step 3 above.
RStudio doesn't currently offer Secure Sockets connections in its free build, so you need to connect over a VPN. We recommend OpenVPN. Use the credentials given to you in step 4 above.
Before you run the following code, be sure you have successfully connected with your credentials to the VPN.
Here's a sample program to connect to Impala from R via Implyr.
# if you don't already have these packages installed, please run:
# install.package(odbc)
# install.package(implyr)
# install.package(dplyr)
library(odbc)
library(implyr)
library(dplyr)
impala <- src_impala(
drv = odbc::odbc(),
driver = "[directory or reference to driver]", # this will depend on your
# operating system and setup. if you need help, please ask on the Slack channel
host = "HOST",
port = 21050,
database = "[get this from HSLynk Support]",
uid = "[get this from HSLynk Support]",
pwd = "[get this from HSLynk Support]",
UseSasl=1,
AuthMech=3
)
tables <- dbGetQuery(impala, "show tables")
View(tables)
client <- data.frame(tbl(impala, "client"))
enrollment <- data.frame(tbl(impala, "enrollment"))
exit <- data.frame(tbl(impala, "exit"))
moveindate <- data.frame(tbl(impala, "moveindate"))
project <- data.frame(tbl(impala, "project"))
# view all clients where Ethnicity is Latino (Latino = 1 in the specs)
client %>%
select(id, active, date_created, ethnicity, ethnicity_desc) %>%
filter(ethnicity == 1) %>%
View()
Once you're connected, check out all the dplyr commands you can use: https://dplyr.tidyverse.org/