Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Impala connection via R using Implyr and Dplyr
Before attempting to connect, please be sure you have the following:
- Install R. (If it is already installed, be sure it is updated.)
- Install R Studio. (If it is already installed, be sure it is updated.)
- Get credentials and database ID for the database you will be using from HSLynk Support.
- Get credentials to your VPN connection from HSLynk Support.
How to Set Up your Connection
To connect to the HSLynk Big Data Warehouse with R, RStudio, etc., the best way is using Impala (as opposed to Hive).
You need to install the correct Cloudera driver for Impala. Cloudera supports Windows, Mac, and Linux. See the documentation here. Use the credentials given to you in step 3 above.
RStudio doesn't currently offer Secure Sockets connections in its free build, so you need to connect over a VPN. We recommend OpenVPN. Use the credentials given to you in step 4 above.
Before you run the following code, be sure you have successfully connected with your credentials to the VPN.
Here's a sample program to connect to Impala from R via Implyr.
# if you don't already have these packages installed, please run: # install.package(odbc) # install.package(implyr) # install.package(dplyr) library(odbc) library(implyr) library(dplyr) impala <- src_impala( drv = odbc::odbc(), driver = "[directory or reference to driver]", # this will depend on your # operating system and setup. if you need help, please ask on the Slack channel host = "HOST", port = 21050, database = "[get this from HSLynk Support]", uid = "[get this from HSLynk Support]", pwd = "[get this from HSLynk Support]", UseSasl=1, AuthMech=3 )
List the tables inside the database
tables <- dbGetQuery(impala, "show tables") View(tables)
Creating dataframes from the database
client <- data.frame(tbl(impala, "client")) enrollment <- data.frame(tbl(impala, "enrollment")) exit <- data.frame(tbl(impala, "exit")) moveindate <- data.frame(tbl(impala, "moveindate")) project <- data.frame(tbl(impala, "project"))
# view all clients where Ethnicity is Latino (Latino = 1 in the specs) client %>% select(id, active, date_created, ethnicity, ethnicity_desc) %>% filter(ethnicity == 1) %>% View()
Once you're connected, check out all the dplyr commands you can use: https://dplyr.tidyverse.org/