# L03-4-SQL Import
## Assignment Instructions
Rename with your name in place of Studentname and make your edits and updates here.



# SQL Import
In this exercise, we will connect to a SQL database, browse the data programmatically and then export the data into R data frames. We will use the odbc and DBI libraries to connect to SQL Server. These libraries work well with the tidyverse dplyr package.

The database authorization level is read-only for the credentials used in this exercise. However, these packages provide the capability to alter the database data and the database itself.


## R Features
* library()
* dbconnect()
* print()
* str()
* dbListTables()
* sort()
* head()
* dbListFields()
* dbGetQuery()
* select()
* everything()
* arrange()
* desc()
* dbDisconnect()
* ggplot()
* geom_bar()


## Datasets
* AdventureWorks


In [1]:
# Load libraries
library('odbc') # odbc
library('DBI') # DBI
library('tidyverse') # tidyverse

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.2
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


## Database connection information
There are a few bits of information you will want to have handy in order to import data from SQL. 
* Driver library: Driver library is the database driver that is compatible with your database. We will use 'odbc' in this exercise. You would change this to match your needs and perhaps have to load a different library than odbc for your connection.
* Driver: Driver name is part of the connection string information. We will use 'SQL Server' in this exercise. You would change this to match your needs.
* Server: Server name is the name of the database server or server instance
* Database: Database name is the name of the database within the server you want to connect to
* Authentication credentials: Authentication credentials is the username and password to allow access to the database. 



In [2]:
# Connection string info
# Already completed, just run the code block
driver_name <- "ODBC Driver 13 for SQL Server"
server_name <- "uwc-sqlserver.clients.uw.edu"
database_name <- "AdventureWorks2016CTP3" 
user_id <- "sqlstudentreader"
password <- "PA6aX2gAhe4hE!ru$6atru"

## dbConnect()
Connects to a ODBC compatible database. The required first parameter, drv is the driver library that you want to use. In our case it will be odbc::odbc() which means, call the odbc() function found in the odbc library.

dbConnect(drv, dsn = NULL, ..., timezone = "UTC",
  encoding = "", driver = NULL, server = NULL, database = NULL,
  uid = NULL, pwd = NULL, .connection_string = NULL)

In [5]:
# View help on dbConnect()
? dbConnect

In [7]:
# Connect to the database
# Store connection in conn variable
conn <- dbConnect(odbc::odbc(), 
                  driver = driver_name, 
                  server = server_name, 
                  database = databse_name,
                  uid = user_id,
                  pwd = password)

# Print the connection object
print(conn)

ERROR: Error in OdbcConnection(dsn = dsn, ..., timezone = timezone, encoding = encoding, : object 'databse_name' not found


Notice in the above output that it returned similar information to what you passed into dbConnect, namely the driver, server, and database information. You can additionally see the driver version.

## str()
Compactly displays the internal structure of an R object, a diagnostic function and an alternative to summary(). Ideally, only one line for each ‘basic’ structure is displayed. The idea is to give reasonable output for any R object. 

You can use str() on any R variable making it universally useful.

In [None]:
# View help on str()
?str

In [None]:
# Print connection details
# using str()
str(___)

Notice in the output above, that str() provides additional details including the DLL of the driver, and the versions of various components. This may be useful in troubleshooting connection issues.

## dbListTables()
Lists remote tables found when using the established database connection.

dbListTables(conn, ...)

In [None]:
# View help on dbListTables()
?___

In [None]:
# List the alphabetically first 50 
# table objects in the database
# Sorted alphabetically
# Hint: head(), sort()
conn %>% 
   ___() %>% 
   ___() %>% 
   ___(50)

Notice the output above lists tables along with other objects. This potentially could be a long list. Using another tool to explore the database as well as develop the SQL query might be useful. 

## dbListFields()
Lists field names of a remote table. Field names is another term for column names.

dbListFields(conn, name, ...)

In [None]:
# View help on dbListFields()
?___

In [None]:
# List the columns 
# for table: Customer
conn %>% 
   ___("___")

The above is the equivalant to the names() function, providing the column names of the table. This is useful to determine what columns we might want to export using our SELECT statement, or of course we can use SELECT * and get them all and sort them out in R later if we choose.

## dbGetQuery()
Send query, retrieve results and then clear result set. This function is for SELECT queries only. Some backends may support data manipulation statements through this function for compatibility reasons. However callers are strongly advised to use dbExecute for data manipulation statements.

dbGetQuery(conn, statement, ...)

In [None]:
# View help on dbGetQuery()
?___

In [None]:
# Create a SQL SELECT statement
# select the first 10 rows of data for all columns
# Table: Sales.Customer
# First store the query text in a variable sql_dim_date
# Hint: SQL TOP N
sql_customer <- 
"SELECT ___ 
FROM ___"

# Second, execute the query
# store result in df_customer
df_customer <- conn %>% 
   ___(___) %>%
   head(10)  # Optional if part of SQL query

# Glimpse result
glimpse(___)

* Notice that it is simple to execute SQL SELECT statements and return the results as a data frame. 
* Notice that the data types map well from SQL to R since R is less restrictive than SQL.  
* Notice that the SQL statement can be a variable and thus we can programmatically create and edit this code to create what is referred to as 'dynamic SQL'. This is where the power of programming comes in for automation as well as adjusting what is queried based upon other data or configurations.

## SQL SELECT 
There is full capabilities with the SQL SELECT statement. It doesn't have to be limited to simple queries. 

Let's try returning specific columns and filtering on some rows.

In [None]:
# Import Sales.SalesPerson
# returning only the first 5 columns 
# and top 10 rows
# with Bonus > 0
# sorted by CommissionPct from high to low
# Hint: select(1:5)
sql_sales_person <- 
"SELECT ___
FROM ___
WHERE ___
ORDER BY ___"

df_sales_person <- conn %>% 
   dbGetQuery(___)

# Glimpse result
glimpse(___)

Repeat the above task but instead bring in all the data from the SQL statement and perform the manipulation using R

In [None]:
# Import Sales.SalesPerson
# returning only the first 5 columns 
# and top 10 rows
# with Bonus > 0
# sorted by CommissionPct from high to low
# Hint: select(1:5), filter(), arrange(), desc(), head()
sql_sales_person <- 
"SELECT * 
FROM ___"

df_sales_person <- conn %>% 
   dbGetQuery(___) %>%
   select(___) %>%  
   filter(___) %>%  
   arrange(___) %>% 
   head(___)  

# Glimpse result
glimpse(___)

Are the results from using SQL SELECT and R identical? If not, why not?

In [None]:
# Create a SELECT statement 
# that returns all rows
# for table Sales.SalesOrderDetail
# Store as sql_sales_order_detail
sql_sales_order_detail <- 
"SELECT ___
  FROM ___"

# Execute the query 
# Store the result in df_usa
df_sales_order_detail <- conn %>%
   ___(___)

# Glimpse result
glimpse(___)

# Print top 5 results of the first 5 columns
# with OrderQty being the first column
# Hint: select(), everything(), arrange(), head()
df_sales_order_detail %>% 
   ___  # May need to use several functions, see hint above

## dbDisconnect()
Disconnect (close) a connection. This closes the connection, discards all pending work, and frees resources (e.g., memory, sockets).

dbDisconnect(conn, ...)

In [None]:
# View help on dbDisconnect()
?___

In [None]:
# Close the database connection
___(conn)

Notice it returned TRUE meaning the connection was successfully closed.

## Code Recap
Let's put all the code in one block to better see the pattern.

In [None]:
# Load libraries
library(___) # odbc
library(___) # DBI
library(___) # tidyverse

# Connection string info
driver_name <- "ODBC Driver 13 for SQL Server"
server_name <- "69.91.210.142" 
database_name <- "AdventureWorks2016CTP3" 
user_id <- "sqlstudentreader"
password <- "PA6aX2gAhe4hE!ru$6atru"

# Connect to the database
conn <- dbConnect(odbc::odbc(), 
                  driver = ___, 
                  server = ___, 
                  database = ___,
                  uid = ___,
                  pwd = ___)

# List the first 20 table objects in the database
conn %>% 
   ___() %>% 
   ___(20)

# List the columns 
# for table: Customer
# limit to the first 10 columns
conn %>% 
   ___("Customer") %>% 
   ___(10)

# Create a SELECT statement 
# that returns all rows
# for table Sales.SalesOrderDetail
# Store as sql_sales_order_detail
sql_sales_order_detail <- 
"SELECT ___
  FROM ___"

# Execute the query 
# Store the result in df_top_orders
df_sales_order_detail <- ___ %>%
   ___(___)

# Glimpse result
glimpse(___)

# Print top 5 highest quantity orders and the first 5 columns
# with OrderQty being the first column
# Hint: select(), everything(), arrange(), head()
df_sales_order_detail %>% 
   ___  # See hint above

# Close the database connection
___(conn)

## Analyze Sales Orders
Once you import a table of data, it may be helpful to get a sense of the data from a graphical perspective. 

Let's plot the frequency of the order quantities.  

In [None]:
# Create a bar plot of OrderQty
df_sales_order_detail %>% 
   ggplot(aes(___)) + 
      geom____()

## Summary
odbc and DBI packages enable you to execute SQL SELECT statements with the results returned as data frames.