# Joins
These are the solutions to selected code blocks. You will have to run the code blocks to see what they do.

In [1]:
# Load libraries
library(odbc) # odbc
library(DBI) # DBI
library(tidyverse) # tidyverse

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.6
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


## Import several related SQL tables
Connect to the AdventureWorks SQL database and download some tables

In [2]:
# Connection string info
# Already completed, just run the code block
# Everyone uses the same SQL credentials
driver_name <- "ODBC Driver 13 for SQL Server"
server_name <- "uwc-sqlserver.clients.uw.edu"
database_name <- "AdventureWorks2016CTP3" 
user_id <- "sqlstudentreader"
password <- "PA6aX2gAhe4hE!ru$6atru"

# Connect to the database
# Store connection in conn variable
conn <- dbConnect(odbc::odbc(), 
                  driver = driver_name, 
                  server = server_name, 
                  database = database_name,
                  uid = user_id,
                  pwd = password)

# Print the connection object
print(conn)

<OdbcConnection> sqlstudentreader@UWC-SQLSERVER
  Database: AdventureWorks2016CTP3
  Microsoft SQL Server Version: 13.00.4224


## Sales Order Tables

In [3]:
# Get Sales.SalesOrderHeader
sql_select <- "SELECT * FROM Sales.SalesOrderHeader"
df_sales_order_header <- conn %>% 
   dbGetQuery(sql_select)

# Get Sales.SalesOrderDetail
sql_select <- "SELECT * FROM Sales.SalesOrderDetail"
df_sales_order_detail <- conn %>% 
   dbGetQuery(sql_select)

# Glimpse results
glimpse(df_sales_order_header)
glimpse(df_sales_order_detail)

Observations: 31,465
Variables: 26
$ SalesOrderID           <int> 43659, 43660, 43661, 43662, 43663, 43664, 43...
$ RevisionNumber         <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,...
$ OrderDate              <dttm> 2011-05-31, 2011-05-31, 2011-05-31, 2011-05...
$ DueDate                <dttm> 2011-06-12, 2011-06-12, 2011-06-12, 2011-06...
$ ShipDate               <dttm> 2011-06-07, 2011-06-07, 2011-06-07, 2011-06...
$ Status                 <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
$ OnlineOrderFlag        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
$ SalesOrderNumber       <chr> "SO43659", "SO43660", "SO43661", "SO43662", ...
$ PurchaseOrderNumber    <chr> "PO522145787", "PO18850127500", "PO184731896...
$ AccountNumber          <chr> "10-4020-000676", "10-4020-000117", "10-4020...
$ CustomerID             <int> 29825, 29672, 29734, 29994, 29565, 29898, 29...
$ SalesPersonID          <int> 279, 279, 282, 282, 276, 280, 283, 276, 277,...
$ TerritoryID    

## Distinct()

In [4]:
# Get the row count for 
# Sales.SalesOrderHeader
# Hint: nrow()
df_sales_order_header %>% 
   nrow()

# Get the distinct row count for 
# Sales.SalesOrderHeader/SalesOrderID
# Hint: distinct() and nrow()
df_sales_order_header %>% 
   distinct(SalesOrderID) %>% 
   nrow()

Notice that the overall row count and the distinct row count for SalesOrderID is the same. This is required for the primary key.

In [5]:
# Get the row count for 
# Sales.SalesOrderDetail
# Hint: nrow()
df_sales_order_detail %>% 
   nrow()

# Get the distinct row count for 
# Sales.SalesOrderDetail/SalesOrderID
# Hint: distinct() and nrow()
df_sales_order_detail %>% 
   distinct(SalesOrderID) %>% 
   nrow()

Notice that the overall row count is much larger than the distinct values of SalesOrderID. This is because SalesOrderID is a foreign key with values used more than once. The grain of the sales order *detail* table is one row per sales transaction AND product ID. This allows storing of quantity of each product purchased. In contrast, the sales order *header* does not contain product ID or its quantity purchased. 

Also notice that the unique values matched the number of rows for the sales order header table. This is a sign of clean data with SQL referrential integrity enabled to ensure that they match. There can be no header row without a detail row. There can be no detail row without a header row.

## Joins
Join two tables together. 

### Usage
inner_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)

left_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)

right_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)

full_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)

semi_join(x, y, by = NULL, copy = FALSE, ...)

anti_join(x, y, by = NULL, copy = FALSE, ...)



### inner_join

In [6]:
# Join Sales.SalesOrderHeader and Sales.SalesOrderDetails
# using inner_join() 
# using automatic join column (omit the 'by' parameter)
# store as df_sales_order_inner
df_sales_order_inner <- df_sales_order_header %>%
   inner_join(df_sales_order_detail)

# Glimpse result
glimpse(df_sales_order_inner)

Joining, by = c("SalesOrderID", "rowguid", "ModifiedDate")


Observations: 0
Variables: 34
$ SalesOrderID           <int> 
$ RevisionNumber         <int> 
$ OrderDate              <dttm> 
$ DueDate                <dttm> 
$ ShipDate               <dttm> 
$ Status                 <int> 
$ OnlineOrderFlag        <lgl> 
$ SalesOrderNumber       <chr> 
$ PurchaseOrderNumber    <chr> 
$ AccountNumber          <chr> 
$ CustomerID             <int> 
$ SalesPersonID          <int> 
$ TerritoryID            <int> 
$ BillToAddressID        <int> 
$ ShipToAddressID        <int> 
$ ShipMethodID           <int> 
$ CreditCardID           <int> 
$ CreditCardApprovalCode <chr> 
$ CurrencyRateID         <int> 
$ SubTotal               <dbl> 
$ TaxAmt                 <dbl> 
$ Freight                <dbl> 
$ TotalDue               <dbl> 
$ Comment                <chr> 
$ rowguid                <chr> 
$ ModifiedDate           <dttm> 
$ SalesOrderDetailID     <int> 
$ CarrierTrackingNumber  <chr> 
$ OrderQty               <int> 
$ ProductID              <int> 
$ Spec

### left_join

In [7]:
# Join Sales.SalesOrderHeader and Sales.SalesOrderDetails
# using left_join() 
# using automatic join column (omit the 'by' parameter)
# store as df_sales_order_left
df_sales_order_left <- df_sales_order_header %>%
   left_join(df_sales_order_detail)

# Glimpse result
glimpse(df_sales_order_left)

Joining, by = c("SalesOrderID", "rowguid", "ModifiedDate")


Observations: 31,465
Variables: 34
$ SalesOrderID           <int> 43659, 43660, 43661, 43662, 43663, 43664, 43...
$ RevisionNumber         <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,...
$ OrderDate              <dttm> 2011-05-31, 2011-05-31, 2011-05-31, 2011-05...
$ DueDate                <dttm> 2011-06-12, 2011-06-12, 2011-06-12, 2011-06...
$ ShipDate               <dttm> 2011-06-07, 2011-06-07, 2011-06-07, 2011-06...
$ Status                 <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
$ OnlineOrderFlag        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
$ SalesOrderNumber       <chr> "SO43659", "SO43660", "SO43661", "SO43662", ...
$ PurchaseOrderNumber    <chr> "PO522145787", "PO18850127500", "PO184731896...
$ AccountNumber          <chr> "10-4020-000676", "10-4020-000117", "10-4020...
$ CustomerID             <int> 29825, 29672, 29734, 29994, 29565, 29898, 29...
$ SalesPersonID          <int> 279, 279, 282, 282, 276, 280, 283, 276, 277,...
$ TerritoryID    

### right_join

In [8]:
# Join Sales.SalesOrderHeader and Sales.SalesOrderDetails
# using right_join() 
# using automatic join column (omit the 'by' parameter)
# store as df_sales_order_right
df_sales_order_right <- df_sales_order_header %>%
   right_join(df_sales_order_detail)

# Glimpse result
glimpse(df_sales_order_right)

Joining, by = c("SalesOrderID", "rowguid", "ModifiedDate")


Observations: 121,317
Variables: 34
$ SalesOrderID           <int> 43659, 43659, 43659, 43659, 43659, 43659, 43...
$ RevisionNumber         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ OrderDate              <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ DueDate                <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ ShipDate               <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ Status                 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ OnlineOrderFlag        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ SalesOrderNumber       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ PurchaseOrderNumber    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ AccountNumber          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ CustomerID             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ SalesPersonID          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ TerritoryID   

### full_join

In [9]:
# Join Sales.SalesOrderHeader and Sales.SalesOrderDetails
# using full_join() 
# using automatic join column (omit the 'by' parameter)
# store as df_sales_order_full
df_sales_order_full <- df_sales_order_header %>%
   full_join(df_sales_order_detail)

# Glimpse result
glimpse(df_sales_order_full)

Joining, by = c("SalesOrderID", "rowguid", "ModifiedDate")


Observations: 152,782
Variables: 34
$ SalesOrderID           <int> 43659, 43660, 43661, 43662, 43663, 43664, 43...
$ RevisionNumber         <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,...
$ OrderDate              <dttm> 2011-05-31, 2011-05-31, 2011-05-31, 2011-05...
$ DueDate                <dttm> 2011-06-12, 2011-06-12, 2011-06-12, 2011-06...
$ ShipDate               <dttm> 2011-06-07, 2011-06-07, 2011-06-07, 2011-06...
$ Status                 <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
$ OnlineOrderFlag        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
$ SalesOrderNumber       <chr> "SO43659", "SO43660", "SO43661", "SO43662", ...
$ PurchaseOrderNumber    <chr> "PO522145787", "PO18850127500", "PO184731896...
$ AccountNumber          <chr> "10-4020-000676", "10-4020-000117", "10-4020...
$ CustomerID             <int> 29825, 29672, 29734, 29994, 29565, 29898, 29...
$ SalesPersonID          <int> 279, 279, 282, 282, 276, 280, 283, 276, 277,...
$ TerritoryID   

## by
The reason there were no matches were because by default, the join functions look for the same column names between the two tables and use that. We can specify the by parameter to override this default behavior. 

There are three different syntax for by:
1. by = "column name" - use this for the same single column name in both tables
2. by = c("col_1", "col_2") - use this for the same multiple column names (composite primary keys)
3. by = c("col_1_left" = "col_1_right", "col_2_left" = "col_2_right") - use this when the names don't match between the two tables. It also supports composite primary keys.

In [10]:
# Join Sales.SalesOrderHeader and Sales.SalesOrderDetails
# using inner_join() 
# using by = "SalesOrderID"
# store as df_sales_order
df_sales_order <- df_sales_order_header %>%
   inner_join(df_sales_order_detail, by = "SalesOrderID")

# Glimpse result
glimpse(df_sales_order)

Observations: 121,317
Variables: 36
$ SalesOrderID           <int> 43659, 43659, 43659, 43659, 43659, 43659, 43...
$ RevisionNumber         <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,...
$ OrderDate              <dttm> 2011-05-31, 2011-05-31, 2011-05-31, 2011-05...
$ DueDate                <dttm> 2011-06-12, 2011-06-12, 2011-06-12, 2011-06...
$ ShipDate               <dttm> 2011-06-07, 2011-06-07, 2011-06-07, 2011-06...
$ Status                 <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
$ OnlineOrderFlag        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA...
$ SalesOrderNumber       <chr> "SO43659", "SO43659", "SO43659", "SO43659", ...
$ PurchaseOrderNumber    <chr> "PO522145787", "PO522145787", "PO522145787",...
$ AccountNumber          <chr> "10-4020-000676", "10-4020-000676", "10-4020...
$ CustomerID             <int> 29825, 29825, 29825, 29825, 29825, 29825, 29...
$ SalesPersonID          <int> 279, 279, 279, 279, 279, 279, 279, 279, 279,...
$ TerritoryID   

## Renaming conflicting columns

In [11]:
# Rename the conflicting .x and .y columns
# using the naming convention <table>_<column>
# update df_sales_order
df_sales_order <- df_sales_order %>%
   rename(SalesOrderHeader_rowguid = rowguid.x,
         SalesOrderHeader_ModifiedDate = ModifiedDate.x,
         SalesOrderDetail_rowguid = rowguid.y,
         SalesOrderDetail_ModifiedDate = ModifiedDate.y)

# Glimpse result
glimpse(df_sales_order)

Observations: 121,317
Variables: 36
$ SalesOrderID                  <int> 43659, 43659, 43659, 43659, 43659, 43...
$ RevisionNumber                <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
$ OrderDate                     <dttm> 2011-05-31, 2011-05-31, 2011-05-31, ...
$ DueDate                       <dttm> 2011-06-12, 2011-06-12, 2011-06-12, ...
$ ShipDate                      <dttm> 2011-06-07, 2011-06-07, 2011-06-07, ...
$ Status                        <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
$ OnlineOrderFlag               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FA...
$ SalesOrderNumber              <chr> "SO43659", "SO43659", "SO43659", "SO4...
$ PurchaseOrderNumber           <chr> "PO522145787", "PO522145787", "PO5221...
$ AccountNumber                 <chr> "10-4020-000676", "10-4020-000676", "...
$ CustomerID                    <int> 29825, 29825, 29825, 29825, 29825, 29...
$ SalesPersonID                 <int> 279, 279, 279, 279, 279, 279, 279, 27...
$ TerritoryID   

## Single vs Multiple Dataframes
You can continue to join tables and create a single table at the lowest data grain with a large number of columns. This is useful for machine learning algorithms where all the data it processes must be in a single table. 

For human analysis, however, it might be more effective to have several data frames you can easily join together as required. Consider a data warehouse star schema as a guideline for how to design your dataframes.

In [12]:
# Get Production.Product
# exclude metadata columns: rowguid, ModifiedDate
sql_select <- "SELECT * FROM Production.Product"
df_product <- conn %>% 
   dbGetQuery(sql_select) %>% 
   select(-rowguid, -ModifiedDate)

# Get Sales.SalesTerritory
# exclude metadata columns: rowguid, ModifiedDate
sql_select <- "SELECT * FROM Sales.SalesTerritory"
df_sales_territory <- conn %>% 
   dbGetQuery(sql_select) %>% 
   select(-rowguid, -ModifiedDate)

# Glimpse results
glimpse(df_product)
glimpse(df_sales_territory)

Observations: 504
Variables: 23
$ ProductID             <int> 1, 2, 3, 4, 316, 317, 318, 319, 320, 321, 322...
$ Name                  <chr> "Adjustable Race", "Bearing Ball", "BB Ball B...
$ ProductNumber         <chr> "AR-5381", "BA-8327", "BE-2349", "BE-2908", "...
$ MakeFlag              <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE...
$ FinishedGoodsFlag     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
$ Color                 <chr> NA, NA, NA, NA, NA, "Black", "Black", "Black"...
$ SafetyStockLevel      <int> 1000, 1000, 800, 800, 800, 500, 500, 500, 100...
$ ReorderPoint          <int> 750, 750, 600, 600, 600, 375, 375, 375, 750, ...
$ StandardCost          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ ListPrice             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Size                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ SizeUnitMeasureCode   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ WeightUnitMeasureC

### left_join

In [13]:
# left join df_sales_order with
# df_product and df_sales_territory
# store df_sales_order_complete
# do not specify the by parameter
df_sales_order_complete <- df_sales_order %>% 
   left_join(df_product) %>% 
   left_join(df_sales_territory)

# Glimpse result
glimpse(df_sales_order_complete)

Joining, by = "ProductID"
Joining, by = c("TerritoryID", "Name")


Observations: 121,317
Variables: 64
$ SalesOrderID                  <int> 43659, 43659, 43659, 43659, 43659, 43...
$ RevisionNumber                <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
$ OrderDate                     <dttm> 2011-05-31, 2011-05-31, 2011-05-31, ...
$ DueDate                       <dttm> 2011-06-12, 2011-06-12, 2011-06-12, ...
$ ShipDate                      <dttm> 2011-06-07, 2011-06-07, 2011-06-07, ...
$ Status                        <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
$ OnlineOrderFlag               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FA...
$ SalesOrderNumber              <chr> "SO43659", "SO43659", "SO43659", "SO4...
$ PurchaseOrderNumber           <chr> "PO522145787", "PO522145787", "PO5221...
$ AccountNumber                 <chr> "10-4020-000676", "10-4020-000676", "...
$ CustomerID                    <int> 29825, 29825, 29825, 29825, 29825, 29...
$ SalesPersonID                 <int> 279, 279, 279, 279, 279, 279, 279, 27...
$ TerritoryID   

Notice the warning message from the automatic by parameter. Were these the join columns you expected? 

This is a convenient way to help determine the by parameter and alert you to any potential issues. 

### left_join and rename

In [14]:
# left join df_sales_order with
# df_product and df_sales_territory
# rename the Name columns to <table>_Name before joining
# store df_sales_order_complete
# do not specify the by parameter
df_sales_order_complete <- df_sales_order %>% 
   left_join(df_product %>% rename(Product_Name = Name)) %>% 
   left_join(df_sales_territory %>% rename(SalesTerritory_Name = Name))

# Glimpse result
glimpse(df_sales_order_complete)

Joining, by = "ProductID"
Joining, by = "TerritoryID"


Observations: 121,317
Variables: 65
$ SalesOrderID                  <int> 43659, 43659, 43659, 43659, 43659, 43...
$ RevisionNumber                <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
$ OrderDate                     <dttm> 2011-05-31, 2011-05-31, 2011-05-31, ...
$ DueDate                       <dttm> 2011-06-12, 2011-06-12, 2011-06-12, ...
$ ShipDate                      <dttm> 2011-06-07, 2011-06-07, 2011-06-07, ...
$ Status                        <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
$ OnlineOrderFlag               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FA...
$ SalesOrderNumber              <chr> "SO43659", "SO43659", "SO43659", "SO4...
$ PurchaseOrderNumber           <chr> "PO522145787", "PO522145787", "PO5221...
$ AccountNumber                 <chr> "10-4020-000676", "10-4020-000676", "...
$ CustomerID                    <int> 29825, 29825, 29825, 29825, 29825, 29...
$ SalesPersonID                 <int> 279, 279, 279, 279, 279, 279, 279, 27...
$ TerritoryID   

## semi_join

In [15]:
# semi join df_sales_order
# with df_sales_territory
# glimpse result, do not store in variable
df_sales_order %>% semi_join(df_sales_territory) %>% 
   glimpse()

Joining, by = "TerritoryID"


Observations: 121,317
Variables: 36
$ SalesOrderID                  <int> 43659, 43659, 43659, 43659, 43659, 43...
$ RevisionNumber                <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
$ OrderDate                     <dttm> 2011-05-31, 2011-05-31, 2011-05-31, ...
$ DueDate                       <dttm> 2011-06-12, 2011-06-12, 2011-06-12, ...
$ ShipDate                      <dttm> 2011-06-07, 2011-06-07, 2011-06-07, ...
$ Status                        <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
$ OnlineOrderFlag               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FA...
$ SalesOrderNumber              <chr> "SO43659", "SO43659", "SO43659", "SO4...
$ PurchaseOrderNumber           <chr> "PO522145787", "PO522145787", "PO5221...
$ AccountNumber                 <chr> "10-4020-000676", "10-4020-000676", "...
$ CustomerID                    <int> 29825, 29825, 29825, 29825, 29825, 29...
$ SalesPersonID                 <int> 279, 279, 279, 279, 279, 279, 279, 27...
$ TerritoryID   

Is the row count what you expected? Are there any SalesTerritory columns part of the result? 

Semi join matched on TerritoryID for each and every row. This is an indication of clean source data. 

## anti_join

In [16]:
# anti join df_sales_order
# with df_sales_territory
# glimpse result, do not store in variable
df_sales_order %>% anti_join(df_sales_territory) %>% 
   glimpse()

Joining, by = "TerritoryID"


Observations: 0
Variables: 36
$ SalesOrderID                  <int> 
$ RevisionNumber                <int> 
$ OrderDate                     <dttm> 
$ DueDate                       <dttm> 
$ ShipDate                      <dttm> 
$ Status                        <int> 
$ OnlineOrderFlag               <lgl> 
$ SalesOrderNumber              <chr> 
$ PurchaseOrderNumber           <chr> 
$ AccountNumber                 <chr> 
$ CustomerID                    <int> 
$ SalesPersonID                 <int> 
$ TerritoryID                   <int> 
$ BillToAddressID               <int> 
$ ShipToAddressID               <int> 
$ ShipMethodID                  <int> 
$ CreditCardID                  <int> 
$ CreditCardApprovalCode        <chr> 
$ CurrencyRateID                <int> 
$ SubTotal                      <dbl> 
$ TaxAmt                        <dbl> 
$ Freight                       <dbl> 
$ TotalDue                      <dbl> 
$ Comment                       <chr> 
$ SalesOrderHeader_rowguid     

Are the rows returned what you expected? 

Since anti_join is the opposite from semi_join, if semi_join matched all the rows then anti_join matched 0 rows. 

### semi_join and filter

In [17]:
# semi join df_sales_order
# with df_sales_territory
# inline filtered on CountryRegionCode == "US"
# glimpse result, do not store in variable
df_sales_order %>% semi_join(df_sales_territory %>% filter(CountryRegionCode == "US")) %>% 
   glimpse()

Joining, by = "TerritoryID"


Observations: 60,153
Variables: 36
$ SalesOrderID                  <int> 43659, 43659, 43659, 43659, 43659, 43...
$ RevisionNumber                <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
$ OrderDate                     <dttm> 2011-05-31, 2011-05-31, 2011-05-31, ...
$ DueDate                       <dttm> 2011-06-12, 2011-06-12, 2011-06-12, ...
$ ShipDate                      <dttm> 2011-06-07, 2011-06-07, 2011-06-07, ...
$ Status                        <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
$ OnlineOrderFlag               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FA...
$ SalesOrderNumber              <chr> "SO43659", "SO43659", "SO43659", "SO4...
$ PurchaseOrderNumber           <chr> "PO522145787", "PO522145787", "PO5221...
$ AccountNumber                 <chr> "10-4020-000676", "10-4020-000676", "...
$ CustomerID                    <int> 29825, 29825, 29825, 29825, 29825, 29...
$ SalesPersonID                 <int> 279, 279, 279, 279, 279, 279, 279, 27...
$ TerritoryID    

Is the row count what you expected? 

There are fewer rows in the result reflecting the filtering of only US sales. 

### anti_join and filter

In [18]:
# anti join df_sales_order
# with df_sales_territory
# inline filtered on Group == "North America"
# glimpse result, do not store in variable
df_sales_order %>% anti_join(df_sales_territory %>% filter(Group == "North America")) %>% 
   glimpse()

Joining, by = "TerritoryID"


Observations: 42,100
Variables: 36
$ SalesOrderID                  <int> 43698, 43701, 43703, 43704, 43705, 43...
$ RevisionNumber                <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
$ OrderDate                     <dttm> 2011-05-31, 2011-05-31, 2011-06-01, ...
$ DueDate                       <dttm> 2011-06-12, 2011-06-12, 2011-06-13, ...
$ ShipDate                      <dttm> 2011-06-07, 2011-06-07, 2011-06-08, ...
$ Status                        <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
$ OnlineOrderFlag               <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T...
$ SalesOrderNumber              <chr> "SO43698", "SO43701", "SO43703", "SO4...
$ PurchaseOrderNumber           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ AccountNumber                 <chr> "10-4030-028389", "10-4030-011003", "...
$ CustomerID                    <int> 28389, 11003, 16624, 11005, 11011, 20...
$ SalesPersonID                 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ TerritoryID    

Is the row count what you expected? 

These results are the sales from all territory groups except for North America.

## Close SQL connection

In [19]:
# Close SQL connection
conn %>% dbDisconnect()

## Summary
Joining is a common data wrangling task as you accumulate more and more dataframes of data. You use joining to organize the data into a design that makes your analysis more agile. You also use joining as a way to explore and filter the data as part of data exploration.