# Combining Data

First, install the required R packages if not done already. See [Installing Required R Packages.](../00_Installing_Required_R_Packages.ipynb)

Second, be sure to run all notebooks from the **Accessing and Exploring Data** section of **Addressing the Use Case with the R language** before running this notebook:

[01_Accessing_and_Reading_Local_Files.ipynb](01_Accessing_and_Reading_Local_Files.ipynb)

[02_Accessing_and_Reading_Data_Lake_Files.ipynb](02_Accessing_and_Reading_Data_Lake_Files.ipynb)

[03_Accessing_and_Reading_Database-Data_Lakehouse_Data.ipynb](03_Accessing_and_Reading_Database-Data_Lakehouse_Data.ipynb)


## Load any necessary packages.

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## Load data created in data access notebooks

In [5]:
load("../04_01_Accessing_and_Exploring_Data/01_Accessing_and_Reading_Local_Files.RData")
load("../04_01_Accessing_and_Exploring_Data/02_Accessing_and_Reading_Data_Lake_Files.RData")
load("../04_01_Accessing_and_Exploring_Data/03_Accessing_and_Reading_Database-Data_Lakehouse_Data.RData")

## Joining the Data

In [6]:
df <- cust_churn_df %>%
  inner_join(customers_df, by = "custId") %>%
  select(-custId) %>%
  inner_join(subscriptions_df, by = "customerSubscrCode") %>%
  select(-customerSubscrCode) %>%
  inner_join(techsupportevals_df, by = "ID") %>%
  left_join(reviews_df, by = "reviewId") %>%
  select(-reviewId)

In [7]:
dim(reviews_df)

In [8]:
dim(df)
head(df)

ID,LostCustomer,regionPctCustomers,numOfTotalReturns,wksSinceLastPurch,basktPurchCount12Month,LastPurchaseAmount,AvgPurchaseAmount12,AvgPurchaseAmountTotal,intAdExposureCount12,⋯,wksSinceFirstPurch,DemHomeOwnerCode,customerGender,EstimatedIncome,regionMedHomeVal,birthDate,customerSubscrStat,techSupportEval,Review_Text,Title
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<date>,<chr>,<dbl>,<chr>,<chr>
9155,0,43,0,14,10,50,0.0,55.65,27,⋯,89,U,F,93000,59280,2004-11-13,Platinum,3,,
9160,0,19,1,19,3,50,62.5,61.9,13,⋯,89,H,M,84000,170820,1974-03-14,Platinum,3,,
9163,0,19,0,7,10,50,0.0,35.2,32,⋯,148,H,F,140000,92430,2006-02-22,Platinum,2,,
9170,0,33,5,7,2,50,40.0,34.75,31,⋯,93,U,F,142000,53430,2006-04-19,Platinum,2,,
9175,0,22,0,6,10,50,0.0,62.95,40,⋯,137,H,F,83000,443690,2002-02-23,Platinum,2,,
9190,0,30,0,20,5,55,55.0,80.2,29,⋯,150,U,M,66000,107770,2005-10-23,Platinum,2,,


In [9]:
colSums(!is.na(df))

## Saving data to be accessed later

In [10]:
save(df, file = "01_Combining_Data.RData")