*Analytical Information Systems*

# Worksheet 2 - Data Integration

Matthias Griebel<br>
Lehrstuhl für Wirtschaftsinformatik und Informationsmanagement

SS 2020

## Exercises

### 1 Data Extraction

You are provided with a set of operational data from a retail company. All files are stored online on [github](https://github.com/wi3jmu/AIS_2019/tree/master/notebooks/data/T02).

- Transaction data (Comma Delimited Files): 
    - *'transactions_eng.csv'*
    - *'transactions_ger.csv'*
   
- Customer data (Semi-colon Delimited Files)
    - *'customers.csv*
    - *'customers_usa.csv*

- Product data (Excel Files)
    - *'products_convenience.xlsx'*
    - *'products.xlsx*

- Load the required packages

In [36]:
library(tidyverse) # includes the readr package
library(readxl) # excel files
data_url = 'https://raw.githubusercontent.com/wi3jmu/AIS_2019/master/notebooks/data/T02/'

#### 1.1 Load the provided files .csv files

- Use `read_csv2`
- Use `paste0` to concatenate the `data_url` and the file name

Example:
```R
customers <- read_csv2(paste0(data_url, 'customers.csv'))
```

In [2]:
# Write your code here 
# Use read_csv for semicolon seperated files
customers <- read_csv2(paste0(data_url, 'customers.csv'))
customers_usa <- read_csv2(paste0(data_url, 'customers_usa.csv'))

# Use read_csv for comma seperated files
transactions_eng <- read_csv(paste0(data_url, 'transactions_eng.csv'))
transactions_ger <- read_csv(paste0(data_url, 'transactions_ger.csv'))

Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
Parsed with column specification:
cols(
  customerID = col_double(),
  country = col_character(),
  gender = col_character(),
  firstNames = col_character(),
  lastNames = col_character()
)
Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
Parsed with column specification:
cols(
  customerID = col_double(),
  country = col_character(),
  gender = col_character(),
  name = col_character()
)
Parsed with column specification:
cols(
  date = col_date(format = ""),
  customerID = col_double(),
  productID = col_double(),
  payment = col_character(),
  amount = col_double()
)
Parsed with column specification:
cols(
  Datum = col_date(format = ""),
  Kundennummer = col_double(),
  Produkt = col_double(),
  Zahlungsmethode = col_character(),
  Stückzahl = col_double()
)


#### 1.2 Load the provided files .xlsx files
- Firstly, download the files using `downlaod.file`
- Read use `read_xlsx` to read excel files

Example:
```R
download.file(url=paste0(data_url, 'products.xlsx'), destfile='products.xlsx')
products <- read_excel('products.xlsx')
```

In [3]:
# Write your code here 
download.file(url=paste0(data_url, 'products.xlsx'), destfile='products.xlsx')
products <- read_excel('products.xlsx')

download.file(url=paste0(data_url, 'products.xlsx'), destfile='products_convenience.xlsx')
products_convenience <- read_xlsx('products_convenience.xlsx')

#### 1.3 Get to know the data

Take a look at the data 
```R
head()
sample_n()
```

In [4]:
# Write your code here 
# e.g.
products_convenience %>%
    head()

productID,price,cost,category
25426,146.15,75.92,emergency
81136,45.9,23.98,emergency
40784,125.79,47.53,specialty
34082,40.67,12.21,specialty
99445,74.29,28.26,specialty
78934,586.96,264.75,specialty


Check that all data is read in correctly

```R
nrow(), ncol(), colnames()
```

In [5]:
# Write your code here 
# e.g.
customers_usa %>% nrow()

Understand the rows and columns (observations and variables)
```R
glimpse()
summary()
```

In [6]:
# Write your code here 
# e.g.
products_convenience %>% 
    summary()

   productID         price            cost          category        
 Min.   : 2002   Min.   :-99.0   Min.   :-99.00   Length:67         
 1st Qu.:25228   1st Qu.:136.0   1st Qu.: 76.45   Class :character  
 Median :41910   Median :393.7   Median :171.11   Mode  :character  
 Mean   :47533   Mean   :414.2   Mean   :194.55                     
 3rd Qu.:75684   3rd Qu.:675.9   3rd Qu.:325.40                     
 Max.   :99445   Max.   :942.0   Max.   :543.90                     

### 2 Data Tranformation

#### 2.1 Resolve 1st class deficiencies

Find the syntactic in the products convenience data sets

- `products_convenience`: look at price and costs
- `customers_usa`: look at the names

In [7]:
# Write your code here 
products_convenience %>% head(10)

productID,price,cost,category
25426,146.15,75.92,emergency
81136,45.9,23.98,emergency
40784,125.79,47.53,specialty
34082,40.67,12.21,specialty
99445,74.29,28.26,specialty
78934,586.96,264.75,specialty
70159,905.96,345.4,emergency
50304,95.66,30.34,specialty
10130,767.88,417.93,specialty
2002,799.99,407.3,emergency


In [8]:
# Write your code here 
customers_usa %>% head(10)

customerID,country,gender,name
97218,usa,m,"Daleel, el-Sinai"
71221,usa,m,"Waseef, Sharma"
39248,usa,f,"Jessica, el-Taha"
41419,usa,f,"Hannah, Cung"
55495,usa,f,"Chelsea, Vang"
17358,usa,f,"Ashley, Marler"
86957,usa,f,"Tiegen, Chambers"
65343,usa,m,"Marwaan, Estrada"
85419,usa,m,"Jeramiah, Brame"
79896,usa,f,"Zeinab, Brown"


Implement transformation rules to resolve the deficiencies
- Transformation rules can be implemented as pipes
- You will have to use mutate() in combination with *str_replace()* and/or *str_split()*

In [9]:
# Write your code here 
products_convenience %>%
    mutate(price = as.numeric(str_replace(price,' €', "")),
           cost = as.numeric(str_replace(cost, ' €', ""))) -> products_convenience

In [10]:
# Write your code here 
customers_usa %>%
    mutate(firstNames = str_split(string = name, pattern = ', ', simplify = TRUE)[, 1],
           lastNames = str_split(string = name, pattern = ', ', simplify = TRUE)[, 2]) %>%
    select(-name) -> customers_usa

#### 2. Resolve 2<sup>nd</sup> class deficiencies*

- Perform plausibility checks (min, mean, max, …) to identify deficiencies in the product data 


In [11]:
# Write your code here 
products %>%
    summary()

   productID         price            cost          category        
 Min.   : 2002   Min.   :-99.0   Min.   :-99.00   Length:67         
 1st Qu.:25228   1st Qu.:136.0   1st Qu.: 76.45   Class :character  
 Median :41910   Median :393.7   Median :171.11   Mode  :character  
 Mean   :47533   Mean   :414.2   Mean   :194.55                     
 3rd Qu.:75684   3rd Qu.:675.9   3rd Qu.:325.40                     
 Max.   :99445   Max.   :942.0   Max.   :543.90                     

In [12]:
# Write your code here 
products_convenience %>%
    summary()

   productID         price            cost          category        
 Min.   : 2002   Min.   :-99.0   Min.   :-99.00   Length:67         
 1st Qu.:25228   1st Qu.:136.0   1st Qu.: 76.45   Class :character  
 Median :41910   Median :393.7   Median :171.11   Mode  :character  
 Mean   :47533   Mean   :414.2   Mean   :194.55                     
 3rd Qu.:75684   3rd Qu.:675.9   3rd Qu.:325.40                     
 Max.   :99445   Max.   :942.0   Max.   :543.90                     

- Implement transformation rules to resolve the deficiencies. <br> If you identify errors or missing values you can either:
    - Keep the errors / missing values
    - Remove the observations
    - Impute the values

In [13]:
# Write your code here 
# remove
products %>%
    filter(cost >= 0, price >= 0) -> products_fil
products_convenience %>%
    filter(cost >= 0, price >= 0) -> products_convenience_fil

In [14]:
# impute (example only, not used further)
# products %>%
#    mutate(price = if_else(price < 0, mean(price), price),
#           cost = if_else(cost < 0, mean(cost), cost)) 

#### 2.3 Resolve 3<sup>rd</sup> class deficiencies*

- Find the semantic 3<sup>rd</sup> class deficiencies in the customer data
    - `customer`: Take a closer look at the countries

In [15]:
# Write your code here 
customers %>%
    distinct(country)

country
uk
ger
moon


- Resolve the deficiencies

In [16]:
# Write your code here 
customers %>%
    filter(country != 'moon') -> customers_filtered

#### 2.4 Data Harmonization - Schema Heterogeneity

Find and harmonize schema heterogeneity in transaction data sets
- Look at the attribute names
- Adjust the transactions_ger to the schema of transactions_eng

In [17]:
# Write your code here 
colnames(transactions_ger)
colnames(transactions_eng)

colnames(transactions_ger) <- colnames(transactions_eng)

#### 2.4 Data Harmonization - Data-level Heterogeneity

Find and harmonize data-level heterogeneity in the customer data sets
- Take a closer look to the variables names as well as the variable values
- Adjust the `customers_usa` to the schema of `customers`

In [18]:
# Write your code here 
customers_usa %>%
        mutate(gender = str_replace(gender, "m", "Male")) %>%
        mutate(gender = str_replace(gender, "f", "Female")) -> customers_usa_harmonized

#### 2.5 Combine the data

- Combine the harmonised data sets
    - Create three new data sets: `customers`, `transactions`, `products`
    - Use *bind_rows()* for binding multiple data frames by row

In [19]:
# Write your code here 
transactions_eng %>%
    bind_rows(transactions_ger) -> transactions

products_fil %>%
    bind_rows(products_convenience_fil) -> products

customers_filtered %>%
    bind_rows(customers_usa_harmonized) -> customers

- Join the three data sets into one data final data frame

In [21]:
# Write your code here 
transactions %>%
    left_join(products) %>%
    left_join(customers) -> data_combined

Joining, by = "productID"
Joining, by = "customerID"


#### 2.6 Enrich the data

Create two new variables:
    - revenue per transaction
    - profit per transaction

In [22]:
# Write your code here 
data_combined %>%
    mutate(revenue = amount * price,
           profit = revenue - amount * cost) -> data_enriched

### 3 Data Loading

We’ll first create an in-memory SQLite database. We also need to install the `RSQLite` package.

In [23]:
#install.packages('RSQLite')
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

and copy over our dataset

In [24]:
copy_to(con, data_enriched)

Now you can retrieve a table using `tbl()` . Printing it just retrieves the first few rows:

In [25]:
db <- tbl(con, "data_enriched")
db

# Source:   table<data_enriched> [?? x 14]
# Database: sqlite 3.22.0 [:memory:]
    date customerID productID payment amount price  cost category country gender
   <dbl>      <dbl>     <dbl> <chr>    <dbl> <dbl> <dbl> <chr>    <chr>   <chr> 
 1 17285      13582     88744 cash         2   NA    NA  <NA>     uk      Male  
 2 17328      45708     16105 credit…      3   NA    NA  <NA>     usa     Male  
 3 17507      85419     93187 cash         2   NA    NA  <NA>     usa     Male  
 4 17182      64292      8403 paypal       3  561.  235. emergen… uk      Male  
 5 17182      64292      8403 paypal       3  561.  235. emergen… uk      Male  
 6 17394      41174      7234 paypal       3  351.  171. special… usa     Female
 7 17394      41174      7234 paypal       3  351.  171. special… usa     Female
 8 17380      99042     57590 credit…      3   NA    NA  <NA>     usa     Female
 9 17465      49737     32738 paypal       3  343.  105. emergen… uk      Female
10 17465      49737     32738

- (Lazily) generate query

In [26]:
db %>%
    filter(payment == 'cash') %>%
    summarise(MeanAmount = mean(amount, na.rm = TRUE)) -> summary

- See query

In [27]:
summary %>% 
    show_query()

<SQL>


“`overscope_eval_next()` is deprecated as of rlang 0.2.0.
Please use `eval_tidy()` with a data mask instead.
“`overscope_clean()` is deprecated as of rlang 0.2.0.

SELECT AVG(`amount`) AS `MeanAmount`
FROM `data_enriched`
WHERE (`payment` = 'cash')


- Execute query and retrieve results

In [28]:
summary %>% 
    collect()

MeanAmount
1.996989


#### 3.1 Analyze the data

Use the database connection to answer the following question:

__How much profit did the company realize in 2017?__

- Generate query

In [29]:
# Write your code here
db %>%
    summarise(totalProfit = sum(profit, na.rm = TRUE)) -> profit

- See query

In [30]:
# Write your code here 
profit %>% 
    show_query()

<SQL>
SELECT SUM(`profit`) AS `totalProfit`
FROM `data_enriched`


- Execute query and retrieve results

In [31]:
# Write your code here 
profit %>% 
    collect()

totalProfit
11483190


### 4 Exam Questions

Exam AIS SS 2018, Question 1

__Data Engineering & Integration (10 points)__

(a) __Getting orders in order__: You are working for a major online retailer who is interested in optimizing internal logistics processes. A key problem in this context is the handling of __orders with a single line item__ vs. __orders with multiple line items__.

The cornerstone of your analysis is an orders table with the following structure:<br>

<left>
    
\begin{array}{cccc}  
\hline
productID & quantity & orderID  \\ 
  \hline
...&...&...\\
\end{array}
    
</left>

i. (1 points) Explain (verbally or in pseudo code) how you would identify the number of orders with a single line item from this data base.

In [32]:
# Toy example for demonstration
order_data = tribble(
     ~productID, ~quantity, ~orderID,
    "Prod1",     2,         "Ord1",
    "Prod2",     3,         "Ord1",
    "Prod3",     4,         "Ord1",
    "Prod2",     5,         "Ord2",
    "Prod1",     2,         "Ord3",
    "Prod3",     1,         "Ord3",
    "Prod3",     1,         "Ord4")

In [33]:
# Write your code here 
order_data %>%
    group_by(orderID) %>%
    filter(n() == 1) %>%
    nrow()

or alternatively,

In [34]:
order_data %>%
    group_by(orderID) %>%
    summarise(nItems = n()) %>%
    filter(nItems == 1) %>%
    nrow()

i. (2 points) The frontend reporting tool used by the logistics department cannot handle data sets with more than 1 million rows. Yet your order table has many more rows. Recognizing that individual product IDs are not crucial for the logistics process analysis (handling times are determined by the number of products in an order) you are approached to provide ___a compact representation which retains the structure (number of line items) of the order invoices___. Explain how this can be achieved by means of clever aggregation.

In [35]:
# Write your code here 
order_data %>%
    group_by(orderID) %>%
    summarise(nItems = n()) 

orderID,nItems
Ord1,3
Ord2,1
Ord3,2
Ord4,1
