*Analytical Information Systems*

# Introduction to Jupyter and R Analytic Pipelines

Matthias Griebel<br>
Chair of Information Systems and Business Analytics 

SS 2022

***

# Table of Contents

* [1. Jupyter and R Working Environment](#jupyter_and_r)
  * [1.1. The Jupyter Notebook](#notebook)
  * [1.2. Kaggle Notebooks](#kaggle_notebooks)
  * [1.3. Local Installation](#local_install)
* [2. R Programming Basics](#r_programming)
  * [2.1. Getting Help](#help)
  * [2.2. Operators in R](#r_operators)
* [3. Analytic Pipelines](#analytic_pipelines)
  * [3.1. Progamming Example](#pe)
* [4. Recommended Books](#recommended_books)

## 1. Jupyter and R Working Environment
<a id="jupyter_and_r"></a>

__We will be using R__

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/724px-R_logo.svg.png" width="200">


R is not the only language that can be used for data analysis. Why R rather than another?
- interactive language
- data structures & functions
- graphics
- packages & community!

__... and tidyverse__

<img src="https://github.com/matjesg/AIS_2019/raw/master/notebooks/images/01/ecosystem.png" width="300">

The [tidyverse](https://www.tidyverse.org) is a collection of R packages that share common philosophies and are designed to work together.

- Reuse existing data structures
- Compose simple functions with the pipe
- Embrace functional programming
- Design for humans

__... within the Jupyter Ecosystem__

Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages (https://jupyter.org/).

### 1.1. The Jupyter Notebook (this!)
<a id="notebook"></a>

- open-source web application 
- create and share documents that contain
    - live code and narrative text
    - data cleaning and transformation
    - numerical simulation
    - statistical modeling 
    - data visualization
    - machine learning
    - and much more 

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/Jupyter_logo.svg/250px-Jupyter_logo.svg.png" width="100">
    
__Jupyter Notebooks Cells__
- A Markdown cell (this cell) contains text formatted using Markdown and displays its output in-place when it is run
- A code cell contains code to be executed in the kernel and displays its output below

Now, write and run your first code in the next cell and run the code
```R
string <- "your text"
print(string)
```

In [1]:
string <- "hello world"
print(string)

[1] "hello world"


### 1.2. Kaggle Notebooks
<a id="kaggle_notebooks"></a>

There are two different types of Notebooks on Kaggle (see [Types of Notebooks](https://www.kaggle.com/docs/notebooks)):

- Jupyter Notebooks (Python and R)
- Scripts (Python and RMarkdown)

With Kaggle Kaggle Notebooks we can
- write and execute code, 
- save and share your analyses, and 
- access powerful computing resources (GPU and TPU), 

all for free from your browser.

### 1.3. Local Installation<a id="local_install"></a>

You can also access and download the worksheets (notebooks) from [github](https://github.com/wi3jmu/AIS2022) and work with them locally or on your desired platform.

> **Basic git knowledge required ;)**

## 2. R Programming Basics
<a id="r_programming"></a>

### 2.1. Getting Help 
<a id="help"></a>

![](https://pbs.twimg.com/media/En136SCXEAInQ6u?format=jpg&name=small)

[Richard Campbell via Twitter](https://twitter.com/richcampbell/status/1332352909451911170?s=21)

#### Cheat Sheets

- Cheat sheets make it easy to learn about your favorite packages
- [Here](https://www.rstudio.com/resources/cheatsheets/), you will find some cheat sheets

#### Accessing the documentation with '?'

The question mark is a simple shortcut to get help

```R
?print
```


In [2]:
?c

### 2.2. R Packages
<a id="r_packages"></a>

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. 

Example: Install and load the *tidyverse*
```R
# install
install.packages('tidyverse')
# load
library(tidyverse)
```

The *tidyverse* package is already pre-installed, so we just need to load it.

In [3]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.0     [32m✔[39m [34mdplyr  [39m 1.0.5
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



### 2.3. Operators in R
<a id="r_operators"></a>

#### Assignment operators

These operators are used to assign values to variables
<table style="font-size: 100%;">
<tbody>
<tr>
<th>Operator</th>
<th>Description</th>
</tr>
<tr>
<td>&lt;-, =</td>
<td>Leftwards assignment</td>
</tr>
<tr>
<td>-&gt;</td>
<td>Rightwards assignment</td>
</tr>
</tbody>
</table>

In [4]:
x <- 5
x
x = 5
x
5 -> x
x

#### Arithmetic operators

These operators are used to carry out mathematical operations like addition and multiplication.

<table style="font-size: 100%;">
<tbody>
<tr>
<th>Operator</th>
<th>Description</th>
</tr>
<tr>
<td>+</td>
<td>Addition</td>
</tr>
<tr>
<td>&#8211;</td>
<td>Subtraction</td>
</tr>
<tr>
<td>*</td>
<td>Multiplication</td>
</tr>
<tr>
<td>/</td>
<td>Division</td>
</tr>
<tr>
<td>^</td>
<td>Exponent</td>
</tr>
<tr>
<td>%%</td>
<td>Modulus (Remainder from division)</td>
</tr>
<tr>
<td>%/%</td>
<td>Integer Division</td>
</tr>
</tbody>
</table>

In [5]:
x^2

#### Relational operators
Relational operators test or define some kind of relation between two entities/values
<table style="font-size: 100%;">
<tbody>
<tr>
<th>Operator</th>
<th>Description</th>
</tr>
<tr>
<td>&lt;</td>
<td>Less than</td>
</tr>
<tr>
<td>&gt;</td>
<td>Greater than</td>
</tr>
<tr>
<td>&lt;=</td>
<td>Less than or equal to</td>
</tr>
<tr>
<td>&gt;=</td>
<td>Greater than or equal to</td>
</tr>
<tr>
<td>==</td>
<td>Equal to</td>
</tr>
<tr>
<td>!=</td>
<td>Not equal to</td>
</tr>
</tbody>
</table>

In [6]:
x != 10

#### The pipe operator

<img src="https://github.com/matjesg/AIS_2019/raw/master/notebooks/images/01/pipes.png" width="150">

Pipes are a powerful tool for clearly expressing a sequence of multiple operations.<br>

In a pipe, we can rewrite the code as follows
```R
string %>%
    print()
```

In [7]:
x %>%
    print()

[1] 5


## 3. Analytic Pipelines: Data Transformation with *dplyr*
<a id="analytic_pipelines"></a>

The *dplyr* packages provides a grammar for manipulating tables in R. It can be conceptualized as an alternative to a traditional query language like SQL.

Main functions are

- *select()* extracts variables/columns as a table

- *filter()* extracts rows that meet logical criteria

- *group_by()* creates a "grouped" copy of a table. *dplyr* functions will manipulate each "group" separately and then combine the results

- *summarise()* applies summary functions to columns to create a new table of summary statistics based on grouping.

- *arrange()* orders rows by values of a column or columns

- *mutate()* computes new columns/variables

### 3.1. Progamming Example
<a id="pe"></a>

We will be working on the Student Performance Data Set:
The [data set](https://rstudio-pubs-static.s3.amazonaws.com/108835_65a73467d96f4c79a5f808f5b8833922.html) contains information on students in secondary education in Portugal.

Important  attributes/columns:
- G1 - first period grade (from 0 to 20) 
- G2 - second period grade (from 0 to 20) 
- G3 - final grade (from 0 to 20)

Let's download the data and save it to the data frame "student_data"

In [8]:
file = "../input/d/impapan/student-performance-data-set/student/student-mat.csv"
student_data <- read.table(file=file, header=TRUE, sep=";")

#### Have a look at the data

To view your data frame, write the name in a code cell and run it

In [9]:
student_data %>% 
    head()

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,⋯,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
Unnamed: 0_level_1,<fct>,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<int>,<fct>,<fct>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,GP,F,18,U,GT3,A,4,4,at_home,teacher,⋯,4,3,4,1,1,3,6,5,6,6
2,GP,F,17,U,GT3,T,1,1,at_home,other,⋯,5,3,3,1,1,3,4,5,5,6
3,GP,F,15,U,LE3,T,1,1,at_home,other,⋯,4,3,2,2,3,3,10,7,8,10
4,GP,F,15,U,GT3,T,4,2,health,services,⋯,3,2,2,1,1,5,2,15,14,15
5,GP,F,16,U,GT3,T,3,3,other,other,⋯,4,3,2,1,2,5,4,6,10,10
6,GP,M,16,U,LE3,T,4,3,services,other,⋯,5,4,2,1,2,5,10,15,15,15


#### View first or last part

*head()* and *tail()* return first or last part of the data frame

In [10]:
student_data %>%
    tail()

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,⋯,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
Unnamed: 0_level_1,<fct>,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<int>,<fct>,<fct>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
390,MS,F,18,U,GT3,T,1,1,other,other,⋯,1,1,1,1,1,5,0,6,5,0
391,MS,M,20,U,LE3,A,2,2,services,services,⋯,5,5,4,4,5,4,11,9,9,9
392,MS,M,17,U,LE3,T,3,1,services,services,⋯,2,4,5,3,4,2,3,14,16,16
393,MS,M,21,R,GT3,T,1,1,other,other,⋯,5,5,3,3,3,3,3,10,8,7
394,MS,M,18,R,LE3,T,3,2,services,other,⋯,4,4,1,3,4,5,0,11,12,10
395,MS,M,19,U,LE3,T,1,1,other,at_home,⋯,3,2,3,3,3,5,5,8,9,9


#### Get a glimpse of your data

*glimpse()* outputs a transposed version of the standard view: columns run down the page, and data runs across. This makes it possible to see every column in a data frame

In [11]:
student_data %>%
    glimpse()

Rows: 395
Columns: 33
$ school     [3m[90m<fct>[39m[23m GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
$ sex        [3m[90m<fct>[39m[23m F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M,…
$ age        [3m[90m<int>[39m[23m 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
$ address    [3m[90m<fct>[39m[23m U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
$ famsize    [3m[90m<fct>[39m[23m GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3,…
$ Pstatus    [3m[90m<fct>[39m[23m A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T,…
$ Medu       [3m[90m<int>[39m[23m 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
$ Fedu       [3m[90m<int>[39m[23m 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
$ Mjob       [3m[90m<fct>[39m[23m at_home, at_home, at_home, health, other, services, other, …
$ Fjob       [3m[90m<fct>[39m[23m teacher, other, other, services, other, other, ot

#### Data summaries

You can use the *summary()* command to get a better feel for how your data are distributed

In [12]:
student_data %>%
    summary()

 school   sex          age       address famsize   Pstatus      Medu      
 GP:349   F:208   Min.   :15.0   R: 88   GT3:281   A: 41   Min.   :0.000  
 MS: 46   M:187   1st Qu.:16.0   U:307   LE3:114   T:354   1st Qu.:2.000  
                  Median :17.0                             Median :3.000  
                  Mean   :16.7                             Mean   :2.749  
                  3rd Qu.:18.0                             3rd Qu.:4.000  
                  Max.   :22.0                             Max.   :4.000  
      Fedu             Mjob           Fjob            reason      guardian  
 Min.   :0.000   at_home : 59   at_home : 20   course    :145   father: 90  
 1st Qu.:2.000   health  : 34   health  : 18   home      :109   mother:273  
 Median :2.000   other   :141   other   :217   other     : 36   other : 32  
 Mean   :2.522   services:103   services:111   reputation:105               
 3rd Qu.:3.000   teacher : 58   teacher : 29                                
 Max.   :4.00

#### Piping Multiple Operations

Multiple operations can be executed in sequence using the pipe operator:

```R
df %>%
    filter() %>%
    mutate() %>%
    arrange()
```

We will now apply these functions to our student dataset. You can use the  [Cheat Cheat](https://content.cdntwrk.com/files/aT05NjI5Mjgmdj0xJmlzc3VlTmFtZT1kYXRhLXRyYW5zZm9ybWF0aW9uLWNoZWF0LXNoZWV0JmNtZD1kJnNpZz01ZjdlZGUxZDJiM2QwMmYxNDUzODIwYzA0NzE5NTA2YQ%253D%253D) to work on the following tasks. 

#### Select variables

Select the attributes *sex* and *age* from the data

In [13]:
student_data %>%
    select(sex, age) %>%
    head()

Unnamed: 0_level_0,sex,age
Unnamed: 0_level_1,<fct>,<int>
1,F,18
2,F,17
3,F,15
4,F,15
5,F,16
6,M,16


#### Make new variables

Calculate the average grade from the first period grade (G1) and the second period grade (G2) in a new columns 'MeanGrade'

In [14]:
student_data %>%
    mutate("MeanGrade" = (G1+G2)/2) %>%
    head()

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,⋯,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,MeanGrade
Unnamed: 0_level_1,<fct>,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<int>,<fct>,<fct>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>
1,GP,F,18,U,GT3,A,4,4,at_home,teacher,⋯,3,4,1,1,3,6,5,6,6,5.5
2,GP,F,17,U,GT3,T,1,1,at_home,other,⋯,3,3,1,1,3,4,5,5,6,5.0
3,GP,F,15,U,LE3,T,1,1,at_home,other,⋯,3,2,2,3,3,10,7,8,10,7.5
4,GP,F,15,U,GT3,T,4,2,health,services,⋯,2,2,1,1,5,2,15,14,15,14.5
5,GP,F,16,U,GT3,T,3,3,other,other,⋯,3,2,1,2,5,4,6,10,10,8.0
6,GP,M,16,U,LE3,T,4,3,services,other,⋯,4,2,1,2,5,10,15,15,15,15.0


#### Extract data
Filter only male students

In [15]:
student_data %>%
    filter(sex=='M') %>%
    head()

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,⋯,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
Unnamed: 0_level_1,<fct>,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<int>,<fct>,<fct>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,GP,M,16,U,LE3,T,4,3,services,other,⋯,5,4,2,1,2,5,10,15,15,15
2,GP,M,16,U,LE3,T,2,2,other,other,⋯,4,4,4,1,1,3,0,12,12,11
3,GP,M,15,U,LE3,A,3,2,services,other,⋯,4,2,2,1,1,1,0,16,18,19
4,GP,M,15,U,GT3,T,3,4,other,other,⋯,5,5,1,1,1,5,0,14,15,15
5,GP,M,15,U,LE3,T,4,4,health,services,⋯,4,3,3,1,3,5,2,14,14,14
6,GP,M,15,U,GT3,T,4,3,teacher,other,⋯,5,4,3,1,2,3,2,10,10,11


#### Sorting the data

Select only the female students and sort them by age.

In [16]:
student_data %>%
    filter(sex=='F') %>%
    arrange(-age) %>%
    head()

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,⋯,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
Unnamed: 0_level_1,<fct>,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<int>,<fct>,<fct>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,MS,F,20,U,GT3,T,4,2,health,other,⋯,5,4,3,1,1,3,4,15,14,15
2,GP,F,19,U,GT3,T,0,1,at_home,other,⋯,3,4,2,1,1,5,2,7,8,9
3,GP,F,19,U,GT3,T,3,3,other,other,⋯,4,3,3,1,2,3,10,8,8,8
4,GP,F,19,U,GT3,T,3,3,other,services,⋯,4,3,5,3,3,5,15,9,9,9
5,GP,F,19,U,GT3,T,4,4,health,other,⋯,2,3,4,2,3,2,0,10,9,0
6,GP,F,19,U,LE3,T,1,1,at_home,other,⋯,4,4,3,1,3,3,18,12,10,10


#### Summarize the data

What is the average absences of the students?

In [17]:
student_data %>%
    summarise(Mean_absences = mean(absences))

Mean_absences
<dbl>
5.708861


#### Grouping and summarizing

Calculate the average absences of both male and female students

In [18]:
student_data %>%
    group_by(age, sex) %>%
    summarise(Mean_absences = mean(absences)) %>%
    head()

`summarise()` has grouped output by 'age'. You can override using the `.groups` argument.



age,sex,Mean_absences
<int>,<fct>,<dbl>
15,F,3.894737
15,M,2.863636
16,F,5.888889
16,M,4.98
17,F,6.913793
17,M,5.8


## 4. Recommended Books
<a id="recommended_books"></a>

R for Data Science (https://r4ds.had.co.nz/)

<img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png" width="100" style="float:left">


An Introduction to Statistical Learning (https://www.springer.com/de/book/9781461471370)

<img src="https://images-na.ssl-images-amazon.com/images/I/41RgG05lZaL._SX329_BO1,204,203,200_.jpg" width="100" style="float:left">