*Analytical Information Systems*

# Worksheet 1 - Introduction to Jupyter and R Analytic Pipelines

Matthias Griebel<br>
Lehrstuhl für Wirtschaftsinformatik und Informationsmanagement

SS 2020

## Jupyter and R Working Environment

__We will be using R__

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/724px-R_logo.svg.png" width="100" align="right">


R is not the only language that can be used for data analysis. Why R rather than another?
- interactive language
- data structures & functions
- graphics
- packages & community!



__... and tidyverse__

<img src="https://github.com/matjesg/AIS_2019/raw/master/notebooks/images/01/ecosystem.png" width="300" align="right">

The [tidyverse](https://www.tidyverse.org) is a collection of R packages that share common philosophies and are designed to work together.

- Reuse existing data structures
- Compose simple functions with the pipe
- Embrace functional programming
- Design for humans




__within the Jupyter Ecosystem__

Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages (https://jupyter.org/).

### The Jupyter Notebook (this!)

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/Jupyter_logo.svg/250px-Jupyter_logo.svg.png" width="100" align="right">

- open-source web application 
- create and share documents that contain
    - live code and narrative text
    - data cleaning and transformation
    - numerical simulation
    - statistical modeling 
    - data visualization
    - machine learning
    - and much more 

__Jupyter Notebooks Cells__
- A Markdown cell (this cell) contains text formatted using Markdown and displays its output in-place when it is run
- A code cell contains code to be executed in the kernel and displays its output below

Now, write and run your first code in the next cell and run the code
```R
string <- "your text"
print(string)
```

In [1]:
string <- "hello world"
print(string)

[1] "hello world"


### Google Colab

<img alt="Colaboratory logo" height="45px" src="https://colab.research.google.com/img/colab_favicon.ico" align="right" hspace="10px" vspace="0px">

[Colaboratory](https://colab.research.google.com/) is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.

With Colaboratory you can 
- write and execute code, 
- save and share your analyses, and 
- access powerful computing resources (GPU and TPU), 

all for free from your browser. [More information](https://colab.research.google.com/notebooks/welcome.ipynb)

### R Programming

#### R Packages

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. 

Example: Install and load the *tidyverse*
```R
# install
install.packages('tidyverse')
# load
library(tidyverse)
```

The *tidyverse* package is already pre-installed, so we just need to load them

In [10]:
library(tidyverse)

#### Operators in R

#### Assignment operators

These operators are used to assign values to variables
<table style="font-size: 100%;">
<tbody>
<tr>
<th>Operator</th>
<th>Description</th>
</tr>
<tr>
<td>&lt;-, =</td>
<td>Leftwards assignment</td>
</tr>
<tr>
<td>-&gt;</td>
<td>Rightwards assignment</td>
</tr>
</tbody>
</table>

In [3]:
x <- 5
x

#### Arithmetic operators

These operators are used to carry out mathematical operations like addition and multiplication.

<table style="font-size: 100%;">
<tbody>
<tr>
<th>Operator</th>
<th>Description</th>
</tr>
<tr>
<td>+</td>
<td>Addition</td>
</tr>
<tr>
<td>&#8211;</td>
<td>Subtraction</td>
</tr>
<tr>
<td>*</td>
<td>Multiplication</td>
</tr>
<tr>
<td>/</td>
<td>Division</td>
</tr>
<tr>
<td>^</td>
<td>Exponent</td>
</tr>
<tr>
<td>%%</td>
<td>Modulus (Remainder from division)</td>
</tr>
<tr>
<td>%/%</td>
<td>Integer Division</td>
</tr>
</tbody>
</table>

In [4]:
x^10

#### Relational operators
Relational operators test or define some kind of relation between two entities/values
<table style="font-size: 100%;">
<tbody>
<tr>
<th>Operator</th>
<th>Description</th>
</tr>
<tr>
<td>&lt;</td>
<td>Less than</td>
</tr>
<tr>
<td>&gt;</td>
<td>Greater than</td>
</tr>
<tr>
<td>&lt;=</td>
<td>Less than or equal to</td>
</tr>
<tr>
<td>&gt;=</td>
<td>Greater than or equal to</td>
</tr>
<tr>
<td>==</td>
<td>Equal to</td>
</tr>
<tr>
<td>!=</td>
<td>Not equal to</td>
</tr>
</tbody>
</table>

In [5]:
x < 10

#### The pipe operator

<img src="https://github.com/matjesg/AIS_2019/raw/master/notebooks/images/01/pipes.png" width="150">

Pipes are a powerful tool for clearly expressing a sequence of multiple operations.<br>

In a pipe, we can rewrite the code as follows
```R
string %>%
    print()
```

In [6]:
string %>%
    print()

[1] "hello world"


### Relational Data and Data Frames

__ARIS Data View__

<img src="https://github.com/matjesg/AIS_2019/raw/master/notebooks/images/01/aris.png" width="500">

The relational model represents the database as a collection of relations (= tables, in R: *data frames* or *tibbles*).

- Each row of a table represents a list of related data values (= data record). Such a line is referred to as a "tuple”
- A column corresponds to an attribute
    - Attributes are assigned a data type, format, or value range 
    - Each attribute value is atomic and cannot be further broken down into components

#### Working on the Student Performance Data Set 

The [data set](https://rstudio-pubs-static.s3.amazonaws.com/108835_65a73467d96f4c79a5f808f5b8833922.html) contains information on students in secondary education in Portugal.

Important  attributes/columns:
- G1 - first period grade (from 0 to 20) 
- G2 - second period grade (from 0 to 20) 
- G3 - final grade (from 0 to 20)

Let's download the data and save it to the data frame "student_data"

In [7]:
url = "https://raw.githubusercontent.com/arunk13/MSDA-Assignments/master/IS607Fall2015/Assignment3/student-mat.csv"
student_data <- read.table(file= url, header = TRUE, sep = ";")

#### Have a look at the data

To view your data frame, write the name in a code cell and run it

In [8]:
student_data %>%
    head()

school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10
GP,M,16,U,LE3,T,4,3,services,other,...,5,4,2,1,2,5,10,15,15,15


#### View first or last part

*head()* and *tail()* return first or last part of the data frame

In [9]:
student_data %>%
    tail()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
390,MS,F,18,U,GT3,T,1,1,other,other,...,1,1,1,1,1,5,0,6,5,0
391,MS,M,20,U,LE3,A,2,2,services,services,...,5,5,4,4,5,4,11,9,9,9
392,MS,M,17,U,LE3,T,3,1,services,services,...,2,4,5,3,4,2,3,14,16,16
393,MS,M,21,R,GT3,T,1,1,other,other,...,5,5,3,3,3,3,3,10,8,7
394,MS,M,18,R,LE3,T,3,2,services,other,...,4,4,1,3,4,5,0,11,12,10
395,MS,M,19,U,LE3,T,1,1,other,at_home,...,3,2,3,3,3,5,5,8,9,9


#### Get a glimpse of your data

*glimpse()* outputs a transposed version of the standard view: columns run down the page, and data runs across. This makes it possible to see every column in a data frame

In [10]:
glimpse(student_data)

Observations: 395
Variables: 33
$ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP…
$ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M…
$ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15…
$ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U…
$ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3…
$ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T…
$ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4…
$ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3…
$ Mjob       <fct> at_home, at_home, at_home, health, other, services, other,…
$ Fjob       <fct> teacher, other, other, services, other, other, other, teac…
$ reason     <fct> course, course, other, home, home, reputation, home, home,…
$ guardian   <fct> mother, father, mother, mother, father, mother, mother, mo…
$ traveltime <int> 2

#### Data summaries

You can use the *summary()* command to get a better feel for how your data are distributed

In [11]:
student_data %>%
    summary()

 school   sex          age       address famsize   Pstatus      Medu      
 GP:349   F:208   Min.   :15.0   R: 88   GT3:281   A: 41   Min.   :0.000  
 MS: 46   M:187   1st Qu.:16.0   U:307   LE3:114   T:354   1st Qu.:2.000  
                  Median :17.0                             Median :3.000  
                  Mean   :16.7                             Mean   :2.749  
                  3rd Qu.:18.0                             3rd Qu.:4.000  
                  Max.   :22.0                             Max.   :4.000  
      Fedu             Mjob           Fjob            reason      guardian  
 Min.   :0.000   at_home : 59   at_home : 20   course    :145   father: 90  
 1st Qu.:2.000   health  : 34   health  : 18   home      :109   mother:273  
 Median :2.000   other   :141   other   :217   other     : 36   other : 32  
 Mean   :2.522   services:103   services:111   reputation:105               
 3rd Qu.:3.000   teacher : 58   teacher : 29                                
 Max.   :4.00

### Help and Documentation

#### Accessing the documentation with '?'

The question mark is a simple shortcut to get help

```R
?tidyverse
```

In [12]:
?tidyverse

0,1
tidyverse-package {tidyverse},R Documentation


### Recommended Books

R for Data Science (https://r4ds.had.co.nz/)

<img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png" width="100" style="float:left">

An Introduction to Statistical Learning (https://www.springer.com/de/book/9781461471370)

<img src="https://images.springer.com/sgw/books/medium/9781461471370.jpg" width="100" style="float:left">

#### Cheat Sheets

- Cheat sheets make it easy to learn about your favorite packages
- [Here](https://www.rstudio.com/resources/cheatsheets/), you will find some cheat sheets

## Analytic Pipelines: Data Transformation with *dplyr*

The *dplyr* packages provides a grammar for manipulating tables in R. It can be conceptualized as an alternative to a traditional query language like SQL.

Main functions are

- *select()* extracts variables/columns as a table

- *filter()* extracts rows that meet logical criteria

- *group_by()* creates a "grouped" copy of a table. *dplyr* functions will manipulate each "group" separately and then combine the results

- *summarise()* applies summary functions to columns to create a new table of summary statistics based on grouping.

- *arrange()* orders rows by values of a column or columns

- *mutate()* computes new columns/variables

Multiple operations can be executed in sequence using the pipe operator:

```R
df %>%
    filter() %>%
    mutate() %>%
    arrange()
```

We will now apply these functions to our student dataset. You can use the  [Cheat Cheat](https://content.cdntwrk.com/files/aT05NjI5Mjgmdj0xJmlzc3VlTmFtZT1kYXRhLXRyYW5zZm9ybWF0aW9uLWNoZWF0LXNoZWV0JmNtZD1kJnNpZz01ZjdlZGUxZDJiM2QwMmYxNDUzODIwYzA0NzE5NTA2YQ%253D%253D) to work on the following tasks. 

#### Select variables

Select the attributes *sex* and *age* from the data

In [13]:
student_data %>%
    select(sex, age) %>%
    head()

sex,age
F,18
F,17
F,15
F,15
F,16
M,16


#### Make new variables

Calculate the average grade from the first period grade (G1) and the second period grade (G2) in a new columns 'MeanGrade'

In [14]:
student_data %>%
    mutate("MeanGrade" = (G1+G2)/2) %>%
    head()

school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3,MeanGrade
GP,F,18,U,GT3,A,4,4,at_home,teacher,...,3,4,1,1,3,6,5,6,6,5.5
GP,F,17,U,GT3,T,1,1,at_home,other,...,3,3,1,1,3,4,5,5,6,5.0
GP,F,15,U,LE3,T,1,1,at_home,other,...,3,2,2,3,3,10,7,8,10,7.5
GP,F,15,U,GT3,T,4,2,health,services,...,2,2,1,1,5,2,15,14,15,14.5
GP,F,16,U,GT3,T,3,3,other,other,...,3,2,1,2,5,4,6,10,10,8.0
GP,M,16,U,LE3,T,4,3,services,other,...,4,2,1,2,5,10,15,15,15,15.0


#### Extract data
Filter only male students

In [15]:
student_data %>%
    filter(sex=='M') %>%
    head()

school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
GP,M,16,U,LE3,T,4,3,services,other,...,5,4,2,1,2,5,10,15,15,15
GP,M,16,U,LE3,T,2,2,other,other,...,4,4,4,1,1,3,0,12,12,11
GP,M,15,U,LE3,A,3,2,services,other,...,4,2,2,1,1,1,0,16,18,19
GP,M,15,U,GT3,T,3,4,other,other,...,5,5,1,1,1,5,0,14,15,15
GP,M,15,U,LE3,T,4,4,health,services,...,4,3,3,1,3,5,2,14,14,14
GP,M,15,U,GT3,T,4,3,teacher,other,...,5,4,3,1,2,3,2,10,10,11


#### Sorting the data

Select only the female students and sort them by age.

In [16]:
student_data %>%
    filter(sex=='F') %>%
    arrange(-age) %>%
    head()

school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
MS,F,20,U,GT3,T,4,2,health,other,...,5,4,3,1,1,3,4,15,14,15
GP,F,19,U,GT3,T,0,1,at_home,other,...,3,4,2,1,1,5,2,7,8,9
GP,F,19,U,GT3,T,3,3,other,other,...,4,3,3,1,2,3,10,8,8,8
GP,F,19,U,GT3,T,3,3,other,services,...,4,3,5,3,3,5,15,9,9,9
GP,F,19,U,GT3,T,4,4,health,other,...,2,3,4,2,3,2,0,10,9,0
GP,F,19,U,LE3,T,1,1,at_home,other,...,4,4,3,1,3,3,18,12,10,10


#### Summarize the data

What is the average absences of the students?

In [17]:
student_data %>%
    summarise(Mean_absences = mean(absences))

Mean_absences
5.708861


#### Grouping and summarizing

Calculate the average absences of both male and female students

In [18]:
student_data %>%
    group_by(age, sex) %>%
    summarise(Mean_absences = mean(absences)) %>%
    head()

age,sex,Mean_absences
15,F,3.894737
15,M,2.863636
16,F,5.888889
16,M,4.98
17,F,6.913793
17,M,5.8
