# R Introductory Notebook complete

This notebook covers both part I and part II

Course: Elementare Statistik / R   
Libraries required: none  
Author: timo.varelmann@uni-koeln.de  
Date: 28.10.2018

List of contents:

- [Basic operations](#basics)
 - [R language basics using summation examples](#basics-sum)
 - [Some more mathematical functions](#functions)
 - [Small numbers](#small)
- [Object assignment and the workspace](#assignment)
- [Vectors](#vectors)
 - [Combinations and sequences of elements](#seq)
 - [Vector classes](#classes)
 - [Position index](#PI)
 - [Exercise Position Index](#ex_one)
- [Two-dimensional data structures I](#two-dim)
 - [Matrix](#matrix)
   - [Binding](#bind)
   - [Function `matrix()`](#func_mat)
   - [Function `dim()`](#func_dim)
   - [Exercise Matrix Building](#ex_mat)
 - [Position index](PImat)
 - [Some basic vector and matrix operations](#vector-ops)
- [Two Dimensional Data Structures II](#data_structures2)
 - [Matrix and Data Frame - Important differences](#data_frame)
 - [Inspection of the Data Frame](#inspection)
 - [Creating Subsets](#subsets)
   - [Index](#index)
   - [\$ Column Selection](#dollar)
   - [Logical Expressions](#log_index)
   - [Subset Function](#subset)
 - [Manipulating Data or Data Frame Attributes](#manipulation)
   - [Changing column or variable names](#colnames)
   - [Changing single values](#singlevals)
   - [Extend the Data Frame](#extend)
   - [Change Order of Columns](#change_columns)
- [Save, load, and import files](#import)
  - [The Working Directory](#wd)
  - [Save Objects like Vectors, Matrices or Data Frames as `.RData`-File](#save)
  - [Load `.RData`-Files](#load)
  - [Read `.csv`-Files](#read.csv)
  - [Read `.txt`-Files](#read.table)

## Basic operations<a name="basics">
### R language basics using summation examples<a name="basics-sum">

In R, basic mathematical operations have the structure 'number - operant - number'  
Like with any calculator or other programming language, such as Python, the sum of two numbers &mdash; 3 and 5 &mdash; can be written as follows:   
`3+5`

In [4]:
3 + 5
3 - 1

As you may have noticed immediately: comments are not executed. All characters on the right-hand side of a `#` will be treated as comments. As such, you may also write comments within an executable code line like:

In [5]:
3 + 5 # This returns the sum of 3 and 5

Unlike Python, in R all lines of code in a cell will be printed when the cell is run.  
You don't need the R-function `print()` to return all of this:

In [1]:
1.1 + 1
2 + 45
3.5 + -25

The `+` operator accepts two arguments, one on the left side and one on the right side.  
Here, the number 3 serves as an argument for the first and the second '+'

In [2]:
2 + 3 + 1.1

Arguments of an operator must be numbers (including numbers assigned to variables) which can be positive, negative or zero. Negative numbers must be indicated by using `-`, in case of positive numbers, this is not neccessary.  
Here is an example of the sum of two positive numbers and one negative number:

In [3]:
+2 + +5 + -3

Let's write our code in a more intuitive way: just subtract number 3

In [5]:
-2 + 5 - 3

Write your code in a way that is well structured!  
Above, I have used empty spaces for readability. An empty space is ignored in the R language (at least, as long as it is not part of a 'character string', see further below).  
So, this also works (though not well arranged)

In [5]:
1.1+     1
    2      -                                                45
                  3.5+-25

If an input is not completed in a line of code, R turns to the next line of code for execution  &mdash; if the input is completed within that next line, it will be executed (here: the function will be executed, it's result will be returned).  
R interprets the next line as separate input.

In [6]:
1 +    # the argument of the operator is missing
3      # here it is: the sum will be returned
1  +     # This returns the single number, function is completed
3    # This returns the single number, function is completed

Important: if the code within a given cell ends without completion of a function, this will produce an error.  
Try it out: uncomment the next line by erasing '#' and execute this cell:

In [1]:
# 1 + 

### Some more mathematical functions<a name="functions">

Here are some basic mathematical operators that have two arguments:

In [16]:
3+5 # sum up
3-4 # subtract
3*4 # multiply
3/4 # divide
3^4 # exponent calculation

Make use of parentheses for priorization.  
Compare:

In [8]:
3^4/9-8   # vs.
3^(4/9-8) # vs.
3^(4/9)-8 # vs.
3^4/(9-8)

Unlike the structure of mathematical operators you have just learned above (argument - operator - argument), the majority of functions in R make use of the structure: `function()`  
Arguments will be written in parenthesis.

As such, 3+4 can be rewritten by using the function `sum` followed by arguments in parentheses 

In [19]:
sum(3,4)

The function `sum()` returns the sum of all the values present in its arguments.  
It represents the big operator for calculating the sum of a sequence:

$\sum(2,3,6,-3)$ can be written and executed as

In [20]:
sum(2,3,6,-3)

The function `prod()` returns the product of all the values present in its arguments.  
It represents the big operator for calculating the product of a sequence:

$\prod(2,3,6,-3)$ can be written and executed as

In [21]:
prod(2,2,2)

R has an elaborated documentation of functions.  
Type a question mark followed by the the name of a function to open it:

Arguments of a given function can also be used to specify some function operations.

For example, in order to round a given number &mdash; besides an argument that specifies the number to be rounded &mdash; you should also specify the number of decimal places to which the number should be rounded. That additional argument `digits` is a number:
round(x=1/3, digits=3)

If you don't use the digits-argument, R takes a default specification when executing the function.  
For `round()`, `digits=0` is the default.

So this returns 0:

In [23]:
round(x=1/3)

Further, the order of arguments matters as soon as you don't indicate which argument ('x' or 'digits') you are going to specify.  
The documentation `?round` shows default order of arguments and default argument specifications: 

round(x, digits=0)

With that knowledge in mind, you can reformulate round(x=1/3, digits=3) as follows:

In [24]:
round(1/3 ,3)          # That's the most economical way (least code writing)
round(1/3 ,digits=3)   # That's what you will often see in textbooks; standard coding. Background knwowlegdge:
                       # The input of a function is typically written on the left-hand side
round(x=1/3 ,3)        # This is also possible, though you won't read such a code quite often.
round(digits=3, x=1/3) # This is also possible. If you know the arguments' names (here 'x' and 'digits'), 
                       # but not their default order, just add the argument names everytime. That's safe.

Here are some more basic mathematical operations which have the structure function():

In [26]:
sqrt(9)        # Square root of 9
log(9,base=3)  # Logarithm of 9, base 3. 
               # If the base is not specified, the default base is the mathematical constant e. 
log(9)         # Thus, this returns the natural logarithm of 9
log2(16)       # This is an example of a prespecified log-function with base 2
log10(100)     # And this with base 10.

Finally: two examples of mathematical constants:

In [27]:
exp(1)         # The mathematical constant e
pi             # The mathematical constant pi

And absolute value or modulus (in German: "Betrag"):

In [28]:
abs(-1.4)

### Small numbers<a name="small">
In R or other programming languages, there is a special presentation of small numbers:

In [29]:
4^-16 

This is 0.00000000023283064365387 , because:

$xey = x\cdot10^y ;  y \in \mathbb{Z}$

In this example, x is 2.3283064365387 and y is -10.

You can also use `e` in your input code line:

In [30]:
2.33e-3
2.33e3

## Object assignment and the workspace<a name="assignment">

Let's assign number 9 to an object called "number".  
By convention, the recipient of assignment is written on the left-hand side:

In [2]:
number <- 9

An equal sign would do the same, here recipient of assignment **must (!)** be on the left-hand side:

In [4]:
number = 9

Let's print it:

In [6]:
number
print(number)

[1] 9


The bracket [ ] in the output indicates a position, later we'll come back to this issue

Let's overwrite this object:

In [8]:
(number <- 5)  # Parentheses around object assignment: return it immediately

Let's create more objects somehow differently called "number".

Regarding object names keep in mind:
- case sensitity
- numbers can be added to the object name, as long as the name doesn't start with a number
- `.` or `_` can also be part of the object name, as long as the name doesn't start with these symbols
- mathematical / boolean operators may not be added to an object name

In [15]:
Number <- 10          
number1 <- 1          
my.Number1 <- number1

After having specified some objects in the cells above, let's take a closer look at them.
Objects 'number' and 'Number' are different objects with different elements:

In [None]:
number         
Number         
number == Number

On the other hand, these are different objects with the same element:

In [None]:
number1   
my.Number1       
number1 == my.Number1

The objects (number, my.Number1 ...) are stored in the R workspace. In Jupyter, the workspace is active as long the notebook is active. You can make use of these objects in every cell of a notebook as long they are stored in workspace.

After a shutdown of the notebook the workspace will be empty when you start this notebook again.

If you want to know all objects that are currently stored in workspace, type:  
`ls()`  
This returns all object in the workspace.

`rm()` removes an object from workspace:

In [None]:
rm(my.Number1)

Now, the object 'my.Number1' is deleted. Check it by executing `ls()`again.  
This also means that executing 'my.Number1' would produce an error now.

This removes all objects from the current workspace:

In [None]:
rm(list=ls())

## Vectors<a name="vectors">
The simplest and most essential data structure in R is the vector, which is a single entity consisting of an ordered collection of elements. It's one-dimensional.

### Combinations and sequences of elements<a name="seq">

Even typing and executing a single number will produce a vector:

In [None]:
5

How many (ordered) elements does the vector have, i.e. what is its length?

In [None]:
length(5)

To create a numeric vector with length > 1, one possiblitity is to combine or concatenate any numbers using the function `c()`, e.g.

In [None]:
c(1,3,6,-0.1)

Let's assign this vector to an object and do some inspections:

In [None]:
myNumbers <- c(1,3,6,-0.1)
length(myNumbers)

Another possibility to create a vector is by repeating a number, e.g. by replacing number 4 seven times using function `rep()`

In [None]:
rep(4, times=7)

You may also want to create a sequence of numbers using function `seq()`, e.g.

In [None]:
-4:1                     # A sequence of integers from -4 to 10, same output as
seq(from=-4, to=1, by=1) # or
seq(-4,1,1)              # ... you immerdiately see the default order of arguments here!

The `seq()`-fuction takes numbers as arguments. Instead of specifying the arguments by numbers, you can also use a variable that has been tied to a number:

In [None]:
myEndofSequence <- 10
mySequence <- seq(1,myEndofSequence,0.5)
mySequence

Finally, we can easily combine two or more vectors into a new one by using function `c()`.  
(Python users: there will be no nesting of vectors)

In [None]:
vector1  <- c(2,3,4,7)
vector2  <- c(10,23)

vector12 <- c(vector1,vector2)

vector12
class(vector12)

### Vector classes<a name="classes">

Vectors have specific classes. Here is an example of a numerical vector:

In [None]:
mynumb <- c(-1.2,6,2)
mynumb
length(mynumb)
class(mynumb)

Remember how to produce sequences of integers.   
R automatically classifies numbers as numeric or integer. Integers are a subset of numerics.   
Sometimes, this classification works: Let's create an integer vector:

In [None]:
myseq <- 1:5
myseq
class(myseq)

Many times, it doesn't work (though I am not aware of any negative consequence of this fact). For example, a single number is always treated as numeric:

In [18]:
5
class(5)

Function `seq` also treats all numbers as numeric:

In [None]:
myseq <- seq(1,5,1)
myseq
class(myseq)

Remember `ls()` returns a vector of all objects in the workspace.  
This is a vector of character strings, indicated by the quotation marks around all vector elements when printed.

Let's check it:

In [None]:
ls()

class(ls())

Let's create a vector of character strings of length = 3

In [None]:
mychar <- c("Dorte","Sebastian","Frodo Baggins")
mychar
class(mychar)

Finally, let's create a vector of logical values:

In [None]:
mylogicals <- c(TRUE,FALSE,FALSE,FALSE)
mylogicals
class(mylogicals)

Instead of typing TRUE or FALSE, a T or an F suffices

In [None]:
mylogicals <- c(T,F,F,F)

We will not consider all classes right now.

Important: The class of a given vector is *exclusive*, there can only be one vector class of a given vector.

If you combine a numerical and character vector to a single vector, numerical vector elements are converted to characters. You cannot do calculations such as summation with character strings any more!

In [None]:
myvector <- c(mynumb,mychar)
myvector
class(myvector)

If vector combined elements are numerical and logical, T gets 1 and F gets 0

In [None]:
myvector <- c(mynumb,mylogicals)
myvector
class(myvector)

### Position index<a name="PI">

There are two different ways to return a vector:

In [None]:
5
print(5)

A difference in the output only occurs in the Jupyter environment, not in the console of the R environment.  
Above, the 1 in brackets shows the position of vector element presented on the left-most side of the output row.  
Remember, just typing a number in R returns a numeric vector with length=1.

Now take a look at this vector and it's position indices:

In [None]:
seqUp <- 1:100
print(seqUp)

And compare it to:

In [1]:
seqDown <- 100:1
print(seqDown)

  [1] 100  99  98  97  96  95  94  93  92  91  90  89  88  87  86  85  84  83
 [19]  82  81  80  79  78  77  76  75  74  73  72  71  70  69  68  67  66  65
 [37]  64  63  62  61  60  59  58  57  56  55  54  53  52  51  50  49  48  47
 [55]  46  45  44  43  42  41  40  39  38  37  36  35  34  33  32  31  30  29
 [73]  28  27  26  25  24  23  22  21  20  19  18  17  16  15  14  13  12  11
 [91]  10   9   8   7   6   5   4   3   2   1


If you want to only return the vector element which has the 91rst position within vector seqDown, you do this by using the index brackets:

In [None]:
seqDown[91]

If you want to return all seqDown vector elements except the one which has the 91rst position:

In [None]:
seqDown[-91]

Unfortunately, this Jupyter-specific presentation is quite chaotic. To clearly arrange your output like in the standard R environment, use `print()` again:

In [2]:
print(seqDown[-91])

 [1] 100  99  98  97  96  95  94  93  92  91  90  89  88  87  86  85  84  83  82
[20]  81  80  79  78  77  76  75  74  73  72  71  70  69  68  67  66  65  64  63
[39]  62  61  60  59  58  57  56  55  54  53  52  51  50  49  48  47  46  45  44
[58]  43  42  41  40  39  38  37  36  35  34  33  32  31  30  29  28  27  26  25
[77]  24  23  22  21  20  19  18  17  16  15  14  13  12  11   9   8   7   6   5
[96]   4   3   2   1


As you already know, -91 is a vector, length is 1 and class is numeric. In R, you make use of a vector in order to specify the positions of a vector.

So, in order to print the second and the last seqDown vector element, you can use this vector  
`c(2,100)`   
to specify the index:

In [None]:
seqDown[c(2,100)]

###  Exercise Position Index<a name="ex_one">
1. Specify the position index of vector seqDown in such a way that it returns the elements at position 2, 4, 6 and 8

2. Specify the position index in such a way that it always returns the last element of a vector. It should work even if you don't know the number of elements of your vector. 

3. Check your solution by use of the vector:     
    `myRandVec <- 2:sample(2:300,size=1)` .   
The elements of this vector are a sequence of integers from 2 to a random integer between 2 and 300. So you do not know how many vector elements there are.

## Two-dimensional data structures<a name="two-dim">

R makes use of different classes of two-dimensional constructs like lists, arrays, matrices or data frames.   
This notebook will introduce to matrix and data frame.

### Matrix<a name="matrix">

A matrix is two-dimensional array with
- same length of all columns and same length of all rows
- same class of elements

There are several ways to create a matrix.

#### Binding<a name="bind">

You can create a matrix by means of binding (one-dimensional) vectors.  

Requirement: The vectors have to be of the same length and must have the same vector class.

Let's create two vectors, e.g.

In [20]:
a <- c(1,2,4)
b <- c(6,7,-9)

You can bind these vectors vertically. In this case, each vector becomes a *column* of the matrix. Make use of the function `cbind()` (you may remember this function as "column-bind")

In [None]:
cbind(a,b)

Or you can bind the vectors horizontally. In this case, each vector becomes to be a *row* of the matrix. Make use of the function `rbind()` (you may remember this function as "row-bind")

In [None]:
rbind(a,b) 

Instead of using objects a and b, you can also write vectors immediatiately within the `cbind()` or `rbind()` function.

Here you do not have column or row names:

In [None]:
rbind(c(1,2,4),c(6,7,9))

#### Function `matrix()`<a name="func_mat">

Function `matrix()` creates a matrix from a given set of values.   

Take the six values of the vector c(1,2,3,4,5,6). To create a two-dimensional data structure, length of rows multiplied by length of columns must be equal to the number of values we have, which is 6.   
Dimensions of the matrix to be created could be:
- 1 row, 6 columns
- 2 rows, 3 columns
- 3 rows, 2 columns
- 6 rows, 1 column

In [26]:
matrix(c(1,2,3,4,5,6), nrow=2, ncol=3)

0,1,2
1,3,5
2,4,6


#### Function `dim()`<a name="func_dim">

Here is another way to create a matrix. Take a single vector and convert it to a matrix

In [None]:
d <- 1:15
mymatrix <- matrix(d)

The matrix above has 15 rows and 1 column. It has the dimensions (dim) 15 and 1. This is the length of rows and columns.

Remember: **the first value refers to rows, the second to columns**

The following line returns a vector showing the dimensions:

In [None]:
dim(mymatrix)

Let's manipulate these dimensions. 

We do that by changing the dimension-vector returned by `dim()`. Be careful: length of rows multiplied by length of columns must be equal to the number of elements of mymatrix, which is 15.   
(3 * 5 = 15)

In [None]:
dim(mymatrix) <- c(3,5)
mymatrix

#### Exercise Matrix Building<a name="ex_mat">

Create two matrices called 'mymat1' and 'mymat2'. These matrices should look like:

mymat1

&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;   2  
&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;   4  
&nbsp;-4 &nbsp;&nbsp; 6

mymat2

&nbsp;&nbsp; 1 &nbsp;&nbsp;&nbsp;  3  
&nbsp;&nbsp; 2 &nbsp;&nbsp;&nbsp;  4  
&nbsp; -4 &nbsp;&nbsp;&nbsp;  4

Insert a new cell below this cell and create these matrices.

Some inspection of these matrices:

In [None]:
class(mymat1) # It should be a matrix
dim(mymat1)   # Rows should have length 3, columns should have length 2
str(mymat1)   # The structure shows
              #  - elements within rows 1 to 3 and columns 1 to 2 are numerical
              #  - and have the values 1 3 -4 2 4 6 

#### Position index<a name="PImat">

'mymat1' should look like this:

mymat1   
&nbsp;&nbsp;1 &nbsp;&nbsp;&nbsp; 2  
&nbsp;&nbsp;3 &nbsp;&nbsp;&nbsp; 4  
&nbsp; -4 &nbsp;&nbsp; 6

Let's only select the second row. Make use of the position index in brackets: mymat1[ , ]  
Remember: the left-hand side refers to rows, the right-hand side to colums.

Thus `mymat1[2,]` will return the second row of the matrix 'mymat1' (as a vector):

In [None]:
mymat1[2,]

Knowing this, we can also select only the second column:

In [None]:
mymat1[,2]

The 6 in 'mymat1' has the position third row, second column:

In [None]:
mymat1[3,2]

print() demonstrates the matrix structure:

In [None]:
print(mymat1)

#### Some basic vector and matrix operations<a name="vector-ops">

Let's multiply all vector elements by the same number.

Vector 'a' has the three elements 2, 4 and 5, and all of them should be multiplied by 1.5.   
Doing so in R, just execute:

In [None]:
a*1.5

The same can be done with summation: add number 1.5 to all vector elements:

In [None]:
a+1.5

If you want to sum up all vector elements, use function `sum()`:

In [None]:
sum(a)

The same can be done if we want to multiply all elements of a matrix with the same number.  
Let's take mymat1 again and multiply all matrix elements with number 1.5:

In [None]:
mymat1 <- cbind(c(1,2,-4),c(2,4,6))
mymat1
mymat1*1.5

Add 1.5 to all elements of the matix:

In [None]:
mymat1+1.5

By use of the position index, we can add 1.5 to the elements of a specific row or column.  

For example, take the first row of the matrix and add 1.5:

In [None]:
mymat1[1,]+1.5

*Caution*: by doing so, matrix mymat1 remains unchanged. The first row is still the original row.

You may want to create an updated matrix by overwriting the original matrix.  

In this example, you do this by taking the first row of mymat1, and assign the sum of first row elements and 1.5 to it:

In [None]:
mymat1[1,] <- mymat1[1,]+1.5

Now, mymat1 is updated:

In [None]:
mymat1

## Two Dimensional Data Structures II <a name="data_structures2">

### Matrix and Data Frame - Important differences<a name="data_frame">
For the purposes of our course, managing, manipulating and analysing data, the `data frame` is the most important two dimensional data structure in R. When reading datasets stored in a `csv` or `txt` data file formats, they are converted into a data frame by default. 

Like a `matrix`, all columns of the data frame have equal length, as do all rows. Also, data of a given column are of the same class (e.g. numerical, integer, character string, logical, factor).  
But in contrast to matrices, different columns of a data frame may comprise different classes of data.

We can demonstrate this by constructing a data frame.  
Let there be three vectors &ndash; of same length &ndash; with vector classes `character` and `numeric`
    
`subject <- c("A","B","C","A","B","C")`  
`test <- c(1,1,1,2,2,2)`  
`score <- c(22,31,10,20,45,2)`

In [32]:
mymatrix <- cbind(subject,test,score)
print(mymatrix)

     subject test score
[1,] "A"     "1"  "22" 
[2,] "B"     "1"  "31" 
[3,] "C"     "1"  "10" 
[4,] "A"     "2"  "20" 
[5,] "B"     "2"  "45" 
[6,] "C"     "2"  "2"  


One way to create a matrix is to bind vectors, here we use `cbind()` for columnar binding:

`mymatrix <- cbind(subject,test,score)`

Caution: The standard Jupyter notebook output (Jupyter Lab Version 0.34.9) shows a table which looks fine, but this is misleading. As matrices may cover only a single vector class and the matrix is build of numerical data and character strings, numerical data are "downgraded" to character strings, which is nominal data type (see next session). As a consequence, e.g.  summing up all data of variable score is impossible:

`sum(mymatrix[,3])` produces an error.

In Jupyter, it is essential to use the `print()` function in order to create a clearly arranged output that covers basic or standard information of data structures in R.   
Now you may immediately recognise that all variables consist of character strings marked by quote signs:

In [153]:
print(mymatrix)

     subject test score
[1,] "A"     "1"  "22" 
[2,] "B"     "1"  "31" 
[3,] "C"     "1"  "10" 
[4,] "A"     "2"  "20" 
[5,] "B"     "2"  "45" 
[6,] "C"     "2"  "2"  


Now create a data frame (called mydf) of the three variables using function `data.frame()` and compare the output the matrix output above:

In [34]:
mydf <- data.frame(subject,test,score)
print(mydf) 

  subject test score
1       A    1    22
2       B    1    31
3       C    1    10
4       A    2    20
5       B    2    45
6       C    2     2


If your intuition is that this data frame has no character string data, you're right.  
Inspection of the data frame using function `str()` provides information about its structure:

In [155]:
str(mydf)

'data.frame':	6 obs. of  3 variables:
 $ subject: Factor w/ 3 levels "A","B","C": 1 2 3 1 2 3
 $ test   : num  1 1 1 2 2 2
 $ score  : num  22 31 10 20 45 2


Inspection of data structures will be covered in the next section of this notebook. For now, it is important to see that the data frame consists of different types of data: variables test and score are numerical vectors, whereas variable subject is a vector of class factor (the character string has been converted to a factor).   
As such, this correctly sums up all score values:

In [156]:
sum(mydf[,3])

### Inspection of the data frame<a name="inspection">
Remember that an object will be stored in the workspace until shutdown of the notebook (use function `ls()` to show all objects stored).   
We will do some inspections on the object mydf:
    
- Object class: `class()`

- Number of rows and number of culumns: `nrow()` , `ncol()`

- dimensions: `dim()` (showing both rows and columns; in that order)

In [35]:
dim(mydf)

- names of columns: `colnames()`

In [36]:
colnames(mydf)

- the "head" of the data frame (the first six rows): `head()`  
- the "tail" of the data frame (the last six rows): `tail()`  
  - both functions provide a good and quick check when a data frame is created, though Jupyter output misses important information unless executing `print(head())`

- structure of the data frame:   

In [157]:
str(mydf)

'data.frame':	6 obs. of  3 variables:
 $ subject: Factor w/ 3 levels "A","B","C": 1 2 3 1 2 3
 $ test   : num  1 1 1 2 2 2
 $ score  : num  22 31 10 20 45 2


This output contains:  
A. Information about object mydf: object class is data.frame, which has 6 rows (observations) and 3 columns (variables)  

B. Information about all columns/variables
 1. column subject: categorical vector class "Factor" with 3 levels "A", "B", "C": first values of this vector
 2. column test: numerical vector: first values of this vector
 3. column score: numberical vector: first values of this vector


### Creating Subsets<a name="subsets">

#### Index<a name="index">
Like in matrices, indices can be used to specify subsets of a data frame. The first index refers to rows, the second the columns: 
    
`df[row,col]`

For data frame mydf, `mydf[,3]` returns the third column as a vector:

What follows here are vector inspections regarding the third column of the data frame:
- head and tail of the vector: `head(mydf[,3])`, `tail(mydf[,3])`
- length: `length(mydf[,3])` giving the number of vector elements

- vector class: `class(mydf[,3])`

In [37]:
class(mydf[,3])

As this is a numerical vector, we can also show (for example):
- the maximimum value: `max(mydf[,3])`
- the minimum value: `min(mydf[,3])`
- min and max value: `range(mydf[,3])`

Remember that the output of all of these inspections is again a vector. As such, you can index the output of `range(mydf[,3])`, for example in order to show the first position of the output vector of `range(mydf[,3])[1]` equals the minimum value `min(mydf[,3])`.

In [40]:
range(mydf[,3])

The first column of the data frame is a vector of class factor. You immediately recognise it when executing `mydf[,1]`. It is a categorical vector with three levels.

In [44]:
range(mydf[,3])[1] != min(mydf[,3])

Use both position indices to indicate a single value of the data frame, e.g. second row second column: `mydf[2,2]`

Use combinations of numbers with `c()` or sequences like `2:4` to index more than one row or column position. 

For example:
`mydf[2:4,c(1,3)]`

Remember that in all indexing examples above, a numerical or integer vector was used to index row or column positions. Values of these vectors were positive.

If negative vector elements are used, indexed rows or columns are exluded.  
For example:

`mydf[-(2:4),]`

returns the original data frame *except* rows 2 to 4:

In [45]:
mydf[,-3]

subject,test
A,1
B,1
C,1
A,2
B,2
C,2


#### `$` Column Selection<a name="dollar">
Instead of using the second position index, the symbol `$` can be used to selcted a specific column of a data frame, followed by one of the `colnames()`. e.g.  
`mydf$test` returns the second column or variable `test` as a vector:

In [49]:
mydf$score

Column selection using `$` is an intuitive way when dealing with specific variables. The matrix data structure does not support this syntax.

#### Logical Expressions<a name="log_index">
Not only numerical/integer, but also logical vectors are used to index subsets of data frames, matrices or vectors. By the use of logical expressions a subset can be created that exactly meets the criteria of that expression.

Let's say, the subset should cover only and all rows for which this expression is `TRUE`:
    
`mydf$test == 1`

To do so, this logical vector is used as row index of the data frame:

`mydf[mydf$test == 1,]`

In [50]:
mydf$test == 1

Let's extend the logical expression and create another subset for which the expressions "test value equals 1" AND "score value greater than or equal 20" is true:  

`mydf$test==1 & mydf$score >= 20`

In [59]:

print(mydf[xor(A,B),])

  subject test score
3       C    1    10
4       A    2    20
5       B    2    45


#### Subset Function<a name="subset">
Function `subset()` provides a convenient way to create a subset of a given data frame, matrix or vector using logical expressions.

Here is a subset of mydf which consists of all rows for which the expression "test value equals 2" is true:
    
`subset(mydf, test==2)`  

Here is another subset for which the expressions "test value equals 2" and "scores value greater than or equal 20" is true:  

`subset(mydf, test==2 & scores >=20)`

Be careful. The `subset()`output is an individual subset *apart* from the original data frame. As such, manipulations within these subsets would have no consequences for the original data frame.

### Manipulating Data or Data Frame Attributes<a name="manipulation">

#### Changing column or variable names<a name="colnames">
To change column or variable names, the function `colnames()` we know from data frame inspection is used. 

`colnames(mydf)` gives a vector of character strings of column names:

In [60]:
colnames(mydf)
class(colnames(mydf))

Changing the column names of a data frame is provided by changing these vector elements.   

Take the third element of colnames vector:
`colnames(mydf)[3]`

You can now change or overwrite this element with a different character string. Say, we want the referring variable within our data frame to be called "scores" (the plural):

`colnames(mydf)[3] <- "scores"`

Now a quick look at the head of our data frame can show the result of that manipulation:

#### Changing single values<a name="singlevals">
Errors may occur when dealing with data, and correcting these data may become mandatory. Though, it will be a messy enterprise if you cannot track your data manipulation in retrospect. For example, using spreadsheats and just overwriting single cells makes your manipulations untrackable.   
R-Scripts or (Jupyter) Notebooks make it easy to track manipulations in retrospect, they facilitate literate coding.

The principle is the same as in changing column names: take a subset of data or data structure attribute and change / overwrite it.
    
Let's say the score of subject C test 1 was 12, not 10. A possibility to change that score is to specify the regarding position by use of its position index `mydf[3,3]`, and overwrite the element at that position:
    
`mydf[3,3] <- 12`

#### Extend the Data Frame<a name="extend">

There are different strategies to extend the data frame.  

- One way is binding using functions `cbind()` or `rbind`.

Say, we want to add a second score to the data frame as fourth column that is

`score2 <- c(24,53,33,22,11,14)`

We can use columnar binding to extend the data frame:

`cbind(mydf,score2)`

CAUTION: Doing so, the original data frame remains unchanged. The function `cbind()` takes the data frame and the vector as arguments and binds them. That's all. The only way to store that extended data frame in workspace is to create an object that covers it (a new object, or overwriting the existing mydf), e.g.

`mydf <- cbind(mydf,score2)`

- Another way to achieve the same result is:

`mydf[,4] <- score2`

and adding the variable name:

`colnames(mydf)[4] <- "score2"`

- Finally, you may add a new column called "scores2" to the right of the existig data frame by using the `$`-sign, and assign the values to that column:

`mydf$score2 <- score2`

In [63]:
mydf[,4] <- c(1,2,3,4,5,6)
mydf

subject,test,score,V4
A,1,22,1
B,1,31,2
C,1,10,3
A,2,20,4
B,2,45,5
C,2,2,6


#### Delete Columns
Above knowlege of data frame management and manipulation is suffient to understand column deletion. Remember that negative column indices exlude columns. As such, this returns the data frame mydf execpt its fourth column:

`mydf[,-4]`

For permanent deletion of that data frame column in workspace, mydf has to be overwritten:

`mydf <- mydf[,-4]`

#### Change Order of Columns<a name="change_columns">
Again, take a look at the column names which produces a charcter vector with length = ncol:
    
`colnames(mydf)`

In [175]:
colnames(mydf)

Say we want "test" to be the first and "subject" to the second column. To achieve this, create a character vector which covers the exact column names in the desired order, and place it in the column index position of mydf

`mydf[,c("test", "subject", "score")]`

or

`mydf[,c(colnames(mydf)[2], colnames(mydf)[1], colnames(mydf)[3])]`

Finally, overwrite the original data frame:

`mydf <- mydf[,c("test", "subject", "score")]`

## Save, load, and import files<a name="import">

### The Working Directory<a name="wd">
The working directory (wd) is the location in the file system where files are stored or from where files are imported. In Jupyter, the working directory is automatically set at the location within the Jupyter file system where the notebook is stored. 

`getwd()` returns the current wd-path:

In [64]:
getwd()

The working directoty is changed by function `setwd()` which takes the new path as argument (written as character string; in quotes):

### Save Objects like Vectors, Matrices or Data Frames as `.RData`-File<a name ="save">
Function `save()` is used to store an external representation of an object as `.RData`-file in the current working directory (or in another working directory to be specified).
    
It is mandatory to indicate:
    - which object should be taken for saving and 
    - the file-name (name it like the object name, though you're free to choose another file name).

Save an external representation of mydf as "mydf.RData" in the current working directory:
    
`save(mydf, file="mydf.RData")`
    
Save it in another working directory, e.g. in your JupyterHub home-folder (this is my home folder ...):
    
`save(mydf, file="/home/timo/mydf.RData")` (this is my home folder, change the path correspondingly).

### Load `.RData`-Files<a name="load">

Function `load()` reloads `.RData`-files, i.e. object representations that have been saved with function `save()`.

`load("mydf.RData")` reloads the saved dataset from the working directory.

`load("/home/timo/mydf.RData")` reloads the dataset saved in (my) JupyterHub Home folder.

### Read `.csv`-Files<a name="read.csv">

`csv` (comma-separated file) is a common file format that stores tabular data as plain text, separated by comma. It is a standard (though mostly not default) format for saving data that are created with spreadsheet software like Excel.

The European standard separator is semicolon, not comma.

Function `read.csv()` imports .csv-datasets separated by comma, `read.csv2()` reads semicolon-separated files.

These functions take as arguments (amongst others):
- the file name / the path (mandatory)
- `header`: if `TRUE` (default), data of the first row will be recognised as column names
- `sep`: default for `read.csv()` is `sep=","`; default for `read.csv2()` is `sep=";"`
- `dec`: the character used in the file for decimal points (comma or point)
- `fill`: if `TRUE` then in case the rows have unequal length, blank fields are implicitly added

**Tip**: Truncated paths of files that were uploaded in Jupyter can be copied easily and fast by *right-mouse-click* on the file, followed by *Copy Path*. 

Take for example a copied path of file *Classroom.csv* which is "LV2018-19/Data/Classroom.csv" in my JupyterLab environment. To open this semicolon separated csv-file, complete the path as "~/LV2018-19/Data/Classroom.csv" or type "home/*username*/..." instead of "~/...".

`read.csv2("~/LV2018-19/Data/Classroom.csv")`

### Read `.txt`-Files<a name="read.table">

`txt` is another common file format that stores data such as tabular data as plain text. The separator differs, common are comma, semicolon, space or tab.

Function `read.table()` imports .txt datasets, and takes as arguments (amongst others):

- the file name / the path (mandatory)
- `header`: if `TRUE`, data of the first row will be recognised as column names (default is `FALSE`)
- `sep`: default is `sep="\t"` (tab). Change settings to `sep=";"`, `sep=","` or `sep=" "` depending on the separator of the .txt-file
- `dec`: the character used in the file for decimal points (comma or point)
- `fill`: if `TRUE` then in case the rows have unequal length, blank fields are implicitly added

Example:  
`read.table("~/LV2018-19/Data/Scores.txt", header=T, fill=T)`