# Data Management (Spring/Summer 2018) at OSIPP, Osaka U
By Shuhei Kitamura

## R basics

### Outline
1. Arithmetic operation
2. Run R scripts
3. Variables
    - Variable assignment
    - Classes
    - Concatenation
    - Vectors
    - Matrices
    - Factors
    - Data frames
        - Get elements
        - Change contents
        - Computation
        - Treat missing values
        - Merge
    - Lists
7. Functions and Libraries
8. If Statements
9. Loops

## 1. Arithmetic operation
- Main operators are `+`, `-`, `*`, and `/`.
- Write `1 + 2` and execute (Shift + Enter or push "run cell" button). Next, do the same for `print(1 + 2)`.

In [2]:
1+2
print(1+2)

[1] 3


- If you want to write a comment, use #. Write `# print(1 + 2)` and execute.

In [3]:
#print(1+2)

- Play with `+`, `-`, `*`, and `/`.

In [5]:
4*4
3104*3104


- `^` or `**` for power. Power is right associative (= `**` in Python). How do you calculate $-2^3$, $3^{3^3}$, and $\frac{3}{3^3}$?
- Note: `^` does not calculate power in Python.

In [9]:
3^3^3

- `%%` for modulus and `%/%` for floor division. What is the outcome of `7 %% 2`? How about `7 %/% 2`?
    - `%` (modulus in Python) and `//` (floor division in Python) do not work in R.
    - Similarly, these R operators do not work in Python.

In [10]:
7%%2

## 2. Run R scripts
<!--
 - Launch a text editor (e.g. Visual Studio Code). Create a new file.
 - Type `print(1 + 2)` in line 1, and `print("Hello world!")` in line 2.
 - Save the file in the local repo. Type "myR" in the file name, and set R for the file type. The file extension becomes ".r".
-->

- Suppose that you already have an R file "myR.R" in your local repository.
- Set the current directory by typing: `setwd("C:/path_to_your_local_repo)"` where you need to specify the path to your local repository. 
    - You will get an error message if you use a backslash instead of a slash.

- Run the R script by typing: `source("myR.R")`.

- Let's push the new file to your remote repository. Start a Git GUI (e.g. SoureTree).
- Stage "myR.R". Then press Commit.
- Write a commit message (e.g. "Create myR.R.").
- Commit and push.
- Check the remote repository.

## 3. Variables
- You can assign an object to a variable. An object in R is data, which does not contain any method.

#### - Variable assignment

- You use `<-` or `->` in R instead of `=` in Python. (You can also use `=` in R, but uncommon.)
```python
myvar <- "Hello world!"
print(myvar)
```
is equivalent to:
```python
print("Hello world!")
```
- In Jupyter Notebook, you can use `Alt & -` to insert `<-` and spaces. (See Keyboard Shortcuts in the Help tab above.)
- You can also use `( <- )` to assign and print at once.

In [25]:
myvar <- "Hello world!"
print(myvar)
"Hello world!" -> myvar
print(myvar)
(myvar <- "Hello world!")

#### - Classes
- Types in Python are called classes in R. They are: numerics, integers, characters, logical, matrices, arrays, etc.
- For logical you should write `TRUE` or `FALSE`, not `True` or `False` as in Python.
- R is also case sensitive!
- You use `class()` to check the class of data. (`type()` in Python.)
    - Alternatively, you can check the class using `is.numeric()`, `is.integer()`, `is.character()`, and `is.logical()`.

In [None]:
mynum1 <- 1 # this is not an integer!
mynum2 <- 1.0 # .0 is not printed.
mynum3 <- 1.5
mychar <- "Hello"
mylogic <- TRUE

print(mynum1)
print(class(mynum1))
print(mynum2)
print(class(mynum2))
print(mynum3)
print(class(mynum3))
\print(class(mychar))
print(class(mylogic))
print(is.integer(mynum1))

- How you write an integer in R is different from how you do it in Python. You need to put `L` after a value.

In [None]:
myint <- 1L
mynum <- 1
print(class(myint))
print(class(mynum))

- Logical is the same as booleans in Python.

In [None]:
print(1 + 2 == 3)
print(class(1 + 2 == 3))

- If a string includes a quote (e.g. `"This is a "quote""`), it returns an error. Use `'This is a "quote"'` instead.

In [None]:
# print("This is a "quote".")　# This returns an error.
print('This is a "quote".')

- To change classes, you can use `as.numeric()`, `as.integer()`, `as.character()`, and `as.logical()`.

In [None]:
mynum <- 1
print(as.character(mynum))
mynum <- 1
print(as.integer(mynum))

- `FALSE` is `0` and `0` is `FALSE`. `TRUE` is `1` and `1` is `TRUE`.
    - The other numerics and integers are also `TRUE`. 
- Characters cannot be converted as logical. (Recall that strings are `True` in Python.)

In [None]:
print(as.numeric(FALSE))
print(as.logical(0))
print(as.numeric(TRUE))
print(as.logical(1))
print(as.logial("Hello!")) # This returns an error.

- The class of `NULL` is NULL (like `None` in Python, whose type is `NoneType`).
- `NaN` (Not a Number) is numeric, and `NA` (Not Available) is logical.
    - Python (pandas) only has `NaN`, which is a missing value.

In [None]:
print(class(NULL))
print(class(NaN))
print(class(NA))
print(as.logical(NaN)) # you can convert NaN to logical
print(as.numeric(NA)) # but you cannot convert NA to numeric...

#### - Concatenation
- While in Python you could write `"Hello" + " World!"` for concatenating strings, you cannot use `+` in the similar way in R.
- Instead, you need to use `paste()`. With `paste`, you can also concatenate characters with other classes.
- Similarly, you cannot use `*` as `"Hello" * 2` in R either.

In [None]:
# print("Hello" + "World!") # This returns an error.
myvar1 <- "Hello"
myvar2 <- " World!"
paste(myvar1, myvar2, sep = "")

savings = 100
print(paste("I have ", savings, " USD in my account.", sep = ""))

#### - Vectors
- The vector is one dimensional array (like `array` in NumPy). `c()` is used for vectors.
- The vector is not a distinguished class, but `numeric`, `character`, `logical`, etc. are the classes of vectors.
- Vectors can contain any classes (like `array` in NumPy).
- You can also use `c()` in `print()` to show multiple outputs at once (like `print( , )` in Python).

In [None]:
myvec <- c("a", 1, TRUE)
print(myvec)
myvec1 <- c(1,2)
myvec2 <- c(3,4)
print(c(myvec1, myvec2))

- Some useful ways to make a vector.
    - `:` to make a vector of numerics or integers.
    - `numeric()` to make a vector of zeros (as numeric).
        - `integer()` to make a vector of zeros (as integer).
    - `seq()` to make a vector of numerics or integers.
    - `rep()` to make a vector. Most classes can be used.

In [None]:
myvec1 <- -1:4
print(myvec1)
print(numeric(5))
print(seq(1,4,by=1)) 
print(rep("Hello",4)) 
#mylist <- list(var1 = 1:10, var2 = "Hello")
#print(rep(mylist, 5)) 

- To get a length of a vector, use `length()`.

In [None]:
myvec <- 1:10
print(length(myvec))

- The class of a vector depends on its elements. If a vector contains multiple classes, the highest one is chosen.
    - Classes are ordered: character > complex > numeric > logical > NULL.
- The class of an empty vector is `NULL`.

In [None]:
myvec1 <- c(1,2)
print(class(myvec1))
myvec2 <- c(1,"Hello",TRUE)
print(myvec2)
print(class(myvec2))
myvec3 <- c(1,TRUE)
print(class(myvec3))
print(class(c())) # an empty vector

- You can assign (column) names to a vector using `names()`.

In [None]:
myvec <- c(1,10)
print(myvec)
names(myvec) <- c("Monday","Tuesday")
print(myvec)

- To subset a vector, use `[]` as in Python. However, the index in R starts from one, not zero!
- You can also use vectors, names, and relational operators for subsetting a vector.

In [None]:
myvec <- c(1,10,100,1000)
names(myvec) <- c("col1","col2","col3","col4") # assign column names
print(myvec)
print(myvec[2])
print(myvec[c(2,3)])
print(myvec[c(-2,-3)])
print(myvec[c("col3","col4")])
print(myvec[myvec > 100])

- Other useful operators are: `is.na()`, `is.null()`, `is.nan()`, and `is.infinite()`.
    - For example, you can substite NA with a value such as zero using `is.na()`.

In [None]:
myvec <- c(NA,10,100,Inf)
print(myvec[!is.na(myvec)]) # exclude NA
print(myvec[!is.infinite(myvec)]) # exclude infinity
print(myvec[!is.na(myvec) & !is.infinite(myvec)]) # exclude NA and infinity

## replace NA with 0
print(myvec)
myvec[is.na(myvec)] <- 0
print(myvec)

- To slice a vector, use `[:]`.
    - You cannot write like `[0:]` without writing the end.
    - The minus sign means exclusion, not the index from right (unlike Python).

In [None]:
myvec <- c(1,10,100,1000)
print(myvec[1:2])
print(myvec[-2:-3])

- Vectors can be used for calculation.
    - `sum()` to compute the sum for each index, which is similar to `+` for the Numpy array.
    - `mean()` to compute the mean for each index.

In [None]:
x <- c(0,1)
y <- c(2,3)
print(c(sum(x), sum(y)))
print(c(mean(x), mean(y)))
print(x + y)
print(x * y)

#### - Matrices
- The matrix is one of the classes.
- The matrix is a two dimensional array (like `ndarray` in NumPy). You can specify the number of columns and rows, and the order of filling in the matrix.

In [None]:
mymat1 <- matrix(c(1,2,3,4),nrow=2,ncol=2) # using a vector, create a 2 by 2 matrix
print(mymat1)
print(class(mymat1))
mymat1 <- matrix(c(1,2,3,4),byrow=TRUE,nrow=2,ncol=2) # the same is above, but the matrix is filled by row
print(mymat1)
mymat3 <- matrix(1:9,byrow=TRUE,nrow=3)
print(mymat3)

- Alternatively, you can apply `cbind()` and `rbind()` to make a matrix.

In [None]:
t_avg <- c(60,170)
c_avg <- c(62,175)
mymat1 <- cbind(t_avg,c_avg)
print(mymat1)
print(class(mymat1))
mymat2 <- rbind(t_avg,c_avg)
print(mymat2)

- The matrix can store only one class. If there are multiple classes, the highest one is chosen.
    - Recall that classes are ordered: character > complex > numeric > logical > NULL. 
- But the class of the matrix is still matrix (unlike the vector)

In [None]:
x <- c("a","b")
y <- c(2,3)
mymat <- cbind(x,y) # character > numeric
print(mymat)
print(class(mymat[2,2])) # the class of an element in the matrix
print(class(mymat)) # the class of the matrix

- Use `colnames()` and `rownames()` to add names for columns and rows, respectively.

In [None]:
t_avg <- c(60,170)
c_avg <- c(62,175)
cpop <- cbind(t_avg,c_avg)
rownames(cpop) <- c("Weight","Height")
colnames(cpop) <- c("Treatment","Control")
print(cpop)

- `colSums()` and `rowSums()` for calculating sums for each column and row, respectively.
    - Instead, `sum()` compute sums for a specific column or row or the sum of all elements in a matrix.
- `colMeans()` and `rowMeans()` for calculating means for each column and row, respectively.
    - Instead, `mean()` compute means for a specific column, but not a row.

In [None]:
t_avg <- c(60,170)
c_avg <- c(62,175)
cpop <- cbind(t_avg,c_avg)
rownames(cpop) <- c("Weight","Height")
colnames(cpop) <- c("Treatment","Control")
print(cpop)
print(c(colSums(cpop),rowSums(cpop)))
print(sum(cpop))
print(c(colMeans(cpop),rowMeans(cpop)))

- To subset a matrix, you often use `[row numbers, column numbers]`.
- You can also combine `:`, vectors, names, and relational operators.

In [None]:
x <- c(60,170,50)
y <- c(62,175,100)
z <- c(61.0,172.5,NA)
dims <- list(c("Weight","Height","Obs."),c("Treatment","Control","Mean"))
cpop <- matrix(c(x,y,z),ncol=3,dimnames=dims) # you can add column and row names at once
print(cpop)

print(cpop[1,1]) # first row, first column
print(cpop[1,1:2]) # first two columns in the first row
print(cpop[1,c(2,3)]) # second and third column in the first row
print(cpop[1,"Treatment"]) # "Treatment" column in the first row
print(cpop[1,]) # all elements in the first row

print(cpop[cpop < 100]) # all elements less than 100

# These are rarely used.
# print(cpop[3]) # the third element
# print(cpop[c(2, 3)]) # the second and the third element

- Matrix calculation allows `+`, `-`, `*`, and `/`. The calculation applies to all elements in the matrix.

In [None]:
mymat <- matrix(1:9,byrow=TRUE,nrow=3)
print(mymat)
print(mymat + mymat)
print(mymat * 2)

#### - Factors
- The factor is one of the classes.
- The factor stores characters.
    - For example, a vector `c("Male", "Female", "Male")` contains two factor levels `"Male"` and `"Female"`. `factor()` encodes the vector as a factor.
- You can define the order of factor levels.

In [None]:
# ordinal categorical variable
temp <- c("High","Low","Medium","Low")
factor_temp <- factor(temp,levels=c("Low","Medium","High"),ordered=TRUE) # put levels from "Low", "Medium", to "High", where "Low" is the smallest
print(factor_temp)
print(class(factor_temp))
print(factor_temp[1] < factor_temp[2]) # check if "Low" is larger than "High"
print(summary(factor_temp)) # count the number of each element

# nominal categorical variable
animals <- c("Elephant","Giraffe","Donkey","Horse")
factor_animals <- factor(animals)
print(factor_animals)
print(class(factor_animals))
# print(factor_animals[1] < factor_animals[2]) # this returns an error.
print(summary(factor_animals))

- It is possible to use any other labels for factor levels.

In [None]:
gender <- c("M","F","F","M","M")
factor_gender <- factor(gender) 
levels(factor_gender) <- c("Female","Male") # factors levels are stored as "M" for "Male" and "F" for "Female". Since R assigns characters alphabetically, the order must be Female, Male, not Male, Female
print(factor_gender)

#### - Data frames
- The data frame is one of the classes.
- Recall that vectors and matrices can store only one class. Data frames allow you to store multiple classes.

In [None]:
print(class(mtcars)) # mtcars is built-in data
print(mtcars)

- To get the structure of a data frame, use `str()`.
- To get dimensions of a data frame, use `dim()`. (The command is similar to `.shape` in Python.)
- You can use `lengths()` for the number of elements in each column.
    - `length()` (without s) is for a specific column.

In [None]:
print(str(mtcars))
print(dim(mtcars))
print(lengths(mtcars))
print(length(mtcars$mpg))

- Use `summary()` to get a summary table.

In [None]:
print(summary(mtcars))

- To get the total number of columns and rows, use `ncol()` and `nrow()`, respectively.

In [None]:
print(ncol(mtcars))
print(nrow(mtcars))

- To get column names and row names, use `colnames()` and `rownames()`, respectively.

In [None]:
print(colnames(mtcars))
print(rownames(mtcars))

- To check if a value is duplicated, use `duplicated()`.
- To check if a value is unique, use `!duplicated()`.

In [None]:
print(duplicated(colnames(mtcars)))
print(!duplicated(colnames(mtcars)))
print(sum(duplicated(rownames(mtcars)))) # the total number of unique index

- To get unique values, use `unique()`.

In [None]:
unique(rownames(mtcars))

##### Get elements
- To access each element, you can use `[]`.
- To slice data, use `[:]`. 
    - To get columns, you can write either `[:]` or `[,:]`.
    - To get rows, use `[:,]`
- You can directly write like `df$column_name`.

In [None]:
# get columns
print(mtcars[1:2]) 
print(mtcars[,1:2]) 
print(mtcars[,"mpg"]) 
print(mtcars$mpg) 
# get rows
print(mtcars[1:2,]) 

- You can use operators like `==`, `>`, and `<=` to get parciular rows.
    - You can also use `subset()`.
- You can also use `filter()` in the `dplyr` package.
- If a row contains NA, subsetting elements using operators in `[]` returns NA for all values in the row.
    - `select()` and `filter()` even remove that row.
    - To keep NA, add an "or" condition.

In [None]:
col1 <- c("a","a","b","b")
col2 <- c(1,2,1,2)
col3 <- c(NA,0.2,0.4,0.1)
col4 <- c(TRUE,FALSE,FALSE,TRUE)
myDf <- data.frame(id=col1,time=col2,varA=col3,varB=col4,stringsAsFactors=FALSE) # add column names at once, do not read strings as factors
print(myDf)

myDf2 <- myDf[myDf$varA != 0.2,]
print(myDf2)
print(subset(myDf,varA != 0.2)) # using subset
library(dplyr) 
print(filter(myDf,varA != 0.2)) # using filter in the dplyr package

In [None]:
col1 <- c("a","a","b","b")
col2 <- c(1,2,1,2)
col3 <- c(NA,0.2,0.4,0.1)
col4 <- c(TRUE,FALSE,FALSE,TRUE)
myDf <- data.frame(id=col1,time=col2,varA=col3,varB=col4,stringsAsFactors=FALSE)
print(myDf)

myDf2 <- myDf[(myDf$varA != 0.2) | (is.na(myDf$varA)),]
print(myDf2)
print(subset(myDf,varA != 0.2 | is.na(varA))) # using subset
library(dplyr) 
print(filter(myDf,varA != 0.2 | is.na(varA))) # using filter

- To subset columns, use `[]`, `subset()`, or `select()` in the dplyr package.

In [None]:
# subset columns
drops <- c("mpg","cyl")
print(mtcars[,!(colnames(mtcars) %in% drops)])
print(subset(mtcars,select=-c(mpg, cyl))) # using subset

- `select` in the `dplyr` package is very useful.

In [None]:
library(dplyr) 
print(select(mtcars,disp:carb)) # select columns from disp to carb
print(select(mtcars,starts_with("m"))) # columns starting with "m"
#print(select(mtcars, ends_with("p"))) # columns ending with "m"
#print(select(mtcars, contains("q"))) # columns containing w or p
# another useful option num_range("x", 1:3) # columns x1, x2, and x3.

- To get first few rows and last few rows easily, use `head()` and `tail()`, respectively.

In [None]:
print(head(mtcars,n=5)) # first 5 rows
print(tail(mtcars,n=5)) # last 5 rows

- Appendix: To get the indices of matched elements, use `grep()` or `match()`.
    - `match()` is used for the exact match like `==`.

In [None]:
mycars <- read.table('cars.csv',header=TRUE,sep=',',stringsAsFactors=FALSE)
print(mycars)
grep("Japan",mycars$country)
match("Japan",mycars$country)
grep("Jap",mycars$country)
match("Jap",mycars$country) # this is not exact match

##### Change contents
- To delete a column, use `<- NULL`.
    - Similar to `del` in Python.
- Alternatively, you can use `[]` for subsetting data.

In [None]:
mycars <- read.table('cars.csv',header=TRUE,sep=',',row.names='id',stringsAsFactors=FALSE)
print(mycars)
mycars$country <- NULL
print(mycars)

- You can sort elements using `order()`.
    - Though rarely used, you can also sort data by row values.
- Another option is `arrange()` in the `dplyr` package.

In [None]:
attach(mtcars)
print(mtcars[order(mpg,cyl),]) # sort by columns
print(mtcars[order(-mpg,-cyl),]) # sort by columns (descending)
print(mtcars[,order(mtcars["Mazda RX4",])]) # sort by a row
#require(dplyr)
#print(arrange(mtcars, cyl, desc(mpg))) # sort by cyl (ascending) and mpg (descending)

- To transpose data, use `t()`.

In [None]:
print(t(mtcars))

- You can use `reshape` to reshape the data.

In [None]:
col1 <- c("a","a","b","b")
col2 <- c(1,2,1,2)
col3 <- c(0.5,0.2,0.4,0.1)
col4 <- c(TRUE,FALSE,FALSE,TRUE)
myDf <- data.frame(id=col1,time=col2,varA=col3,varB=col4,stringsAsFactors=FALSE)
print(myDf)

myDf <- reshape(myDf,idvar="id",timevar="time",direction="wide")
rownames(myDf) <- NULL # delete the old rownames
print(myDf)
myDf <- reshape(myDf,idvar="id",timevar="time",direction="long")
rownames(myDf) <- NULL # delete the old rownames
print(myDf)

- Appendix: Use `melt()` in the `reshape2` package to show data in an alternative form.
<!-- `dcast()` 
myDf_back <- dcast(myDf_melt, id + time ~ variable)
print(myDf_back)
-->

In [None]:
library(reshape2)
col1 <- c("a","a","b","b")
col2 <- c(1,2,1,2)
col3 <- c(0.5,0.2,0.4,0.1)
col4 <- c(TRUE,FALSE,FALSE,TRUE)
myDf <- data.frame(id=col1,time=col2,varA=col3,varB=col4,stringsAsFactors=FALSE)
print(myDf)
print(melt(myDf,id.vars=c("id","time")))

- To rename column and row names, use `colnames()` and `rownames()`, respectively.
    - For column names, you can also use `setnames()` in the `data.table` package.
<!--    - For column names, you can also use `rename()` in the `dplyr` package. -->

In [None]:
col1 <- c("a","a","b","b")
col2 <- c(1,2,1,2)
col3 <- c(0.5,0.2,0.4,0.1)
col4 <- c(TRUE,FALSE,FALSE,TRUE)
myDf <- data.frame(id=col1,time=col2,varA=col3,varB=col4,stringsAsFactors=FALSE)
print(myDf)
colnames(myDf)[colnames(myDf)=="varA"] <- "varC" # replace a column name
print(myDf)
#library(data.table) #install.packages("data.table", repos='http://cran.us.r-project.org')
#print(setnames(myDf,"varB","varD"))

- To replace values, use `replace()` or `gsub()`.
    - `gsub()` returns character. If you want to replace numerics, use `replace()`.

In [None]:
col1 <- c("a","a","b","b")
col2 <- c(1,2,1,2)
col3 <- c(0.5,0.2,0.4,0.1)
col4 <- c(TRUE,FALSE,FALSE,TRUE)
myDf <- data.frame(id=col1,time=col2,varA=col3,varB=col4,stringsAsFactors=FALSE)
print(myDf)
myDf$varA <- replace(myDf$varA,myDf$varA==0.5,0.8) # replace varA == 0.5 to 0.8
print(myDf)
myDf$varA <- gsub(0.5,0.8,myDf$varA) # using gsub
print(myDf)
print(class(myDf$varA))

##### Computation
- `colSums()` and `rowSums()` for calculating sums for each column and row, respectively.
    - Instead, `sum()` compute sums for a specific column or row or the sum of all elements in a data frame.
- `colMeans()` and `rowMeans()` for calculating means for each column and row, respectively.
    - Instead, `mean()` compute means for a specific column, but not a row.

In [None]:
col2 <- c(1,2,1,2)
col3 <- c(1.5,0.9,0.2,0.3)
col4 <- c(0.5,0.2,0.4,0.1)
myDf <- data.frame(time=col2,varA=col3,varB=col4,stringsAsFactors=FALSE)
print(myDf)
print(sum(myDf$varA))
print(colSums(myDf[,c("varA","varB")]))
print(rowSums(myDf[c(1,3),]))

- NaN and NA are not always ignored in calculation (unlike Python).

In [None]:
myDf <- data.frame(matrix(1:9,byrow=TRUE,nrow=3)) 
print(myDf)
myDf$X1 <- replace(myDf$X1,myDf$X1==1,NaN)
print(myDf)
print(sum(myDf$X1)) # returns NaN
print(sum(myDf$X1,na.rm=TRUE)) # ignore NaN

##### Treat missing values
- To check if data contain NA or NaN, use `is.na()`.
    - To check if data contain NaN, use `is.nan()`.
- To check if data contain inf or -inf, use `is.infinite()`.

In [None]:
x <- c(-Inf,NA,3,Inf)
y <- c(NA,0/0,NaN,log(0))
my_data <- data.frame(x,y)
print(my_data)
print(is.na(my_data))
print(sum(is.na(my_data))) # total number of NA or NaN (counting TRUEs)
print(apply(is.na(my_data),2,sum)) # total number of NA or NaN by column (change 2 to 1 to count by row)
print(is.nan(as.matrix(my_data))) # to use is.nan in the data with multiple rows, one should use as.matrix()
# print(is.nan(my_data)) # this returns nothing
print(is.infinite(as.matrix(my_data))) # to use is.infinite in the data with multiple rows, one should use as.matrix()

- To drop rows with an NA or NaN, use `[]`.
- To drop rows with all NA or NaN, use `complete.cases()`.

In [None]:
x <- c(-Inf,NA,3,Inf)
y <- c(NA,0/0,NaN,log(0))
my_data <- data.frame(x,y)
print(my_data)
my_data_nomiss <- my_data[!is.na(my_data$x),] 
print(my_data_nomiss)
my_data_nomiss <- my_data[complete.cases(my_data),]
print(my_data_nomiss)

- To drop columns with all NA or NaN, use `[]`.

In [None]:
x <- c(log(0), NA, 3, Inf)
y <- c(NA, 0/0, NaN, NA)
z <- c(1,2,3,4)
my_data <- data.frame(x,y,z)
print(my_data)
my_data2 <- my_data[,colSums(is.na(my_data)) != nrow(my_data)]
print(my_data2)

- Replace NA, NaN, inf or -inf with another value.

In [None]:
x <- c(-Inf, NA, 3, Inf)
y <- c(NA, 0/0, NaN, log(0))
z <- c(1,2,3,4)
my_data <- data.frame(x,y,z)
print(my_data)
my_data$x[is.na(my_data$x)] <- 0 # NA or NaN in column x
print(my_data)
my_data[is.na(my_data)] <- 0 # all NA or NaN
print(my_data)

##### Merge
- To merge datasets, use `merge()`.
    - Alternatively, you can use `cbind()` and `rbind()`.

In [None]:
id <- c("a","b","c")
weight <- c(80.0,65.0,55.0)
data1 <- data.frame(id,weight)
 
id <- c("a","b","c","d")
height <- c(180.0,170.0,150.0,155.0)
data2 <- data.frame(id,height)

print(merge(data1,data2,by="id")) # exclude unmatched ones
print(merge(data1,data2,by="id",all=TRUE)) # include all

- You can merge even if the key names are different.

In [None]:
id <- c("a","b","c")
weight <- c(80.0,65.0,55.0)
data1 <- data.frame(id,weight)
 
xid <- c("a","b","c","d")
height <- c(180.0,170.0,150.0,155.0)
data2 <- data.frame(xid,height)

print(merge(data1,data2,by.x="id",by.y="xid")) # exclude unmatched ones
print(merge(data1,data2,by.x="id",by.y="xid",all=TRUE)) # include all

- You can use more than one key.

In [None]:
id <- c("a","a","b","b")
time <- c(1,2,1,2)
weight <- c(80.0,79.0,54.0,55.0)
data1 <- data.frame(id,time,weight)

id <- c("a","a","b","b","c")
time <- c(1,2,1,2,1)
height <- c(180.0,180.0,170.0,170.0,155.0)
data2 <- data.frame(id,time,height)

print(merge(data1,data2,by=c("id","time"))) # exclude unmatched ones
print(merge(data1,data2,by=c("id","time"),all=TRUE)) # include all

- What happens if both datasets have the same column name but the contents are different?

In [None]:
id <- c("a","b","c")
weight <- c(80.0,65.0,55.0)
data1 <- data.frame(id, weight)
 
xid <- c("a","b","c","d")
weight <- c(180.0,170.0,150.0,155.0)
data2 <- data.frame(xid,weight)

print(merge(data1,data2,by.x="id",by.y="xid")) # exclude unmatched ones
print(merge(data1,data2,by.x="id",by.y="xid",all=TRUE)) # include all

#### - Lists
- Lists are one of the classes.
- Lists can contain any class.
- The structure of lists in R looks quite different from that of lists in Python.

In [None]:
x <- c(1,2)
y <- matrix(1:4,nrow=2,ncol=2)
z <- mtcars
mylist <- list(x,y,z)
names(mylist) <- c("vec","mat","df")
print(class(mylist))
print(mylist)

- To get an element in a list, use `[[]]` or` $ `, and `[]`.

In [None]:
x <- c(1,2)
y <- matrix(1:4,nrow=2,ncol=2)
z <- mtcars
mylist <- list(vec=x,mat=y,df=z) # assign names
print(mylist[[2]]) # this returns mat
print(mylist[["mat"]]) # this returns mat
print(mylist$mat) # this returns mat

print(mylist$mat[,1]) # this returns the first column in mat

- To get the structure of a list, use `str()`.
- To get the length of each element in a list, use `lengths()`.
    - To get the length of the list, use `length()`.

In [None]:
x <- c(1,2)
y <- matrix(1:4,nrow=2,ncol=2)
z <- mtcars
mylist <- list(vec=x,mat=y,df=z)
print(str(mylist))
print(lengths(mylist))
print(length(mylist))

## 4. Functions and Packages
- As in Python, there are many functions and packages.
- You have already used some of the functions such as `print()`, `paste()`, `summary()`, `sum()`, and `subset()`.
    - `builtins()` to get the list of built-in functions.
- You can get a help document using `?` or `help()`.

In [60]:
?print

- Packages contain functions, data, etc.
- There are 12,649 packages (as of 2018/06/26) in  [CRAN](https://cran.r-project.org/).
- To install a package, use `install.packages()`. You can use `c()` to install multiple packages.
- Installed packages can be loaded using `library()`.
<!--
    - Type `.libPaths()` to check where packages are stored.
    - If no path to `Anaconda3/Lib/R/library` is found, type
```python
.libPaths("C:/[path_to_the_folder]/Anaconda3/Lib/R/library")
```
to set the path. You have to specify the path to your local directory.
    - Then type the following:
-->

In [None]:
#install.packages(c("ggplot2","foreign"),repos='http://cran.us.r-project.org')
library(ggplot2)

- To check functions in a package, use `lsf.str()` or `ls()`.

In [None]:
lsf.str("package:ggplot2")
#ls("package:ggplot2")

- You can list loaded packages using `(.packages())`.
- You can unload packages using `detach()`.

In [None]:
(.packages())
detach(package:ggplot2)
(.packages())

- The following commands install the `nycflights13` package and load it.
 <!--   - `dplyr` is a popular package for data scientist, focused on tools for working with data frames. -->
    - `nycflights13` includes data on 336,776 flights departed from NYC in 2013.

In [64]:
install.packages("nycflights13",repos='http://cran.us.r-project.org')
library(nycflights13)

- Unlike Python, you can use a function without referring to any package at any time (e.g., no, `np.` as in `np.narray`).
- In the following command:
    <!-- 
    - `filter()` included in the dplyr package
    - `head()` included in the utils package (= a built-in package)
    -->
    - `flights` is included in the nycflights13 package.

In [None]:
print(head(subset(flights, month == 3 & day == 1), 5))

- You can make easily your own function. The following example uses `x` and `y` as inputs and returns `x + y`.

In [None]:
mysum <- function(x,y) {
    return(x + y)
}
print(mysum(2,3))

## 5. If Statements
- R uses `{}` for if statements. Also, the condition has to be included in `()`. 

In [None]:
inp <- "Hello"
if (inp == "Hello") {
    print("World!")
}

- `else` and `else if` can be used.

In [None]:
inp <- "Hello"
if (inp == "Hell") {
    print("No.")
} else if (inp == "He") {
    print("No.")
} else print("World!")

- The above example looks ugly. Alternatively, you can define a function.

In [None]:
myfunc <- function(inp) {
    if (inp == "Hell") out <- "No."
    else if (inp == "He") out <- "No."
    else out <- "World!"
    return(out)
}
print(myfunc("Hello"))

- Other relational operators are: `!=`, `>`, `<`, `>=`, and `<=`.
- You can also combine them with `&&`, `||` and `%in%`. 
    - R does not allow `is` and `not` unlike Python.

In [None]:
year <- 2018
if ((year > 2017) && (year < 2019)) print("2018!")
if (year != 2017) print('Not 2017!')

myvec = c(NA,'b','c')
if (NA %in% myvec) print('NA in myvec!')

## 6. Loops
- Use `for` or `while` for loops.
- `for` loops use `in`.

In [None]:
for (year in 2010:2019) {
  print(paste("The year is", year))
}

- You can also loop over vectors.

In [None]:
for (vec in c(2010, TRUE, "a")) {
  print(vec)
}

- Loops can be simplified.

In [None]:
for (year in 2010:2019) print(year)

- You can insert elements into a vector, string, matrix, list, etc.

In [None]:
a <- NULL
for (i in 1:10) {
    a[[i]] <- i
}
print(a)
b <- c()
for (i in 1:10) {
    b[[i]] <- i
}
print(b)
c <- ""
for (i in 1:10) {
    c[[i]] <- i
}
print(c)
d <- matrix()
for (i in 1:10) {
    d[[i]] <- i
}
print(d)
e <- list()
for (i in 1:10) {
    e[[i]] <- i
}
print(e)

`while` loops use counters.

In [None]:
cnt <- 0
while (cnt < 10) {
    print(cnt)
    cnt <- cnt + 1
}

- Infinite loops use `repeat`.
- Infinite loops often use if statements with `break` and `next`.
    - `break` means you will exit the current loop.
    - `next` means you will go back to the starting point of the current loop (like `continue` in Python).

In [None]:
print("--- infinite loop till 9 ---")
cnt <- 0
repeat {
    print(cnt)
    cnt <- cnt + 1
    if (cnt >= 10) break
}


print("--- infinite loop till 9, except 2 ---")
cnt <- 0
repeat {
    if (cnt == 2) {
        cnt <- cnt + 1
        next
    }
    print(cnt)
    cnt <- cnt + 1
    
    if (cnt >= 10) break
}

- You can combine functions with loops.

In [None]:
myfive <- function(x) { # myfive return "Five!" if the input is five, and "." otherwise.
    if (x == 5) y <- "Five!"
    else y <- "."
    return(y)
}
for (x in 1:5) {
    print(myfive(x))
}