## Conversion Notebook: Mapping from From Data 8X's `datascience` to R

Throughout Data 8X, we have been working a lot with the datascience library, a library created by faculty at UC Berkeley, specifically for this course. While this library is not used outside of this course, all of the ideas and concepts behind the library and the different functions are definitely used when dealing with data science problems in the real world. This notebook serves as an introduction to basic R terminology, data structures and commands. The functions introduced will be analogous to those in Berkeley's `datascience` module, with examples provided for each.

We will cover the following topics in this notebook:
1. Basics of R
    - Importing and Loading Packages
    - Arithmetic and Logical Operators
    - Assigning Variables
2. Dataframes: Storing Tabular Data
    - Creating a Dataframe
    - Accessing Values in Dataframe
    - Manipulating Data
3. Visualizing Data
    - Histograms
    - Line and Scatter Plots
    - Bar Charts
    
For reference:

Datascience documentation: http://data8.org/datascience/index.html

Python Reference: http://data8.org/python-reference/python-reference.html

R documentation: https://www.rdocumentation.org/

## 1. Basics of R

R is a command line driven program. This means that the user can enter expressions, create variables and define functions and run them in the R console. In the Jupyter notebook interface, code chunks can be run as individual cells either by clicking on 'Run' in the toolbar above or using the shortcut keys `shift + enter`
<br>

### 1.1 Importing and Loading Packages

In Python, we use the following syntax to install packages:
```python
!pip install datascience
```
And we load them using:
```python
import numpy as np
from datascience import Table
```

In R, we use `install.packages('package_name')` to install new packages from the CRAN repository. For a full list of available packages, refer to https://cran.r-project.org/web/packages/available_packages_by_name.html

It is not necessary to reinstall packages everytime we quit or reload an R session. Once we have a package installed, we can load it using `library('package_name')`.

In [None]:
# example: install a package
install.packages('ggplot2')

In [None]:
# example: loading a package
library('ggplot2')

### 1.2 Arithmetic and Logical Operators

#### Arithmetic Operators

Here are the basic arithmetic operations in Python:
<br>
```python
import math
import numpy as np

print(2 + 3) # add numbers
print(3**4) # powers
print(pow(3, 4)) # powers
print(math.sqrt(4**4)) # functions
print(21 % 5) # 21 mod 5
print(math.log(10)) # take log
print(math.exp(2)) # exponential
print(np.abs(-2)) # absolute value
print(2*math.pi) # mathematical constant

# scientific notation
print(5000000000 * 1000)
print(5e9 * 1e3)
```
In Python, we need to import `numpy` and `math` for certain mathematical operations. In R, however, these capabilities are built-in so no imports are required.
<br>
<br>
Running the following cells will demostrate some basic operations performed in R.

In [None]:
# adding two numbers
2 + 3

In [None]:
# raising to a power
3 ^ 4

In [None]:
# square roots
sqrt(4 ^ 4)

In [None]:
# 21 mod 5
21 %% 5 

In [None]:
# taking the log
log(10)

In [None]:
# exponential
exp(2) 

In [None]:
# using mathematical constants
2 * pi 

In [None]:
# absolute value
abs(-2)

In [None]:
# scientific notation
5e9 * 1e3

#### Logical Operators

Now, recall the logical operations in Python:
```python
print((1 > 0) and (3 <= 5))
print((1 < 0) or (3 > 5))
print((3 == 9/3) or (2 < 1) )
print(not(2 != 4/3))
```
<br>
In R, the logical operators are <, <=, >, >=, == for exact equality and != for inequality. 

`and`, `or`, `not` is replaced by `&`, `|`, `!`. 

The boolean values True/False in Python correspond to TRUE/FALSE in R (Notice the case difference). 
<br><br>
Run the cells below to see how logical operators work in R.

In [None]:
(1 > 0) & (3 <= 5)

In [None]:
(1 < 0) | (3 > 5)

In [None]:
(3 == 9/3) | (2 < 1)

In [None]:
!(2 != 4/3)

### 1.3 Assigning Variables

**Assignment**

In R, the assignment operator is `<-`. This represents an arrow, since we are assigning a value on the right side of the operator to the variable on the left side. In most (not all) contexts, the `=` operator can be used as an alternative. It is recommended to use `<-` as standard usage to avoid mistakes.

Note: Variables names in R are case sensitive, which means `A` and `a` are different symbols and would refer to different variables.

In [None]:
# run this cell
val <- 3
print(val) # same usage of print function as in Python 3

Val <- 7 # case-sensitive!
print(Val)

**Vectors**

Vectors are the main basic data structure in R and are analogous to lists in Python. The syntax for vectors is of the form `c(a, b, c)`. This will create a vector with the variables `a`, `b`, and `c`. The `c()` function is the combine function and lets us combine the elements within the function into a vector. When we use this function, we are declaring a vector datatype.
<br>
```python
# A comparison: create a list of numbers in python
a = [0.125, 4.75, -1.3]
a1 = np.array([0.125, 4.75, -1.3])
```

In [None]:
# run this cell
a <- c(1, 2, 3)
a
b <- c(4, 5, 6)
b

In [None]:
# combine two vectors
ab <- c(a, b)
ab

**Creating Sequences**

In Python, we used `np.arange` to create a numpy array with a start, end and a step value as follows:
```python
a = np.arange(4, 9, 1) # creates [4 5 6 7 8]
```
<br>
In R, we can use the `seq` function to do the same. The end element is inclusive in `seq`, unlike `np.arange`. We need to specify the from, to, and by parameters in order to use the seq function properly. 

In [None]:
# run this cell
seq1 <- seq(from=4, to=9, by=1)
seq1 # Notice the output difference with np.arange

There are more parameters available for the `seq` function. To pull up more information about an R function, we can use either `?seq` or `help(seq)` 

Another important difference is that in R, indexing starts at 1, unlike 0 in Python.

In [None]:
seq1[1] # extracting element at first index of vector

## 2. Dataframes: Storing Tabular Data

In Python's `datascience` module, we used `Table` to build our dataframes and used commands such as `select()`, `where()`, `group()`, `column()` etc. In this section, we will go over some basic commands to work with tabular data in R using Dataframes.

### 2.1 Creating a Dataframe

In Python's `datascience` module that is used in Data 8, this is how we created tables from scratch by extending an empty table:
```python
t = Table().with_columns([
     'letter', ['a', 'b', 'c', 'z'],
     'count',  [  9,   3,   3,   1],
     'points', [  1,   2,   2,  10],
 ])
```
<br> 
In R, we can initialize a dataframe using `data.frame()`. For a full list of parameters and options, refer to [this guide](https://www.rdocumentation.org/packages/base/versions/3.5.0/topics/data.frame)

When not specified, the function `data.frame` will coerce all character variables to factors. If you want to keep the strings as character variables, you need to specify `stringsAsFactors = FALSE`.

In [None]:
# example: creating a dataframe in R
t <- data.frame(letter = c('a', 'b', 'c', 'z'),
                count = c(9, 3, 3, 1),
                points = c(1, 2, 2, 10),
                stringsAsFactors = FALSE
               )
t

More often, we will need to create a dataframe by importing data from a .csv file. In `datascience`, this is how we read data from a csv:
```python
Table.read_table('sample.csv')
```

In R, we can use `read.csv()` to read data from a csv file. There are a lot of different parameters that we can specify based on the data that we are importing into our environment (related to the header, column names, etc). For a full list of parameters, refer to [this guide](https://www.rdocumentation.org/packages/utils/versions/3.5.0/topics/read.table)

In [None]:
# example: reading baby.csv (Located in current working directory)
baby <- read.csv('baby.csv')
head(baby) # display first few rows of dataframe

To see a quick summary of the data, we can call `summary()` on our data frame in order to get some summary statistics, like the min, 1st quartile, median, mean, 3rd quartile, and the max. There are many other functions that we can run on dataframes in order to learn more about the data and do some initial exploring. We can look at the first few rows of the data (using the head function), the dimensions, the number of rows, and the column names, as some examples.

In [None]:
# view data summary
summary(baby)

In [None]:
# view information about dataframe
summary(baby) # view data summary
nrow(sat) # display no. of rows
dim(sat) # view dimensions (rows, cols)
colnames(sat) # view column names

### 2.2 Accessing Values in Dataframe

In `datascience`, we can use `column()` to access all the values in a particular column as follows:
```python
In [10]: t.column('letter')
Out[10]: 
array(['a', 'b', 'c', 'z'], 
      dtype='<U1')
```

   In R, to access values in a particular column, we can use the `$` sign or use the following syntax: `df[, colname]`. In order to use `$`, we must specify the dataframe name as well as the column name. If we have a dataframe called `df`, we would run `df$column_name`.

In [None]:
# accessing column values
t$letter
t[, 'letter'] # Can also use t[, 1] to access column at first index

In Python, we can use `take()` to access a row:
```python
In [34]: t.take[0:2] # the first and second rows
Out[34]: 
points | letter | count
1      | a      | 9
2      | b      | 3
```

In R, we can use the following syntax to access row data: `df[rowname, ]` (similar to how we accessed columns above). Note: we CANNOT use the `$` notation here, that will only work for columns. We must use the bracket notation.

In [None]:
# example: Access first row of dataframe
t[1:2, ]

We can also access a specific value in the dataframe by specifiying the row and column as follows. Remember that we specify the row first and then the column: `df[row, column]`.

In [None]:
# extracting one value
t[1, 'letter']
# slicing the dataframe
t[1:3, 'count']

### 2.3 Manipulating Data

**Adding Columns**

Adding a new column in `datascience` is done by the `with_columns()` function as follows:
```python
In [23]: t.with_column('vowel', ['yes', 'no', 'no', 'no'])
Out[23]: 
points | letter | count | vowel
1      | a      | 9     | yes
2      | b      | 3     | no
2      | c      | 3     | no
10     | z      | 1     | no
```
In R, we can use `data$newcolumn <- datavector` to add a new column to an existing dataframe. On the right side of the arrow, we have the data vector we want to add to the dataframe and we can add this data by accessing/creating a new column the `$` notation.

In [None]:
# example: Adding a new column
t$vowel <- c('yes', 'no', 'no', 'no')
t

We can also add an existing column to the dataframe, which might be manipulated and added as a new column. This will manipulate the column within the existing dataframe that we have.

In [None]:
# Example: Adding twice the count to the dataframe
t$double <- t$count * 2
t

**Selecting Columns**

In `datascience`, we used `select()` to subset the dataframe by selecting columns:
```python
In [27]: t.select(['letter', 'points'])
Out[27]: 
letter | points
a      | 1
b      | 2
c      | 2
z      | 10
```

In R, we can subset the dataframe by columns using the `df[, colnames(s)]` notation. We can pass in the column names as a vector or specify the indexes of the columns we want to use.

In [None]:
# selecting columns 1 through 3
t[, 1:3]

In [None]:
# selecting columns by name
colstoselect <- c('letter', 'points')
t[, colstoselect]

**Filtering Rows Conditionally**

In `datascience`, we used `where()` to select rows according to a given condition:
```python
In [35]: t.where('points', 2) # rows where points == 2
Out[35]: 
points | letter | count
2      | b      | 3
2      | c      | 3

In [36]: t.where(t['count'] < 8) # rows where count < 8
Out[36]: 
points | letter | count
2      | b      | 3
2      | c      | 3
10     | z      | 1
```

In R, we use the following syntax to subset rows: `df[df$colname <logical operator> value, ]` (Note the comma at the end!). The following examples will provide different ways of filtering the dataframe.

In [None]:
# rows where points = 2
t[t$points == 2, ]

In [None]:
# rows where count < 8
t[t['count'] <= 8, ] # we can also use df['colname'] instead of df$colname

In [None]:
# multiple conditions: points < 2 and vowel = 'yes'
t[t$points < 2 & t['vowel'] == 'yes', ]

**Renaming Columns**

In `datascience`, we used `relabeled()` to rename columns:
```python
In [29]: t.relabeled('points', 'other name')
Out[29]: 
other name | letter | count
1          | a      | 9
2          | b      | 3
2          | c      | 3
10         | z      | 1
```

In R, we can use the following to rename columns: `colnames(df)[colnames(df) == oldcolname] <- newcolname`. This selects the required column by filtering from `colnames(df)` and then assigns the new name to that column. Note that the Python `relabeled()` function creates a new Table, while the R method to rename mutates the current table.

In [None]:
# example: renaming double to twice
colnames(t)[colnames(t) == 'double'] <- 'twice'
t

**Sorting Dataframe by Column**

In `datascience` we used `sort()` to sort a Table according to the values in a column:
```python
In [40]: t.sort('count')
Out[40]: 
points | letter | count
10     | z      | 1
2      | b      | 3
2      | c      | 3
1      | a      | 9
```

In R, we can use `order()` in the subset notation: `df[order(df$colname), ]`.

In [None]:
# sort by count
t[order(t$count), ]

In [None]:
# sort by descending order of count
t[order(-t$count), ]

**Grouping and Aggregating**

In `datascience`, we used `group()` and the `collect` argument to group a Table by a column and aggregrate values in another column:
```python
In [42]: t.select(['count', 'points']).group('count', collect=sum)
Out[42]: 
count | points sum
1     | 10
3     | 4
9     | 1
```

In R, we can use `aggregate()` to group and aggregate values in our dataframe

In [None]:
# grouping by count and summing points
aggregate(points ~ count, t, sum)

**Pivot Tables**

In `datascience`, we used the `pivot()` function to build contingency tables:
```python
In [44]: other_table
Out[44]: 
mar_status | empl_status     | count
married    | Working as paid | 1
married    | Working as paid | 1
partner    | Not working     | 1
partner    | Not working     | 1
married    | Not working     | 1

In [45]: other_table.pivot('mar_status', 'empl_status', 'count', collect=sum)
Out[45]: 
empl_status     | married | partner
Not working     | 1       | 2
Working as paid | 2       | 0
```

In R, one simple way of building a pivot table is to use the `table()` function

In [None]:
# read in couples.csv
couples <- read.csv('couples.csv', header = TRUE)
head(couples)

`table` is a function to create tabular results for categorical variables. We could use it to summarize a vector of categorical variables. For example, to count the number of females and males:

In [None]:
# count no. of females and males
table(couples$Gender)

When `table` takes two vectors as input and among them at least one can be interpreted as a factor (categorical variable), we get the contingency table

In [None]:
# creating a pivot table
ratings_by_gender <- table(couples$Gender, couples$Relationship.Rating)
ratings_by_gender

## 3. Visualizing Data

In Python, we learned to plot data using histograms, line plots, scatter plots and histograms. In `datascience`, the corresponding functions were `hist()`, `plot()`, `scatter()`, and `barh()`.

In this section we will go through some examples of plots in R.

### 3.1 Histograms

In R, we can simple use the `hist()` function to create a histogram. In this example, we will be using data from `baby.csv`. Recall that the baby data set contains data on a random sample of 1,174 mothers and their newborn babies. The column `birthwt` contains the birth weight of the baby, in ounces; `gest_days` is the number of gestational days, that is, the number of days the baby was in the womb. There is also data on maternal age, maternal height, maternal pregnancy weight, and whether or not the mother was a smoker.

In [None]:
head(baby)

In [None]:
# plotting a histogram of birth weights
hist(baby$Birth.Weight, 
     col=c("darkblue"), # histogram color
     main = "Birth weights of babies",# plot titile
     xlab = "Birth weights (in ounces)", # x axis lable
     xlim = c(40, 180)) # x axis range

### 3.2 Line and Scatter Plots

We can use the `plot()` function in R to plot two vectors X and Y. The first argument is the X values and the second argument is the Y values. We can specify the `type` parameter and set it to `'l'` for line plots, `'p'` for scatter plots and `'b'` for both. For a full list of parameters, refer to [this guide](https://www.rdocumentation.org/packages/graphics/versions/3.5.0/topics/plot)

In [None]:
# line plot
plot(c(1, 2, 3, 4), c(10, 12, 23, 30), type='l', col=c("darkblue"))

In [None]:
# scatter plot
plot(c(1, 2, 3, 4), c(10, 12, 23, 30), type='p', col=c("darkblue"))

### 3.3 Bar Charts

We can plot categorical variables using a bar chart using the `barplot()` function in R. 

Recall the contingency table `ratings_by_gender`

In [None]:
ratings_by_gender

We will now create a bar chart using `barplot()`. For a full list of parameters, refer to [this guide](https://www.rdocumentation.org/packages/graphics/versions/3.5.0/topics/barplot)

In [None]:
# creating a bar chart
barplot(ratings_by_gender, 
        main="Rating by Gender", # Title of the plot
        xlab="Rating", # x axis label
        col=c("darkblue","yellow"), # specify color
        legend = rownames(ratings_by_gender), # legend
        beside=TRUE # plot bars for male and female side by side
       )

### Reading `R` Documentation
There are many more functions and methods you can call with `R` to do more cool things and expanding on what we went through in this notebook. One way to learn more about this is by looking through the `R` documentation. The documentation has all the different functions associated with `R`, and descriptions about what they do, how you use them, and some examples. 

Some tips for reading through the documentation:
* For various functions there are LOTS of different parameters that you can call, usually there are only a few that are important (usually related to the data you are working with and specifying how to run the functions). There are some parameters that are optional and you do not have to specify (automatically R will use default settings for these functions). 

### Further Reading
Here is a list of useful resources for R:
* [R Documentation](https://www.rdocumentation.org/)
* [R for Data Science Online Textbook](http://r4ds.had.co.nz/)
* [Quick-R (Short R Tutorials)](https://www.statmethods.net/)
* [Professor Gaston Sanchez's Tutorials](http://www.gastonsanchez.com/)