In [1]:
# ignore warning message if cannot open file
# used for formatting slides, not needed for notes
try(source("../startup.R"))
options(jupyter.plot_mimetypes="image/png")

# STATS 306: Introduction to Statistical Computing

# Administrative stuff

| Person                  | Uniqname | Office Hours | Location      |
|-------------------------|----------|--------------|---------------|
| Prof. Jonathan Terhorst | jonth    | Tu 1-2:30p<br/>Th 3-4:30p | 269 West Hall |
| Byoung Jang             | bwjang   |    TBD       |     SLC       |
| Wayne Wang              | wayneyw  |    TBD       |     SLC       |
| Enes Dilber             | enes     |    TBD       |     SLC       |



## The book

<img style="float:right; margin: 10px; width: 200px" src="http://r4ds.had.co.nz/cover.png"/>

We will follow the book "R for Data Science" (R4DS) by Hadley Wickham and Garrett Grolemund. The electronic version is available [for free](http://r4ds.had.co.nz). There is no need to purchase the hardcopy version unless you enjoy spending money.

## What this course is about
* [Data Visualization](#Data-Visualization)
* [Data Transformation](#Data-Transformation)
* [Exploratory Data Analysis](#Exploratory-Data-Analysis)
* [Strings](#Strings)
* [Dates and Times](#Dates-and-Times)
* [Functions](#Functions) / abstraction
* [Vectors](#Vectors)
* [Iteration](#Iteration)
* [Models](#Models)

## What this course is *not* about
This is not a traditional programming course. You will learn to program in R as a byproduct of learning how to visualize, clean, and model data. However we will *not* cover things like:
- Algorithms
- Data structures
- OOP
- etc.

If you find that you enjoy programming and want to go further, these would be good topics to learn about in a future course.

# Goals for today's lecture
- Learn how to get Jupyter / R running on your computer.
- Use R to do perform a basic statistical analysis.

## Accessing an R programming enviroment
Everything in this course will be done using [Jupyter notebooks](http://jupyter.org/) running the [R programming language](https://www.r-project.org/). Lecture notes will be distributed in Jupyter notebook format before lecture. You are encouraged to bring your laptop to lecture and follow along.

### Using JS3O
The easiest way to get up and running in this environment is by surfing to [https://jupyter.stats306.org](https://jupyter.stats306.org).


### RStudio
Another popular option is [RStudio](http://rstudio.org). 

You are free to use whatever environment you please, but lectures and assignments will be done using Jupyter notebook.

# What is R
R is a programming language developed by statisticians to perform statistical analysis. The "traditional" way to run R from the Unix command line is by typing the command `R`:

    $ R
    R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
    Copyright (C) 2018 The R Foundation for Statistical Computing
    Platform: x86_64-apple-darwin17.7.0 (64-bit)

    R is free software and comes with ABSOLUTELY NO WARRANTY.
    You are welcome to redistribute it under certain conditions.
    Type 'license()' or 'licence()' for distribution details.

      Natural language support but running in an English locale

    R is a collaborative project with many contributors.
    Type 'contributors()' for more information and
    'citation()' on how to cite R or R packages in publications.

    Type 'demo()' for some demos, 'help()' for on-line help, or
    'help.start()' for an HTML browser interface to help.
    Type 'q()' to quit R.

    >

We won't use the command line in this class. The Jupyter notebook which runs these slides is running an R "kernel" in the background. Typing commands into these cells is the same as if you had type typed them into the R interpreter:

In [2]:
1 + 1

[1] 2

## Vectors
The most basic element of statistics is a vector of numeric data. To create a vector in R, we use the following syntax:

In [17]:
v <- c(1, 2, 3, 4, 5)

This create a variable named `v` and stores in it the vector of numbers `(10.4, 5.6, 3.1, 6.4, 21.7)`. Some things to note:

-  The `<-` (arrow) is an operator used to assign values to variables. (If you are used to programming in other languages, `=` also works.)
-  The `c()` is a function that takes several values and creates a vector out of them.

Almost all operations in R are "vectorized", meaning that they can operate on vectors. If I want to multiply each element of `v` by the number 2, I simply type

In [18]:
v * 2

[1]  2  4  6  8 10

If I want to add together all the numbers in `v`, I can use `sum()`:

In [19]:
sum(v)

[1] 15

## Logical values
In addition to representing numbers, it's quite common when programming to deal with *boolean* (true / false) values. In R these are represented by the special values `TRUE` and `FALSE`, commonly abbreviated `T` and `F`:

In [9]:
u <- c(T, F, TRUE, TRUE, FALSE)
u

[1]  TRUE FALSE  TRUE  TRUE FALSE

One way we can get logical values is asking whether numerical values are larger or smaller than some number:

In [12]:
v > 15

[1] FALSE FALSE FALSE FALSE  TRUE

## Missing values
In statistics missing data is common. A unique feature of R that sets it apart from other languages is a native ability to handle missing data via the special value `NA`:

In [13]:
NA

[1] NA

Think of `NA` as saying that you don't know the value of something. Is `NA` greater than 5? I don't know, because I don't know what the original value was:

In [14]:
NA > 5  # I don't know
NA > NA  # I definitely don't know

[1] NA

[1] NA

## Functions
Above we saw that typing `c(1,2,3,4,5)` creates a vector with the numbers 1-5. `c()` is an example of a *function*. The general form of a function in R (and most other programming languages) is:

    <function name>(<function arguments>)
    
In the above example, the function is named `c`, and the arguments were `1, 2, 3, 4, 5`. Another example of a function is `print`, which prints its arguments to the screen:

In [21]:
print("I am a function named print")

[1] "I am a function named print"


## Data Frames

Our main goal in R is to work with data, and one of the most fundamental objects in R is the *data frame*. Think of a data frame as a container for a bunch of vectors of data:

![dataframe](https://garrettgman.github.io/images/tidy-2.png)

In [27]:
library(tidyverse)
print(population)

# A tibble: 4,060 x 3
   country      year population
   <chr>       <int>      <int>
 1 Afghanistan  1995   17586073
 2 Afghanistan  1996   18415307
 3 Afghanistan  1997   19021226
 4 Afghanistan  1998   19496836
 5 Afghanistan  1999   19987071
 6 Afghanistan  2000   20595360
 7 Afghanistan  2001   21347782
 8 Afghanistan  2002   22202806
 9 Afghanistan  2003   23116142
10 Afghanistan  2004   24018682
# … with 4,050 more rows


To extract a vector (column) of data from a data frame, we use the `$` operator:

In [30]:
print(population$country)

   [1] "Afghanistan"                       "Afghanistan"                      
   [3] "Afghanistan"                       "Afghanistan"                      
   [5] "Afghanistan"                       "Afghanistan"                      
   [7] "Afghanistan"                       "Afghanistan"                      
   [9] "Afghanistan"                       "Afghanistan"                      
  [11] "Afghanistan"                       "Afghanistan"                      
  [13] "Afghanistan"                       "Afghanistan"                      
  [15] "Afghanistan"                       "Afghanistan"                      
  [17] "Afghanistan"                       "Afghanistan"                      
  [19] "Afghanistan"                       "Albania"                          
  [21] "Albania"                           "Albania"                          
  [23] "Albania"                           "Albania"                          
  [25] "Albania"                           "Albania"

## Loading a data frame
To load a data frame from a file we use the `load()` function. In the same folder as this lecture there is a file called `flint.RData`. 

In [32]:
load("flint.RData")

This has loaded a data frame into a variable called `flint` containing data from the Flint water crisis:

In [38]:
print(flint)

# A tibble: 23,184 x 10
   `Sample Number` `Date Submitted`    `Analysis (Lead… `Lead (ppb)`
   <chr>           <dttm>              <chr>                   <dbl>
 1 LF84899         2015-09-25 11:07:30 Lead                        0
 2 LF85330         2015-09-29 14:35:09 Lead                        0
 3 LF85604         2015-09-30 13:06:52 Lead                        0
 4 LF85613         2015-09-30 13:07:02 Lead                        0
 5 LF85796         2015-10-01 11:10:35 Lead                        0
 6 LF85797         2015-10-01 11:10:36 Lead                        0
 7 LF85799         2015-10-01 11:10:38 Lead                        0
 8 LF85802         2015-10-01 11:10:41 Lead                        0
 9 LF85862         2015-10-01 12:46:38 Lead                        0
10 LF85931         2015-10-02 09:54:53 Lead                        0
# … with 23,174 more rows, and 6 more variables: `Analysis (Copper)` <chr>,
#   `Copper (ppb)` <dbl>, `Street #` <chr>, `Street Name` <chr>, City <c

Let's do some basic analysis of the Flint water crisis using this data. 

What are some interesting questions we could ask about this data set?