# Law, Order, and Algorithms
## Introduction to `R`

In [0]:
# Some initial setup
options(digits = 3)

### `R` basics

#### Assignment

The convention for assigning values to variables in `R` is an arrow (`<-`), where the direction of the arrow indicates the direction of assignment.
For example, if we want to assign the value `12` to a variable named `A`,

In [0]:
A <- 12   # This works
print(A)  # ... and this statement shows us ("prints") the value currently assigned to A
12 -> A   # So does this
print(A)

The more "standard" assignment using equal sign (`=`) also works, but _only for assignment to the left_. In other words

In [0]:
A = 12  # This works

In [0]:
12 = A  # But this doesn't!

#### Vectors

The native unit for variables in `R` is a vector. For example, the `A` variable we created above is actually a _vector_ of length 1.
We can create vectors of longer length by `c`ombining multiple values together.

In [0]:
X <- c(1, 2, 3)
print(X)
Y <- c("this", "that", "those")
print(Y)

A `seq`uence of numbers can be created using `seq(from, to, by = 1)`.
In other words, there is a function called `seq()` which takes three arguments, each named `from`, `to`, and `by`. 
The last argument (`by`) is optional, and will be set to `by = 1` if not supplied. 
For example,

In [0]:
seq(1, 5)  # Creates a sequence of 1 to 5

In [0]:
seq(1, 5, 2)  # Creates a sequence of 1 to 5, but in steps of 2

Since sequences in steps of 1 are created quite often, `R` provides a short-hand notation in the form `from:to`. 
For example,

In [0]:
1:5  # Short-hand notation for generating a sequence of 1 to 5, in increments of 1

Use square braces (`[]`) to index a vector (the first element is at index `1`, _not_ `0`)

In [0]:
X <- c(10, 11, 12, 13)
X[1]

Note that you _can_ index a value that is larger than the length of the vector. 
`R` will NOT fail, but return a special value called `NA`.

In [0]:
X[500000]

In `R`, a _negative_ index is used to _exclude_ elements.

In [0]:
X[-1]  # This will return all but the first element of X

A vector can also be used to index multiple elements of another vector.
For example, if you want the second and fourth elements of `X`,

In [0]:
ind <- c(2, 4)  # A vector that we create for the sole purpose of indexing another vector, X
X[ind]          # We get the second and fourth elements of X, because ind = (2, 4)

#### Exercise: vector
Create a sequence of numbers from 5 to 10, and then select the numbers 6, 7, and 8 from this sequence.

In [0]:
# WRITE CODE HERE


#### Vector operations

Vector are a native data structure in `R`, and many operations are "vectorized", meaning that they work directly on vectors.
Basic math operations are done element-wise.

In [0]:
A <- c(1, 2)
B <- c(6, 2)

A + B  # == c(1 + 6, 2 + 2)

In [0]:
A - B  # == c(1 - 6, 2 - 2)

In [0]:
A * B  # == c(1 * 6, 2 * 2)

In [0]:
B / A  # == c(6 / 1, 2 / 2)

In [0]:
B^2  # == c(6*6, 2*2)

Comparisons are also done element-wise

In [0]:
A == B  # == c(1 == 6, 2 == 2)

Note the double equal sign (`==`) for comparing equality! (One equal sign would be assignment.)

In [0]:
A < B  # == c(1 < 6, 2 < 2)

There are many functions in `R` that operate on units of vectors. 
Some examples are:

In [0]:
X <- c(0.1, 1, 10, 100)
log(X)  # Element-wise log

In [0]:
exp(X)  # Element-wise exponential

In [0]:
sqrt(X)  # Element-wise square-root

In [0]:
mean(X)  # Mean

In [0]:
sd(X)  # (Sample) standard deviation

In [0]:
var(X)  # (Sample) variance

In [0]:
max(X)  # Maximum value

In [0]:
min(X)  # Minimum value

In [0]:
median(X)  # Median value

In [0]:
sum(X)  # Sum of all values

In [0]:
prod(X)  # Product of all values

In [0]:
quantile(X, probs = c(.1, .5, .9))  # Quantile at specified probs

In [0]:
length(X)  # Length of vector

#### Exercise: vector operations

Generate a sequence of 1,000 random numbers between 0 and 1, and calculate their
1. mean
2. variance
3. 25%, 50%, and 75% quantile

Hint: you can use `runif(n)` to generate n random numbers between 0 and 1.

In [0]:
# WRITE CODE HERE


#### Strings

Use the `paste` function to concatenate two or more strings. 
Numerical values are automatically converted to strings.

In [0]:
paste("One plus one equals", 1 + 1, ".")

The `paste()` function has an optional `sep` argument, which you can use to specify how the different strings are `paste`d together.

In [0]:
paste("One plus one", 1 + 1, sep = " = ")

Similar to `sep`, you can also use the optional `collapse` argument to concatenate a vector of strings instead of having them as individual arguments.

In [0]:
my_strings <- c("one", "plus", "one", "equals", "two")

paste(my_strings, collapse=" ")

If you're familiar with `C`-style formatting, there is a `sprintf()` function, which literally calls the system `sprintf` `C`-library.

In [0]:
sprintf("One plus one = %d, and e = %.3f", 1 + 1, exp(1))

#### Exercise: string operations

Suppose you are given a vector of strings denoting items that you have. Write R code to turn this vector into a English sentence in the form of "I have x, y, and z". Pay extra attention to the "and" at the end.

Example input: `c("one apple", "two pears", "three bananas")`

Example output: `I have one apple, two pears, and three bananas.`

In [0]:
my_items <- c("a can of Coke", "a bottle of Pepsi", "a glass of water")

# WRITE CODE HERE


### Packages

`R` packages can be installed using the `install.packages()` function.
For example, to install the `tidyverse` package (which will be used primarily in this course) you can run:
`install.packages("tidyverse")`

This is like installing a piece of software, and only needs to be done once on any machine.

Once a package is installed, it can be "loaded" into the current environment with the `library` function.
For example, to load the `tidyverse` package, run

In [0]:
library("tidyverse")

Unlike `install.packages`, this needs to be done whenever you're on a new session/environment.

### Intro to `ggplot`

Now we will explore how we can make plots using `ggplot`, which is part of the `tidyverse` library.

We will explore the stop-and-frisk data from New York city. "Stop-and-frisk" is a police practice of temporarily detaining, questioning, and at times searching civilians on the street for weapons and other contraband. The only reason for police officers to stop and frisk some individual is that they have reasonable suspicion that the individual illegally carries a weapon.

Here we want to plot the data and answer this question: are white and Black individuals subject to the same threshold of “reasonable suspicion”?

One way to examine the stop-and-frisk threshold is to check the "hit rate" of frisks.
Hit rate is defined as: among stops for suspicion of criminal possession of a weapon (CPW), percent of cases in which weapon is found.
If Black individuals that are frisked have a higher hit rate than white individuals, we have a reason to believe the police might have a lower stop-and-frisk threshold for Black individuals (discuss: why?).
We will come back to this data later in this class for a more in-depth analysis.

In [0]:
# set ggplot theme
theme_set(theme_bw())

# let's first load the data
load('../data/sqf.Rdata')

# display first few rows of the hitrate_by_precinct dataframe,
# which contains the hit rate for Black and white individuals in each precinct in NYC
head(hitrate_by_precinct)

Let's first plot the black hit rate against white hit rate in each precinct

In [0]:
p <- ggplot(data=hitrate_by_precinct, aes(x=black, y=white)) +
  geom_point() +
  scale_x_continuous('\nHit rate for black individuals',
                     labels=scales::percent, limits=c(0, .3)) +
  scale_y_continuous('Hit rate for white individuals\n',
                     labels=scales::percent, limits=c(0, .5))

p

We then plot a line with slope of 1. If a point lies above the line, it indicates white individuals have a higher hit rate than black individuals in that precinct.

Discuss: What do we see here? What can we say about the stop-and-frisk policy?

In [0]:
p <- ggplot(data=hitrate_by_precinct, aes(x=black, y=white)) +
  geom_point(size=1) +
  geom_abline(slope=1, intercept=0, linetype='dashed') +
  scale_x_continuous('\nHit rate for black individuals',
                     labels=scales::percent, limits=c(0, .5)) +
  scale_y_continuous('Hit rate for white individuals\n',
                     labels=scales::percent, limits=c(0, .5))

p

Now let's look at the `hitrate_by_location` data, which further breaks down the stops in `hitrate_by_precinct` by location (`housing`, `transit`, or `neither`).

In [0]:
# We assign different colors to each location category
p <- ggplot(data=hitrate_by_location, aes(x=black, y=white,
                                          group=location.housing)) +
  geom_point(aes(color=location.housing), alpha=.6) +
  geom_abline(slope=1, intercept=0, linetype='dashed') +
  scale_color_discrete(element_blank(),
                       breaks=c('housing', 'neither', 'transit'),
                       labels=c('Public housing', 'Pedestrian', 'Transit')) +
  scale_x_continuous('\nHit rate for black individuals',
                     labels=scales::percent, limits=c(0, .8)) +
  scale_y_continuous('Hit rate for white individuals\n',
                     labels=scales::percent, limits=c(0, .8)) +
  theme(legend.position=c(1, 0), legend.justification=c(1, 0),
        legend.background=element_blank())

p

In [0]:
# Change axes to log scales
p <- ggplot(data=hitrate_by_location, aes(x=black, y=white,
                                          group=location.housing)) +
  geom_point(aes(color=location.housing), alpha=.6) +
  geom_abline(slope=1, intercept=0, linetype='dashed') +
  scale_color_discrete(element_blank(),
                       breaks=c('housing', 'neither', 'transit'),
                       labels=c('Public housing', 'Pedestrian', 'Transit')) +
  scale_x_continuous('\nHit rate for black individuals',  labels=scales::percent,
                     trans='log10', limits=c(0.003, 1),
                     breaks=c(.003, .01, .03, .1, .3, 1)) +
  scale_y_continuous('Hit rate for white individuals\n',  labels=scales::percent,
                     trans='log10', limits=c(0.003, 1),
                     breaks=c(.003, .01, .03, .1, .3, 1)) +
  theme(legend.position=c(1, 0), legend.justification=c(1, 0),
        legend.background=element_blank())

p

In [0]:
# Resize the points by the number of stops
p <- ggplot(data=hitrate_by_location, aes(x=black, y=white,
                                          group=location.housing)) +
  geom_point(aes(color=location.housing, size=count), alpha=.6) +
  geom_abline(slope=1, intercept=0, linetype='dashed') +
  scale_size_area(guide=FALSE) +
  scale_color_discrete(element_blank(),
                       breaks=c('housing', 'neither', 'transit'),
                       labels=c('Public housing', 'Pedestrian', 'Transit')) +
  scale_x_continuous('\nHit rate for black individuals',  labels=scales::percent,
                     trans='log10', limits=c(0.003, 1),
                     breaks=c(.003, .01, .03, .1, .3, 1)) +
  scale_y_continuous('Hit rate for white individuals\n',  labels=scales::percent,
                     trans='log10', limits=c(0.003, 1),
                     breaks=c(.003, .01, .03, .1, .3, 1)) +
  theme(legend.position=c(1, 0), legend.justification=c(1, 0),
        legend.background=element_blank())

p