-
Notifications
You must be signed in to change notification settings - Fork 1
/
R4EH.Rmd
906 lines (676 loc) · 30.1 KB
/
R4EH.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
---
title: "R for Econometrics: A Modest Handbook"
author: "[Ryan Safner](http://ryansafner.com) (safner@hood.edu)"
date: "Last Updated: `r system('git log -1 --format=%cd',intern=TRUE)`"
output:
html_document:
theme: simplex
toc: true
toc_float: true
code_folding: show
highlight: tango
---
> "There’s an implied contract between you and R: it will do the tedious computation for you, but in return, you must be completely precise in your instructions. Typos matter. Case matters." - Hadley Wickham, [R for Data Science](http://r4ds.had.co.nz/workflow-basics.html)
**Open Source**: The raw (`.Rmd`) code used to produce this guide, along with the guide itself, are available on [GitHub](http://github.com/ryansafner/R4EH), and are updated regularly. GitHub does not automatically render HTML, so download the HTML file and open it, or view it where I host it [on my website](http://ryansafner.com/tutorial/R4EH.html).
**Note to Students**: This is a work in progress, check the date at the top for when this was last updated. This compiles all of my instructions, advice, and examples from econometrics class lectures regarding `R`. It also contains some advanced material that I did not or will not cover in class, but will be useful to know for future data analysis and understanding or diagnosing problems.
**Note to Everyone Else**: This guide is oriented primarily for my [Econometrics class](http://ryansafner.com/courses/econ480) at Hood College, but should be of wider use to anyone interested in learning `R` for data analysis. Lecture slides, handouts, and guides (both PDFs and source code in R Markdown) are openly available [on GitHub](http://github.com/ryansafner/ECON480/).
**See also my companion guide to using R Markdown** to more effectively manage your entire workflow (text, data analysis, tables, graphs, and citations!) in a single plain text file and make your work reproducible and shareable, also [on GitHub](). This document, for example, was produced in `R Markdown` (see the raw code `.Rmd` file).
# For First-Timers:
This guide is meant to be a somewhat comprehensive resource such that you can come back to different sections when you encounter a specific limitation or problem in your own work. I *do not* recommend reading through this guide from start to finish, or in order.
I would recommend some basic concepts
# Basic Concepts
## Operating R Studio
- There are a few ways you can use R Studio:
1. \alert{Command line/Console}: writing each command by itself and copying down the result as needed
- Great for testing individual commands to see what happens
- Not reproducible! Not saved! NOT RECOMMENDED!
2. \alert{.R files}: A sequence of commands (and hopefully comments) saved as a script, the entire script is run all at once
- Can test individual commands in command line and then put good commands in *.R* file
- Equivalent to a *.do* file for Stata
- Reproducible, saved, commented
3. \alert{R Markdown (*.Rmd*) files}: A plain text document written in *R Markdown* language
- Allows for individual chunks of *R* code to be run individually (great for testing one command instead of all at once)
- Reproducible, saved, commented as if a normal document
- Can write an entire document (text, equations, R commands, figures, tables, etc) with one file!
- Can export to html, MS Word, Beamer, etc!
- Markdown is a language that is intuitive, simple, human- and machine-readable
![Rstudio Windows](images/rstudio.png)
### Keyboard Shortcuts
- `Ctrl+2`: move cursor to console
- `Ctrl` (`Cmd` on Mac)`+Enter`: run current line (from editor) in console
- `Ctrl` (`Cmd` on Mac)`+Enter`
- `Uparrow`: retrieve recent commands in console
- `Ctrl` (`Cmd` on Mac)`+Uparrow`: search previous commands
- `Option -`: insert assignment operator (`<-`)
- `Ctrl` (`Cmd` on Mac)+`Shift+M`: insert pipe operator (`%>%`)
## Working Directory (`wd`){#Working-directory}
- *R* assumes a default (often inconvenient) working directory on your computer
- this is where it thinks it will **load** any files you want to load and **save** anything you want to save by default
- Find out where *R* currently thinks this is with **`getwd()`**
- this is often Operating System specific, e.g.:
- Mac: `/Users/yourusername/`
- Windows: `C:/Users/yourusername/Documents/`
- you can move everything you want to load into this folder on your computer (and save everything there too), but this may be inconvenient
- *Change* the working directory to wherever you plan on keeping your related data and documents with **`setwd("/path/to/folder")`**
- you can move to a new `wd` *relative* to the current working directory:
- move down a folder by typing the folder name with a `/` after i
- e.g. to move from `/Ryansafner/Documents/` to `/Ryansafner/Documents/Econometrics/`
- move up one folder in a hierarchy with `..`
- e.g. to move from `/Ryansafner/Documents/` to `/Ryansafner/Downloads`, use `setwd("../Downloads/")` to move up from the `Documents` folder to `Ryansafner` folder, then down to Downloads
## Packages
- \alert{Packages} are extensions of base `R` designed by users
- Remember, `R` is open source, packages are usually published first on [Github](github.com)
- Official packages distributed and documented through [CRAN](cran.r-project.org/)
- To use a (previously-installed) package (note the ""), use the `library()` command:
```
library("packagename")
```
- If you do not have a package, they are easy to install with (note the plural "s")
```
install.packages("packagename")
```
- To install or load multiple packages at once, we can use the `c()` function to select multiple packages (**see below**)
```
library(c("gapminder","ggplot2","dplyr"))
```
## Useful Packages
- There are several packages we will use often (and are featured later in this guide)
- Packages are often very well-documented with explanations and examples
- Google each package for more information
\begin{table}
\begin{tabular}{ll}
Package name & Use \\ \toprule
\texttt{ggplot2} & Rendering beautiful graphics (scatterplots, histograms, etc)\\
\texttt{stargazer} & Rendering professioanl looking regression output tables\\
\texttt{dplyr} & Manipulating data much more intuitively\\
\texttt{sandwich} & More tools for regression, particularly robust SE's\\
\texttt{tidyverse} & An epic \emph{meta}package of \texttt{ggplot2, dplyr} and other popular packages\\ \bottomrule
\end{tabular}
\end{table}
## Calculations
- `R` can be used as a calculator
- Basic operations `+`, `-`, `*`, `/`
- More advanced math operators like exponents, logarithms, trigonometric functions, etc
```{r}
2+2
6^2 # 6 to the second power (i.e. squared)
sqrt(100/4) # square root
log(5) # logarithm
sin(2*pi) # sin
factorial(5) # factorial (e.g. 5!)
choose(2,6) # binomial choose function
# order of operations matters
3*3+4
3*(3+4)
```
- **Note on Notation**: `R` often reports very large (or very small) numbers in scientific notation with `e`
- For positive `e`: the number of zeros (or digits after the decimal point) to the right of a number
- e.g. $1.25e6 = 1.25 \times 10^6 = 1,250,000$
- For negative `e`: one less than the number of zeros (or digits after the decimal point) to the left of a number
- e.g. $1.25e-6 = 1.25 \times 10^{-6} = 0.00000125$
## Hints for Writing Code
### Naming Objects
- Object names cannot start with a digit or contain a space or comma
- FOR THE LOVE OF GOD AVOID SPACES IN GENERAL
- You've seen webpages intended to be called `"my webpage in html"` turned into `http://my%20webpage%20in%20html.html`
- Consider both your `R` objects and your files and folder names on your computer...(`/School/ECON_480_Econometrics/Homeworks_and_Data/`)
- It will be wise to adopt some consistent standard for demarcating names:
```
i.use.snake.case
otherPeopleUseCamelCase
some_people_use_underscores
And_aFew.People_RENOUNCEconvention
```
### Commenting
- Always comment your commands! Describe what you are doing so someone else (or you, 5 years later) can understand what is happening and why!
- Use the hashtag `#` to start a comment (`R` ignores everything on that line after the hashtag)
- Can be made its own line or at the end of lines
- e.g.
```{r writing, eval=FALSE}
# Run regression of y on x, save as reg1
reg1<-lm(y~x, data=mydata) # runs regression using mydata
summary(reg1$coefficients) # prints coefficients
```
### Managing Your Workflow
- **Save often!**
- Better yet, ask me about version control and GitHub
## Getting Help
- You can get documentation, explanations, and examples of every command in `R`
- simply type `?commandname` or `help("commandname")`
- Meet your new friend:
![](images/googlehelp.png)
- Meet your new *best* friend:
![](images/stackoverflow.png)
- The **only** way to learn coding is by tweaking existing examples, messing up, and searching the internet for help!
# Objects
- `R` is an \alert{object-oriented} programming language:
- We will almost always store data in **objects** and run **functions** on the objects
- Objects are assigned values: `myobject<-value`
- "**<-**" is the "\alert{assignment operator}", think of it like an $=$ sign
- Functions take the form: `functionname(myobject)`
- See **below** for examples of useful functions for data analysis
- Functions can have *other* functions for arguments (the object the function is run on), e.g.
```
round(rnorm(5),2)
# rnorm(5) takes 5 random draws from a normal distribution
# then round(, 2) rounds the result to 2 decimal places
```
## Vectors
- The simplest data structure in `R` is a \alert{vector}, simply a collection of objects
- To construct a vector, use the "combine/concatenate" function "**`c()`**"
- **Example**: a vector of the numbers 1 through 5
```{r vector1}
c(1,2,3,4,5)
```
- We can also build vectors via generating series
```{r vector 11}
1:5
```
- We can perform mathematical operations on a vector as a whole:
```{r vector 3}
sum(1:5)
mean(1:5)
```
-
```{r store.vector.as.x}
x<-c(1,2,3,4,5)
```
- To inspect an object, we simply call it up by name
```{r look.at.x}
x
```
# Other Object Types
## Lists
- A \alert{list} is a non-atomic vector, meaning you can gather data elements of different classes in one object
```{r list}
mylist<-list(5, pi, TRUE, 4.3, "cabbage")
class(mylist)
```
- Another great property of lists is that elements of the list can themselves be vectors
```{r}
vectored.list<-list(c(1.82, 1940, 93.20, 192.917),
c("Orange", "Cyan", "Pink"),
c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE))
str(vectored.list) # look at structure of the list
```
- We can create a **label** for each element in a list, called a `name`
```{r}
vectored.list<-list(numbers=c(1.82, 1940, 93.20, 192.917), # first element is a vector called 'numbers'
colors=c("Orange", "Cyan", "Pink"), # second element is a vector called `colors`
logic=c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)) # third element is a vector called `logic`
vectored.list
```
- The `names` command prints (or changes) the name of the label of each element in the list
```{r}
names(vectored.list) # print the names of the list elements
names(vectored.list)<-c("name1","name2","name3") # rename the lables to 'name1', 'name2', and 'name3'
names(vectored.list) # print new names
vectored.list # print list with new names
```
## Matrix
- Everything thus far has been 1 dimension, but we often work with 2-dimensional data
- Rows are observations
- Columns are variables
- A \alert{matrix}
- `matrix()` command creates a matrix by column,
- can define number of rows with `nrow=`, `R` will divide the elements into equal number of columns
```{r}
matrix1<-matrix(c(1,2,3,4,5,6),nrow=3) # make a 3-row matrix
matrix1
```
## Data Frame
- The most important object in `R` is a **data frame** (what you call a "spreadsheet"), used for statistics, plots, regressions, etc
- "Rectangular" data, rows are observations, columns are variables
- Can hold variables of different classes (e.g. a quantitative variable like income, a character variable like name, etc)
- In essence, data frames are actually lists (where each list object itself is a vector)
- All vectors (columns) must have the same length!
```{r}
df<-data.frame(x = 1:3,
y = c("a", "b", "c"),
z = c(TRUE, FALSE, TRUE))
df
```
# Data Classes
- Vectors **must** contain the same type of elements (e.g. numerical or text)
* Technically this refers to **atomic vectors** (nearly all vectors are atomic)
- Vectors with "mixed" types will convert all elements to the lowest-common denominator, e.g. character
- You can always check the type of vector using **class()**
```{r vector.types}
mixed<-c(5, pi, TRUE, 4.3, "cabbage")
class(mixed)
```
### Numeric
- \alert{Numeric} (aka "double"), as it sounds, can perform mathematical operations
```{r numeric, echo=TRUE, eval=FALSE}
numeric<-c(1,2,3,4,5)
```
- There are two **types** of numeric objects: **double** and **integer**
### Double
- If numeric values contain decimal points, they are technically called \alert{floating point double} or simply \alert{double} class
- `R` may simply call them numeric, but contrast with integer below
```{r}
double<-c(pi,2.34,9.99)
class(double)
typeof(double) # will return the more specific type
is.double(double) # a logical test to see if object is "double" type
is.integer(double) # a logical test to see if object is "integer" type
```
#### Integer
- If numeric values are all whole numbers, they are \alert{integer} class
```{r}
integers<-c(1,2,3,4)
class(integers)
typeof(integers)
is.double(double)
is.integer(double)
```
### Logical
- \alert{Logical} is a series of binary elements or statements that can either be TRUE or FALSE
```{r logical, echo=TRUE,eval=FALSE}
logical<-c(TRUE,FALSE,FALSE,TRUE)
```
- We can perform logical tests with common operators:
- `<` less than
- `>` greater than
- `<=` less than or equal to ($\leq$)
- `>=` greater than or equal to ($\geq$)
- `==` is equal to (note two equals signs are needed!)
- `!=` is not equal to
- `%in%` is a member of a set ($\in$)
```{r}
3==4 #is 3 equal to 4?
3<4 # is three less than 4?
3<=4 # is three less than or equal to 4?
3>4 # is three greater than 4?
3!=4 # is three not equal to four?
3 %in% c(0,1,2) # is three in the following set of numbers?
3 %in% c(0,1,2,3) # is three in the following set of numbers?
```
- We are not limited to using numeric data, `R` can also perform logical tests on other classes of variable, like characters (which need quotes):
```{r}
"red"=="blue" # is red the same as blue?
"red"!="blue" # is red not equal to blue?
political.party<-c("Republican","Democrat") # define political party as a set of Republican and Democrat
"Libertarian" %in% political.party # check if Libertarian is in the set of political parties we created
"Democrat" %in% political.party # check if Democrat is in our set of political parties
```
- We can also perform more than one test at a time with multiple conditions:
- `&` AND
- `|` OR
```{r}
2==2 & 2>3 # is 2 equal to 2 AND greater than 3?
2==2 | 2>3 # is 2 equal to 2 OR greater than 3?
```
- These commands will become very useful when we want to subset data or look at portions of our data based on some condition
### Character
- \alert{Character} is a string of text: letters, numbers, and symbols, cannot perform mathematical operations
- Character values require quotation marks around each value
```{r char, echo=TRUE,eval=FALSE}
character<-c("one","two","7","orange")
```
#### Dates
- Dates are a specific type of character class
- Specific dates
* Can do days, weeks, months, quarters, years
```{r dates 1}
today<-Sys.Date() #print today's date
format(today, format="%B %d %Y") # specify how to report date format
months<-seq(as.Date("2010/1/1"), as.Date("2012/1/1"), "months") # generate sequence of months between Jan 2010 and Jan 2012
months
```
### Factor
- \alert{Factor} is a special type of character variable, often used to indicate membership in one of several possible categories, called **levels** (e.g. for plotting, or conditional statistics and data work)
```{r factors}
students<-factor(c("freshman", "senior", "senior", "junior", "freshman",
"sophomore", "freshman"))
students # note order is arbitrary
levels(students) #extract unique levels
nlevels(students) #count the number of levels
```
```{r factors2}
table(students) #tabulate number of values for each level
```
#### Ordered Factors
- Factors have ordered `levels()` which control the order on plots and in `table()`
```{r}
students.o<-ordered(students, levels=c("freshman","sophomore","junior","senior"))
students.o
```
- **Be advised**: when `R` stores and calls factors, it actually stores them as integers [1..k, for k categories] instead of characters (e.g. "freshman"=1, "sophomore"=2), making this a **nominal** variable. This allows for some mathematical operations.
- An **ordered factor** is where the ordering matters (e.g. "small", "medium", "large" coded as 1, 2, 3 in order)
## Checking or Reclassifying Objects
- We can always check the class of an object with `class()` or `typeof()`.
- We can perform logical tests `is.numeric()`, `is.factor()`, etc. to see if an object is a specified type
- We can change the class of an object by redefining it with `as.classname()`, e.g.
```{r class change}
is.numeric(x) # check if x is numeric
is.factor(x) # check if x is a character
x<-as.character(x) # change vector x to a character
class(x)
x<-as.numeric(x) # change vector x back to numeric
class(x)
```
# Manipulating Objects
## Subsetting Objects
- We often find ourselves wanting to work with only a specific slice of data from a larger dataset
- e.g. for a dataframe, look only at one or two variables, or look only at members of a category
- e.g. only females, only age greater than 21
- This requires us to extract or **subset** our data
- there are many ways to do this in `R` (and some packages, like `dplyr`, `magrittr`, and `tidyverse` make this easier)
- we first discuss how `R` keeps track of objects to use this syntax to allow us to extract relevant data
- then we talk about **conditional** values to extract data using **logical statements** (see above on logical statements)
### Object Indexing
- `R` objects store multiple values as an ordered list
- The order of each value is kept as an index number
- e.g. the first value is [1], the second value is [2], the 82nd value is [82], etc.
- Use [square brackets] to isolate elements of a vector, index starts at [1]
```{r vector.index}
print(x) # Our original vector
x[1] # Print first element of x
```
- We can look at specific elements by identifying their index values:
```{r}
x[c(1,3)] # Print first and third elements
x[-c(1,3)] # Print everything EXCEPT first and third elements
```
- We can change specific elements simply by identifying them by index number and redefining them
```{r}
x[2]<-"7" #Change second element
print(x) #Our new vector
```
- We can also ask how many elements a vector contains with **length()**
```{r length}
length(x)
```
- We can also select elements by (a) logical condition(s)
```{r}
x[x>2] # print elements in x greater than 2
x[x>2 & x<5] # print elements in x greater than 2 AND less than 5
x[x>2 | x<5] # print elements in x greater than 2 OR less than 5
```
- It is often useful to label values (later, variables) by giving them "names"
```{r names}
myvars<-c(Var1="alpha",Var2=12,Var3="height") #Label name goes first, then =, then the value
myvars
#We can still call specific values by index position
myvars[3] #Get 3rd value
#We can equivalently call up specific values by name
myvars["Var3"] #Get value of Var3
names(myvars) #List names of all values
names(myvars)[3] #Get name of third value
```
- This become more useful when we try to subset dataframes for analysis:
- Observations are indexed by row and column `[r,c]`
- Use one or both indexes to identify specific observations
```{r}
df<-data.frame(x=c("a","b","c"),
y=c(1,2,3),
z=c("do","re","me"))
df
# select row 2, column 3
df[2,3]
# select column 3
df[,3]
# select row 2
df[2,]
# using just one index will always return a column vector (e.g. variable)
df[3]
df[[3]]
```
- Subsetting by actual position is quite cumbersome, it is easy to subset by `name`
```{r}
names(df)<-c("letter","number","note")
# select letters
df["letter"]
df[["letter"]]
# select letters and notes
df[c("letter", "note")]
```
- This is extremely common, so `$` provides a quicker way
```{r}
# select letters
df$letter
# select notes
df$note
```
# Automate all the Things with `function()`s and `for` Loops
## Functions
- Sometimes it makes sense to write your own function
- A function has **argument(s)** (inputs) and produces a desired **output** with the following syntax:
```{r, eval=FALSE}
my.function<-function(x){
argument.using.x
}
```
## `for` Loops
```{r}
library("ggplot2")
# using diamonds dataset in ggplot2
cuts <- levels(diamonds$cut)
means <- rep(NA, length(cuts))
for(i in seq_along(cuts)) {
sub <- diamonds[diamonds$cut == cuts[i], ]
means[i] <- mean(sub$price)
}
means
```
```{r}
# using diamonds dataset
colours <- levels(diamonds$color)
n <- length(colours)
mprice <- rep(NA, n)
mcarat <- rep(NA, n)
for(i in seq_len(n)) {
set <- diamonds[diamonds$color == colours[i], ]
mprice[i] <- median(set$price)
mcarat[i] <- median(set$carat)
}
results <- data.frame(colours, mprice, mcarat)
results
```
## Simulation
```{r}
# Let’s make a virtual coin flip
# 1 = heads, 0 = tails
coin <- c(0, 1)
# we can flip the coin once
flips <- sample(coin, 1, replace = T)
mean(flips)
# we can flip the coin many times
flips <- sample(coin, 10, replace = T)
mean(flips)
flips <- sample(coin, 10000, replace = T)
n <- seq_along(flips)
mean <- cumsum(flips) / n
coin_toss <- data.frame(n, flips, mean)
library(ggplot2)
qplot(n, mean, data = coin_toss, geom = "line") +
geom_hline(yintercept = 0.5)
```
## Math Operations on Objects
## Redefining Objects
# Importing Data
- `R` can read almost any type of data (with the proper package)
- Syntax: `newdf<-read("path/to/file")`
- You'll create a new object (typically dataframe) with the data (e.g. called `newdf`)
- Use a particular `read` function depending on the data type (see packages below) and package
- Put the file path of the object in quotes and parentheses
| Package | Function | File Format(s) |
|----|----|----|
| `foreign` | `read.*()` | `.dta`, `.spss`, etc. |
| `haven` | `read_*()` | `.dta`(13+) |
| `readr` | `read_csv()` | `.csv` files |
| `readxl` | `read_excel()` | `.xls`, `.xlsx` files |
where `*` is a placeholder for the file type
- Depending on where you saved the downloaded data on your computer, you tell `R` where to load it from
- If you save/move the data in the same default working directory as `R`, just tell it `newdf<-read("dataname.ext")`
- Where `read` should be replaced by the proper function above
- `dataname.etx` is the name of the data file and its file type (e.g. `.csv`, `.xslx`, etc)
- You can determine where `R` thinks your working directory is again with `getwd()` and change it (poor idea) with `setwd()` (See (Working Directory)[#Working-directory])
- If you save/move the data in a different folder:
- `path/to/file` should be the path **relative** to your current working directory, using `..` to move up one directory on your computer and `/` to move through folders
- Example: suppose my current working directory is `RyanSafner/Documentss` and I saved my data file in `RyanSafner/Downloads`
- My relative path would be `../Downloads/dataname.ext`, since I need to move up one directory (from `RyanSafner/Documents` to just `RyanSafner/'`) and then down into the `Downloads/` folder.
# Data Wrangling, Tidying, and Cleaning
- 90\% of data work is "wrangling" raw data files into something we can actually work with
- **"Tidy"** data, to use Hadley Wickham's term, has:
- Columns are variables
- Rows are observations
- One type per dataset
- Tidy data are easy to model, visualize, and aggregate (i.e. run `lm`, `ddply` and `ggplot` with)
- Common causes of "messy" data:
- column headers are values, not variable names
- multiple variables are stored in one column
- variables are stored in both rows and columns
- multiple types of experimental unit stored in the same table
- one type of experimental unit stored in multiple tables
## The `dplyr` package
-
### `dplyr` verbs
| Function | Does |
|----------|------|
| `filter` | Keep only selected *observations* |
| `select` | Keep only selected *variables* |
| `arrange` | Reorder rows (e.g. in numerical order) |
| `mutate` | Create new variables |
| `summarize` | Collapse data into summary statistics|
| `group_by` | Perform any of the above functions by groups/categories |
- Syntax for all of these functions is the same:
```{r, eval=FALSE}
dplyrfunction(dataframe, condition)
# returns a new dataframe
```
- `filter`ing observations examples (with `gapminder`)
```{r}
library("gapminder")
library("dplyr")
# Only look at countries in Africa
filter(gapminder, continent=="Africa")
# Only look at countries in Europe AND 1997
filter(gapminder, continent=="Europe", year==1997)
# Only look at countries in Europe OR Africa
filter(gapminder, continent=="Europe" | continent=="Africa")
# Only look at countries in the set of Europe and Asia
filter(gapminder, continent %in% c("Europe","Asia"))
```
- `select`ing observations examples (with `gapminder`)
```{r}
library("gapminder")
library("dplyr")
# Only keep country, year, and population variables
select(gapminder, country, year, pop)
# The "starts_with()", "ends_with()", and "contains()" are also useful functions
select(gapminder, starts_with("c"), gdpPercap) # i.e. *C*ountry, *C*ontinent
```
- `rename`ing variables
```{r}
rename(gapminder, population=pop, gdppercapita=gdpPercap)
```
```{r}
# order by year (smallest to largest)
arrange(gapminder, year)
# order by country (reverse alphabetical)
arrange(gapminder, desc(country)) # desc stands for descending()
# order by lifeexp (smallest to largest)
arrange(gapminder, lifeExp)
# order by lifeexp (largest to smallest)
arrange(gapminder, desc(lifeExp))
```
- `mutate`ing new variables
```{r}
# create a GDP variable by multiplying gdpPercap and pop
mutate(gapminder, gdp = gdpPercap * pop)
# Can create multiple variables at once
mutate(gapminder,
gdp = gdpPercap * pop,
gdp_billions = gdp/1000000000)
```
- `summarize`ing variables
```{r}
# get average life expectancy
summarize(gapminder, mean_life=mean(lifeExp))
# get standard deviation of life expectancy
summarize(gapminder, sd_life=sd(lifeExp))
# summarize avg life expectancy by continent with "group_by"
by_continent<-group_by(gapminder, continent)
summarize(by_continent, mean_life=mean(lifeExp))
# or by year
by_year<-group_by(gapminder, year)
summarize(by_year, mean_life=mean(lifeExp))
```
- The **pipe operator** `%>%` "pipes" the *output* of the previous command into the *input* of the next command:
```{r}
gapminder %>% # start with gapminder data
group_by(continent, year) %>% # use gapminder data to create groups of continents and of years
summarize(mean_life = mean(lifeExp)) # use those groups to calculate average life expectancy by group
```
- The pipe operator `%>%` works across different functions and packages
- Once you learn it, it's hard to give it up because it shortens your code significantly!
```{r}
library("ggplot2")
# we want to plot the change in average life expectancy by continent over time
gapminder %>% # start with gapminder data
group_by(year, continent) %>% # create groups of years and of continents
summarize(mean_life = mean(lifeExp)) %>% # get average life expectancy for each group
ggplot(aes(year, mean_life, color = continent))+ # plot this over time
geom_point() + geom_smooth(method="lm")
```
```
new <- summarise(diamonds, avg_depth = mean(depth), avg_price = mean(price))
```
# Plotting
## Base `R`
- Base `R` is very powerful and intuitive to plot, but not very sexy
- Basic syntax:
```{r, eval=FALSE}
plottype(data.frame.name$variable)
```
- If using multiple variables, you can avoid `$` by typing the variable names and then telling `R` where the variables come from (a data frame)
```{r, eval=FALSE}
plottype(variable1,variable2, data=data.frame.name)
```
### Histogram
```{r, echo=TRUE, fig.height=4, fig.width=10, warning=FALSE}
library("gapminder")
hist(gapminder$gdpPercap)
```
### Boxplot
- Boxplots are similar syntax
```{r, fig.height=4}
boxplot(gapminder$gdpPercap)
```
- If we want a boxplot for each category, use `variable.name~category.variable.name` to tell `R` to plot a boxplot **by** category
```{r, fig.height=4}
boxplot(gdpPercap~continent,data=gapminder)
```
### Scatterplot
- Scatterplot syntax for plotting is similar to `hist()` and `boxplot()`: `plot(df$x,df$y)`
```{r, fig.height=3.5}
plot(gapminder$gdpPercap, gapminder$lifeExp)
```
## With `ggplot2`
# Summary Statistics
- For some object named `distr`:
| Function | Result |
|---|---|
`min(distr)` | Find minimum value |
`max(distr)` | Find maximum value |
`range(distr)` | Find the range |
`sort(distr)` | Sort values of distribution from smallest to largest |
|`sort(distr)[1]` | Find first value when sorted (equvalient to finding min) |
|`sort(distr, decreasing=TRUE)` | Sort from largest to smallest |
|`median(distr)` | Find the median |
|`mean(distr)` | Find the mean |
|`var(distr)` | Find the variance |
|`sd(distr)` | Find the standard deviation |
|`table(distr)` | Gives frequency table of categorical variable values |
|`fivenum(distr)` | Five number summary (min, q1, median, q3, max) |
|`summary(distr)` | Gives min, q1, median, mean, q3, max |
|`quantile(distr, 0.32)` | Find specific (e.g. 32$^{nd}$) percentile |
|`summary(factor(distr))` | Lists all unique values in distr |
|`sum(distr)` | Takes the sum of all values in distr |
# Linear Regression
```{r}
sex<-c("Male", "Female")
marital<-c("Married","Unmarried")
```