## R Prep Minicourse
### Week 14: Importing, Generating, and Sampling Data

**Credits:** Importing Data in R [(Part 1)](https://www.datacamp.com/courses/importing-data-in-r-part-1) and [(Part 2)](https://www.datacamp.com/courses/importing-data-in-r-part-2e) Tutorials.

#### Importing data

CSV files can be imported with `read_csv()`. This function is part of the `readr` package. It can receive both a path to the file or a URL. For example:

`read_csv("potatoes.csv")`

In [1]:
# Load the readr package
library(readr)

# Import potatoes.csv with read_csv(): potatoes
# "potatoes.csv" is the file path. I have added the csv file in the folder
# of this jupyter notebook.
potatoes <- read_csv("potatoes.csv")

# which is equivalent to the above. However, CoCalc has no network access, so this doesn't work. It would work in RStudio.
# potatoes <- read_csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_1477/datasets/potatoes.csv")

Parsed with column specification:
cols(
  area = col_integer(),
  temp = col_integer(),
  size = col_integer(),
  storage = col_integer(),
  method = col_integer(),
  texture = col_double(),
  flavor = col_double(),
  moistness = col_double()
)


In [2]:
head(potatoes, 3)

area,temp,size,storage,method,texture,flavor,moistness
1,1,1,1,1,2.9,3.2,3.0
1,1,1,1,2,2.3,2.5,2.6
1,1,1,1,3,2.5,2.8,2.8


Sometimes, you will have to use the optional argument. A common one is `skip`. Several datasets come with descriptions in their first rows, instead of showing the data straightaway. In these cases, you can use `skip` to specify how many of the first rows you wish to ignore. See this example:

In [3]:
# Gives an error because the first columns contain a description of the data.
CO2 <- read_csv("weekly_in_situ_co2_mlo.csv")
# Or CO2 <- read_csv("http://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/weekly/weekly_in_situ_co2_mlo.csv")
head(CO2)

Parsed with column specification:
cols(
  `-------------------------------------------------------------------------------------------` = col_character()
)
"3093 parsing failures.
row # A tibble: 5 x 5 col     row col                                 expected     actual  file           expected   <int> <chr>                               <chr>        <chr>   <chr>          actual 1    32 ----------------------------------~ delimiter o~ A       'weekly_in_si~ file 2    32 <NA>                                1 columns    2 colu~ 'weekly_in_si~ row 3    44 <NA>                                1 columns    2 colu~ 'weekly_in_si~ col 4    45 <NA>                                1 columns    2 colu~ 'weekly_in_si~ expected 5    46 <NA>                                1 columns    2 colu~ 'weekly_in_si~
... ................. ... ............................................................................... ........ ............................................................................... .

-------------------------------------------------------------------------------------------
Atmospheric CO2 concentrations (ppm) derived from in situ air measurements
"at Mauna Loa, Observatory, Hawaii: Latitude 19.5°N Longitude 155.6°W Elevation 3397m"
""
"Source: R. F. Keeling, S. J. Walker, S. C. Piper and A. F. Bollenbacher"
Scripps CO2 Program ( http://scrippsco2.ucsd.edu )
Scripps Institution of Oceanography (SIO)


In [4]:
CO2 <- read_csv("weekly_in_situ_co2_mlo.csv", skip=44, col_names = c("Date", "CO2 ppm"))
head(CO2)

Parsed with column specification:
cols(
  Date = col_date(format = ""),
  `CO2 ppm` = col_double()
)


Date,CO2 ppm
1958-03-29,316.19
1958-04-05,317.31
1958-04-12,317.69
1958-04-19,317.58
1958-04-26,316.48
1958-05-03,316.95


There are several other optional arguments that may or may not be necessary depending on your case. Honestly, your best bet is to use the RStudio functionality for data importing. There is a tutorial available [here](https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-RStudio).

#### Generating data

R has functions associated with well-known probability distributions (see [here](http://www.r-tutor.com/elementary-statistics/probability-distributions)). You can generate samples, calculate the probability density function (i.e., $P(X = k)$) and the cumulative probability function (i.e., $P(X < k)$).

In [5]:
# The is the documentation for the functions associated with the binomial distribution.
?dbinom

Consider the following problem:

Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a student attempts to answer every question at random.

In [6]:
# Since only one out of five possible answers is correct, the probability of answering a question
# correctly by random is 1/5=0.2. We can find the probability of having exactly 4 correct answers by random attempts as follows.
dbinom(4, size=12, prob=0.2)

In [7]:
# To find the probability of having four or less correct answers by random attempts, we apply the function dbinom with x = 0,…,4.
dbinom(0, size=12, prob=0.2) + 
dbinom(1, size=12, prob=0.2) + 
dbinom(2, size=12, prob=0.2) + 
dbinom(3, size=12, prob=0.2) + 
dbinom(4, size=12, prob=0.2) 

In [8]:
# Alternatively, we can use the cumulative probability function for binomial distribution pbinom.
pbinom(4, size=12, prob=0.2) 

In [9]:
# If we wanted to generate data for a classroom of 30 students in which everyone answers
# all questions at random, we can use the rbinom function as follows.
rbinom(n=30, size=12, prob=0.2)

The functions associated with other probability distributions work similarly:
- `dunif`, `punif`, `unif` for the uniform distribution;
- `dnorm`, `pnorm`, `rnorm` for the normal distribution;
- `dexp`, `pexp`, `rexp` for the exponential distribution;
- etc.

#### Sampling data

The last function we'll be looking at is the `sample()` function. It receives a vector $x$ and returns a random subset of $x$ of a specified size. You can also specify whether you want to sample with or without replacement. Hence:

`sample([vector], size = [# of samples], replace = [boolean])`

In [10]:
# If we wanted to throw a fair dice
sample(1:6, size=1)

# If we wanted to throw a fair dice three times
sample(1:6, size=3, replace=TRUE)  # We need replace = TRUE because we may sample the same value more than once

In [11]:
# Sample() is particularly useful to randomly subset data. For example, last class we saw how to separate a dataset
# into training and test sets:
library(Matching)
data(lalonde)
trainingRowIndex <- sample(1:nrow(lalonde), 0.8*nrow(lalonde))  # row indices for training data
trainingData <- lalonde[trainingRowIndex, ]  # model training data
testData  <- lalonde[-trainingRowIndex, ]   # test data

cat("Full data:", nrow(lalonde), "observations. \nTraining data:", nrow(trainingData), "observations. \nTest data:", nrow(testData), "observations.")

Loading required package: MASS
## 
##  Matching (Version 4.9-2, Build Date: 2015-12-25)
##  See http://sekhon.berkeley.edu/matching for additional documentation.
##  Please cite software as:
##   Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching
##   Software with Automated Balance Optimization: The Matching package for R.''
##   Journal of Statistical Software, 42(7): 1-52. 
##



Full data: 445 observations. 
Training data: 356 observations. 
Test data: 89 observations.

Use R to create a fake data set that contains 200 observations.

Your data set should consist of a single (continuous) dependent variable and at least two independent variables (also known as factors, or predictors).

Make sure that your independent variables predict the dependent variable, but make sure that they do not PERFECTLY predict the dependent variable. This means you have to add some random noise to your model. Feel free to create this random noise, and your independent variables using R’s random distribution functions (e.g., rnorm, runif) or any other method you wish.

E.g., your formula could be:

$20*Age + 50*(Education)^2 - 2*Gender + 4*treatment\_indicator*gender + 10*treat + treatment\_indicator/(\log(parents\_income)) + N(0, 10).$ 

This rather complex formula would work if you had already created a dataframe with age, education, gender, treatment_indicator, and parents_income. Your formula need not be quite so complex. A simple formula is fine, as long as it meets the requirements. The `N(0, 10)` item at the end is where you would use `rnorm(200, 0, 10)`...
Devise a story about the data set — what does it describe? Write down your short story in a paragraph about 3-5 sentences long.

Save your data (and the code that created it, if you used code), so that we will be able to utilize your creation whenever necessary.

In [12]:
Education <- sample(10:20, size=200, replace=TRUE)
Age <- sample(15:50, size=200, replace = TRUE)


x <- 20 * Age + 100 * Education + rnorm(n=200, 100, 1000)
y <- data.frame("education" = Education, "age" = Age, "wealth" = x)
head(y)

model <- lm(x ~ Education + Age)

summary(model)

education,age,wealth
20,26,2621.805
20,48,2609.539
16,49,3971.405
11,32,2228.562
18,21,1891.237
20,23,4132.692



Call:
lm(formula = x ~ Education + Age)

Residuals:
    Min      1Q  Median      3Q     Max 
-2393.7  -642.7    48.0   606.7  2600.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  284.713    374.981   0.759 0.448596    
Education     84.757     21.462   3.949 0.000109 ***
Age           21.951      6.611   3.320 0.001072 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 974.7 on 197 degrees of freedom
Multiple R-squared:  0.128,	Adjusted R-squared:  0.1191 
F-statistic: 14.45 on 2 and 197 DF,  p-value: 1.389e-06


In [13]:
?rnorm