In [1]:
# Setup

library(readr)
data <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")

Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_double(),
  ppltrst = col_double(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_double()
)


# Vectors

A "vector" is a basic data structure in R.

## Constructing vectors

Vectors can be created using `c()`:

In [2]:
names  <- c('araya', 'keenan', 'townsend')
years <- c(1961, 1964, 1972)

In [3]:
print(names)
print(years)

[1] "araya"    "keenan"   "townsend"
[1] 1961 1964 1972


In [4]:
mean(years)

Notice that vectors can only store values of the same type/class.

When trying to combine different types in a vector, R will coerce all values to a type compatible with all values (if possible)

In [5]:
names_years <- c('araya', 1961, 'keenan', 1964)

In [6]:
names_years # Notice the numbers are now converted to text

## Types of vectors

There are six types of vectors: logical, integer, double, character, complex, and raw.

The types primarily used for data analysis are: logical, integer, double, character.

"integer" and "double" are both referred to as *numeric vectors* (whole number and decimal point, respectively).

The type of vector can be examined with either `typeof` or `class`:

In [7]:
print(class(names))
print(class(years))

[1] "character"
[1] "numeric"


In [8]:
print(typeof(years))

[1] "double"


# Data frames and vectors

Data frames are essentially a collection of same length vectors.

R treats single columns (or variables) as "vectors". 

One refers to a single column in a data frame with `$` (a vector).

In [9]:
head(data$yrbrn) # First six values of yrbrn variable

Each value in a vector is assigned an index refering to the position of the value in the vector (starts from 1).

A vector is indexed using `[]`:

In [10]:
data$yrbrn[10] # Returns the 10th value (row 10) of the yrbrn variable

In [11]:
data$yrbrn[2:10] # Returns value 2-10 of the yrbn variable (both inclusive)

A range of useful functions exist for calculating descriptive measures for a vector; fx `mean()`, `min()`, `max()` and `length()`.

In [12]:
min(data$yrbrn) # Returns smallest value
max(data$yrbrn) # Returns largest value
mean(data$yrbrn) # Returns mean value
length(data$yrbrn) # Returns number of values in the vector (corresponding to the number of rows)

`unique()` returns the unique values in a vector (useful for getting familiar with a variable):

In [13]:
unique(data$ppltrst)

## Useful operations and functions on vectors
Below are some examples of different commands to interact with vectors.

| Code   | Description |
|:-------|:------------|
|`my_vec[-3]` | Everything but the 3rd element |
|`my_vec[c(1,4)]` | The 1st and 4th element |
|`my_vec[c(2:4)]` | The elements from index 2 to 4 |
|`length(my_vec)` | The number of elements |
|`sort(my_vec)` | Sorts the elements in ascending order |
|`sum(my_vec)` | The sum of the vector elements (numeric) |
|`mean(my_vec)`| The mean of the vector elements (numeric) |
|`min(my_vec)` | The vector element with the lowest value (numeric) |
|`max(my_vec)` | The vector element with the highest value (numeric)

## Missing values

Data will often contain missing values. Missing values can denote a lot of things like a non-response, an invalid answer, an inaccessible information and so on. 

Missing values are used to assign a value without assigning a value. They are denotes as `NA` in R.

The `summary()` function includes information about the number of missing values:

In [14]:
summary(data$inwtm)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  17.00   50.00   59.00   62.68   70.00  613.00       6 

Missing values are neither high or low in R. This means that it is not possible to perform computations on missing values:

In [15]:
min(data$inwtm) # NA is neither high or low - returns NA
max(data$inwtm) # NA is neither high or low - returns NA
mean(data$inwtm) # NA is neither high or low - returns NA

Usually one will have to deal with the missing values in some ways - either by replacing them or removing them.

Some functions have a built-in arguement for dealing with missing values.

Looking at the help file for `max()` (`?max`), we can see an arguement called `na.rm`. We can see that this arguement is used for removing missing values when performing the calculation.

Notice that in the help file, the arguement is set `na.rm = FALSE`. This is the default setting of the function, meaning that unless otherwise specified, the function will run with the arguement set to `FALSE` (missing values will be kep)

Changing the arguement when calling the function, the missing values are removed:

In [16]:
max(data$inwtm, na.rm = TRUE)