# Data frames in R

## R Objects: Data Frames

A "data frame" is the R-equivalent of a spreadsheet (a table of rows and columns). It is one of the most useful storage structures for data analysis in R.

## Importing data with `readr`

`readr` is a package for reading various data files into R.

R does have some "base" functions for doing this but `readr` is more efficient.

`readr` is part of a collection of packages called `tidyverse`: https://www.tidyverse.org/

In [2]:
library(readr)

ess18 <- read_csv("https://github.com/CALDISS-AAU/workshop_R-intro/raw/master/data/ESS2018DK_subset.csv")

[1mRows: [22m[34m1285[39m [1mColumns: [22m[34m16[39m

[36m--[39m [1mColumn specification[22m [36m------------------------------------------------------------------------------------------------[39m
[1mDelimiter:[22m ","
[31mchr[39m  (6): vote, prtvtddk, lvpntyr, tygrtr, gndr, edlvddk
[32mdbl[39m (10): idno, netustm, ppltrst, yrbrn, eduyrs, wkhct, wkhtot, grspnum, frl...


[36mi[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



`ess18` is now a data frame object containing a dataset. Notice that the basic syntax stays the same: `objectname <- somefunction(something)`.

## Data used: European Social Survey 2018

We are using a subset of the Danish European Social Survey data from 2018 (https://www.europeansocialsurvey.org/)

The data contains the following variables:

|variable | description |
|----|---|
|idno|Respondent's identification number|
|netustm |Internet use, how much time on typical day, in minutes|
|ppltrst|Most people can be trusted or you can't be too careful|
|vote|Voted last national election|
|prtvtddk|Party voted for in last national election, Denmark|
|lvpntyr|Year first left parents for living separately for 2 months or more|
|tygrtr|Retire permanently, age too young. SPLIT BALLOT|
|gndr|Gender|
|yrbrn|Year of birth|
|edlvddk|Highest level of education, Denmark|
|eduyrs|Years of full-time education completed|
|wkhct|Total contracted hours per week in main job overtime excluded|
|wkhtot|Total hours normally worked per week in main job overtime included|
|grspnum|What is your usual [weekly/monthly/annual] gross pay|
|frlgrsp|Fair level of [weekly/monthly/annual] gross pay for you|
|inwtm|Interview length in minutes, main questionnaire|


## Exploring Data Frames
To get an idea of what the data contains, we can use `head()`:

In [3]:
head(ess18)

idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5816,90.0,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000.0,35000.0,61
7251,300.0,5,Yes,Dansk Folkeparti - Danish People's Party,1993,40,Female,1975,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",13,32,34,22000.0,30000.0,68
7887,360.0,8,Yes,Socialdemokratiet - The Social democrats,1983,55,Male,1958,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks",25,39,39,36000.0,42000.0,89
9607,540.0,9,Yes,Alternativet - The Alternative,1982,64,Female,1964,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",13,32,34,32000.0,,50
11688,,5,Yes,Socialdemokratiet - The Social democrats,1968,50,Female,1952,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",2,37,37,,,77
12355,120.0,5,Yes,Socialdemokratiet - The Social democrats,1987,60,Male,1963,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",14,38,37,36000.0,38000.0,48


We can check the names of the rows and columns (the variable names) using `rownames` and `colnames`. `dim` returns number of rows and columns.

In [4]:
head(rownames(ess18))

In [5]:
colnames(ess18)

In [6]:
dim(ess18)

See key summary statistics using `summary()`. (counts, mean, std, min, max, quartiles).

In [7]:
summary(ess18)

      idno           netustm          ppltrst          vote          
 Min.   :  5816   Min.   :   0.0   Min.   : 0.00   Length:1285       
 1st Qu.: 93707   1st Qu.:  90.0   1st Qu.: 6.00   Class :character  
 Median :112877   Median : 150.0   Median : 7.00   Mode  :character  
 Mean   :110980   Mean   : 227.4   Mean   : 7.08                     
 3rd Qu.:131072   3rd Qu.: 300.0   3rd Qu.: 8.00                     
 Max.   :150446   Max.   :1020.0   Max.   :10.00                     
                  NA's   :151      NA's   :3                         
   prtvtddk           lvpntyr             tygrtr              gndr          
 Length:1285        Length:1285        Length:1285        Length:1285       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                       

## Data frames and vectors

Data frames consists of rows and columns. Typically R expects the rows of a data frame to contain observations and the columns of the data frame to contain variables (information about the observations).

R treats single columns (or variables) as "vectors". A vector is a series of values of the same class.

We can refer to a single column in a data frame with `$` (a vector).

In [8]:
head(ess18$yrbrn) # First six values of yrbrn variable

Each value in a vector is assigned an index refering to the position of the value in the vector (starts from 1).

A vector is indexed using `[]`:

In [9]:
ess18$yrbrn[10] # Returns the 10th value (row 10) of the yrbrn variable

In [10]:
ess18$yrbrn[2:10] # Returns value 2-10 of the yrbn variable (both inclusive)

A range of useful functions exist for calculating descriptive measures for a vector; fx `mean()`, `min()`, `max()` and `length()`.

In [11]:
min(ess18$yrbrn) # Returns smallest value
max(ess18$yrbrn) # Returns largest value
mean(ess18$yrbrn) # Returns mean value
length(ess18$yrbrn) # Returns number of values in the vector (corresponding to the number of rows)

`unique()` returns the unique values in a vector (useful for getting familiar with a variable):

In [12]:
unique(ess18$ppltrst)

### Useful operations and functions on vectors
Below are some examples of different commands to interact with vectors.

| Code   | Description |
|:-------|:------------|
|`my_vec[-3]` | Everything but the 3rd element |
|`my_vec[c(1,4)]` | The 1st and 4th element |
|`my_vec[c(2:4)]` | The elements from index 2 to 4 |
|`length(my_vec)` | The number of elements |
|`sort(my_vec)` | Sorts the elements in ascending order |
|`sum(my_vec)` | The sum of the vector elements (numeric) |
|`mean(my_vec)`| The mean of the vector elements (numeric) |
|`min(my_vec)` | The vector element with the lowest value (numeric) |
|`max(my_vec)` | The vector element with the highest value (numeric)

## Missing values

Data will often contain missing values. Missing values can denote a lot of things like a non-response, an invalid answer, an inaccessible information and so on. 

Missing values are used to assign a value without assigning a value. They are denotes as `NA` in R.

The `summary()` function includes information about the number of missing values:

In [13]:
summary(ess18$inwtm)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  18.00   51.00   59.00   63.32   70.00  613.00       5 

Missing values are neither high or low in R. This means that it is not possible to perform computations on missing values:

In [15]:
min(ess18$inwtm) # NA is neither high or low - returns NA
max(ess18$inwtm) # NA is neither high or low - returns NA
mean(ess18$inwtm) # NA is neither high or low - returns NA

Usually one will have to deal with the missing values in some ways - either by replacing them or removing them.

Some functions have a built-in arguement for dealing with missing values.

## Data handling in R

When working with data, we usually need to perform some introductory data handling steps before being able to conduct our analysis.

This could include:

- Filtering observations and variables
- Creating new variables
- Recoding values

R supports all these operations both from "base" operations but there are also packages with more intuitive functions (like the packages in tidyverse: https://www.tidyverse.org/.

### Filtering observations (subsetting)

Like values in a vector, each value in a data frame also has an index. Each value in a data frame can be uniquely identified by the combination of its row and column index.

Data frames are indexed using `[rowindex, columnindex/column name]`:

In [16]:
ess18[10, 2] # Row 10, column 2

netustm
<dbl>
60


In [17]:
ess18[10, 'prtvtddk'] # Row 10 in column prtvtddk

prtvtddk
<chr>
"Venstre, Danmarks Liberale Parti - The Liberal Party"


Notice that even though we are asking for a single value, the object returned is a 1x1 data frame!

We can also ask for several rows and columns:

In [18]:
ess18[c(1:10), c('prtvtddk', 'gndr')] # Row 1-10, column prtvtddk and gndr (specified as vectors)

prtvtddk,gndr
<chr>,<chr>
SF Socialistisk Folkeparti - Socialist People's Party,Male
Dansk Folkeparti - Danish People's Party,Female
Socialdemokratiet - The Social democrats,Male
Alternativet - The Alternative,Female
Socialdemokratiet - The Social democrats,Female
Socialdemokratiet - The Social democrats,Male
Dansk Folkeparti - Danish People's Party,Male
Dansk Folkeparti - Danish People's Party,Female
Liberal Alliance - Liberal Alliance,Female
"Venstre, Danmarks Liberale Parti - The Liberal Party",Male


### Filtering with booleans/logical values

Normally we do not know the rowindex of the values we want to keep. Rather we want to filter observations based on a certain criteria. 

In R this is done via the use of "booleans" or "logical values". These are values that are either `TRUE` or `FALSE`.

A number of operations in R always return a logical value:

- `>`
- `>=`
- `<`
- `<=`
- `==`
- `!=`

In [19]:
42 > 10

In [20]:
10 != 10

Logicals can be used when filtering observations. Logicals are used to index the rows, so that only rows meeting the criteria will be returned:

In [21]:
head(ess18[ess18$gndr == 'Male', c('gndr', 'prtvtddk')])

gndr,prtvtddk
<chr>,<chr>
Male,SF Socialistisk Folkeparti - Socialist People's Party
Male,Socialdemokratiet - The Social democrats
Male,Socialdemokratiet - The Social democrats
Male,Dansk Folkeparti - Danish People's Party
Male,"Venstre, Danmarks Liberale Parti - The Liberal Party"
Male,"Venstre, Danmarks Liberale Parti - The Liberal Party"


Logicals can also be used when indexing vectors, thus being able to perform calculation on a specific group:

In [22]:
mean(ess18$ppltrst[ess18$gndr == "Male"], na.rm = TRUE) # Mean of ppltrst for gndr = male

## Recoding and creating variables

Creating variables and (simple) recoding is usually done in the same way. The only difference being whether the recoding is assigned to a new variable or overwriting an existing (we are here only looking at recoding by arithmetic operations and not by replacing values).

In base R, we simply specify a variable that is not in the data and specify the contents:

In [23]:
ess18$inwth <- ess18$inwtm / 60 # Creating variable for length of interview in hours

head(ess18$inwth)

In [24]:
ess18$inwth <- NULL # This line removes the variable

# Saving files 

The most important thing to save is the R script.

Otherwise data can be exported - fx as a .csv file.

Check your directory with `getwd`. Directory can be changed with `setwd`.

We can save a .csv-file (comma-separated values) with `write.csv`:

```R
write.csv(ess18, file = "my_ess2018dk.csv", col.names = TRUE, row.names = FALSE)
```

**Code breakdown:**

| Code | Description |
|:-----|:------------|
|`ess18` | The object we want to save |
|`file = "my_ess2018dk.csv"` | The filename - .csv for comma-separated values |
|`col.names = TRUE` | Whether to store column names (in this case yes) |
|`row.names = FALSE` | Whether to store row names (in this case no) |

A .csv-file is a multi-platform format and can be loaded by most statistical software (Excel, Stata, etc.).