# Common errors when working with data in R

R makes some assumptions about how the data being imported is set up (value delimiters, decimal points etc.). Furthermore, data may contain errors or need some handling before being ready for any kind of analysis. 

Errors that one encounters when working with data can be rather unique and solving them will often involve a lot of trial and error specific to the data being worked with.

In this section we take a look at some of the errors one may encounter when working with tabular data in R. The section uses the same subset of ESS 2018 from the introduction but with some errors added to the data.

## Common error 1: Data uses a non-standard separator

This error can occur when working with CSV files. "CSV" stands for comma-separated values but the standard csv-format actually differs a bit across countries: some countries use commas; others used semi-commas.

"CSV" is a type of "delimited data file". Delimited data files are all made of lines where each value is separated by some character (tab, comma, semi-comma or something else).

In the code below, the dataset is imported with the standard `read_csv()` function from `readr`:

In [21]:
library(readr)

ess2018 <- read_csv("https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv")

head(ess2018, 5)

Parsed with column specification:
cols(
  `idno;netustm;ppltrst;vote;prtvtddk;lvpntyr;tygrtr;gndr;yrbrn;edlvddk;eduyrs;wkhct;wkhtot;grspnum;frlgrsp;inwtm` = col_character()
)
"1572 parsing failures.
row col  expected    actual                                                                                                                   file
  1  -- 1 columns 4 columns 'https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv'
  2  -- 1 columns 4 columns 'https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv'
  3  -- 1 columns 4 columns 'https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv'
  4  -- 1 columns 4 columns 'https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv'
  5  -- 1 columns 4 columns 'https://raw.githubusercontent.com/CAL

idno;netustm;ppltrst;vote;prtvtddk;lvpntyr;tygrtr;gndr;yrbrn;edlvddk;eduyrs;wkhct;wkhtot;grspnum;frlgrsp;inwtm
110;180
705;60
1327;240
3760;300
4658;90


As can be seen, there seems to be something off with the data, as all the values are condensed into a single column. This happens because `read_csv()` assumes the values are separated by commas, but in this dataset the values are separated by semi-commas.

`read_csv2()` can be used in this case, as this assumes semi-commas as separators (which is common in some European countries - like Denmark):

In [22]:
ess2018 <- read_csv2("https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv") # Works with ";"
head(ess2018)

Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_character(),
  ppltrst = col_character(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_character()
)


idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
110,1800,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,9999999,9999999,1190
705,600,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,9999999,9999999,550
1327,2400,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,9999999,9999999,370
3760,3000,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200,9999999,430
4658,900,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,9999999,9999999,620
5816,900,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000,35000,610


Alternatively, the function `read_delim()` can be used where the delimiter (the character separating the values) is specified:

In [23]:
ess2018 <- read_delim("https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv", delim = ";") # Alternative - specify the delimiter
head(ess2018)

Parsed with column specification:
cols(
  idno = col_double(),
  netustm = col_character(),
  ppltrst = col_character(),
  vote = col_character(),
  prtvtddk = col_character(),
  lvpntyr = col_character(),
  tygrtr = col_character(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  eduyrs = col_double(),
  wkhct = col_double(),
  wkhtot = col_double(),
  grspnum = col_double(),
  frlgrsp = col_double(),
  inwtm = col_character()
)


idno,netustm,ppltrst,vote,prtvtddk,lvpntyr,tygrtr,gndr,yrbrn,edlvddk,eduyrs,wkhct,wkhtot,grspnum,frlgrsp,inwtm
110,1800,8,Yes,Socialdemokratiet - The Social democrats,1968,Never too young,Male,1949,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",9,28,28,9999999,9999999,1190
705,600,5,Yes,Det Konservative Folkeparti - Conservative People's Party,1976,67,Male,1958,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",22,37,45,9999999,9999999,550
1327,2400,5,,,"Still in parental home, never left 2 months",,Male,2000,Folkeskole 9.-10. klasse,11,37,37,9999999,9999999,370
3760,3000,7,Not eligible to vote,,"Still in parental home, never left 2 months",40,Male,2002,Folkeskole 9.-10. klasse,9,2,2,200,9999999,430
4658,900,8,Yes,,1974,50,Female,1956,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",4,30,30,9999999,9999999,620
5816,900,7,Yes,SF Socialistisk Folkeparti - Socialist People's Party,1994,60,Male,1974,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",35,37,37,37000,35000,610


## Common error 2: Missing not coded as missing

As there is no global standard for denoting missing values, the values used for denoting missing values will often vary from dataset to dataset. For surveys, the codebook usually contains information about what values are used to denote missing values (often very high numbers are used).

If this is overlooked, one can end up with errorneous results. In the code below, the mean for the variable `grspnum` (usual weekly/monthly/annual gross pay) is calculated:

In [14]:
mean(ess2018$grspnum)

The result is very high. This is because the variable contains high values to denote the mixing values (this can for example be seen using summary):

In [15]:
summary(ess2018$grspnum)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0   32000  500000 4833784 9999999 9999999 

The function `na_if` from `dplyr` can be used to simply recode specific values to missing:

In [19]:
library(dplyr)

ess2018 <- ess2018 %>%
    mutate(grspnum = na_if(grspnum, 9999999))

mean(ess2018$grspnum, na.rm = TRUE)

## Common error 3: Data uses a different decimal point

R assumes that periods (".") are used as the decimal point in the data. This is however not standard in all countries (some countries use commas). 

Because R usually does not recognize commas as decimal points, R will instead treat vectors with commas as a character class (string).

The variable `inwtm` contains the length of the interview in minutes which should be numeric. It is however currently not possible to perform arithmetic operations on it:

In [4]:
mean(ess2018$inwtm, na.rm = TRUE)

"argument is not numeric or logical: returning NA"

The code above produces an error because the variable/vector is the wrong class:

In [6]:
class(ess2018$inwtm)
head(ess2018$inwtm)

If we simply try to coerce the class, R will convert them all to missing values:

In [9]:
head(as.numeric(ess2018$inwtm))

"NAs introduced by coercion"

This error can either be fixed by correcting the data before importing. Alternatively, the commas can be replaced with periods in R and then make the coercion.

In the code below, the function `str_replace()` from the package `stringr` is used to replace the commas with periods (`gsub()` also works):

In [13]:
library(stringr)
# Using base R recoding

ess2018$inwtm <- str_replace(ess2018$inwtm, ",", ".") # Replace commas with periods
ess2018$inwtm <- as.numeric(ess2018$inwtm) # Coerce to numeric class

mean(ess2018$inwtm, na.rm = TRUE)

## EXERCISE: COMMON ERRORS

- Load the data "ESS2018DK_subset_with-errors.csv" if you have not already. Make sure it is imported using the correct delimiter.
    - Link: https://raw.githubusercontent.com/CALDISS-AAU/workshop_r-extended-intro/master/data/ESS2018DK_subset_with-errors.csv


- The variable `frlgrsp` contains what level of weekly/monthly/annual gross pay the respondent feel is fair for them. Try calculating the mean of the variable. The variable may contain high values to denote missing values so be sure to recode these first.

- The variable `netustm` contains how much time the respondent spends on the internet on a typical day (in minutes). Try calculating the mean time the repondent spends on the internet on a typical day. If you encounter errors, try correcting them and calculaing the mean again.