# New Zealand Election Survey - Data Preparation

<img src="images/NZsurvey.png"/>

This data is a modified version of data from the New Zealand Election Survey, deliberately modified to introduce problems that occur naturally in many data sets.

In this case, the New Zealand Election Survey takes place every three years as a postal survey of a sample of registered electors. Some sampled electors were part of a sample panel of people surveyed at the previous election as part of a longitudinal study, others were randomly chosen from the electoral roll. Those electors that were part of the longitudinal panel group were randomly selected in previous elections. 

As well as survey results, the data set includes information from the electoral roll, and weighting values for adjusting results. The full NZES data set has been reduced to a selected group of variables, making 3101 observations of 107 variables.

### Libraries

In [1]:
#install.packages("dplyr")
library(dplyr) # for data wrangling

#install.packages("ggplot2")
library(ggplot2) # for data visualization


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




### Load Data

In [2]:
load("data/selected_nzes2011.Rdata")
head(selected_nzes2011)

Unnamed: 0_level_0,Jelect,jblogel,jnewspaper,jnatradio,jtalkback,jdiscussp,jrallies,jpersuade,jpcmoney,jpcposter,⋯,jethnicity_o,jethnicityx,jethnicmost,jethnicmostx,jpartyvote,jelecvote,njptyvote,njelecvote,jdiffvoting,X_singlefav
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,36,,,,,"Yes, frequently",No,No,No,No,⋯,,,NZ European,,National,National,National,National,Voting can make a reasonable amount of difference to what happens,National
2,34,,Sometimes,,,"Yes, rarely",No,No,No,No,⋯,,,NZ European,,National,National,National,National,Voting can make some difference to what happens,National
3,27,Visited a political blog for election information,Sometimes,Sometimes,Sometimes,"Yes, frequently",No,"Yes, occasionally",No,No,⋯,,,,,National,Labour,National,Labour,Voting can make a big difference to what happens,National
4,34,,Often,,,"Yes, occasionally",No,No,No,No,⋯,,,,,NZ First,Labour,NZ First,Labour,Voting can make some difference to what happens,Labour
5,15,,Sometimes,Not at all,Often,"Yes, occasionally",No,No,No,No,⋯,Other,SOUTH AFRICAN,NZ European,,National,National,National,National,Voting can make a big difference to what happens,National
6,58,,Sometimes,,,"Yes, occasionally",,,,,⋯,,,,,NZ First,,NZ First,,Voting can make some difference to what happens,NZ First


### Some Exploration

In [3]:
str(selected_nzes2011)

'data.frame':	3101 obs. of  107 variables:
 $ Jelect        : int  36 34 27 34 15 58 30 32 40 10 ...
 $ jblogel       : chr  NA NA "Visited a political blog for election information" NA ...
 $ jnewspaper    : chr  NA "Sometimes" "Sometimes" "Often" ...
 $ jnatradio     : chr  NA NA "Sometimes" NA ...
 $ jtalkback     : chr  NA NA "Sometimes" NA ...
 $ jdiscussp     : chr  "Yes, frequently" "Yes, rarely" "Yes, frequently" "Yes, occasionally" ...
 $ jrallies      : chr  "No" "No" "No" "No" ...
 $ jpersuade     : chr  "No" "No" "Yes, occasionally" "No" ...
 $ jpcmoney      : chr  "No" "No" "No" "No" ...
 $ jpcposter     : chr  "No" "No" "No" "No" ...
 $ jlablike      : Factor w/ 12 levels "0","1","10","2",..: 4 8 6 10 6 8 NA NA 10 7 ...
 $ jnatlike      : Factor w/ 12 levels "0","1","10","2",..: 10 9 9 NA 11 8 3 NA 4 3 ...
 $ jgrnlike      : Factor w/ 12 levels "0","1","10","2",..: 1 8 9 NA 7 1 NA NA 3 12 ...
 $ jnzflike      : Factor w/ 12 levels "0","1","10","2",..: 1 1 4 10 4 11 NA NA 

In [4]:
summary(selected_nzes2011)

     Jelect        jblogel           jnewspaper         jnatradio        
 Min.   : 1.00   Length:3101        Length:3101        Length:3101       
 1st Qu.:21.00   Class :character   Class :character   Class :character  
 Median :44.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   :41.41                                                           
 3rd Qu.:63.00                                                           
 Max.   :70.00                                                           
 NA's   :2                                                               
  jtalkback          jdiscussp           jrallies          jpersuade        
 Length:3101        Length:3101        Length:3101        Length:3101       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                       

As a first question, we might be interested in exploring the relationship between the party the person voted for, the party that was their favourite, and if they believed that their vote makes a difference -- focusing on the question that are people who believe their vote makes a difference more likely to strategically vote for a party not their favourite. To achieve this, we familiarise ourselves with the variables `jpartyvote`, `jdiffvoting`, and `_singlefav`.

    `jpartyvote`: if cast party vote, for which party 
    
    `jdiffvoting`: does voting make any difference to what happens
    
    `_singlefav`    |Caluclated Variable of most liked of major parties 

In [5]:
names(selected_nzes2011)

When we have hundreds of column names, a useful tip is to just search out only possible names. We can search the names for a fragment of the name by using the `grep("FRAGMENT", variable, value = TRUE)` command, which in this case might be:

In [6]:
grep("singlefav", names(selected_nzes2011), value = TRUE)

The `value = TRUE` argument, as described in the help for the `grep()` function reports the mathing character string, as opposed to the index number for that string.

We can now confirm that the variable is called `X_singlefav`, so that is how we should be referring to it.

In [7]:
selected_nzes2011 %>% 
  select(jpartyvote, jdiffvoting, X_singlefav) %>% 
  str()

'data.frame':	3101 obs. of  3 variables:
 $ jpartyvote : chr  "National" "National" "National" "NZ First" ...
 $ jdiffvoting: chr  "Voting can make a reasonable amount of difference to what happens" "Voting can make some difference to what happens" "Voting can make a big difference to what happens" "Voting can make some difference to what happens" ...
 $ X_singlefav: chr  "National" "National" "National" "Labour" ...


These are all categorical data, however they are recorded as characters (text strings) as opposed to factors.

An easy way of tabulating these data to see how many times each level of is to use the `group_by()` function along with the `summarise()` command:

In [8]:
selected_nzes2011 %>% 
  group_by(jpartyvote) %>% 
  summarise(count = n())

`summarise()` ungrouping output (override with `.groups` argument)



jpartyvote,count
<chr>,<int>
Act,29
ALC,10
Alliance,2
Another party,8
Conservative,74
Don't know,23
Green,348
Labour,749
Mana,62
Maori Party,128


We can see that 23 people answered `"Don't know"`. Since our question is about people who knew which party they voted for, we might want to exclude these observations from our analysis. We can do so by `filter`ing them out.

In [9]:
selected_nzes2011 %>% 
  filter(jpartyvote != "Don't know") %>%
  group_by(jpartyvote) %>% 
  summarise(count = n())

`summarise()` ungrouping output (override with `.groups` argument)



jpartyvote,count
<chr>,<int>
Act,29
ALC,10
Alliance,2
Another party,8
Conservative,74
Green,348
Labour,749
Mana,62
Maori Party,128
National,1130


We can also similarly view the levels and number of occurances of these levels in the  `X_singlefav` variable:

In [10]:
selected_nzes2011 %>% 
  group_by(X_singlefav) %>% 
  summarise(count = n())

`summarise()` ungrouping output (override with `.groups` argument)



X_singlefav,count
<chr>,<int>
Act,33
Green,388
Labour,1043
Mana,47
National,1266
NZ First,138
United Future,128
,58


This set also has `NA` entries, but in this case we don't want to get rid of anything but the `NA`s so we need to target them directly. `NA` entries need special targeting because they do not actually exist (they are different to the text `"NA"` or a variable saved with the name `NA`).

If we only wanted to find the `NA`s we would use the `is.na()` function with the name of the variable inside the parentheses. 

However since we want the entries that are **not** `NA`s we can use the __Not__ operator, `!`, to indicate "we want all the ones that are not NA":`!is.na()`. Hence we can `filter` out all non NAs in our `dplyr` chain:

In [11]:
selected_nzes2011 %>% 
  filter(!is.na(X_singlefav)) %>%  #Not NA
  group_by(X_singlefav) %>% 
  summarise(count = n())

`summarise()` ungrouping output (override with `.groups` argument)



X_singlefav,count
<chr>,<int>
Act,33
Green,388
Labour,1043
Mana,47
National,1266
NZ First,138
United Future,128


And remember that we can `filter` for multiple characteristics at once:

In [12]:
selected_nzes2011 %>% 
  filter(!is.na(X_singlefav), jpartyvote != "Don't know") %>%
  group_by(X_singlefav) %>% 
  summarise(count=n())

`summarise()` ungrouping output (override with `.groups` argument)



X_singlefav,count
<chr>,<int>
Act,29
Green,354
Labour,914
Mana,42
National,1172
NZ First,119
United Future,115


If we examine the categories in `jdiffvoting` we can see that this variable has levels such as both `"Don't know"` and `NA`.

In [13]:
selected_nzes2011 %>% 
  group_by(jdiffvoting) %>% 
  summarise(count = n())

`summarise()` ungrouping output (override with `.groups` argument)



jdiffvoting,count
<chr>,<int>
Don't know,63
Voting can make a big difference to what happens,1605
Voting can make a reasonable amount of difference to what happens,841
Voting can make some difference to what happens,339
Voting won't make any difference to what happens,119
Voting won't make much difference to what happens,106
,28


We need to decide how we want to handle these levels in our analysis.

Remember that our main question is about whether people vote for their favorite party or a diffent one. Hence an straighforwrd approach would be to first determine whether each observation in the data represents a person who voted for the party same as their favorite party or different. This requires creating a new variable with the `mutate()` function.

In creating this variable we want to evaluate if for a given observation the values in the `jpartyvote` and `X_singlefav` variables are the same, or different:

In [14]:
selected_nzes2011 <- selected_nzes2011 %>%
  mutate(sameparty = ifelse(jpartyvote == X_singlefav, "same", "different"))

head(selected_nzes2011)

Unnamed: 0_level_0,Jelect,jblogel,jnewspaper,jnatradio,jtalkback,jdiscussp,jrallies,jpersuade,jpcmoney,jpcposter,⋯,jethnicityx,jethnicmost,jethnicmostx,jpartyvote,jelecvote,njptyvote,njelecvote,jdiffvoting,X_singlefav,sameparty
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,36,,,,,"Yes, frequently",No,No,No,No,⋯,,NZ European,,National,National,National,National,Voting can make a reasonable amount of difference to what happens,National,same
2,34,,Sometimes,,,"Yes, rarely",No,No,No,No,⋯,,NZ European,,National,National,National,National,Voting can make some difference to what happens,National,same
3,27,Visited a political blog for election information,Sometimes,Sometimes,Sometimes,"Yes, frequently",No,"Yes, occasionally",No,No,⋯,,,,National,Labour,National,Labour,Voting can make a big difference to what happens,National,same
4,34,,Often,,,"Yes, occasionally",No,No,No,No,⋯,,,,NZ First,Labour,NZ First,Labour,Voting can make some difference to what happens,Labour,different
5,15,,Sometimes,Not at all,Often,"Yes, occasionally",No,No,No,No,⋯,SOUTH AFRICAN,NZ European,,National,National,National,National,Voting can make a big difference to what happens,National,same
6,58,,Sometimes,,,"Yes, occasionally",,,,,⋯,,,,NZ First,,NZ First,,Voting can make some difference to what happens,NZ First,same


This creates a new variable named `sameparty` that has the value `"same"` if `jpartyvote` is equal to `X_singlefav`, and `"different"` otherwise.

We can again check our work by exploring the groupings in a View:

In [15]:
selected_nzes2011 %>% 
group_by(jpartyvote, X_singlefav, sameparty) %>%
  summarise(count = n())

`summarise()` regrouping output by 'jpartyvote', 'X_singlefav' (override with `.groups` argument)



jpartyvote,X_singlefav,sameparty,count
<chr>,<chr>,<chr>,<int>
Act,Act,same,12
Act,Green,different,1
Act,National,different,14
Act,United Future,different,1
Act,,,1
ALC,Green,different,1
ALC,Labour,different,4
ALC,National,different,2
ALC,United Future,different,3
Alliance,Labour,different,1


We can see that observations where `jpartyvote` equaled `X_singlefav`, the value `"same"` was recorded for the new variable `sameparty`, and the value `"different"` was recorded otherwise. If either `jpartyvote` or `X_singlefav` had an `NA`, R could not check for equality and hence `NA` was recorded for the `sameparty` variable as well.

To view and summarize the "same" entries we can use the following:

In [16]:
selected_nzes2011 %>% 
  group_by(jpartyvote, X_singlefav, sameparty) %>%
  summarise(count = n()) %>% 
  filter(sameparty == "same")

`summarise()` regrouping output by 'jpartyvote', 'X_singlefav' (override with `.groups` argument)



jpartyvote,X_singlefav,sameparty,count
<chr>,<chr>,<chr>,<int>
Act,Act,same,12
Green,Green,same,237
Labour,Labour,same,632
Mana,Mana,same,31
National,National,same,1004
NZ First,NZ First,same,82
United Future,United Future,same,5


And to view and summarize the "different" entries we can use the following:

In [17]:
selected_nzes2011 %>% 
  group_by(jpartyvote, X_singlefav, sameparty) %>%
  summarise(count = n()) %>% 
  filter(sameparty == "different")

`summarise()` regrouping output by 'jpartyvote', 'X_singlefav' (override with `.groups` argument)



jpartyvote,X_singlefav,sameparty,count
<chr>,<chr>,<chr>,<int>
Act,Green,different,1
Act,National,different,14
Act,United Future,different,1
ALC,Green,different,1
ALC,Labour,different,4
ALC,National,different,2
ALC,United Future,different,3
Alliance,Labour,different,1
Alliance,National,different,1
Another party,Green,different,2


We can also check how we got any `NA`s we have by using the `is.na()` function:

In [18]:
selected_nzes2011 %>% 
  group_by(jpartyvote, X_singlefav, sameparty) %>%
  summarise(count = n()) %>% 
  filter(is.na(sameparty))

`summarise()` regrouping output by 'jpartyvote', 'X_singlefav' (override with `.groups` argument)



jpartyvote,X_singlefav,sameparty,count
<chr>,<chr>,<chr>,<int>
Act,,,1
Conservative,,,1
Don't know,,,7
Green,,,1
Labour,,,11
Maori Party,,,2
National,,,7
NZ First,,,2
,Act,,4
,Green,,32


The checks show that the observations with `NA`s in the `sameparty`are going to be excluded from the analysis when we fiter out the `NA`s in the `jpartyvote` and `X_singlefav` variables, so we don't need to worry about them anymore.

As a second question, we might be interested in exploring the relationship between age of voters and how much they like the NZ First party. For the variables `jnzflike` (how much like NZ First) and `jage` (respondent's age in years).

In [19]:
str(selected_nzes2011$jnzflike)

 Factor w/ 12 levels "0","1","10","2",..: 1 1 4 10 4 11 NA NA 1 12 ...


In [20]:
str(selected_nzes2011$jage)

 int [1:3101] 37 37 28 71 43 NA 59 68 64 70 ...


`jnzflike` is a factor variable, in fact it's ordinal and by default the levels are listed in alphabetical order. Since this is a categorical variable, we can also summarize the occurances of each level with `group_by()` and `summarise()` again:

In [21]:
selected_nzes2011 %>% 
  group_by(jnzflike) %>% 
  summarise(count = n())

`summarise()` ungrouping output (override with `.groups` argument)



jnzflike,count
<fct>,<int>
0,622
1,298
10,134
2,266
3,227
4,162
5,544
6,165
7,138
8,107


While `jnzflike` is on a 0 to 10 scale, this variable also has a level labeled `"Don't know"`, which is why R stores this variable as not a numeric variable.

`jage`, on the other hand, is an integer, with values that are whole numbers between 0 and infinity (or `NA`). For this variable we would want to take a look at numerical summaries such as means, medians, etc.

In [22]:
selected_nzes2011 %>% 
  summarise(agemean = mean(jage), agemedian = median(jage), agesd = sd(jage), 
            agemin = min(jage), agemax = max(jage))

agemean,agemedian,agesd,agemin,agemax
<dbl>,<int>,<dbl>,<int>,<int>
,,,,


What went wrong? The reason why all of the results were reported as NAs is that there were some NA entries in the `jage` variable (people not reporting their age). Since it is not possible to take the average of a series of values that contain `NA`s, obtaining the numerical summaries requires that we exclude the `NA`s from the calculation.

Most numerical summary functions allow us to easily exclude `NA`s with the `na.rm` argument.

An alternative approach is just to `filter` out the `NA`s first, and then ask for the numerical summaries:

In [23]:
selected_nzes2011 %>% 
  filter(!(is.na(jage))) %>%
  summarise(agemean = mean(jage), agemedian = median(jage), agesd = sd(jage), 
            agemin = min(jage), agemax = max(jage))

agemean,agemedian,agesd,agemin,agemax
<dbl>,<dbl>,<dbl>,<int>,<int>
53.22328,54,17.5371,18,100


Having gained some familiarity with the specific variables we are using, we next need to consider if there is additional work we should do on the data in investigating the question. There are a number of different approaches we might take. For example, we could consider if those that strongly like NZ First are older than those that strongly dislike NZ First, or we could consider if old people like NZ First more than young people.

If we wanted to select only two of the possible levels in how much people like NZ First, we can filter for these specific levels. When interested in filtering for multiple values a variable can take, the `%in%` operator can come in handy:

In [24]:
selected_nzes2011 %>% 
  filter(jnzflike %in% c("0","10")) %>%
  group_by(jnzflike) %>% 
  summarise(count = n())

`summarise()` ungrouping output (override with `.groups` argument)



jnzflike,count
<fct>,<int>
0,622
10,134


Remember that the `jnzflike` is not a numerical variable, hence we use the quotation marks around the values (even though they happen to be numbers).

This is an example of simpligying the analysis by considering only two levels of a categorical variable, as opposed to all possible levels.

We might also like to refine our question slightly, asking do people above retirement age (65 in New Zealand) like NZ First more than younger people. To do this we can turn the numeric age variable into a categorical variable based on whether people are 65 years or older or younger than 65. Once again we make use of the `mutate()` and `ifelse()` functions:

In [25]:
selected_nzes2011 <- selected_nzes2011 %>% 
  mutate(retiredage = ifelse(jage >= 65, "retired age", "working age"))
selected_nzes2011 %>% 
  group_by(retiredage) %>% 
  summarise(count = n())

`summarise()` ungrouping output (override with `.groups` argument)



retiredage,count
<chr>,<int>
retired age,876
working age,2156
,69


We can see that individuals in the dataset are now labeled as either `"retired age"` or `"working age"` or neither (`NA`), which we can easily filter out if need be.

This is an example of using a numerical threshold to convert a numerical variable to a categorical variable.

For other hand, we need a conversion method that will use the text strings that label the levels, as opposed to the storage order of these levels. We can do this by first saving the variable as a character variable, and then turning it into a number:

In [26]:
selected_nzes2011 <- selected_nzes2011 %>% 
  mutate(numlikenzf = as.numeric(as.character(jnzflike)))

"NAs introduced by coercion"


The warning "NAs introduced by coercion" happens since the level `"Don't know"` cannot be turned into a number. But this should be fine for our purposes since we are interested in the numerical responses anyway.

In [27]:
selected_nzes2011 %>% 
  group_by(jnzflike, numlikenzf) %>% 
  summarise(count = n())

`summarise()` regrouping output by 'jnzflike' (override with `.groups` argument)



jnzflike,numlikenzf,count
<fct>,<dbl>,<int>
0,0.0,622
1,1.0,298
10,10.0,134
2,2.0,266
3,3.0,227
4,4.0,162
5,5.0,544
6,6.0,165
7,7.0,138
8,8.0,107


Converting the factor to a character first ensures that the numerical values used in the labels of the levels of the categorical variable are used.

Now that we cleaned up the data in a way that addresses the needs of the research questions we want to explore, we are ready to continue with our analysis.