# Brexit - the data analysis

We start, as usual, by importing all the libraries we need.

In [None]:
# Import Numpy library, rename as "np"
import numpy as np
# Import Pandas library, rename as "pd"
import pandas as pd

# Set up plotting
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

Then load the test library to run the tests for your answers.

In [None]:
# Load the OKpy test library and tests.
from client.api.notebook import Notebook
ok = Notebook('brexiteering.ok')

## All about the Brexiteers

Every year, the [Hansard
Society](https://www.hansardsociety.org.uk/research/audit-of-political-engagement)
sponsors a survey on political engagement in the UK.

They put topical questions in each survey.  For the 2016 / 7 survey, they
asked about how people voted in the Brexit referendum.

Luckily, they make the data freely available online for us to analyze.

You can get the data for yourself from the UK Data Service:
[if you want](https://discover.ukdataservice.ac.uk/catalogue/?sn=8183).
There are data files in various formats, including:

* SPSS format (for the SPSS statistical package);
* Stata format (for the Stata statistical package);
* tab-delimited (a general data format, that can be used with Pandas, Excel,
  and other packages).

The data is in a standard form, with one row per respondent, and one column
per question.

To save you a tiny bit of work, I have made an unchanged copy of the
tab-delimited version of the data file for you to download directly. I have
also made a copy of the document describing the questions they ask and the way
that they have recorded the answers in the data file.  This is often called
the “data dictionary”.  It was originally in Rich Text Format, but I have
converted to PDF for convenience.  It is otherwise identical to the file you
will find at the UK Data Service.

You can download these copies from the following links:

* [tab-delimited data file](https://matthew-brett.github.io/cfd2019/data/audit_of_political_engagement_14_2017.tab);
* [data dictionary PDF file](https://matthew-brett.github.io/cfd2019/data/audit_of_political_engagement_14_2017_ukda_data_dictionary.pdf).

If you are running this notebook on your laptop, download the tab-delimited
data file to the same directory as the notebook.

In the moment, we are going to try and analyze these data.  We will focus on
two questions labeled `cut15` and `numage`.  `cut15` is the question about
Brexit. The data dictionary has the *variable label* “CUT15 - How did you vote
on the question ‘Should the United Kingdom remain a member of the European
Union or leave the European Union’?”.  The recorded values run from 1 through
6 and have the following labels:

```
Value label information for cut15
Value = 1.0    Label = Remain a member of the European Union
Value = 2.0    Label = Leave the European Union
Value = 3.0    Label = Did not vote
Value = 4.0    Label = Too young
Value = 5.0    Label = Can't remember
Value = 6.0    Label = Refused
```

We also want the variable `numage`; this is the age of the respondent in years.

The data file that you just downloaded should be called
`audit_of_political_engagement_14_2017.tab`.  The cell below loads the data
file into memory with Pandas:

In [None]:
# Load the data frame, and put it in the variable "audit_data"
audit_data = pd.read_table('audit_of_political_engagement_14_2017.tab')

As you know, we now have a *data frame*:

In [None]:
type(audit_data)

The data frame has one row per person surveyed, and one column for each
question in the survey.  The columns have kind-of helpful names that you can
read about in the data dictionary:

In [None]:
# Show the top five rows of the data frame.
audit_data.head()

The data frame has columns for all the questions listed in the data
dictionary:

In [None]:
audit_data.columns

To reduce clutter, we first make a new data frame that just has the two
questions we are interested in:


In [None]:
# Select the age and Brexit vote questions only
mini_brexit = pd.DataFrame()
mini_brexit['numage'] = audit_data['numage']
mini_brexit['cut15'] = audit_data['cut15']
mini_brexit.head()

To get started on exploring, we make a new variable `ages` that refers
to the `numage` column in the `mini_brexit` data frame.

In [None]:
# Make a new variable "ages" that refers to the "numage" column in
# "mini_brexit"
ages = mini_brexit["numage"]

Confirm that `ages` has a value of type `Series`, the Pandas type for a column of a data frame:

In [None]:
type(ages)

Here are the numbers of rows, columns in the original data frame:

In [None]:
audit_data.shape

Run the cell below to confirm that `ages` has the same number of values, as
`audit_data` has rows.  To do this, we can use the `len` function, as applied
to the `ages` Series.  It returns the number of values.

In [None]:
len(ages)

In fact, `len`, as applied to the *data frame*, returns the number of rows:

In [None]:
len(audit_data)

Start by doing a histogram of the values in `ages` (which are also the values
in the `numage` column of `mini_brexit`).  If you can't remember how to do
histograms, have a look at the [introduction to data
frames](https://matthew-brett.github.io/cfd2019/chapters/04/data_frame_intro)
notebook.   Hint: consider using the `hist`
method of the `ages` variable.

In [None]:
#- Do a histogram of the values in the "numage" column.
#- Your code here.

You will see that a few subjects have an age of 0.

It looks as if the survey coders are using the value 0 to mean that the person
did not state their age.  We will have to clean that up.  We do that by
selecting the cases that have ages not equal to 0.

Hint:  You have seen the operator to say whether two values are equal or no:

In [None]:
1 == 2

In [None]:
2 == 2

The operator for *not equal* is `!=`, as in:

In [None]:
1 != 2

Prepare for brain-bending double negative...

In [None]:
2 != 2

To identify the values in `ages` that are *not equal* to 0, use the comparison
I've hinted at above, to make a new variable, `age_not_0`, that has the same
number of values as `ages`, and has `True` at positions where `ages` is *not
equal* to 0, and `False` otherwise.   We will refer to these sequences of True
and False values, as *Boolean vectors*.

Check back to the [introduction to data frames](../04/data_frame_intro)
notebook for a reminder of making and using Boolean vectors to select rows
from data frames.

In [None]:
#- Create new variable "age_not_0", with True at positions where "ages" is not
#- equal to 0, and False otherwise.
age_not_0 = ...
age_not_0

In [None]:
_ = ok.grade('q_age_not_0')

Use `age_not_0` to select rows in the `mini_brexit` data frame where the value
is `True`, and throw away the rows where the value is `False`.

In [None]:
#- Select rows in the data frame where the age is not equal to 0.
good_brexit = ...
good_brexit

In [None]:
_ = ok.grade('q_good_brexit')

Now we want to ask what proportion of the respondents said that they voted
Remain or Leave.

We begin by making a new data frame that contains only the rows for people who
said they voted No in the referendum (remain).  Remember, from the data
dictionary, that 1 is the code for a No vote.

First, make a new variable `votes` that has the values of `cut15` column of
the `good_brexit` data frame.

In [None]:
votes = ...
votes

In [None]:
_ = ok.grade('q_votes')

Now make a new Boolean vector, that has True at the positions where `votes` is
equal to 1, and False otherwise.  Call this variable `is_remain`.

In [None]:
is_remain = ...

In [None]:
_ = ok.grade('q_is_remain')

Next, use `is_remain` to select the rows in `good_brexit` that correspond to
confessed "Remain" voters.  Call the new data frame `remainers`:

In [None]:
remainers = ...
remainers

In [None]:
_ = ok.grade('q_remainers')

Do a histogram of the values in the `numage` column of `remainers`:

In [None]:
#- Show a histogram of the `numage` column from `remainers`

Next, go through the same steps, to make a new data frame for those who
claimed to vote Yes (leave) (code 2):

In [None]:
#- Make a Boolean vector, called "is_leave", that True for Leave row, False
#- otherwise.
is_leave = ...

In [None]:
_ = ok.grade('q_is_leave')

Next, use `is_leave` to select the rows in `good_brexit` that correspond to
confessed "Leave" voters.  Call the new data frame `leavers`:

In [None]:
leavers = ...

In [None]:
_ = ok.grade('q_leavers')

Do a histogram of the values in the `numage` column of `leavers`:

In [None]:
#- Show a histogram of the `numage` column from `remainers`

Here is the total number of Remain voters:

In [None]:
n_remain = len(remainers)
n_remain

Here is the total number of Leave voters:

In [None]:
n_leave = len(leavers)
n_leave

Here is the total number of voters who confessed to a specific Leave or Remain vote:

In [None]:
n_total = n_leave + n_remain
n_total

Here is the proportion of Leave voters:

In [None]:
leave_proportion = n_leave / n_total
leave_proportion

As you remember, the proportion of Leave voters in the referendum was 51.9%.
`leave_proportion` from the survey seems a way off.  Is it too far off?

You go back to the survey company and tell them that the proportion of Leave voters seems too low.

They say the following:

> We took a random sample of the population.  You are a data scientist, you
> know well that the proportion from this random sample is very unlikely to be
> exactly the same as the proportion in the whole population.  The proportion
> we get is compatible with the variation we expect from taking a random
> sample.
>
> In other words - the difference in the proportions, between the referendum
> and the survey, is due to sampling error.

Time for a simulation.

The null hypothesis offered by the survey company is that the proportion we
saw above is a plausible value if we took a random sample of `n_total` voters.

We can simulate a new survey, with `n_total` voters, by taking `n_total`
random numbers between 0 and 1.  We consider the values less than 0.52 as
corresponding to a Leave vote, and the rest are Remain votes.  We then
calculate the proportion of Leave votes (proportion of values where value <
0\.519 == True).

We do this 10000 times, to get 10000 simulated surveys.  We calculate the
proportions for each simulated survey, and do a histogram of the proportions.
Is `leave_proportion` a plausible value on this histogram?

See:

* [Reply to the Supreme Court](https://matthew-brett.github.io/cfd2019/chapters/03/reply_supreme)
* [Final three girls simulation](https://matthew-brett.github.io/cfd2019/chapters/03/filling_arrays)

to remind yourself about simulations.

Your simulation should calculate 10000 simulated proportions, and store them in an array called `simulated_proportion`.

In [None]:
#- Your simulation here
simulated_proportions = ...
# Show the first 5 simulated_proportion values
simulated_proportions[:5]

In [None]:
_ = ok.grade('q_simulated_proportions')

Do a histogram of `simulated_proportions` below.  What do you think of the survey company's explanation?

In [None]:
plt.hist(simulated_proportions)

## Done

You're finished with the assignment!  Be sure to...

- **run all the tests** (the next cell has a shortcut for that),
- **Save and Checkpoint** from the  menu.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]