In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Demographics: Standardising and recoding data fields

Let's now turn to the individual demographics data, which we will tidy up a bit.  These are in the files called `demographics`.  We'll take a bit of a shortcut and not check the schemas are the same (exercise: try it yourself!)

In [None]:
raw_demographics = pd.concat(
    [pd.read_csv("data/raw/batch1/demographics.csv"), pd.read_csv("data/raw/batch2/demographics.csv")],
    ignore_index=True
)
raw_demographics

Let's have a look at the columns and their data types.

In [None]:
raw_demographics.dtypes

In this case we know the fields we're particularly interested in: the seven "about you" questions.  Let's have a look at the data values for them.  We can look at just a subset of the columns in a `DataFrame` by using the square-bracket operator.  We are passing a list of columns, hence we have double square brackets.  The outer pair of square brackets is the indexing operator, and the inner pair is what denotes the list.  (It is perhaps unfortunate that Python uses square brackets both for indexing and for delimiting a list....)

In [None]:
raw_demographics[[
    'player.gender', 'player.age', 'player.countryborn',
    'player.countrynow', 'player.department', 'player.degree',
    'player.timeuea'
]].head(20)


As before, we know we have entries for slots that were opened up for participants who did not turn up for the experiment.  However, trying to filter participants on whether or not demographics are null would be problematic, because participants cannot be obligated to disclose any or all of their demographic information:

In [None]:
raw_demographics[[
    'player.gender', 'player.age', 'player.countryborn',
    'player.countrynow', 'player.department', 'player.degree',
    'player.timeuea'
]].query("`player.department`.isnull()").head(10)


Here's where knowing how the software works is useful, including some of that data which is more about administering the experiment rather than collecting responses directly.  For each row there's a field called `participant._index_in_pages` (note the leading underscore), which tells you how far the participant has progressed in the experiment, and also `participant._max_page_index`, which is the total number of pages needed to complete the experiment.

In [None]:
raw_demographics[[
    "participant._index_in_pages", "participant._max_page_index",
    "player.gender", "player.age"
]].head(20)

The number of pages there are in a session depends on the session.  (As we were running the experiment, we realised having extra landing pages in the instructions was helpful to keep participants together.)  So the best way to test whether a participant row is a valid obseration is to see whether they reached the final page.

In [None]:
df = raw_demographics.query("`participant._index_in_pages` == `participant._max_page_index`")
df

We've got the right number of rows.  We'll check later on whether our participant IDs match up exactly with what we did with decisions.  For now, let's continue cleaning the data by selecting the columns we want (and while we're at it, let's get rid of those annoying periods in the column names).

In [None]:
df = (
    df.rename(columns=lambda x: x.replace(".", "_"))
    .reindex(
        columns=['session_label', 'participant_code',
                 'player_gender', 'player_age', 'player_countryborn',
                 'player_countrynow', 'player_department', 'player_degree',
                 'player_timeuea']

    )
)
df

Sometimes, there are fields where there is a finite list of possible answers - but that list is too long to specify completely in a question.  Countries are a good example of this; there are roughly 200 in the world (depending on how you count), but we're all had the experience of how tedious it is to pick out your country from a long drop-down list.  We have two country fields in our data; this is a good opportunity to look at ways to tidy up the data.

Let's first look at `countryborn`, and see what data values are there:

In [None]:
df.sort_values('player_countryborn')['player_countryborn'].unique()

Compared to some datasets, this isn't all that bad; most of the country names are already rather clean.  We just need to standardise a few country names, and to make a decision about how to code situations where more than one country is listed.

To accomplish this we'll use two functions:
1. `Series.str.title()`: This will convert all the strings to Title Case - that is, first letter of each work capitalised and all others lowercase; (https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.Series.str.title.html)
2. `Series.replace()`: This takes a `dict`, and replaces each instance of a key with the corresponding value in the `dict`. (https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.Series.replace.html)

In [None]:
df = df.assign(
    player_countryborn = lambda x: (
        x['player_countryborn'].str.title()
        .replace(
            {'Britain': 'United Kingdom',
             'British': 'United Kingdom',
             'Uk': 'United Kingdom',
             'Uk (England)': 'United Kingdom',
             'Uk, Australia': 'United Kingdom',
             'England': 'United Kingdom',
             'England / Uk': 'United Kingdom',
             'United Kingdom (England)': 'United Kingdom',
             'Denmark/Usa': 'Denmark',
             'United States Of America': 'United States',
             'Usa': 'United States',
             'Taiwan, Egypt': 'Taiwan'
            }
        )
    )
)
df.sort_values('player_countryborn')['player_countryborn'].unique()

Looks good.  Now we'll do the same exercise with `player_countrynow`.

In [None]:
df.sort_values('player_countrynow')['player_countrynow'].unique()

In [None]:
df = df.assign(
    player_countrynow = lambda x: (
        x['player_countrynow'].str.title()
        .replace(
            {'Britain': 'United Kingdom',
             'British': 'United Kingdom',
             'Uk': 'United Kingdom',
             'Uk (England)': 'United Kingdom',
             'Uk England': 'United Kingdom',
             'Uk, Australia': 'United Kingdom',
             'England, Uk': 'United Kingdom',
             'England': 'United Kingdom',
             'England / Uk': 'United Kingdom',
             'Uk As A Visiting Student': 'United Kingdom',
             'United Kingdom (England)': 'United Kingdom',
             'Denmark/Usa': 'Denmark',
             'United States Of America': 'United States',
             'Usa': 'United States',
             'Taiwan, Egypt': 'Taiwan'
            }
        )
    )
)
df.sort_values('player_countrynow')['player_countrynow'].unique()

Now let's have a look at `player_gender`.  This is an example of a quite annoying data field - the data are recorded by the computer as integers, but you have to know the computer code to know what is what.  Because we do have the computer code, we know that 1 = Male, 2 = Female, 3 = Other, and 4 = prefer not to say.


In [None]:
df.groupby('player_gender')['participant_code'].count()

We'll recode these using letters (M, F, O), and replace 4 with true null values.

In [None]:
df = df.assign(
    player_gender = lambda x: (
        x['player_gender'].replace(
            {1: 'M',
             2: 'F',
             3: 'O',
             4: None}
        )
    )
)
df.groupby('player_gender')['participant_code'].count()

Now, let's have a look at the responses for UEA schools:

In [None]:
df.sort_values('player_department')['player_department'].unique()

These aren't too bad.  To clean these up, alongside `Series.replace` which we've already used, we'll make use of two useful methods for string manipulation:

1. `Series.str.upper()`: Converts all characters in the string to uppercase. (https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.Series.str.upper.html) 
2. `Series.str[]`: The `[]` notation on a Series works just like it does on regular Python strings or lists.  We'll use it here to restrict to the first three letters - which, after a bit of initial cleanup, maps to the UEA School/Faculty/programme names.

In [None]:
df = df.assign(
    player_department=lambda x: (
        x['player_department'].replace(
            {"BSc Psychology": "PSY",
             "NMS": "MED",
             "UEA": None}
        )
        .str.upper()
        .str[:3]
    )
)
df.sort_values('player_department')['player_department'].unique()

The UEA degree/affiliation field is, like gender, straightforward enough if you have the coding from the software.

In [None]:
df = df.assign(
    player_degree=lambda x: (
        x['player_degree'].replace(
            {1: "INTO",
             2: "BSc",
             3: "PGDip",
             4: "MA/MSc",
             5: "PhD",
             6: "Staff",
             7: "Other",
             8: None}
        )        
    )
)
df.sort_values('player_degree')['player_degree'].unique()

Likewise, coding up the time-at-UEA question is now routine (I hope!)

In [None]:
df = df.assign(
    player_timeuea=lambda x: (
        x['player_timeuea'].replace(
            {1: "1st",
             2: "2nd",
             3: "3rd",
             4: "4th",
             5: "5th+",
             6: None}
        )
    )
)
df.sort_values('player_timeuea')['player_timeuea'].unique()

Let's take stock of where we are.

In [None]:
df

We haven't yet looked at the 'age' field.  We can have a look at the distribution of values in this field to see whether there are any which might be problematic.

In [None]:
df.groupby('player_age')[['participant_code']].count()

We're rather close; just a few further adjustments

In [None]:
df = (
    df.rename(columns=lambda x: x.replace("player_", ""))
    .rename(columns={
        'session_label': 'session_id',
        'participant_code': 'participant_id'
    })
    .astype({'age': int})
)
df

We'll save our work.

In [None]:
demographics = df
demographics.to_csv("data/prepared/demographics.csv", index=False)

In experiments (whether lab or field), randomisation of participants into treatments is a crucial aspect of the research methodology.  For example, in this experiment, we want to isolate the effect of information provision.  Now, in naturally-occuring data, there may be different kinds of information provided by, for example, different investment platforms.  However, because people choose which investment platform to use, it might be that individual characteristics or preferences of people vary across different platforms.  For example, hypothetically, people who are risk-averse might prefer platforms that emphasise risk information.  Or, it could be - because people tend to avoid negative information - risk-averse people might prefer platforms that *don't* have risk information!  Either way, this would confound our understanding of the effect of information.

Because we recruit participants into treatments at random, it should be the case that the characteristics of the participants in each treatment will be similar.  It is customary in experiments (especially field experiments) to check that the assignment of participants to treatments is similar based on their *observable* characteristics.

Let's check a few of these as an exercise.

In [None]:
sessions = pd.read_csv("data/raw/sessions.csv")

We will augment the demographics `DataFrame` with the treatment.  To do this, we use the `merge` operation.
(https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.merge.html).  See the somewhat deeper dive in the "mini-focus" on `merge` (and `join`) available in the `topics` notebook for this week.

In our `sessions` data, we have only one row for each `session_id`.  So our resulting `DataFrame` should still have 200 rows - one for each participant in `demographics`.

In [None]:
demographics = demographics.merge(
    sessions, how='left', on='session_id'
)
demographics

In [None]:
gender = demographics.groupby(['treatment', 'gender'])[['participant_id']].count()
gender

In [None]:
gender.unstack(fill_value=0)

In [None]:
department = (
    demographics.assign(
        is_eco=lambda x: x['department'] == "ECO"
    )
    .groupby(['treatment', 'is_eco'])[['participant_id']].count()
)
department.unstack(fill_value=0)

In [None]:
country = (
    demographics.assign(
        is_uk=lambda x: x['countryborn'] == "United Kingdom"
    )
    .groupby(['treatment', 'is_uk'])[['participant_id']].count()
)
country.unstack(fill_value=0)

## Numeracy data: Wide and long data formats

We'll turn now to the data from the 7 economic/numeracy questions.

In [None]:
raw_numeracy = pd.concat(
    [pd.read_csv("data/raw/batch1/numeracy.csv"), pd.read_csv("data/raw/batch2/numeracy.csv")],
    ignore_index=True
)
raw_numeracy

In this experiment, the answers to the seven questions are coded in fields called `player.answer1` up to `player.answer7`.

In [None]:
raw_numeracy.columns

We're only interested in the session/participant labels, and the answers to the seven questions.

In [None]:
df = raw_numeracy.reindex(
    columns=['session.label', 'participant.code',
             'player.answer1', 'player.answer2', 'player.answer3', 'player.answer4',
             'player.answer5', 'player.answer6', 'player.answer7']
)
df

We'll get rid of those pesky full-stops in the column names.

In [None]:
df = df.rename(columns=lambda x: x.replace(".", "_"))
df

Answering the numeracy questions was compulsory - so we can identify which rows correspond to actual participant responses by looking at the answer to the first question.

In [None]:
df = df.query("player_answer1.notnull()")
df

A bit of column renaming gets us to a first tidied-up representation of our data.

In [None]:
numeracy = (
    df.rename(columns={'session_label': 'session_id',
                       'participant_code': 'participant_id'})
    .rename(columns=lambda x: x.replace("player_answer", "answer"))
)
numeracy

We're interested in how many questions participants got correct.  One way we could do this is by manually going through and assigning correct/incorrect for each of the seven questions, like this.

In [None]:
df = (
    numeracy.assign(
        correct1=lambda x: (x['answer1'] == 150).astype(int),
        correct2=lambda x: (x['answer2'] == 100).astype(int),
        correct3=lambda x: (x['answer3'] == 9000).astype(int),
        correct4=lambda x: (x['answer4'] == 400000).astype(int),
        correct5=lambda x: (x['answer5'] == 242).astype(int),
        correct6=lambda x: (x['answer6'] == 3).astype(int),
        correct7=lambda x: (x['answer7'] == 2).astype(int)
    )
)
df

The numeracy score is then just the number of correct responses.

In [None]:
df = df.assign(
    numeracy=lambda x: x['correct1'] + x['correct2'] + x['correct3'] + x['correct4'] + x['correct5'] + x['correct6'] + x['correct7']
)
df

How did our participants do?  Quite well actually.  Frankly - too well.  We used these questions because in previous studies most people scored 3 or 4.  Our sample is far more numerate than the general public.  So - good on UEA students!  But in the end not as good for our research question...

In [None]:
df.groupby('numeracy')[['participant_id']].count()

There is another way of computing these scores - one that involves less repetitive typing, and also would scale much better to different (and larger) numbers of questions.

The data here are represented in "wide" format.  Each row corresponds to one participant, and within that participant we have multiple columns corresponding to responses to different questions.

We can convert the data to "long" format.  In long format, each row corresponds to one participant's response to one question.  For this purpose we'll use the `wide_to_long` function.  (https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.wide_to_long.html)

In [None]:
df = pd.wide_to_long(numeracy, 'answer', ['session_id', 'participant_id'], 'question')
df

In [None]:
df = df.reset_index()
df

Long-format data is often easier to work with for doing various types of analyses.  For example, if you want to look at the distribution of responses across participants for each question, it is very easy to do with one line when you have the data in long format.  Doing this analysis with wide-format data would be much more cumbersome - we would have to iterate over each of the response columns.

In [None]:
df.groupby(['question', 'answer'])[['participant_id']].count()

Scoring the responses to be correct/incorrect is also much easier.  We can do this by creating an auxiliary `DataFrame` which gives the correct response to each question. We'll do that here by just making the `DataFrame` in memory - but for example if you had a much longer inventory of questions you might create this as another data file in your `raw` data folder.

In [None]:
correct = pd.DataFrame(
    [(1, 150), (2, 100), (3, 9000), (4, 400000), (5, 242), (6, 3), (7, 2)],
    columns=['question', 'correct']
)
correct

Then we can use a `merge` to add the correct answer to each row of our long-format `DataFrame`.  This is much more elegant, and more maintainable, than the way we did this in wide-format with `assign` above.

In [None]:
df = df.merge(correct, how='left', on=['question'])
df

Likewise, scoring each question is now much easier to write.

In [None]:
df = df.assign(
    numeracy=lambda x: (x['answer'] == x['correct']).astype(int)
)
df

In [None]:
scores = df.groupby(['participant_id'])[['numeracy']].sum()
scores

In [None]:
scores = scores.reset_index()
scores

And we can see that we get the same distribution of numeracy scores via the "long-format" route as we did via the "wide-format" route.

In [None]:
scores.groupby('numeracy')[['participant_id']].count()

The inverse operation to `wide_to_long` is `pivot`.  (https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.pivot.html)

In [None]:
pd.pivot(df, index=['session_id', 'participant_id'], columns='question', values=['answer', 'correct', 'numeracy'])

Although we got less variation in the numeracy scores than we predicted, it is also still interesting to look to see whether numeracy correlates with any other demographics.

In [None]:
demographics = pd.read_csv("data/prepared/demographics.csv")

In what I hope is now starting to feel routine, we'll take our numeracy scores and merge them with the demographics by `participant_id`.

In [None]:
scores = scores.merge(demographics, how='left', on='participant_id')
scores

We can look at the relationship between gender and numeracy score.  First we could just look at average scores:

In [None]:
scores.groupby('gender')[['numeracy']].mean()

But it's often more informative to do a cross-tabulation breakdown.  Following a similar pattern as before,

In [None]:
df = scores.groupby(['gender', 'numeracy'])[['participant_id']].count()
df

As we observed already, it turned out we had rather more females than males in our study.  So it would be useful to convert the numeracy scores into percentages.  We can accomplish this by grouping and then calling `transform`.  (https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html)

In [None]:
df = df.groupby(level=0).transform(lambda x: x/sum(x))
df

In [None]:
df = df.unstack(1)
df

You might like to round the percentages for easier viewing:

In [None]:
df = df.round(2)
df

In [None]:
scores = scores.assign(
    is_eco=lambda x: x['department'] == "ECO"
)
scores

What about ECO students?  Do ECO students score more highly on numeracy than others?

We can follow the same pattern as above - but exercise our fluent-interface muscles to write the algorithm for computing the table compactly as a single expression!

In [None]:
df = (
    scores.groupby(['is_eco', 'numeracy'])[['participant_id']].count()
    .groupby(level=0).transform(lambda x: x/sum(x))
    .unstack(1, fill_value=0)
    .round(2)
)
df