In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Demographics: Standardising and recoding data fields

Let's now turn to the individual demographics data, which we will tidy up a bit.  These are in the files called `demographics`.  We'll take a bit of a shortcut and not check the schemas are the same (exercise: try it yourself!)

In [3]:
raw_demographics = pd.concat(
    [pd.read_csv("data/raw/batch1/demographics.csv"), pd.read_csv("data/raw/batch2/demographics.csv")],
    ignore_index=True
)
raw_demographics

Unnamed: 0,participant.id_in_session,participant.code,participant.label,participant._is_bot,participant._index_in_pages,participant._max_page_index,participant._current_app_name,participant._current_page_name,participant.time_started,participant.visited,...,player.timeuea,player.payoff,group.id_in_subsession,subsession.round_number,session.code,session.label,session.mturk_HITId,session.mturk_HITGroupId,session.comment,session.is_demo
0,1,giaw6638,,0,59,59,endpage,EarningsPage,2021-06-25 12:59:44.372507+00:00,1,...,5.0,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
1,2,ty10jr1q,,0,59,59,endpage,EarningsPage,2021-06-25 13:02:03.866151+00:00,1,...,2.0,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
2,3,yrmtcn62,,0,59,59,endpage,EarningsPage,2021-06-25 13:07:30.993543+00:00,1,...,3.0,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
3,4,duxn23dk,,0,59,59,endpage,EarningsPage,2021-06-25 13:09:10.696985+00:00,1,...,2.0,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
4,5,jry264g2,,0,14,59,decisions,DecisionPage,2021-06-25 13:26:00.922890+00:00,1,...,,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,5,sjnia5kz,,0,61,61,endpage,EarningsPage,2021-10-18 11:11:07.478584+00:00,1,...,2.0,0.0,1,1,rdbejwb9,20211018_1200,,,,0
290,6,e1u5e0p2,,0,61,61,endpage,EarningsPage,2021-10-18 11:11:06.498121+00:00,1,...,3.0,0.0,1,1,rdbejwb9,20211018_1200,,,,0
291,7,3teu8bii,,0,61,61,endpage,EarningsPage,2021-10-18 11:11:53.791273+00:00,1,...,2.0,0.0,1,1,rdbejwb9,20211018_1200,,,,0
292,8,uvptgp2n,,0,61,61,endpage,EarningsPage,2021-10-18 11:11:33.661737+00:00,1,...,1.0,0.0,1,1,rdbejwb9,20211018_1200,,,,0


Let's have a look at the columns and their data types.

In [3]:
raw_demographics.dtypes

participant.id_in_session            int64
participant.code                    object
participant.label                  float64
participant._is_bot                  int64
participant._index_in_pages          int64
participant._max_page_index          int64
participant._current_app_name       object
participant._current_page_name      object
participant.time_started            object
participant.visited                  int64
participant.mturk_worker_id        float64
participant.mturk_assignment_id    float64
participant.payoff                 float64
player.id_in_group                   int64
player.gender                      float64
player.age                         float64
player.countryborn                  object
player.countrynow                   object
player.department                   object
player.degree                      float64
player.timeuea                     float64
player.payoff                      float64
group.id_in_subsession               int64
subsession.

In this case we know the fields we're particularly interested in: the seven "about you" questions.  Let's have a look at the data values for them.  We can look at just a subset of the columns in a `DataFrame` by using the square-bracket operator.  We are passing a list of columns, hence we have double square brackets.  The outer pair of square brackets is the indexing operator, and the inner pair is what denotes the list.  (It is perhaps unfortunate that Python uses square brackets both for indexing and for delimiting a list....)

In [4]:
raw_demographics[[
    'player.gender', 'player.age', 'player.countryborn',
    'player.countrynow', 'player.department', 'player.degree',
    'player.timeuea'
]].head(20)


Unnamed: 0,player.gender,player.age,player.countryborn,player.countrynow,player.department,player.degree,player.timeuea
0,1.0,25.0,India,UK,MED,4.0,5.0
1,2.0,20.0,UK,UK,DEV,2.0,2.0
2,1.0,21.0,Brunei,UK,ECO,2.0,3.0
3,2.0,21.0,britain,uk,Pharmacy,4.0,2.0
4,,,,,,,
5,,,,,,,
6,2.0,23.0,united kingdom,united kingdom,,7.0,5.0
7,1.0,48.0,United Kingdom,United Kingdom,AMA,2.0,2.0
8,2.0,19.0,UK,UK,ECO,2.0,1.0
9,1.0,40.0,UK,UK,PPL,2.0,3.0


As before, we know we have entries for slots that were opened up for participants who did not turn up for the experiment.  However, trying to filter participants on whether or not demographics are null would be problematic, because participants cannot be obligated to disclose any or all of their demographic information:

In [5]:
raw_demographics[[
    'player.gender', 'player.age', 'player.countryborn',
    'player.countrynow', 'player.department', 'player.degree',
    'player.timeuea'
]].query("`player.department`.isnull()").head(10)


Unnamed: 0,player.gender,player.age,player.countryborn,player.countrynow,player.department,player.degree,player.timeuea
4,,,,,,,
5,,,,,,,
6,2.0,23.0,united kingdom,united kingdom,,7.0,5.0
14,,,,,,,
15,,,,,,,
37,4.0,20.0,Sweden,UK,,6.0,2.0
43,,,,,,,
44,,,,,,,
45,,,,,,,
51,1.0,22.0,Norway,UK,,8.0,6.0


Here's where knowing how the software works is useful, including some of that data which is more about administering the experiment rather than collecting responses directly.  For each row there's a field called `participant._index_in_pages` (note the leading underscore), which tells you how far the participant has progressed in the experiment, and also `participant._max_page_index`, which is the total number of pages needed to complete the experiment.

In [6]:
raw_demographics[[
    "participant._index_in_pages", "participant._max_page_index",
    "player.gender", "player.age"
]].head(20)

Unnamed: 0,participant._index_in_pages,participant._max_page_index,player.gender,player.age
0,59,59,1.0,25.0
1,59,59,2.0,20.0
2,59,59,1.0,21.0
3,59,59,2.0,21.0
4,14,59,,
5,0,61,,
6,61,61,2.0,23.0
7,61,61,1.0,48.0
8,61,61,2.0,19.0
9,61,61,1.0,40.0


The number of pages there are in a session depends on the session.  (As we were running the experiment, we realised having extra landing pages in the instructions was helpful to keep participants together.)  So the best way to test whether a participant row is a valid obseration is to see whether they reached the final page.

In [7]:
df = raw_demographics.query("`participant._index_in_pages` == `participant._max_page_index`")
df

Unnamed: 0,participant.id_in_session,participant.code,participant.label,participant._is_bot,participant._index_in_pages,participant._max_page_index,participant._current_app_name,participant._current_page_name,participant.time_started,participant.visited,...,player.timeuea,player.payoff,group.id_in_subsession,subsession.round_number,session.code,session.label,session.mturk_HITId,session.mturk_HITGroupId,session.comment,session.is_demo
0,1,giaw6638,,0,59,59,endpage,EarningsPage,2021-06-25 12:59:44.372507+00:00,1,...,5.0,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
1,2,ty10jr1q,,0,59,59,endpage,EarningsPage,2021-06-25 13:02:03.866151+00:00,1,...,2.0,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
2,3,yrmtcn62,,0,59,59,endpage,EarningsPage,2021-06-25 13:07:30.993543+00:00,1,...,3.0,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
3,4,duxn23dk,,0,59,59,endpage,EarningsPage,2021-06-25 13:09:10.696985+00:00,1,...,2.0,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
6,1,mh9jvilu,,0,61,61,endpage,EarningsPage,2021-06-28 08:51:58.930374+00:00,1,...,5.0,0.0,1,1,f3j5v0lq,20210628_1000,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,5,sjnia5kz,,0,61,61,endpage,EarningsPage,2021-10-18 11:11:07.478584+00:00,1,...,2.0,0.0,1,1,rdbejwb9,20211018_1200,,,,0
290,6,e1u5e0p2,,0,61,61,endpage,EarningsPage,2021-10-18 11:11:06.498121+00:00,1,...,3.0,0.0,1,1,rdbejwb9,20211018_1200,,,,0
291,7,3teu8bii,,0,61,61,endpage,EarningsPage,2021-10-18 11:11:53.791273+00:00,1,...,2.0,0.0,1,1,rdbejwb9,20211018_1200,,,,0
292,8,uvptgp2n,,0,61,61,endpage,EarningsPage,2021-10-18 11:11:33.661737+00:00,1,...,1.0,0.0,1,1,rdbejwb9,20211018_1200,,,,0


We've got the right number of rows.  We'll check later on whether our participant IDs match up exactly with what we did with decisions.  For now, let's continue cleaning the data by selecting the columns we want (and while we're at it, let's get rid of those annoying periods in the column names).

In [8]:
df = (
    df.rename(columns=lambda x: x.replace(".", "_"))
    .reindex(
        columns=['session_label', 'participant_code',
                 'player_gender', 'player_age', 'player_countryborn',
                 'player_countrynow', 'player_department', 'player_degree',
                 'player_timeuea']

    )
)
df

Unnamed: 0,session_label,participant_code,player_gender,player_age,player_countryborn,player_countrynow,player_department,player_degree,player_timeuea
0,20210625_1400,giaw6638,1.0,25.0,India,UK,MED,4.0,5.0
1,20210625_1400,ty10jr1q,2.0,20.0,UK,UK,DEV,2.0,2.0
2,20210625_1400,yrmtcn62,1.0,21.0,Brunei,UK,ECO,2.0,3.0
3,20210625_1400,duxn23dk,2.0,21.0,britain,uk,Pharmacy,4.0,2.0
6,20210628_1000,mh9jvilu,2.0,23.0,united kingdom,united kingdom,,7.0,5.0
...,...,...,...,...,...,...,...,...,...
289,20211018_1200,sjnia5kz,2.0,21.0,England,UK,DEV,2.0,2.0
290,20211018_1200,e1u5e0p2,1.0,20.0,United Kingdom,United Kingdom,ECO,2.0,3.0
291,20211018_1200,3teu8bii,2.0,20.0,England,England,PSY,2.0,2.0
292,20211018_1200,uvptgp2n,2.0,18.0,United Kingdom,United Kingdom,CMP,2.0,1.0


Sometimes, there are fields where there is a finite list of possible answers - but that list is too long to specify completely in a question.  Countries are a good example of this; there are roughly 200 in the world (depending on how you count), but we're all had the experience of how tedious it is to pick out your country from a long drop-down list.  We have two country fields in our data; this is a good opportunity to look at ways to tidy up the data.

Let's first look at `countryborn`, and see what data values are there:

In [9]:
df.sort_values('player_countryborn')['player_countryborn'].unique()

array(['Bangladesh', 'Brazil', 'British', 'Brunei', 'China', 'Cyprus',
       'Denmark', 'Denmark/USA', 'England', 'England / UK', 'France',
       'Hong Kong', 'Hungary', 'India', 'Italy', 'Jamaica', 'Japan',
       'Jordan', 'Kenya', 'Latvia', 'Lithuania', 'Malaysia', 'Mexico',
       'Nepal', 'Nigeria', 'Norway', 'Philippines', 'Portugal',
       'South Africa', 'Spain', 'Sri Lanka', 'Sweden', 'Syria',
       'Taiwan, Egypt', 'UK', 'UK (England)', 'UK, Australia', 'USA',
       'Uk', 'United Kingdom', 'United Kingdom (England)',
       'United States', 'United States of America', 'United kingdom',
       'Vietnam', 'britain', 'cyprus', 'england', 'india', 'nigeria',
       'spain', 'uk', 'united kingdom', 'vietnam', nan], dtype=object)

Compared to some datasets, this isn't all that bad; most of the country names are already rather clean.  We just need to standardise a few country names, and to make a decision about how to code situations where more than one country is listed.

To accomplish this we'll use two functions:
1. `Series.str.title()`: This will convert all the strings to Title Case - that is, first letter of each work capitalised and all others lowercase; (https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.Series.str.title.html)
2. `Series.replace()`: This takes a `dict`, and replaces each instance of a key with the corresponding value in the `dict`. (https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.Series.replace.html)

In [10]:
df = df.assign(
    player_countryborn = lambda x: (
        x['player_countryborn'].str.title()
        .replace(
            {'Britain': 'United Kingdom',
             'British': 'United Kingdom',
             'Uk': 'United Kingdom',
             'Uk (England)': 'United Kingdom',
             'Uk, Australia': 'United Kingdom',
             'England': 'United Kingdom',
             'England / Uk': 'United Kingdom',
             'United Kingdom (England)': 'United Kingdom',
             'Denmark/Usa': 'Denmark',
             'United States Of America': 'United States',
             'Usa': 'United States',
             'Taiwan, Egypt': 'Taiwan'
            }
        )
    )
)
df.sort_values('player_countryborn')['player_countryborn'].unique()

array(['Bangladesh', 'Brazil', 'Brunei', 'China', 'Cyprus', 'Denmark',
       'France', 'Hong Kong', 'Hungary', 'India', 'Italy', 'Jamaica',
       'Japan', 'Jordan', 'Kenya', 'Latvia', 'Lithuania', 'Malaysia',
       'Mexico', 'Nepal', 'Nigeria', 'Norway', 'Philippines', 'Portugal',
       'South Africa', 'Spain', 'Sri Lanka', 'Sweden', 'Syria', 'Taiwan',
       'United Kingdom', 'United States', 'Vietnam', nan], dtype=object)

Looks good.  Now we'll do the same exercise with `player_countrynow`.

In [11]:
df.sort_values('player_countrynow')['player_countrynow'].unique()

array(['China', 'Cyprus', 'Egypt', 'England', 'England / UK',
       'England, uk', 'Hong Kong', 'India', 'Latvia', 'Lithuania',
       'Malaysia', 'Nepal', 'Nigeria', 'Norway', 'South Africa', 'Spain',
       'Switzerland', 'Syria', 'Taiwan, Egypt', 'UK',
       'UK as a visiting student', 'Uk England', 'United Kingdom',
       'United Kingdom (England)', 'cyprus', 'england', 'uk',
       'united kingdom', 'vietnam'], dtype=object)

In [12]:
df = df.assign(
    player_countrynow = lambda x: (
        x['player_countrynow'].str.title()
        .replace(
            {'Britain': 'United Kingdom',
             'British': 'United Kingdom',
             'Uk': 'United Kingdom',
             'Uk (England)': 'United Kingdom',
             'Uk England': 'United Kingdom',
             'Uk, Australia': 'United Kingdom',
             'England, Uk': 'United Kingdom',
             'England': 'United Kingdom',
             'England / Uk': 'United Kingdom',
             'Uk As A Visiting Student': 'United Kingdom',
             'United Kingdom (England)': 'United Kingdom',
             'Denmark/Usa': 'Denmark',
             'United States Of America': 'United States',
             'Usa': 'United States',
             'Taiwan, Egypt': 'Taiwan'
            }
        )
    )
)
df.sort_values('player_countrynow')['player_countrynow'].unique()

array(['China', 'Cyprus', 'Egypt', 'Hong Kong', 'India', 'Latvia',
       'Lithuania', 'Malaysia', 'Nepal', 'Nigeria', 'Norway',
       'South Africa', 'Spain', 'Switzerland', 'Syria', 'Taiwan',
       'United Kingdom', 'Vietnam'], dtype=object)

Now let's have a look at `player_gender`.  This is an example of a quite annoying data field - the data are recorded by the computer as integers, but you have to know the computer code to know what is what.  Because we do have the computer code, we know that 1 = Male, 2 = Female, 3 = Other, and 4 = prefer not to say.


In [13]:
df.groupby('player_gender')['participant_code'].count()

player_gender
1.0     74
2.0    120
3.0      3
4.0      3
Name: participant_code, dtype: int64

We'll recode these using letters (M, F, O), and replace 4 with true null values.

In [14]:
df = df.assign(
    player_gender = lambda x: (
        x['player_gender'].replace(
            {1: 'M',
             2: 'F',
             3: 'O',
             4: None}
        )
    )
)
df.groupby('player_gender')['participant_code'].count()

player_gender
F    120
M     74
O      3
Name: participant_code, dtype: int64

Now, let's have a look at the responses for UEA schools:

In [15]:
df.sort_values('player_department')['player_department'].unique()

array(['AMA', 'BIO', 'BSc Psychology', 'Biological Science', 'CHE', 'CMP',
       'CMP - Computing Science', 'DEV', 'ECO', 'EDU', 'ENG', 'ENV',
       'Eco', 'Economics', 'HIS', 'HSC', 'Hsc', 'IIH', 'LAW', 'LDC',
       'Law', 'MED', 'MTH', 'Med', 'NAT', 'NATSci', 'NBS',
       'NBS Business School', 'NMS (Med)', 'PHA', 'PHY', 'PPL', 'PSY',
       'Pharmacy', 'SCI', 'SWK', 'UEA', 'chemistry', 'eco', 'nbs',
       'pharmacy', 'psy', nan], dtype=object)

These aren't too bad.  To clean these up, alongside `Series.replace` which we've already used, we'll make use of two useful methods for string manipulation:

1. `Series.str.upper()`: Converts all characters in the string to uppercase. (https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.Series.str.upper.html) 
2. `Series.str[]`: The `[]` notation on a Series works just like it does on regular Python strings or lists.  We'll use it here to restrict to the first three letters - which, after a bit of initial cleanup, maps to the UEA School/Faculty/programme names.

In [16]:
df = df.assign(
    player_department=lambda x: (
        x['player_department'].replace(
            {"BSc Psychology": "PSY",
             "NMS": "MED",
             "UEA": None}
        )
        .str.upper()
        .str[:3]
    )
)
df.sort_values('player_department')['player_department'].unique()

array(['AMA', 'BIO', 'CHE', 'CMP', 'DEV', 'ECO', 'EDU', 'ENG', 'ENV',
       'HIS', 'HSC', 'IIH', 'LAW', 'LDC', 'MED', 'MTH', 'NAT', 'NBS',
       'NMS', 'PHA', 'PHY', 'PPL', 'PSY', 'SCI', 'SWK', nan, None],
      dtype=object)

The UEA degree/affiliation field is, like gender, straightforward enough if you have the coding from the software.

In [17]:
df = df.assign(
    player_degree=lambda x: (
        x['player_degree'].replace(
            {1: "INTO",
             2: "BSc",
             3: "PGDip",
             4: "MA/MSc",
             5: "PhD",
             6: "Staff",
             7: "Other",
             8: None}
        )        
    )
)
df.sort_values('player_degree')['player_degree'].unique()

array(['BSc', 'INTO', 'MA/MSc', 'Other', 'PGDip', 'PhD', 'Staff', None],
      dtype=object)

Likewise, coding up the time-at-UEA question is now routine (I hope!)

In [18]:
df = df.assign(
    player_timeuea=lambda x: (
        x['player_timeuea'].replace(
            {1: "1st",
             2: "2nd",
             3: "3rd",
             4: "4th",
             5: "5th+",
             6: None}
        )
    )
)
df.sort_values('player_timeuea')['player_timeuea'].unique()

array(['1st', '2nd', '3rd', '4th', '5th+', None], dtype=object)

Let's take stock of where we are.

In [19]:
df

Unnamed: 0,session_label,participant_code,player_gender,player_age,player_countryborn,player_countrynow,player_department,player_degree,player_timeuea
0,20210625_1400,giaw6638,M,25.0,India,United Kingdom,MED,MA/MSc,5th+
1,20210625_1400,ty10jr1q,F,20.0,United Kingdom,United Kingdom,DEV,BSc,2nd
2,20210625_1400,yrmtcn62,M,21.0,Brunei,United Kingdom,ECO,BSc,3rd
3,20210625_1400,duxn23dk,F,21.0,United Kingdom,United Kingdom,PHA,MA/MSc,2nd
6,20210628_1000,mh9jvilu,F,23.0,United Kingdom,United Kingdom,,Other,5th+
...,...,...,...,...,...,...,...,...,...
289,20211018_1200,sjnia5kz,F,21.0,United Kingdom,United Kingdom,DEV,BSc,2nd
290,20211018_1200,e1u5e0p2,M,20.0,United Kingdom,United Kingdom,ECO,BSc,3rd
291,20211018_1200,3teu8bii,F,20.0,United Kingdom,United Kingdom,PSY,BSc,2nd
292,20211018_1200,uvptgp2n,F,18.0,United Kingdom,United Kingdom,CMP,BSc,1st


We haven't yet looked at the 'age' field.  We can have a look at the distribution of values in this field to see whether there are any which might be problematic.

In [20]:
df.groupby('player_age')[['participant_code']].count()

Unnamed: 0_level_0,participant_code
player_age,Unnamed: 1_level_1
18.0,21
19.0,27
20.0,44
21.0,44
22.0,15
23.0,14
24.0,10
25.0,9
26.0,1
27.0,3


We're rather close; just a few further adjustments

In [21]:
df = (
    df.rename(columns=lambda x: x.replace("player_", ""))
    .rename(columns={
        'session_label': 'session_id',
        'participant_code': 'participant_id'
    })
    .astype({'age': int})
)
df

Unnamed: 0,session_id,participant_id,gender,age,countryborn,countrynow,department,degree,timeuea
0,20210625_1400,giaw6638,M,25,India,United Kingdom,MED,MA/MSc,5th+
1,20210625_1400,ty10jr1q,F,20,United Kingdom,United Kingdom,DEV,BSc,2nd
2,20210625_1400,yrmtcn62,M,21,Brunei,United Kingdom,ECO,BSc,3rd
3,20210625_1400,duxn23dk,F,21,United Kingdom,United Kingdom,PHA,MA/MSc,2nd
6,20210628_1000,mh9jvilu,F,23,United Kingdom,United Kingdom,,Other,5th+
...,...,...,...,...,...,...,...,...,...
289,20211018_1200,sjnia5kz,F,21,United Kingdom,United Kingdom,DEV,BSc,2nd
290,20211018_1200,e1u5e0p2,M,20,United Kingdom,United Kingdom,ECO,BSc,3rd
291,20211018_1200,3teu8bii,F,20,United Kingdom,United Kingdom,PSY,BSc,2nd
292,20211018_1200,uvptgp2n,F,18,United Kingdom,United Kingdom,CMP,BSc,1st


We'll save our work.

In [22]:
demographics = df
demographics.to_csv("data/prepared/demographics.csv", index=False)

In experiments (whether lab or field), randomisation of participants into treatments is a crucial aspect of the research methodology.  For example, in this experiment, we want to isolate the effect of information provision.  Now, in naturally-occuring data, there may be different kinds of information provided by, for example, different investment platforms.  However, because people choose which investment platform to use, it might be that individual characteristics or preferences of people vary across different platforms.  For example, hypothetically, people who are risk-averse might prefer platforms that emphasise risk information.  Or, it could be - because people tend to avoid negative information - risk-averse people might prefer platforms that *don't* have risk information!  Either way, this would confound our understanding of the effect of information.

Because we recruit participants into treatments at random, it should be the case that the characteristics of the participants in each treatment will be similar.  It is customary in experiments (especially field experiments) to check that the assignment of participants to treatments is similar based on their *observable* characteristics.

Let's check a few of these as an exercise.

In [23]:
sessions = pd.read_csv("data/raw/sessions.csv")

We will augment the demographics `DataFrame` with the treatment.  To do this, we use the `merge` operation.
(https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.merge.html).  See the somewhat deeper dive in the "mini-focus" on `merge` (and `join`) available in the `topics` notebook for this week.

In our `sessions` data, we have only one row for each `session_id`.  So our resulting `DataFrame` should still have 200 rows - one for each participant in `demographics`.

In [24]:
demographics = demographics.merge(
    sessions, how='left', on='session_id'
)
demographics

Unnamed: 0,session_id,participant_id,gender,age,countryborn,countrynow,department,degree,timeuea,treatment
0,20210625_1400,giaw6638,M,25,India,United Kingdom,MED,MA/MSc,5th+,B
1,20210625_1400,ty10jr1q,F,20,United Kingdom,United Kingdom,DEV,BSc,2nd,B
2,20210625_1400,yrmtcn62,M,21,Brunei,United Kingdom,ECO,BSc,3rd,B
3,20210625_1400,duxn23dk,F,21,United Kingdom,United Kingdom,PHA,MA/MSc,2nd,B
4,20210628_1000,mh9jvilu,F,23,United Kingdom,United Kingdom,,Other,5th+,E
...,...,...,...,...,...,...,...,...,...,...
195,20211018_1200,sjnia5kz,F,21,United Kingdom,United Kingdom,DEV,BSc,2nd,B
196,20211018_1200,e1u5e0p2,M,20,United Kingdom,United Kingdom,ECO,BSc,3rd,B
197,20211018_1200,3teu8bii,F,20,United Kingdom,United Kingdom,PSY,BSc,2nd,B
198,20211018_1200,uvptgp2n,F,18,United Kingdom,United Kingdom,CMP,BSc,1st,B


In [27]:
gender = demographics.groupby(['treatment', 'gender'])[['participant_id']].count()
gender

Unnamed: 0_level_0,Unnamed: 1_level_0,participant_id
treatment,gender,Unnamed: 2_level_1
B,F,28
B,M,21
E,F,34
E,M,14
E,O,1
ER,F,32
ER,M,16
ER,O,2
R,F,26
R,M,23


In [28]:
gender.unstack(fill_value=0)

Unnamed: 0_level_0,participant_id,participant_id,participant_id
gender,F,M,O
treatment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
B,28,21,0
E,34,14,1
ER,32,16,2
R,26,23,0


In [29]:
department = (
    demographics.assign(
        is_eco=lambda x: x['department'] == "ECO"
    )
    .groupby(['treatment', 'is_eco'])[['participant_id']].count()
)
department.unstack(fill_value=0)

Unnamed: 0_level_0,participant_id,participant_id
is_eco,False,True
treatment,Unnamed: 1_level_2,Unnamed: 2_level_2
B,41,9
E,43,7
ER,41,9
R,44,6


In [30]:
country = (
    demographics.assign(
        is_uk=lambda x: x['countryborn'] == "United Kingdom"
    )
    .groupby(['treatment', 'is_uk'])[['participant_id']].count()
)
country.unstack(fill_value=0)

Unnamed: 0_level_0,participant_id,participant_id
is_uk,False,True
treatment,Unnamed: 1_level_2,Unnamed: 2_level_2
B,17,33
E,23,27
ER,22,28
R,22,28


## Numeracy data: Wide and long data formats

We'll turn now to the data from the 7 economic/numeracy questions.

In [4]:
raw_numeracy = pd.concat(
    [pd.read_csv("data/raw/batch1/numeracy.csv"), pd.read_csv("data/raw/batch2/numeracy.csv")],
    ignore_index=True
)
raw_numeracy

Unnamed: 0,participant.id_in_session,participant.code,participant.label,participant._is_bot,participant._index_in_pages,participant._max_page_index,participant._current_app_name,participant._current_page_name,participant.time_started,participant.visited,...,player.answer7,player.payoff,group.id_in_subsession,subsession.round_number,session.code,session.label,session.mturk_HITId,session.mturk_HITGroupId,session.comment,session.is_demo
0,1,giaw6638,,0,59,59,endpage,EarningsPage,2021-06-25 12:59:44.372507+00:00,1,...,2.0,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
1,2,ty10jr1q,,0,59,59,endpage,EarningsPage,2021-06-25 13:02:03.866151+00:00,1,...,2.0,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
2,3,yrmtcn62,,0,59,59,endpage,EarningsPage,2021-06-25 13:07:30.993543+00:00,1,...,2.0,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
3,4,duxn23dk,,0,59,59,endpage,EarningsPage,2021-06-25 13:09:10.696985+00:00,1,...,2.0,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
4,5,jry264g2,,0,14,59,decisions,DecisionPage,2021-06-25 13:26:00.922890+00:00,1,...,,0.0,1,1,o4cfc4sd,20210625_1400,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,5,sjnia5kz,,0,61,61,endpage,EarningsPage,2021-10-18 11:11:07.478584+00:00,1,...,2.0,0.0,1,1,rdbejwb9,20211018_1200,,,,0
290,6,e1u5e0p2,,0,61,61,endpage,EarningsPage,2021-10-18 11:11:06.498121+00:00,1,...,2.0,0.0,1,1,rdbejwb9,20211018_1200,,,,0
291,7,3teu8bii,,0,61,61,endpage,EarningsPage,2021-10-18 11:11:53.791273+00:00,1,...,2.0,0.0,1,1,rdbejwb9,20211018_1200,,,,0
292,8,uvptgp2n,,0,61,61,endpage,EarningsPage,2021-10-18 11:11:33.661737+00:00,1,...,2.0,0.0,1,1,rdbejwb9,20211018_1200,,,,0


In this experiment, the answers to the seven questions are coded in fields called `player.answer1` up to `player.answer7`.

In [5]:
raw_numeracy.columns

Index(['participant.id_in_session', 'participant.code', 'participant.label',
       'participant._is_bot', 'participant._index_in_pages',
       'participant._max_page_index', 'participant._current_app_name',
       'participant._current_page_name', 'participant.time_started',
       'participant.visited', 'participant.mturk_worker_id',
       'participant.mturk_assignment_id', 'participant.payoff',
       'player.id_in_group', 'player.answer1', 'player.answer2',
       'player.answer3', 'player.answer4', 'player.answer5', 'player.answer6',
       'player.answer7', 'player.payoff', 'group.id_in_subsession',
       'subsession.round_number', 'session.code', 'session.label',
       'session.mturk_HITId', 'session.mturk_HITGroupId', 'session.comment',
       'session.is_demo'],
      dtype='object')

We're only interested in the session/participant labels, and the answers to the seven questions.

In [8]:
df = raw_numeracy.reindex(
    columns=['session.label', 'participant.code',
             'player.answer1', 'player.answer2', 'player.answer3', 'player.answer4',
             'player.answer5', 'player.answer6', 'player.answer7']
)
df

Unnamed: 0,session.label,participant.code,player.answer1,player.answer2,player.answer3,player.answer4,player.answer5,player.answer6,player.answer7
0,20210625_1400,giaw6638,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
1,20210625_1400,ty10jr1q,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
2,20210625_1400,yrmtcn62,150.0,100.0,9000.0,2000000.0,242.0,3.0,2.0
3,20210625_1400,duxn23dk,150.0,100.0,9000.0,400000.0,240.0,2.0,2.0
4,20210625_1400,jry264g2,,,,,,,
...,...,...,...,...,...,...,...,...,...
289,20211018_1200,sjnia5kz,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
290,20211018_1200,e1u5e0p2,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
291,20211018_1200,3teu8bii,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
292,20211018_1200,uvptgp2n,150.0,100.0,9000.0,400000.0,240.0,3.0,2.0


We'll get rid of those pesky full-stops in the column names.

In [9]:
df = df.rename(columns=lambda x: x.replace(".", "_"))
df

Unnamed: 0,session_label,participant_code,player_answer1,player_answer2,player_answer3,player_answer4,player_answer5,player_answer6,player_answer7
0,20210625_1400,giaw6638,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
1,20210625_1400,ty10jr1q,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
2,20210625_1400,yrmtcn62,150.0,100.0,9000.0,2000000.0,242.0,3.0,2.0
3,20210625_1400,duxn23dk,150.0,100.0,9000.0,400000.0,240.0,2.0,2.0
4,20210625_1400,jry264g2,,,,,,,
...,...,...,...,...,...,...,...,...,...
289,20211018_1200,sjnia5kz,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
290,20211018_1200,e1u5e0p2,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
291,20211018_1200,3teu8bii,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
292,20211018_1200,uvptgp2n,150.0,100.0,9000.0,400000.0,240.0,3.0,2.0


Answering the numeracy questions was compulsory - so we can identify which rows correspond to actual participant responses by looking at the answer to the first question.

In [10]:
df = df.query("player_answer1.notnull()")
df

Unnamed: 0,session_label,participant_code,player_answer1,player_answer2,player_answer3,player_answer4,player_answer5,player_answer6,player_answer7
0,20210625_1400,giaw6638,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
1,20210625_1400,ty10jr1q,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
2,20210625_1400,yrmtcn62,150.0,100.0,9000.0,2000000.0,242.0,3.0,2.0
3,20210625_1400,duxn23dk,150.0,100.0,9000.0,400000.0,240.0,2.0,2.0
6,20210628_1000,mh9jvilu,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
...,...,...,...,...,...,...,...,...,...
289,20211018_1200,sjnia5kz,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
290,20211018_1200,e1u5e0p2,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
291,20211018_1200,3teu8bii,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
292,20211018_1200,uvptgp2n,150.0,100.0,9000.0,400000.0,240.0,3.0,2.0


A bit of column renaming gets us to a first tidied-up representation of our data.

In [12]:
numeracy = (
    df.rename(columns={'session_label': 'session_id',
                       'participant_code': 'participant_id'})
    .rename(columns=lambda x: x.replace("player_answer", "answer"))
)
numeracy

Unnamed: 0,session_id,participant_id,answer1,answer2,answer3,answer4,answer5,answer6,answer7
0,20210625_1400,giaw6638,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
1,20210625_1400,ty10jr1q,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
2,20210625_1400,yrmtcn62,150.0,100.0,9000.0,2000000.0,242.0,3.0,2.0
3,20210625_1400,duxn23dk,150.0,100.0,9000.0,400000.0,240.0,2.0,2.0
6,20210628_1000,mh9jvilu,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
...,...,...,...,...,...,...,...,...,...
289,20211018_1200,sjnia5kz,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
290,20211018_1200,e1u5e0p2,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
291,20211018_1200,3teu8bii,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0
292,20211018_1200,uvptgp2n,150.0,100.0,9000.0,400000.0,240.0,3.0,2.0


We're interested in how many questions participants got correct.  One way we could do this is by manually going through and assigning correct/incorrect for each of the seven questions, like this.

In [13]:
df = (
    numeracy.assign(
        correct1=lambda x: (x['answer1'] == 150).astype(int),
        correct2=lambda x: (x['answer2'] == 100).astype(int),
        correct3=lambda x: (x['answer3'] == 9000).astype(int),
        correct4=lambda x: (x['answer4'] == 400000).astype(int),
        correct5=lambda x: (x['answer5'] == 242).astype(int),
        correct6=lambda x: (x['answer6'] == 3).astype(int),
        correct7=lambda x: (x['answer7'] == 2).astype(int)
    )
)
df

Unnamed: 0,session_id,participant_id,answer1,answer2,answer3,answer4,answer5,answer6,answer7,correct1,correct2,correct3,correct4,correct5,correct6,correct7
0,20210625_1400,giaw6638,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,1,1,1,1,1,1,1
1,20210625_1400,ty10jr1q,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,1,1,1,1,1,1,1
2,20210625_1400,yrmtcn62,150.0,100.0,9000.0,2000000.0,242.0,3.0,2.0,1,1,1,0,1,1,1
3,20210625_1400,duxn23dk,150.0,100.0,9000.0,400000.0,240.0,2.0,2.0,1,1,1,1,0,0,1
6,20210628_1000,mh9jvilu,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,20211018_1200,sjnia5kz,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,1,1,1,1,1,1,1
290,20211018_1200,e1u5e0p2,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,1,1,1,1,1,1,1
291,20211018_1200,3teu8bii,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,1,1,1,1,1,1,1
292,20211018_1200,uvptgp2n,150.0,100.0,9000.0,400000.0,240.0,3.0,2.0,1,1,1,1,0,1,1


The numeracy score is then just the number of correct responses.

In [15]:
df = df.assign(
    numeracy=lambda x: x['correct1'] + x['correct2'] + x['correct3'] + x['correct4'] + x['correct5'] + x['correct6'] + x['correct7']
)
df

Unnamed: 0,session_id,participant_id,answer1,answer2,answer3,answer4,answer5,answer6,answer7,correct1,correct2,correct3,correct4,correct5,correct6,correct7,numeracy
0,20210625_1400,giaw6638,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,1,1,1,1,1,1,1,7
1,20210625_1400,ty10jr1q,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,1,1,1,1,1,1,1,7
2,20210625_1400,yrmtcn62,150.0,100.0,9000.0,2000000.0,242.0,3.0,2.0,1,1,1,0,1,1,1,6
3,20210625_1400,duxn23dk,150.0,100.0,9000.0,400000.0,240.0,2.0,2.0,1,1,1,1,0,0,1,5
6,20210628_1000,mh9jvilu,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,1,1,1,1,1,1,1,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,20211018_1200,sjnia5kz,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,1,1,1,1,1,1,1,7
290,20211018_1200,e1u5e0p2,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,1,1,1,1,1,1,1,7
291,20211018_1200,3teu8bii,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,1,1,1,1,1,1,1,7
292,20211018_1200,uvptgp2n,150.0,100.0,9000.0,400000.0,240.0,3.0,2.0,1,1,1,1,0,1,1,6


How did our participants do?  Quite well actually.  Frankly - too well.  We used these questions because in previous studies most people scored 3 or 4.  Our sample is far more numerate than the general public.  So - good on UEA students!  But in the end not as good for our research question...

In [16]:
df.groupby('numeracy')[['participant_id']].count()

Unnamed: 0_level_0,participant_id
numeracy,Unnamed: 1_level_1
3,2
4,6
5,25
6,53
7,114


There is another way of computing these scores - one that involves less repetitive typing, and also would scale much better to different (and larger) numbers of questions.

The data here are represented in "wide" format.  Each row corresponds to one participant, and within that participant we have multiple columns corresponding to responses to different questions.

We can convert the data to "long" format.  In long format, each row corresponds to one participant's response to one question.  For this purpose we'll use the `wide_to_long` function.  (https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.wide_to_long.html)

In [17]:
df = pd.wide_to_long(numeracy, 'answer', ['session_id', 'participant_id'], 'question')
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,answer
session_id,participant_id,question,Unnamed: 3_level_1
20210625_1400,giaw6638,1,150.0
20210625_1400,giaw6638,2,100.0
20210625_1400,giaw6638,3,9000.0
20210625_1400,giaw6638,4,400000.0
20210625_1400,giaw6638,5,242.0
...,...,...,...
20211018_1200,oukkjmup,3,9000.0
20211018_1200,oukkjmup,4,400000.0
20211018_1200,oukkjmup,5,242.0
20211018_1200,oukkjmup,6,3.0


In [18]:
df = df.reset_index()
df

Unnamed: 0,session_id,participant_id,question,answer
0,20210625_1400,giaw6638,1,150.0
1,20210625_1400,giaw6638,2,100.0
2,20210625_1400,giaw6638,3,9000.0
3,20210625_1400,giaw6638,4,400000.0
4,20210625_1400,giaw6638,5,242.0
...,...,...,...,...
1395,20211018_1200,oukkjmup,3,9000.0
1396,20211018_1200,oukkjmup,4,400000.0
1397,20211018_1200,oukkjmup,5,242.0
1398,20211018_1200,oukkjmup,6,3.0


Long-format data is often easier to work with for doing various types of analyses.  For example, if you want to look at the distribution of responses across participants for each question, it is very easy to do with one line when you have the data in long format.  Doing this analysis with wide-format data would be much more cumbersome - we would have to iterate over each of the response columns.

In [19]:
df.groupby(['question', 'answer'])[['participant_id']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,participant_id
question,answer,Unnamed: 2_level_1
1,150.0,198
1,300.0,1
1,600.0,1
2,1.0,1
2,10.0,4
2,30.0,1
2,100.0,194
3,4000.0,4
3,7500.0,1
3,7800.0,1


Scoring the responses to be correct/incorrect is also much easier.  We can do this by creating an auxiliary `DataFrame` which gives the correct response to each question. We'll do that here by just making the `DataFrame` in memory - but for example if you had a much longer inventory of questions you might create this as another data file in your `raw` data folder.

In [21]:
correct = pd.DataFrame(
    [(1, 150), (2, 100), (3, 9000), (4, 400000), (5, 242), (6, 3), (7, 2)],
    columns=['question', 'correct']
)
correct

Unnamed: 0,question,correct
0,1,150
1,2,100
2,3,9000
3,4,400000
4,5,242
5,6,3
6,7,2


Then we can use a `merge` to add the correct answer to each row of our long-format `DataFrame`.  This is much more elegant, and more maintainable, than the way we did this in wide-format with `assign` above.

In [22]:
df = df.merge(correct, how='left', on=['question'])
df

Unnamed: 0,session_id,participant_id,question,answer,correct
0,20210625_1400,giaw6638,1,150.0,150
1,20210625_1400,giaw6638,2,100.0,100
2,20210625_1400,giaw6638,3,9000.0,9000
3,20210625_1400,giaw6638,4,400000.0,400000
4,20210625_1400,giaw6638,5,242.0,242
...,...,...,...,...,...
1395,20211018_1200,oukkjmup,3,9000.0,9000
1396,20211018_1200,oukkjmup,4,400000.0,400000
1397,20211018_1200,oukkjmup,5,242.0,242
1398,20211018_1200,oukkjmup,6,3.0,3


Likewise, scoring each question is now much easier to write.

In [23]:
df = df.assign(
    numeracy=lambda x: (x['answer'] == x['correct']).astype(int)
)
df

Unnamed: 0,session_id,participant_id,question,answer,correct,numeracy
0,20210625_1400,giaw6638,1,150.0,150,1
1,20210625_1400,giaw6638,2,100.0,100,1
2,20210625_1400,giaw6638,3,9000.0,9000,1
3,20210625_1400,giaw6638,4,400000.0,400000,1
4,20210625_1400,giaw6638,5,242.0,242,1
...,...,...,...,...,...,...
1395,20211018_1200,oukkjmup,3,9000.0,9000,1
1396,20211018_1200,oukkjmup,4,400000.0,400000,1
1397,20211018_1200,oukkjmup,5,242.0,242,1
1398,20211018_1200,oukkjmup,6,3.0,3,1


In [33]:
scores = df.groupby(['participant_id'])[['numeracy']].sum()
scores

Unnamed: 0_level_0,numeracy
participant_id,Unnamed: 1_level_1
00g32yc1,7
0ozxf8xh,7
0uviu3c8,7
1cnerogk,5
21otsptc,3
...,...
zmaoyq1i,6
znmoz035,4
zu809ozq,6
zw36ul8j,7


In [34]:
scores = scores.reset_index()
scores

Unnamed: 0,participant_id,numeracy
0,00g32yc1,7
1,0ozxf8xh,7
2,0uviu3c8,7
3,1cnerogk,5
4,21otsptc,3
...,...,...
195,zmaoyq1i,6
196,znmoz035,4
197,zu809ozq,6
198,zw36ul8j,7


And we can see that we get the same distribution of numeracy scores via the "long-format" route as we did via the "wide-format" route.

In [27]:
scores.groupby('numeracy')[['participant_id']].count()

Unnamed: 0_level_0,participant_id
numeracy,Unnamed: 1_level_1
3,2
4,6
5,25
6,53
7,114


The inverse operation to `wide_to_long` is `pivot`.  (https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.pivot.html)

In [96]:
pd.pivot(df, index=['session_id', 'participant_id'], columns='question', values=['answer', 'correct', 'numeracy'])

Unnamed: 0_level_0,Unnamed: 1_level_0,answer,answer,answer,answer,answer,answer,answer,correct,correct,correct,correct,correct,correct,correct,numeracy,numeracy,numeracy,numeracy,numeracy,numeracy,numeracy
Unnamed: 0_level_1,question,1,2,3,4,5,6,7,1,2,3,...,5,6,7,1,2,3,4,5,6,7
session_id,participant_id,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
1qzq4xiq,34gfep4y,150.0,100.0,7980.0,400000.0,242.0,3.0,2.0,150.0,100.0,9000.0,...,242.0,3.0,2.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
1qzq4xiq,70uhesln,150.0,100.0,9000.0,400000.0,240.0,3.0,2.0,150.0,100.0,9000.0,...,242.0,3.0,2.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
1qzq4xiq,9xmkmjqn,150.0,100.0,9000.0,400000.0,244.0,3.0,2.0,150.0,100.0,9000.0,...,242.0,3.0,2.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
1qzq4xiq,bxvbfy2c,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,150.0,100.0,9000.0,...,242.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1qzq4xiq,cw8zvr2h,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,150.0,100.0,9000.0,...,242.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
z4srslpi,lfrzjoj3,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,150.0,100.0,9000.0,...,242.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
z4srslpi,q0ffsac5,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,150.0,100.0,9000.0,...,242.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
z4srslpi,td1cwjdp,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,150.0,100.0,9000.0,...,242.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
z4srslpi,uel5pire,150.0,100.0,9000.0,400000.0,242.0,3.0,2.0,150.0,100.0,9000.0,...,242.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Although we got less variation in the numeracy scores than we predicted, it is also still interesting to look to see whether numeracy correlates with any other demographics.

In [30]:
demographics = pd.read_csv("data/prepared/demographics.csv")

In what I hope is now starting to feel routine, we'll take our numeracy scores and merge them with the demographics by `participant_id`.

In [35]:
scores = scores.merge(demographics, how='left', on='participant_id')
scores

Unnamed: 0,participant_id,numeracy,session_id,gender,age,countryborn,countrynow,department,degree,timeuea
0,00g32yc1,7,20210630_1000,M,23,United Kingdom,United Kingdom,ECO,MA/MSc,5th+
1,0ozxf8xh,7,20210714_1130,M,20,United Kingdom,United Kingdom,NBS,BSc,2nd
2,0uviu3c8,7,20210702_1300,F,22,United Kingdom,United Kingdom,DEV,BSc,3rd
3,1cnerogk,5,20211013_1300,,20,Spain,United Kingdom,PSY,INTO,1st
4,21otsptc,3,20210702_1300,F,26,Nepal,Nepal,DEV,MA/MSc,
...,...,...,...,...,...,...,...,...,...,...
195,zmaoyq1i,6,20210702_1430,F,21,,United Kingdom,NBS,BSc,2nd
196,znmoz035,4,20211006_1300,F,23,Hong Kong,United Kingdom,MED,BSc,1st
197,zu809ozq,6,20210628_1600,M,19,China,China,NBS,BSc,2nd
198,zw36ul8j,7,20210628_1000,M,40,United Kingdom,United Kingdom,PPL,BSc,3rd


We can look at the relationship between gender and numeracy score.  First we could just look at average scores:

In [51]:
scores.groupby('gender')[['numeracy']].mean()

Unnamed: 0_level_0,numeracy
gender,Unnamed: 1_level_1
F,6.275
M,6.5
O,6.666667


But it's often more informative to do a cross-tabulation breakdown.  Following a similar pattern as before,

In [52]:
df = scores.groupby(['gender', 'numeracy'])[['participant_id']].count()
df

Unnamed: 0_level_0,Unnamed: 1_level_0,participant_id
gender,numeracy,Unnamed: 2_level_1
F,3,1
F,4,5
F,5,19
F,6,30
F,7,65
M,3,1
M,4,1
M,5,4
M,6,22
M,7,46


As we observed already, it turned out we had rather more females than males in our study.  So it would be useful to convert the numeracy scores into percentages.  We can accomplish this by grouping and then calling `transform`.  (https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html)

In [54]:
df = df.groupby(level=0).transform(lambda x: x/sum(x))
df

Unnamed: 0_level_0,Unnamed: 1_level_0,participant_id
gender,numeracy,Unnamed: 2_level_1
F,3,0.008333
F,4,0.041667
F,5,0.158333
F,6,0.25
F,7,0.541667
M,3,0.013514
M,4,0.013514
M,5,0.054054
M,6,0.297297
M,7,0.621622


In [55]:
df = df.unstack(1)
df

Unnamed: 0_level_0,participant_id,participant_id,participant_id,participant_id,participant_id
numeracy,3,4,5,6,7
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
F,0.008333,0.041667,0.158333,0.25,0.541667
M,0.013514,0.013514,0.054054,0.297297,0.621622
O,,,,0.333333,0.666667


You might like to round the percentages for easier viewing:

In [56]:
df = df.round(2)
df

Unnamed: 0_level_0,participant_id,participant_id,participant_id,participant_id,participant_id
numeracy,3,4,5,6,7
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
F,0.01,0.04,0.16,0.25,0.54
M,0.01,0.01,0.05,0.3,0.62
O,,,,0.33,0.67


In [58]:
scores = scores.assign(
    is_eco=lambda x: x['department'] == "ECO"
)
scores

Unnamed: 0,participant_id,numeracy,session_id,gender,age,countryborn,countrynow,department,degree,timeuea,is_eco
0,00g32yc1,7,20210630_1000,M,23,United Kingdom,United Kingdom,ECO,MA/MSc,5th+,True
1,0ozxf8xh,7,20210714_1130,M,20,United Kingdom,United Kingdom,NBS,BSc,2nd,False
2,0uviu3c8,7,20210702_1300,F,22,United Kingdom,United Kingdom,DEV,BSc,3rd,False
3,1cnerogk,5,20211013_1300,,20,Spain,United Kingdom,PSY,INTO,1st,False
4,21otsptc,3,20210702_1300,F,26,Nepal,Nepal,DEV,MA/MSc,,False
...,...,...,...,...,...,...,...,...,...,...,...
195,zmaoyq1i,6,20210702_1430,F,21,,United Kingdom,NBS,BSc,2nd,False
196,znmoz035,4,20211006_1300,F,23,Hong Kong,United Kingdom,MED,BSc,1st,False
197,zu809ozq,6,20210628_1600,M,19,China,China,NBS,BSc,2nd,False
198,zw36ul8j,7,20210628_1000,M,40,United Kingdom,United Kingdom,PPL,BSc,3rd,False


What about ECO students?  Do ECO students score more highly on numeracy than others?

We can follow the same pattern as above - but exercise our fluent-interface muscles to write the algorithm for computing the table compactly as a single expression!

In [60]:
df = (
    scores.groupby(['is_eco', 'numeracy'])[['participant_id']].count()
    .groupby(level=0).transform(lambda x: x/sum(x))
    .unstack(1, fill_value=0)
    .round(2)
)
df

Unnamed: 0_level_0,participant_id,participant_id,participant_id,participant_id,participant_id
numeracy,3,4,5,6,7
is_eco,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
False,0.01,0.03,0.15,0.27,0.54
True,0.0,0.03,0.0,0.26,0.71
