In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Recall that in this module we're sticking with a given set of versions of libraries (such as `pandas`) which match the Anaconda installation available in the IT labs.  It is a convention in Python libraries that the version string of the library is available via the `__version__` attribute of the module.

In this notebook I'll include links to the documentation for various function calls corresponding to the version of `pandas` we're using.  The `pandas` documentation site maintains versions of the documentation for all (recent) releases - you'll always want to check that you're consulting the documentation for the version you're using, as libraries do evolve over time.

In [None]:
pd.__version__

In the experiment, we collected the data in two batches.  These are the folders `batch1` and `batch2` in `data/raw`.  Each folder has files with the same names, and the files with the same names have the same **schema** - that is to say, they have the same set of column names, and each column name means the same thing in the two batches.

In [None]:
decisions1 = pd.read_csv("data/raw/batch1/decisions.csv")
decisions1

In [None]:
decisions2 = pd.read_csv("data/raw/batch2/decisions.csv")
decisions2

Well, so far so good; there is the same number of columns across the two `decisions` files, and visually scanning the tables suggests they do have the same schemas.  But let's check!

In [None]:
decisions1.columns == decisions2.columns

File formats such as CSV do not have any way to represent the types of the data it contains - that is, whether the data are integers, floating-point numbers, text strings, dates, and so on.  `read_csv` [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.read_csv.html] attempts to *infer* the type of data by inspecting the contents of the file.  This often works well if your input data are well-behaved.  Because today we are working with data which "we" generated in an experiment where we wrote the program, our data will tend to be tidy.  But again we will check to be sure!

If you have a `DataFrame`, you can get a list of the data types using the `dtypes` attribute [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.dtypes.html].

In [None]:
decisions1.dtypes

**Fun fact**: `dtypes` is itself a `pandas` `Series`!

In [None]:
type(decisions1.dtypes)

So all of the things you've learned about working with `Series` apply - including that you can get the data type for a column via using the square-brackets notation like this:

In [None]:
decisions1.dtypes['player.lotterychoice']

This also means that we can check for the equality of the `dtypes` of our two `DataFrame`s by using the `==` operator.  Recall that this operator works by comparing entries with the same index label, so you get out another `Series` which shows the result of the comparison for each column.

In [None]:
decisions1.dtypes == decisions2.dtypes

We want it to be the case that all of our columns have the same datatype - we can do this using the `all()` method on the `Series`. [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.Series.all.html]

In [None]:
(decisions1.dtypes == decisions2.dtypes).all()

At this point we feel confident enough our two `DataFrames` do have the same schema, so we can go ahead and concatenate them using `concat` [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.concat.html].

The index of our two `DataFrame`s was simply an auto-generated row number when we read the files in.  These don't have any meaning to use, so we use the `ignore_index` parameter to tell `pandas` that after concatenating the data, create a new indexing simply labeling each observation by an (arbitrary) row number.

In [None]:
raw_decisions = pd.concat([decisions1, decisions2], ignore_index=True)
raw_decisions

As the name of the folder suggests, the data we have in `raw` has come directly from the source - in our case, the server on which the experiment was hosted.  The files we have stored here are **completely untouched**.

This illustrates one of the key principles of data science projects:

**RAW (SOURCE) DATA IS IMMUTABLE**

What this means is that we always keep a copy of the file(s) we start with.  Those files will rarely be in exactly the format we want.  But that's OK.

What many (most) people are tempted to do (or, in fact, do!) is to start manipulating the file by, for example, loading it into Excel (shudder!), editing it, and then saving it back, often overwriting the original.  **You should never do this.**. Perhaps once upon a time, before there were great libraries like `pandas` (or the Tidyverse in R, or other similar libraries), cleaning and transforming data was difficult to do, and maybe manual editing was a practical if imperfect solution.

But today there is no excuse!  A key learning objective of this module (basically, the entire reason it exists) is to give you the tools to manage, transform, and analyse data such that every step you take is completely reproducible by yourself, and by anyone else.  Especially in the next four lectures, we will develop good habits and practices for how to accomplish that.

In the case of the dataset from our experiment, the experiment software keeps some standard data fields which are used only for some types of experiments.  These were not relevant for our experiment, and so they are null (blank) in all of our observations.  We can see which entries are null by calling `isnull` [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.isnull.html]

In [None]:
raw_decisions.isnull()

That gives us whether the individual cells are nulls - we want to know which columns are entirely null.  Just as before, we can use `all()` to aggregate the columns and report whether all of the entries are nulls:

In [None]:
raw_decisions.isnull().all()

Dropping columns which are totally null can be done with the `dropna()` method on a `DataFrame` [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.dropna.html].  We will use it here to drop columns - but you can also use the method to drop rows which are entirely null.  (We don't have any of those)

`dropna` comes with a bit of a "gotcha" - its default mode is to drop a column if **any** of the entries are null.  In our case, we want to make sure to drop only when **all** of the entries are null.

We'll assign the resulting `DataFrame` with the null columns dropped to a variable `df`.  The variable name `df` is commonly used by convention when you're working with just one `DataFrame` and don't need to distinguish among more than one - you will see it in a lot of examples in the `pandas` documentation and elsewhere.  There's nothing special about using `df` other than it's mnemonic and easy to type - you could use any variable name you wanted.

In [None]:
df = raw_decisions.dropna(axis='columns', how='all')
df

Although we have no rows where the data are all none, we do have some rows that we don't want to consider for analysis.  Here we are needing to apply our specialist knowledge of the dataset and how it was generated.

First, the experimental software has a "demo" mode that is used for testing purposes.  Because part of testing is ensuring data is recorded correctly, the "demo" data also appears in this file.  However, we don't want to include it for data analysis.  So, we want to remove rows which are flagged as being demo.

To select a subset of rows, the best way is to use the `query` method. [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.query.html]. This is a bit of an unusal method in that it takes a text string - and the text string is then a logical expression.  (There are practical reasons for this, which perhaps I'll have time to mention in due course...)

We have one extra wrinkle here, which is that because we have a full-stop in the variable name, we need to surround the field name in the expression using back-ticks.

In [None]:
df = df.query("`session.is_demo` == 0")
df

In [None]:
df = df.query("`player.lotterychoice`.notnull()")
df

In our file, most of the columns are actually redundant for analysis purposes - they're there to support the infrastructure of running the experiment, or for diagnostic reasons (confirming the experiment software is doing what we want it to).  It's good for us to keep them in our `raw` data, but we won't need them for data analysis.

We can shape our `DataFrame` to include the columns we want using `reindex()`.  [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.reindex.html]. We give `reindex` a list of column names, and it returns a `DataFrame` with the data just from those columns.  Notice that we also use it to re-order the columns here.  My own personal taste is to put "metadata" at the start, in decreasing order of scope.  So we have sessions, and in each session there is a participant, and each participant plays a number of rounds, hence the ordering of the first three columns.  (This is a personal preference and not a hard rule, but it is good to develop a convention and try to stick with it!  Operations like `reindex()` make this easy to do.)

In [None]:
df = df.reindex(
    columns=[
        'session.label', 'participant.code', 'subsession.round_number', 'player.menu_number',
        'player.displayed_first', 'player.lotterychoice'
    ]
)
df

OK, so earlier we saw that having full-stops in column names can be annoying, because when we ran `query()` we had to put extra characters in our query string.  Indeed, a lot of software does not like column names with periods, or spaces, or other characters in them.

We can use the `rename` method on `DataFrame` to re-label columns.  [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.rename.html]

There are a few ways you can use `rename`.  One way is to pass a function to the `columns` parameter.  This function is then called on each column name.  This function takes each column name, and replaces all instances of a full-stop with an underscore - usually, underscores are a safe character to have in your column names.

In [None]:
df = df.rename(columns=lambda x: x.replace(".", "_"))
df

Another way to use `rename` is to pass a `dict` which maps how to change existing column names into the ones you want.

(In this case, we could have just done a `rename` like this one straightaway without doing the function-based one.  I wanted to show you that as well, because often you will want to do a mass-rename according to some pattern.)

In [None]:
df = df.rename(columns={
    'session_label': 'session_id',
    'participant_code': 'participant_id',
    'subsession_round_number': 'round_id',
    'player_menu_number': 'menu_id',
    'player_displayed_first': 'displayed_first',
})
df

In the experiment, each menu consisted of two choices, which are labeled 'p' and 'q' for reasons having to do with the theory being tested.  To avoid ordering effects, the software randomised whether 'p' or 'q' was displayed first.  The choice recorded is then 0 or 1, where 0 means the subject choice the first lottery shown and 1 means they chose the second.

When we analyse the data we don't want to have to worry about remembering that, so we transform the data so we just directly record the name of the lottery chosen.

In [None]:
df = df.assign(
     chose_p=lambda x: (
        ((x['displayed_first'] == "p") & (x['player_lotterychoice'] == 0)) |
        ((x['displayed_first'] == "q") & (x['player_lotterychoice'] == 1))
     )
)
df

In [None]:
df = df.assign(
    choice=lambda x: x['chose_p'].replace({True: "q", False: "p"})
)
df

Now we close with some tidying up for human consumption.  Let's start by keeping just the columns we care about, discarding the interim columns we used in computation.

In [None]:
df = df.reindex(
    columns=['session_id', 'participant_id', 'round_id', 'menu_id', 'displayed_first', 'choice']
)
df

The menu IDs were actually integers.  However, `read_csv` imported them as floating-point numbers.  This has to do with how `pandas` deals with null values in numeric types - historically, it was not possible to have null values in integer fields, and so `read_csv` made the menu ID column to be floating-point because our original file had null entries for the menu ID.  Now that we have tidied things up, we can make the menu ID to be an integer.

In [None]:
df = df.astype({'menu_id': int})
df

And finally, let's order the data in the way that's most convenient for us.  In terms of the experiment, what we are interested in is the 25 choices of each participant, so it makes sense to order by participant.  Further, we're most interested in the choices by menu, and not necessarily the order in which the participant saw each of the menus.  So this is the most natural ordering.

Sorting data like this is entirely cosmetic - it's just for our benefit as humans.

In [None]:
df = df.sort_values(['session_id', 'participant_id', 'menu_id'])
df

If you were paying close attention, you'll note that I pulled a bit of a fast one on you.  Once I started working with the data, I was assigning the resulting `DataFrame` to a new variable `df`.  What this means is that actually the original data in `raw_decisions` is still there - and completely unchanged!

That might not seem like something to make a big deal about.  However, this turns out to be **extremely powerful** in practice, for many reasons.  Some of these reasons are technical and have to do with how calculations on `DataFrame`s are implemented behind-the-scenes; you probably won't have to worry about those until you're a very advanced user.  Other reasons however will come up in our examples over the next few weeks.

A way of thinking about how we work with data is that we do not **change** data, but instead we **transform** it.  Each step we did above had the same logical structure: You start with a `DataFrame`, you apply some operation to it, and then you get out another `DataFrame`.  Now, you might start to get worried by this.  We are working here with a quite small `DataFrame` - but when you're working on a project you might have a huge `DataFrame`.  Isn't it inefficient (both in memory and processing speed) to be creating a new `DataFrame` each time?  Well, it turns out that there are techniques which libraries can use to avoid copying data unless necessary - and so actually many of these operations can be implemented very efficiently.

Efficiency aside, because of this property that every operation transforms one `DataFrame` into another `DataFrame`, we can do something **very cool**.  We can write all of the transformations we did on our data above as **one single Python expression**:

In [None]:
df2 = (
    raw_decisions.dropna(axis='columns', how='all')
    .query("`session.is_demo` == 0")
    .query("`player.lotterychoice`.notnull()")
    .reindex(
        columns=[
            'session.label', 'participant.code', 'subsession.round_number', 'player.menu_number',
            'player.displayed_first', 'player.lotterychoice'
        ]
    )
    .rename(columns=lambda x: x.replace(".", "_"))
    .rename(columns={
        'session_label': 'session_id',
        'participant_code': 'participant_id',
        'subsession_round_number': 'round_id',
        'player_menu_number': 'menu_id',
        'player_displayed_first': 'displayed_first',
    })
    .assign(
        chose_p=lambda x: (
            ((x['displayed_first'] == "p") & (x['player_lotterychoice'] == 0)) |
            ((x['displayed_first'] == "q") & (x['player_lotterychoice'] == 1))
        )
    )
    .assign(
        choice=lambda x: x['chose_p'].replace({True: "q", False: "p"})
    )
    .reindex(
        columns=['session_id', 'participant_id', 'round_id', 'menu_id', 'displayed_first', 'choice']
    )
    .astype({'menu_id': int})
    .sort_values(['session_id', 'participant_id', 'menu_id'])
)
df2

Is this new `df2` we have constructed this way identical to the `df` we built step-by-step above?

In [None]:
(df == df2).all()

Yes it is!

Isn't that absolutely brilliant?

On the one hand, that expression we used to make `df2` is a lot to take in, and when you first see that all at once it could be overwhelming.  However, look at each step individually, one-by-one: each line of code represents one logical step in the process.  So, when you write your code like this, you're also documenting the process you used to go from the raw data to the finished product.

This style of programming is sometimes called a more "declarative" style, and comes from the world of what's called "functional programming".  The emphasis in declarative-style programming is on writing what task you want the computer to do, as opposed to micro-managing the implementation of how the task is carried out.  (The latter is sometimes called "imperative programming".)  When you use a library like `pandas`, you can interact with the data at a very high level of abstraction, and leave it to the library implementers to come up with efficient ways to carry out each of those operations.

The style of chaining of the `DataFrame` methods used here is sometimes called a *fluent interface*.  If you are familiar with R (or someday use it), the `dplyr` library in R's `tidyverse` uses the "%>%" operator, which accomplishes basically the same thing.

When we're working in Jupyter notebooks, we'll typically only take one step at a time because we're working interactively and want to see the output of each step.  So we won't too often chain all of our transformations together like the above.  In a few weeks' time we'll talk about how to organise your code into scripts, and there we will tend to use this chaining technique to very powerful effect!

Let's do a little more data checking and a bit of analysis.  First, we expect that each participant should have 25 choices recorded.

To do this, we group our data by `participant_id`. [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.DataFrame.groupby.html].   The easiest way to think of what `groupby` does is that it creates a collection of sub-`DataFrame`s, one for each of the values that we have grouped by.

Then, we can do operations on each of those individual sub-`DataFrame`s.  Here, we will use the `count` method, which gives the number of non-null values in each column. [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.core.groupby.DataFrameGroupBy.count.html].

When we're working with a `DataFrame`, all of our operations get applied individually to each row.  With `groupby` it's the same idea, except the operations are applied to **groups** of rows instead of individual rows.

In [None]:
df.groupby(['participant_id']).count()

Are all of the entries equal to 25 as expected?

In [None]:
(df.groupby(['participant_id']).count() == 25).all()

We also expect that because we have 200 participants, we should have 200 observations for each menu.

In [None]:
df.groupby(['menu_id']).count()

In [None]:
(df.groupby(['menu_id']).count() == 200).all()

But having the right number of participants and right number of menus isn't quite enough - we also want that each participant should have seen each menu exactly once.  There are a few ways we can check for this.  One is by grouping jointly by `participant_id` and `menu_id`:

In [None]:
df.groupby(['participant_id', 'menu_id']).count()

In [None]:
(df.groupby(['participant_id', 'menu_id']).count() == 1).all()

Another way is by using the function `duplicated`, which would return `True` for a row if it matches a previous row over all of the fields given. [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.Series.duplicated.html]

In [None]:
df.duplicated(['participant_id', 'menu_id'])

Here we would want to know if any of the entries in that `Series` were `True` - to do that we use the method `any`. [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.Series.any.html]

In [None]:
df.duplicated(['participant_id', 'menu_id']).any()

That's a good place to save our work.  Remembering what we were saying before about always preserving the original data, we store the output of our work in a different directory.  I tend to use `prepared`, to indicate that it should be in a format that's ready for analysis

In [None]:
df.to_csv("data/prepared/decisions.csv", index=False)

In this experiment there were four different treatments, which varied the information that participants were shown about the lotteries.  There is a separate `sessions` file which indicates which sessions were associated with which treatments.  We want to add this information into our table.

In [None]:
sessions = pd.read_csv("data/raw/sessions.csv")
sessions

In [None]:
df = df.merge(sessions, how='left', on='session_id')
df

We are expecting that participants were assigned in equal numbers to the four treatments.  Let us check this!

In [None]:
df.groupby(['treatment']).nunique()

Much of our focus in these few weeks is the process of going from raw, potentially messy source data to data which is ready for analysis.  This process, often called "data wrangling", will often take up most of your time in a real empirical project.  Ideally, by the time you start doing statistical or econometric analysis, the data are already in the format you need it, so your scripts (whether in Python or Stata or another package) that do the actual analysis will tend to be short - and short scripts are easier to read and understand!

Having said that, after having done all of this work to prepare the data, I'm sure you're at least curious a bit about the results!  So let's look a little bit at some results - it'll also illustrate a few points about using `pandas` which we'll elaborate on in the next few weeks.

First, the design of the experiment is that generally the lottery labeled 'p' has a higher expected value but also a higher variance (more risk) than the lottery 'q' in that menu.  Our hypothesis in the experiment is that giving information about expected value (treatments with E in the name) will increase the frequency with which lottery 'p' is chosen, while treatments giving information about risk (treatments with R in the name) will increase the frequency with which 'q' is chosen.

In [None]:
df.groupby(['treatment', 'choice'])[['participant_id']].count()

We can also view the 25 lotteries that a participant chooses over the experiment as them creating a 'portfolio'.  We can ask how much the portfolios differ across treatments.  To do this we'll use an auxiliary data file which contains the expected value and standard deviation of the lotteries.

In [None]:
lotteries = pd.read_csv("data/raw/lotteries.csv")
lotteries

In [None]:
df = df.merge(
    lotteries.rename(columns={'lottery': 'choice'}),
    how='left', on=['menu_id', 'choice']
)
df

In our experiment, we took the simplified view that all of the lottery outcomes were realised independently.  (In real finance applications they would be correlated, and correlation is an important part of portfolio design.  However for the research question for the experiment, the simpler environment of independence is useful.)

Recall from your basic probability:
1. The expected value of the 25 lotteries is the sum of the expected values of the lotteries (this does not depend on independence);
2. The standard deviation of the 25 lotteries is the sum of the standard deviations of the lotteries (this does depend on independence).

In [None]:
portfolios = (
    df.groupby(['treatment', 'participant_id'])[['mean', 'stdev']].sum()
)
portfolios

In [None]:
portfolios = portfolios.reset_index()
portfolios

Let's have a look at averages across the treatments.

In [None]:
portfolios.groupby('treatment')[['mean', 'stdev']].mean()

We can do some quick visualisations to see whether there are really obvious patterns.  For this, we'll do some quick scatterplots using the `seaborn` library, which extends `matplotlib` by automating a number of processes around choosing axes, colours, legends, and so on.  We'll just have a look at what it can do here; we'll cover the library in more depth later.

In [None]:
sns.scatterplot(
    x='mean', y='stdev', hue='treatment',
    data=portfolios.query("treatment.isin(['B', 'E'])")
)

In [None]:
sns.scatterplot(
    x='mean', y='stdev', hue='treatment',
    data=portfolios.query("treatment.isin(['B', 'R'])")
)

Let's now turn to the individual demographics data, which we will tidy up a bit.  These are in the files called `demographics`.  We'll take a bit of a shortcut and not check the schemas are the same (exercise: try it yourself!)

In [None]:
raw_demographics = pd.concat(
    [pd.read_csv("data/raw/batch1/demographics.csv"), pd.read_csv("data/raw/batch2/demographics.csv")],
    ignore_index=True
)
raw_demographics

Let's have a look at the columns and their data types.

In [None]:
raw_demographics.dtypes

In this case we know the fields we're particularly interested in: the seven "about you" questions.  Let's have a look at the data values for them.

In [None]:
raw_demographics[[
    'player.gender', 'player.age', 'player.countryborn',
    'player.countrynow', 'player.department', 'player.degree',
    'player.timeuea'
]].head(20)


As before, we know we have entries for slots that were opened up for participants who did not turn up for the experiment.  However, trying to filter participants on whether or not demographics are null would be problematic, because participants cannot be obligated to disclose any or all of their demographic information:

In [None]:
raw_demographics[[
    'player.gender', 'player.age', 'player.countryborn',
    'player.countrynow', 'player.department', 'player.degree',
    'player.timeuea'
]].query("`player.department`.isnull()").head(10)


Here's where knowing how the software works is useful, including some of that data which is more about administering the experiment rather than collecting responses directly.  For each row there's a field called `participant._index_in_pages` (note the leading underscore), which tells you how far the participant has progressed in the experiment, and also `participant._max_page_index`, which is the total number of pages needed to complete the experiment.

In [None]:
raw_demographics[[
    "participant._index_in_pages", "participant._max_page_index",
    "player.gender", "player.age"
]].head(20)

The number of pages there are in a session depends on the session.  (As we were running the experiment, we realised having extra landing pages in the instructions was helpful to keep participants together.)  So the best way to test whether a participant row is a valid obseration is to see whether they reached the final page.

In [None]:
df = raw_demographics.query("`participant._index_in_pages` == `participant._max_page_index`")
df

We've got the right number of rows.  We'll check later on whether our participant IDs match up exactly with what we did with decisions.  For now, let's continue cleaning the data by selecting the columns we want (and while we're at it, let's get rid of those annoying periods in the column names).

In [None]:
df = (
    df.rename(columns=lambda x: x.replace(".", "_"))
    .reindex(
        columns=['session_code', 'participant_code',
                 'player_gender', 'player_age', 'player_countryborn',
                 'player_countrynow', 'player_department', 'player_degree',
                 'player_timeuea']

    )
)
df

Sometimes, there are fields where there is a finite list of possible answers - but that list is too long to specify completely in a question.  Countries are a good example of this; there are roughly 200 in the world (depending on how you count), but we're all had the experience of how tedious it is to pick out your country from a long drop-down list.  We have two country fields in our data; this is a good opportunity to look at ways to tidy up the data.

Let's first look at `countryborn`, and see what data values are there:

In [None]:
df.sort_values('player_countryborn')['player_countryborn'].unique()

Compared to some datasets, this isn't all that bad; most of the country names are already rather clean.  We just need to standardise a few country names, and to make a decision about how to code situations where more than one country is listed.

To accomplish this we'll use two functions:
1. `Series.str.title()`: This will convert all the strings to Title Case - that is, first letter of each work capitalised and all others lowercase; [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.Series.str.title.html]
2. `Series.replace(): This takes a `dict`, and replaces each instance of a key with the corresponding value in the `dict`. [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.Series.replace.html]

In [None]:
df = df.assign(
    player_countryborn = lambda x: (
        x['player_countryborn'].str.title()
        .replace(
            {'Britain': 'United Kingdom',
             'British': 'United Kingdom',
             'Uk': 'United Kingdom',
             'Uk (England)': 'United Kingdom',
             'Uk, Australia': 'United Kingdom',
             'England': 'United Kingdom',
             'England / Uk': 'United Kingdom',
             'United Kingdom (England)': 'United Kingdom',
             'Denmark/Usa': 'Denmark',
             'United States Of America': 'United States',
             'Usa': 'United States',
             'Taiwan, Egypt': 'Taiwan'
            }
        )
    )
)
df.sort_values('player_countryborn')['player_countryborn'].unique()

Looks good.  Now we'll do the same exercise with `player_countrynow`.

In [None]:
df.sort_values('player_countrynow')['player_countrynow'].unique()

In [None]:
df = df.assign(
    player_countrynow = lambda x: (
        x['player_countrynow'].str.title()
        .replace(
            {'Britain': 'United Kingdom',
             'British': 'United Kingdom',
             'Uk': 'United Kingdom',
             'Uk (England)': 'United Kingdom',
             'Uk England': 'United Kingdom',
             'Uk, Australia': 'United Kingdom',
             'England, Uk': 'United Kingdom',
             'England': 'United Kingdom',
             'England / Uk': 'United Kingdom',
             'Uk As A Visiting Student': 'United Kingdom',
             'United Kingdom (England)': 'United Kingdom',
             'Denmark/Usa': 'Denmark',
             'United States Of America': 'United States',
             'Usa': 'United States',
             'Taiwan, Egypt': 'Taiwan'
            }
        )
    )
)
df.sort_values('player_countrynow')['player_countrynow'].unique()

Now let's have a look at `player_gender`.  This is an example of a quite annoying data field - the data are recorded by the computer as integers, but you have to know the computer code to know what is what.  Because we do have the computer code, we know that 1 = Male, 2 = Female, 3 = Other, and 4 = prefer not to say.


In [None]:
df.groupby('player_gender')['participant_code'].count()

We'll recode these using letters (M, F, O), and replace 4 with true null values.

In [None]:
df = df.assign(
    player_gender = lambda x: (
        x['player_gender'].replace(
            {1: 'M',
             2: 'F',
             3: 'O',
             4: None}
        )
    )
)
df.groupby('player_gender')['participant_code'].count()

Now, let's have a look at the responses for UEA schools:

In [None]:
df.sort_values('player_department')['player_department'].unique()

These aren't too bad.  To clean these up, alongside `Series.replace` which we've already used, we'll make use of two useful methods for string manipulation:

1. `Series.str.upper()`: Converts all characters in the string to uppercase. [https://pandas.pydata.org/pandas-docs/version/1.2.4/reference/api/pandas.Series.str.upper.html]
2. `Series.str[]`: The `[]` notation on a Series works just like it does on regular Python strings or lists.  We'll use it here to restrict to the first three letters - which, after a bit of initial cleanup, maps to the UEA School/Faculty/programme names.

In [None]:
df = df.assign(
    player_department=lambda x: (
        x['player_department'].replace(
            {"BSc Psychology": "PSY",
             "NMS": "MED",
             "UEA": None}
        )
        .str.upper()
        .str[:3]
    )
)
df.sort_values('player_department')['player_department'].unique()

The UEA degree/affiliation field is, like gender, straightforward enough if you have the coding from the software.

In [None]:
df = df.assign(
    player_degree=lambda x: (
        x['player_degree'].replace(
            {1: "INTO",
             2: "BSc",
             3: "PGDip",
             4: "MA/MSc",
             5: "PhD",
             6: "Staff",
             7: "Other",
             8: None}
        )        
    )
)
df.sort_values('player_degree')['player_degree'].unique()

Likewise, coding up the time-at-UEA question is now routine (I hope!)

In [None]:
df = df.assign(
    player_timeuea=lambda x: (
        x['player_timeuea'].replace(
            {1: "1st",
             2: "2nd",
             3: "3rd",
             4: "4th",
             5: "5th+",
             6: None}
        )
    )
)
df.sort_values('player_timeuea')['player_timeuea'].unique()

Let's take stock of where we are.

In [None]:
df

We're rather close; just a few further adjustments

In [None]:
df = (
    df.rename(columns=lambda x: x.replace("player_", ""))
    .rename(columns={
        'session_code': 'session_id',
        'participant_code': 'participant_id'
    })
    .astype({'age': int})
)
df

We'll save our work.

In [None]:
df.to_csv("data/prepared/demographics.csv", index=False)

As a final exercise, let's do a quick look at demographics and decisions.

In [None]:
decisions = pd.read_csv("data/prepared/decisions.csv")
decisions

Let's check to confirm our list of participant IDs do match across the two files:

In [None]:
df.sort_values('participant_id')['participant_id'].unique() == decisions.sort_values('participant_id')['participant_id'].unique()

In [None]:
decisions = decisions.groupby(['participant_id', 'choice'])[['menu_id']].count()
decisions

Let's just look at how many choices of 'p' each participant made.

In [None]:
decisions = decisions.reset_index()
decisions = decisions.query("choice == 'p'")
decisions

Simple thing we might look at: Is there a difference in gender with respect to the frequency of choices of 'p'?

In [None]:
decisions = decisions.merge(df, how='left', on='participant_id')
decisions

In [None]:
decisions.groupby('gender')[['menu_id']].mean()