As a further exercise, we'll now clean and prepare the data from another behavioural economics paper, "Honesty, beliefs about honesty, and economic growth in 15 countries" (Journal of Economic Behavior & Organization 2016). This reports an online experiment.

The experiment data came from Qualtrics as an Excel file. It is quite messy.  In particular, Qualtrics (and Google Forms, and Office 365 Forms) store the responses to questions in columns where the column name is the text of the question.

**Question**: Is it good or bad that the software uses the question name text as variable names?

As we have seen in our previous exercises, data cleaning involves a standard set of common tasks:

* Importing data from external files
* "Eyeballing" the data manually
* Removing unwanted rows
* Removing unwanted columns
* Renaming variables
* Recoding variables, into a new or the same variable
* Merging different datasets, by row or by column

In [None]:
import pandas as pd

In [None]:
hon_data = pd.read_excel("data/honesty-data.xlsx")
hon_data

From the 2016 paper:

*People from 15 countries took part in an online survey containing two incentivized experiments measuring honest behaviour. I use both the well-known coin flip experiment, where subjects report the result of a coin flip and are offered money for reporting "heads", and a new experimental paradigm: an online quiz in which subjects were able to cheat and this could be detected.*

To reanalyze this data we will need to know what country subjects were from; whether they reported heads in the coin flip; and what their quiz answers were.

Let's look at the column names of the data.  We have already seen how to do this with `.columns`; another way to get a nice list of column details is with the method `.info()`.  This method also has the benefit of telling us how many non-null entries are in each column.

In [None]:
hon_data.info()

This looks encouraging. We see that column 33 (remember to count from zero!) is called "Did the coin land on heads?" and column 51 is "Please enter your nationality:" There are also some quiz questions in columns 22-28.

Let's look at column 33. We hope there will be two values in it, representing heads and tails:

In [None]:
hon_data["Did the coin land on heads?"].unique()

We have some good and bad news. Good news: we can see 1 and 2. Those probably mean heads and tails. Bad news: there is some other stuff in there - a question number, the question text itself, and `nan` which means "not a number".  We now need to figure out where this other data comes from. 

First we can check for how many rows have a value which is equal to 1 or 2.  Two notes:

* The function `.isin()` returns a `Series` whose elements are type `bool`.  To negate a Boolean series, we use the tilde operator `~`.
* Here we assign the result to a `Series` that is not a column in the `DataFrame`.  We don't necessarily have to store the results of every calculation on `DataFrame` columns in the `DataFrame` itself.  In this case, our `not_1_or_2` is an auxiliary calculation we are doing.  We don't need to keep it long-term, so there's no need to assign it to a column in the `DataFrame`.

In [None]:
is_1_or_2 = hon_data["Did the coin land on heads?"].isin([1,2])
not_1_or_2 = ~is_1_or_2
not_1_or_2

That looks OK - most of the rows seem to be 1 or 2 (so `not_1_or_2` is False). We can check by counting the rows. To do this we just take the sum of `not_1_or_2`. This trick uses the useful fact that if you add boolean values in python, they get treated as 1 for True and 0 for False:

In [None]:
True + True + False + True

In [None]:
sum(not_1_or_2)

That seems like a lot of dodgy rows.  Looking at the dataset manually in Excel, we find:

* Many rows don't contain any data beyond the first few columns. These might be participants who gave up early, or who failed a check to qualify for the survey.
* The first row contains the question (e.g. "Did the coin land on heads?")
* Rows 322 and 323 repeat the question, along with a list of question numbers.
* Lastly, some rows are blank in column 33 - perhaps because the participant simply didn't answer.

None of these rows contain the data we need, so we can safely delete them.

In practice, it would always be a good idea to know as much you can about how the data were collected.  For example, if you got this data from someone else, you might ask them why there are so many rows which do not seem to have valid data.

We know how to use `.query()` to select rows from a `DataFrame`.  We could use this to get the rows we are interested in by doing

```hon_data.query("`Did the coin land on heads?`.isin([1,2])")```

But, `query()` has a feature that allows us to use the work we have already done.  If you put an `@` in front of a variable, `query()` looks at the variables defined in your current scope in Python and uses those.  So we can use our existing `is_1_or_2` variable.

In [None]:
hon_data = hon_data.query("@is_1_or_2")
hon_data

At this point, the horrible column names are starting to give me a headache.  And, there are a lot of columns that aren't relevant to our objectives.  (As when doing a maths problem, what you need to do in developing your recipes for cleaning up data depends on what you are trying to accomplish.  'The wise person begins at the end; the foolish person ends at the beginning.')

Let's make some progress towards rationalising the data by removing the "block randomizer" columns.  First we need to make a list of the matching column names.  Here is one way to do it:

In [None]:
block_rand = []
for c in hon_data.columns:
    if "Block Randomizer" in c:
        block_rand.append(c)
block_rand

However - this is a very common pattern in programming, and perhaps not surprisingly Python has a more compact way of writing this using **list comprehensions**.

In a list comprehension, one can put the for-loop inside the definition of the list you want to make.  The resulting code is... Pythonic!

In [None]:
[c for c in hon_data.columns if "Block Randomizer" in c]

We can then write our `drop` expression quite elegantly, in a way that (we hope) makes it clear what our intention is.

In [None]:
hon_data = hon_data.drop(columns=[c for c in hon_data.columns if "Block Randomizer" in c])

There are a few columns which we definitely will want to retain; let's give them nicer names.

In [None]:
hon_data = hon_data.rename(columns={
    "Did the coin land on heads?": "heads",
    "What is your gender?": "gender",
    "How old are you?": "age"
})

Several of the columns contain whether or not the participant got the answer to quiz questions correct:

In [None]:
[c for c in hon_data.columns if "correct" in c]

We've now many times seen the benefits of having column names which are also valid Python function or variable names.  Let's clean these!

First, let's do this by defining a bespoke function that takes a column name `x` and makes it lowercase, with underscores instead of spaces, if the column name starts with "Quiz" and ends with "correct":

In [None]:
def rename_quiz_column(x):
    if x.startswith("Quiz") and x.endswith("correct"):
        return x.lower().replace(" ", "_")
    else:
        return x

print(
    hon_data.rename(columns=rename_quiz_column).columns
)

Could we do this more compactly by using a `lambda` function instead of one we define via `def`?  Your first thought might be that `rename_quiz_column` looks like it has more than one expression, so we can't write it as a `lambda`.

However, much as we can embed a `for` loop in a list comprehension, Python also allows us to do **conditional expressions**, which let us put `if`/`else` logic in a single expression.  With that we can in fact do the renaming with a `lambda`:

In [None]:
hon_data = hon_data.rename(
    columns=lambda x: x.lower().replace(" ", "_") if x.startswith("Quiz") and x.endswith("correct") else x
)
print(hon_data.columns)

Which one is better - the `def` or the `lambda`?  It depends on the situation; next week when we talk about organising one's work into scripts, we'll see that perhaps in this case putting the logic in a `def` could actually be the more transparent way to write it.  But neither is wrong; it's a question of personal preference and style, not correctness.

The `nationality` variable records partcipants' answers to "What nationality are you?" They were given as free text:

In [None]:
hon_data['nationality'].unique()

We want these names to be consistent across subjects, so that e.g. "turkey", "Turkey" and "Türk" are the same.

As a start, we can make all the nationalities lowercase.

In [None]:
hon_data = hon_data.assign(**{
    'nationality': lambda x: x['nationality'].str.lower()
})

In [None]:
hon_data['nationality'].unique()

We can then standardise these to country names.  In the case of "t.c.", it might not be commonly-known what this is an abbreviation for.  In such a case, if you have to look something up, it's probably a good idea to include a comment!

In [None]:
hon_data = hon_data.assign(**{
    'nationality': lambda x: x['nationality'].replace({
        "türk"      : "turkey",
        "turkey"    : "turkey",
        "t.c."      : "turkey", # Wikipedia: short for "Türkiye Cumhuriyeti".
        "tc"        : "turkey",
        "polish"    : "poland",
        "american"  : "US",
        "german"    : "germany",
        "russian"   : "russia",
        "italian"   : "italy",
        "australian": "australia",
        "irish"     : "ireland",
        "japanese"  : "japan",
        "malaysian" : "malaysia",
        "chinese"   : "china",
        "indian"    : "india",
        "togolese"  : "togo",
        "swiss"     : "switzerland",
        "brazilian" : "brazil",
        "guinean"   : "guinea",
    })
})
hon_data["nationality"].unique()

There are still some participants we haven't dealt with - those who gave a nationality like "white" or "musulman", or a mixed nationality like "italian/irish". We can't categorize those into one of my national groups, so in this instance we are going to delete them.

Also, there are some people who are not in the group of nations selected for the survey. For example, there is one Romanian. We only want the 8 nations targeted in (this part of) the survey.

⚠️ **Choices made during data cleaning can have statistical implications!** When we delete people who describe their nationality as "white" or "musulman", I may be selecting out subjects with a particular sense of national identity! Similarly, by excluding people who were resident in one of these 8 countries, but from a different nation, I exclude migrants.

In [None]:
hon_data = hon_data.query(
    'nationality.isin(["US", "brazil", "russia", "turkey", "china", "japan", "greece", "switzerland"])'
)

In [None]:
hon_data['nationality'].value_counts()

Because these are country names, let's capitalise them.

In [None]:
hon_data = hon_data.assign(**{
    'nationality': lambda x: x['nationality'].str.capitalize()
})
hon_data['nationality'].value_counts()

Ahh, one last adjustment; we want "US" instead of "Us"....

In [None]:
hon_data = hon_data.assign(**{
    'nationality': lambda x: x['nationality'].replace("Us", "US")
})
hon_data['nationality'].value_counts()

Now, let's have a look at some of the core "honesty" data - which is whether the participant reported the coin toss to be heads or tails.  The coding of this field is ambiguous, but perhaps we can figure out how it is coded:

In [None]:
hon_data['heads'].value_counts()

Because in this experiment there were incentives to say the coin toss was heads, it's quite probable that this implies that 1 = heads and 2 = tails - this is certainly the finding in many other experiments using the same instrument.  So let's re-code that.

In [None]:
hon_data = hon_data.assign(**{
    'heads': lambda x: x['heads'] == 1
})

Some of the columns are results from an "integrity test" where participants were asked whether they thought certain actions were always (1), sometimes (2), rarely (3), or never (4) justified.

These are the columns that start with "Please / think":

In [None]:
[c for c in hon_data.columns if c.startswith("Please / think")]

We'd like to create a single total score from these columns.  Higher scores will indicate greater "integrity."

When we did something similar previously (with numeracy scores), we saw that a wide-to-long transformation can be an attractive way to do this kind of calculation.  However, we can also use the `sum` method.  The default for `sum` is to sum *down* a column (as we have seen before).  If we specify the `axis` parameter, we can instead sum *across* columns:

In [None]:
integrity = hon_data[[c for c in hon_data.columns if c.startswith("Please / think")]].sum(axis='columns')
integrity

Let's have a quick look at the distribution of these scores.

In [None]:
integrity.value_counts().sort_index().plot(kind='bar')

The distribution is quite skewed.  We might want to bin this data for analysis.  We can do this using the `cut()` function:

In [None]:
help(pd.cut)

In [None]:
hon_data = hon_data.assign(**{
    'integrity': pd.cut(integrity, [0, 45, 50, 55, 60])
})

In [None]:
hon_data['integrity'].value_counts()