In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 5)


# Lecture 05 - Grouping

## DSC 80, Fall 2022

## Today, in DSC 80...

- Data comes in varying levels of detail (**granularity**)
- How do we go from very fine granularity to **aggregated** data?

## Announcements 📣

- 

## Data granularity

### Granularity

- **Granularity** refers to the level of detail present in data.
    - Fine: small details.
    - Coarse: bigger picture.
- Typically, rows in a DataFrame correspond to individuals, and columns correspond to attributes.
- In the following example, what is an individual?

| Name | Assignment | Score |
| --- | --- | --- |
| Billy | Homework 1 | 94 |
| Sally | Homework 1 | 98 |
| Molly | Homework 1 | 82 |
| Sally | Homework 2 | 47 |

### Levels of granularity

<center><img src='imgs/caper.png' width=30%></center>

Each student submits CAPEs once for each course they are in.

| Student Name | Quarter | Course | Instructor | Recommend? | Expected Grade | Hours Per Week | Comments |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Billy | SP22 | DSC 80 | Suraj Rampure | No | A- | 14 | I hate this class |
| Billy | SP22 | DSC 40B | Arya Mazumdar | Yes | B+ | 9 | go big O |
| Sally | SP22 | DSC 10 | Janine Tiefenbruck | Yes | A | 11 | babypandas are so cute |
| Molly | SP22 | DSC 80 | Suraj Rampure | Yes | A+ | 2 | I wish there was music in class |
| Billy | SP22 | DSC 190 | Justin Eldridge | No | B+ | 7 | Wears too much flannel |

Only instructors can see individual responses. At [cape.ucsd.edu](https://cape.ucsd.edu), only overall class statistics are visible.

| Quarter | Course | Instructor | Recommend (%) | Expected Grade | Hours Per Week |
| --- | --- | --- | --- | --- | --- |
| SP22 | DSC 80 | Suraj Rampure | 23% | 3.15 (B) | 13.32 |
| SP22 | DSC 40B | Arya Mazumdar | 89% | 3.35 (B+) | 8.54 |
| SP22 | DSC 10 | Janine Tiefenbruck | 94% | 3.45 (B+) | 11.49 |
| SP22 | DSC 190 | Justin Eldridge | 24% | 3.5 (B+) | 9.21

The university may be interested in looking at CAPEs results by department.

| Quarter | Department | Recommend (%) | Expected Grade | Hours Per Week |
| --- | --- | --- | --- | --- |
| SP22 | DSC | 91% | 3.01 (B) | 12.29 |
| SP22 | BILD | 85% | 2.78 (C+) | 13.21 |

Prospective students may be interested in comparing course evaluations across different universities.

| University | Recommend (%) | Average GPA | Hours Per Week |
| --- | --- | --- | --- |
| UC San Diego | 94% | 3.12 (B) | 42.19 |
| UC Irvine | 89% | 3.15 (B) | 38.44 |
| SDSU | 88% | 2.99 (B-) | 36.89 |

### Collecting data

- If you can control how your dataset is created then you should opt for **finer granularity** (more detail).
- You can always remove detail, but you cannot add detail if it is not already present in the dataset.
- However, obtaining fine-grained data can take more time and space.

### Manipulating granularity

- In the CAPEs example, we looked at the same information (course evaluations) at varying levels of detail.
- We'll now explore how to change the level of granularity present in our dataset.
    - While it may seem like we are "losing information," removing detail can help us understand bigger-picture trends in our data (reduces **noise**).

### Discussion Question

What is the average number of `'Years'` for each `'Degree'`? Write code that finds the answer as a **Series** indexed by `'Degree'`.

In [None]:
profs = pd.DataFrame(
[['Brad', 'UCB', 8, 'Neuro', 'Orange'],
 ['Janine', 'UCSD', 7, 'Math', 'Purple'],
 ['Marina', 'UIC', 6, 'CS', 'Yellow'],
 ['Justin', 'OSU', 7, 'CS', 'Yellow'],
 ['Aaron', 'UCB', 4, 'Math', 'Purple'],
 ['Soohyun', 'UCSD', 1, 'CS', 'Orange'],
 ['Suraj', 'UCB', 1, 'CS', 'Purple']],
    columns=['Name', 'School', 'Years', 'Degree', 'Color']
)

profs

### Approach 1: Looping through unique values

In [None]:
year_map = {}
for degree in profs['Degree'].unique():
    degree_only = profs.loc[profs['Degree'] == degree]
    year_map[degree] = degree_only['Years'].mean()
    
pd.Series(year_map)

- For each unique `'Degree'`, we make a pass through the entire dataset.
- For 40B people: $\Theta(nd)$

### Approach 2: Single pass

Let's try and avoid passing over the dataset repeatedly.

In [None]:
profs

You can iterate over the rows of a DataFrame using the `iterrows` method (though you should rarely need to do this):

In [None]:
for idx, row in profs.iterrows():
    print(row, '\n')

In [None]:
year_map = {}
for idx, row in profs.iterrows():                            
    degree = row['Degree']
    person_years = row['Years']
    if degree in year_map:
        year_map[degree] += np.array([1, person_years])
    else:
        year_map[degree] = np.array([1, person_years])
        
year_map

In [None]:
df = pd.DataFrame(year_map, index=['total', 'years'])
df.loc['years'] / df.loc['total']

### Issues with the previous solutions

- These solutions were "ad-hoc", and depended on the specific problem we had.
    - What if we wanted the **median** `'Years'` for each `'Degree'`?
- Loops in Python are slow (though the **algorithmic reasoning** is still relevant).

## GroupBy

### 🤔

In [None]:
profs

In [None]:
profs.groupby('Degree').mean()

### Aside: Pandas Tutor

- [pandastutor.com](https://pandastutor.com) is a new tool that allows you to visualize DataFrame operations.
    - It works similarly to [pythontutor.com](https://pythontutor.com), which you may have seen in DSC 20.
    - Slight issue: can't upload `.csv` files.
- Follow along with our current example [here](https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0A%0Aprofs%20%3D%20pd.DataFrame%28%0A%5B%5B'Brad',%20'UCB',%208,%20'Neuro',%20'Orange'%5D,%0A%20%5B'Janine',%20'UCSD',%207,%20'Math',%20'Purple'%5D,%0A%20%5B'Marina',%20'UIC',%206,%20'CS',%20'Yellow'%5D,%0A%20%5B'Justin',%20'OSU',%204,%20'CS',%20'Yellow'%5D,%0A%20%5B'Aaron',%20'UCB',%204,%20'Math',%20'Purple'%5D,%0A%20%5B'Soohyun',%20'UCSD',%201,%20'CS',%20'Orange'%5D,%0A%20%5B'Suraj',%20'UCB',%201,%20'CS',%20'Purple'%5D%5D,%0A%20%20%20%20columns%3D%5B'Name',%20'School',%20'Years',%20'Degree',%20'Color'%5D%0A%29%0A%0Aprofs.groupby%28'Degree'%29.mean%28%29&d=2022-04-11&lang=py&v=v1).

### Split-apply-combine

- The `groupby` method involves three steps: **split**, **apply**, and **combine**.

<center><img src="imgs/image_0.png" width=40%></center>

- **Split** breaks up and "groups" the rows of a DataFrame according to the specified key. There is one "group" for every unique value of the key.
- **Apply** uses a function (e.g. aggregation, transformation, filtering) within the individual groups.
- **Combine** stitches the results of these operations into an output DataFrame.

### Runtime considerations

* The `groupby` method can often produce results using just a **single pass** over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.

* `groupby` is a **declarative** operation – the user just specifies **what** computation needs to be done, and `pandas` figures out **how** to do it under the hood.

* The split-apply-combine pattern can be parallelized to work on multiple computers or threads, by sending computations for each group to different processors.

### Example: Penguins 🐧

In [None]:
import seaborn as sns
penguins = sns.load_dataset('penguins').dropna()
penguins.head()

In [None]:
penguins['species'].value_counts()

In [None]:
penguins['island'].value_counts()

### For each species...

What is the median bill length?

In [None]:
penguins.groupby('species').median()

What proportion live on Dream Island?

In [None]:
(
    penguins.assign(on_Dream = penguins['island'] == 'Dream')
            .groupby('species')
            .mean()
)

Now that we understand how to use `groupby`, let's dive deeper into **how** it works.

### Accessing groups

- If `df` is a DataFrame, then `df.groupby(key)` returns a `DataFrameGroupBy` object.
    - This object represents the "split" in "split-apply-combine".
- Methods and attributes of `DataFrameGroupBy` objects:
    - `.groups`: a dictionary in which the keys are group names and the values are lists of row labels.
    - `.get_group(key)`: a DataFrame with only the values for the given key
    - We usually don't use these directly, but they're useful in understanding how `groupby` works.

In [None]:
# Creates one group for each unique value in the species column
penguin_groups = penguins.groupby('species')
penguin_groups

In [None]:
penguin_groups.groups

In [None]:
penguin_groups.get_group('Chinstrap')

In [None]:
# Same as the above
penguins[penguins['species'] == 'Chinstrap']

In [None]:
for key, df in penguin_groups:
    display(df)

### Aggregation

- Once we create a `DataFrameGroupBy` object, we need to **apply** some function to each group, and **combine** the results.
- The most common operation applied to each group is an **aggregation**.
    - Aggregation refers to the process of reducing many values to one.
- To perform an aggregation, use an aggregator method on the `DataFrameGroupBy` object, e.g. `.mean()`, `.max()`, `.median()`, etc.

In [None]:
penguins

In [None]:
penguin_groups

In [None]:
penguin_groups.mean()

In [None]:
penguin_groups.sum()

In [None]:
penguin_groups.max()

### Column selection

- By default, the aggregator will be applied to **all** columns that it can be applied to.
    - `max` and `min` are defined on strings, while `median` and `mean` are not.
- If we only care about one column, we can select that column before aggregating to save time.
- `DataFrameGroupBy` objects support `[]` notation.

In [None]:
penguins.groupby('species').median()

In [None]:
penguins.groupby('species')['bill_length_mm'].median()

In [None]:
# Gives the same result, but involves wasted effort
# since the other columns had to be aggregated for no reason
penguins.groupby('species').median()['bill_length_mm']

In [None]:
# Note that this is a SeriesGroupBy object, not a DataFrameGroupBy object!
penguins.groupby('species')['bill_length_mm']

## Additional `GroupBy` methods

### Aggregation methods

- There are many built-in aggregation methods.
- What if you want to apply different aggregation methods to different columns?
- What if the aggregation method you want to use doesn't already exist in `pandas`?

### The `aggregate` method

- The `DataFrameGroupBy` object has a general `aggregate` method, which aggregates using one or more operations.
    - Remember, aggregation refers to the process of reducing many values to one.
- There are many ways of using `aggregate`; refer to [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html) for a comprehensive list.
- Example arguments:
    - A single function.
    - A list of functions.
    - A dictionary mapping column names to functions.
- Per [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html), `agg` is an alias for `aggregate`.

### Example

How many penguins are there of each species, and what is the mean body mass of each species?

In [None]:
penguins.groupby('species')['body_mass_g'].aggregate(['count', 'mean'])

Note what happens when we don't select a column before aggregating.

In [None]:
# penguins.drop(columns=['island', 'sex']).groupby('species').aggregate(['count', 'mean'])
penguins.groupby('species').aggregate(['count', 'mean'])

### Example

What is the max bill length of each species, and how many islands is each species found on?

In [None]:
penguins.groupby('species').aggregate({'bill_length_mm': 'max', 'island': 'nunique'})

### Example

What is the **interquartile range** of the body mass of each species?

In [None]:
def IQR(col):
    return np.percentile(col, 75) - np.percentile(col, 25)

In [None]:
penguins.groupby('species')['body_mass_g'].aggregate(IQR)

### The `transform` method

- Let's say we want to subtract the mean within each group.
- This is not an **aggregation**, it is a **transformation**.
- A transformation returns a DataFrame or Series of the same size.

In [None]:
penguins

In [None]:
penguins.groupby('species')['body_mass_g'].transform(lambda ser: ser - ser.mean())

### The `filter` method

- Suppose we want to keep only the groups that satisfy a particular condition.
- To do this, we use the `filter` method, which takes in a function.
- That function should accept a DataFrame/Series and return a Boolean.
- The result is a new DataFrame/Series with only the groups for which the filter function returned `True`.
- For example, suppose we want only the species whose mean bill length is above 39 mm.

In [None]:
penguins

In [None]:
penguins.groupby('species').filter(lambda df: df['bill_length_mm'].mean() > 39)

No more Adelies!

### The `apply` method

- The `apply` method is a generalization of `aggregate`, `transform`, and `filter`.
- It accepts a group as a DataFrame/Series, and can return a DataFrame, Series, or scalar.
- Per [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.apply.html), it is slower than other aggregation and transformation methods, so use those instead whenever possible, and **avoid `apply`**.

In [None]:
penguins.groupby('species').apply(lambda s: s * 2)

In [None]:
penguins.groupby('species').apply(lambda s: s.mean().mean())

### Discussion Question

For each species, find the island on which the heaviest penguin of that species lives.

In [None]:
# Why doesn't this work?
penguins.groupby('species').max()

In [None]:
penguins.sort_values('body_mass_g', ascending=False).groupby('species').first()

### Grouping with multiple columns

When we group with multiple columns, one group is created for **every unique combination** of elements in the specified columns.

In [None]:
double_group = penguins.groupby(['species', 'island'])
double_group

In [None]:
double_group.groups

In [None]:
for key, df in double_group:
    display(df.head())

In [None]:
penguins.groupby(['species', 'island']).mean()

### Grouping and indexes

- The `groupby` method creates an index based on the specified columns.
- When grouping by multiple columns, the resulting DataFrame has a `MultiIndex`.
- Advice: When working with a `MultiIndex`, use `reset_index` or set `as_index=False` in `groupby`.

In [None]:
weird = penguins.groupby(['species', 'island']).mean()
weird

In [None]:
weird['body_mass_g']

In [None]:
weird.loc['Adelie']

In [None]:
weird.loc[('Adelie', 'Torgersen')]

In [None]:
weird.reset_index()

In [None]:
penguins.groupby(['species', 'island'], as_index=False).mean()

### Summary

- Grouping allows us to change the level of granularity in a DataFrame.
- Grouping involves three steps – split, apply, and combine.
- The `groupby` method returns a `DataFrameGroupBy` method, which creates one group for every unique combination of values in the column(s) being grouped on.
- Most often, we will use an aggregation method on a `DataFrameGroupBy` object, but we can also use `transform`, `filter`, or the more general `apply` methods. Each one of these methods acts on each group individually.
- **Next time:** More on `pivot` and `pivot_table`. Simpson's paradox. Combining DataFrames.

## Pivoting

### Average body mass for every combination of species and island

To find the above information, we can group by both `'species'` and `'island'`.

In [None]:
penguins.groupby(['species', 'island'])['body_mass_g'].mean()

But we can also create a **pivot table**.

In [None]:
penguins.pivot_table(index='species', 
                     columns='island', 
                     values='body_mass_g', 
                     aggfunc='mean')

Note that the DataFrame above shows the same information as the Series above it, just in a different arrangement.

### `pivot_table`

- The `pivot_table` DataFrame method aggregates a DataFrame using two columns. To use it:

```py
df.pivot_table(index=index_col,
               columns=columns_col,
               values=values_col,
               aggfunc=func)
```
- The resulting DataFrame will have:
    - One row for every unique value in `index_col`.
    - One column for every unique value in `columns_col`.
    - Values determined by applying `func` on values in `values_col`.

### Example

Find the number of penguins per island and species.

In [None]:
penguins.pivot_table(index='island', 
                     columns='species', 
                     values='bill_length_mm', 
                     aggfunc='count')

Note that there is a `NaN` at the intersection of `'Biscoe'` and `'Chinstrap'`, because there were no Chinstrap penguins on Biscoe Island.

We can either use the `fillna` method afterwards or the `fill_values` argument to fill in `NaN`s.

In [None]:
penguins.pivot_table(index='island', 
                     columns='species', 
                     values='bill_length_mm', 
                     aggfunc='count').fillna(0)

In [None]:
penguins.pivot_table(index='island', 
                     columns='species', 
                     values='bill_length_mm', 
                     aggfunc='count', 
                     fill_value=0)

### Example

Find the mean body mass per species and sex.

In [None]:
penguins.pivot_table(index='species', columns='sex', values='body_mass_g', aggfunc='mean')

**Important:** In `penguins`, each row corresponds to an individual/observation. In the pivot table above, that is no longer true.

### Joint and conditional distributions

When using `aggfunc='count'`, a pivot table describes the joint distribution of two categorical variables.

In [None]:
counts = penguins.pivot_table(index='species', 
                              columns='sex', 
                              values='body_mass_g', 
                              aggfunc='count', 
                              fill_value=0)

counts

We can normalize the DataFrame by dividing by the total number of penguins. The resulting numbers can be interpreted as **probabilities** that a randomly selected penguin from the dataset belongs to a given combination of species and sex.

In [None]:
joint = counts / counts.sum().sum()
joint

If we sum over one of the axes, we can compute **marginal probabilities**.

In [None]:
joint

In [None]:
joint.sum(axis=1)

In [None]:
joint.sum(axis=0)

For instance, the first Series tells us that a randomly selected penguin has a 0.357357 chance of being of species `'Gentoo'`.

If we divide `counts` by row or column sums, we can compute **conditional probabilities**.

In [None]:
counts

In [None]:
counts.sum(axis=0)

The conditional distribution of species **given** sex is below.

In [None]:
counts / counts.sum(axis=0)

For instance, the above DataFrame tells us that the probability that a randomly selected penguin is of species `'Adelie'` **given** that they are of sex `'Female'` is 0.442424.

The conditional distribution of sex given species is below.

In [None]:
counts.T / counts.sum(axis=1)

### `pivot_table` aggregates and reshapes

- The `pivot_table` method does two things. It:
    - Aggregates based on two columns.
    - Reshapes the data from "long" to "wide".
        - Rows no longer correspond to observations.
- At times, we may only want to do the second step – reshape the data.

### Example: Tic-tac-toe

<center><img src='imgs/tic-tac-toe.png' width=20%></center>

In [None]:
moves = pd.DataFrame([
    [1, 1, 'O'],
    [2, 1, 'X'],
    [2, 2, 'X'],
    [2, 3, 'O'],
    [3, 1, 'O'],
    [3, 3, 'X']
], columns=['i', 'j', 'move'])
moves

In [None]:
moves.pivot(index='i', columns='j', values='move').fillna('')

The `pivot` method **only** reshapes a DataFrame. It does not change any of the values in it (i.e. `aggfunc` doesn't work with `pivot`).

### `pivot_table` = `groupby` + `pivot`

- `pivot_table` is a shortcut for using `groupby` and then using `pivot`.
- For example, both of the following code cells find the mean body mass per species and sex.

In [None]:
(
    penguins.groupby(['species', 'sex'])[['body_mass_g']]
            .mean()
            .reset_index()
            .pivot(index='species', columns='sex', values='body_mass_g')
)

In [None]:
penguins.pivot_table(index='species', columns='sex', values='body_mass_g', aggfunc='mean')

`aggfunc='mean'` plays the same role that `.mean()` does.

### Reshaping

- `pivot_table` and `pivot` reshape DataFrames from "long" to "wide".
- Other DataFrame reshaping methods:
    - `melt`: un-pivots a DataFrame.
    - `stack`: pivots multi-level columns to multi-indices.
    - `unstack`: pivots multi-indices to columns.
    - Google and the documentation are your friends!

### Simpson's paradox

<center><img src="imgs/image_2.png" width=50%></center>

### Example: Grades

- Two students, Lisa and Bart, just finished freshman year. They both took a different number of classes in Fall, Winter, and Spring.

- Within each quarter, Lisa had a higher GPA than Bart.

- But Bart has a higher overall GPA.

- How is this possible? 🤔

**Note:** The number of "grade points" you earn for a course is

$$\text{number of units} \cdot \text{grade (out of 4)}$$

So an A- in a 4 unit course earns $3.7 \cdot 4 = 14.8$ grade points.

In [None]:
lisa = pd.DataFrame([
        [20, 46],
        [18, 54],
        [5, 20]
    ],
    columns=['Units', 'Grade Points Earned'], 
    index=['Fall', 'Winter', 'Spring'])

lisa

In [None]:
bart = pd.DataFrame([
        [5, 10],
        [5, 13.5],
        [22, 81.4]
    ],
    columns=['Units', 'Grade Points Earned'], 
    index=['Fall', 'Winter', 'Spring'])

bart

The following DataFrame shows that Lisa had a higher GPA in all three quarters.

In [None]:
quarterly_gpas = pd.DataFrame(
    {
        "Lisa's Quarter GPA": lisa['Grade Points Earned'] / lisa['Units'],
        "Bart's Quarter GPA": bart['Grade Points Earned'] / bart['Units']
    }
)

quarterly_gpas

But Lisa's overall GPA is less than Bart's overall GPA.

In [None]:
tot = lisa.sum()
tot['Grade Points Earned'] / tot['Units']

In [None]:
tot = bart.sum()
tot['Grade Points Earned'] / tot['Units']

### What happened?

- When Lisa and Bart both performed poorly, Lisa took more units than Bart.
    - This brings down Lisa's overall average.
- When Lisa and Bart both performed well, Bart took more units than Annie.
    - This brings up Bart's overall average.

In [None]:
quarterly_gpas.assign(Lisa_units=lisa['Units']) \
              .assign(Bart_units=bart['Units']) \
              .iloc[:, [0, 2, 1, 3]]

### Simpson's paradox

- Simpson's paradox occurs when **grouped data and ungrouped data show opposing trends**.
    - It is named after Edward H. Simpson, not Lisa or Bart Simpson.
    
- It is **purely arithmetic** – it is a consequence of weighted averages.

- It often happens because there is a hidden factor (i.e. a **confounder**) within the data that influences results.

- **Question:** What is the "correct" way to summarize your data? What if you had to act on these results?

### Example: How Berkeley was sued for gender discrimination (1973)

### What do you notice?

<center><img src='imgs/berkeley.png' width=70%></center>

In [None]:
from IPython.display import display, IFrame

def show_paradox_slides():
    src = 'https://docs.google.com/presentation/d/e/2PACX-1vSbFSaxaYZ0NcgrgqZLvjhkjX-5MQzAITWAsEFZHnix3j1c0qN8Vd1rogTAQP7F7Nf5r-JWExnGey7h/embed?start=false'
    width = 960
    height = 569
    display(IFrame(src, width, height))

show_paradox_slides()

### What happened?

- The overall acceptance rate for women (30%) was lower than it was for men (45%).
- However, most departments (A, B, D, F) had a higher acceptance rate for women.
- Department A had a 62% acceptance rate for men and an 82% acceptance rate for women!
    - 31% of men applied to Department A.
    - 6% of women applied to Department A.
- Department F had a 6% acceptance rate for men and a 7% acceptance rate for women!
    - 14% of men applied to Department F.
    - 19% of women applied to Department F.
- **Conclusion:** Women tended to apply to departments with a lower acceptance rate.

### Caution!

This doesn't mean that admissions are free from gender discrimination! 

From [Moss-Racusin et al., 2012, PNAS](https://www.pnas.org/doi/10.1073/pnas.1211286109) (cited 2600+ times):

> In a randomized double-blind study (n = 127), **science faculty** from research-intensive universities **rated the application materials of a student—who was randomly assigned either a male or female** name—for a laboratory manager position. Faculty **participants rated the male applicant as significantly more competent and hireable than the (identical) female applicant**. These participants also selected a higher starting salary and offered more career mentoring to the male applicant. The gender of the faculty participants did not affect responses, such that female and male faculty were equally likely to exhibit bias against the female student.

### But then...

From [Williams and Ceci, 2015, PNAS](https://www.pnas.org/doi/10.1073/pnas.1418878112):

> Here we report five hiring experiments in which faculty evaluated hypothetical female and male applicants, using systematically varied profiles disguising identical scholarship, for assistant professorships in biology, engineering, economics, and psychology. Contrary to prevailing assumptions, **men and women faculty members from all four fields preferred female applicants 2:1 over identically qualified males** with matching lifestyles (single, married, divorced), with the exception of male economists, who showed no gender preference.

### Do these conflict?

Not necessarily. One explanation, from William and Ceci:

> Instead, past studies have used ratings of students’ hirability for a range of posts that do not include tenure-track jobs, such as managing laboratories or performing math assignments for a company. However, hiring tenure-track faculty differs from hiring lower-level staff: it entails selecting among highly accomplished candidates, all of whom have completed Ph.D.s and amassed publications and strong letters of support. **Hiring bias may occur when applicants’ records are ambiguous, as was true in studies of hiring bias for lower-level staff posts, but such bias may not occur when records are clearly strong**, as is the case with tenure-track hiring.

### Do these conflict?

From Witteman, et al, 2019, in *The Lancet*:

> Thus, evidence of scientists favouring women comes exclusively from hypothetical scenarios, whereas evidence of scientists favouring men comes from hypothetical scenarios and real behaviour. This **might reflect academics' growing awareness of the social desirability of achieving gender balance, while real academic behaviour might not yet put such ideals into action**.

### Example: Restaurant reviews and phone types

* You are deciding whether to eat at Dirty Birds or The Loft.
* Suppose Yelp shows ratings aggregated by phone type (Android vs. iPhone).
* Should you choose Dirty Birds or The Loft? 

|Phone Type|Stars for Dirty Birds|Stars for The Loft|
|---|---|---|
|Android|4.24|4.0|
|iPhone|2.99|2.79|
|**All**|**3.32**|**3.37**|



### Restaurant reviews and phone types

* It's doubtful that your phone type will **cause** you to prefer one restaurant over another.
* Again, Simpson's paradox is merely a property of weighted averages!

### Verifying Simpson's paradox

In [None]:
ratings = pd.read_csv('data/ratings.csv')
ratings.sample(5).head()

Aggregated means:

In [None]:
ratings.pivot_table(index='phone', columns='restaurant', values='rating', aggfunc='mean')

Disaggregated means:

In [None]:
ratings.groupby('restaurant').mean()

### Takeaways

Be skeptical of...
- Aggregate statistics.
- People misusing statistics to "prove" that discrimination doesn't exist.
- Drawing conclusions from individual publications (p-hacking, publication bias, narrow focus, etc.).
- Everything!

### Further reading

- [Gender Bias in Admission Statistics?](https://www.cantorsparadise.com/gender-bias-in-admission-statistics-eaabca650810)
    - Contains a **great** visualization.
- [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox#UC_Berkeley_gender_bias) on Wikipedia.

## Next time in DSC 80...

- Combining data from different sources.