# Age and Gender Distortion

In a [recent paper in Nature](https://www.nature.com/articles/s41586-025-09581-z), Douglas Guilbeault, Solène Delecourt & Bhargav Srinivasa Desikan investigated the effects of age distortion on genders.

In this assignent, you will work through some of their results yoursefl. You will use their data, available at <https://github.com/drguilbe/distortion_age_gender_online/>.

## Correlation Between Age and Gender

You will begin by analysing gender-age associations in GPT-2Large, the
largest (albeit old) model available from OpenAI. To do that, the
researchers captured how the internal represenentation of texts by
GPT2-Large are mapped to age and gender dimensions. In the file
`GPT2-large-dimensions.csv`, each social category is mapped to a
number on the age and the sex dimension.
You will calculate the Pearson correlation between age and gender. The
results should look like the following.

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>n</th>
      <th>r</th>
      <th>CI95%</th>
      <th>p-val</th>
      <th>BF10</th>
      <th>power</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>pearson</th>
      <td>3495</td>
      <td>0.872559</td>
      <td>[0.86, 0.88]</td>
      <td>0.0</td>
      <td>inf</td>
      <td>1.0</td>
    </tr>
  </tbody>
</table>
</div>

You will then show that the results are robust to alternative methods
for extracting age and gender associations by creating all pairwise
correlations and putting them on heatmaps:

<img src="correlation_heatmap_age.svg" width="300"/><img src="correlation_heatmap_gender.svg" width="300"/>

## Relationship between Age and Gender

You will run a regression between the normalized age and gender
measures. You will report the results of the model, which
should be:

<table class="simpletable">
<caption>OLS Regression Results</caption>
<tr>
  <th>Dep. Variable:</th>      <td>age_norm_main</td>  <th>  R-squared:         </th>  <td>   0.761</td> 
</tr>
<tr>
  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th>  <td>   0.761</td> 
</tr>
<tr>
  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th>  <td>1.114e+04</td>
</tr>
<tr>
  <th>Date:</th>             <td>Tue, 09 Dec 2025</td> <th>  Prob (F-statistic):</th>   <td>  0.00</td>  
</tr>
<tr>
  <th>Time:</th>                 <td>07:10:32</td>     <th>  Log-Likelihood:    </th>  <td>  5425.6</td> 
</tr>
<tr>
  <th>No. Observations:</th>      <td>  3495</td>      <th>  AIC:               </th> <td>-1.085e+04</td>
</tr>
<tr>
  <th>Df Residuals:</th>          <td>  3493</td>      <th>  BIC:               </th> <td>-1.083e+04</td>
</tr>
<tr>
  <th>Df Model:</th>              <td>     1</td>      <th>                     </th>      <td> </td>    
</tr>
<tr>
  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>      <td> </td>    
</tr>
</table>
<table class="simpletable">
<tr>
          <td></td>            <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  
</tr>
<tr>
  <th>Intercept</th>        <td>   -0.0211</td> <td>    0.003</td> <td>   -6.510</td> <td> 0.000</td> <td>   -0.027</td> <td>   -0.015</td>
</tr>
<tr>
  <th>gender_norm_main</th> <td>    0.7450</td> <td>    0.007</td> <td>  105.565</td> <td> 0.000</td> <td>    0.731</td> <td>    0.759</td>
</tr>
</table>
<table class="simpletable">
<tr>
  <th>Omnibus:</th>       <td>829.901</td> <th>  Durbin-Watson:     </th> <td>   1.255</td>
</tr>
<tr>
  <th>Prob(Omnibus):</th> <td> 0.000</td>  <th>  Jarque-Bera (JB):  </th> <td>4643.267</td>
</tr>
<tr>
  <th>Skew:</th>          <td> 1.012</td>  <th>  Prob(JB):          </th> <td>    0.00</td>
</tr>
<tr>
  <th>Kurtosis:</th>      <td> 8.271</td>  <th>  Cond. No.          </th> <td>    9.75</td>
</tr>
</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Following that, you will make a regression plot, highlighting
representative categories:

<img src ="age_gender_regression_plot.svg" width="600"/>

Having done that, you will also made 
[an interactive plot](age_gender_regression_interactive.html), 
so that byhovering over the points the reader will be able to 
see the details of other categories. You will notice that other outliers 
were not hightlighted by the authors of the study. Comment on that.

## Amplification via Google Search

To see if the results of online search amplify distortion, the
researchers designed an experiment. To quote from the paper:

> The participants were randomized into treatment or control
> condition.  In the treatment condition (hereafter ‘image
> condition’), the participants used Google Images to search for
> images of occupations, which they then uploaded to our survey. 
> After uploading an image for an occupation, the participants were 
> asked to label the gender of the image they uploaded and then to 
> estimate the average age of someone in this occupation. The
> participants were also asked to rate their willingness to hire 
> the person depicted in their uploaded image. In the control
> condition, the participants used Google Images to search for and
> upload images of basic unrelated categories (such as apple and
> guitar). After uploading a random image, the control participants 
> were asked to estimate the average age of someone in a randomly 
> selected occupation from the same set. 


The results of the experiment are in files `experiment_control.csv`
and `experiment_treatment.csv`. To see the effects of image search,
you will create the following plot. The plot shows two distributions,
male and female, of the age differences between the age reported by
the participants in the treatment and control groups. The age 
difference is the age estimated by each subject for each category
in the treatment group minus the average age estimated for the
category in the control group.

<img src ="estimated_age_gender_image_search_distortion.svg" width="600"/>

Show, as the authors do, that:

> The participants who uploaded an image of a woman estimated the
> average age o f an ccupation to be 5.46 years younger than those who
> uploaded an image of a man ($t = −19.07$; $p$-value
> $= 2.2 \times 10^{−16}$; Student's t-test), holding occupation
> constant. Moreover, uploading an image of a woman led the participants
> to estimate a significantly lower age for each occupation (by 1.75 years)
> compared with the control participants ($t = −11.32$; 
> $p$-value $= 2.2 \times 10^{−16}$), whereas uploading an
> image of a man led the participants to estimate a significantly
> higher age for each occupation (by 0.64 years) compared with those
> in the control condition ($t = 3.42$; $p$-value $= 0.0006$; Student's 
> two-tailed t-test).

## Investigate the Amplification

To investigate the amplification, take the treatment and the control
together as one dataset and run a regression with `age` as the
dependent variable, and as independent variables:

* `condition * gender`, with gender as categorical with Treatment
  coding and `Male` as the reference level

* `category`

* `subj`

Report the results of the regression. The beginning of the summary
table should be like the following.

<table class="simpletable">
<caption>OLS Regression Results</caption>
<tr>
  <th>Dep. Variable:</th>           <td>age</td>       <th>  R-squared:         </th> <td>   0.586</td> 
</tr>
<tr>
  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.560</td> 
</tr>
<tr>
  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   22.30</td> 
</tr>
<tr>
  <th>Date:</th>             <td>Thu, 11 Dec 2025</td> <th>  Prob (F-statistic):</th>  <td>  0.00</td>  
</tr>
<tr>
  <th>Time:</th>                 <td>08:36:12</td>     <th>  Log-Likelihood:    </th> <td> -28066.</td> 
</tr>
<tr>
  <th>No. Observations:</th>      <td>  8514</td>      <th>  AIC:               </th> <td>5.715e+04</td>
</tr>
<tr>
  <th>Df Residuals:</th>          <td>  8005</td>      <th>  BIC:               </th> <td>6.074e+04</td>
</tr>
<tr>
  <th>Df Model:</th>              <td>   508</td>      <th>                     </th>     <td> </td>    
</tr>
<tr>
  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>    
</tr>
</table>
<table class="simpletable">
<tr>
                                                     <td></td>                                                       <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  
</tr>
<tr>
  <th>Intercept</th>                                                                                              <td>   44.1459</td> <td>    1.524</td> <td>   28.963</td> <td> 0.000</td> <td>   41.158</td> <td>   47.134</td>
</tr>
<tr>
  <th>C(condition, Treatment(reference="Control"))[T.Image]</th>                                                  <td>   -4.7819</td> <td>    1.447</td> <td>   -3.305</td> <td> 0.001</td> <td>   -7.618</td> <td>   -1.946</td>
</tr>
<tr>
  <th>C(gender, Treatment(reference="Male"))[T.Female]</th>                                                       <td>   -2.1919</td> <td>    0.288</td> <td>   -7.614</td> <td> 0.000</td> <td>   -2.756</td> <td>   -1.628</td>
</tr>
<tr>
  <th>category[T.appliedscientist]</th>                                                                           <td>   -0.1997</td> <td>    0.772</td> <td>   -0.259</td> <td> 0.796</td> <td>   -1.713</td> <td>    1.314</td>
</tr>
</table>
 

Then, do another regression for `age` with only `category` and `subj` as the
independent  variables. You will the use that model to predict the age
for the combined  treatment and control dataset. Use the results to
create the following  two plots showing the differences between the
two groups. In the first  plot, you show the age predictions grouped
by men and women, control  and treatment. In the second plot, you show
the residuals of the predictions. 

Explain the results.

<img src='grouped_plot.svg' width="400"/><img src='grouped_plot_residuals.svg' width="400"/>

Finally, run two ANOVA models. Both of them will have `age` as the
dependent variable.

* The first one will have `condition * gender` as the independent
  variables, both with Sum encoding.

* The second one will have `condition * gender` as the independent
  variables, both with Sum encoding, as well as `category` and `subj`.

Explain the results.

## Evaluation

Your work will be evaluated for completeness and quality. There is no absolute measure by which a 10/10 is awarded. The best submissions will get the top grades, and the rest will be graded accordingly. Grading will take into account:

* Is the work engaging?

* Is the approach explained well enough?

* Have the questions been answered perfunctorily, or do the answers exhibit attention and care?

## Submission Instructions

You must submit your assignment as a Jupyter notebook that will contain the full code and documentation of how you solved the questions, and all other necessary files. Your submission must be fully replicable: that is, somebody reading it must be able to do exactly what you did and obtain the same results.

The documentation must be at the level where somebody that has some knowledge of Python can understand exactly what you are doing and why. Your output must be as user-friendly as possible. It should not contain ladles of output that is not related to the output and does not inform the reader.

## Honor Code

You understand that this is an individual assignment, and as such you must carry it out alone. You may seek help on the Internet, on ChatGPT/Gemini/etc., by Googling or searching in StackOverflow for general questions pertaining to the use of Python and pandas libraries and idioms. However, it is not right to ask direct questions that relate to the assignment and where third parties will actually solve your problem by answering them. You may discuss with your colleagues in order to better understand the questions, if they are not clear enough, but you should not ask them to share their answers with you, or to help you by giving specific advice.