In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw2-seda.ipynb")

# PSTAT 100 Homework 2

In [None]:
import numpy as np
import pandas as pd
import altair as alt

## Background 

Gender achievement gaps in education have been well-documented over the years -- studies consistently find boys outperforming girls on math tests and girls outperforming boys on reading and language tests. A particularly controversial [article](https://www.jstor.org/stable/1684489) was published in Science in 1980 arguing that this pattern was due to an 'innate' difference in ability (focusing, of course, on mathematics rather than on reading and language). Such views persisted in part because studying systematic patterns in achievement nationwide was a challenge due to differential testing standards across school districts and the general lack of availability of large-scale data.

It is only recently that data-driven research has begun to reveal socioeconomic drivers of achievement gaps. The [Standford Educational Data Archive](https://edopportunity.org/) (SEDA), a publicly available database on academic achievement and educational opportunity in U.S. schools, has supported this effort. The database is part of a broader initiave aiming to improve educational opportunity by enabling researchers and policymakers to identify systemic drivers of disparity.

> SEDA includes a range of detailed data on educational conditions, contexts, and outcomes in school districts and counties across the United States. It includes measures of academic achievement and achievement gaps for school districts and counties, as well as district-level measures of racial and socioeconomic composition, racial and socioeconomic segregation patterns, and other features of the schooling system.

The database standardizes average test scores for schools 10,000 U.S. school districts relative to national standards to allow comparability between school districts and across grade levels and years. The test score data come from the U.S. Department of Education. In addition, multiple data sources (American Community Survey and Common Core of Data) are integrated to provide district-level socioeconomic and demographic information.

A [study of the SEDA data published in 2018](https://cepa.stanford.edu/content/gender-achievement-gaps-us-school-districts) identified the following persistent patterns across grade levels 3 - 8 and school ears from 2008 through 2015:
* a consistent reading and language achievement gap favoring girls;
* *no* national math achievement gap on average; and
* local math achievement gaps that depend on the socioeconomic conditions of school districts.
You can read about the main findings of the study in this [brief NY Times article](https://www.nytimes.com/interactive/2018/06/13/upshot/boys-girls-math-reading-tests.html).

Below, we'll work with selected portions of the database. The full datasets can be downloaded [here](https://edopportunity.org/get-the-data/seda-archive-downloads/).

---
## Assignment objectives

In this assignment, you'll explore achievement gaps in California school districts in 2018, reproducing the findings described [in the article above](https://www.nytimes.com/interactive/2018/06/13/upshot/boys-girls-math-reading-tests.html) on a more local scale and with the most recent SEDA data. This will afford you an opportunity to practice the first several stages of the data science lifecycle: collect, acquaint, tidy, and explore.

**Collect/acquiant**
* review data documentation
* identify population, sampling frame, sample
* assess scope of inference

**Tidy**
* data import
* slicing and filtering
* merging multiple data frames
* pivoting tables
* renaming and reordering variables

**Explore**
* scatterplots
* basic plotting aesthetics
* faceted plots
* visualizing trends
* aggregation and tabulation

**Communicate**
* narrative summary of exploratory analysis

---

### Collaboration

You are encouraged to collaborate with other students on the labs, but are expected to write up your own work for submission. Copying and pasting others' solutions is considered plaigarism and may result in penalties, depending on severity and extent. 

If you choose to work with others, please list their names here.

**Your name:**

**Collaborators:**


---
## 0. Getting acquainted with the SEDA data

The cell below imports the district-level SEDA data from California in 2018. The test score data is stored in a separate file (`ca-main.csv`) from the socioeconomic and demographic covariate data (`ca-cov.csv`). 

In [None]:
# import seda data
ca_main = pd.read_csv('data/ca-main.csv')
ca_cov = pd.read_csv('data/ca-cov.csv')

### Test score data

The first few rows of the test data are shown below. The columns are:

Column name | Meaning
---|---
`sedalea` | District ID
`grade` | Grade level
`stateabb` | State abbreviation
`sedaleaname` | District name
`subject` | Test subject
`cs_mn_...` | Estimated mean test score
`cs_mnse_...` | Standard error for estimated mean test score
`totgyb_...` | Number of individual tests used to estimate the mean score

In [None]:
ca_main.head(3)

The test score means for each district are named `cs_mn_...` with an abbreviation indicating subgroup (such as mean score for all `cs_mean_all`, for boys `cs_mean_mal`, for white students `cs_mn_wht`, and so on). Notice that these are generally small-ish: decimal numbers between -0.5 and 0.5.

These means are *estimated* from a number of individual student tests and *standardized* relative to national averages. They represent the number of standard deviations by which a district mean differs from the national average. So, for instance, the value `cs_mn_all = 0.1` indicates that the district average is estimated to be 0.1 standard deviations greater than the national average on the corresponding test and at the corresponding grade level.

<!-- BEGIN QUESTION -->

### Q0 (a). Interpreting test score values

Interpret the average math test score for all 4th grade students in Acton-Agua Dulce Unified School District (the first row of the dataset shown above).


_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Covariate data

The first few rows of the covariate data are shown below. The column information is as follows:

Column name | Meaning
---|---
`sedalea` | District ID
`grade` | Grade level
`sedaleanm` | District name
`urban` | Indicator: is the district in an urban locale?
`suburb` | Indicator: is the district in a suburban locale?
`town` | Indicator: is the district in a town locale?
`rural` | Indicator: is the district in a rural locale?
`locale` | Description of district locale
Remaining variables | Demographic and socioeconomic measures

In [None]:
ca_cov.head(3)

You will only be working with a handful of the demographic and socioeconomic measures, so you can put off getting acquainted with those until selecting a subset of variables.

<!-- BEGIN QUESTION -->

### Q0 (b). Data semantics

In the non-public data, observational units are students -- test scores are measured for each student. However, in the SEDA data you've imported, scores are *aggregated* to the district level by grade. Let's regard estimated test score means for each grade as distinct variables, so that an observation consists in a set of estimated means for different grade levels and groups. In this view, what are the observational units in the test score dataset? Are they the same or different for the covariate dataset?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Q0 (c). Sample sizes

How many observational units are in each dataset? Count the number of units in the test dataset and the number of units in the covariate dataset separately. Store the numbers as `ca_cov_units` and `ca_main_units`, respectively. 

(*Hint*: use `.nunique()`.)


In [None]:
ca_cov_units = ...
ca_main_units = ...

In [None]:
grader.check("q0_c")

### Q0 (d). Sample characteristics

Answer the questions below about the sampling design. You do not need to dig through any data documentation in order to resolve these questions.

<!-- BEGIN QUESTION -->

#### (i) What is the relevant population for the datasets you've imported?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### (ii) About what proportion (to within 0.1) of the population is captured in the sample?
(*Hint*: have a look at [this website](https://www.cde.ca.gov/ds/sd/cb/ceffingertipfacts.asp).)


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### (iii) Considering that the sampling frame is not identified clearly, what kind of dataset do you suspect this is (*e.g.*, administrative, data from a 'typical sample', census, etc.)?  


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Q0 (e). Scope of inference

In light of your description of the sample characteristics, what is the scope of inference for this dataset?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

---
## 1. Tidy

Your goal will be to examine the relationship between gender achievement gaps and socioeconomic measures for school districts in California in 2018. In order to do this, the following manipulations of the imported data are needed:
* selecting columns of interest;
* filtering out non-urban districts;
* merging the covariate data with the test data; and
* putting the result in tidy format.

Since you've already had some guided practice doing this in previous assignments, you'll be left to fill in a little bit more of the details on your own in this assignment.

You'll work with the following variables from each dataset:

* **Test score data**
    + District ID
    + District name
    + Grade
    + Test subject
    + Estimated male-female gap
* **Covariate data**
    + District ID
    + Locale
    + Grade
    + Socioeconomic status (all demographic groups)
    + Log median income (all demographic groups)
    + Poverty rate (all demographic groups)
    + Unemployment rate (all demographic groups)
    + SNAP benefit receipt rate (all demographic groups)

### Q1 (a). Variable names of interest

Download the codebooks by opening the 'data' directory from your Jupyter Lab file navigator (pstat100-s21-content > hw > hw2 > data), right-click the codebook .xlsx files, and select 'Download'. Identify the variables listed above, and store the column names in lists `main_vars` and `cov_vars`. 


In [None]:
# store variable names of interest
main_vars = ...
cov_vars = ...

In [None]:
grader.check("q1_a")

### Q1 (b). Slice columns

Use your result from Q1 (a) to slice the columns of interest from the covariate and test score data. Store the results as `main_sub` and `cov_sub`.


In [None]:
# slice columns to select variables of interest
main_sub = ...
cov_sub = ...

In [None]:
grader.check("q1_b")

In the next step you'll merge the covariate data with the test score data. In order to do this, you can use the `pd.merge(A, B, how = ..., on = SHARED_COLS)` function, which will match the rows of `A` and `B` based on the shared columns `SHARED_COLS`. If `how = 'left'`, then only rows in `A` will be retained in the output (so `B` will be merged *to* `A`); conversely, if `how = 'right'`, then only rows in `B` will be retained in the output (so `A` will be merged *to* `B`).

A simple example of the use of `pd.merge` is illustrated below:

In [None]:
# toy data frames
A = pd.DataFrame(
    {'shared_col': ['a', 'b', 'c'], 
    'x1': [1, 2, 3], 
    'x2': [4, 5, 6]}
)

B = pd.DataFrame(
    {'shared_col': ['a', 'b'], 
    'y1': [7, 8]}
)

In [None]:
A

In [None]:
B

Below, if `A` and `B` are merged retaining the rows in `A`, notice that a missing value is input because `B` has no row where the shared column (on which the merging is done) has value `c`. In other words, the third row of `A` has no match in `B`.

In [None]:
# left join
pd.merge(A, B, how = 'left', on = 'shared_col')

If the direction of merging is reversed, and the row structure of `B` is dominant, then the third row of `A` is dropped altogether because it has no match in `B`.

In [None]:
# right join
pd.merge(A, B, how = 'right', on = 'shared_col')

### Q1 (c). Merge

Follow the example above and merge the covariate and test score data on both the ***district ID*** and ***grade level***, retaining only the columns from the test score data (meaning, treat the test score data as primary and merge the covariate data *to* the test score data). Store the result as `rawdata` and print the first four rows. 

**Hint**: When merging on multiple columns, you can utilize a list to hold both column names.


In [None]:
# merge covariates with gap data
rawdata = ...

# print first four rows
...

In [None]:
grader.check("q1_c")

### Q1 (d). Rename and reorder columns

Now rename and rearrange the columns of `rawdata` so that they appear in the following order and with the following names:

* District ID, District, Locale, log(Median income), Poverty rate, Unemployment rate, SNAP rate, Socioeconomic index, Grade, Gender gap, Subject 

Store the result as `rawdata_mod1` and print the first four rows.

(*Hint*: first define a dictionary to map the old names to the new ones; then create a list of the new names specified in the desired order; then use `.rename()` and `.loc[]`. You can follow the renaming steps in HW1 as an example if needed.)


In [None]:
# define dictionary mapping for renaming columns
...

# specify order of columns
...

# rename and reorder
...

# print first four rows
...

In [None]:
grader.check("q1_d")

### Q1 (e). Pivot

Notice that the Gender gap column contains the values of two variables: the gap in estimated mean test scores for math tests, and the gap in estimated mean test scores for reading and language tests. To put the data in tidier format, use `.pivot` to pivot the table so that the gender gap column is spread into two columns corresponding to the entries of `Subject`. Name the resulting columns `Math gap` and `Reading gap`, and store the result as `rawdata_mod2` and print the first four rows.

*Comment*: an alternative solution is to manipulate the indices and use `.unstack()`. Either method will produce a dataframe with hierarchical column indexing; this will need to be collapsed in order to rename the columns as instructed. You may find `MultiIndex.droplevel()` to be of use.

In [None]:
# pivot to unstack gender gap (fixing tidy issue: multiple variables in one column)
...

# print first four rows
...

In [None]:
grader.check("q1_e")

### Q1 (f). Indexing

If necessary, remove the name of the column index ('Subject') that was induced by the pivot step using `.rename_axis()`, and store the result as `data`; otherwise, simply store a copy of the previous dataframe as `data`. Print the first four rows.

In [None]:
# drop the name of column index induced by pivoting
data = ...

# print first four rows
...

Your final dataset should match the dataframe below. You can use this to check your answer and revise any portions above that lead to different results.

In [None]:
# intended result
data_reference = pd.read_csv('data/tidy-seda.csv')
data_reference.head(4)

### Q1 (g). Sanity check

Ensure that your tidying did not inadvertently drop any observations: count the number of units in `data`. Does this match the number of units represented in the original test score data `ca_main`? Store these as `data_units` and `ca_main_units`, respectively.

(*Hint*: use `.nunique()`.)


In [None]:
# number of districts in tidied data compared with raw
data_units = ...
ca_main_units = ...

In [None]:
grader.check("q1_g")

### Q1 (h). Missing values

Gap estimates were not calculated for certain grades in certain districts due to small sample sizes (not enough individual tests recorded).

#### (i) What proportion of rows are missing for each of the reading and math gap variables? 
Store these as `math_missing` and `reading_missing`, respectively.

**Hint**: Can utilize the fact that both columns have the ending of "gap" to subset the dataframe.

In [None]:
# proportion of missing values
...

In [None]:
grader.check("q1_h_i")

#### (ii) What proportion of *districts* have missing gap estimates for one or both test subjects for at least one grade level?

Save the value as `district_missing`.


In [None]:
# proportion of districts with missing values
...

In [None]:
grader.check("q1_h_ii")

<!-- BEGIN QUESTION -->

#### (iii) Do you expect that this missingness is related to any particular district attribute(s)?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

---
## 2. Explore

For the purpose of visualizing the relationship between estimated gender gaps and socioeconomic variables, you'll find it more helpful to store a non-tidy version of the data. The cell below rearranges the dataset so that one column contains an estimated gap, one column contains the value of a socioeconomic variable, and the remaining columns record the gap type and variable identity. 

Ensure that your results from part 1 match the reference dataset before running this cell.

In [None]:
# format data for plotting
plot_df = data.melt(
    id_vars = name_order[0:9],
    value_vars = ['Math gap', 'Reading gap'],
    var_name = 'Gap type',
    value_name = 'Gap'
).melt(
    id_vars = ['District ID', 'District', 'Locale', 'Gap type', 'Gap', 'Grade'],
    value_vars = name_order[3:8],
    var_name = 'Socioeconomic variable',
    value_name = 'Measure'
)

# preview
plot_df.head()

Altair, by default, limits the number of rows for input dataframes. We will need to disable this behavior in order to generate plots of this dataset.

In [None]:
# disable row limit for plotting
alt.data_transformers.disable_max_rows()

### Relationship between gender gaps and socioeconomic factors

The cell below generates a panel of scatterplots showing the relationship between estimated gender gap and socioeconomic factors for all grade levels by test subject. The plot suggests that the reading gap favors girls consistently across the socioeconomic spectrum -- in a typical district girls seem to outperform boys by 0.25 standard deviations of the national average. By contrast, the math gap appears to depend on socioeconomic factors -- boys only seem to outperform girls under *better* socioeconomic conditions.  

In [None]:
# plot gap against socioeconomic variables by subject for all grades
fig1 = alt.Chart(plot_df).mark_circle(opacity = 0.1).encode(
    y = 'Gap',
    x = alt.X('Measure', scale = alt.Scale(zero = False), title = ''),
    color = 'Gap type'
).properties(
    width = 200,
    height = 200
).facet(
    column = alt.Column('Socioeconomic variable')
).resolve_scale(x = 'independent')

fig1

### Q2 (a). Relationships by grade level

Does the pattern shown in the plot above persist within each grade level? 

<!-- BEGIN QUESTION -->


#### (i)
Modify the plot above to show these relationships by grade level: generate a panel of scatterplots of gap against socioeconomic measures by subject, where each column of the panel corresponds to one socioeconomic variable and each row corresponds to one grade level; the result should by a 5x5 panel. Resize the width and height of each facet so that the panel is of reasonable size.

(*Hint*: you may find it useful to have a look at the [altair documentation on compound charts](https://altair-viz.github.io/user_guide/compound_charts.html), and lab 2, for examples to follow.)



In [None]:
# plotting codes here
...

# display
fig2a

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### (ii) is the pattern consistent across grade level?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Q2 (b). Do gaps shift across grade levels?

#### (i)
Construct a 2x5 panel of scatterplots showing estimated achievement gap against each of the 5 socioeconomic variables, with one row per test subject. Display grade level using a color gradient.

(*Hint:* plot gap against measure, facet by gap type (rows) and socioeconomic variable (columns), and color by grade.)


In [None]:
# plotting codes here
...


# display
fig2b

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### (ii) Do the gaps seem to shift with grade level?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Aggregating by grade

While the magnitude of the achievement gaps seems to depend very slightly on grade level (figure 2b), the *relationship* between achievement gap and socioeconomic factors does not differ from grade to grade (figure 2a). In what follows, you'll look at the average relationship between estimated achievement gap and median income after aggregating across grade. The cell below computes the mean of each variable across grade levels for each district. 

In [None]:
# aggregate across grades
data_agg = data.groupby(['District ID', 'District', 'Locale']).mean().reset_index().drop(columns = 'Grade')
data_agg.head()

Similar to working with the disaggregated data, it will be helpful for plotting to melt the two gap variables into a single column.

In [None]:
# format for plotting
agg_plot_df = data_agg.melt(
    id_vars = name_order[0:7],
    value_vars = ['Math gap', 'Reading gap'],
    var_name = 'Subject',
    value_name = 'Average estimated gap'
)

agg_plot_df.head()

<!-- BEGIN QUESTION -->

### Q2 (c). District average gaps

Construct a scatterplot of the average estimated gap against log(Median income) by subject for each district and add trend lines.


In [None]:
# scatterplot
...

# trend line
trend = ...

# combine layers
fig2c = ...

# display
fig2c

<!-- END QUESTION -->

Now let's try to capture this pattern in *tabular* form. The cell below adds an `Income bracket` variable by cutting the median income into 8 contiguous intervals using `pd.cut()`, and tabulates the average socioeconomic measures and estimated gaps across districts by income bracket. Notice that with respect to the gaps, this displays the pattern that is shown visually in the figures above. 

In [None]:
data_agg['Income bracket'] = pd.cut(np.e**data_agg['log(Median income)'], 8)
data_agg.groupby('Income bracket').mean().drop(columns = ['District ID', 'log(Median income)'])

### Q2 (d). Proportion of districts with a math gap.

What proportion of districts in each income bracket have an average estimated math achievement gap favoring boys? Answer this question by performing the following steps:

* Append an indicator variable `Math gap favoring boys` to `data_agg` that records whether the average estimated math gap favors boys by more than 0.1 standard deviations relative to the national average.
* Compute the proportion of districts in each income bracket for which the indicator is true: group by bracket and take the mean. Store this as `income_bracket_boys_favored`


In [None]:
# define indicator
...

# proportion of districts with gap favoring boys, by income bracket
income_bracket_boys_favored = ...

In [None]:
grader.check("q2_d")

### Q2 (e). Statewide averages

To wrap up the exploration, calculate a few statewide averages to get a sense of how some of the patterns above compare with the state as a whole.

#### (i) Compute the statewide average estimated achievement gaps.
Store the result as `state_avg`.


In [None]:
# statewide average
state_avg = ...

In [None]:
grader.check("q2_e_i")

#### (ii)  Compute the proportion of districts in the state with a math gap favoring boys.
Store this result as `math_boys_proportion`


In [None]:
# proportion of districts in the state with a math gap favoring boys
math_boys_proportion = ...

In [None]:
grader.check("q2_e_ii")

#### (iii)  Compute the proportion of districts in the state with a math gap favoring girls.

You will need to define a new indicator within `data_agg` to perform this calculation.


In [None]:
# new indicator
...

# proportion of districts in the state with a math gap favoring girls
math_girls_proportion = ...

In [None]:
grader.check("q2_e_iii")

---
## 3. Communicating results

Take a moment to review and reflect on your findings, and then answer the questions below.

<!-- BEGIN QUESTION -->

### Q3 (a). Summary

Write a brief summary of your exploratory analysis. What have you discovered about educational achievement gaps in California school districts? Aim to answer in 3-5 sentences or less.


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Q3 (b). Hypothesize!

It's a cliche in statistics that 'correlation is not causation'. In your exploratory analysis, you identified a correlation between socioeconomic factors and achievement gaps. But clearly, affluence does not directly cause a math achievement gap favoring boys. What factors do you think might explain this association?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

---
## Submission

1. Save file to confirm all changes are on disk
2. Run *Kernel > Restart & Run All* to execute all code from top to bottom
3. Save file again to write any new output to disk
4. Generate PDF copy
5. Submit both notebook and PDF to Gradescope

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()