# Missing Values Exercises


## Gradescope Autograding

Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file `exercise_missing.ipynb` before uploading.

You can check that you have answers for all questions in your `results` dictionary with this code:


```python
assert set(results.keys()) == {
    "ex2_avg_income",
    "ex3_share_making_9999999",
    "ex3_share_making_zero",
    "ex5_avg_income",
    "ex8_avg_income_black",
    "ex8_avg_income_white",
    "ex8_racial_difference",
    "ex9_avg_income_black",
    "ex9_avg_income_white",
    "ex10_wage_gap",
}
```

### Submission Limits

Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total.


In [162]:
import pandas as pd
import numpy as np

pd.set_option("mode.copy_on_write", True)

results = {}

## Exercises

### Exercise 1

Today, we will be using the ACS data we used during out first `pandas` exercise to examine the US income distribution, and how it varies by race. Note that because the US income distribution has a very small number of people with *extremely* high incomes, and the ACS is just a sample of Americans, the far right tail of the distribution will not be very well estimated. However, this data should suffice for helping to understand wealth inequality in the United States. 

To begin, load the ACS Data we used in our first pandas exercise. That [data can be found here](https://github.com/nickeubank/MIDS_Data/tree/master/US_AmericanCommunitySurvey). We'll be working with `US_ACS_2017_10pct_sample.dta`. 

In [163]:
import pandas as pd

# URL of the dataset US_ACS_2017_10pct_sample.dta
url = "https://github.com/nickeubank/MIDS_Data/raw/master/US_AmericanCommunitySurvey/US_ACS_2017_10pct_sample.dta"

# Reading the data from the URL into a pandas DataFrame
acs = pd.read_stata(url)

### Exercise 2

Let's begin by calculating the mean US incomes from this data (recall that income is stored in the `inctot` variable). Store the answer in `results` under the key `"ex2_avg_income"`.

In [164]:
ex2_avg_income = acs["inctot"].mean()
results["ex2_avg_income"] = ex2_avg_income
print(f"The mean US incomes from this ACS data is around ${round (ex2_avg_income,2)}.")

The mean US incomes from this ACS data is around $1723646.27.


### Exercise 3

Hmmm... That doesn't look right. The average American is definitely not earning that much a year! Let's look at the values of `inctot` using `value_counts()`. Do you see a problem?

Now use `value_counts()` with the argument `normalize=True` to see proportions of the sample that report each value instead of the count of people in each category. What percentage of our sample has an income of 9,999,999? Store that proportion (between 0 and 1) as `"ex3_share_making_9999999"`. What percentage has an income of 0? Store that proportion as `"ex3_share_making_zero"`.

(Recall `.value_counts()` returns a Series, so you can pull values out with our usual pandas tools.)

In [165]:
acs["inctot"].value_counts()

inctot
9999999    53901
0          33679
30000       4778
50000       4414
40000       4413
           ...  
70520          1
76680          1
57760          1
200310         1
505400         1
Name: count, Length: 8471, dtype: int64

In [166]:
acs["inctot"].value_counts(normalize=True)

inctot
9999999    0.168967
0          0.105575
30000      0.014978
50000      0.013837
40000      0.013834
             ...   
70520      0.000003
76680      0.000003
57760      0.000003
200310     0.000003
505400     0.000003
Name: proportion, Length: 8471, dtype: float64

In [167]:
income_percentage = acs["inctot"].value_counts(normalize=True)
ex3_share_making_9999999 = income_percentage[9999999]
print(
    f"The proportion of our dataset acs having an income of 9,999,999 is around {round(ex3_share_making_9999999*100,2)}%."
)
ex3_share_making_zero = income_percentage[0]
print(
    f"The proportion of our dataset acs having an income of 0 is around {round(ex3_share_making_zero*100,2)}%."
)
results["ex3_share_making_9999999"] = ex3_share_making_9999999
results["ex3_share_making_zero"] = ex3_share_making_zero
print(results)

The proportion of our dataset acs having an income of 9,999,999 is around 16.9%.
The proportion of our dataset acs having an income of 0 is around 10.56%.
{'ex2_avg_income': 1723646.2703978634, 'ex3_share_making_9999999': 0.1689665333350052, 'ex3_share_making_zero': 0.10557547867738336}


### Exercise 4

As we discussed before, the ACS uses a value of 9999999 to denote that income information is not available for someone. The problem with using this kind of "sentinel value" is that pandas doesn't understand that this is supposed to denote missing data, and so when it averages the variable, it doesn't know to ignore 9999999. 

To help out `pandas`, use the `replace` command to replace all values of 9999999 with `np.nan`. 

In [168]:
acs["inctot"] = acs["inctot"].replace(9999999, np.nan)
acs["inctot"]

0              NaN
1           6000.0
2           6150.0
3          14000.0
4              NaN
            ...   
318999     22130.0
319000         NaN
319001      5000.0
319002    240000.0
319003     48000.0
Name: inctot, Length: 319004, dtype: float64

### Exercise 5

Now that we've properly labeled our missing data as `np.nan`, let's calculate the average US income once more. Store the answer in `results` under the key `"ex5_avg_income"`.

In [169]:
ex5_avg_income = acs["inctot"].mean()
print(
    f"The mean US incomes from this data after converting values 9999999 to np.nan is around ${round (ex5_avg_income,2)}."
)
results["ex5_avg_income"] = ex5_avg_income
print(results)

The mean US incomes from this data after converting values 9999999 to np.nan is around $40890.18.
{'ex2_avg_income': 1723646.2703978634, 'ex3_share_making_9999999': 0.1689665333350052, 'ex3_share_making_zero': 0.10557547867738336, 'ex5_avg_income': 40890.177564946454}


### Exercise 6

OK, now we've been able to get a reasonable average income number. As we can see, a major advantage of using `np.nan` is that `pandas` knows that `np.nan` observations should just be ignored when we are calculating means. 

But it's not enough to just get rid of the people who had `inctot` values of 9999999. We also need to know why those values were missing. Suppose, for example, that the value of 9999999 was used for anyone who made more than 100,000 dollars: if we just dropped those people, then our estimate of average income wouldn't mean much, would it?

So let's make sure we understand *why* data is missing for some people. If you recall from our last exercise, it seemed to be the case that most of the people who had incomes of 9999999 were children. Let's make sure that's true by looking at the distribution of the variable `age` for people for whom `inctot` is missing (i.e. subset the data to people with `inctot` missing, then look at the values of `age` with `value_counts()`).

Then do the opposite: look at the distribution of the `age` variable for people who whom `inctot` is *not* missing. 

Can you determine when 9999999 was being used? Is it ok we're excluding those people from our analysis?

Note: In this data, Python doesn't understand `age` is a number; it thinks it is a string because the original data has categories like "90 (90+ in 1980 and 1990)" and "less than 1 year old". So you can't just use `min()` or `max()`. We'll discuss converting string variables into numbers in a future class.

In [170]:
# check for missing values
acs["inctot"].value_counts(dropna=False)

# subset the data to people with `inctot` missing then look at the values of `age` with `value_counts()
subset_age_null = acs.loc[acs["inctot"].isnull(), "age"].value_counts()

# sample 50 random rows from the subset
subset_age_null.sample(50)

age
50                         0
62                         0
6                       3524
49                         0
87                         0
81                         0
54                         0
21                         0
47                         0
67                         0
42                         0
59                         0
19                         0
28                         0
82                         0
44                         0
76                         0
33                         0
17                         0
31                         0
83                         0
73                         0
60                         0
63                         0
95                         0
92                         0
80                         0
55                         0
69                         0
43                         0
16                         0
53                         0
23                         0
26                         0
37        

In [171]:
# then look at the values of `age` with `value_counts()
subset_age_not_null = acs.loc[acs["inctot"].notnull(), "age"].value_counts()

# sample 50 random rows from the subset
subset_age_not_null.sample(50)

age
34                           3942
31                           3880
21                           3740
42                           3603
1                               0
79                           1758
38                           3718
23                           3551
59                           4776
less than 1 year old            0
29                           3810
11                              0
82                           1464
86                           1041
57                           4720
53                           4600
30                           3917
54                           4821
28                           3808
71                           2917
74                           2819
85                           1117
22                           3617
26                           3781
52                           4418
68                           3951
88                            859
48                           3956
44                           3656
90 (90+ in

> After this analysis, we find that the missing values (9999999) are predominantly children, which is a group that don't typically have an income, then excluding them from the income analysis should be fine as they wouldn't be relevant to income statistics. 

### Exercise 7

Great, so now we know why those people had missing data, and we're ok with excluding them. 

But as we previously noted, there are also a lot of observations of zero income in our data, and it's not clear that we want everyone with a zero-income *should* be included in this average, since those may be people who are retired, or in school. 

Let's limit our attention to people who are currently working by subsetting to only employed respondents. We can do this using `empstat`. Remember you can use `value_counts()` to see what values of `empstat` are in the data!

In [172]:
# Check unique values and their counts for employment status
print(acs["empstat"].value_counts())

# subsetting to only employed respondents
subset_employment = acs.loc[acs["empstat"] == "employed"]
subset_employment.value_counts()

# acs.loc[acs["empstat"] == "employed"].value_counts()

empstat
employed              148758
not in labor force    104676
n/a                    57843
unemployed              7727
Name: count, dtype: int64


year  datanum  serial   cbserial      numprec          subsamp  hhwt  hhtype                             cluster       adjust    cpi99  region                   stateicp        statefip        countyicp  countyfip  metro                                                                       city                                      citypop  strata  gq                                farm      ownershp                      ownershpd                    mortgage                                       mortgag2               mortamt1  mortamt2  respmode   pernum  cbpernum  perwt  slwt  famunit                                    sex     age  marst                    birthyr  race                          raced                         hispan        hispand       bpl             bpld            citizen  yrnatur  yrimmig  language  languaged  speakeng                  hcovany                         hcovpriv                                   hinsemp                               hinspur           

### Exercise 8

Now let's estimate the racial income gap in the United States. What is the average salary for employed Black Americans, and what is the average salary for employed White Americans? In percentage terms, how much more does the average White American make than the average Black American?

**Note:** these values are not quite accurate estimates. As we'll discuss in later lessons, to get completely accurate estimates from the ACS we have to take into account how people were selected to be interviewed. But you get pretty good estimates in most cases even without weights—your estimate of the racial wage gap without weights is within 5\% of the corrected value. 

**Note:** This is actually an underestimate of the wage gap. The US Census treats Hispanic respondents as a sub-category of "White." While all ethnic distinctions are socially constructed, and so on some level these distinctions are all deeply problematic, this coding is inconsistent with what most Americans think of when they hear the term "White," a term *most* Americans think of as a category that is mutually exclusive of being Hispanic or Latino (categories which are also usually conflated in American popular discussion). With that in mind, most researchers working with US Census data split "White" into "White, Hispanic" and "White, Non-Hispanic" using `race` *and* `hispan`. But for the moment, just identify "White" respondents using the value in `race`.

Store your results in `results` under the keys `"ex8_avg_income_black"`, `"ex8_avg_income_white"`, and the percentage difference as `ex8_racial_difference`. Please note the wording above when calculating the percentage difference to ensure you get the reference category correct, and interpret your result as well.

In [173]:
# check race value counts
subset_employment["race"].value_counts()

# 1. Average salary for employed Black Americans
ex8_avg_income_black = subset_employment[
    subset_employment["race"] == "black/african american/negro"
]["inctot"].mean()

# 2. Average salary for employed White Americans
ex8_avg_income_white = subset_employment[subset_employment["race"] == "white"][
    "inctot"
].mean()

# 3. Percentage difference
ex8_racial_difference = (
    (ex8_avg_income_white - ex8_avg_income_black) / ex8_avg_income_black
) * 100

print(
    f"The average salary for employed Black Americans is around ${round(ex8_avg_income_black,2)}."
)
print(
    f"The average salary for employed white Americans is around ${round(ex8_avg_income_white,2)}."
)

print(
    f"The average White American make around {round(ex8_racial_difference,2)}% more than the average Black American."
)

# Store the results
results["ex8_avg_income_black"] = ex8_avg_income_black
results["ex8_avg_income_white"] = ex8_avg_income_white
results["ex8_racial_difference"] = ex8_racial_difference

print(results)

The average salary for employed Black Americans is around $41747.95.
The average salary for employed white Americans is around $60473.15.
The average White American make around 44.85% more than the average Black American.
{'ex2_avg_income': 1723646.2703978634, 'ex3_share_making_9999999': 0.1689665333350052, 'ex3_share_making_zero': 0.10557547867738336, 'ex5_avg_income': 40890.177564946454, 'ex8_avg_income_black': 41747.949905123336, 'ex8_avg_income_white': 60473.15372747098, 'ex8_racial_difference': 44.85299006275197}


### Exercise 9


As noted above, these estimates are not actually *quite* correct because we aren't using survey weights. To calculate a weighted average that takes into account survey weights, you need to use the following formula:

$$weighted\_mean\_of\_x = \frac{\sum_i x_i * weight_i}{\sum_i weight_i}$$

(As you can see, when $weight_i$ is constant for all observations, this just simplifies to our normal formula for mean values. It is only when weights vary across individuals that weights must be explicitly addressed).

In this data, weights are stored in the variable `perwt`, which is the number of people for which each observation is a stand-in (the inverse of that observations sampling probability). 

Using the formula, re-calculate the *weighted* average income for both populations and store them as `ex9_avg_income_white` and `ex9_avg_income_black`.


In [174]:
# Weighted average salary for employed White Americans
white_subset = subset_employment[subset_employment["race"] == "white"]
ex9_avg_income_white = (
    white_subset["inctot"] * white_subset["perwt"]
).sum() / white_subset["perwt"].sum()

# Weighted average salary for employed Black Americans
black_subset = subset_employment[
    subset_employment["race"] == "black/african american/negro"
]
ex9_avg_income_black = (
    black_subset["inctot"] * black_subset["perwt"]
).sum() / black_subset["perwt"].sum()

print(
    f"The weighted average salary for employed Black Americans is around ${round(ex9_avg_income_black,2)}."
)
print(
    f"The weighted average salary for employed white Americans is around ${round(ex9_avg_income_white,2)}."
)

# Store the results
results["ex9_avg_income_black"] = ex9_avg_income_black
results["ex9_avg_income_white"] = ex9_avg_income_white

print(results)

The weighted average salary for employed Black Americans is around $40430.95.
The weighted average salary for employed white Americans is around $58361.48.
{'ex2_avg_income': 1723646.2703978634, 'ex3_share_making_9999999': 0.1689665333350052, 'ex3_share_making_zero': 0.10557547867738336, 'ex5_avg_income': 40890.177564946454, 'ex8_avg_income_black': 41747.949905123336, 'ex8_avg_income_white': 60473.15372747098, 'ex8_racial_difference': 44.85299006275197, 'ex9_avg_income_black': 40430.953355310274, 'ex9_avg_income_white': 58361.48196061399}


### Exercise 10

Now calculate the weighted average income gap between *non-Hispanic* White Americans and Black Americans. What percentage more do employed White non-Hispanic Americans earn than employed Black Americans? Store as `"ex10_wage_gap"`.

In [175]:
# observe unique values in the hispan column to identify non-Hispanic Whites
print(subset_employment["hispan"].value_counts())

# Subset for employed non-Hispanic White Americans
white_non_hispanic_subset = subset_employment[
    (subset_employment["race"] == "white")
    & (subset_employment["hispan"] == "not hispanic")
]

# Calculate the weighted average for non-Hispanic White Americans
weighted_avg_income_white_non_hispanic = (
    white_non_hispanic_subset["inctot"] * white_non_hispanic_subset["perwt"]
).sum() / white_non_hispanic_subset["perwt"].sum()

# Wage gap calculation
wage_gap = (
    (weighted_avg_income_white_non_hispanic - ex9_avg_income_black)
    / ex9_avg_income_black
) * 100

print(
    f"The weighted average income gap between non-Hispanic white Americans and black Americans is about {round(wage_gap,2)}%."
)
# Store the result
results["ex10_wage_gap"] = wage_gap

print(results)

hispan
not hispanic    128293
mexican          12437
other             5302
puerto rican      1787
cuban              939
Name: count, dtype: int64
The weighted average income gap between non-Hispanic white Americans and black Americans is about 52.53%.
{'ex2_avg_income': 1723646.2703978634, 'ex3_share_making_9999999': 0.1689665333350052, 'ex3_share_making_zero': 0.10557547867738336, 'ex5_avg_income': 40890.177564946454, 'ex8_avg_income_black': 41747.949905123336, 'ex8_avg_income_white': 60473.15372747098, 'ex8_racial_difference': 44.85299006275197, 'ex9_avg_income_black': 40430.953355310274, 'ex9_avg_income_white': 58361.48196061399, 'ex10_wage_gap': 52.52989147705372}


### Exercise 11

Is that greater or less than the difference you found in Exercise 8? Why do you think that's the case?

> + **Comparison of Income Gaps:**
  > - In Exercise 8, when analyzing the income gap between the general "White" population and "Black" Americans, we found that, on average, a White American makes around 44.85% more than a Black American.
  > - However, when we refined our analysis to consider only the weighted average income gap between "non-Hispanic White" Americans and "Black" Americans, the income gap increased to 52.53%. This shows that the disparity is even more pronounced when we exclude individuals identified as Hispanic and others.

> + **Reason for the Difference:**
  > - The category "White" in our data encompasses a broader range of subgroups, including "non-Hispanic", "Mexican", "Puerto Rican", "Cuban", and "other".
  > - Based on our dataset and the assumption about average incomes, it seems that Hispanic categories like Mexican, Puerto Rican, and Cuban may have a lower average income compared to the non-Hispanic White subgroup. 
  > - When these subgroups with potentially lower incomes are included in the broader "White" category, they exert a downward pull on the average income for that entire group. As a result, the observed income gap between the generalized White population and Black Americans appears to be smaller.
  > - On the other hand, when we focus our analysis on just the "non-Hispanic White" subgroup, we are potentially looking at a population with a higher average income. Consequently, the income disparity between this subgroup and Black Americans appears larger.

In [176]:
assert set(results.keys()) == {
    "ex2_avg_income",
    "ex3_share_making_9999999",
    "ex3_share_making_zero",
    "ex5_avg_income",
    "ex8_avg_income_black",
    "ex8_avg_income_white",
    "ex8_racial_difference",
    "ex9_avg_income_black",
    "ex9_avg_income_white",
    "ex10_wage_gap",
}