Source: https://www.practicaldatascience.org/html/exercises/Exercise_missing.html

# Exercise 1
Today, we will be using the ACS data we used during out first pandas exercise to examine the US income distribution, and how it varies by race. Note that because the US income distribution has a very small number of people with extremely high incomes, and the ACS is just a sample of Americans, the far right tail of the distribution will not be very well estimated. However, this data should suffice for helping to understand wealth inequality in the United States.

To begin, load the ACS Data we used in our first pandas exercise. That data can be found here. We’ll be working with US_ACS_2017_10pct_sample.dta.

In [1]:
!ls

DataFrames.ipynb             Series.ipynb
Index.ipynb                  US_ACS_2017_10pct_sample.dta
Missing_Data.ipynb


In [3]:
import pandas as pd
df = pd.read_stata('US_ACS_2017_10pct_sample.dta')

# Exercise 2
Let’s begin by calculating the mean US incomes from this data (recall that income is stored in the inctot variable).

In [5]:
df['inctot'].mean()

1723646.2703978634

In [6]:
df['inctot'].max()

9999999

It looks very high because they use the 9999999 for NA values

# Exercise 3
Hmmm… That doesn’t look right. The average American is definitely not earning 1.7 million dollars a year. Let’s look at the values of inctot using value_counts(). Do you see a problem?

Now use value_counts() with the argument normalize=True to see proportions of the sample that report each value instead of the count of people in each category. What percentage of our sample has an income of 9,999,999? What percentage has an income of 0?

In [7]:
df['inctot'].value_counts()

9999999    53901
0          33679
30000       4778
50000       4414
40000       4413
           ...  
70520          1
76680          1
57760          1
200310         1
505400         1
Name: inctot, Length: 8471, dtype: int64

In [8]:
df['inctot'].value_counts(normalize=True)

9999999    0.168967
0          0.105575
30000      0.014978
50000      0.013837
40000      0.013834
             ...   
70520      0.000003
76680      0.000003
57760      0.000003
200310     0.000003
505400     0.000003
Name: inctot, Length: 8471, dtype: float64

# Exercise 4
As we discussed before, the ACS uses a value of 9999999 to denote that income information is not available for someone. The problem with using this kind of “sentinel value” is that pandas doesn’t understand that this is supposed to denote missing data, and so when it averages the variable, it doesn’t know to ignore 9999999.

To help out pandas, use the replace command to replace all values of 9999999 with np.nan.

In [10]:
import numpy as np

In [12]:
df['inctot'] = df['inctot'].replace(9999999, np.nan)

# Exercise 5
Now that we’ve properly labeled our missing data as np.nan, let’s calculate the average US income once more.

In [13]:
df['inctot'].mean()

40890.177564946454

# Exercise 6
OK, now we’ve been able to get a reasonable average income number. As we can see, a major advantage of using np.nan is that pandas knows that np.nan observations should just be ignored when we are calculating means.

But it’s not enough to just get rid of the people who had inctot values of 9999999. We also need to know why those values were missing. Suppose, for example, that the value of 9999999 was used for anyone who made more than 100,000 dollars: if we just dropped those people, then our estimate of average income wouldn’t mean much, would it?

So let’s make sure we understand why data is missing for some people. If you recall from our last exercise, it seemed to be the case that most of the people who had incomes of 9999999 were children. Let’s make sure that’s true by looking at the distribution of the variable age for people for whom inctot is missing (i.e. subset the data to people with inctot missing, then look at the values of age with value_counts()).

Then do the opposite: look at the distribution of the age variable for people who whom inctot is not missing.

Can you determine when 9999999 was being used? Is it ok we’re excluding those people from our analysis?

Note: In this data, Python doesn’t understand age is a number; it thinks it is a string because the original data has categories like “90 (90+ in 1980 and 1990)” and “less than 1 year old”. So you can’t just use min() or max(). We’ll discuss converting string variables into numbers in a future class.

In [19]:
cond1 = df['inctot'].isna()
print(df[cond1]['age'].min())
print(df[cond1]['age'].max())

less than 1 year old
14


In [20]:
print(df[~cond1]['age'].min())
print(df[~cond1]['age'].max())

15
96


# Exercise 7
Great, so now we know why those people had missing data, and we’re ok with excluding them.

But as we previously noted, there are also a lot of observations of zero income in our data, and it’s not clear that we want everyone with a zero-income should be included in this average, since those may be people who are retired, or in school.

Let’s limit our attention to people who are currently working. We can do this using empstat. Remember you can use value_counts() to see what values of empstat are in the data!

In [22]:
df['empstat'].value_counts()

employed              148758
not in labor force    104676
n/a                    57843
unemployed              7727
Name: empstat, dtype: int64

In [28]:
cond1 = df['empstat'] == 'employed'
df[cond1]['inctot'].mean()

57854.723914007984

# Exercise 8
Now let’s estimate the racial income gap in the United States. What is the average salary for employed Black Americans, and what is the average salary for employed White Americans? In percentage terms, how much more does the average White American make than the average Black American?

In [38]:
df['race'].value_counts()

white                               243751
black/african american/negro         31691
other asian or pacific islander      12508
other race, nec                      12304
two major races                       8826
chinese                               4313
american indian or alaska native      3595
three or more major races             1207
japanese                               809
Name: race, dtype: int64

In [39]:
cond1 = df['empstat'] == 'employed'
df[cond1].groupby('race').mean()['inctot']

race
white                               60473.153727
black/african american/negro        41747.949905
american indian or alaska native    37996.522481
chinese                             72804.918567
japanese                            78906.744186
other asian or pacific islander     66647.736613
other race, nec                     34989.400521
two major races                     49021.151515
three or more major races           49787.183099
Name: inctot, dtype: float64

In [40]:
cond1 = df['empstat'] == 'employed'
df[cond1].groupby(['race','hispan']).mean()['inctot']

race                              hispan      
white                             not hispanic     63049.444348
                                  mexican          37490.119250
                                  puerto rican     46674.033838
                                  cuban            48149.635922
                                  other            44690.357210
black/african american/negro      not hispanic     41969.817312
                                  mexican          25981.224490
                                  puerto rican     36056.916667
                                  cuban            36912.121212
                                  other            39147.647059
american indian or alaska native  not hispanic     38021.568816
                                  mexican          39143.406593
                                  puerto rican     27366.666667
                                  cuban            14500.000000
                                  other            37018.

# Want more practice?
(1) As noted above, these estimates are not actually quite correct because we aren’t using survey weights. To calculate a weighted average that takes into account survey weights, you need to use the following formula:

 
(As you can see, when  is constant for all observations, this just simplifies to our normal formula for mean values. It is only when weights vary across individuals that weights must be explicitly addressed).

In this data, weights are stored in the variable perwt, which is the number of people for which each observation is a stand-in (the inverse of that observations sampling probability).

Using the formula, re-calculate the weighted average income for both populations.

In [44]:
cond1 = df['empstat'] == 'employed'
filtered_df = df[cond1].copy()

(filtered_df['inctot']*filtered_df['perwt']).sum()/filtered_df['perwt'].sum()

55050.47746086801

In [48]:
f = lambda x: (x['inctot']*filtered_df['perwt']).sum()/x['perwt'].sum()
filtered_df.groupby('race').apply(f)

race
white                               58361.481961
black/african american/negro        40430.953355
american indian or alaska native    36982.665676
chinese                             71139.573781
japanese                            73385.636287
other asian or pacific islander     63954.682574
other race, nec                     33696.658180
two major races                     47724.446956
three or more major races           47127.188928
dtype: float64