# Cleaning Data Exercises

In this exercise, we'll be returning to the American Community Survey data we used previously to measuring racial income inequality in the United States. In today's exercise, we'll be using it to measure the returns to education and how those returns vary by race and gender.




## Gradescope Autograding

Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file `exercise_missing.ipynb` before uploading.

You can check that you have answers for all questions in your `results` dictionary with this code:

```python
assert set(results.keys()) == {
    "ex5_age_young",
    "ex5_age_old",
    "ex7_avg_age",
    "ex8_avg_age",
    "ex9_num_college",
    "ex11_share_male_w_degrees",
    "ex11_share_female_w_degrees",
    "ex12_comparing",
}
```


### Submission Limits

Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total.

In [1]:
import pandas as pd
import numpy as np

pd.set_option("mode.copy_on_write", True)

# Create a results dictionary
results = {}

## Exercises

### Exercise 1

For these cleaning exercises, we'll return to the ACS data we've used before one last time. We'll be working with `US_ACS_2017_10pct_sample.dta`. Import the data (please use url for the autograder).

In [2]:
import pandas as pd

# URL of the dataset US_ACS_2017_10pct_sample.dta
url = "https://github.com/nickeubank/MIDS_Data/raw/master/US_AmericanCommunitySurvey/US_ACS_2017_10pct_sample.dta"

# Reading the data from the URL into a pandas DataFrame
acs = pd.read_stata(url)
print("The head of the dataframe is: ")
acs.head()

The head of the dataframe is: 


Unnamed: 0,year,datanum,serial,cbserial,numprec,subsamp,hhwt,hhtype,cluster,adjust,...,migcounty1,migmet131,vetdisab,diffrem,diffphys,diffmob,diffcare,diffsens,diffeye,diffhear
0,2017,1,177686,2017001000000.0,9,64,55,"female householder, no husband present",2017002000000.0,1.011189,...,0,not in identifiable area,,,,,,no vision or hearing difficulty,no,no
1,2017,1,1200045,2017001000000.0,6,79,25,"male householder, no wife present",2017012000000.0,1.011189,...,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
2,2017,1,70831,2017000000000.0,1 person record,36,57,"male householder, living alone",2017001000000.0,1.011189,...,0,not in identifiable area,,has cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
3,2017,1,557128,2017001000000.0,2,10,98,married-couple family household,2017006000000.0,1.011189,...,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
4,2017,1,614890,2017001000000.0,4,96,54,married-couple family household,2017006000000.0,1.011189,...,0,not in identifiable area,,,,,,no vision or hearing difficulty,no,no


### Exercise 2

For our exercises today, we'll focus on `age`, `sex`, `educ` (education), and `inctot` (total income). Subset your data to those variables, and quickly look at a sample of 10 rows.

In [3]:
# Subsetting the DataFrame to only include 'age', 'sex', 'educ', 'inctot' columns
subset_acs = acs[["age", "sex", "educ", "inctot"]]

# Displaying a sample of 10 rows from the subsetted DataFrame
print(
    "The sample of the subsetted dataframe only include age, sex, education and income is: "
)
subset_acs.sample(10)

The sample of the subsetted dataframe only include age, sex, education and income is: 


Unnamed: 0,age,sex,educ,inctot
227931,34,male,grade 12,25000
204450,64,female,5+ years of college,1400
300291,58,male,5+ years of college,518000
254697,44,female,5+ years of college,8000
149184,47,male,grade 12,35000
315913,80,male,grade 12,8400
200435,36,male,4 years of college,115000
163773,41,female,1 year of college,50000
153706,37,female,grade 9,12000
216759,65,female,grade 10,19800


### Exercise 3

As before, all the values of `9999999` have the potential to cause us real problems, so replace all the values of `inctot` that are `9999999` with `np.nan`. 

In [4]:
subset_acs.loc[subset_acs["inctot"] == 9999999, "inctot"] = np.nan
print(
    "The sample of the subsetted dataframe only include age, sex, education and income with replacement of 9999999 to NaN is: "
)
subset_acs.sample(10)

The sample of the subsetted dataframe only include age, sex, education and income with replacement of 9999999 to NaN is: 


Unnamed: 0,age,sex,educ,inctot
306378,53,female,1 year of college,0.0
20266,65,male,2 years of college,30000.0
155004,75,female,5+ years of college,0.0
48518,67,female,grade 12,29900.0
12738,46,male,1 year of college,56630.0
14390,41,female,n/a or no schooling,23400.0
141705,74,male,grade 12,24100.0
180858,31,male,grade 12,18300.0
197298,71,female,1 year of college,10200.0
146885,34,male,4 years of college,12600.0


### Exercise 4

Attempt to calculate the average age of people in our data. What do you get? Why are you getting that error?

You *should* get an error in trying to answer this question, but **PLEASE LEAVE THE CODE THAT GENERATES THIS ERROR COMMENTED OUT SO YOUR NOTEBOOK WILL RUN IN THE AUTOGRADER**. 

Then talk about the error in a markdown cell.

In [5]:
# average_age = subset_acs["age"].mean()

> + I get an error when attempting to calculate the average age of people in our data because the age column has some non-numeric values, such as "less than 1 year old" or "90 (90+ in 1980 and 1990)". 
> + To fix this, I need to clean the data in the age column to ensure it only contains numerical values or NaN (for missing values). 

### Exercise 5

We want to be able to calculate things using age, so we need it to be a numeric type. Check the current type of `age`, and look at all the values of `age` to figure out why it's categorical and not numeric. You should find two problematic categories. Store the values of these categories in `"ex5_age_young"` and `"ex5_age_old"` (once you find them, it should be clear which is which).

In [6]:
# Check the current type of `age`
age_datatype = subset_acs["age"].dtype
print("The datatype of the age column:", age_datatype)

# look at all the unique values in the `age` column
age_unique = subset_acs["age"].unique()
print("The unique values in the age column are:", age_unique)

# Store the values of the two problematic categories in dictionary
ex5_age_young = "less than 1 year old"
ex5_age_old = "90 (90+ in 1980 and 1990)"

print(f"The value of the first problematic category is: {ex5_age_young}.")
print(f"The value of the second problematic category is: {ex5_age_old}.")

results["ex5_age_young"] = ex5_age_young
results["ex5_age_old"] = ex5_age_old

print(results)

The datatype of the age column: category
The unique values in the age column are: ['4', '17', '63', '66', '1', ..., '86', '95', '89', '91', '96']
Length: 97
Categories (97, object): ['less than 1 year old' < '1' < '2' < '3' ... '93' < '94' < '95' < '96']
The value of the first problematic category is: less than 1 year old.
The value of the second problematic category is: 90 (90+ in 1980 and 1990).
{'ex5_age_young': 'less than 1 year old', 'ex5_age_old': '90 (90+ in 1980 and 1990)'}


### Exercise 6

In order to convert `age` into a numeric variable, we need to replace those problematic entries with values that `pandas` can later convert into numbers. Pick appropriate substitutions for the existing values and replace the current values. 

**Hint 1:** Categorical variables act like strings, so you might want to use string methods! 

**Hint 2:** Remember that characters like parentheses, pluses, asterisks, etc. are special in Python strings, and you have to escape them if you want them to be interpreted literally!

**Hint 3:** Because the US Census has been conducted regularly for hundreds of years but exactly how the census has been conducted have occasionally changed, variables are sometimes coded in a way that might be interpreted in different ways for different census years. For example, hypothetically, one might write `90 (90+ in 1980 and 1990)` if the Censuses conducted in 1980 and 1990 used to top-code age at 90 (any values *over* 90 were just coded as 90), but more recent Censuses no longer top-coded age and recorded ages over 90 as the respondents actual age.

In [7]:
# Replace 'less than 1 year old' with 0
subset_acs["age"] = subset_acs["age"].replace("less than 1 year old", "0")

# Replace '90 (90+ in 1980 and 1990)' with 90
subset_acs["age"] = subset_acs["age"].replace("90 (90+ in 1980 and 1990)", "90")

age_unqiue2 = subset_acs["age"].unique()
print(
    f"The unique values in the age column after modifying the first and second probelmatic category are : {age_unqiue2}"
)

The unique values in the age column after modifying the first and second probelmatic category are : ['4', '17', '63', '66', '1', ..., '86', '95', '89', '91', '96']
Length: 97
Categories (97, object): ['0' < '1' < '2' < '3' ... '93' < '94' < '95' < '96']


### Exercise 7

Now convert age from a categorical to numeric. Calculate the average age amoung this group, and store it in `"ex7_avg_age"`.

In [8]:
# Convert the 'age' column to a numeric data type
subset_acs["age"] = pd.to_numeric(subset_acs["age"])

print(f"The data type of 'age' after conversion: {subset_acs['age'].dtype}")

# Calculate the average age amoung this group
ex7_avg_age = subset_acs["age"].mean()

print(f"The average age in this dataset is about {round(ex7_avg_age)} years old.")

results["ex7_avg_age"] = ex7_avg_age
print(results)

The data type of 'age' after conversion: int64
The average age in this dataset is about 41 years old.
{'ex5_age_young': 'less than 1 year old', 'ex5_age_old': '90 (90+ in 1980 and 1990)', 'ex7_avg_age': 41.30384885455982}


### Exercise 8

Let's now filter out anyone in our data whose age is less than 18. Note that before made `age` a numeric variable, we couldn't do this! Again, calculate the average age and this time store it in `"ex8_avg_age"`. 

Use this sample of people 18 and over for all subsequent exercises.

In [9]:
# Filter out individuals with age less than 18
subset_acs_adults = subset_acs[subset_acs["age"] >= 18]

# Calculate the average age
avg_age = subset_acs_adults["age"].mean()

# Store the average age in 'ex8_avg_age'
ex8_avg_age = avg_age
print(
    f"The average age of adults in the dataset is about {round(ex8_avg_age)} years old."
)

results["ex8_avg_age"] = ex8_avg_age
print(results)

The average age of adults in the dataset is about 50 years old.
{'ex5_age_young': 'less than 1 year old', 'ex5_age_old': '90 (90+ in 1980 and 1990)', 'ex7_avg_age': 41.30384885455982, 'ex8_avg_age': 49.75769659413359}


### Exercise 9

Create an indicator variable for whether each person has *at least* a college Bachelor's degree called `college_degree`. Use this variable to calculate the number of people in the dataset with a college degree. You may assume that to get a college degree you need to complete at least 4 years of college. Save the result as `"ex9_num_college"`.

In [10]:
# Observe the unique values in the 'degree' column
degree_unique = subset_acs_adults["educ"].unique()
print(
    "The unique values in the degree column for adults in the dataset are:",
    degree_unique,
)

# Create the 'college_degree' column
# For each row, if the educ column indicates that the person has completed 4 or more years of college, set college_degree to 1 (indicating they have a college degree).
# Otherwise, set it to 0.
subset_acs_adults["college_degree"] = subset_acs_adults["educ"].apply(
    lambda x: 1 if "4 years of college" in x or "5+ years of college" in x else 0
)

# Sum the values in 'college_degree' to get the number of people with a college degree
ex9_num_college = subset_acs_adults["college_degree"].sum()

print(f"There are {ex9_num_college} adults with a college degree in the dataset.")

results["ex9_num_college"] = ex9_num_college
print(results)

The unique values in the degree column for adults in the dataset are: ['4 years of college', 'grade 12', '1 year of college', 'n/a or no schooling', '2 years of college', ..., 'grade 5, 6, 7, or 8', 'grade 9', 'grade 11', 'grade 10', 'nursery school to grade 4']
Length: 11
Categories (11, object): ['n/a or no schooling' < 'nursery school to grade 4' < 'grade 5, 6, 7, or 8' < 'grade 9' ... '1 year of college' < '2 years of college' < '4 years of college' < '5+ years of college']
There are 77013 adults with a college degree in the dataset.
{'ex5_age_young': 'less than 1 year old', 'ex5_age_old': '90 (90+ in 1980 and 1990)', 'ex7_avg_age': 41.30384885455982, 'ex8_avg_age': 49.75769659413359, 'ex9_num_college': 77013}


### Exercise 10

Let's examine how the educational gender gap. Use `pd.crosstab` to create a cross-tabulation of `sex` and `college_degree`. `pd.crosstab` will give you the number of people who have each combination of `sex` and `college_degree` (so in this case, it will give us a 2x2 table with Male and Female as rows, and `college_degree` True and False as columns, or vice versa. 

In [11]:
# Create a cross-tabulation of 'sex' and 'college_degree'
educ_gender_crosstab = pd.crosstab(
    subset_acs_adults["sex"], subset_acs_adults["college_degree"]
)

# Display the resulting table
print("Cross-tabulation of sex and college degree for all adults in the dataset:")
print(educ_gender_crosstab)

Cross-tabulation of sex and college degree for all adults in the dataset:
college_degree      0      1
sex                         
male            85821  36181
female          90200  40832


### Exercise 11

Counts are kind of hard to interpret. `pd.crosstab` can also normalize values to give percentages. Look at the `pd.crosstab` help file to figure out how to normalize the values in the table. Normalize them so that you get the share of men with and without college degree, and the share of women with and without college degrees.

Store the share (between 0 and 1) of men with college degrees in `"ex11_share_male_w_degrees"`, and the share of women with degrees in `"ex11_share_female_w_degrees"`.

In [12]:
# Normalize the crosstab values row-wise
normalized_crosstab = pd.crosstab(
    subset_acs_adults["sex"], subset_acs_adults["college_degree"], normalize="index"
)

# Display the normalized table
print(
    "Normalized Cross-tabulation of 'sex' and 'college_degree' for all adults in the dataset:"
)
print(normalized_crosstab)

# Store the share of men with degrees
ex11_share_male_w_degrees = normalized_crosstab.loc["male", 1]

# Store the share of women with degrees
ex11_share_female_w_degrees = normalized_crosstab.loc["female", 1]

print(
    f"The share of adult men with college degrees is about {round(ex11_share_male_w_degrees*100, 2)}%."
)
print(
    f"The share of adult women with college degree is about {round(ex11_share_female_w_degrees*100,2)}%."
)

results["ex11_share_male_w_degrees"] = ex11_share_male_w_degrees
results["ex11_share_female_w_degrees"] = ex11_share_female_w_degrees
print(results)

Normalized Cross-tabulation of 'sex' and 'college_degree' for all adults in the dataset:
college_degree         0         1
sex                               
male            0.703439  0.296561
female          0.688381  0.311619
The share of adult men with college degrees is about 29.66%.
The share of adult women with college degree is about 31.16%.
{'ex5_age_young': 'less than 1 year old', 'ex5_age_old': '90 (90+ in 1980 and 1990)', 'ex7_avg_age': 41.30384885455982, 'ex8_avg_age': 49.75769659413359, 'ex9_num_college': 77013, 'ex11_share_male_w_degrees': 0.29656071211947344, 'ex11_share_female_w_degrees': 0.3116185359301545}


### Exercise 12

Now, let's recreate that table for people who are 40 and over and people under 40. Over time, what does this suggest about the absolute difference in the share of men and women earning college degrees? Has it gotten larger, stayed the same, or gotten smaller? Store your answer (either `"the absolute difference has increased"` or `"the absolute difference has decreased"`) in `"ex12_comparing"`.

In [13]:
# Create subsets
subset_40_and_over = subset_acs_adults[subset_acs_adults["age"] >= 40]
subset_under_40 = subset_acs_adults[subset_acs_adults["age"] < 40]

# Get normalized crosstabs for both subsets
crosstab_40_and_over = pd.crosstab(
    subset_40_and_over["sex"], subset_40_and_over["college_degree"], normalize="index"
)
crosstab_under_40 = pd.crosstab(
    subset_under_40["sex"], subset_under_40["college_degree"], normalize="index"
)

print(
    "Cross-tabulation of sex and college degree for all adults over 40 years old in the dataset:"
)
print(crosstab_40_and_over)
print(
    "Cross-tabulation of sex and college_degree for all adults under 40 years old in the dataset:"
)
print(crosstab_under_40)

# Calculate absolute differences for both subsets
diff_40_and_over = abs(
    crosstab_40_and_over.loc["male", 1] - crosstab_40_and_over.loc["female", 1]
)

print(
    f"The absolute difference in the share of men and women earning college degrees for adults over 40 years old is {round(diff_40_and_over*100,2)}%."
)

diff_under_40 = abs(
    crosstab_under_40.loc["male", 1] - crosstab_under_40.loc["female", 1]
)

print(
    f"The absolute difference in the share of men and women earning college degrees for adults under 40 years old is {round(diff_under_40*100,2)}%."
)

# Compare and draw conclusions
if diff_under_40 > diff_40_and_over:
    ex12_comparing = "the absolute difference has increased"
elif diff_under_40 < diff_40_and_over:
    ex12_comparing = "the absolute difference has decreased"
else:
    ex12_comparing = "the absolute difference has not changed"

print(ex12_comparing)

results["ex12_comparing"] = ex12_comparing

print(results)

Cross-tabulation of sex and college degree for all adults over 40 years old in the dataset:
college_degree         0         1
sex                               
male            0.682123  0.317877
female          0.699144  0.300856
Cross-tabulation of sex and college_degree for all adults under 40 years old in the dataset:
college_degree         0         1
sex                               
male            0.743143  0.256857
female          0.665710  0.334290
The absolute difference in the share of men and women earning college degrees for adults over 40 years old is 1.7%.
The absolute difference in the share of men and women earning college degrees for adults under 40 years old is 7.74%.
the absolute difference has increased
{'ex5_age_young': 'less than 1 year old', 'ex5_age_old': '90 (90+ in 1980 and 1990)', 'ex7_avg_age': 41.30384885455982, 'ex8_avg_age': 49.75769659413359, 'ex9_num_college': 77013, 'ex11_share_male_w_degrees': 0.29656071211947344, 'ex11_share_female_w_degrees': 0.

In [14]:
print(
    f"Since 7.74% is greater than 1.7%, the absolute difference in the share of adult men and adult women earning college degrees has gotten greater over time. Therefore, {ex12_comparing} over time."
)

Since 7.74% is greater than 1.7%, the absolute difference in the share of adult men and adult women earning college degrees has gotten greater over time. Therefore, the absolute difference has increased over time.


### Exercise 13

In words, what is causing the change noted in Exercise 12 (i.e., looking at the tables above, tell me a story about Men and Women's College attainment).

> **Observation**:
> + The first table show the percentages of adult men and women who have attained a college degree with age over 40. 
>   + About 31.79% of men have a college degree.
>   + About 30.09% of women have a college degree.
>   + This results in an absolute difference of approximately 1.7% more men than women with college degrees in this age group.
> + The second table show the percentages of adult men and women who have attained a college degree with age under 40. 
>   + About 25.69% of men have a college degree.
>   + About 33.43% of women have a college degree.
>   + This results in an absolute difference of approximately 7.74% more women than men with college degrees in this age group.

> **Interpretation**:
> + The absolute difference in the proportion of adult men and women earning college degrees is 7.74% for the younger generation (under 40 years old). This is notably higher than the 1.7% difference observed in the older generation (over 40 years old). Therefore, the absolute difference in the share of adult men and adult women earning college degrees has gotten greater over time (The absolute difference has increased). Based on this, I conclude that the disparity between adult men and women obtaining college degrees has widened over time.
> + From the table, it seems that in the younger generation today (in the age group under 40), adult women are more likely to have a college degree than men; while in the older generation (in the age group above 40), men are slightly more likely to have a college degree than women. 
> + This shift demonstrates a significant change in educational attainment trends between genders. It may due to the societal changes that prioritize and support women's education. For instance, in the old generation, women are more expected to stay in the traditional household instead of going to the workforce, so there's less incentives for women to get a college degree in the old times; However, in the young generation, women are encouraged to puruse higher education and enter the workforce to be independent like men do. Therefore, the widened difference in the share of adult men and adult women earning college degree implies an increased accessibility to higher education for women. 
> + In essence, while the older generation had a modest male advantage in college attainment, the tables have turned in recent times with women leading in educational attainment among the younger population. This trend signifies a crucial societal shift in the value and accessibility of education for women.

## Want More Practice?

Calculate the educational racial gap in the United States for White Americans, Black Americans, Hispanic Americans, and other groups. 

Note that to do these calculations, you'll have to deal with the fact that unlike most Americans, the American Census Bureau treats "Hispanic" not as a racial category, but a linguistic one. As a result, the racial category "White" in `race` actually includes most Hispanic Americans. For this analysis, we wish to work with the mutually exclusive categories of "White, non-Hispanic", "White, Hispanic", "Black (Hispanic or non-Hispanic)", and a category for everyone else. 

In [15]:
# Observe Unique Values in race and hispan columns

print(acs["race"].value_counts())
print(acs["hispan"].value_counts())
# Subsetting the DataFrame to only include 'age', 'educ', 'race', 'hispan' columns
subset_acs_race = acs[["age", "educ", "race", "hispan"]]

# Displaying a sample of 10 rows
print(
    "The sample of the subsetted dataframe only include age, education, race and hispanic is: "
)

subset_acs_race.sample(10)

race
white                               243751
black/african american/negro         31691
other asian or pacific islander      12508
other race, nec                      12304
two major races                       8826
chinese                               4313
american indian or alaska native      3595
three or more major races             1207
japanese                               809
Name: count, dtype: int64
hispan
not hispanic    272750
mexican          28755
other            11151
puerto rican      4397
cuban             1951
Name: count, dtype: int64
The sample of the subsetted dataframe only include age, education, race and hispanic is: 


Unnamed: 0,age,educ,race,hispan
205425,65,grade 12,white,not hispanic
15382,74,grade 12,white,not hispanic
246552,15,grade 10,white,not hispanic
254280,14,"grade 5, 6, 7, or 8",white,not hispanic
39683,19,grade 12,two major races,not hispanic
69194,31,grade 12,white,not hispanic
156419,32,5+ years of college,white,not hispanic
216195,51,2 years of college,white,not hispanic
178642,3,nursery school to grade 4,white,mexican
237033,34,4 years of college,white,other


In [16]:
# Replace 'less than 1 year old' with 0
subset_acs_race["age"] = subset_acs_race["age"].replace("less than 1 year old", "0")

# Replace '90 (90+ in 1980 and 1990)' with 90
subset_acs_race["age"] = subset_acs_race["age"].replace(
    "90 (90+ in 1980 and 1990)", "90"
)

# Convert the 'age' column to a numeric data type
subset_acs_race["age"] = pd.to_numeric(subset_acs_race["age"])

# Filter out individuals with age less than 18
subset_acs_race_adults = subset_acs_race[subset_acs["age"] >= 18]

# Create the 'college_degree' column
# For each row, if the educ column indicates that the person has completed 4 or more years of college, set college_degree to 1 (indicating they have a college degree).
# Otherwise, set it to 0.
subset_acs_race_adults["college_degree"] = subset_acs_race_adults["educ"].apply(
    lambda x: 1 if "4 years of college" in x or "5+ years of college" in x else 0
)

print(
    "The sample of the subsetted dataframe only include age, education, race and hispanic after cleaning age column and creating college degree column is: "
)
subset_acs_race_adults.sample(10)

The sample of the subsetted dataframe only include age, education, race and hispanic after cleaning age column and creating college degree column is: 


Unnamed: 0,age,educ,race,hispan,college_degree
190207,42,5+ years of college,white,not hispanic,1
133441,64,1 year of college,white,not hispanic,0
124585,52,grade 12,white,not hispanic,0
117297,30,grade 12,white,not hispanic,0
133033,72,"grade 5, 6, 7, or 8",white,not hispanic,0
31335,82,grade 12,white,not hispanic,0
141034,57,4 years of college,white,not hispanic,1
289391,34,4 years of college,white,not hispanic,1
128698,27,2 years of college,white,not hispanic,0
311034,63,grade 12,white,not hispanic,0


In [17]:
# Creating a new 'race_category' column in 'subset_acs_race_adults'


def categorize_race(row):
    if row["race"] == "white" and row["hispan"] == "not hispanic":
        return "White, non-Hispanic"
    elif row["race"] == "white":
        return "White, Hispanic"
    elif row["race"] == "black/african american/negro":
        return "Black (Hispanic or non-Hispanic)"
    else:
        return "Other"


subset_acs_race_adults["race_category"] = subset_acs_race_adults.apply(
    categorize_race, axis=1
)

print(
    "The sample of the subsetted dataframe only include age, education, race and hispanic after cleaning age column and creating college degree column and race category column is: "
)
subset_acs_race_adults.sample(10)

The sample of the subsetted dataframe only include age, education, race and hispanic after cleaning age column and creating college degree column and race category column is: 


Unnamed: 0,age,educ,race,hispan,college_degree,race_category
140976,80,4 years of college,white,not hispanic,1,"White, non-Hispanic"
185682,37,grade 11,"other race, nec",mexican,0,Other
255546,72,grade 11,white,not hispanic,0,"White, non-Hispanic"
127440,43,1 year of college,white,not hispanic,0,"White, non-Hispanic"
216510,49,grade 12,white,not hispanic,0,"White, non-Hispanic"
236513,77,4 years of college,white,not hispanic,1,"White, non-Hispanic"
83090,63,5+ years of college,white,not hispanic,1,"White, non-Hispanic"
187885,27,1 year of college,"other race, nec",other,0,Other
57836,29,1 year of college,white,not hispanic,0,"White, non-Hispanic"
119029,32,4 years of college,white,not hispanic,1,"White, non-Hispanic"


In [18]:
# Method 1
# Calculating the percentage with a college degree for each race category
racial_gap = subset_acs_race_adults.groupby("race_category")["college_degree"].mean()

for race, percentage in racial_gap.items():
    print(f"{race}: {percentage*100:.2f}%")

Black (Hispanic or non-Hispanic): 19.11%
Other: 31.45%
White, Hispanic: 17.25%
White, non-Hispanic: 33.46%


In [19]:
# Method 2 using crosstab
# Calculate the educational gap
educ_racial_gap = pd.crosstab(
    subset_acs_race_adults["race_category"],
    subset_acs_race_adults["college_degree"],
    normalize="index",
)
print("Cross-tabulation of race and college_degree for all adults in the dataset:")
print(educ_racial_gap)

# To extract the percentage of each reclassified racial group with a college degree:
percent_with_degree = educ_racial_gap[1]
for race, percentage in percent_with_degree.items():
    print(f"{race}: {percentage*100:.2f}%")

Cross-tabulation of race and college_degree for all adults in the dataset:
college_degree                           0         1
race_category                                       
Black (Hispanic or non-Hispanic)  0.808948  0.191052
Other                             0.685502  0.314498
White, Hispanic                   0.827497  0.172503
White, non-Hispanic               0.665403  0.334597
Black (Hispanic or non-Hispanic): 19.11%
Other: 31.45%
White, Hispanic: 17.25%
White, non-Hispanic: 33.46%


In [20]:
print(results)

{'ex5_age_young': 'less than 1 year old', 'ex5_age_old': '90 (90+ in 1980 and 1990)', 'ex7_avg_age': 41.30384885455982, 'ex8_avg_age': 49.75769659413359, 'ex9_num_college': 77013, 'ex11_share_male_w_degrees': 0.29656071211947344, 'ex11_share_female_w_degrees': 0.3116185359301545, 'ex12_comparing': 'the absolute difference has increased'}


In [21]:
assert set(results.keys()) == {
    "ex5_age_young",
    "ex5_age_old",
    "ex7_avg_age",
    "ex8_avg_age",
    "ex9_num_college",
    "ex11_share_male_w_degrees",
    "ex11_share_female_w_degrees",
    "ex12_comparing",
}