Source: https://www.practicaldatascience.org/html/exercises/Exercise_cleaning.html

__1. For our data cleaning exercises, we will return one last time to our ACS data here. Download and import the 10percent ACS sample.__

__2. For our exercises today, we’ll focus on age, gender, sex, and inctot. Subset your data to those variables, and quickly look at a sample of 10 rows.__

In [1]:
import pandas as pd
df = pd.read_stata('US_ACS_2017_10pct_sample.dta')

In [40]:
cols = ['age', 'sex', 'inctot','empstat','educ']
sub_df = df[cols].copy()

In [41]:
sub_df.sample(10)

Unnamed: 0,age,sex,inctot,empstat,educ
161046,48,male,87100,employed,1 year of college
109696,49,female,30000,employed,2 years of college
198174,54,male,7000,employed,4 years of college
35805,8,female,9999999,,nursery school to grade 4
136659,12,male,9999999,,"grade 5, 6, 7, or 8"
109329,16,female,0,not in labor force,grade 9
224144,68,male,13200,not in labor force,grade 12
126252,41,male,24000,employed,grade 12
21499,49,female,0,not in labor force,4 years of college
147745,17,female,0,not in labor force,grade 9


__3) First, replace all the values of inctot that are 9999999 with np.nan.__

In [42]:
import numpy as np
sub_df['inctot'] = sub_df['inctot'].replace(9999999, np.nan)

__4) Calculate the average age of people in our data. What do you get?__

In [49]:
# sub_df['age'].mean()

__5) We want to be able to calculate things using age, so we need it to be a numeric type. Check all the values of age to figure out why it’s categorical and not numeric. You should find two problematic categories.__

In [44]:
for val in sub_df['age'].unique():
    print(val)

4
17
63
66
1
50
82
8
14
47
13
27
15
67
92
37
59
11
28
48
39
70
10
80
75
7
9
34
88
49
73
less than 1 year old
76
20
40
22
12
25
52
43
51
53
35
69
30
38
81
64
36
57
21
72
79
71
46
78
26
45
56
93
19
18
58
5
29
54
44
31
68
42
62
74
60
65
61
41
32
3
6
55
2
33
77
16
23
94
24
83
90 (90+ in 1980 and 1990)
85
84
87
86
95
89
91
96


__6) In order to convert age into a numeric variable, we need to replace those problematic entries with values that pandas can later convert into numbers. Pick appropriate substitutions for the existing values and replace the current values. Hint 1: Categorical variables act like strings, so you might want to use string methods! Hint 2: Remember that characters like parentheses, pluses, asterices, etc. are special in Python strings, and you have to escape them if you want them to be interpreted literally!__

In [45]:
sub_df['age'] = sub_df['age'].replace('less than 1 year old', '1')
sub_df['age'] = sub_df['age'].replace('90 (90+ in 1980 and 1990)', '90')

In [46]:
# for val in sub_df['age'].unique():
#     print(val)

__7) Now convert age from a categorical to numeric.__

In [47]:
sub_df['age'] = sub_df['age'].astype(float)

__8) Let’s now filter out anyone in our data whose age is less than than 18. Note that before made age a numeric variable, we couldn’t do this!__

In [48]:
cond1 = sub_df['age'] < 18
sub_df[cond1].sample(10)

Unnamed: 0,age,sex,inctot,empstat,educ
178142,7.0,female,,,nursery school to grade 4
26295,16.0,male,0.0,not in labor force,grade 9
209377,6.0,male,,,nursery school to grade 4
96364,1.0,male,,,n/a or no schooling
118724,1.0,female,,,n/a or no schooling
86242,9.0,male,,,nursery school to grade 4
12886,13.0,female,,,"grade 5, 6, 7, or 8"
125556,9.0,male,,,nursery school to grade 4
211530,1.0,male,,,n/a or no schooling
297720,10.0,male,,,nursery school to grade 4


__9) Create an indicator variable for whether each person has at least a college degree called college_degree.__

In [53]:
sub_df['educ'].value_counts()

grade 12                     93133
4 years of college           47212
1 year of college            38779
5+ years of college          29801
nursery school to grade 4    24514
grade 5, 6, 7, or 8          21535
2 years of college           20757
n/a or no schooling          19562
grade 11                      8758
grade 10                      7818
grade 9                       7135
Name: educ, dtype: int64

In [54]:
sub_df['college_degree'] = sub_df['educ'] > 'grade 12'

__10) Let’s examine how the educational gender gap. Use pd.crosstab to create a cross-tabulation of sex and college_degree. pd.crosstab will give you the number of people who have each combination of sex and college_degree (so in this case, it will give us a 2x2 table with Male and Female as rows, and college_degree True and False as columns, or vice versa.__

In [55]:
pd.crosstab(sub_df['sex'],sub_df['college_degree'])

college_degree,False,True
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
male,92134,63731
female,90321,72818


__11) Counts are kind of hard to interpret. pd.crosstab can also normalize values to give percentages. Look at the pd.crosstab help file to figure out how to normalize the values in the table. Normalize them so that you get the share of men with and without college degree, and the share of women with and without college degrees.__

In [61]:
pd.crosstab(sub_df['sex'],sub_df['college_degree'], normalize='index')

college_degree,False,True
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
male,0.591114,0.408886
female,0.553644,0.446356


__12) Now, let’s recreate that table for people over 40 and people under 40. Has the difference between men and women in terms of getting a college degree impoved, stayed the same, or worsened?__

In [62]:
# Above 40
cond1 = sub_df['age'] > 40
filtered_df = sub_df[cond1].copy()

pd.crosstab(filtered_df['sex'],filtered_df['college_degree'], normalize='index')

college_degree,False,True
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
male,0.476536,0.523464
female,0.479776,0.520224


In [63]:
# Below 40
cond1 = sub_df['age'] <= 40
filtered_df = sub_df[cond1].copy()

pd.crosstab(filtered_df['sex'],filtered_df['college_degree'], normalize='index')

college_degree,False,True
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
male,0.704431,0.295569
female,0.637741,0.362259
