# Education data prep **Part 2**

### This script combines the following 3 datasets, aggregates them by county, redesigns column naming structure, and re-calculates rates:
1. District Student Mobility/Stability Statistics 2011-2012 **by Instructional Program/Service Type**
2. District Student Mobility/Stability Statistics 2011-2012 **by Gender & Race/Ethnicity**
3. District Graduation Data Statistics 2011-2012 **by Instructional Program Service Type**
## Reference: Column Naming conventions

- This dataset is designed so you should never have to look at the columns to find the name of one (since there are around 140 columns). Just look here for reference instead.
- For instance, to get the rate for any variable, just use `_rate` after a variable. So `graduated` becomes `graduated_rate`

| Type | Naming | Example |
| - | - | - |
| County Total | variable | `stable` |
| Count | group + variable | `disabled_stable` |
| Rate | group + variable + "rate" | `disabled_stable_rate` |
| Group Total | group + group total | `disabled_pupil_total` |

<br>

#### Mobility/Stability columns

| GROUPS | VARIABLES | GROUP TOTALS |
| - | - | - |
| disabled | stable | pupil_total |
| limited_eng | mobile | 
| poor | mobile_instances |
| migrant | 
| title_1 | 
| homeless |
| gifted |
| male |
| female |
| white |
| asian |
| black |
| hispanic |

<br>

#### Graduation columns

| GROUPS | VARIABLES | GROUP TOTALS |
| - | - | - |
| disabled | graduated | grad_base_total |
| limited_eng | completed |
| poor |
| migrant |
| title_1 |
| homeless |
| gifted |

<br>

**What are group totals?**
- Notice they aren't just called "total". This is because, for graduation data, we don't care about the total number of students. We care about the total number of students who are actually in the pool for graduation. So, we call it `grad_base_total` and use that when calculating rate

**Rates are calculated by dividing a variable by its group total, then multiplying by 100**

---
---
---

In [1]:
%run workspace.py

In [2]:
filtr = "`Organization Name` != 'STATE TOTAL'"

grad_raw = read_raw('dist_grad_rate', where=filtr)
mob_raw = read_raw('dist_student_mobility', where=filtr)
mob_dem_raw = read_raw('dist_mobility_demographics', where=filtr)

head(grad_raw, mob_raw, mob_dem_raw)

38 cols x 183 rows


Unnamed: 0,County Name,Organization Code,Organization Name,Students with Disabilities Final Grad Base,Students with Disabilities Graduates Total,Students with Disabilities Graduation Rate,Students with Disabilities Completers Total,Students with Disabilities Completion Rate,Limited English Proficient Final Grad Base,Limited English Proficient Graduates Total,...,Homeless Final Grad Base,Homeless Graduates Total,Homeless Graduation Rate,Homeless Completers Total,Homeless Completion Rate,Gifted-Talented Final Grad Base,Gifted-Talented Graduates Total,Gifted-Talented Graduation Rate,Gifted-Talented Completers Total,Gifted-Talented Completion Rate
0,ADAMS,10,MAPLETON 1,49,18,36.7,19,38.8,219,73,...,41,12,29.3,16,39.0,44,27,61.4,27,61.4
1,ADAMS,20,ADAMS 12 FIVE STAR SCHOOLS,250,118,47.2,127,50.8,379,257,...,106,62,58.5,65,61.3,227,201,88.5,208,91.6
2,ADAMS,30,ADAMS COUNTY 14,59,32,54.2,32,54.2,170,86,...,99,52,52.5,57,57.6,30,27,90.0,27,90.0


60 cols x 183 rows


Unnamed: 0,School Year,Org. Code,Organization Name,Category,Total Pupil Count (All students),Total Stable Pupil Count (All Students),Total Stability Rate (All Students),Total Mobile Student Count (All students),Total Student Mobility Rate (All students),Total Instances of Mobility (All students),...,Homeless Student Mobility Rate,Homeless Instances of Mobility,Homeless Mobility Incidence Rate,Gifted & Talented Pupil Count,Gifted & Talented Stable Student Count,Gifted & Talented Stability Rate,Gifted & Talented Mobile Student Count,Gifted & Talented Student Mobility Rate,Gifted & Talented Instances of Mobility,Gifted & Talented Mobility Incidence Rate
0,20112012,10,MAPLETON 1,1DISTRICT TOTALS (INCLUDING ALTERNATIVE SCHOOLS),9037,5077,56.2,3919,43.4,4133,...,32.7,79,36.9,250,205,82.0,44,17.6,47,18.8
1,20112012,20,ADAMS 12 FIVE STAR SCHOOLS,1DISTRICT TOTALS (INCLUDING ALTERNATIVE SCHOOLS),49889,34283,68.7,15424,30.9,16854,...,57.2,481,68.2,3590,3225,89.8,361,10.1,404,11.3
2,20112012,30,ADAMS COUNTY 14,1DISTRICT TOTALS (INCLUDING ALTERNATIVE SCHOOLS),8265,5510,66.7,3038,36.8,3397,...,49.7,529,59.7,377,317,84.1,75,19.9,89,23.6


74 cols x 183 rows


Unnamed: 0,School Year,Org. Code,Organization Name,Category,Total Pupil Count,Total Stable Student Count,Total Stability Rate,Total Mobile Student Count,Total Student Mobility Rate,Total Instances Of Mobility,...,Total Native Hawaiian or Other Pacific Islander Student Mobility Rate,Total Native Hawaiian or Other Pacific Islander Instances Of Mobility,Total Native Hawaiian or Other Pacific Islander Mobility Incidence Rate,Total Two or More Races Pupil Count,Total Two or More Races Stable Student Count,Total Two or More Races Stability Rate,Total Two or More Races Mobile Student Count,Total Two or More Races Student Mobility Rate,Total Two or More Races Instances Of Mobility,Total Two or More Races Mobility Incidence Rate
0,20112012,10,MAPLETON 1,1DISTRICT TOTALS (INCLUDING ALTERNATIVE SCHOOLS),9037,5077,56.2,3919,43.4,4133,...,70.8,17,70.8,219,129,58.9,90,41.1,91,41.6
1,20112012,20,ADAMS 12 FIVE STAR SCHOOLS,1DISTRICT TOTALS (INCLUDING ALTERNATIVE SCHOOLS),49889,34283,68.7,15424,30.9,16854,...,45.3,42,48.8,662,455,68.7,203,30.7,222,33.5
2,20112012,30,ADAMS COUNTY 14,1DISTRICT TOTALS (INCLUDING ALTERNATIVE SCHOOLS),8265,5510,66.7,3038,36.8,3397,...,0.0,0,0.0,55,28,50.9,26,47.3,28,50.9


In [3]:
def format_cols(df):
    import re
    df = (df
        .drop_cols(['County Name', 'Organization Code', 'School Year', 'Org. Code', 'Category'])
        .rename_col('Organization Name', 'district')
        .rename(columns={c: re.sub(r"\s|-", "_", c.lower()) for c in df.columns})
    )
    df = df.rename(columns={c: re.sub(r"\.|\(|\)|\&", "", c).replace('__', '_') for c in df.columns})
    return df

grad_raw = format_cols(grad_raw)
mob_raw = format_cols(mob_raw)
mob_dem_raw = format_cols(mob_dem_raw)
head(grad_raw)

36 cols x 183 rows


Unnamed: 0,district,students_with_disabilities_final_grad_base,students_with_disabilities_graduates_total,students_with_disabilities_graduation_rate,students_with_disabilities_completers_total,students_with_disabilities_completion_rate,limited_english_proficient_final_grad_base,limited_english_proficient_graduates_total,limited_english_proficient_graduation_rate,limited_english_proficient_completers_total,...,homeless_final_grad_base,homeless_graduates_total,homeless_graduation_rate,homeless_completers_total,homeless_completion_rate,gifted_talented_final_grad_base,gifted_talented_graduates_total,gifted_talented_graduation_rate,gifted_talented_completers_total,gifted_talented_completion_rate
0,MAPLETON 1,49,18,36.7,19,38.8,219,73,33.3,76,...,41,12,29.3,16,39.0,44,27,61.4,27,61.4
1,ADAMS 12 FIVE STAR SCHOOLS,250,118,47.2,127,50.8,379,257,67.8,261,...,106,62,58.5,65,61.3,227,201,88.5,208,91.6
2,ADAMS COUNTY 14,59,32,54.2,32,54.2,170,86,50.6,88,...,99,52,52.5,57,57.6,30,27,90.0,27,90.0


### Before joining, make sure all district columns match

In [4]:
sorted(grad_raw.district) == sorted(mob_raw.district) == sorted(mob_dem_raw.district)

True

### Merge

In [5]:
# Remove the columns duplicated across mobility demographics and mobility datasets
mob_dem = mob_dem_raw.drop(columns=[
    'total_pupil_count', 'total_stable_student_count', 'total_stability_rate', 'total_mobile_student_count',
    'total_student_mobility_rate', 'total_instances_of_mobility', 'total_mobility_incidence_rate'])

# Combine the two mobility datasets
df_raw_dist = (
    mob_raw
    .merge(mob_dem, on=['district'])
    .merge(grad_raw, on=['district'])
)
head(df_raw_dist)

155 cols x 183 rows


Unnamed: 0,district,total_pupil_count_all_students,total_stable_pupil_count_all_students,total_stability_rate_all_students,total_mobile_student_count_all_students,total_student_mobility_rate_all_students,total_instances_of_mobility_all_students,total_mobility_incidence_rate_all_students,students_with_disabilities_pupil_count,students_with_disabilities_stable_student_count,...,homeless_final_grad_base,homeless_graduates_total,homeless_graduation_rate,homeless_completers_total,homeless_completion_rate,gifted_talented_final_grad_base,gifted_talented_graduates_total,gifted_talented_graduation_rate,gifted_talented_completers_total,gifted_talented_completion_rate
0,MAPLETON 1,9037,5077,56.2,3919,43.4,4133,45.7,735,469,...,41,12,29.3,16,39.0,44,27,61.4,27,61.4
1,ADAMS 12 FIVE STAR SCHOOLS,49889,34283,68.7,15424,30.9,16854,33.8,4339,3001,...,106,62,58.5,65,61.3,227,201,88.5,208,91.6
2,ADAMS COUNTY 14,8265,5510,66.7,3038,36.8,3397,41.1,876,636,...,99,52,52.5,57,57.6,30,27,90.0,27,90.0


## Column Name Manipulation
---

In [6]:
df = df_raw_dist.copy()

### Remove all rates. They got messed up when we aggregated by county

In [7]:
df = separate_by(df, "rate", mode='exclude')

#### Remove native american and native hawaiian because the group sizes are very small and values are 0 for a lot of counties. Remove "two_or_more_races" because it's inconsistent, and difficult to compare groups

In [8]:
df = separate_by(df, "american_indian", mode='exclude')
df = separate_by(df, "native_hawaiian", mode='exclude')
df = separate_by(df, "two_or_more", mode='exclude')

### Standardize group names, then shorten group names
- Graduation data has `limited_english_proficient` and `econ_disadvant` 
- Mobility data `english_language_learners` and `economically_disadvantaged`

**Standardize these to `limited_english` and `econ_disadvant`, and shorten the others**

In [9]:
df = df.col_replace({
    # Mobility/Stability groups
    "limited_english_proficient": "limited_eng",
    "english_language_learners": "limited_eng",
    "economically_disadvantaged": "poor",
    "econ_disadvant": "poor",
    "students_with_disabilities": "disabled",
    "gifted_talented": "gifted",
    # Demographics
    "black_or_african_american": "black",
    "hispanic_or_latino": "hispanic",
    # Graduation data
    "final_grad_base": "grad_base_total",
    "graduates_total": "graduated",
    "completers_total": "completed",
    # Mobility/Stability data
    "instances_of_mobility": "mobile_instances",
    "pupil_count": "pupil_total",
    "_student_count": "",
    # Variable totals
    "_all_students": "",
    "total_": "",
}
).rename_col('stable_pupil_total', 'stable')
df_dist_counts = df
head(df_dist_counts, with_tail=True)

78 cols x 183 rows


Unnamed: 0,district,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,limited_eng_pupil_total,...,migrant_completed,title_1_grad_base_total,title_1_graduated,title_1_completed,homeless_grad_base_total,homeless_graduated,homeless_completed,gifted_grad_base_total,gifted_graduated,gifted_completed
0,MAPLETON 1,9037,5077,3919,4133,735,469,261,279,2863,...,5,218,118,124,41,12,16,44,27,27
1,ADAMS 12 FIVE STAR SCHOOLS,49889,34283,15424,16854,4339,3001,1325,1501,6141,...,12,224,80,98,106,62,65,227,201,208
181,SAN JUAN BOCES,84,0,84,84,5,0,5,5,0,...,0,0,0,0,0,0,0,1,0,1
182,EXPEDITIONARY BOCES,402,302,100,101,40,34,6,6,16,...,0,0,0,0,0,0,0,0,0,0


#### Standardize district names

In [10]:
from format_district import standardize_district_name
df.district = df.district.apply(standardize_district_name)
head(df)

78 cols x 183 rows


Unnamed: 0,district,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,limited_eng_pupil_total,...,migrant_completed,title_1_grad_base_total,title_1_graduated,title_1_completed,homeless_grad_base_total,homeless_graduated,homeless_completed,gifted_grad_base_total,gifted_graduated,gifted_completed
0,MAPLETON 1,9037,5077,3919,4133,735,469,261,279,2863,...,5,218,118,124,41,12,16,44,27,27
1,ADAMSFIVESTAR 12,49889,34283,15424,16854,4339,3001,1325,1501,6141,...,12,224,80,98,106,62,65,227,201,208
2,ADAMSCOUNTY 14,8265,5510,3038,3397,876,636,266,311,3826,...,4,419,296,301,99,52,57,30,27,27


In [11]:
write_main(df_dist_counts, 'education_dist_counts')

183

#### Bring in county column

In [12]:
dist_county = read_main('select district, in_county as county from district')
head(dist_county)

2 cols x 183 rows


Unnamed: 0,district,county
0,MAPLETON 1,ADAMS
1,ADAMSFIVESTAR 12,ADAMS
2,ADAMSCOUNTY 14,ADAMS


#### Make sure all districts match across datasets

In [13]:
from format_district import join_conflicts

conflicts = join_conflicts(df, dist_county, 'district')
assert len(conflicts) == 0

df = df.merge(dist_county, on='district').move_col('county', 1)
head(df)

79 cols x 183 rows


Unnamed: 0,district,county,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,...,migrant_completed,title_1_grad_base_total,title_1_graduated,title_1_completed,homeless_grad_base_total,homeless_graduated,homeless_completed,gifted_grad_base_total,gifted_graduated,gifted_completed
0,MAPLETON 1,ADAMS,9037,5077,3919,4133,735,469,261,279,...,5,218,118,124,41,12,16,44,27,27
1,ADAMSFIVESTAR 12,ADAMS,49889,34283,15424,16854,4339,3001,1325,1501,...,12,224,80,98,106,62,65,227,201,208
2,ADAMSCOUNTY 14,ADAMS,8265,5510,3038,3397,876,636,266,311,...,4,419,296,301,99,52,57,30,27,27


### Create county grouping

In [14]:
df_county_counts = (df
    .groupby(['county'])
    .sum(numeric_only=True)
    .reset_index()
)
head(df_county_counts)

78 cols x 63 rows


Unnamed: 0,county,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,limited_eng_pupil_total,...,migrant_completed,title_1_grad_base_total,title_1_graduated,title_1_completed,homeless_grad_base_total,homeless_graduated,homeless_completed,gifted_grad_base_total,gifted_graduated,gifted_completed
0,ADAMS,98546,67272,31222,33925,8848,6263,2588,2896,20773,...,33,935,529,559,360,190,204,402,337,345
1,ALAMOSA,2775,1882,885,950,223,159,63,66,368,...,4,28,22,23,6,6,6,0,0,0
2,ARAPAHOE,124639,94109,30134,32269,11842,9461,2354,2568,25370,...,9,488,202,213,243,96,102,909,820,828


In [15]:
write_main(df_county_counts, 'education_county_counts')

63

## Calculate Rates
---

- This code is very confusing, but basically I'm just trying to dynamically divide each statistic by its parent's group total to get a percentage, and multiply by 100 to get a rate.
- For example, `disabled_stable` / `stable` gets the percent of stable students who are disabled. Then, `stable` / `pupil_total` gets the percent of all students who are stable, and so on.

In [16]:
def get_rates(df, index):
    df = df.copy()
    df_rates = df.copy()[index]

    for c in ['stable', 'mobile', 'mobile_instances']:
        group_rate = (df[c] / df['pupil_total'] * 100).round(2).fillna(0)
        df_rates[f"{c}_rate"] = group_rate
        df[f"{c}_rate"] = group_rate

    # Calculate rates dynamically
    for group in [
            'disabled', 'limited_eng', 'poor', 'migrant', 'title_1', 'homeless', 'gifted',
            'male', 'female', 'white', 'black', 'hispanic', 'asian']:

        for c in [c for c in df.columns if group in c and "total" not in c]:
            var = c.replace(f"{group}_", '')

            if var in ['graduated', 'completed']:
                new = df[c] / df[f"{group}_grad_base_total"]
            else:
                new = df[c] / df[f"{group}_pupil_total"]
            
            new = (new * 100).round(2).fillna(0)
            df_rates[f"{c}_rate"] = new
            df[f"{c}_rate"] = new

    return df, df_rates

In [17]:
df_dist_all, df_dist_rates = get_rates(df_dist_counts, ['district'])
df_county_all, df_county_rates = get_rates(df_county_counts, ['county'])

In [18]:
head(df_dist_all, df_dist_counts, df_dist_rates, df_county_all, df_county_counts, df_county_rates)

137 cols x 183 rows


Unnamed: 0,district,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,limited_eng_pupil_total,...,white_mobile_instances_rate,black_stable_rate,black_mobile_rate,black_mobile_instances_rate,hispanic_stable_rate,hispanic_mobile_rate,hispanic_mobile_instances_rate,asian_stable_rate,asian_mobile_rate,asian_mobile_instances_rate
0,MAPLETON 1,9037,5077,3919,4133,735,469,261,279,2863,...,52.08,48.04,51.4,53.63,60.31,39.19,42.19,47.22,52.78,53.7
1,ADAMSFIVESTAR 12,49889,34283,15424,16854,4339,3001,1325,1501,6141,...,32.27,55.98,43.94,47.41,67.3,32.23,36.81,81.07,18.84,21.15
2,ADAMSCOUNTY 14,8265,5510,3038,3397,876,636,266,311,3826,...,47.45,48.2,51.8,53.6,68.67,35.2,39.57,80.0,20.0,20.0


78 cols x 183 rows


Unnamed: 0,district,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,limited_eng_pupil_total,...,migrant_completed,title_1_grad_base_total,title_1_graduated,title_1_completed,homeless_grad_base_total,homeless_graduated,homeless_completed,gifted_grad_base_total,gifted_graduated,gifted_completed
0,MAPLETON 1,9037,5077,3919,4133,735,469,261,279,2863,...,5,218,118,124,41,12,16,44,27,27
1,ADAMSFIVESTAR 12,49889,34283,15424,16854,4339,3001,1325,1501,6141,...,12,224,80,98,106,62,65,227,201,208
2,ADAMSCOUNTY 14,8265,5510,3038,3397,876,636,266,311,3826,...,4,419,296,301,99,52,57,30,27,27


60 cols x 183 rows


Unnamed: 0,district,stable_rate,mobile_rate,mobile_instances_rate,disabled_stable_rate,disabled_mobile_rate,disabled_mobile_instances_rate,disabled_graduated_rate,disabled_completed_rate,limited_eng_stable_rate,...,white_mobile_instances_rate,black_stable_rate,black_mobile_rate,black_mobile_instances_rate,hispanic_stable_rate,hispanic_mobile_rate,hispanic_mobile_instances_rate,asian_stable_rate,asian_mobile_rate,asian_mobile_instances_rate
0,MAPLETON 1,56.18,43.37,45.73,63.81,35.51,37.96,36.73,38.78,66.01,...,52.08,48.04,51.4,53.63,60.31,39.19,42.19,47.22,52.78,53.7
1,ADAMSFIVESTAR 12,68.72,30.92,33.78,69.16,30.54,34.59,47.2,50.8,67.56,...,32.27,55.98,43.94,47.41,67.3,32.23,36.81,81.07,18.84,21.15
2,ADAMSCOUNTY 14,66.67,36.76,41.1,72.6,30.37,35.5,54.24,54.24,73.05,...,47.45,48.2,51.8,53.6,68.67,35.2,39.57,80.0,20.0,20.0


137 cols x 63 rows


Unnamed: 0,county,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,limited_eng_pupil_total,...,white_mobile_instances_rate,black_stable_rate,black_mobile_rate,black_mobile_instances_rate,hispanic_stable_rate,hispanic_mobile_rate,hispanic_mobile_instances_rate,asian_stable_rate,asian_mobile_rate,asian_mobile_instances_rate
0,ADAMS,98546,67272,31222,33925,8848,6263,2588,2896,20773,...,32.45,54.66,45.08,47.91,67.49,32.71,36.4,78.55,21.37,23.54
1,ALAMOSA,2775,1882,885,950,223,159,63,66,368,...,35.54,57.14,42.86,42.86,70.15,29.47,32.69,64.0,36.0,36.0
2,ARAPAHOE,124639,94109,30134,32269,11842,9461,2354,2568,25370,...,21.16,67.47,32.03,34.57,72.76,26.75,29.2,78.59,21.3,22.56


78 cols x 63 rows


Unnamed: 0,county,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,limited_eng_pupil_total,...,migrant_completed,title_1_grad_base_total,title_1_graduated,title_1_completed,homeless_grad_base_total,homeless_graduated,homeless_completed,gifted_grad_base_total,gifted_graduated,gifted_completed
0,ADAMS,98546,67272,31222,33925,8848,6263,2588,2896,20773,...,33,935,529,559,360,190,204,402,337,345
1,ALAMOSA,2775,1882,885,950,223,159,63,66,368,...,4,28,22,23,6,6,6,0,0,0
2,ARAPAHOE,124639,94109,30134,32269,11842,9461,2354,2568,25370,...,9,488,202,213,243,96,102,909,820,828


60 cols x 63 rows


Unnamed: 0,county,stable_rate,mobile_rate,mobile_instances_rate,disabled_stable_rate,disabled_mobile_rate,disabled_mobile_instances_rate,disabled_graduated_rate,disabled_completed_rate,limited_eng_stable_rate,...,white_mobile_instances_rate,black_stable_rate,black_mobile_rate,black_mobile_instances_rate,hispanic_stable_rate,hispanic_mobile_rate,hispanic_mobile_instances_rate,asian_stable_rate,asian_mobile_rate,asian_mobile_instances_rate
0,ADAMS,68.26,31.68,34.43,70.78,29.25,32.73,47.54,50.1,69.99,...,32.45,54.66,45.08,47.91,67.49,32.71,36.4,78.55,21.37,23.54
1,ALAMOSA,67.82,31.89,34.23,71.3,28.25,29.6,86.67,93.33,72.01,...,35.54,57.14,42.86,42.86,70.15,29.47,32.69,64.0,36.0,36.0
2,ARAPAHOE,75.51,24.18,25.89,79.89,19.88,21.69,51.26,52.06,74.94,...,21.16,67.47,32.03,34.57,72.76,26.75,29.2,78.59,21.3,22.56


## Save
---

In [19]:
write_main(df_dist_all, 'education_dist')
write_main(df_dist_rates, 'education_dist_rates')

write_main(df_county_all, 'education_county')
write_main(df_county_rates, 'education_county_rates')

63