This project explores demographic differences in immediate postsecondary enrollment among Washington high school graduates. It uses public data from data.wa.gov.
The first dataset includes data related to postseconary enrollment rates. It’s broken down by:
Year (2005-2019); area (state, county, district, school); demographic (sex, race, income, etc); and institution type (2 year, 4 year, or not enrolled) For now, I will arrange the data for all students at the county level.

In [1]:
#READ IN HIGH SCHOOL GRADUATE OUTCOMES - FIRST YEAR ENROLLMENT
#https://data.wa.gov/Education/High-School-Graduate-Outcomes-First-Year-Enrollmen/vk6s-am8z

#!/usr/bin/env python

# make sure to install these packages before running:
# pip install pandas
# pip install sodapy

import pandas as pd
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.wa.gov", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.wa.gov,
#                  MyAppToken,
#                  username="user@example.com",
#                  password="AFakePassword")

# All results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("7ma7-qs6m", limit=400000)

# Convert to pandas DataFrame
enrollment_df = pd.DataFrame.from_records(results)



In [2]:
#filter to include all students at the county level only
enrollment_df=enrollment_df[enrollment_df['districttype'].str.contains('School Dist')]
enrollment_df=enrollment_df[enrollment_df['demotype'].str.contains('All Students')]
enrollment_df=enrollment_df[enrollment_df['cohorttype'].str.contains('1yr')]

In [3]:
#drop unused columns
enrollment_df.drop(['districttype','schoolttl','redactedpct','demotype',
                    'demographicgroup', 'demographicvalue', 'redactedpct', 'cohorttype'],
                   axis=1, inplace=True)
enrollment_df.reset_index(drop=True, inplace=True)

In [4]:
#change value 'pct' from object to number
enrollment_df['pct']=enrollment_df['pct'].apply(pd.to_numeric, errors='coerce')
enrollment_df['cohortyearttl']=enrollment_df['cohortyearttl'].apply(pd.to_numeric, errors='coerce')

In [5]:
enrollment_df=enrollment_df[enrollment_df['cohortyearttl']>=2014]

In [6]:
#pivot from long to wide data frame and reset index
df1=pd.pivot_table(enrollment_df,
                   index=['cohortyearttl','districtttl'],
                   columns='psenrolllevel',
                   values='pct').reset_index()

#remove index name
df1.index.name = df1.columns.name = None

#rename columns
df1.rename(columns={'cohortyearttl': 'Year', 'districtttl': 'District', '2 Year / CTC' : '2 Year'}, inplace=True)

The next dataset inlcudes demographic data. It includes raw counts of students broken down by:

Year (2014-15 - 2021-22);
area (state, county, district, school);
demographic
For now, I will arrange the data for all students at the county level.

In [7]:
#LOAD DISTRICT DEMOGRAPHIC DATA
#https://data.wa.gov/education/Report-Card-Enrollment-from-2014-15-to-Current-Yea/rxjk-6ieq

#!/usr/bin/env python

# make sure to install these packages before running:
# pip install pandas
# pip install sodapy

import pandas as pd
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.wa.gov", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.wa.gov,
#                  MyAppToken,
#                  username="user@example.com",
#                  password="AFakePassword")

# All results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
demo = client.get("rxjk-6ieq", limit=200000)

# Convert to pandas DataFrame
demo_df = pd.DataFrame.from_records(demo)



In [8]:
#filter to include all students at the county level only
demo_df=demo_df[demo_df['organizationlevel'].str.contains('District')]
demo_df=demo_df[demo_df['gradelevel'].str.contains('AllGrades')]

In [9]:
#drop unused columns
demo_df.drop(['organizationlevel','county','esdname','schoolname',
              'gradelevel', 'dataasof', 'esdorganizationid', 'districtcode',
              'districtorganizationid', 'schoolcode', 'schoolorganizationid', 'currentschooltype',
              'fostercare', 'non_fostercare', 'students_without_disabilities','students_with_disabilities',
              'section_504', 'non_section_504',
              'mobile', 'non_mobile', 'military_parent', 'non_military_parent','migrant', 'non_migrant',
              'homeless', 'non_homeless',
              ],
             axis=1, inplace=True)
demo_df.reset_index(drop=True, inplace=True)

In [10]:
demo_df['districtname'] = demo_df['districtname'].str.replace(' School District', '')
demo_df['schoolyear'] = demo_df['schoolyear'].str[:-3]

In [11]:
severalToNum=lambda x:pd.to_numeric(x,errors='coerce')

where=demo_df.columns[2:]

demo_df.loc[:,where]=demo_df.loc[:,where].apply(severalToNum)
demo_df['schoolyear']=demo_df['schoolyear'].apply(pd.to_numeric, errors = "coerce")

demo_df = demo_df[demo_df['schoolyear']<=2019]

  demo_df.loc[:,where]=demo_df.loc[:,where].apply(severalToNum)


In [12]:
df2=demo_df
df2.rename(columns={'schoolyear': 'Year', 'districtname': 'District'}, inplace=True)
df2.reset_index(drop=True, inplace=True)

In [13]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1066 entries, 0 to 1065
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Year          1066 non-null   int64  
 1   District      1066 non-null   object 
 2   2 Year        1066 non-null   float64
 3   4 Year        1066 non-null   float64
 4   Not Enrolled  1066 non-null   float64
dtypes: float64(3), int64(1), object(1)
memory usage: 41.8+ KB


In [14]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1928 entries, 0 to 1927
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Year                            1928 non-null   int64 
 1   District                        1928 non-null   object
 2   all_students                    1928 non-null   int64 
 3   female                          1928 non-null   int64 
 4   male                            1928 non-null   int64 
 5   gender_x                        1928 non-null   int64 
 6   american_indian_alaskan_native  1928 non-null   int64 
 7   asian                           1928 non-null   int64 
 8   black_african_american          1928 non-null   int64 
 9   hispanic_latino_of_any_race     1928 non-null   int64 
 10  native_hawaiian_other_pacific   1928 non-null   int64 
 11  two_or_more_races               1928 non-null   int64 
 12  white                           1928 non-null   

In [15]:
df3 = pd.merge(df1, df2, on=['District', 'Year'])
df3.info

<bound method DataFrame.info of       Year             District  2 Year    4 Year  Not Enrolled  all_students  \
0     2014             Aberdeen   0.440  0.180000      0.385000          3404   
1     2014                 Adna   0.420  0.120000      0.470000           610   
2     2014            Anacortes   0.275  0.420000      0.310000          2708   
3     2014            Arlington   0.310  0.275000      0.415000          5575   
4     2014               Auburn   0.285  0.325000      0.387500         15685   
...    ...                  ...     ...       ...           ...           ...   
1019  2019  White Salmon Valley   0.120  0.375000      0.510000          1285   
1020  2019             Woodland   0.245  0.130000      0.625000          2535   
1021  2019               Yakima   0.280  0.266667      0.453333         16419   
1022  2019                 Yelm   0.185  0.285000      0.530000          5902   
1023  2019               Zillah   0.310  0.220000      0.470000          1335

In [16]:
df3

Unnamed: 0,Year,District,2 Year,4 Year,Not Enrolled,all_students,female,male,gender_x,american_indian_alaskan_native,...,hispanic_latino_of_any_race,native_hawaiian_other_pacific,two_or_more_races,white,english_language_learners,highly_capable,low_income,non_english_language_learners,non_highly_capable,non_low_income
0,2014,Aberdeen,0.440,0.180000,0.385000,3404,1661,1743,0,145,...,1003,10,162,1952,334,96,2472,3070,3308,932
1,2014,Adna,0.420,0.120000,0.470000,610,287,323,0,3,...,35,0,14,551,0,0,176,610,610,434
2,2014,Anacortes,0.275,0.420000,0.310000,2708,1338,1370,0,20,...,226,8,44,2274,61,46,898,2647,2662,1810
3,2014,Arlington,0.310,0.275000,0.415000,5575,2661,2914,0,63,...,667,30,279,4390,217,3,2089,5358,5572,3486
4,2014,Auburn,0.285,0.325000,0.387500,15685,7730,7955,0,221,...,4148,530,1444,7081,2341,227,9349,13344,15458,6336
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1019,2019,White Salmon Valley,0.120,0.375000,0.510000,1285,591,693,1,10,...,406,2,42,819,196,100,589,1089,1185,696
1020,2019,Woodland,0.245,0.130000,0.625000,2535,1193,1342,0,8,...,571,2,99,1821,213,170,1228,2322,2365,1307
1021,2019,Yakima,0.280,0.266667,0.453333,16419,8126,8293,0,150,...,13058,18,396,2642,4947,462,13681,11472,15957,2738
1022,2019,Yelm,0.185,0.285000,0.530000,5902,2813,3085,4,108,...,937,89,686,3924,165,357,3001,5737,5545,2901
