# About this Notebook
9/26/23<br>
This notebook pulls data only from the Chicago Data Portal, rather than CPS' official "20th day" numbers. It then organizes the data and prepares it for analysis in a subsequent notebook.
<br>
### Dataset timestamps:
<ul >
<li>CPS Profile SY2223- last updated 11/28/22 and no longer being updated
<li>CPS Profile SY2324- <span style="color:red;">last updated 9/13/23 and likely to be updated again</span>
    </ul>
    
### Next Steps
N/A

# 1. Review Documentation

### Chicago Data Portal API
CPS Profile SY2223
https://data.cityofchicago.org/Education/Chicago-Public-Schools-School-Profile-Information-/9a5f-2r4p
<br><strong>API: </strong>https://data.cityofchicago.org/resource/9a5f-2r4p.json
<br>last updated 11/28/22, but description says "Data set is no longer being updated when data set for next year is created" which suggests it could be updated throughout the school year...

CPS Profile SY2324
https://data.cityofchicago.org/Education/Chicago-Public-Schools-School-Profile-Information-/cu4u-b4d9
<br>data last updated 9/13/23
<br><strong>API: </strong>https://data.cityofchicago.org/resource/cu4u-b4d9.json

### CPS Data- 20th day of each school year
Manually downloaded each data set from
https://www.cps.edu/about/district-data/demographics

# 2. Get Data
<br>CPS school profile data is acquired via API from Chicago's data portal
<br>CPS 20th Day Data on low income and bilingual status is first downloaded from CPS website

In [34]:
import pandas as pd
import requests
import xlrd

In [35]:
#CPS profile 2223
url_y1 = "https://data.cityofchicago.org/resource/9a5f-2r4p.json"
file_y1 = "../data/cps_profile2223.csv"
response = requests.get(url_y1)
cps_profile_y1 = response.json()
df_profile_y1 = pd.DataFrame(cps_profile_y1)
df_profile_y1.to_csv(file_y1)

In [36]:
#CPS profile 2324
url_y2 = "https://data.cityofchicago.org/resource/cu4u-b4d9.json"
file_y2 = "../data/cps_profile2324.csv"
response = requests.get(url_y2)
cps_profile_y2 = response.json()
df_profile_y2 = pd.DataFrame(cps_profile_y2)
df_profile_y2.to_csv(file_y2)

In [37]:
#For offline work, read locally downloaded files
# df_profile_y1 = pd.read_csv("../data/cps_profile2223.csv")
# df_profile_y2 = pd.read_csv("../data/cps_profile2324.csv")

In [38]:
#CPS demographics report is useful for Community Area data
df_lkup_community = pd.read_excel("../data/demographics_20thday_2023.xlsx")

# Store the first row values to use in column names
new_column_names = df_lkup_community.iloc[0]

# Rename the columns by appending the values from the first row
df_lkup_community.columns = [f"{col}-{new_col}" for col, new_col in zip(df_lkup_community.columns, new_column_names)]

# Remove the first two rows from the DataFrame (the extraneous header, AND the totals row)
df_lkup_community = df_lkup_community[2:]

In [40]:
# rename column headers for simplicity
df_lkup_community = df_lkup_community.rename(columns={
 'School Information-School ID' :'school_id',
  'Unnamed: 5-Community Area': 'community_name'
})

#keep only columns I need
df_lkup_community = df_lkup_community[['school_id','community_name']]
df_lkup_community

Unnamed: 0,school_id,community_name
2,609772,EAST SIDE
3,610513,ARMOUR SQUARE
4,610212,ALBANY PARK
5,609774,LINCOLN PARK
6,610524,NORTH CENTER
...,...,...
645,610571,NEAR WEST SIDE
646,610557,ASHBURN
647,610568,AVONDALE
648,400173,BRIGHTON PARK


# 3. Reformat and Clean Data

### create master schools dataset

In [41]:
df_master = df_profile_y2

In [42]:
df_master = df_master[['school_id','short_name','long_name','primary_category','is_high_school','is_middle_school','is_elementary_school','is_pre_school','school_latitude','school_longitude']]

### clean up school profile data

In [43]:
df_profile_y1 = df_profile_y1[['school_id','student_count_total', 'student_count_low_income', 'student_count_special_ed', 'student_count_english_learners', 'student_count_hispanic']]
df_profile_y2 = df_profile_y2[['school_id','student_count_total', 'student_count_low_income', 'student_count_special_ed', 'student_count_english_learners', 'student_count_hispanic']]

In [44]:
df_profile_y1 = df_profile_y1.fillna(0)
df_profile_y2 = df_profile_y2.fillna(0)

In [46]:
# rename column headers for simplicity
df_profile_y1 = df_profile_y1.rename(columns={
 'student_count_total' :'n_y1',
  'student_count_low_income': 'n_low_income_y1',
    'student_count_special_ed': 'n_special_y1',
    'student_count_english_learners': 'n_english_learn_y1',
    'student_count_hispanic': 'n_hispanic_y1'
})

df_profile_y2 = df_profile_y2.rename(columns={
 'student_count_total' :'n_y2',
  'student_count_low_income': 'n_low_income_y2',
    'student_count_special_ed': 'n_special_y2',
    'student_count_english_learners': 'n_english_learn_y2',
    'student_count_hispanic': 'n_hispanic_y2'
})

In [47]:
#convert everything else to int except for descriptive fields in df_master
df_profile_y1 = df_profile_y1.astype(int)
df_profile_y2 = df_profile_y2.astype(int)
df_master["school_id"]= df_master["school_id"].astype(int)

# 4. Merge Datasets

In [48]:
df_cps = pd.merge(df_master,df_profile_y1,on="school_id")
df_cps = pd.merge(df_cps,df_profile_y2,on="school_id")
df_cps = pd.merge(df_cps,df_lkup_community,on="school_id")
df_cps.head()

Unnamed: 0,school_id,short_name,long_name,primary_category,is_high_school,is_middle_school,is_elementary_school,is_pre_school,school_latitude,school_longitude,...,n_low_income_y1,n_special_y1,n_english_learn_y1,n_hispanic_y1,n_y2,n_low_income_y2,n_special_y2,n_english_learn_y2,n_hispanic_y2,community_name
0,610176,SHOOP,John D Shoop Math-Science Technical Academy ES,ES,False,True,True,True,41.690919,-87.658669,...,311,72,1,8,414,317,80,1,8,MORGAN PARK
1,400180,KIPP - ONE,KIPP One Academy,ES,False,True,True,False,41.893805,-87.726615,...,804,99,210,378,1028,853,100,247,402,HUMBOLDT PARK
2,610009,GALILEO,Galileo Math & Science Scholastic Academy ES,ES,False,True,True,False,41.871255,-87.653366,...,251,79,74,305,531,251,90,87,295,NEAR WEST SIDE
3,610145,REINBERG,Peter A Reinberg Elementary School,ES,False,True,True,True,41.943019,-87.768983,...,502,187,329,568,774,495,180,347,586,PORTAGE PARK
4,400119,LEGAL PREP HS,Legal Prep Charter Academy,HS,True,False,False,False,41.881733,-87.733778,...,213,47,1,5,217,183,43,1,3,WEST GARFIELD PARK


# 5. School Summary Stats

In [49]:
# calculate non-english-learn
df_cps["n_non_english_learn_y1"] = df_cps["n_y1"]-df_cps["n_english_learn_y1"]
df_cps["n_non_english_learn_y2"] = df_cps["n_y2"]-df_cps["n_english_learn_y2"]

In [50]:
# calculate year over year change
df_cps["d_total"] = df_cps["n_y2"]-df_cps["n_y1"]
df_cps["d_low_income"] = df_cps["n_low_income_y2"]-df_cps["n_low_income_y1"]
df_cps["d_english_learn"] = df_cps["n_english_learn_y2"]-df_cps["n_english_learn_y1"]
df_cps["d_non_english_learn"] = df_cps["n_non_english_learn_y2"]-df_cps["n_non_english_learn_y1"]
df_cps["d_hispanic"] = df_cps["n_hispanic_y2"]-df_cps["n_hispanic_y1"]

In [51]:
#calculate percent changes
df_cps["pct_total"] = 100*df_cps["d_total"]/df_cps["n_y1"]
df_cps["pct_low_income"] = 100*df_cps["d_low_income"]/df_cps["n_low_income_y1"]
df_cps["pct_english_learn"] = 100*df_cps["d_english_learn"]/df_cps["n_english_learn_y1"]
df_cps["pct_non_english_learn"] = 100*df_cps["d_non_english_learn"]/df_cps["n_non_english_learn_y1"]
df_cps["pct_hispanic"] = 100*df_cps["d_hispanic"]/df_cps["n_hispanic_y1"]

In [52]:
df_cps

Unnamed: 0,school_id,short_name,long_name,primary_category,is_high_school,is_middle_school,is_elementary_school,is_pre_school,school_latitude,school_longitude,...,d_total,d_low_income,d_english_learn,d_non_english_learn,d_hispanic,pct_total,pct_low_income,pct_english_learn,pct_non_english_learn,pct_hispanic
0,610176,SHOOP,John D Shoop Math-Science Technical Academy ES,ES,False,True,True,True,41.690919,-87.658669,...,23,6,0,23,0,5.882353,1.929260,0.000000,5.897436,0.000000
1,400180,KIPP - ONE,KIPP One Academy,ES,False,True,True,False,41.893805,-87.726615,...,38,49,37,1,24,3.838384,6.094527,17.619048,0.128205,6.349206
2,610009,GALILEO,Galileo Math & Science Scholastic Academy ES,ES,False,True,True,False,41.871255,-87.653366,...,-4,0,13,-17,-10,-0.747664,0.000000,17.567568,-3.687636,-3.278689
3,610145,REINBERG,Peter A Reinberg Elementary School,ES,False,True,True,True,41.943019,-87.768983,...,29,-7,18,11,18,3.892617,-1.394422,5.471125,2.644231,3.169014
4,400119,LEGAL PREP HS,Legal Prep Charter Academy,HS,True,False,False,False,41.881733,-87.733778,...,-25,-30,0,-25,-2,-10.330579,-14.084507,0.000000,-10.373444,-40.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
640,610520,LASALLE II,LaSalle II Magnet Elementary School,ES,False,True,True,True,41.902914,-87.673696,...,0,11,-1,1,0,0.000000,6.508876,-1.562500,0.193424,0.000000
641,609821,BURNHAM,Burnham Elementary Inclusive Academy,ES,False,True,True,True,41.714402,-87.567131,...,-9,-11,-1,-8,0,-2.127660,-3.179191,-11.111111,-1.932367,0.000000
642,610248,CHICAGO ACADEMY ES,Chicago Academy Elementary School,ES,False,True,True,True,41.943056,-87.778187,...,-8,-71,-7,-1,0,-1.415929,-18.393782,-3.517588,-0.273224,0.000000
643,610499,COLLINS HS,Collins Academy High School,HS,True,False,False,False,41.864149,-87.702042,...,2,4,0,2,-1,0.934579,1.960784,,0.934579,-11.111111


In [53]:
#write summary
df_cps.to_csv("../results/cps_school_sy2223_sy2324.csv")

# 6. Community Summary Stats

In [54]:
#aggregate by community
df_community = df_cps.groupby('community_name', as_index=False)\
    .agg({'n_y1':'sum',
          'n_y2':'sum',
          'n_english_learn_y1':'sum',
          'n_english_learn_y2':'sum',
          'n_non_english_learn_y1':'sum',
          'n_non_english_learn_y2':'sum',
          'n_low_income_y1':'sum',
          'n_low_income_y2':'sum',
          'n_hispanic_y1':'sum',
          'n_hispanic_y2':'sum'
         })

In [55]:
#add count of number of schools
df_count = df_cps.groupby('community_name').size().reset_index(name='n_schools')
df_community = pd.merge(df_community, df_count, on='community_name')

In [56]:
# calculate year over year change
df_community["d_total"] = df_community["n_y2"]-df_community["n_y1"]
df_community["d_low_income"] = df_community["n_low_income_y2"]-df_community["n_low_income_y1"]
df_community["d_english_learn"] = df_community["n_english_learn_y2"]-df_community["n_english_learn_y1"]
df_community["d_non_english_learn"] = df_community["n_non_english_learn_y2"]-df_community["n_non_english_learn_y1"]
df_community["d_hispanic"] = df_community["n_hispanic_y2"]-df_community["n_hispanic_y1"]

In [57]:
#calculate percent changes
df_community["pct_total"] = 100*df_community["d_total"]/df_community["n_y1"]
df_community["pct_low_income"] = 100*df_community["d_low_income"]/df_community["n_low_income_y1"]
df_community["pct_english_learn"] = 100*df_community["d_english_learn"]/df_community["n_english_learn_y1"]
df_community["pct_non_english_learn"] = 100*df_community["d_non_english_learn"]/df_community["n_non_english_learn_y1"]
df_community["pct_hispanic"] = 100*df_community["d_hispanic"]/df_community["n_hispanic_y1"]

In [58]:
#write summary
df_community.to_csv("../results/cps_community_sy2223_sy2324.csv")

# 7. Overall Summary Stats

In [59]:
df_total = pd.DataFrame(
{
    'n_y1': [df_cps['n_y1'].sum()],
    'n_y2': [df_cps['n_y2'].sum()],
    'n_english_learn_y1': [df_cps['n_english_learn_y1'].sum()],
    'n_english_learn_y2': [df_cps['n_english_learn_y2'].sum()],
    'n_non_english_learn_y1': [df_cps['n_non_english_learn_y1'].sum()],
    'n_non_english_learn_y2': [df_cps['n_non_english_learn_y2'].sum()],
    'n_low_income_y1': [df_cps['n_low_income_y1'].sum()],
    'n_low_income_y2': [df_cps['n_low_income_y2'].sum()],
    'n_hispanic_y1': [df_cps['n_hispanic_y1'].sum()],
    'n_hispanic_y2': [df_cps['n_hispanic_y2'].sum()]
}
)
df_total

Unnamed: 0,n_y1,n_y2,n_english_learn_y1,n_english_learn_y2,n_non_english_learn_y1,n_non_english_learn_y2,n_low_income_y1,n_low_income_y2,n_hispanic_y1,n_hispanic_y2
0,321432,320612,71744,75367,249688,245245,220511,215184,149560,150412


In [60]:
#write summary
df_total.to_csv("../results/cps_total_sy2223_sy2324.csv")