This data is from Oregan State relating to School performance between 1997-2006
https://www.ode.state.or.us/sfda/reports/r0045Select.asp

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [2]:
school_perf_report = pd.read_csv("oregon_schoool_performance_report.csv", encoding='iso-8859-1')

In [3]:
school_perf_report.shape

(28240, 165)

There are 28k rows, however, most of the rows are empty. For this reason, we only look at columns which have atleast 15k data (a little more than half). Note that the performance results are very sparse and get dropped by this action. Most of the columns that remain are the demographics information and information about school funding. This is yearly data, but full date is given, so a new column is created called 'year' for easy plotting. INSTID are unique numbers assigned in this report to schools, but are not standard. "DISTINSTID" is a standard number assigned to school and will be used for correlating with other data. "GRDRNG" is wrong/ incomplete data, because some schools have grade information like "K-05" while others have data information "8-Jul". "last column" only has "-" so it is also dropped.

In [4]:
keep_columns = school_perf_report.columns[school_perf_report.isnull().sum() < 15000]

In [5]:
school_perf_report = school_perf_report[keep_columns]
school_perf_report['year'] = pd.to_datetime(school_perf_report['SCHLYR'], errors='ignore').dt.year
school_perf_report['year'].head()

0    1999
1    2000
2    2001
3    2002
4    2003
Name: year, dtype: int64

In [6]:
school_perf_report = school_perf_report.drop(["INSTID","SCHLYR","GRDRNG","last column"], axis=1)
school_perf_report.columns

Index(['DISTINSTID', 'SCHLNM', 'DISTNM', 'STUDENRCNT', 'DISTSTUDENRCNT',
       'STSTUDENRCNT', 'FREEREDSTUDCNT', 'FREEREDPCT', 'DISTFREEREDPCT',
       'STFREEREDPCT', 'MnrtyStudCnt', 'MnrtyStudPct', 'DistMnrtyStudPct',
       'StMnrtyStudPct', 'GenFundDIRCLSRMAMT', 'GenFundDISTDIRCLSRmAmt',
       'GenFundSTDIRCLSRMAmt', 'GenFundCLSRMSUPPAMt',
       'GenFundDISTCLSRMSUppAmt', 'GenFundSTCLSRMSUPPAmt',
       'GenFundBLDGSUPPAMT', 'GenFundDISTBLDGSUPpAmt', 'GenFundSTBLDGSUPPAmt',
       'GenFundCNTLSUPPAMT', 'GenFundDISTCNTLSUPpAmt', 'GenFundSTCNTLSUPPAmt',
       'GenFundTtlAmt', 'GenFundDistTtlAmt', 'GenFundStTtlAmt',
       'TtlDirClsRmAmt', 'TtlClsRmSuppAmt', 'TtlBldgSuppAmt', 'TtlCntlSuppAmt',
       'TtlSpendAmt', 'TtlDistDirClsRmAmt', 'TtlDistClsRmSuppAmt',
       'TtlDistBldgSuppAmt', 'TtlDistCntlSuppAmt', 'TtlDistSpendAmt',
       'TtlStDirClsRmAmt', 'TtlStClsRmSuppAmt', 'TtlStBldgSuppAmt',
       'TtlStCntlSuppAmt', 'TtlStSpendAmt', 'year'],
      dtype='object')

It is not clear from the webpage, what the column descriptions are. 
https://www.ode.state.or.us/sfda/rptDef.aspx

Another problem is that the school names have different versions of the same name. So had to clean up the school names.


In [7]:
school_perf_report[school_perf_report["DISTINSTID"]==2180]["SCHLNM"].unique()

array(['Abernethy Elem School', 'Abernethy Elementary School',
       'AINSWORTH ELEM SCHOOL', 'Ainsworth Elementary School',
       'ALAMEDA ELEM SCHOOL', 'Alameda Elementary School',
       'APPLEGATE ELEM SCHOOL', 'Applegate Elementary School',
       'ARLETA ELEM SCHOOL', 'Arleta Elementary School',
       'ASTOR ELEM SCHOOL', 'Astor Elementary School',
       'ATKINSON ELEM SCHOOL', 'Atkinson Elementary School',
       'BALL ELEM SCHOOL', 'Ball Elementary School',
       'Rosa Parks Elementary School', 'BEACH ELEM SCHOOL',
       'Beach Elementary School', 'BEAUMONT MIDDLE SCHOOL',
       'Beaumont Middle School', 'BINNSMEAD MIDDLE SCHOOL',
       'Binnsmead Middle School', 'BOISE/ELIOT ELEM SCHOOL',
       'Boise/Eliot Elementary School', 'Boise-Eliot Elementary School',
       'BRIDGER ELEM SCHOOL', 'Bridger Elementary School',
       'BRIDLEMILE ELEM SCHOOL', 'Bridlemile Elementary School',
       'BROOKLYN ELEM SCHOOL', 'Brooklyn Elementary School',
       'BUCKMAN ELEM SCHOOL

In [8]:

school_perf_report["SCHLNM"] = school_perf_report["SCHLNM"].str.lower().str.replace(r"elementary","elem").str.replace(r"/"," ").str.replace("school","").str.replace("-"," ").str.replace("the","").str.replace(".","").str.replace(r"\s+"," ").str.rstrip().str.lstrip()

In [9]:
school_perf_report[school_perf_report["DISTINSTID"]==2180]["SCHLNM"].unique()

array(['abernethy elem', 'ainsworth elem', 'alameda elem',
       'applegate elem', 'arleta elem', 'astor elem', 'atkinson elem',
       'ball elem', 'rosa parks elem', 'beach elem', 'beaumont middle',
       'binnsmead middle', 'boise eliot elem', 'bridger elem',
       'bridlemile elem', 'brooklyn elem', 'buckman elem',
       'capitol hill elem', 'chapman elem', 'chief joseph elem',
       'clarendon elem', 'clarendon portsmouth', 'césar chávez k 8',
       'cesar chavez k 8', 'clark elem', 'harrison park', 'creston elem',
       'duniway elem', 'edwards elem', 'faubion elem', 'fernwood middle',
       'george middle', 'glencoe elem', 'gray middle',
       'gregory heights middle', 'grout elem', 'hayhurst elem',
       'hollyrood elem', 'hollyrood fernwood', 'beverly cleary',
       'hosford middle', 'humboldt elem', 'irvington elem',
       'james john elem', 'kellogg middle', 'kelly elem', 'kenton elem',
       'king elem', 'martin lur king jr', 'laurelhurst elem', 'lee elem',
   

In [10]:
school_perf_report.to_csv("oregon_schoool_performance_report_filtered.csv", index=False)
school_perf_report.shape

(28240, 45)