# Texas school accountability data
This notebook has the scripts needed to cut, filter and analyze school accountability data from the Texas Education Association.

### Download the data
Accountability data for 2013-2016 are in the `data` folder inside this repo. Here's how you could get them yourself, using the 2015 data file as an example.

First: Go to the [accountability data portal](https://rptsvr1.tea.texas.gov/perfreport/account/2015/) and click the "Download data" link on the left rail.

<img src="img/1-portal-page.gif" style="border: 1px solid #ccc; margin: 20px auto 40px auto;" />

On the resulting page, click the "Campus-level Data" radio button, then scroll down and click "Continue."

<img src="img/2-data-page.gif" style="border: 1px solid #ccc; margin: 20px auto 40px auto;" />

Finally, on the data download page, select "Tab delimited" from the select menu. Click the "Select all" button. Then click the "Download" button.

<img src="img/3-download-page.gif" style="border: 1px solid #ccc; margin: 20px auto 40px auto;" />

I renamed this file `2015-tx-school-acc-data.dat` and dropped it into the `/data` folder, then repeated this process for 2014 and 2013.

Also, I snagged the file layouts ([e.g.](https://rptsvr1.tea.texas.gov/perfreport/account/2015/download/camprate.html)) and saved them as .tsv files in the `/data` directory. In practice, however, they didn't always match up with the data, so I used them as a rough guide and consulted a sample of published summary reports [like this one](https://rptsvr1.tea.texas.gov/perfreport/account/2013/static/summary/campus/c227901170.pdf) to check expected values against actual values.

Also also, I grabbed a .csv file with [spatial and contact data for every school in Texas](http://schoolsdata.tea-texas.opendata.arcgis.com/datasets/059432fd0dcb4a208974c235e837c94f_0), renamed the columns I'm going to use later (`campus_id`, `city`, `lat`, `lng`, `district_id`) and saved it as `/data/school_location_data.csv`. (TODO: grab the [districts shapefile](http://schoolsdata.tea-texas.opendata.arcgis.com/datasets/e115fed14c0f4ca5b942dc3323626b1c_0), too.)

### Cut and stack
So now I can use `awk` and `csvkit` to extract the columns I need from each file and append them to `data/stacked-file.csv`. (The file layouts are different each year.) Then I joined a few columns of location data and sorted by campus ID.

In [145]:
%%bash
# truncate existing file
:> data/stacked_data.csv

# write headers
echo "campus_id,campus_name,campus_population,campus_pct_disadvantaged,campus_pct_english_language_learners,district_name,index1_target_score,index1_score,index2_target_score,index2_score,index3_target_score,index3_score,index4_target_score,index4_score,distinction_reading,distinction_math,distinction_student_progress,distinction_science,distinction_social_studies,distinction_close_performance_gap,distinction_postsecondary_readiness,jjaep,daep,year,overall_rating,updated_rating" >> data/stacked_data.csv

# 2013 data
awk -F '\t' '{OFS=","; if (NR!=1) {print $1,$6,$44,$46,$48,$51,$20,$19,$25,$24,$30,$29,$35,$34,$5,$3,$4,".",".",".",".",$12,$11,"2013",$49,$50;}}' data/2013-tx-school-acc-data.dat >> data/stacked_data.csv

# 2014 data
awk -F '\t' '{OFS=","; if (NR!=1) {print $1,$9,$49,$51,$53,$56,$23,$22,$28,$27,$33,$32,$39,$37,$6,$3,$5,$7,$8,$2,$4,$15,$13,"2014",$54,$55;}}' data/2014-tx-school-acc-data.dat >> data/stacked_data.csv

# 2015 data
awk -F '\t' '{OFS=","; if (NR!=1) {print $1,$9,$49,$51,$53,$56,$23,$22,$28,$27,$33,$32,$38,$37,$6,$3,$5,$7,$8,$2,$4,$15,$13,"2015",$54,$55;}}' data/2015-tx-school-acc-data.dat >> data/stacked_data.csv

# join to location data and sort by campus ID
csvcut -c 9,7,2,1,15 data/school_location_data.csv | csvjoin -c "campus_id,campus_id" data/stacked_data.csv - | csvcut -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,28,29,30,31 | csvsort -c 1 > data/stacked_data_with_coordinates.csv

# check for ish
csvclean -n data/stacked_data_with_coordinates.csv

# report line count
wc -l data/stacked_data_with_coordinates.csv

No errors.
   24241 data/stacked_data_with_coordinates.csv


_Psst, future me_: I created a Python dict with a 1-indexed column layout for each year of data at `/col_index.py`.
You're welcome.

### Load up the data to analyze
Time to analyze some data. Should I use `R`, or `numpy`, or maybe `pandas`?

<img src="img/achewood.png" style="margin: 0;" />

Ha ha OK guys, settle down, I'll use `Agate`. First, create a table.

In [146]:
import agate

# Define the column types
column_types = {
    'campus_id': agate.Text(),
    'campus_name': agate.Text(),
    'campus_population': agate.Number(),
    'campus_pct_disadvantaged': agate.Number(),
    'campus_pct_english_language_learners': agate.Number(),
    'district_name': agate.Text(),
    'index1_target_score': agate.Number(),
    'index1_score': agate.Number(),
    'index2_target_score': agate.Number(),
    'index2_score': agate.Number(),
    'index3_target_score': agate.Number(),
    'index3_score': agate.Number(),
    'index4_target_score': agate.Number(),
    'index4_score': agate.Number(),
    'distinction_reading': agate.Boolean(),
    'distinction_math': agate.Boolean(),
    'distinction_student_progress': agate.Boolean(),
    'distinction_science': agate.Boolean(),
    'distinction_social_studies': agate.Boolean(),
    'distinction_close_performance_gap': agate.Boolean(),
    'distinction_postsecondary_readiness': agate.Boolean(),
    'jjaep': agate.Boolean(),
    'daep': agate.Boolean(),
    'year': agate.Text(),
    'overall_rating': agate.Text(),
    'updated_rating': agate.Number(),
    'district_id': agate.Text(),
    'lng': agate.Number(),
    'lat': agate.Number(),
    'city': agate.Text()
}

school_ratings = agate.Table.from_csv('data/stacked_data_with_coordinates.csv', column_types=column_types)

print(school_ratings)

|---------------------------------------+------------|
|  column                               | data_type  |
|---------------------------------------+------------|
|  campus_id                            | Text       |
|  campus_name                          | Text       |
|  campus_population                    | Number     |
|  campus_pct_disadvantaged             | Number     |
|  campus_pct_english_language_learners | Number     |
|  district_name                        | Text       |
|  index1_target_score                  | Number     |
|  index1_score                         | Number     |
|  index2_target_score                  | Number     |
|  index2_score                         | Number     |
|  index3_target_score                  | Number     |
|  index3_score                         | Number     |
|  index4_target_score                  | Number     |
|  index4_score                         | Number     |
|  distinction_reading                  | Boolean    |
|  distinc

### Process the data
I need to:
* Exclude disciplinary alternative schools ("daep") and kid jails ("jjaep").
* Run the campus names through some text transforms to standardize names.

In [147]:
import re

TEXT_TRANSFORMS = (
    (r"H S$", "High School"),
    (r"MIDDLE$", "Middle School"),
    (r"JR H S$", "Junior High School"),
    (r"INT$", "Intermediate"),
    (r"EL$", "Elementary"),
)

def clean_text(garb):
    if garb:
        for item in TEXT_TRANSFORMS:
            garb = re.sub(*item, garb, flags=re.IGNORECASE)
        return garb.title().replace("Isd", "ISD")

school_ratings_transformed = school_ratings.where(
    lambda row: row['jjaep'] is False and row['daep'] is False
).compute([
    ('campus_name', agate.Formula(agate.Text(), lambda row: clean_name(row['campus_name']))),
    ('city', agate.Formula(agate.Text(), lambda row: clean_name(row['city']))),
    ('district_name', agate.Formula(agate.Text(), lambda row: clean_name(row['district_name'])))
], replace=True)

disciplinary_schools_count = len(school_ratings.rows) - len(school_ratings_transformed.rows)

print(
    "Chopped",
    "{:,}".format(disciplinary_schools_count),
    "disciplinary schools ..."
)

Chopped 1,553 disciplinary schools ...


### Have any schools that reported scores for all 4 standards missed every one?

In [148]:
def sad_trombone(row):
    if row['index1_score'] and \
            row['index1_target_score'] and \
            row['index2_score'] and \
            row['index2_target_score'] and \
            row['index3_score'] and \
            row['index3_target_score'] and \
            row['index4_score'] and \
            row['index4_target_score']:
        return row['index1_score'] < row['index1_target_score'] and \
        row['index2_score'] < row['index2_target_score'] and \
        row['index3_score'] < row['index3_target_score'] and \
        row['index4_score'] < row['index4_target_score']

missed_erry_one = school_ratings_transformed.where(
    lambda row: sad_trombone(row)
).order_by('campus_id')

print(
    len(missed_erry_one.rows),
    "schools missed every one:\n"
)

for row in missed_erry_one.rows:
    print(
        row['campus_name'],
        "\n" + row['district_name'],
        "\n" + row['city'],
        "\n" + row['year'],
        "\n" + "https://rptsvr1.tea.texas.gov/perfreport/account/{}/static/summary/campus/c{}.pdf".format(row['year'], row['campus_id']),
        "\n"
    )

70 schools missed every one:

O A Fleming Elementary 
Brazosport Isd 
Freeport 
2014 
https://rptsvr1.tea.texas.gov/perfreport/account/2014/static/summary/campus/c020905104.pdf 

O A Fleming Elementary 
Brazosport Isd 
Freeport 
2015 
https://rptsvr1.tea.texas.gov/perfreport/account/2015/static/summary/campus/c020905104.pdf 

Jane Long Elementary 
Brazosport Isd 
Freeport 
2014 
https://rptsvr1.tea.texas.gov/perfreport/account/2014/static/summary/campus/c020905106.pdf 

O'Hara Lanier Middle School 
Brazosport Isd 
Freeport 
2015 
https://rptsvr1.tea.texas.gov/perfreport/account/2015/static/summary/campus/c020905116.pdf 

T W Browne Middle School 
Dallas Isd 
Dallas 
2014 
https://rptsvr1.tea.texas.gov/perfreport/account/2014/static/summary/campus/c057905043.pdf 

Thomas A Edison Middle Learning Ce 
Dallas Isd 
Dallas 
2015 
https://rptsvr1.tea.texas.gov/perfreport/account/2015/static/summary/campus/c057905074.pdf 

Rufus C Burleson Elementary 
Dallas Isd 
Dallas 
2014 
https://rptsvr1.

### Dump to a grouped list of dictionaries

In [149]:
# school_ratings_grouped = school_ratings_transformed.group_by('campus_id')
# print(school_ratings_grouped)
# outlist = []
# for row in school_ratings_transformed:
    


### Peep data on local schools

In [150]:
# csvcut -d "," -c 6 data/stacked_data_with_coordinates.csv | sort | uniq > districts.txt

local_districts = ["AUSTIN ACHIEVE PUBLIC SCHOOLS", "AUSTIN DISCOVERY SCHOOL", "AUSTIN ISD", "ROUND ROCK ISD", "LEANDER ISD", "PFLUGERVILLE ISD", "HAYS CISD", "GEORGETOWN ISD", "BASTROP ISD", "MANOR ISD", "LAKE TRAVIS ISD", "EANES ISD", "SAN MARCOS CISD", "HUTTO ISD", "DRIPPING SPRINGS ISD", "DEL VALLE ISD"]

local_school_data = school_ratings_transformed.where(
    lambda row: row['district_name'].upper() in local_districts
)

print(
    "Pulled data for",
    "{:,}".format(len(local_school_data.rows)),
    "schools in",
    "{:,}".format(len(local_districts)),
    "local districts ..."
)

local_school_data.to_json('local.json')

Pulled data for 1,053 schools in 16 local districts ...
