# Texas school accountability data
This notebook has the scripts needed to process Texas Education Association school accountability data, 2013-2016, for a Statesman interactive. Cody Winchester wrote for 2016 and Christian updated to prepare for the 2017 release.

### Download the data
Accountability data for 2013-2016 are in the `data` folder inside this repo. Excluding the 2016 file -- it's in a different format, in a different place -- here's how you could get the older data yourself, using the 2015 data file as an example.

First: Go to the [accountability data portal](https://rptsvr1.tea.texas.gov/perfreport/account/2015/) and click the "Download data" link on the left rail.

<img src="img/1-portal-page.gif" style="border: 1px solid #ccc; margin: 20px auto 40px auto;" />

On the resulting page, click the "Campus-level Data" radio button, then scroll down and click "Continue."

<img src="img/2-data-page.gif" style="border: 1px solid #ccc; margin: 20px auto 40px auto;" />

Finally, on the data download page, select "Tab delimited" from the select menu. Click the "Select all" button. Then click the "Download" button.

<img src="img/3-download-page.gif" style="border: 1px solid #ccc; margin: 20px auto 40px auto;" />

I renamed this file `2015-tx-school-acc-data.dat` and dropped it into the `/data` folder, then repeated this process for 2014 and 2013.

Also, I snagged the file layouts ([e.g.](https://rptsvr1.tea.texas.gov/perfreport/account/2015/download/camprate.html)) and saved them as .tsv files in the `/data` directory. In practice, however, they didn't always match up with the data, so I used them as a rough guide and consulted a sample of published summary reports [like this one](https://rptsvr1.tea.texas.gov/perfreport/account/2013/static/summary/campus/c227901170.pdf) to check expected values against actual values.

### Process preliminary 2016 data
The initial data release for 2016, [available here as an Excel file](http://tea.texas.gov/WorkArea/linkit.aspx?LinkIdentifier=id&ItemID=51539609928), had the four top-level index scores and the "met standard" ratings, so that defineed our fields. The file was saved as `data/2016-raw-data.csv`, headers chopped and notes lines at bottom were deleted. The file that processed it [has been preserved from the notebook](2016-process-save.py). **The file created was `/data/2016-processed-data.txt`**.

### Preparing for the 2017 data release
The processing scripts below is set up for 2017 since 2016 is done did.

We will need top open the new Excel file, chop off the headers and notes at the bottom, and save as a csv file in data. Name it `2017-raw-data.csv`. **This script currently uses test 2017 data made from a copy of 2016. PLEASE replace this with the real deal.**

It's possibly that some adjustments will be needed when we get the real 2017 file.

In [1]:
import csv

## processing new year

## Will need to replace both input and output file here
with open('school_data/2017-raw-data.csv', 'r') as file_in, \
         open('school_data/2017-processed-data.txt', 'w') as file_out:
    reader = csv.reader(file_in, delimiter=',')


    fieldnames = ['campus_id', 'campus_name', 'district_name', 'rating', 'i1_target', 'i1_score',
                  'i2_target', 'i2_score', 'i3_target', 'i3_score', 'i4_target', 'i4_score', 'year']
    
    writer = csv.DictWriter(file_out, fieldnames=fieldnames, delimiter="|")
    # writer.writeheader()

    for row in reader:
        if row[1] != "":
            d = {}
            # left pad to nine digits or your results will be wack
            d['campus_id'] = row[0].zfill(9)
            d['campus_name'] = row[23]
            d['district_name'] = row[26]
            d['rating'] = row[57]
            d['i1_target'] = row[53]
            d['i1_score'] = row[2]
            d['i2_target'] = row[54]
            d['i2_score'] = row[7]
            d['i3_target'] = row[55]
            d['i3_score'] = row[10]
            d['i4_target'] = row[56]
            d['i4_score'] = row[15]
            d['year'] = '2017'
            writer.writerow(d)

print("~ processed 2017 raw data ~")

~ processed 2017 raw data ~


### Cut and stack
So now you can use `awk` and `csvkit` to extract the columns needed from each file and append them to `data/stacked-file.csv`. (The file layouts are different each year.) **Running this code block creates a pipe-delimited file at `data/stacked_data.txt`**

**UPDATE THIS FOR 2017 WITH REAL DATA PATH**

In [2]:
%%bash

# truncate file if already exists
:> school_data/stacked_data.txt

# write headers
echo "campus_id|campus_name|district_name|rating|i1_target|i1_score|i2_target|i2_score|i3_target|i3_score|i4_target|i4_score|year" >> school_data/stacked_data.txt

# slim version of 2013 data
awk -F '\t' '{OFS="|"; if (NR!=1) {print$1,$6,$51,$49,$20,$19,$25,$24,$30,$29,$35,$34,"2013"}}' school_data/2013-tx-school-acc-data.dat >> school_data/stacked_data.txt

# slim version of 2014 data
awk -F '\t' '{OFS="|"; if (NR!=1) {print $1,$9,$56,$54,$23,$22,$28,$27,$33,$32,$39,$37,"2014"}}' school_data/2014-tx-school-acc-data.dat >> school_data/stacked_data.txt

# slim version of 2015 data
awk -F '\t' '{OFS="|"; if (NR!=1) {print $1,$9,$56,$54,$23,$22,$28,$27,$33,$32,$38,$37,"2015"}}' school_data/2015-tx-school-acc-data.dat >> school_data/stacked_data.txt

# 2016 data
cat school_data/2016-processed-data.txt >> school_data/stacked_data.txt

# 2017 data THIS WILL NEED TO BE UPDATED
cat school_data/2017-processed-data.txt >> school_data/stacked_data.txt

# check for ish
csvclean -n school_data/stacked_data.txt

# report line count
wc -l < school_data/stacked_data.txt

No errors.
   43206


### Group and bake
The goal here is to power an interactive, and a single JSON file with all this data is ~9MB. So we're going to group the records by school and bake out 9K individual files. **Running this code block will create ~9,000 json files in `public/assets/data`. They are NOT checked into the repo any longer.**

In [3]:
from operator import itemgetter
import json
import re
import os

# create data folder if it doesn't exist
if not os.path.exists('../public/assets/data/'):
    os.mkdir('../public/assets/data/')

outdict = {}
index_list = []

TEXT_TRANSFORMS = (
    (r" H S$", " HIGH SCHOOL"),
    (r" MIDDLE$", " MIDDLE SCHOOL"),
    (r" JR H S$", " JUNIOR HIGH SCHOOL"),
    (r" INT$", " INTERMEDIATE"),
    (r" EL$", " ELEMENTARY"),
    (r"Met Standard", "M"),
    (r"Met Standard\**", "M"),
    (r"Met Standard-Paired", "M"),
    (r"Not Rated", "X"),
    (r"Not Rated: Data Integrity Issues", "X"),
    (r"Not Rated: Data Integrity Issues-Paired", "X"),
    (r"^Z$", "X"),
    (r"^Q$", "X"),
    (r"^T$", "X"),
    (r"Improvement Required-Paired", "I"),
    (r"Improvement Required", "I"),
    (r"Met Alternative Standard-Paired", "A"),
    (r"Met Alternative Standard", "A")
)

def clean_text(garb):
    if garb:
        for item in TEXT_TRANSFORMS:
            garb = re.sub(*item, garb, flags=re.IGNORECASE)
        return garb

def de_decimalize(num, type_method):
    """Turn decimals into something JSON-serializable."""
    if num:
        if num == ".":
            return None
        try:
            return type_method(num)
        except:
            return num

with open('school_data/stacked_data.txt') as infile:
    reader = csv.reader(infile, delimiter="|")
    for row in reader:
        campus_id = row[0]
        campus_name = row[1]
        district_name = row[2]
        rating = row[3]
        i1_target = row[4]
        i1_score = row[5]
        i2_target = row[6]
        i2_score = row[7]
        i3_target = row[8]
        i3_score = row[9]
        i4_target = row[10]
        i4_score = row[11]
        year = row[12]
        
        overall_rating = {}
        index1 = {}
        index2 = {}
        index3 = {}
        index4 = {}

        overall_rating['year'] = year
        overall_rating['rating'] = clean_text(rating)

        index1['year'] = year
        index1['target'] = de_decimalize(i1_target, int)
        index1['score'] = de_decimalize(i1_score, int)

        index2['year'] = year
        index2['target'] = de_decimalize(i2_target, int)
        index2['score'] = de_decimalize(i2_score, int)

        index3['year'] = year
        index3['target'] = de_decimalize(i3_target, int)
        index3['score'] = de_decimalize(i3_score, int)

        index4['year'] = year
        index4['target'] = de_decimalize(i4_target, int)
        index4['score'] = de_decimalize(i4_score, int)

        d = outdict.get(campus_id, None)

        if not d:
            outdict[campus_id] = {}
            outdict[campus_id]['name'] = clean_text(campus_name)
            outdict[campus_id]['dist_name'] = clean_text(district_name)
            outdict[campus_id]['ratings'] = []    

        idx1 = outdict[campus_id].get('1', None)
        idx2 = outdict[campus_id].get('2', None)
        idx3 = outdict[campus_id].get('3', None)
        idx4 = outdict[campus_id].get('4', None)

        if not idx1:
            outdict[campus_id]['1'] = {}
            outdict[campus_id]['1']['scores'] = []

        if not idx2:
            outdict[campus_id]['2'] = {}
            outdict[campus_id]['2']['scores'] = []

        if not idx3:
            outdict[campus_id]['3'] = {}
            outdict[campus_id]['3']['scores'] = []

        if not idx4:
            outdict[campus_id]['4'] = {}
            outdict[campus_id]['4']['scores'] = []

        outdict[campus_id]['1']['scores'].append(index1)
        outdict[campus_id]['2']['scores'].append(index2)
        outdict[campus_id]['3']['scores'].append(index3)
        outdict[campus_id]['4']['scores'].append(index4)
        outdict[campus_id]['ratings'].append(overall_rating)

    # fill in missing years
    expected_years = ['2013', '2014', '2015', '2016', '2017']

    for school in outdict:
        years = [x['year'] for x in outdict[school]['ratings']]
        missing_years = [x for x in expected_years if x not in years]

        if len(missing_years) > 0:
            for missing_year in missing_years:
                outdict[school]['ratings'].append({"year": missing_year, "rating": None})

        for i in range(1,5):
            years = [x['year'] for x in outdict[school][str(i)]['scores']]
            missing_years = [x for x in expected_years if x not in years]
            if len(missing_years) > 0:
                for missing_year in missing_years:
                    outdict[school][str(i)]['scores'].append({'target': None, 'year': missing_year, 'score': None})

        # sort the list of dicts by year
        outdict[school]['ratings'] = sorted(outdict[school]['ratings'], key=itemgetter('year'), reverse=True)
        outdict[school]['1']['scores'] = sorted(outdict[school]['1']['scores'], key=itemgetter('year'), reverse=True)
        outdict[school]['2']['scores'] = sorted(outdict[school]['2']['scores'], key=itemgetter('year'), reverse=True)
        outdict[school]['3']['scores'] = sorted(outdict[school]['3']['scores'], key=itemgetter('year'), reverse=True)
        outdict[school]['4']['scores'] = sorted(outdict[school]['4']['scores'], key=itemgetter('year'), reverse=True)

        # write the record to its own file and add to index
        with open("../public/assets/data/" + school + ".json", "w") as f:
            f.write(json.dumps(outdict[school]))

        index_list.append({
            "name": outdict[school]['name'],
            "id": school,
            "district": outdict[school]['dist_name']
        })
        
with open('../public/assets/data/search_index.json', 'w') as f:
    f.write(json.dumps(index_list))
    
print("Wrote", "{:,}".format(len(index_list)), "records to file.")

Wrote 9,317 records to file.
