I want to reorganize this dataset once again.

This time, let's compare each demographic group to the state population as a whole. That way, the calculation performed for the "ALL" population in each district is comparable to what's being done for each demographic within the district.

Also, let's use the "DISTRICT CUMULATIVE YEAR END ENROLLMENT" number as the total population for each district, not the number reported in the district snapshot.

Also, it should follow Tidy Data practices (https://vita.had.co.nz/papers/tidy-data.pdf).

This script takes Texas Education Agency data about school district demographics and disciplinary actions, and puts them together in one GeoJSON file for the Texas Appleseed "School to Prison Pipeline" map. See http://www.texasdisciplinelab.org/

To use the script, follow these instructions:

1. For every year that you want to cover, download all 20 of the region files from http://rptsvr1.tea.texas.gov/adhocrpt/Disciplinary_Data_Products/Download_Region_Districts.html and put them in the directory '../data/from_agency/by_region/'

2. For every year that you want to cover, download the "District and Charter Detail Data" Snapshot Data File (comma-delimited *.dat)" from https://rptsvr1.tea.texas.gov/perfreport/snapshot/download.html. The website automatically delivers these files with the same filename: district.dat. You will need to rename them to have different names by adding the year after "district". For instance, "district2016.dat"

3. This script needs a GeoJSON file of district shapes. Make sure it can find that file at '../geojson/base_districts.geojson'

4. Change the first_year and last_year variables below to reflect the years you want your file to cover.

5. Run the notebook with "Kernel -> Restart and Run All"

6. Wait a while for it to finish. After about 15 minutes, the notebook should produce 'districts_with_data.geojson' in the '../geojson/' directory.

7. The resulting file will be about 20 MB depending on how many years it covers. You can make it smaller (about 10 MB) by uploading it to http://mapshaper.org/, using the "simplify" function to reduce the number of lines in the district boundaries, and exporting the file as TopoJSON instead of GeoJSON. I did this and put the result in the '../topojson/' directory.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats

pd.options.display.max_columns = 999
pd.options.display.max_rows = 999

first_year = 2006 # the year 2006 is the first year on the TEA site
last_year = 2016

pdict = {"EXPULSIONS":"EXP","DAEP REMOVALS":"DAE","IN SCHOOL SUSPENSIONS":"ISS","OUT OF SCHOOL SUSPENSIONS":"OSS"}

demos = {'SPE', 'ECO','HIS','BLA', 'WHI','IND', 'ASI','PCI', 'TWO'}

In [2]:
def formatDF(apple, year, year_col):
    
    # Removes columns not needed for the map
    
    apple = apple.drop(["AGGREGATION LEVEL","REGION","DISTNAME"], axis = 1)
    
    # Keeping only the rows that categorize students by protected class, or that have totals.
    
    patternIn = "WHITE|BLACK OR AFRICAN AMERICAN|AMERICAN INDIAN OR ALASKA NAT|HISPANIC|NATIVE HAWAIIAN|ASIAN" +\
                "|TWO OR MORE RACES|SPEC. ED|ECO. DISAD|ECO DISAD.|TOTAL|DISTRICT CUMULATIVE YEAR END ENROLLMENT" +\
                "|MANDATORY|DISCRETIONARY"
    
    apple = apple[apple["HEADING NAME"].str.contains(patternIn)]
    
    # Getting rid of rows that count students instead of incidents, or non-disadvantaged kids.
    
    patternOut = "SPEC. ED. STUDENTS|EXPULSIONS TO JJAEP|ECO DISAD. STUDENTS|ECO. DISAD. STUDENTS" +\
                 "|AT RISK|NON AT|UNKNOWN AT|NON ECO DISAD.|NON ECO. DISAD."
    apple = apple[apple["HEADING NAME"].str.contains(patternOut) == False]

    # Delete rows appearing to double-count the same expulsions.
    
    JJAEPReplace = {"SECTION": {
                        'M-ECO\. DISADV\. JJAEP PLACEMENTS|H-SPEC\. ED\. JJAEP EXPULSIONS': 'C-JJAEP EXPULSIONS'}}
    apple = apple.replace(to_replace=JJAEPReplace, regex=True)
    apple = apple[apple["SECTION"].str.contains("JJAEP EXPULSIONS|DISCIPLINE ACTION COUNTS") == False]
    
    # Consolidating some of the descriptors into broader categories
    
    appleReplace = {year_col:
                        {-99999999: 1, -999999: 1, -999: 1},
                    "SECTION": {
                        'A-PARTICIPATION': 'POP',
                        'D-EXPULSION ACTIONS|N-ECO\. DISADV\. EXPULSIONS|I-SPEC\. ED\. EXPULSIONS': 'EXP',
                        'E-DAEP PLACEMENTS|O-ECO\. DISADV\. DAEP PLACEMENTS|J-SPEC\. ED\. DAEP PLACEMENTS': 'DAE',
                        'F-OUT OF SCHOOL SUSPENSIONS|P-ECO\. DISADV\. OUT OF SCHOOL SUS.|K-SPEC\. ED\. OUT OF SCHOOL SUS\.': 'OSS',
                        'G-IN SCHOOL SUSPENSIONS|Q-ECO\. DISADV\. IN SCHOOL SUS\.|L-SPEC\. ED\. IN SCHOOL SUS\.': 'ISS'},
                    "HEADING NAME": {'SPEC\. ED.*$': 'SPE',
                                     'ECO?. DISAD.*$': 'ECO',
                                     'HISPANIC': 'HIS',
                                     'HIS/LATINO': 'HIS',
                                     'HISPANIC/LATINO': 'HIS',
                                     'BLACK OR AFRICAN AMERICAN': 'BLA',
                                     'BLACK/AFRICAN AMERICAN': 'BLA',
                                     'WHITE': 'WHI',
                                     'AMERICAN INDIAN OR ALASKA NAT': 'IND',
                                     'ASIAN': 'ASI',
                                     'NATIVE HAWAIIAN/OTHER PACIFIC': 'PCI',
                                     'TWO OR MORE RACES': 'TWO',
                                     'DISTRICT CUMULATIVE YEAR END ENROLLMENT': 'ALL'
                                    }
                    }

    df = apple.replace(to_replace=appleReplace, regex=True)
    
    df["Year"] = year

    for punishment in pdict:
        for category in ("MANDATORY ", "DISCRETIONARY "):
            df.loc[((df['HEADING NAME'] == category + punishment), 'SECTION')] = category + pdict[punishment]
            df = df.replace(to_replace=category + punishment, value="ALL")

    return df


In [3]:
year = 2016

def getYear(year):
    year_col = "YR{}".format(str(year)[-2:])
    apple_path = '../data/from_agency/by_region/REGION_{}_DISTRICT_summary_{}.csv'
    one_year = [pd.read_csv(apple_path.format(str(region).zfill(2),str(year)[-2:]), 
                            dtype = {year_col: int})
                for region in range(1,21)]
    a = pd.concat(one_year)
    # a = a[~a.index.duplicated(keep='last')]  # a single row was causing a non-unique multiindex error 
    # dfnames = a[["DISTRICT", "DISTNAME"]].drop_duplicates()
    apple = formatDF(a, year, year_col).rename(columns={year_col: "Count"})
    return apple

df = getYear(year)

In [4]:
# saving list of school names to link to the IDs later.

dfnames[:5]

Unnamed: 0,DISTRICT,DISTNAME
0,31901,BROWNSVILLE ISD
97,108902,DONNA ISD
198,108903,EDCOUCH-ELSA ISD
256,108904,EDINBURG CISD
365,108809,EXCELLENCE IN LEADERSHIP ACADEMY


In [5]:
df[:5]

Unnamed: 0,DISTRICT,SECTION,HEADING,HEADING NAME,Count,Year
0,31901,POP,A01,ALL,50150,2016
7,31901,MANDATORY EXP,B05,ALL,1,2016
8,31901,DISCRETIONARY EXP,B06,ALL,1,2016
10,31901,MANDATORY DAE,B08,ALL,218,2016
11,31901,DISCRETIONARY DAE,B09,ALL,384,2016


In [6]:
dfpivot = df.pivot_table(index=['DISTRICT','Year'], columns=['HEADING NAME', 'SECTION'], values='Count')

In [7]:
# Trying two methods to get total disciplinary actions per district: adding up "Mandatory" and "Discretionary"
# actions (which are not always reported), and adding up actions against special ed and non-special ed students.
# Relying on whichever number is higher, on the assumption that if actions are reported anywhere, they probably
# really happened.

for p in pdict.values():
    try:
        dfpivot["ALL", p] = dfpivot["ALL"]["DISCRETIONARY " + p] + dfpivot["ALL"]["MANDATORY " + p]
    except KeyError:
        print("No mandatory/discretionary columns for " + p + " in " + str(year))
    dfpivot["ALLS", p] = dfpivot["SPE"][p] + dfpivot["NON SPE"][p]
    dfpivot = dfpivot.drop("DISCRETIONARY " + p, axis=1, level=1)
    dfpivot = dfpivot.drop("MANDATORY " + p, axis=1, level=1)
    try:
        dfpivot["ALL", p] = dfpivot[[("ALL", p), ("ALLS", p)]].max(axis=1)
    except KeyError:
        dfpivot["ALL", p] = dfpivot["ALLS", p]
    # dfpivot = dfpivot.MultiIndex.drop([("ALL", "DISCRETIONARY " + p), ("ALL", "MANDATORY " + p)])

No mandatory/discretionary columns for ISS in 2016
No mandatory/discretionary columns for OSS in 2016


In [8]:
dfpivot.columns

MultiIndex(levels=[['ALL', 'ASI', 'BLA', 'ECO', 'HIS', 'IND', 'NON SPE', 'PCI', 'SPE', 'TWO', 'WHI', 'ALLS'], ['DAE', 'DISCRETIONARY DAE', 'DISCRETIONARY EXP', 'EXP', 'ISS', 'MANDATORY DAE', 'MANDATORY EXP', 'OSS', 'POP']],
           labels=[[0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 0, 11, 0, 11, 11, 0, 11, 0], [8, 0, 3, 4, 7, 0, 3, 4, 7, 0, 3, 4, 7, 0, 3, 4, 7, 0, 3, 4, 7, 0, 3, 4, 7, 0, 3, 4, 7, 0, 3, 4, 7, 0, 3, 4, 7, 0, 3, 4, 7, 3, 3, 0, 0, 4, 4, 7, 7]],
           names=['HEADING NAME', 'SECTION'])

In [9]:
# Dropping columns that were just used to get totals, and won't be needed for the map.

dfpivot = dfpivot.drop("ALLS", axis=1, level=0)
dfpivot = dfpivot.drop("NON SPE", axis=1, level=0)
dfpivot = dfpivot.sort_index(axis=1) # sorting columns back into order

In [10]:
def populations(districtPath, year):
    district = pd.read_csv(districtPath)

    district = district.rename(columns = {"SNAPDIST": 'DISTNAME'})
    
    sometimes_missing = [ 'DPETINDP', 'DPETASIP', 'DPETPCIP', 'DPETTWOP']
    
    for c in sometimes_missing:
        if c not in district.columns:
            district[c] = np.nan
    
    # deleting redundant columns
    
    district["Year"] = year
    
    # dropping 'DPETALLC', which is also a measure of district population, but it's not
    # the same as what the TEA uses in the annual discipline reports processed above.
    
    district = district[['DISTRICT', 'Year', 
                         'DPETBLAP', 'DPETHISP', 'DPETWHIP', 'DPETINDP',
                         'DPETASIP', 'DPETPCIP', 'DPETTWOP', 'DPETECOP', 
                         'DPETSPEP']] 

    district = district.set_index(["DISTRICT",'Year'])
    
    # turning percentages into decimals

    district = district * .01

    return district

In [11]:
districtPath = '../data/from_agency/districts/district{}.dat'.format(year)
district = populations(districtPath, year)
    

In [12]:
district[:10]

Unnamed: 0_level_0,Unnamed: 1_level_0,DPETBLAP,DPETHISP,DPETWHIP,DPETINDP,DPETASIP,DPETPCIP,DPETTWOP,DPETECOP,DPETSPEP
DISTRICT,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1902,2016,0.042,0.07,0.829,0.0,0.007,0.0,0.051,0.338,0.137
1903,2016,0.054,0.096,0.814,0.001,0.009,0.002,0.026,0.547,0.116
1904,2016,0.088,0.102,0.759,0.005,0.007,0.001,0.038,0.565,0.082
1906,2016,0.084,0.115,0.773,0.0,0.008,0.0,0.021,0.428,0.11
1907,2016,0.27,0.412,0.279,0.002,0.008,0.0,0.028,0.741,0.085
1908,2016,0.178,0.239,0.533,0.006,0.005,0.0,0.04,0.628,0.099
1909,2016,0.022,0.053,0.894,0.007,0.002,0.0,0.022,0.478,0.13
2901,2016,0.011,0.674,0.291,0.003,0.004,0.0,0.018,0.397,0.065
3801,2016,0.172,0.196,0.59,0.0,0.023,0.002,0.018,0.467,0.06
3902,2016,0.066,0.24,0.661,0.007,0.008,0.0,0.017,0.535,0.076


In [13]:
dfwide = pd.concat([dfpivot, district], axis=1, join='outer')

In [14]:


# adding population for each demographic group

for demo in demos:
    dfwide[demo, "POP"] = dfwide[[("ALL", "POP"), ("DPET" + demo + "P")]].prod(axis=1).round(0)



In [15]:
dfpivot = dfwide.drop(columns=["DPETBLAP","DPETHISP","DPETWHIP","DPETINDP","DPETASIP","DPETPCIP","DPETTWOP","DPETECOP","DPETSPEP"])
dfpivot = dfpivot.sort_index(axis=1)

In [16]:
dfpivot.columns

Index([('ALL', 'DAE'), ('ALL', 'EXP'), ('ALL', 'ISS'), ('ALL', 'OSS'),
       ('ALL', 'POP'), ('ASI', 'DAE'), ('ASI', 'EXP'), ('ASI', 'ISS'),
       ('ASI', 'OSS'), ('ASI', 'POP'), ('BLA', 'DAE'), ('BLA', 'EXP'),
       ('BLA', 'ISS'), ('BLA', 'OSS'), ('BLA', 'POP'), ('ECO', 'DAE'),
       ('ECO', 'EXP'), ('ECO', 'ISS'), ('ECO', 'OSS'), ('ECO', 'POP'),
       ('HIS', 'DAE'), ('HIS', 'EXP'), ('HIS', 'ISS'), ('HIS', 'OSS'),
       ('HIS', 'POP'), ('IND', 'DAE'), ('IND', 'EXP'), ('IND', 'ISS'),
       ('IND', 'OSS'), ('IND', 'POP'), ('PCI', 'DAE'), ('PCI', 'EXP'),
       ('PCI', 'ISS'), ('PCI', 'OSS'), ('PCI', 'POP'), ('SPE', 'DAE'),
       ('SPE', 'EXP'), ('SPE', 'ISS'), ('SPE', 'OSS'), ('SPE', 'POP'),
       ('TWO', 'DAE'), ('TWO', 'EXP'), ('TWO', 'ISS'), ('TWO', 'OSS'),
       ('TWO', 'POP'), ('WHI', 'DAE'), ('WHI', 'EXP'), ('WHI', 'ISS'),
       ('WHI', 'OSS'), ('WHI', 'POP')],
      dtype='object')

In [17]:
dfpivot[:10]

Unnamed: 0_level_0,Unnamed: 1_level_0,"(ALL, DAE)","(ALL, EXP)","(ALL, ISS)","(ALL, OSS)","(ALL, POP)","(ASI, DAE)","(ASI, EXP)","(ASI, ISS)","(ASI, OSS)","(ASI, POP)","(BLA, DAE)","(BLA, EXP)","(BLA, ISS)","(BLA, OSS)","(BLA, POP)","(ECO, DAE)","(ECO, EXP)","(ECO, ISS)","(ECO, OSS)","(ECO, POP)","(HIS, DAE)","(HIS, EXP)","(HIS, ISS)","(HIS, OSS)","(HIS, POP)","(IND, DAE)","(IND, EXP)","(IND, ISS)","(IND, OSS)","(IND, POP)","(PCI, DAE)","(PCI, EXP)","(PCI, ISS)","(PCI, OSS)","(PCI, POP)","(SPE, DAE)","(SPE, EXP)","(SPE, ISS)","(SPE, OSS)","(SPE, POP)","(TWO, DAE)","(TWO, EXP)","(TWO, ISS)","(TWO, OSS)","(TWO, POP)","(WHI, DAE)","(WHI, EXP)","(WHI, ISS)","(WHI, OSS)","(WHI, POP)"
DISTRICT,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
1902,2016,2.0,,46.0,2.0,622.0,,,,,4.0,,,1.0,,26.0,1.0,,23.0,1.0,210.0,,,7.0,1.0,44.0,,,,,0.0,,,,,0.0,1.0,,19.0,1.0,85.0,,,1.0,,32.0,7.0,,31.0,1.0,516.0
1903,2016,2.0,,380.0,28.0,1381.0,,,,,12.0,,,53.0,1.0,75.0,5.0,,284.0,23.0,755.0,,,17.0,1.0,133.0,,,,,1.0,,,,,3.0,1.0,,77.0,7.0,160.0,,,5.0,1.0,36.0,5.0,,305.0,22.0,1124.0
1904,2016,2.0,,117.0,2.0,916.0,,,1.0,,6.0,1.0,,19.0,1.0,81.0,1.0,,80.0,1.0,518.0,,,10.0,,93.0,,,,,5.0,,,,,1.0,1.0,,9.0,1.0,75.0,,,1.0,1.0,35.0,1.0,,80.0,1.0,695.0
1906,2016,2.0,,21.0,2.0,425.0,,,,,3.0,,,,,36.0,1.0,,15.0,1.0,182.0,1.0,,,,49.0,,,,,0.0,,,,,0.0,1.0,,8.0,1.0,47.0,,,,,9.0,1.0,,21.0,8.0,329.0
1907,2016,163.0,,996.0,234.0,3693.0,,,1.0,,30.0,94.0,1.0,458.0,125.0,997.0,148.0,1.0,859.0,198.0,2737.0,22.0,,265.0,46.0,1522.0,1.0,,1.0,1.0,7.0,,,,,0.0,34.0,,153.0,58.0,314.0,1.0,,26.0,1.0,103.0,42.0,,236.0,52.0,1030.0
1908,2016,24.0,,484.0,153.0,1774.0,,,1.0,,9.0,1.0,,154.0,40.0,316.0,19.0,,379.0,128.0,1114.0,1.0,,83.0,32.0,424.0,,,1.0,1.0,11.0,,,,,0.0,1.0,,74.0,35.0,176.0,,,1.0,1.0,71.0,16.0,,231.0,67.0,946.0
1909,2016,,,74.0,2.0,440.0,,,,,1.0,,,,,10.0,1.0,,65.0,1.0,210.0,,,,,23.0,,,1.0,1.0,3.0,,,,,0.0,,,18.0,1.0,57.0,,,1.0,,10.0,1.0,,1.0,1.0,393.0
2901,2016,61.0,,739.0,182.0,4273.0,,,,,17.0,1.0,,15.0,5.0,47.0,43.0,,461.0,133.0,1696.0,42.0,1.0,548.0,126.0,2880.0,,,,,13.0,,,,,0.0,8.0,,79.0,43.0,278.0,1.0,,25.0,10.0,77.0,13.0,1.0,151.0,41.0,1243.0
3801,2016,,,239.0,2.0,1025.0,,,,,24.0,,,65.0,1.0,176.0,,,170.0,16.0,479.0,,,63.0,1.0,201.0,,,,,0.0,,,,,2.0,,,27.0,1.0,62.0,,,7.0,1.0,18.0,,,104.0,11.0,605.0
3902,2016,39.0,,398.0,64.0,3002.0,,,,,24.0,1.0,,81.0,1.0,198.0,1.0,,333.0,1.0,1606.0,1.0,,99.0,27.0,720.0,,,1.0,,21.0,,,,,0.0,6.0,,90.0,10.0,228.0,,,1.0,1.0,51.0,18.0,,209.0,25.0,1984.0


In [18]:
def impossible(distPop, racePop, all_punishments, group_punishments):

    """
    >>> print(impossible(50, 20, 20, 100))
    1
    >>> impossible(20, 0, 20, 0)
    0
    """

    # flags implausible data entries. Some of them could still be true if school administrators
    # applied different standards different standards to determine which students belong to which demographic group.
    # Or some could be the result of students not being counted because of the time they moved in and out of district.

    if group_punishments > max(all_punishments,8): # eight because TEA could report 2 masked columns with 4 each
        return 1
    if racePop == 0 and group_punishments > 0:
        return 1
    return 0



def getFisher(row, state_pop, state_punishments, d, p):

    """
    >>> getFisher(20, 5, 20, 10)
    2
    >>> getFisher(20, 0, 20, 0)
    None
    """
    
    # I don't know if this is a valid way to report the Fisher's exact test statistic, but the idea is that if getFisher returns a
    # positive number over .95, there's a 95% chance that the group's better-than-average treatment is not due to chance.
    # If it returns a number under -.95, there's a 95% chance that the group's worse-than-average treatment is not due to chance.
    # I think it should be easier to create a color scale to show the scores on a map this way.

    # The getFisher function assumes wrongly that everyone can have only one punishment (of each type). If the number of
    # punishments exceeds the number of kids, it reduces the number of punishments (and assumes wrongly that every
    # kid has been punished) But maybe the results are still close enough to correct to use for scaling?

    # Using the impossible function to decide whether to proceed, instead of making separate columns.

    if d == "ALL":
        distPop = state_pop # variable names are misleading in the "ALL" case
        all_punishments = state_punishments
        
    else:
        distPop = row[("ALL", "POP")]
        all_punishments = row[("ALL", p)]
        
    racePop = row[(d, "POP")]
    group_punishments = row[(d, p)]
    
    if impossible(distPop, racePop, all_punishments, group_punishments):
        return None
    
    if max(racePop, group_punishments) == 0 or None:
        return None
    if distPop == 0:
        return None
    elif max(group_punishments, all_punishments) == 0 or None:
        return 0
    else:
        if pd.isna(group_punishments):
            group_punishments = 0
        if pd.isna(all_punishments):
            all_punishments = 0
        try: 
            oddsratio, pvalueG = stats.fisher_exact([[racePop, max(distPop - racePop, 0)],
                                                 [group_punishments, max(all_punishments - group_punishments, 0)]],
                                                alternative='greater')
        except ValueError:
            print(distPop, racePop, all_punishments, group_punishments)
        oddsratio, pvalueL = stats.fisher_exact([[racePop, max(distPop - racePop, 0)],
                                                 [group_punishments, max(all_punishments - group_punishments, 0)]],
                                                alternative='less')
        if pvalueL < pvalueG:
            pv = 1 - pvalueL
        else:
            pv = pvalueG - 1
        
        # To save space in the output file, this simplifies the decimal values to an integer from -6 to 6
        # It should replace similar code in txappleseedmap/js/index.js
        
        scale = -6
        scale_colors = (-0.99999,-0.9984,-0.992,-0.96,-0.8,-0.2,0.2,0.8,0.96,0.992,0.9984,0.99999)
        
        for v in scale_colors:
            if pv > v:
                scale += 1
        
    return scale

# print(getFisher(20, 5, 20, 10))

In [19]:
# changing the set used to make columns earlier.
# could cause a bug if cells not run in order.

demos.add("ALL")

In [20]:
demos

{'ALL', 'ASI', 'BLA', 'ECO', 'HIS', 'IND', 'PCI', 'SPE', 'TWO', 'WHI'}

In [21]:
state_pop = dfpivot[("ALL", "POP")].sum()

In [22]:
# This compares each population within a district to the state as a whole,
# not the district as a whole.

for p in pdict.values():
    state_punishments = dfpivot[("ALL", p)].sum()
    for d in demos:
        dfpivot[(d, p, "S")] = dfpivot.apply(getFisher, axis = 1, args = (state_pop, state_punishments, d, p))

In [23]:
# This chart has too many sixes and negative sixes. Need to go back to the old way of calculating.

dfpivot = dfpivot.sort_index(axis=1)
dfpivot[900:920]

Unnamed: 0_level_0,Unnamed: 1_level_0,"(ALL, DAE)","(ALL, DAE, S)","(ALL, EXP)","(ALL, EXP, S)","(ALL, ISS)","(ALL, ISS, S)","(ALL, OSS)","(ALL, OSS, S)","(ALL, POP)","(ASI, DAE)","(ASI, DAE, S)","(ASI, EXP)","(ASI, EXP, S)","(ASI, ISS)","(ASI, ISS, S)","(ASI, OSS)","(ASI, OSS, S)","(ASI, POP)","(BLA, DAE)","(BLA, DAE, S)","(BLA, EXP)","(BLA, EXP, S)","(BLA, ISS)","(BLA, ISS, S)","(BLA, OSS)","(BLA, OSS, S)","(BLA, POP)","(ECO, DAE)","(ECO, DAE, S)","(ECO, EXP)","(ECO, EXP, S)","(ECO, ISS)","(ECO, ISS, S)","(ECO, OSS)","(ECO, OSS, S)","(ECO, POP)","(HIS, DAE)","(HIS, DAE, S)","(HIS, EXP)","(HIS, EXP, S)","(HIS, ISS)","(HIS, ISS, S)","(HIS, OSS)","(HIS, OSS, S)","(HIS, POP)","(IND, DAE)","(IND, DAE, S)","(IND, EXP)","(IND, EXP, S)","(IND, ISS)","(IND, ISS, S)","(IND, OSS)","(IND, OSS, S)","(IND, POP)","(PCI, DAE)","(PCI, DAE, S)","(PCI, EXP)","(PCI, EXP, S)","(PCI, ISS)","(PCI, ISS, S)","(PCI, OSS)","(PCI, OSS, S)","(PCI, POP)","(SPE, DAE)","(SPE, DAE, S)","(SPE, EXP)","(SPE, EXP, S)","(SPE, ISS)","(SPE, ISS, S)","(SPE, OSS)","(SPE, OSS, S)","(SPE, POP)","(TWO, DAE)","(TWO, DAE, S)","(TWO, EXP)","(TWO, EXP, S)","(TWO, ISS)","(TWO, ISS, S)","(TWO, OSS)","(TWO, OSS, S)","(TWO, POP)","(WHI, DAE)","(WHI, DAE, S)","(WHI, EXP)","(WHI, EXP, S)","(WHI, ISS)","(WHI, ISS, S)","(WHI, OSS)","(WHI, OSS, S)","(WHI, POP)"
DISTRICT,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1
181905,2016,19.0,-3,,-1,506.0,6,126.0,-3,1878.0,,-1.0,,0.0,1.0,-5.0,,-2.0,36.0,1.0,2.0,,0.0,13.0,5.0,1.0,1.0,6.0,1.0,-4.0,1.0,1.0,293.0,6.0,96.0,6.0,689.0,1.0,-1.0,,0.0,41.0,-1.0,8.0,-2.0,171.0,1.0,2.0,,0.0,12.0,5.0,1.0,1.0,6.0,,,,,,,,,0.0,5.0,3.0,1.0,2.0,72.0,5.0,47.0,6.0,148.0,1.0,1.0,,0.0,1.0,-5.0,15.0,6.0,49.0,14.0,-2.0,1.0,0.0,429.0,-1.0,92.0,-5.0,1611.0
181906,2016,2.0,-6,,-1,931.0,6,503.0,6,2778.0,,0.0,,0.0,,-3.0,,-2.0,14.0,13.0,,1.0,1.0,764.0,6.0,348.0,5.0,1670.0,18.0,,,0.0,858.0,6.0,466.0,6.0,2322.0,1.0,1.0,,0.0,43.0,-6.0,24.0,-6.0,353.0,,0.0,,0.0,,-2.0,,-1.0,6.0,,0.0,,0.0,,-2.0,,-1.0,6.0,1.0,1.0,,0.0,128.0,3.0,189.0,6.0,306.0,,0.0,,0.0,14.0,-6.0,5.0,-5.0,128.0,1.0,1.0,,0.0,110.0,-6.0,126.0,2.0,600.0
181907,2016,121.0,5,2.0,-1,1907.0,6,554.0,6,5210.0,,-1.0,,0.0,1.0,-6.0,1.0,-2.0,47.0,,-1.0,,0.0,1.0,-2.0,,-1.0,10.0,101.0,6.0,7.0,3.0,1354.0,6.0,479.0,6.0,2683.0,1.0,-4.0,,0.0,102.0,-3.0,30.0,-2.0,359.0,1.0,1.0,,0.0,7.0,-1.0,1.0,-1.0,21.0,,0.0,,0.0,,-1.0,,-1.0,5.0,24.0,3.0,1.0,1.0,402.0,6.0,144.0,6.0,630.0,5.0,2.0,,0.0,32.0,-1.0,19.0,4.0,89.0,111.0,1.0,7.0,1.0,1761.0,5.0,494.0,-1.0,4679.0
181908,2016,93.0,6,2.0,1,2589.0,6,245.0,-4,3654.0,,-1.0,,0.0,1.0,-6.0,,-3.0,55.0,16.0,5.0,,0.0,384.0,6.0,74.0,6.0,241.0,62.0,6.0,,-1.0,1659.0,6.0,196.0,6.0,1286.0,1.0,-5.0,,0.0,339.0,5.0,23.0,-1.0,373.0,,-1.0,,0.0,1.0,-5.0,1.0,-1.0,18.0,,0.0,,0.0,,-2.0,,-1.0,4.0,27.0,5.0,1.0,1.0,635.0,6.0,81.0,6.0,442.0,1.0,-1.0,,0.0,1.0,-6.0,1.0,-4.0,106.0,62.0,-4.0,1.0,-1.0,1771.0,-6.0,137.0,-6.0,2857.0
182901,2016,,-3,,0,39.0,-1,,-6,214.0,,,,,,,,,0.0,,,,,,,,,0.0,,0.0,,0.0,27.0,5.0,1.0,1.0,85.0,,0.0,,0.0,12.0,2.0,1.0,2.0,42.0,,,,,,,,,0.0,,,,,,,,,0.0,,0.0,,0.0,12.0,5.0,1.0,2.0,17.0,,0.0,,0.0,5.0,3.0,,0.0,7.0,,0.0,,0.0,22.0,-4.0,,0.0,166.0
182902,2016,2.0,-2,,0,2.0,-6,,-6,405.0,,0.0,,0.0,,0.0,,0.0,2.0,,,,,,,,,0.0,1.0,-1.0,,0.0,17.0,,1.0,1.0,213.0,,-1.0,,0.0,1.0,1.0,1.0,2.0,55.0,,0.0,,0.0,,0.0,,0.0,2.0,,,,,,,,,0.0,1.0,2.0,,0.0,1.0,2.0,,0.0,26.0,,0.0,,0.0,,0.0,,0.0,6.0,1.0,-1.0,,0.0,1.0,-1.0,1.0,0.0,340.0
182903,2016,104.0,6,,-2,1178.0,6,236.0,-4,3542.0,,-1.0,,0.0,,-3.0,,-1.0,14.0,8.0,3.0,,0.0,1.0,-6.0,24.0,6.0,120.0,87.0,4.0,1.0,1.0,1002.0,6.0,200.0,6.0,2501.0,32.0,-2.0,1.0,1.0,294.0,-6.0,44.0,-6.0,1364.0,1.0,1.0,,0.0,1.0,-2.0,1.0,1.0,14.0,,,,,,,,,0.0,11.0,1.0,,0.0,133.0,4.0,50.0,6.0,308.0,1.0,-2.0,,0.0,71.0,5.0,1.0,-4.0,110.0,56.0,-1.0,1.0,1.0,753.0,6.0,150.0,4.0,1920.0
182904,2016,,-5,,-1,2.0,-6,,-6,519.0,,0.0,,0.0,,0.0,,0.0,1.0,,0.0,,0.0,,0.0,,0.0,1.0,1.0,1.0,,0.0,6.0,3.0,1.0,1.0,236.0,,0.0,,0.0,1.0,1.0,,0.0,90.0,,0.0,,0.0,,0.0,,0.0,1.0,,,,,,,,,0.0,,0.0,,0.0,1.0,2.0,,0.0,48.0,,0.0,,0.0,,0.0,1.0,3.0,11.0,1.0,0.0,,0.0,1.0,-1.0,1.0,0.0,415.0
182905,2016,,-2,,0,,-6,,-6,187.0,,,,,,,,,0.0,,0.0,,0.0,,0.0,,0.0,4.0,1.0,1.0,,0.0,,0.0,1.0,1.0,110.0,,0.0,,0.0,,0.0,,0.0,66.0,,,,,,,,,0.0,,,,,,,,,0.0,,0.0,,0.0,,0.0,,0.0,14.0,,0.0,,0.0,,0.0,,0.0,6.0,1.0,1.0,,0.0,,0.0,1.0,1.0,111.0
182906,2016,,-2,,0,,-6,,-5,126.0,,,,,,,,,0.0,,0.0,,0.0,,0.0,,0.0,1.0,,0.0,,0.0,,0.0,,0.0,78.0,,0.0,,0.0,,0.0,,0.0,28.0,,,,,,,,,0.0,,,,,,,,,0.0,,0.0,,0.0,,0.0,,0.0,8.0,,0.0,,0.0,,0.0,,0.0,4.0,,0.0,,0.0,,0.0,,0.0,93.0


In [128]:
# Checking to see how to restore the column MultiIndex

m = pd.MultiIndex.from_tuples(dfpivot.columns)

In [136]:
df = dfwide.set_axis(m, axis = 1, inplace=False)

In [8]:
dfpivot = dfpivot.reset_index()

In [139]:
dfpivot = dfpivot.sort_index(axis=1)

In [142]:
# The population columns were estimated by multiplying by a percentage.
# So they need to be rounded to the nearest whole person.

dfpivot = dfpivot.round(0)

In [143]:
dfpivot[:5]

Unnamed: 0_level_0,Unnamed: 1_level_0,ALL,ALL,ALL,ALL,ALL,ASI,ASI,ASI,ASI,ASI,BLA,BLA,BLA,BLA,BLA,ECO,ECO,ECO,ECO,ECO,HIS,HIS,HIS,HIS,HIS,IND,IND,IND,IND,IND,PCI,PCI,PCI,PCI,PCI,SPE,SPE,SPE,SPE,SPE,TWO,TWO,TWO,TWO,TWO,WHI,WHI,WHI,WHI,WHI
Unnamed: 0_level_1,Unnamed: 1_level_1,DAE,EXP,ISS,OSS,POP,DAE,EXP,ISS,OSS,POP,DAE,EXP,ISS,OSS,POP,DAE,EXP,ISS,OSS,POP,DAE,EXP,ISS,OSS,POP,DAE,EXP,ISS,OSS,POP,DAE,EXP,ISS,OSS,POP,DAE,EXP,ISS,OSS,POP,DAE,EXP,ISS,OSS,POP,DAE,EXP,ISS,OSS,POP
DISTRICT,Year,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2,Unnamed: 50_level_2,Unnamed: 51_level_2
1902,2016,2.0,,46.0,2.0,622.0,,,,,435.0,,,1.0,,2612.0,1.0,,23.0,1.0,21024.0,,,7.0,1.0,4354.0,,,,,0.0,,,,,0.0,1.0,,19.0,1.0,8521.0,,,1.0,,3172.0,7.0,,31.0,1.0,51564.0
1903,2016,2.0,,380.0,28.0,1381.0,,,,,1243.0,,,53.0,1.0,7457.0,5.0,,284.0,23.0,75541.0,,,17.0,1.0,13258.0,,,,,138.0,,,,,276.0,1.0,,77.0,7.0,16020.0,,,5.0,1.0,3591.0,5.0,,305.0,22.0,112413.0
1904,2016,2.0,,117.0,2.0,916.0,,,1.0,,641.0,1.0,,19.0,1.0,8061.0,1.0,,80.0,1.0,51754.0,,,10.0,,9343.0,,,,,458.0,,,,,92.0,1.0,,9.0,1.0,7511.0,,,1.0,1.0,3481.0,1.0,,80.0,1.0,69524.0
1906,2016,2.0,,21.0,2.0,425.0,,,,,340.0,,,,,3570.0,1.0,,15.0,1.0,18190.0,1.0,,,,4888.0,,,,,0.0,,,,,0.0,1.0,,8.0,1.0,4675.0,,,,,892.0,1.0,,21.0,8.0,32852.0
1907,2016,163.0,,996.0,234.0,3693.0,,,1.0,,2954.0,94.0,1.0,458.0,125.0,99711.0,148.0,1.0,859.0,198.0,273651.0,22.0,,265.0,46.0,152152.0,1.0,,1.0,1.0,739.0,,,,,0.0,34.0,,153.0,58.0,31390.0,1.0,,26.0,1.0,10340.0,42.0,,236.0,52.0,103035.0


In [144]:
dfpivot.to_csv("../data/processed/stppFlat.csv")

In [148]:
dfpivot.to_json("../data/processed/stpp.json")

In [147]:
# Probably CSV will be fine.

import sqlalchemy
dfpivot.to_sql("schools", "sqlite")

ArgumentError: Could not parse rfc1738 URL from string 'sqlite'

In [1]:
first_year = 2006 # the year 2006 is the first year on the TEA site
last_year = 2016



def getYear(year):
    year_col = "YR{}".format(str(year)[-2:])
    apple_path = '../data/from_agency/by_region/REGION_{}_DISTRICT_summary_{}.csv'
    one_year = [pd.read_csv(apple_path.format(str(region).zfill(2),str(year)[-2:]), 
                            index_col = ["DISTRICT","HEADING"], dtype = {year_col: int})
                for region in range(1,21)]
    a = pd.concat(one_year)
    
    # a = a.set_index(["DISTRICT","HEADING"] )
    a = a[~a.index.duplicated(keep='last')]  # a single row was causing a non-unique multiindex error 
    # print(a.loc[31901])
    a = formatDF(a, year_col)
    return a

In [2]:


def getRatio(distPop, racePop, all_punishments, group_punishments):
    # Calculating ratio of punishments for the demographic group compared to the punishments for the student population
    # as a whole. For instance, "0.505" in the disparity column indicates the group got the punishment 50.5% as often
    # as average for the student population.

    """
    >>> getRatio(200, 20, 20, 10)
    4.0
    >>> getRatio(200, 20, 20, 2)
    0.0
    >>> print(getRatio(200, 0, 20, 0))
    None
    """

    if max(racePop, group_punishments) == 0 or None:
        return None
    elif all_punishments == 0 or None:
        return 0
    else:
        disparity = (group_punishments / (max(all_punishments, group_punishments))
                     / (max(racePop, group_punishments) / distPop)) - 1
        disparity = Decimal(disparity)
        disparity = disparity.quantize(Decimal('0.01'))
    return float(disparity)

In [4]:
import json
with open("../geojson/base_districts.geojson") as json_data:
    district_map = json.load(json_data)
    json_data.close()

In [5]:
shapeIDs = set()

for shape in district_map["features"]:
    shape["id"] = shape["properties"]["DISTRICT_N"]
    assert shape["id"] not in shapeIDs, "id already in list: %r" % shape["id"]
    shapeIDs.add(shape["id"])
    
    # These two fields look redundant. Let's try deleting them.
    
    shape["properties"].pop("DISTRICT_1", None)
    shape["properties"].pop("OBJECTID_1", None)


In [6]:
type(district_map["features"][1]['geometry']['coordinates'][0][1][1])

float

In [7]:
# For districts overall, need columns that show what percentage of the state population they have
# and what percentage of the punishments?

def getLE(x):
    
    # Collects the correct values from the dataframes called "apple" and "district"
    # and calls the "impossible" function, which looks for data errors
    
    distPop = x["DPETALLC"]
    if distPop in (0, None, np.nan):
        return 1
    elif x["HEADING NAME"] == "ALL":
        return 0
    else:    
        all_punishments = x["all_punishments"]
        # all_punishments = apple.loc[x["DISTRICT"]][x["SECTION"]]["ALL"]
        group_punishments = x[year_col]
        # trying to make this run faster by returning info for two columns, then splitting them
        raceP = x["DPET{}P".format(x["HEADING NAME"][:3])]
        return impossible(distPop, raceP, all_punishments, group_punishments)


In [8]:
def getScale(x, punishment_totals, statewide_students_count):
    
    """
    This function does something different for the "HEADING NAME == ALL" rows than for the other rows.
    For the "ALL" rows it uses the whole state population as the "distPop" and uses the entire district population
    as the "racePop". For the other rows, the entire district population is used as "distPop", not "racePop".
    
    And this function calls getFisher for the real calculation.
    """
    
    group_punishments = x[year_col]
    if x["HEADING NAME"] == "ALL":
        distPop = statewide_students_count
        racePop = x["DPETALLC"]
        all_punishments = punishment_totals[x["SECTION"]]
    else:
        distPop = x["DPETALLC"]
        racePop = x["DPET{}P".format(x["HEADING NAME"])] * distPop * .01
        if pd.isna(racePop):
            return None
        if pd.isnull(x["all_punishments"]):
            print("null all_punishments: " + str(x))
        all_punishments = x["all_punishments"]
    return getFisher(distPop, racePop, all_punishments, group_punishments)

def getPercentage(x, punishment_totals):
    if x["HEADING NAME"] == "ALL":
        return x["all_punishments"] / punishment_totals[x["SECTION"]] * 100
    else:
        return x[year_col] / x["all_punishments"] * 100

In [9]:
# Need to merge columns of apple and district.

years = [x for x in range(first_year, last_year + 1)] # change back to first_year

pop_stats = ("DPETALLC","DPETALLP", "DPETBLAP","DPETHISP","DPETWHIP","DPETINDP","DPETASIP","DPETPCIP",
             "DPETTWOP","DPETECOP","DPETSPEP")



punishments = ('EXP','DAE','OSS','ISS')

fail = {} # for testing
noScale = {}

for year in years:
    print("starting year " + str(year))

    apple = getYear(year)
    # the path to the files in the district demographics directory
    districtPath = '../data/from_agency/districts/district{}.dat'.format(year)
    district = populations(districtPath)

    statewide_students_count = district["DPETALLC"].sum()
    year_col = "YR" + str(year)[-2:]

    apple = apple.reset_index()
    appleAll = apple[apple["HEADING NAME"] == "ALL"].rename({year_col: "all_punishments"}, axis = 1).drop(["HEADING NAME", "HEADING"], axis = 1)
    print(appleAll[:5])
    apple = apple.merge(district, how = "left", left_on = "DISTRICT", right_index = True)
    
    apple = apple[apple["DPETALLC"].notnull()]
    
    punishment_totals = {}
    for p in apple["SECTION"].unique():
        punishment_totals[p] = apple[apple["SECTION"] == p][apple["HEADING NAME"] == "ALL"][year_col].sum()
        
    # apple[18464:18470]  previous problem rows, gone because of the .notnull()
    
    apple = apple.merge(appleAll, how = "left", left_on = ["DISTRICT","SECTION"], right_on = ["DISTRICT","SECTION"])
    
    """    
    # This line will run slowly because for each row, it searches the entire dataframe
    apple["all_punishments"] = apple.apply(lambda x: 
                                               apple[apple["DISTRICT"] == x["DISTRICT"]][apple["SECTION"] == x["SECTION"]][apple["HEADING NAME"] == "ALL"][year_col].values[0], axis=1)
    """
    
    # New column will show the district's percentage of the state's student population.
    district["DPETALLP"] = district.apply(lambda x: x["DPETALLC"] / statewide_students_count * 100, axis=1).round(2)
    
    apple["LikelyError"] = apple.apply(getLE, axis=1)
    apple["Scale"] = apple.apply(lambda x: getScale(x, punishment_totals, statewide_students_count), axis=1)
    apple["Percentage"] = apple.apply(lambda x: getPercentage(x, punishment_totals), axis=1).round(3)
    
    apple = apple.set_index(["DISTRICT","SECTION","HEADING NAME"])
    apple = apple.sort_index() # trying to improve speed
    
    # populating the GeoJSON file, which already has geometry for the districts.
    
    for entry in district_map["features"]:
        if entry["id"] in district.index:
            entry["properties"][year] = {}
            for stat in pop_stats:
                # This will give NaN (numpy.float64) when empty
                if pd.notnull(district.loc[entry["id"]][stat]):
                    try:
                        entry["properties"][year][stat] = district.loc[entry["id"]][stat]
                    except KeyError:
                        # for when the map has a district not in the TEA's data
                        print("no stats for " + str(year) + " " + str(entry["id"]))
                        entry["properties"][year][stat] = None
        if entry["id"] in apple.index.get_level_values(0):
            for punishment in punishments:
                entry["properties"][year][punishment] = {}
                for demo in demos:
                    if (entry["id"],punishment,demo) in apple.index:
                    # if pd.notnull(apple.loc[entry["id"],punishment,demo][year_col]): # should prevent empty dicts at "demo" level
                        entry["properties"][year][punishment][demo] = {} 
                        try:
                            entry["properties"][year][punishment][demo]["C"] = int(apple.loc[entry["id"],punishment,demo][year_col])
                            entry["properties"][year][punishment][demo]["E"] = int(apple.loc[entry["id"],punishment,demo]["LikelyError"])
                            entry["properties"][year][punishment][demo]["%"] = float(apple.loc[entry["id"],punishment,demo]["Percentage"])
                        except:
                            fail[entry["id"]] = (year,punishment,demo)
                        try:
                            entry["properties"][year][punishment][demo]["S"] = int(apple.loc[entry["id"],punishment,demo]["Scale"])
                        except:
                            noScale[entry["id"]] = (year,punishment,demo)
    print(district_map["features"][30]["properties"])
                    # print("Nothing for {} {} {}".format(entry["id"],punishment,demo))
                    # impossible(distPop, racePop, all_punishments, group_punishments)

starting year 2006
       DISTRICT SECTION  all_punishments
12970     31901     EXP               73
12971     31901     DAE              826
12972     31901     OSS             3144
12973     31901     ISS            15309
12974    108902     EXP               18




{'DISTRICT_C': '228905', 'DISTRICT_N': 228905, 'OBJECTID': 31, 'NAME2': 'Apple Springs', 'DISTRICT': '228-905', 'OBJECTID_2': 1113, 'DISTNAME': 'APPLE SPRINGS ISD', 'REGION': 6, 2006: {'DPETALLC': 167, 'DPETALLP': 0.0, 'DPETBLAP': 16, 'DPETHISP': 2, 'DPETWHIP': 83, 'DPETECOP': 59.9, 'DPETSPEP': 20, 'EXP': {'ALL': {'C': 0, 'E': 0, '%': 0.0, 'S': -1}}, 'DAE': {'ALL': {'C': 0, 'E': 0, '%': 0.0, 'S': -4}}, 'OSS': {'ALL': {'C': 2, 'E': 0, '%': 0.0, 'S': -6}, 'SPE': {'C': 1, 'E': 0, '%': 50.0, 'S': 1}, 'ECO': {'C': 1, 'E': 0, '%': 50.0, 'S': -1}, 'WHI': {'C': 1, 'E': 0, '%': 50.0, 'S': -1}}, 'ISS': {'ALL': {'C': 30, 'E': 0, '%': 0.002, 'S': -5}, 'SPE': {'C': 29, 'E': 0, '%': 96.667, 'S': 6}, 'ECO': {'C': 20, 'E': 0, '%': 66.667, 'S': 1}, 'WHI': {'C': 25, 'E': 0, '%': 83.333, 'S': -1}}}}
starting year 2007
       DISTRICT SECTION  all_punishments
13039     31901     EXP              161
13040     31901     DAE             1083
13041     31901     OSS             3797
13042     31901     ISS  

       DISTRICT SECTION  all_punishments
13065     31901     EXP               58
13066     31901     DAE              889
13067     31901     OSS             4952
13068     31901     ISS            16379
13069    108902     EXP                2
{'DISTRICT_C': '228905', 'DISTRICT_N': 228905, 'OBJECTID': 31, 'NAME2': 'Apple Springs', 'DISTRICT': '228-905', 'OBJECTID_2': 1113, 'DISTNAME': 'APPLE SPRINGS ISD', 'REGION': 6, 2006: {'DPETALLC': 167, 'DPETALLP': 0.0, 'DPETBLAP': 16, 'DPETHISP': 2, 'DPETWHIP': 83, 'DPETECOP': 59.9, 'DPETSPEP': 20, 'EXP': {'ALL': {'C': 0, 'E': 0, '%': 0.0, 'S': -1}}, 'DAE': {'ALL': {'C': 0, 'E': 0, '%': 0.0, 'S': -4}}, 'OSS': {'ALL': {'C': 2, 'E': 0, '%': 0.0, 'S': -6}, 'SPE': {'C': 1, 'E': 0, '%': 50.0, 'S': 1}, 'ECO': {'C': 1, 'E': 0, '%': 50.0, 'S': -1}, 'WHI': {'C': 1, 'E': 0, '%': 50.0, 'S': -1}}, 'ISS': {'ALL': {'C': 30, 'E': 0, '%': 0.002, 'S': -5}, 'SPE': {'C': 29, 'E': 0, '%': 96.667, 'S': 6}, 'ECO': {'C': 20, 'E': 0, '%': 66.667, 'S': 1}, 'WHI': {'C':

       DISTRICT SECTION  all_punishments
17396     31901     EXP               51
17397     31901     DAE              848
17398     31901     OSS             3717
17399     31901     ISS            14344
17400    108902     EXP                2
{'DISTRICT_C': '228905', 'DISTRICT_N': 228905, 'OBJECTID': 31, 'NAME2': 'Apple Springs', 'DISTRICT': '228-905', 'OBJECTID_2': 1113, 'DISTNAME': 'APPLE SPRINGS ISD', 'REGION': 6, 2006: {'DPETALLC': 167, 'DPETALLP': 0.0, 'DPETBLAP': 16, 'DPETHISP': 2, 'DPETWHIP': 83, 'DPETECOP': 59.9, 'DPETSPEP': 20, 'EXP': {'ALL': {'C': 0, 'E': 0, '%': 0.0, 'S': -1}}, 'DAE': {'ALL': {'C': 0, 'E': 0, '%': 0.0, 'S': -4}}, 'OSS': {'ALL': {'C': 2, 'E': 0, '%': 0.0, 'S': -6}, 'SPE': {'C': 1, 'E': 0, '%': 50.0, 'S': 1}, 'ECO': {'C': 1, 'E': 0, '%': 50.0, 'S': -1}, 'WHI': {'C': 1, 'E': 0, '%': 50.0, 'S': -1}}, 'ISS': {'ALL': {'C': 30, 'E': 0, '%': 0.002, 'S': -5}, 'SPE': {'C': 29, 'E': 0, '%': 96.667, 'S': 6}, 'ECO': {'C': 20, 'E': 0, '%': 66.667, 'S': 1}, 'WHI': {'C':

       DISTRICT SECTION  all_punishments
17394     31901     EXP               38
17395     31901     DAE              671
17396     31901     OSS             2715
17397     31901     ISS             8848
17398    108902     EXP                2
{'DISTRICT_C': '228905', 'DISTRICT_N': 228905, 'OBJECTID': 31, 'NAME2': 'Apple Springs', 'DISTRICT': '228-905', 'OBJECTID_2': 1113, 'DISTNAME': 'APPLE SPRINGS ISD', 'REGION': 6, 2006: {'DPETALLC': 167, 'DPETALLP': 0.0, 'DPETBLAP': 16, 'DPETHISP': 2, 'DPETWHIP': 83, 'DPETECOP': 59.9, 'DPETSPEP': 20, 'EXP': {'ALL': {'C': 0, 'E': 0, '%': 0.0, 'S': -1}}, 'DAE': {'ALL': {'C': 0, 'E': 0, '%': 0.0, 'S': -4}}, 'OSS': {'ALL': {'C': 2, 'E': 0, '%': 0.0, 'S': -6}, 'SPE': {'C': 1, 'E': 0, '%': 50.0, 'S': 1}, 'ECO': {'C': 1, 'E': 0, '%': 50.0, 'S': -1}, 'WHI': {'C': 1, 'E': 0, '%': 50.0, 'S': -1}}, 'ISS': {'ALL': {'C': 30, 'E': 0, '%': 0.002, 'S': -5}, 'SPE': {'C': 29, 'E': 0, '%': 96.667, 'S': 6}, 'ECO': {'C': 20, 'E': 0, '%': 66.667, 'S': 1}, 'WHI': {'C':

       DISTRICT SECTION  all_punishments
17141     31901     EXP                2
17142     31901     DAE              602
17143     31901     OSS             3091
17144     31901     ISS             8413
17145    108902     EXP                2
{'DISTRICT_C': '228905', 'DISTRICT_N': 228905, 'OBJECTID': 31, 'NAME2': 'Apple Springs', 'DISTRICT': '228-905', 'OBJECTID_2': 1113, 'DISTNAME': 'APPLE SPRINGS ISD', 'REGION': 6, 2006: {'DPETALLC': 167, 'DPETALLP': 0.0, 'DPETBLAP': 16, 'DPETHISP': 2, 'DPETWHIP': 83, 'DPETECOP': 59.9, 'DPETSPEP': 20, 'EXP': {'ALL': {'C': 0, 'E': 0, '%': 0.0, 'S': -1}}, 'DAE': {'ALL': {'C': 0, 'E': 0, '%': 0.0, 'S': -4}}, 'OSS': {'ALL': {'C': 2, 'E': 0, '%': 0.0, 'S': -6}, 'SPE': {'C': 1, 'E': 0, '%': 50.0, 'S': 1}, 'ECO': {'C': 1, 'E': 0, '%': 50.0, 'S': -1}, 'WHI': {'C': 1, 'E': 0, '%': 50.0, 'S': -1}}, 'ISS': {'ALL': {'C': 30, 'E': 0, '%': 0.002, 'S': -5}, 'SPE': {'C': 29, 'E': 0, '%': 96.667, 'S': 6}, 'ECO': {'C': 20, 'E': 0, '%': 66.667, 'S': 1}, 'WHI': {'C':

In [10]:
pd.notnull(district.loc[228905])

DISTNAME    True
REGION      True
DPETALLC    True
DPETBLAP    True
DPETHISP    True
DPETWHIP    True
DPETINDP    True
DPETASIP    True
DPETPCIP    True
DPETTWOP    True
DPETECOP    True
DPETSPEP    True
DPETALLP    True
Name: 228905, dtype: bool

In [25]:
apple[:5]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,HEADING,YR16,DISTNAME,REGION,DPETALLC,DPETBLAP,DPETHISP,DPETWHIP,DPETINDP,DPETASIP,DPETPCIP,DPETTWOP,DPETECOP,DPETSPEP,all_punishments,LikelyError,Scale,Percentage
DISTRICT,SECTION,HEADING NAME,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1902,DAE,ALL,D09,2,CAYUGA ISD,7,568,4.2,7.0,82.9,0.0,0.7,0.0,5.1,33.8,13.7,2,0,-4,0.002
1902,DAE,ECO,E10,1,CAYUGA ISD,7,568,4.2,7.0,82.9,0.0,0.7,0.0,5.1,33.8,13.7,2,0,1,50.0
1902,DAE,SPE,D08,1,CAYUGA ISD,7,568,4.2,7.0,82.9,0.0,0.7,0.0,5.1,33.8,13.7,2,0,1,50.0
1902,DAE,WHI,C21,7,CAYUGA ISD,7,568,4.2,7.0,82.9,0.0,0.7,0.0,5.1,33.8,13.7,2,0,1,350.0
1902,EXP,ALL,D06,0,CAYUGA ISD,7,568,4.2,7.0,82.9,0.0,0.7,0.0,5.1,33.8,13.7,0,0,-1,0.0


In [11]:
int(apple.loc[entry["id"],punishment,demo][year_col])

33

In [12]:
float(apple.loc[entry["id"],punishment,demo]["Percentage"])

1.275

In [13]:
apple.loc[entry["id"],punishment,demo]["Scale"]

2

In [14]:
with open('../geojson/districts_with_data.geojson', 'w') as fp:
    json.dump(district_map, fp, default=int)
    fp.close()

In [15]:
import geojson

fc = geojson.FeatureCollection(district_map)

In [16]:
type(fc)

geojson.feature.FeatureCollection

In [17]:
help(geojson.load)

Help on function load in module geojson.codec:

load(fp, cls=<class 'json.decoder.JSONDecoder'>, parse_constant=<function _enforce_strict_numbers at 0x115eb7840>, object_hook=<bound method GeoJSON.to_instance of <class 'geojson.base.GeoJSON'>>, **kwargs)



In [18]:
with open("../geojson/districts_with_data.geojson") as geo_data:
    fc = geojson.load(geo_data)
    geo_data.close()

In [19]:
district_map["features"][3]

{'geometry': {'coordinates': [[[-100.27515251768764, 34.7111389239958],
    [-100.28657751973152, 34.71077492247225],
    [-100.31184752675881, 34.71081492105591],
    [-100.31216652699148, 34.703736920181406],
    [-100.31243352377226, 34.66611091322487],
    [-100.33106252812291, 34.66589691201054],
    [-100.3312295268639, 34.64954390931071],
    [-100.34276552863754, 34.649149909104864],
    [-100.34966653147917, 34.64928690860006],
    [-100.3499905290419, 34.62083190423951],
    [-100.37889753639715, 34.620883902110876],
    [-100.40052054204337, 34.621032901222385],
    [-100.41585854499239, 34.6209689004178],
    [-100.41592154324587, 34.589524896120565],
    [-100.41579454076648, 34.558511890082706],
    [-100.41595853781935, 34.51835488290953],
    [-100.41596053666917, 34.50006488057768],
    [-100.41684753212495, 34.42457586695365],
    [-100.42922853412152, 34.42467186533117],
    [-100.42928953339748, 34.41542686458377],
    [-100.4300625317414, 34.391619859961324],
    [

In [20]:
# Testing to see if the file we produced is valid GeoJSON

fc.is_valid

True

In [21]:
district.loc[67908]

DISTNAME    RISING STAR ISD
REGION                   14
DPETALLC                160
DPETBLAP                  0
DPETHISP               19.4
DPETWHIP               79.4
DPETINDP                  0
DPETASIP                  0
DPETPCIP                  0
DPETTWOP                1.3
DPETECOP               70.6
DPETSPEP                5.6
DPETALLP                  0
Name: 67908, dtype: object

In [22]:
len(fail)

4

In [23]:
len(noScale)

779

In [24]:
# df[df['A'] > 0]

q = district_map["features"][900]["properties"]["DISTRICT_N"]

district.loc[q]

# district[district["DISTRICT"] == 167903]

DISTNAME    RANDOLPH FIELD ISD
REGION                      20
DPETALLC                  1357
DPETBLAP                  17.5
DPETHISP                  21.9
DPETWHIP                  45.3
DPETINDP                   0.3
DPETASIP                   2.9
DPETPCIP                   1.1
DPETTWOP                  11.1
DPETECOP                   8.8
DPETSPEP                   7.1
DPETALLP                  0.03
Name: 15906, dtype: object

In [None]:
def weirdProcessForTotals(apple):
    
    # Avoid using this.

    # Adding totals for each discipline in each district, by adding up actions 
    # against special ed students and non-special ed students. This will be 
    # inefficient because it makes a dict list first instead of staying in pandas.
    
    non_special = {"D06": ("D05","D-EXPULSION ACTIONS"), 
                   "D09": ("D08","E-DAEP PLACEMENTS"), 
                   "D12": ("D11", "F-OUT OF SCHOOL SUSPENSIONS"), 
                   "D15": ("D14", "G-IN SCHOOL SUSPENSIONS")}
    
    all_actions = []
    
    # if it was a .csv, the headers would be ["DISTRICT", "SECTION", "HEADING", "HEADING NAME", year_col]
    
    unfound = []
    
    for d in apple.index.get_level_values(0).unique():
        for key in non_special:
            try: 
                a = apple.loc[(d, key)][year_col]
            except KeyError:
                a = 0
            try:
                b = apple.loc[(d, non_special[key][0])][year_col]
            except KeyError:
                b = 0
            if a < 0: # in case of dummy values like -999
                a = 1
            if b < 0:
                b = 1
            total = a + b
            all_actions.append({"DISTRICT": d, "HEADING": key, "SECTION": non_special[key][1], 
                                "HEADING NAME": "ALL", year_col: total})
    
    new = pd.DataFrame(all_actions)
    new = new.set_index(["DISTRICT", "HEADING"])
    
    apple = apple.append(new)
    
    return apple

In [11]:
# old

for district in df["DISTRICT"].unique():
    for p in pdict.values():
        df.loc[((df['DISTRICT'] == district) & (df['SECTION'] == p) & (df['Year'] == year), 'ALL')] = \
        df.loc[((df['DISTRICT'] == district) & (df['SECTION'] == "DISCRETIONARY " + p) & (df['Year'] == year), 'ALL')] + \
        df.loc[((df['DISTRICT'] == district) & (df['SECTION'] == "MANDATORY " + p) & (df['Year'] == year), 'ALL')]

KeyError: 'the label [ALL] is not in the [columns]'

In [3]:
#old

        for district in df["DISTRICT"].unique():
            df.loc[((df["DISTRICT"] == district) & (df['HEADING NAME'] == "MANDATORY " + punishment), 'Count')] += \
            df.loc[((df["DISTRICT"] == district) & (df['HEADING NAME'] == "DISCRETIONARY " + punishment), 'Count')]

            df = df.replace(to_replace="MANDATORY " + punishment, value="ALL")
            


IndentationError: unexpected indent (<ipython-input-3-3494c545380c>, line 3)