In [1]:
import pandas as pd
import numpy as np

Read in the merged data file. 

In [2]:
df = pd.read_csv('../../data/50_states.csv')

  interactivity=interactivity, compiler=compiler, result=result)


Examine.

In [3]:
df.head()

Unnamed: 0,LEAID,CENSUSID,FIPST,CONUM,CSA,CBSA,NAME,STNAME,STABBR,SCHLEV,...,FL_W61,FL_V95,FL_V02,FL_K14,FL_CE1,FL_CE2,FL_CE3,District Name,Total,Graduation Rate
0,2700001,24506901700000,27,27137,N,20260,MOUNTAIN IRON-BUHL,Minnesota,MN,3,...,R,R,R,R,M,M,M,MOUNTAIN IRON-BUHL,31.0,0.9355
1,2700005,24502200900000,27,27043,N,N,UNITED SOUTH CENTRAL,Minnesota,MN,3,...,R,R,R,R,M,M,M,UNITED SOUTH CENTRAL,42.0,0.881
2,2700006,24500720200000,27,27013,359,31860,MAPLE RIVER,Minnesota,MN,3,...,R,R,R,R,M,M,M,MAPLE RIVER,79.0,0.9747
3,2700007,24502320200000,27,27045,462,40340,KINGSLAND,Minnesota,MN,3,...,R,R,R,R,M,M,M,KINGSLAND,31.0,0.9677
4,2700008,24506920200000,27,27137,N,20260,ST LOUIS COUNTY,Minnesota,MN,3,...,R,R,R,R,M,M,M,ST LOUIS COUNTY,122.0,0.8607


Drop the columns left over from merging with the state-level data: 'District Name' as it's duplicated by 'NAME', and 'Total' because too many districts were missing the information. 

In [4]:
df = df.drop(columns=['District Name', 'Total'])

Set nominal features with mixed datatypes to be strings, and set all district names to be uppercase for consistency. 

In [5]:
df['NAME'] = df['NAME'].astype(str).str.upper()
df['STABBR'] = df['STABBR'].astype(str)

Set categorical feature 'AGCHRT' to be all ints for easier processing.

In [6]:
df['AGCHRT'].replace('N', 0, inplace=True)

Check for null values. 

In [7]:
df.isna().sum()

LEAID                0
CENSUSID             0
FIPST                0
CONUM                0
CSA                  0
                  ... 
FL_K14               0
FL_CE1               0
FL_CE2               0
FL_CE3               0
Graduation Rate    306
Length: 263, dtype: int64

Drop any rows with null values, as the value they're missing (Graduation Rate) would be the target value. 

In [8]:
df.dropna(inplace=True)

Check to make sure none of the rows are duplicates. 

In [9]:
df.duplicated().sort_values(ascending=False)

2118      True
2116      True
0        False
7910     False
7902     False
         ...  
3941     False
3942     False
3943     False
3944     False
11758    False
Length: 11453, dtype: bool

Drop the duplicate rows. 

In [10]:
df.drop([2116, 2118], inplace=True)

Drop the unneeded flag columns.

In [11]:
fls = [col for col in df.columns if 'FL' in col]
df.drop(columns=fls, inplace=True)

Drop the rest of the columns with unneeded and non-financial data. 

In [12]:
df.drop(columns=['FIPST', 'CENSUSID', 'CONUM', 'CSA', 'CBSA', 'STNAME', 'SCHLEV', 'YEAR', 
                 'CCDNF', 'CENFILE', 'GSLO', 'GSHI', 'MEMBERSCH', 'WEIGHT'], inplace=True)

Find and drop the entries with bad inputs--there seem to be a number of districts in the original file where all of the financial data was entered as negative numbers. 

In [13]:
bad_entries = df[(df['TOTALREV'] < 0) & (df['TOTALEXP'] < 0)]

In [14]:
len(bad_entries)

130

In [15]:
df.drop(bad_entries.index, inplace=True)

There are still some missing values entered as negative numbers, so let's see where they are.

In [16]:
df = df.replace({-2.0 : np.nan})
df.isna().sum().sort_values(ascending=False).head(10)

T02    10501
T40     2149
T09     2149
T15     2149
T99     2148
T06     1633
V33       42
V15        0
V13        0
Q11        0
dtype: int64

We'll drop 'T02' as it has over 10,000 missing values, and the rest will be filled in with zeroes.

In [17]:
df.drop(columns='T02', inplace=True)

In [18]:
df.fillna(0, inplace=True)

As a convenience, set the district ID to be the index value. 

In [19]:
df.set_index('LEAID', inplace=True)

Save the master file. 

In [20]:
df.to_csv('../../data/dataset.csv', index=True)