# North Carolina IRS Individual Income Tax Statistics by School District 2013
* ZIP Code data shows selected income and tax items classified by State, ZIP Code, and size of adjusted gross income. 
* Data are based on individual income tax returns filed with the IRS and are available for Tax Years 1998, 2001, and 2004 through 2016.
* This data is aviable at: https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi
* We suggest mapping IRS Income Tax data for the tax year that covers a majority of a particular school year. 
* For example, you will find the tax data for 2013 in the 2013-2014 school year folder. This is why the files for NCDPI and IRS data may appear one year off.  
* However, the raw data is keyed by zip code, so users may merge years however they see fit! 

## IRS Income Tax Data by School District
* Pivoted IRS tax data by zip code and Adjusted Gross Income ranges for North Carolina is available at: \\Raw Datasets\IncomeTaxDataByZipCode_**(tax year)**.cs
* We map IRS tax data by zip code to school districts using all zip codes which a particular school district has physical     
  elementary, middle, and high school campuses located in.
* IRS tax data for each zip code mapped to a school district is aggregated, summing each individual data field by school district.  

In [1]:
#import required Libraries
import pandas as pd
import numpy as np

#**********************************************************************************
# Set the following variables before running this code!!!
#**********************************************************************************

#Location where copies of the school data files will be downloaded and saved as csv files.
dataDir = 'C:/Users/Jake/Documents/GitHub/EducationDataNC/2014/'

#All raw data files are filtered for the year below
taxYear = 2013

In [2]:
#Read in all public school data 
districtZips = pd.read_csv(dataDir + 'School Datasets/PublicSchools' + str(taxYear + 1) + '.csv', low_memory=False)

#Map all zip codes within each school district
districtZips = districtZips[['szip_ad','Lea_Name']].drop_duplicates()
districtZips.rename(columns={'szip_ad': 'Zip Code'}, inplace=True)

## Take a Look at our Mapping of Zip Codes to School Districts
* Here we have created a unique list of all North Carolina zip codes containing public elementary, middle, and high school campus by school district.
* It is possbile for more than one school district to have campuses located in a single zip code.

In [3]:
#Take a look at the data 
districtZips

Unnamed: 0,Zip Code,Lea_Name
0,27253.0,Alamance-Burlington Schools
2,27244.0,Alamance-Burlington Schools
3,27217.0,Alamance-Burlington Schools
5,27215.0,Alamance-Burlington Schools
6,27302.0,Alamance-Burlington Schools
18,27258.0,Alamance-Burlington Schools
30,27349.0,Alamance-Burlington Schools
36,27253.0,Charter and Non-District Affiliated Schools
37,27217.0,Charter and Non-District Affiliated Schools
38,27340.0,Charter and Non-District Affiliated Schools


## Summarize IRS Tax Data by School District 
**Income tax data counts and amounts are organized by Adjusted Gross Income Ranges within each Zip Code**
* **All** - Income tax data represents the entire zip code 
* **LT25K** - Income tax data represents adjusted gross income from \$1 under \$25,000 within a zip code.
* **25KLT50K** - Income tax data represents adjusted gross income >= \$25,000 and < \$50,000 within a zip code.
* **50KLT75K** - Income tax data represents adjusted gross income >= \$50,000 and < \$75,000 within a zip code.
* **75KLT100K** - Income tax data represents adjusted gross income >= \$75,000 and < \$100,000 within a zip code.
* **100KLT200K** - Income tax data represents adjusted gross income >= \$100,000 and < \$200,000 within a zip code.
* **GE200K** - Income tax data represents adjusted gross income >= \$200,000 within a zip code.

In [4]:
#Read in IRS tax data by zip code
irsData = pd.read_csv(dataDir + 'Raw Datasets/IncomeTaxDataByZipCode_' + str(taxYear) + '.csv', low_memory=False)
#Look at the record and column counts
irsData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 723 entries, 0 to 722
Columns: 771 entries, Zip Code to Unemployment compensation Ct LT25K
dtypes: float64(771)
memory usage: 4.3 MB


In [5]:
#Look at a sample of the data
irsData

Unnamed: 0,Zip Code,Add Medicare tax Amt 100KLT200K,Add Medicare tax Amt 25KLT50K,Add Medicare tax Amt 50KLT75K,Add Medicare tax Amt 75KLT100K,Add Medicare tax Amt All,Add Medicare tax Amt GE200K,Add Medicare tax Amt LT25K,Add Medicare tax Ct 100KLT200K,Add Medicare tax Ct 25KLT50K,...,Unemployment compensation Amt All,Unemployment compensation Amt GE200K,Unemployment compensation Amt LT25K,Unemployment compensation Ct 100KLT200K,Unemployment compensation Ct 25KLT50K,Unemployment compensation Ct 50KLT75K,Unemployment compensation Ct 75KLT100K,Unemployment compensation Ct All,Unemployment compensation Ct GE200K,Unemployment compensation Ct LT25K
0,27006.0,0.0,0.0,0.0,0.0,504.0,504.0,0.0,0.0,0.0,...,1639.0,0.0,392.0,40.0,90.0,60.0,40.0,320.0,0.0,90.0
1,27007.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,278.0,0.0,185.0,0.0,0.0,30.0,0.0,80.0,0.0,50.0
2,27009.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,458.0,0.0,196.0,0.0,20.0,30.0,0.0,90.0,0.0,40.0
3,27011.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,635.0,0.0,260.0,0.0,40.0,40.0,0.0,140.0,0.0,60.0
4,27012.0,0.0,0.0,0.0,0.0,546.0,546.0,0.0,0.0,0.0,...,3456.0,0.0,948.0,90.0,140.0,100.0,60.0,580.0,0.0,190.0
5,27013.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1283.0,0.0,436.0,0.0,70.0,70.0,60.0,290.0,0.0,90.0
6,27016.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,190.0,0.0,131.0,0.0,0.0,20.0,0.0,60.0,0.0,40.0
7,27017.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1177.0,0.0,489.0,0.0,80.0,40.0,20.0,260.0,0.0,120.0
8,27018.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,961.0,0.0,408.0,0.0,50.0,30.0,20.0,190.0,0.0,90.0
9,27019.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,519.0,0.0,210.0,0.0,40.0,0.0,20.0,110.0,0.0,50.0


## Merge School Districts by Zip Codes 
* Here we create one data record for each unique public elementary, middle, and high school campus zip code and school district combination.
* We can group these records by school district and sum up the income tax data. 
* This provides income tax data specific to each school district and public school campus. 
* It is possbile for one zip code to be mapped to multiple school districts when two districts have campuses which reside in a single zip code.  
* As a result, school district data should not be summarized at the state level.  The original raw data file from the IRS already contains this information. 

In [6]:
#Merge each school district 
irsData = irsData.merge(districtZips,how='inner',on='Zip Code', suffixes=('', '_Drop'))
#Look at the record and column counts after the merge
irsData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 744 entries, 0 to 743
Columns: 772 entries, Zip Code to Lea_Name
dtypes: float64(771), object(1)
memory usage: 4.4+ MB


## Summarize Zip Code by School District
* This rolls up IRS income tax data by school district using the zip codes for all public elementary, middle, and high school campuses within each district.  

In [7]:
#Summarize IRS tax data by school district and remove zip codes
irsBySchoolDist = irsData.loc[:, irsData.columns != 'Zip Code'].groupby('Lea_Name').sum()
#Make our index a column for merges later
irsBySchoolDist.reset_index(level=0, inplace=True)

In [8]:
irsBySchoolDist.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116 entries, 0 to 115
Columns: 771 entries, Lea_Name to Unemployment compensation Ct LT25K
dtypes: float64(770), object(1)
memory usage: 698.8+ KB


In [9]:
#Save IRS tax data by school district to school datasets folder
irsBySchoolDist.to_csv(dataDir + 'School Datasets/IncomeTaxDataBySchoolDistrict_' + str(taxYear) + '.csv'
                     , sep=',', index=False)