# Part 1: Data Cleaning
----

# To TAs/Graders:

### The following below is so important and you could not want to miss.

This `data_cleaning.ipynb` file is tedious and boring, which not only

- **needs domain knowledge**, but also 
- **to re-run some code**.

It is very suggested to go through this file, and to use the `data/data_com.csv` for other .ipynb.

### Explanation for `temp` directory.

It is like a playground.

No need to worry the important data and models used in other notebook will be re-write, appended, or changed by some running error.

- The **raw data** are saved in `raw_data` directory.

- The **cleaned data** are saved in `data` directory, which will be used by other .ipynb files.

- The **dataset created in this notebook** are saved in `temp/data` directory. Because we assumed you run this code only for checking the code, It is a temp directory, saving temporary data.

----

# Note
This file is used to proprocess the fiscal data and graduation data of year 2007 to 2010. Data was cleaned, selectd, merged between table, and combined from different years. The processed data has one identification column 'LEAID', one target column 'AFGR' representing graduation rate, and other feature columns.

# Table of contents
1. Import data
2. Data cleaning
3. Combine variables based on explanation from the data manual
4. Output
5. Combine four data files

# 1. Import data
---
Import both fiscal and graduation data

## a. Look at data

In [1]:
import pandas as pd

In [2]:
# the following code is for data from in year 2008. 
# For other years, change the file name, eg. 'data/fiscal08-09.txt' and 'data/dr08-09.txt' for year 2009
fiscal_path = 'raw_data/fiscal07-08.txt'
dropout_path = 'raw_data/dr07-08.txt'

In [3]:
def import_data(path):
    imported_data = pd.read_csv(path, sep='\t', low_memory=False, dtype = {'LEAID':str})
    return imported_data
#dtype={"LEAID": str,'SCHLEV':str, 'GSLO':str}

In [4]:
fiscal = import_data(fiscal_path)

In [5]:
# only year 2008 has graduation data in which the 'leaid' is in lowercase
dropout = pd.read_csv(dropout_path, sep='\t', dtype = {'leaid':str})

# for other years run the following code
# dropout = pd.read_csv(dropout_path, sep='\t', dtype = {'LEAID':str})

In [6]:
# If no dtype is not specified, there will be an error.
# DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

In [7]:
fiscal.head()

Unnamed: 0,LEAID,CENSUSID,FIPST,CONUM,CSA,CBSA,NAME,STNAME,STABBR,SCHLEV,...,FL_V93,FL_19H,FL_21F,FL_31F,FL_41F,FL_61V,FL_66V,FL_W01,FL_W31,FL_W61
0,100005,1504840100000,1,1095,N,10700,ALBERTVILLE CITY SCHOOL DISTRICT,Alabama,AL,3,...,R,R,R,R,R,R,R,R,R,R
1,100006,1504800100000,1,1095,N,10700,MARSHALL COUNTY SCHOOL DISTRICT,Alabama,AL,3,...,R,R,R,R,R,R,R,R,R,R
2,100007,1503740100000,1,1073,142,13820,HOOVER CITY SCHOOL DISTRICT,Alabama,AL,3,...,R,R,R,R,R,R,R,R,R,R
3,100008,1504530100000,1,1089,290,26620,MADISON CITY SCHOOL DISTRICT,Alabama,AL,3,...,R,R,R,R,R,R,R,R,R,R
4,100011,1503710100000,1,1073,142,13820,LEEDS CITY SCHOOL DISTRICT,Alabama,AL,3,...,R,R,R,R,R,R,R,R,R,R


In [8]:
fiscal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16453 entries, 0 to 16452
Columns: 250 entries, LEAID to FL_W61
dtypes: int64(129), object(121)
memory usage: 31.4+ MB


In [9]:
dropout.head()

Unnamed: 0,survyear,fipst,leaid,totd912,ebs912,drp912,totdpl,afgeb,afgr,totohc
0,2007-08,1,100002,0,47,0.0,-1,14,-1.0,-1
1,2007-08,1,100005,29,939,3.1,172,247,69.6,5
2,2007-08,1,100006,41,1612,2.5,276,450,61.3,13
3,2007-08,1,100007,38,3817,1.0,899,978,91.9,5
4,2007-08,1,100008,27,2715,1.0,624,584,100.0,9


In [10]:
dropout.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18090 entries, 0 to 18089
Data columns (total 10 columns):
survyear    18090 non-null object
fipst       18090 non-null int64
leaid       18090 non-null object
totd912     18090 non-null int64
ebs912      18090 non-null int64
drp912      18090 non-null float64
totdpl      18090 non-null int64
afgeb       18090 non-null int64
afgr        18090 non-null float64
totohc      18090 non-null int64
dtypes: float64(2), int64(6), object(2)
memory usage: 1.4+ MB


## b. Check duplicated data for each dataframe

In [11]:
len(fiscal['LEAID'])

16453

In [12]:
len(fiscal[fiscal['LEAID'].duplicated() == True])

0

In [13]:
len(dropout['leaid'])

18090

In [14]:
len(dropout[dropout['leaid'].duplicated() == True])

0

In [15]:
# only run this for year 2008 becuase the labels is in lowercase
dropout['LEAID'] = dropout['leaid']
dropout['AFGR'] = dropout['afgr']

## c. Check number of same LEAID for both dataframe
LEAID is used to join the fiscal table and the graduation table

In [16]:
len(set([i for i in fiscal['LEAID']]) & set([i for i in dropout['LEAID']]))

16408

# 2. Data cleaning
---

## a. Selection variables that are going to be used

In [17]:
# ignore columns that are flags for the numerical variables
# for year 2007， there is one more data column 'C18'
fiscal = fiscal[['LEAID','SCHLEV','AGCHRT','CONUM','FIPST','YEAR','V33',
                 'C14','C16','C17','C25',
                 'C15','C18','C19','B11','B10','B12',
                 'C20','C36','B13',
                 'C01','C04','C10','C12','C38',
                 'C05','C06','C07','C08','C09',
                 'C11','C13','C35','C39',
                 'T06','T09','T15','T40','T99','D11','D23',
                 'A07','A08','A09','A11','A13','A15','A20','A40',
                 'U11','U22','U30','U50','U97','C24',
                 'Z33',
                 'V11','V13','V15','V17','V21','V23','V37','V29',
                 'Z34',
                 'E13','TCURSSVC','E11','V60','V65',
                 'TNONELSE',
                 'TCAPOUT',
                 'L12','M12','Q11','V91','V92',
                 'V93',
                 'I86'                 
                ]]

In [18]:
fiscal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16453 entries, 0 to 16452
Data columns (total 79 columns):
LEAID       16453 non-null object
SCHLEV      16453 non-null int64
AGCHRT      16453 non-null object
CONUM       16453 non-null int64
FIPST       16453 non-null int64
YEAR        16453 non-null int64
V33         16453 non-null int64
C14         16453 non-null int64
C16         16453 non-null int64
C17         16453 non-null int64
C25         16453 non-null int64
C15         16453 non-null int64
C18         16453 non-null int64
C19         16453 non-null int64
B11         16453 non-null int64
B10         16453 non-null int64
B12         16453 non-null int64
C20         16453 non-null int64
C36         16453 non-null int64
B13         16453 non-null int64
C01         16453 non-null int64
C04         16453 non-null int64
C10         16453 non-null int64
C12         16453 non-null int64
C38         16453 non-null int64
C05         16453 non-null int64
C06         16453 non-null int6

In [19]:
# There are no null value for both dataset.

## b. Deal with invalid value

#### 1) almost every column has "-2, -1, 0",

Take column "V93" for Example:

-2 5.935222

-1 3.490985

0 32.690305

We don't know the meaning of -1 and -2, and difference between -1 and -2

Almost every column's "-2, -1" has the same proportion, 5.935222 and 3.490985

-2 5.935222

-1 3.490985

**We asume that, this should be a series of effects which means drop rows where value = -1 or -2 for one column, could make every column clean.**

But 0 means 0. There is no need to delete 0.

In [20]:
for a in fiscal.columns:
    l = len(fiscal)
    print(fiscal.groupby(by=[a])[a].count()/l*100)

LEAID
0100005    0.006078
0100006    0.006078
0100007    0.006078
0100008    0.006078
0100011    0.006078
0100012    0.006078
0100013    0.006078
0100030    0.006078
0100060    0.006078
0100090    0.006078
0100100    0.006078
0100120    0.006078
0100180    0.006078
0100185    0.006078
0100210    0.006078
0100240    0.006078
0100270    0.006078
0100300    0.006078
0100330    0.006078
0100360    0.006078
0100390    0.006078
0100420    0.006078
0100450    0.006078
0100480    0.006078
0100510    0.006078
0100540    0.006078
0100600    0.006078
0100630    0.006078
0100660    0.006078
0100690    0.006078
             ...   
5602870    0.006078
5602990    0.006078
5603170    0.006078
5603180    0.006078
5603310    0.006078
5603770    0.006078
5604030    0.006078
5604060    0.006078
5604120    0.006078
5604230    0.006078
5604260    0.006078
5604380    0.006078
5604450    0.006078
5604500    0.006078
5604510    0.006078
5604830    0.006078
5604860    0.006078
5605090    0.006078
5605160    0.0

U50
0           60.262566
1000         2.716830
2000         1.981402
3000         1.452623
4000         1.300675
5000         1.160883
6000         1.094025
7000         0.911688
8000         0.784052
9000         0.662493
10000        0.869142
11000        0.832675
12000        0.662493
13000        0.565246
14000        0.559169
15000        0.650337
16000        0.443688
17000        0.577402
18000        0.407221
19000        0.468000
20000        0.461922
21000        0.431532
22000        0.395065
23000        0.395065
24000        0.364675
25000        0.322130
26000        0.370753
27000        0.352519
28000        0.346441
29000        0.297818
              ...    
3388000      0.006078
3444000      0.006078
3469000      0.006078
3580000      0.006078
3593000      0.006078
3604000      0.006078
3649000      0.006078
3660000      0.006078
3897000      0.006078
3925000      0.006078
4308000      0.006078
4459000      0.006078
4517000      0.006078
4767000      0.006078
545300

#### 2) Missing, Nonapplicable, and Suppressed Data (interpretion for negative values)

-1 In the F-33 data files, CCD identifies missing data by reporting the data value as “-1.”

-3 CCD identifies suppressed membership data by reporting the membership as “-3” and the membership flag as a value of “A.”

-9 CCD identifies submitted F-33 data that do not meet NCES data quality standards by reporting the data item as “-9” and data item flag as “A.”

-2 CCD identifies nonapplicable data by reporting the data value as “-2” and the data item flag as a value of “N.”

In [21]:
# get rid of all negative data in columns that they appear
for a in ['T06','T09','T15','T40','T99','V33']:
    fiscal = fiscal[fiscal[a] >= 0 ]

Check if -1 or -2 still in one of the columns

In [22]:
for a in fiscal.columns:
    l = len(fiscal)
    print(fiscal.groupby(by=[a])[a].count()/l*100)

LEAID
0100005    0.007827
0100006    0.007827
0100007    0.007827
0100008    0.007827
0100011    0.007827
0100012    0.007827
0100013    0.007827
0100030    0.007827
0100060    0.007827
0100090    0.007827
0100100    0.007827
0100120    0.007827
0100180    0.007827
0100210    0.007827
0100240    0.007827
0100270    0.007827
0100300    0.007827
0100330    0.007827
0100360    0.007827
0100390    0.007827
0100420    0.007827
0100450    0.007827
0100480    0.007827
0100510    0.007827
0100540    0.007827
0100600    0.007827
0100630    0.007827
0100660    0.007827
0100690    0.007827
0100720    0.007827
             ...   
5602870    0.007827
5602990    0.007827
5603170    0.007827
5603180    0.007827
5603310    0.007827
5603770    0.007827
5604030    0.007827
5604060    0.007827
5604120    0.007827
5604230    0.007827
5604260    0.007827
5604380    0.007827
5604450    0.007827
5604500    0.007827
5604510    0.007827
5604830    0.007827
5604860    0.007827
5605090    0.007827
5605160    0.0

A40
0           50.833529
1000         5.063786
2000         3.201064
3000         2.598419
4000         2.285356
5000         2.011427
6000         1.729671
7000         1.588792
8000         1.267903
9000         1.150505
10000        1.033106
11000        0.907881
12000        0.907881
13000        0.813963
14000        0.735697
15000        0.657431
16000        0.665258
17000        0.579166
18000        0.688738
19000        0.555686
20000        0.532206
21000        0.508727
22000        0.493073
23000        0.352195
24000        0.438288
25000        0.406981
26000        0.422634
27000        0.438288
28000        0.352195
29000        0.289583
              ...    
2356000      0.007827
2511000      0.007827
2551000      0.007827
2578000      0.007827
2632000      0.007827
2710000      0.007827
2797000      0.007827
2814000      0.007827
2829000      0.007827
2879000      0.007827
2930000      0.007827
3201000      0.007827
3253000      0.007827
3371000      0.007827
351100

In [23]:
print(len(fiscal))

12777


In [24]:
# check which column has null value
fiscal.isnull().any()

# fiscal.dropna(how = 'any',axis = 0)
# fiscal.dropna(how = 'any',axis = 1)

LEAID       False
SCHLEV      False
AGCHRT      False
CONUM       False
FIPST       False
YEAR        False
V33         False
C14         False
C16         False
C17         False
C25         False
C15         False
C18         False
C19         False
B11         False
B10         False
B12         False
C20         False
C36         False
B13         False
C01         False
C04         False
C10         False
C12         False
C38         False
C05         False
C06         False
C07         False
C08         False
C09         False
            ...  
U11         False
U22         False
U30         False
U50         False
U97         False
C24         False
Z33         False
V11         False
V13         False
V15         False
V17         False
V21         False
V23         False
V37         False
V29         False
Z34         False
E13         False
TCURSSVC    False
E11         False
V60         False
V65         False
TNONELSE    False
TCAPOUT     False
L12         False
M12       

##### AGCHRT: 

AGENCY CHARTER CODE

1 = All schools are charter schools

2 = All schools are charter and noncharter schools

3 = All associated schools are noncharter schools

N = Not applicable or code could not be determined

**Just treat them as categorical data, no need to change.**

### C.Deal with graduation data

In [25]:
dropout.head(20)

Unnamed: 0,survyear,fipst,leaid,totd912,ebs912,drp912,totdpl,afgeb,afgr,totohc,LEAID,AFGR
0,2007-08,1,100002,0,47,0.0,-1,14,-1.0,-1,100002,-1.0
1,2007-08,1,100005,29,939,3.1,172,247,69.6,5,100005,69.6
2,2007-08,1,100006,41,1612,2.5,276,450,61.3,13,100006,61.3
3,2007-08,1,100007,38,3817,1.0,899,978,91.9,5,100007,91.9
4,2007-08,1,100008,27,2715,1.0,624,584,100.0,9,100008,100.0
5,2007-08,1,100009,0,163,0.0,-1,20,-1.0,-1,100009,-1.0
6,2007-08,1,100010,-2,-2,-2.0,-2,-2,-2.0,-2,100010,-2.0
7,2007-08,1,100011,15,401,3.7,72,107,67.3,6,100011,67.3
8,2007-08,1,100012,-3,622,-3.0,109,156,69.9,7,100012,69.9
9,2007-08,1,100013,14,1241,1.1,263,-2,-2.0,1,100013,-2.0


In [26]:
for a in dropout.columns:
    l = len(dropout)
    print(dropout.groupby(by=[a])[a].count()/l*100)

survyear
2007-08    100.0
Name: survyear, dtype: float64
fipst
1     0.950802
2     0.298507
4     3.449420
5     1.686014
6     6.296296
8     1.448314
9     1.105583
10    0.226645
11    0.320619
12    0.425650
13    1.133223
15    0.005528
16    0.751797
17    6.030956
18    2.050857
19    2.078496
20    1.835268
21    1.077944
22    0.585959
23    1.686014
24    0.138198
25    2.769486
26    4.687673
27    3.145384
28    0.906578
29    3.101161
30    2.857933
31    1.724710
32    0.105030
33    1.520177
34    3.747927
35    0.525152
36    4.814815
37    1.387507
38    1.337756
39    6.064124
40    3.322278
41    1.232725
42    4.378109
44    0.287452
45    0.569375
46    1.055832
47    0.773908
48    7.042565
49    0.630182
50    1.990050
51    1.276949
53    1.708126
54    0.315091
55    2.559425
56    0.348259
58    0.049751
59    0.110558
60    0.005528
61    0.038695
66    0.005528
69    0.005528
72    0.005528
78    0.011056
Name: fipst, dtype: float64
leaid
0100002    0.00552

#### 1) Dropout Data - Missing, Nonapplicable, and Suppressed Data from the data manual

Data suppression has also been employed as part of the CCD disclosure mitigation plan. Dropout
counts of 1, 2, or 3 have been suppressed. These counts are presented on the data file with the
value -3. Dropout counts that are 1, 2, or 3 students less than the membership count have also
been suppressed. These counts are represented on the file with the denoted value of -4. In order
to prevent data users from backing out these suppressed values and determining the real value of
the cell, complimentary suppression has also been employed. Any complementary suppression
performed on the file is denoted with the same value as a missing count, -1. These suppressed
cells are not distinguishable from the cells that contain missing values.

Suppression has also been employed to protect against the individual disclosure of anyone that
did not receive a regular high school diploma following their 12th grade year. These, and the
counter-suppressions made to protect the primary suppressions, are denoted as -1 on the data file.
These suppressed cells are not distinguishable from cells that contain missing values. 

0: A zero value represents a report of no occurrences of a data element. A value was
expected and measured, but zero cases were found in the category. (For example, a K–12
district having no 12th-graders would report “0.”) 

M (or -1 for numeric values): A value of M (or -1) indicates that data are missing. A
value was expected, but none was measured. (For example, a district that has at least one
12th-grader but cannot measure the number of 12th-graders would report “-1.”). This value
also denotes a suppressed high school diploma count or dropout count. 

N (or -2 for numeric values): A value of N (or -2) indicates that data are not applicable. A
value was neither expected nor measured. (For example, an elementary school district
would report “-2” for 12th-graders.)

-3: A value of -3 indicates a dropout count of 1, 2, or 3. These cells have been suppressed
such that the true value of the cell cannot be identified. All cells with a value of -3 have a
plausible value of 1, 2, or 3. 

-4: A value of -4 indicates a dropout count that is equal to or exceeds the 3 less than the
membership count. These cells have been suppressed such that the true value of the cell
cannot be identified. All cells with a value of -4 have a plausible value of 3 less than the
membership. 

#### 2) Look at the graduation rate column

AFGR is more clean than DRP912, there only 3 negative values.

**For AFGR:**

-9.0       1.090081

-2.0      35.066978

-1.0       5.314822

**For DRP912:**
    
-9.0      4.051196

-4.0      2.717067

-3.0     50.111177

-2.0     28.472260

-1.0      2.467596

**So We are tring to use AFGR. This will remove less instances.**

In [27]:
# remove all negative numbers
dropout = dropout[['LEAID','AFGR']]
dropout = dropout[dropout['AFGR'] != -9]
dropout = dropout[dropout['AFGR'] != -2]
dropout = dropout[dropout['AFGR'] != -1]

In [28]:
len(dropout)

10735

In [29]:
for a in dropout.columns:
    l = len(dropout)
    print(dropout.groupby(by=[a])[a].count()/l*100)

LEAID
0100005    0.009315
0100006    0.009315
0100007    0.009315
0100008    0.009315
0100011    0.009315
0100012    0.009315
0100030    0.009315
0100060    0.009315
0100090    0.009315
0100100    0.009315
0100120    0.009315
0100180    0.009315
0100210    0.009315
0100240    0.009315
0100270    0.009315
0100300    0.009315
0100330    0.009315
0100360    0.009315
0100390    0.009315
0100420    0.009315
0100450    0.009315
0100480    0.009315
0100510    0.009315
0100540    0.009315
0100600    0.009315
0100630    0.009315
0100660    0.009315
0100690    0.009315
0100720    0.009315
0100750    0.009315
             ...   
5603170    0.009315
5603180    0.009315
5603310    0.009315
5603770    0.009315
5604030    0.009315
5604060    0.009315
5604120    0.009315
5604230    0.009315
5604260    0.009315
5604380    0.009315
5604500    0.009315
5604510    0.009315
5604830    0.009315
5604860    0.009315
5605090    0.009315
5605160    0.009315
5605220    0.009315
5605302    0.009315
5605680    0.0

We've got a clean AFGR column, just forget the others now.

In [30]:
dropout.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10735 entries, 1 to 18089
Data columns (total 2 columns):
LEAID    10735 non-null object
AFGR     10735 non-null float64
dtypes: float64(1), object(1)
memory usage: 251.6+ KB


In [31]:
fiscal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12777 entries, 0 to 16452
Data columns (total 79 columns):
LEAID       12777 non-null object
SCHLEV      12777 non-null int64
AGCHRT      12777 non-null object
CONUM       12777 non-null int64
FIPST       12777 non-null int64
YEAR        12777 non-null int64
V33         12777 non-null int64
C14         12777 non-null int64
C16         12777 non-null int64
C17         12777 non-null int64
C25         12777 non-null int64
C15         12777 non-null int64
C18         12777 non-null int64
C19         12777 non-null int64
B11         12777 non-null int64
B10         12777 non-null int64
B12         12777 non-null int64
C20         12777 non-null int64
C36         12777 non-null int64
B13         12777 non-null int64
C01         12777 non-null int64
C04         12777 non-null int64
C10         12777 non-null int64
C12         12777 non-null int64
C38         12777 non-null int64
C05         12777 non-null int64
C06         12777 non-null int6

### D. Merge fiscal and graduation table based on 'LEAID'

In [32]:
data = pd.merge(fiscal, dropout, how='inner', on="LEAID", suffixes=('_f', '_d'))

In [33]:
data.head()

Unnamed: 0,LEAID,SCHLEV,AGCHRT,CONUM,FIPST,YEAR,V33,C14,C16,C17,...,TNONELSE,TCAPOUT,L12,M12,Q11,V91,V92,V93,I86,AFGR
0,100005,3,3,1095,1,8,3790,1053000,163000,65000,...,736000,12082000,0,0,70000,0,0,240000,413000,69.6
1,100006,3,3,1095,1,8,5647,2288000,424000,558000,...,1491000,2443000,0,0,8000,0,0,555000,746000,61.3
2,100007,3,3,1073,1,8,12479,165000,165000,21000,...,4709000,5784000,0,0,99000,0,0,1876000,9415000,91.9
3,100008,3,3,1089,1,8,8298,589000,155000,16000,...,1245000,8347000,0,0,124000,8000,0,798000,3316000,100.0
4,100011,3,3,1073,1,8,1406,344000,100000,6000,...,138000,1575000,0,0,13000,0,0,84000,182000,67.3


In [34]:
len(data)

9527

In [35]:
# look at the 
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9527 entries, 0 to 9526
Data columns (total 80 columns):
LEAID       9527 non-null object
SCHLEV      9527 non-null int64
AGCHRT      9527 non-null object
CONUM       9527 non-null int64
FIPST       9527 non-null int64
YEAR        9527 non-null int64
V33         9527 non-null int64
C14         9527 non-null int64
C16         9527 non-null int64
C17         9527 non-null int64
C25         9527 non-null int64
C15         9527 non-null int64
C18         9527 non-null int64
C19         9527 non-null int64
B11         9527 non-null int64
B10         9527 non-null int64
B12         9527 non-null int64
C20         9527 non-null int64
C36         9527 non-null int64
B13         9527 non-null int64
C01         9527 non-null int64
C04         9527 non-null int64
C10         9527 non-null int64
C12         9527 non-null int64
C38         9527 non-null int64
C05         9527 non-null int64
C06         9527 non-null int64
C07         9527 non-null i

In [36]:
# Look at data distribution
for a in data.columns:
    l = len(data)
    print(data.groupby(by=[a])[a].count()/l*100)

LEAID
0100005    0.010496
0100006    0.010496
0100007    0.010496
0100008    0.010496
0100011    0.010496
0100012    0.010496
0100030    0.010496
0100060    0.010496
0100090    0.010496
0100100    0.010496
0100120    0.010496
0100180    0.010496
0100210    0.010496
0100240    0.010496
0100270    0.010496
0100300    0.010496
0100330    0.010496
0100360    0.010496
0100390    0.010496
0100420    0.010496
0100450    0.010496
0100480    0.010496
0100510    0.010496
0100540    0.010496
0100600    0.010496
0100630    0.010496
0100660    0.010496
0100690    0.010496
0100720    0.010496
0100750    0.010496
             ...   
5602830    0.010496
5602870    0.010496
5602990    0.010496
5603170    0.010496
5603180    0.010496
5603310    0.010496
5603770    0.010496
5604030    0.010496
5604060    0.010496
5604120    0.010496
5604230    0.010496
5604260    0.010496
5604380    0.010496
5604500    0.010496
5604510    0.010496
5604830    0.010496
5604860    0.010496
5605090    0.010496
5605160    0.0

U97
0            2.802561
1000         1.196599
2000         1.018159
3000         0.892201
4000         0.944684
5000         0.965676
6000         0.934187
7000         0.892201
8000         0.923691
9000         0.808229
10000        0.829222
11000        0.766243
12000        0.829222
13000        0.923691
14000        0.892201
15000        0.818726
16000        0.755747
17000        0.703264
18000        0.892201
19000        0.808229
20000        0.787236
21000        0.850215
22000        0.755747
23000        0.776740
24000        0.692768
25000        0.640286
26000        0.598300
27000        0.503831
28000        0.524824
29000        0.545817
               ...   
25060000     0.010496
25519000     0.010496
26666000     0.010496
28689000     0.010496
30637000     0.010496
31119000     0.010496
33213000     0.010496
36452000     0.010496
36675000     0.010496
40715000     0.010496
44942000     0.010496
45539000     0.010496
46959000     0.010496
48180000     0.010496
518120

# 3. Combine variables based on explanation from the data manual
---

In [37]:
data.head()

Unnamed: 0,LEAID,SCHLEV,AGCHRT,CONUM,FIPST,YEAR,V33,C14,C16,C17,...,TNONELSE,TCAPOUT,L12,M12,Q11,V91,V92,V93,I86,AFGR
0,100005,3,3,1095,1,8,3790,1053000,163000,65000,...,736000,12082000,0,0,70000,0,0,240000,413000,69.6
1,100006,3,3,1095,1,8,5647,2288000,424000,558000,...,1491000,2443000,0,0,8000,0,0,555000,746000,61.3
2,100007,3,3,1073,1,8,12479,165000,165000,21000,...,4709000,5784000,0,0,99000,0,0,1876000,9415000,91.9
3,100008,3,3,1089,1,8,8298,589000,155000,16000,...,1245000,8347000,0,0,124000,8000,0,798000,3316000,100.0
4,100011,3,3,1073,1,8,1406,344000,100000,6000,...,138000,1575000,0,0,13000,0,0,84000,182000,67.3


In [38]:
# One more column 'C18' for year 2007, should be grouped into ‘Re_F_Special’
data['Re_F_Basic'] = data['C14'] + data['C16'] + data['C17'] + data['C25']
data['Re_F_Special'] = data['C15'] + data['C19'] + data['B11'] + data['B10'] + data['B12']
data['Re_F_Other'] = data['C20'] + data['C36'] + data['B13']
data['Re_S_Basic'] = data['C01'] + data['C04'] + data['C10'] + data['C12'] + data['C38']
data['Re_S_Special'] = data['C05'] + data['C06'] + data['C07'] + data['C08'] + data['C09']
data['Re_S_Other'] = data['C11'] + data['C13'] + data['C35'] + data['C39']
data['Re_L_Gov'] = data['T06'] + data['T09'] + data['T15'] + data['T40'] + data['T99'] + data['D11'] + data['D23']
data['Re_L_fee'] = data['A07'] + data['A08'] + data['A09'] + data['A11'] + data['A13'] + data['A15'] + data['A20'] + data['A40']
data['Re_L_Other'] = data['U11'] + data['U22'] + data['U30'] + data['U50'] + data['U97'] + data['C24']
data['Ex_Teacher_Inst'] = data['Z33']
data['Ex_Teacher_Supp'] = data['V11'] + data['V13'] + data['V15'] + data['V17'] + data['V21'] + data['V23'] + data['V37'] + data['V29'] 
data['Ex_Employ'] = data['Z34']
data['Ex_Edu'] = data['E13'] + data['TCURSSVC'] + data['E11'] + data['V60'] + data['V65']
data['Ex_Community'] = data['TNONELSE']
data['Ex_Capital'] = data['TCAPOUT']
data['Ex_Payment'] = data['L12'] + data['M12'] + data['Q11'] + data['V91'] + data['V92']
data['Ex_Textbook'] = data['V93']
data['Ex_Interest'] = data['I86']

In [39]:
data = data[['LEAID','SCHLEV','AGCHRT','CONUM','FIPST','YEAR','V33',
             'Re_F_Basic','Re_F_Special','Re_F_Other',
             'Re_S_Basic','Re_S_Special','Re_S_Other',
             'Re_L_Gov','Re_L_fee','Re_L_Other',
             'Ex_Teacher_Inst','Ex_Teacher_Supp','Ex_Employ',
             'Ex_Edu','Ex_Community','Ex_Capital','Ex_Payment',
             'Ex_Textbook','Ex_Interest',
             'AFGR'
            ]]

In [40]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9527 entries, 0 to 9526
Data columns (total 26 columns):
LEAID              9527 non-null object
SCHLEV             9527 non-null int64
AGCHRT             9527 non-null object
CONUM              9527 non-null int64
FIPST              9527 non-null int64
YEAR               9527 non-null int64
V33                9527 non-null int64
Re_F_Basic         9527 non-null int64
Re_F_Special       9527 non-null int64
Re_F_Other         9527 non-null int64
Re_S_Basic         9527 non-null int64
Re_S_Special       9527 non-null int64
Re_S_Other         9527 non-null int64
Re_L_Gov           9527 non-null int64
Re_L_fee           9527 non-null int64
Re_L_Other         9527 non-null int64
Ex_Teacher_Inst    9527 non-null int64
Ex_Teacher_Supp    9527 non-null int64
Ex_Employ          9527 non-null int64
Ex_Edu             9527 non-null int64
Ex_Community       9527 non-null int64
Ex_Capital         9527 non-null int64
Ex_Payment         9527 non-null 

In [41]:
data.head()

Unnamed: 0,LEAID,SCHLEV,AGCHRT,CONUM,FIPST,YEAR,V33,Re_F_Basic,Re_F_Special,Re_F_Other,...,Ex_Teacher_Inst,Ex_Teacher_Supp,Ex_Employ,Ex_Edu,Ex_Community,Ex_Capital,Ex_Payment,Ex_Textbook,Ex_Interest,AFGR
0,100005,3,3,1095,1,8,3790,2482000,1101000,224000,...,12471000,5847000,7745000,32292000,736000,12082000,70000,240000,413000,69.6
1,100006,3,3,1095,1,8,5647,5309000,1408000,242000,...,17798000,10404000,11933000,50516000,1491000,2443000,8000,555000,746000,61.3
2,100007,3,3,1073,1,8,12479,1513000,2158000,159000,...,52946000,28252000,34750000,148790000,4709000,5784000,99000,1876000,9415000,91.9
3,100008,3,3,1089,1,8,8298,1582000,1738000,46000,...,27664000,12089000,16632000,72287000,1245000,8347000,132000,798000,3316000,100.0
4,100011,3,3,1073,1,8,1406,839000,346000,0,...,4262000,2685000,2990000,12542000,138000,1575000,13000,84000,182000,67.3


# 4. Output
---
To get cleaned and merged data in each year, the code needs to be rerun again with changes noted in the code.

##### Because we assumed you run this code only for checking the code, we put the `07-08data.csv` in a temp directory, saving temporary data.

In [42]:
# save the data to the correspoding csv file for the year
#data.to_csv('06-07data.csv')
data.to_csv('temp/data/07-08data.csv')
#data.to_csv('08-09data.csv')
#data.to_csv('09-10data.csv')

In [43]:
#data0607 = data
data0708 = data
#data0809 = data
#data0910 = data

# 5. Combine four data files
---
This can only be run when data files from four different years are all created.

In [44]:
#data0607.info()
#data0708.info()
#data0809.info()
#data0910.info()

In [45]:
# combine files from the four years
# data_com = pd.concat([data0607,data0708,data0809,data0910])
# data_com.info()

In [46]:
# # save the complete csv file
# data_com.to_csv('data_com.csv')

The Final data file `data_com.csv` is generated by this code above, which is commented.
While, to generate this, needs all the files available. You can re-run the code and follow the comments to slightly change some line of this code.

##### But all the procedure in other ipynb are using the dataset in `data/data_com.csv`, which is the final dataset we have already created.