Budgets for schools and school districts are huge, complex, and unwieldy. It's no easy task to digest where and how schools are using their resources [source](https://www.drivendata.org/competitions/46/box-plots-for-education-reboot/). Education Resource Strategies is a non-profit that tackles just this task with the goal of letting districts be smarter, more strategic, and more effective in their spending.

In order to compare budget or expenditure data across districts, ERS assigns every line item to certain categories in a comprehensive financial spending framework. For instance, Object_Type describes what the spending "is"—Base Salary/Compensation, Benefits, Stipends & Other Compensation, Equipment & Equipment Lease, Property Rental, and so on. Other categories describe what the spending "does," which groups of students benefit, and where the funds come from.   

**Goal**: Build machine learning algorithm that can automate the process where:   
line-item = "algebra books for 8th grade student "   
label = "Text book","math","middle school"

The task is a **multi-class-multi-label classification problem** with the goal of attaching canonical labels to the freeform text in budget line items. These labels let ERS understand how schools are spending money and tailor their strategy recommendations to improve outcomes for students, teachers, and administrators.

[Credit](https://github.com/datacamp/course-resources-ml-with-experts-budgets)
[Machine Learning with the Experts: School Budgets](https://www.datacamp.com/courses/machine-learning-with-the-experts-school-budgets)   

[Additional example](https://towardsdatascience.com/using-functiontransformer-and-pipeline-in-sklearn-to-predict-chardonnay-ratings-9b13fdd6c6fd)

In [25]:

from __future__ import division
from __future__ import print_function

# ignore deprecation warnings in sklearn
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd

import os
import sys

**The goal is to predict the probability that a certain label is attached to a budget line item**.   
Each row in the budget has mostly free-form text features, except for the two below that are noted as float. Any of the fields may or may not be empty

**FTE** float - If an employee, the percentage of full-time that the employee works.   
**Facility_or_Department** - If expenditure is tied to a department/facility, that department/facility.   
**Function_Description** - A description of the function the expenditure was serving.
**Fund_Description** - A description of the source of the funds.   
**Job_Title_Description** - If this is an employee, a description of that employee's job title.   
**Location_Description** - A description of where the funds were spent.   
**Object_Description** - A description of what the funds were used for.   
**Position_Extra** - Any extra information about the position that we have.   
**Program_Description** - A description of the program that the funds were used for.   
**SubFund_Description** - More detail on Fund_Description   
**Sub_Object_Description** - More detail on Object_Description   
**Text_1** - Any additional text supplied by the district.   
**Text_2** - Any additional text supplied by the district.   
**Text_3** - Any additional text supplied by the district.   
**Text_4** - Any additional text supplied by the district.   
**Total float** - The total cost of the expenditure.   

In [26]:
os.chdir('./Downloads/Projects_prototype/SchoolBudget')

'/Users/tridoan/Downloads/Projects_prototype/SchoolBudget'

In [27]:
# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), 'src')
sys.path.append(src_dir)

In [28]:
src_dir

'/Users/tridoan/Downloads/Projects_prototype/SchoolBudget/src'

For each of these rows, ERS attaches one label from each of 9 different categories:   

**Function**:

+ Aides Compensation   
+ Career & Academic Counseling
+ Communications
+ Curriculum Development
+ Data Processing & Information Services
+ Development & Fundraising
+ Enrichment
+ Extended Time & Tutoring
+ Facilities & Maintenance
+ Facilities Planning
+ Finance, Budget, Purchasing & Distribution
+ Food Services
+ Governance
+ Human Resources
+ Instructional Materials & Supplies
+ Insurance
+ Legal
+ Library & Media
+ NO_LABEL
+ Other Compensation
+ Other Non-Compensation
+ Parent & Community Relations
+ Physical Health & Services
+ Professional Development
+ Recruitment
+ Research & Accountability
+ School Administration
+ School Supervision
+ Security & Safety
+ Social & Emotional
+ Special Population Program Management & Support
+ Student Assignment
+ Student Transportation
+ Substitute Compensation
+ Teacher Compensation
+ Untracked Budget Set-Aside
+ Utilities  

**Object_Type**:   

+ Base Salary/Compensation
+ Benefits
+ Contracted Services
+ Equipment & Equipment Lease
+ NO_LABEL
+ Other Compensation/Stipend
+ Other Non-Compensation
+ Rent/Utilities
+ Substitute Compensation
+ Supplies/Materials
+ Travel & Conferences  

**Operating_Status**:

+ Non-Operating
+ Operating, Not PreK-12
+ PreK-12 Operating

**Position_Type**:   

+ (Exec) Director
+ Area Officers
+ Club Advisor/Coach
+ Coordinator/Manager
+ Custodian
+ Guidance Counselor
+ Instructional Coach
+ Librarian
+ NO_LABEL
+ Non-Position
+ Nurse
+ Nurse Aide
+ Occupational Therapist
+ Other
+ Physical Therapist
+ Principal
+ Psychologist
+ School Monitor/Security
+ Sec/Clerk/Other Admin
+ Social Worker
+ Speech Therapist
+ Substitute
+ TA
+ Teacher
+ Vice Principal

**Pre_K**:  

+ NO_LABEL
+ Non PreK
+ PreK

**Reporting**:

+ NO_LABEL
+ Non-School
+ School

**Sharing**:

+ Leadership & Management
+ NO_LABEL
+ School Reported
+ School on Central Budgets
+ Shared Services

**Student_Type**:

+ Alternative
+ At Risk
+ ELL
+ Gifted
+ NO_LABEL
+ Poverty
+ PreK
+ Special Education
+ Unspecified

**Use**:

+ Business Services
+ ISPD
+ Instruction
+ Leadership
+ NO_LABEL
+ O&M
+ Pupil Services & Enrichment
+ Untracked Budget Set-Aside   

Note, there is a hierarchical relationship for these labels. If a line is marked as Non-Operating in the Operating_Status category, then all of the other labels should be marked as NO_LABEL since ERS does not analyze and compare non-operating budget items.

In [29]:
os.getcwd()

'/Users/tridoan/Downloads/Projects_prototype/SchoolBudget'

The [goal](https://www.drivendata.org/competitions/46/box-plots-for-education-reboot/page/86/) is to predict a probability for each possible label in the dataset given a row of new data. Each of these probabilities goes in a separate column in the submission file. The submission must be 50064x104 where 50064 is the number of rows in the test dataset (excluding the header) and 104 is the number of columns (excluding a first column of row ids).   

The columns in the submission have the format ColumnName__PossibleLabel, for example:  

Function__Aides Compensation      
...   
Object_Type__Base Salary/Compensation   
Object_Type__Benefits   
... 
Position_Type__(Exec) Director   
Position_Type__Area Officers   
...   
Pre_K__NO_LABEL   
Pre_K__Non PreK   
...   
Sharing__Leadership & Management   
Sharing__NO_LABEL   
...   

such as    


        Function_   Function_                   Use_   Use_       Use_ 
        Aides       Career &                    O&M    Pupil &    Untracked 
        Compensation Academic                          Enrichment Budget
                    Counseling                                    Set-Aside
180042	0.027027    0.027027  	0.027027	...	0.125	0.125	   0.125   
28872	0.027027	0.027027	0.027027	...	0.125	0.125	   0.125

In [30]:
from data.multilabel import multilabel_sample_dataframe, multilabel_train_test_split
from features.SparseInteractions import SparseInteractions
from models.metrics import multi_multi_log_loss

# Load Data

First, we'll load the entire training data set available from DrivenData. 
 - [Sign up for an account on DrivenData](http://www.drivendata.org)
 - [Join the Box-plots for education competition](https://www.drivendata.org/competitions/46/box-plots-for-education-reboot/)
 - Download the competition data to the `data` folder in this repository. Files should be named `TrainingSet.csv` and `TestSet.csv`. 
 

In [32]:
#path_to_training_data = os.path.join(os.pardir,'data','TrainingSet.csv')
#df = pd.read_csv(path_to_training_data, index_col=0)
df = pd.read_csv('/Users/tridoan/Downloads/data/SchoolBudget/TrainingData.csv', index_col=0)

print(df.shape)

(400277, 25)


In [33]:
df.columns.values

array(['Function', 'Use', 'Sharing', 'Reporting', 'Student_Type',
       'Position_Type', 'Object_Type', 'Pre_K', 'Operating_Status',
       'Object_Description', 'Text_2', 'SubFund_Description',
       'Job_Title_Description', 'Text_3', 'Text_4',
       'Sub_Object_Description', 'Location_Description', 'FTE',
       'Function_Description', 'Facility_or_Department', 'Position_Extra',
       'Total', 'Program_Description', 'Fund_Description', 'Text_1'],
      dtype=object)

In [34]:
df.head()

Unnamed: 0,Function,Use,Sharing,Reporting,Student_Type,Position_Type,Object_Type,Pre_K,Operating_Status,Object_Description,...,Sub_Object_Description,Location_Description,FTE,Function_Description,Facility_or_Department,Position_Extra,Total,Program_Description,Fund_Description,Text_1
134338,Teacher Compensation,Instruction,School Reported,School,NO_LABEL,Teacher,NO_LABEL,NO_LABEL,PreK-12 Operating,,...,,,1.0,,,KINDERGARTEN,50471.81,KINDERGARTEN,General Fund,
206341,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,Non-Operating,CONTRACTOR SERVICES,...,,,,RGN GOB,,UNDESIGNATED,3477.86,BUILDING IMPROVEMENT SERVICES,,BUILDING IMPROVEMENT SERVICES
326408,Teacher Compensation,Instruction,School Reported,School,Unspecified,Teacher,Base Salary/Compensation,Non PreK,PreK-12 Operating,Personal Services - Teachers,...,,,1.0,,,TEACHER,62237.13,Instruction - Regular,General Purpose School,
364634,Substitute Compensation,Instruction,School Reported,School,Unspecified,Substitute,Benefits,NO_LABEL,PreK-12 Operating,EMPLOYEE BENEFITS,...,,,,UNALLOC BUDGETS/SCHOOLS,,PROFESSIONAL-INSTRUCTIONAL,22.3,GENERAL MIDDLE/JUNIOR HIGH SCH,,REGULAR INSTRUCTION
47683,Substitute Compensation,Instruction,School Reported,School,Unspecified,Teacher,Substitute Compensation,NO_LABEL,PreK-12 Operating,TEACHER COVERAGE FOR TEACHER,...,,,,NON-PROJECT,,PROFESSIONAL-INSTRUCTIONAL,54.166,GENERAL HIGH SCHOOL EDUCATION,,REGULAR INSTRUCTION


# Resample Data

400,277 rows is too many to work with locally while we develop our approach. We'll sample down to 10,000 rows so that it is easy and quick to run our analysis.

We'll also create dummy variables for our labels and split our sampled dataset into a training set and a test set.

In [35]:
LABELS = ['Function',
          'Use',
          'Sharing',
          'Reporting',
          'Student_Type',
          'Position_Type',
          'Object_Type', 
          'Pre_K',
          'Operating_Status']

NON_LABELS = [c for c in df.columns if c not in LABELS]

SAMPLE_SIZE = 40000

sampling = multilabel_sample_dataframe(df,
                                       pd.get_dummies(df[LABELS]),
                                       size=SAMPLE_SIZE,
                                       min_count=25,
                                       seed=43)

dummy_labels = pd.get_dummies(sampling[LABELS])

X_train, X_test, y_train, y_test = multilabel_train_test_split(sampling[NON_LABELS],
                                                               dummy_labels,
                                                               0.2,
                                                               min_count=3,
                                                               seed=43)

In [36]:
sampling.shape

(40000, 25)

In [37]:
y_train[:5]

Unnamed: 0,Function_Aides Compensation,Function_Career & Academic Counseling,Function_Communications,Function_Curriculum Development,Function_Data Processing & Information Services,Function_Development & Fundraising,Function_Enrichment,Function_Extended Time & Tutoring,Function_Facilities & Maintenance,Function_Facilities Planning,...,Object_Type_Rent/Utilities,Object_Type_Substitute Compensation,Object_Type_Supplies/Materials,Object_Type_Travel & Conferences,Pre_K_NO_LABEL,Pre_K_Non PreK,Pre_K_PreK,Operating_Status_Non-Operating,"Operating_Status_Operating, Not PreK-12",Operating_Status_PreK-12 Operating
38,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
70,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
198,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
209,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
614,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1


In [38]:
X_train.head()

Unnamed: 0,Object_Description,Text_2,SubFund_Description,Job_Title_Description,Text_3,Text_4,Sub_Object_Description,Location_Description,FTE,Function_Description,Facility_or_Department,Position_Extra,Total,Program_Description,Fund_Description,Text_1
38,OTHER PURCHASED SERVICES,,SCHOOL-WIDE SCHOOL PGMS FOR TITLE GRANTS,,,,,,,STUDENT TRANSPORT SERVICE,,,653.46,Misc,Schoolwide Schools,
70,Extra Duty Pay/Overtime For Support Personnel,,Operations,SECURITY OFFICER,,,Extra Duty Pay/Overtime For Support Personnel,Unallocated,,Security And Monitoring Services,Security Department,POLICE PATROL MAN,2153.53,Undistributed,General Operating Fund,OVERTIME
198,Supplemental *,,Operation and Maintenance of Plant Services,,,,Non-Certificated Salaries And Wages,,,Care and Upkeep of Building Services,,,-8291.86,,Title I - Disadvantaged Children/Targeted Assi...,TITLE I CARRYOVER
209,REPAIR AND MAINTENANCE SERVICES,,PUPIL TRANSPORTATION,,,,,ADMIN. SERVICES,,STUDENT TRANSPORT SERVICE,,,618.29,PUPIL TRANSPORTATION,General Fund,
614,,GENERAL EDUCATION,LOCAL,"EDUCATIONAL AIDE,70 HRS",,,,,0.71,,,,21747.666875,,,


# Create preprocessing tools

We need tools to preprocess our text and numeric data. We'll create those tools here. The `combine_text_columns` function will take a DataFrame of text columns and return a single series where all of the text in the columns has been joined together.

We'll then create `FunctionTransformer` objects that select our text and numeric data from the dataframe.

Finally, we create a custom scoring method that uses the `multi_multi_log_loss` function that is the evaluation metric for the competition.

In [39]:
set(X_train.columns.tolist())

{'FTE',
 'Facility_or_Department',
 'Function_Description',
 'Fund_Description',
 'Job_Title_Description',
 'Location_Description',
 'Object_Description',
 'Position_Extra',
 'Program_Description',
 'SubFund_Description',
 'Sub_Object_Description',
 'Text_1',
 'Text_2',
 'Text_3',
 'Text_4',
 'Total'}

In [40]:
set(['FTE', "Total"]) & set(X_train.columns.tolist())

{'FTE', 'Total'}

In [41]:


NUMERIC_COLUMNS = ['FTE', "Total"]

def combine_text_columns(data_frame, to_drop = NUMERIC_COLUMNS + LABELS):
    """ Takes the dataset as read in, drops the non-feature, non-text columns and
        then combines all of the text columns into a single vector that has all of
        the text for a row.
        
        :param data_frame: The data as read in with read_csv (no preprocessing necessary)
        :param to_drop (optional): Removes the numeric and label columns by default.
    """
    # drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)
    
    # replace nans with blanks
    text_data.fillna("", inplace=True)
    
    # joins all of the text items in a row (axis=1)
    # with a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)


**FunctionTransformer** is useful because it allows you to apply a custom function in a pipeline. Because Pipeline() from sklearn.pipeline only works with **objects** that implement the .transform() and .fit() methods.   

For example, we could transform a DataFrame or Series by using .apply() (or something similar like a list comprehension), but we wouldn't be able to use that function in Pipeline() without first using Function Transformer.

In [42]:
from sklearn.preprocessing import FunctionTransformer

get_text_data = FunctionTransformer(combine_text_columns, validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)



In [15]:
get_text_data.fit_transform(sampling.head(5))

38     OTHER PURCHASED SERVICES  SCHOOL-WIDE SCHOOL P...
70     Extra Duty Pay/Overtime For Support Personnel ...
198    Supplemental *  Operation and Maintenance of P...
209    REPAIR AND MAINTENANCE SERVICES  PUPIL TRANSPO...
614     GENERAL EDUCATION LOCAL EDUCATIONAL AIDE,70 H...
dtype: object

In [16]:
get_numeric_data.fit_transform(sampling.head(5))

Unnamed: 0,FTE,Total
38,,653.46
70,,2153.53
198,,-8291.86
209,,618.29
614,0.71,21747.666875


In [17]:
from sklearn.metrics.scorer import make_scorer

log_loss_scorer = make_scorer(multi_multi_log_loss)

# Train model pipeline

Now we'll train the final pipeline from the course that takes text and numeric data, does the necessary preprocessing, and trains the classifier.

In [18]:
from sklearn.feature_selection import chi2, SelectKBest

from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import HashingVectorizer

from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import MaxAbsScaler

TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

In [43]:
y_train.head()

Unnamed: 0,Function_Aides Compensation,Function_Career & Academic Counseling,Function_Communications,Function_Curriculum Development,Function_Data Processing & Information Services,Function_Development & Fundraising,Function_Enrichment,Function_Extended Time & Tutoring,Function_Facilities & Maintenance,Function_Facilities Planning,...,Object_Type_Rent/Utilities,Object_Type_Substitute Compensation,Object_Type_Supplies/Materials,Object_Type_Travel & Conferences,Pre_K_NO_LABEL,Pre_K_Non PreK,Pre_K_PreK,Operating_Status_Non-Operating,"Operating_Status_Operating, Not PreK-12",Operating_Status_PreK-12 Operating
38,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
70,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
198,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
209,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
614,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1


In [44]:
%%time

# set a reasonable number of features before adding interactions
chi_k = 300

# create the pipeline object
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
                                                     alternate_sign=False, norm=None, binary=False,
                                                     ngram_range=(1, 2))),
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('int', SparseInteractions(degree=2)),
        ('scale', MaxAbsScaler()),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])

# fit the pipeline to our training data
pl.fit(X_train, y_train.values)

# print the score of our trained pipeline on our test set
print("Logloss score of trained pipeline: ", log_loss_scorer(pl, X_test, y_test.values))

Logloss score of trained pipeline:  2.1882609128545414
CPU times: user 19min 48s, sys: 12.6 s, total: 20min
Wall time: 3min 59s


# Predict holdout set and write submission

Finally, we want to use our trained pipeline to predict the holdout dataset. We will write our predictions to a file, `predictions.csv`, that we can submit on [DrivenData](http://www.drivendata.org)!

In [26]:
os.getcwd()

'/Users/tridoan/Downloads/Projects_prototype/SchoolBudget'

In [11]:
path_to_holdout_data = os.path.join(os.pardir,
                                    'data',
                                    'TestSet.csv')

# Load holdout data
holdout = pd.read_csv(path_to_holdout_data, index_col=0)

# Make predictions
predictions = pl.predict_proba(holdout)

# Format correctly in new DataFrame: prediction_df
prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS]).columns,
                             index=holdout.index,
                             data=predictions)


# Save prediction_df to csv called "predictions.csv"
prediction_df.to_csv("predictions.csv")