# Modeling School Budgets
- 2017-12-12
- William Surles
- This is for the Driven Data [school budget competition](https://www.drivendata.org/competitions/46/box-plots-for-education-reboot/) (its just for learning)
- I am using [this](https://github.com/datacamp/course-resources-ml-with-experts-budgets/blob/master/notebooks/1.0-full-model.ipynb) notebook provided by datacamp as a guide

In [1]:
%matplotlib inline
# from _future_ import division
# from _future_ import print_function

# ignore deprecation warnings in sklearn
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import os
import sys

# add the src directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), 'src')
sys.path.append(src_dir)

from data.multilabel import multilabel_sample_dataframe, multilabel_train_test_split
from features.SparseInteractions import SparseInteractions
from models.metrics import multi_multi_log_loss

# Load Data

In [2]:
file = 'data/TrainingData.csv'
df = pd.read_csv(file, index_col = 0)
print(df.shape)

(400277, 25)


# Resample Data

- Okay, so 400k rows might be a bit too many to work on locally, especially if I'm going to do bi-grams and interactions. I'll be here all day. 
- We'll sample down to 10k so that we can run the analysis faster and see how the models perform. 
- We will also create dummy variable for our labels and split our sampled dataset into training and test sets

In [3]:
LABELS = [
    'Function',
    'Use',
    'Sharing',
    'Reporting',
    'Student_Type',
    'Position_Type',
    'Object_Type', 
    'Pre_K',
    'Operating_Status']

NON_LABELS = [c for c in df.columns if c not in LABELS]

SAMPLE_SIZE = 10000

sampling = multilabel_sample_dataframe(
    df, 
    pd.get_dummies(df[LABELS]),
    size = SAMPLE_SIZE,
    min_count = 25, 
    seed = 43)

sampling.head()

Unnamed: 0,Function,Use,Sharing,Reporting,Student_Type,Position_Type,Object_Type,Pre_K,Operating_Status,Object_Description,...,Sub_Object_Description,Location_Description,FTE,Function_Description,Facility_or_Department,Position_Extra,Total,Program_Description,Fund_Description,Text_1
38,Student Transportation,NO_LABEL,School Reported,School,NO_LABEL,NO_LABEL,Contracted Services,NO_LABEL,PreK-12 Operating,OTHER PURCHASED SERVICES,...,,,,STUDENT TRANSPORT SERVICE,,,653.46,Misc,Schoolwide Schools,
70,Security & Safety,O&M,School on Central Budgets,Non-School,Unspecified,School Monitor/Security,Other Compensation/Stipend,Non PreK,PreK-12 Operating,Extra Duty Pay/Overtime For Support Personnel,...,Extra Duty Pay/Overtime For Support Personnel,Unallocated,,Security And Monitoring Services,Security Department,POLICE PATROL MAN,2153.53,Undistributed,General Operating Fund,OVERTIME
198,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,Non-Operating,Supplemental *,...,Non-Certificated Salaries And Wages,,,Care and Upkeep of Building Services,,,-8291.86,,Title I - Disadvantaged Children/Targeted Assi...,TITLE I CARRYOVER
209,Student Transportation,NO_LABEL,Shared Services,Non-School,NO_LABEL,NO_LABEL,Other Non-Compensation,NO_LABEL,PreK-12 Operating,REPAIR AND MAINTENANCE SERVICES,...,,ADMIN. SERVICES,,STUDENT TRANSPORT SERVICE,,,618.29,PUPIL TRANSPORTATION,General Fund,
614,Aides Compensation,Instruction,School Reported,School,Unspecified,TA,Base Salary/Compensation,NO_LABEL,PreK-12 Operating,,...,,,0.71,,,,21747.666875,,,


In [4]:
dummy_labels = pd.get_dummies(sampling[LABELS])
dummy_labels.head()

Unnamed: 0,Function_Aides Compensation,Function_Career & Academic Counseling,Function_Communications,Function_Curriculum Development,Function_Data Processing & Information Services,Function_Development & Fundraising,Function_Enrichment,Function_Extended Time & Tutoring,Function_Facilities & Maintenance,Function_Facilities Planning,...,Object_Type_Rent/Utilities,Object_Type_Substitute Compensation,Object_Type_Supplies/Materials,Object_Type_Travel & Conferences,Pre_K_NO_LABEL,Pre_K_Non PreK,Pre_K_PreK,Operating_Status_Non-Operating,"Operating_Status_Operating, Not PreK-12",Operating_Status_PreK-12 Operating
38,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
70,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
198,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
209,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
614,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1


In [5]:
X_train, X_test, y_train, y_test = multilabel_train_test_split(
    sampling[NON_LABELS],
    dummy_labels,
    size = 0.2,
    min_count = 25, 
    seed = 43)

In [6]:
X_train.head()

Unnamed: 0,Object_Description,Text_2,SubFund_Description,Job_Title_Description,Text_3,Text_4,Sub_Object_Description,Location_Description,FTE,Function_Description,Facility_or_Department,Position_Extra,Total,Program_Description,Fund_Description,Text_1
38,OTHER PURCHASED SERVICES,,SCHOOL-WIDE SCHOOL PGMS FOR TITLE GRANTS,,,,,,,STUDENT TRANSPORT SERVICE,,,653.46,Misc,Schoolwide Schools,
198,Supplemental *,,Operation and Maintenance of Plant Services,,,,Non-Certificated Salaries And Wages,,,Care and Upkeep of Building Services,,,-8291.86,,Title I - Disadvantaged Children/Targeted Assi...,TITLE I CARRYOVER
614,,GENERAL EDUCATION,LOCAL,"EDUCATIONAL AIDE,70 HRS",,,,,0.71,,,,21747.666875,,,
662,EMPLOYEE BENEFITS,EARLY EDUCATION,FEDERAL GDPG FUND - FY,"Teacher,Retrd Shrt Term Sub",Regular,,,,,HEAD START,,PROFESSIONAL-INSTRUCTIONAL,0.63,EARLY CHILDHOOD,,REGULAR INSTRUCTION
750,Personal Services - Teachers,,,TCHER 5TH GRADE,,Regular Instruction,,,1.0,,,TEACHER,49768.82,Instruction - Regular,General Purpose School,


In [7]:
y_train.head()

Unnamed: 0,Function_Aides Compensation,Function_Career & Academic Counseling,Function_Communications,Function_Curriculum Development,Function_Data Processing & Information Services,Function_Development & Fundraising,Function_Enrichment,Function_Extended Time & Tutoring,Function_Facilities & Maintenance,Function_Facilities Planning,...,Object_Type_Rent/Utilities,Object_Type_Substitute Compensation,Object_Type_Supplies/Materials,Object_Type_Travel & Conferences,Pre_K_NO_LABEL,Pre_K_Non PreK,Pre_K_PreK,Operating_Status_Non-Operating,"Operating_Status_Operating, Not PreK-12",Operating_Status_PreK-12 Operating
38,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
198,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
614,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
662,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
750,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1


# Create preprocessing tools

- We need to make some pre-processing tools for out text and numeric data.
- The `combine_text_columns` function will take a DataFrame of text columns and return a single series where all of the text in the columns has been joined together.

In [9]:
NUMERIC_COLUMNS = ['FTE', 'Total']

def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ converts all text in each row of data_frame to single vector 
        :param data_frame: The data as read in with read_csv (no preprocessing necessary)
        :param to_drop: (optional): Removes the numeric and label columns be default
    """
    
    # Drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis = 1)
    
    # Replace nans with blanks
    text_data.fillna("", inplace=True)
    
    # Join all text items in a row with a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)

- We also need to create `FunctionTransformer` objects that select our text and numeric data from the dataframe

In [11]:
from sklearn.preprocessing import FunctionTransformer

get_text_data = FunctionTransformer(combine_text_columns, validate = False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate = False)

In [12]:
get_text_data.fit_transform(sampling.head(5))

38     OTHER PURCHASED SERVICES  SCHOOL-WIDE SCHOOL P...
70     Extra Duty Pay/Overtime For Support Personnel ...
198    Supplemental *  Operation and Maintenance of P...
209    REPAIR AND MAINTENANCE SERVICES  PUPIL TRANSPO...
614     GENERAL EDUCATION LOCAL EDUCATIONAL AIDE,70 H...
dtype: object

In [13]:
get_numeric_data.fit_transform(sampling.head(5))

Unnamed: 0,FTE,Total
38,,653.46
70,,2153.53
198,,-8291.86
209,,618.29
614,0.71,21747.666875


- Finally, we create a custom scoring method that uses the `multi_multi_log_loss` function that is the evaluation metric for the competition

In [14]:
from sklearn.metrics.scorer import make_scorer

log_loss_scorer = make_scorer(multi_multi_log_loss)

# Train model pipeline

Now we'll train the final pipeline from the course. It will:
- take text and numeric data
- do necessary preprocessing
- train the classifier

Pretty simple right?

In [16]:
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import Imputer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MaxAbsScaler

TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

In [19]:
%%time

## set a reasonable number of features before adding interactions
chi_k = 300

## create the pipeline object
pl = Pipeline([
    ('union', FeatureUnion(
        transformer_list = [
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('imputer', Imputer())
            ])),
            ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vectorizer', HashingVectorizer(
                    token_pattern = TOKENS_ALPHANUMERIC,

                    norm = None,
                    binary = False,
                    ngram_range = (1, 2))),
                ('dim_red', SelectKBest(chi2, chi_k))
            ]))
        ])),
    ('int', SparseInteractions(degree = 2)),
    ('scale', MaxAbsScaler()),
    ('clf', OneVsRestClassifier(LogisticRegression()))
])

## fit the pipeline to our training data
pl.fit(X_train, y_train.values)

## print the score
print("Logloss score on test data: ", log_loss_scorer(pl, X_test, y_test.values))

ValueError: Input X must be non-negative.

# Predict holdout set and write submission