# Box-Plots for Education (DrivenData project)

This is a project listed in DrivenData (https://www.drivendata.org/competitions/46/box-plots-for-education-reboot/). The goal is to predict the probability that a certain label is attached to a budget line item.

#### Predictors:

`FTE` float - If an employee, the percentage of full-time that the employee works. 

`Facility_or_Department` - If expenditure is tied to a department/facility, that department/facility.

`Function_Description` - A description of the function the expenditure was serving.

`Fund_Description` - A description of the source of the funds.

`Job_Title_Description` - If this is an employee, a description of that employee's job title.

`Location_Description` - A description of where the funds were spent.

`Object_Description` - A description of what the funds were used for.

`Position_Extra` - Any extra information about the position that we have.

`Program_Description` - A description of the program that the funds were used for.

`SubFund_Description` - More detail on Fund_Description

`Sub_Object_Description` - More detail on Object_Description 

`Text_1` - Any additional text supplied by the district.

`Text_2` - Any additional text supplied by the district.

`Text_3` - Any additional text supplied by the district.

`Text_4` - Any additional text supplied by the district.

`Total` float - The total cost of the expenditure. 

#### Labels:

`Function`

`Use`

`Sharing`

`Reporting`

`Student_Type`

`Position_Type`

`Object_Type`

`Pre_K`

`Operating_Status`

Loading the datasset:

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('TrainingData.csv', index_col=0)

Separating the columns into three lists: text predictors, numeric predictors, and labels

In [2]:
# storing in a list the categorical columns names
LABELS = ['Function', 'Object_Type', 'Operating_Status',
          'Position_Type', 'Pre_K', 'Reporting',
          'Sharing', 'Student_Type', 'Use']

# Get the columns that are features in the original df
NON_LABELS = [c for c in df.columns if c not in LABELS]

# Get the columns that are numeric in the original df
NUMERIC_COLUMNS = ['FTE', 'Total']

Loading functions to sample and split into train and test sets for multilabel dataset: (taken from datacamp https://github.com/datacamp/course-resources-ml-with-experts-budgets/blob/master/src/data/multilabel.py)

Total rows on training set is 400,277. Sampling to reduce the computation time of the model to run this enotebook as an example. Then splitting dataset into train and test sets:

In [3]:
from warnings import warn
from multilable_split import multilabel_sample_dataframe, multilabel_train_test_split

# setting size of sample size
SAMPLE_SIZE = 50000

# taking sample of `SAMPLE_SIZE`
sampling = multilabel_sample_dataframe(df,
                                       pd.get_dummies(df[LABELS]),
                                       size=SAMPLE_SIZE,
                                       min_count=25,
                                       seed=42)

# Get the dummy encoding of the labels
dummy_labels = pd.get_dummies(sampling[LABELS])

# Split into training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(sampling[NON_LABELS],
                                                               dummy_labels,
                                                               0.3,
                                                               min_count=3,
                                                               seed=42)

Convert all text data into a single string per row so it can be passed to a vectorizer object:

In [4]:
#### funtion to convert all training text data in your DataFrame to a
#### single string per row that can be passed to the vectorizer object
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ converts all text in each row of data_frame to single vector """
    # `NUMERIC_COLUMNS` and `LABELS` are lists that contain non-text columns
    
    # Drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)
    
    # Replace nans with blanks
    text_data.fillna('', inplace=True)
    
    # Join all text items in a row that have a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)


Creating Function Transformers:

In [5]:
from sklearn.preprocessing import FunctionTransformer

# Preprocess the text data: get_text_data
get_text_data = FunctionTransformer(combine_text_columns,validate=False)

# Preprocess the numeric data: get_numeric_data
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

Text processing by tokenizing in punctuation to accept just alphanumerical characters and including unigrams and bi-grams. With this `token_pattern` characters like hyphenes between two words will separate the words into two tokens.

Parameters used:

`token_pattern`: how to split into tokens. Used `'[A-Za-z0-9]+(?=\\s+)'` to tokenize only alphanumerical characters.

`ngram_range`: number of ngrams to use. Used `(1,2)` to use all ngrams from unigram (1) to bi-gram (2).

In [6]:
# Define the token pattern
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

Hashing Trick: when wanting to make array of features smaller, specially when using lare amounts of text data. Using Hashing trick can be use with sklearn with `HashingVectorizer` and it replaces `CountVectorizer`.

In [7]:
from sklearn.feature_extraction.text import HashingVectorizer

vector = HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
                           ngram_range=(1,2),
                           alternate_sign=False,
                           norm=None,
                           binary=False)

Adding interaction terms to account for the influence tokens that are not together and therefore not catch by the bi-grams have. For example '2nd grade english teacher' and '2nd grade elementary school english teacher'. Both string have '2nd grade' and 'english teacher' but the second is separated in a range that the ngrams used is not going to group them. so an interaction term will find both tokens are present in the same record and find the influence that has on the labels.

Using function `SparseInteractions` that is taken from DataCamp (https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py)

Full model takes too much computational time, can't run locally. So for my model I'll use `SelectKBest` to reduce number of features (dimensional reduction) and therefore reduce computational time.

In [11]:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.impute import SimpleImputer
from SparseInteractions import SparseInteractions
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier

# number of best features
chi_k = 300

# my model
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', SimpleImputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', vector),
                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
        ('int', SparseInteractions(degree=2)),
        ('clf', OneVsRestClassifier(RandomForestClassifier(n_estimators=100)))
    ])


In [12]:
# Fit to the training data
pl.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('union', FeatureUnion(n_jobs=None,
       transformer_list=[('numeric_features', Pipeline(memory=None,
     steps=[('selector', FunctionTransformer(accept_sparse=False, check_inverse=True,
          func=<function <lambda> at 0x0000027FBDFEA2F0>, inv_kw_args=None,
          inverse_func=None...b_score=False, random_state=None, verbose=0,
            warm_start=False),
          n_jobs=None))])

In [13]:
from sklearn.metrics.scorer import make_scorer
from LogLoss import multi_multi_log_loss

# Compute and print accuracy
print("\nAccuracy: ",
      pl.score(X_test, y_test))

# 
log_loss_scorer = make_scorer(multi_multi_log_loss)

# calculate and print the log loss score
print("Logloss score: ", log_loss_scorer(pl, X_test, y_test.values))


Accuracy:  0.7932
Logloss score:  1.5872486574372298


## Calculating predicted probabilities on test set.



In [13]:
# loading the test set
testset = pd.read_csv('TestData.csv',index_col=0)

# calculating the predicted probabilities
predictions = pl.predict_proba(testset)


  interactivity=interactivity, compiler=compiler, result=result)


In [14]:

# storing the predictions in the format required
predictions_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS],prefix_sep='__').columns, 
                              index=testset.index,
                              data=predictions)

# storing predictions in a csv file to upload
predictions_df.to_csv('predictions.csv')

predictions_df.head()

Unnamed: 0,Function__Aides Compensation,Function__Career & Academic Counseling,Function__Communications,Function__Curriculum Development,Function__Data Processing & Information Services,Function__Development & Fundraising,Function__Enrichment,Function__Extended Time & Tutoring,Function__Facilities & Maintenance,Function__Facilities Planning,...,Student_Type__Special Education,Student_Type__Unspecified,Use__Business Services,Use__ISPD,Use__Instruction,Use__Leadership,Use__NO_LABEL,Use__O&M,Use__Pupil Services & Enrichment,Use__Untracked Budget Set-Aside
180042,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,...,0.0,0.866667,0.133333,0.0,0.733333,0.133333,0.066667,0.0,0.0,0.0
28872,0.066667,0.0,0.0,0.066667,0.0,0.0,0.8,0.0,0.0,0.0,...,0.0,0.866667,0.0,0.0,0.133333,0.0,0.2,0.066667,0.8,0.0
186915,0.0,0.066667,0.0,0.266667,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.4,0.0,0.2,0.266667,0.0,0.2,0.0,0.0,0.0
412396,0.0,0.066667,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.333333,0.0,0.266667,0.266667,0.0,0.266667,0.0,0.0,0.0
427740,0.0,0.066667,0.0,0.066667,0.0,0.0,0.066667,0.0,0.133333,0.0,...,0.0,0.733333,0.133333,0.066667,0.066667,0.066667,0.0,0.0,0.2,0.0


# try to improve

- vector: `FeatureHasher`, `CountVectorizer`, `HashingVectorizer`, with new parameters (ngrams and tokkenizing)
- change `chi_k`(increase)
- more data on sample (increase)
- increase nclassifiers in random forest
- try other classifiers

In [9]:
from sklearn.feature_extraction import FeatureHasher

vector = FeatureHasher(n_features=1048576,
                       alternate_sign=False)


from sklearn.feature_extraction.text import CountVectorizer

vector = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
                         ngram_range=(1,2))

Run 1:
- SAMPLE_SIZE = 50000
- `chi_k` = 500
- RandomForestClassifier - n_estimators=15

Accuracy:  0.7748
Logloss score:  1.640719800152314

Run 2:
- SAMPLE_SIZE = 50000
- `chi_k` = 500
- RandomForestClassifier - n_estimators=100

Accuracy:  0.7932
Logloss score:  1.5872486574372298

Run 3:
- SAMPLE_SIZE = 200000
- `chi_k` = 500
- RandomForestClassifier - n_estimators=100