# Introduction
https://github.com/datacamp/course-resources-ml-with-experts-budgets
## Topic: 
School Budgets problem from DrivenData.org
School budgets are huge, complex and non-standardized within the US
School want to measure their performance
## Goal: 
Build Machine Learning Algorithm that automate the labeling of spendings
## Data:
Line-Data with description:
like "Algebra books for 8th grade students"
Labels attached like: "Math", "Middle School", "Textbooks"
## Type of the problem:
Supervised Learning Problem -> Using correct labeled data to predict the label of an unlabeled sample
predict the label -> Classification problem
predict the probability of each target variables possible value (logreg?)
## Specials:
Over 100 target variables
9 columns with several possible Labels
predicting variable-value-probabilities via dummy variables
## Exploring the data:
Loading dataset via:

In [14]:
#import packages
import pandas as pd

#load a sample set of the data  !!!!!!!!!!!!! -- Change to the real dataset
sample_df = pd.read_csv("TrainingData.csv")

#explore basics via head (data example), info(data structure) and describe(summary statistics)
print(sample_df.head())
print( sample_df.info())
print(sample_df.describe())
NUMERIC_COLUMNS = list(sample_df.loc[:,sample_df.dtypes != "object"].columns)
LABELS = ['Function',
 'Use',
 'Sharing',
 'Reporting',
 'Student_Type',
 'Position_Type',
 'Object_Type',
 'Pre_K',
 'Operating_Status']
df = sample_df

print(type(df["FTE"]))
print(type(df[["FTE"]]))

Unnamed: 0                 Function          Use          Sharing  \
0      134338     Teacher Compensation  Instruction  School Reported   
1      206341                 NO_LABEL     NO_LABEL         NO_LABEL   
2      326408     Teacher Compensation  Instruction  School Reported   
3      364634  Substitute Compensation  Instruction  School Reported   
4       47683  Substitute Compensation  Instruction  School Reported   

  Reporting Student_Type Position_Type               Object_Type     Pre_K  \
0    School     NO_LABEL       Teacher                  NO_LABEL  NO_LABEL   
1  NO_LABEL     NO_LABEL      NO_LABEL                  NO_LABEL  NO_LABEL   
2    School  Unspecified       Teacher  Base Salary/Compensation  Non PreK   
3    School  Unspecified    Substitute                  Benefits  NO_LABEL   
4    School  Unspecified       Teacher   Substitute Compensation  NO_LABEL   

    Operating_Status  ... Sub_Object_Description Location_Description  FTE  \
0  PreK-12 Operating  .

### Information obtained:
* Strings in values in several columns
* NaNs found in several columns 

### Encountered problems
* ML works with numbers not with strings 
* Strings are computationally expensive
    * Category - Datatype from pandas could solve the issue by storing this information numerically

### What to do 
create lambda function to change each object type column into categorical column

    categorize = lambda x = x.astype("category")

Apply this function to desired column using .apply()-method

    sample_df.label = sample_df[["label"]].apply(categorize, axis=0) ((use a list of column labels))

Count the numbers of unique categorical values per Category via .apply(pd.Series.nunique)

    num_unique_values = sample_df[["label"]].apply(pd.Series.nunique)



## Measure of model sucess
### Logloss function
The Logloss-function is used in this competition to evaluate the model performace

see:  https://www.drivendata.org/competitions/46/box-plots-for-education-reboot/submissions/

-> being less sure is better than confident and wrong

Loglossfunction:

In [7]:
import numpy as np
def compute_log_loss(predicted, actual, eps=1e-14):
    predicted = np.clip(predicted, eps, 1-eps)
    loss = -1 * np.mean(actual * np.log(predicted)
                +(1-actual)
                *np.log(1-predicted))
    return loss
compute_log_loss(0.5, 0)



0.6931471805599453

## Starting with a simple model
gives a sense of how complex and difficult the problem might be  
wanting to come as fast as possible from raw data to prediction  
Using Multi-class logistic regression  
Format predictions and save to csv  
Submit

### Splitting the dataset
Normal train-test-split does not work here, because of many different labels

### Solution:  

    StratifiedShuffleSplit
* Con: Only works with single target variable
* We have many target variables
* multilabel_train_test_split()  

  


In [36]:
import numpy as np
import pandas as pd
from warnings import warn

def compute_log_loss(predicted, actual, eps=1e-14):
    predicted = np.clip(predicted, eps, 1-eps)
    loss = -1 * np.mean(actual * np.log(predicted)
                +(1-actual)
                *np.log(1-predicted))
    return loss
compute_log_loss(0.5, 0)


def multilabel_sample(y, size=1000, min_count=5, seed=None):
    """ Takes a matrix of binary labels `y` and returns
        the indices for a sample of size `size` if
        `size` > 1 or `size` * len(y) if size =< 1.
        The sample is guaranteed to have > `min_count` of
        each label.
    """
    try:
        if (np.unique(y).astype(int) != np.array([0, 1])).all():
            raise ValueError()
    except (TypeError, ValueError):
        raise ValueError('multilabel_sample only works with binary indicator matrices')

    if (y.sum(axis=0) < min_count).any():
        raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')

    if size <= 1:
        size = np.floor(y.shape[0] * size)

    if y.shape[1] * min_count > size:
        msg = "Size less than number of columns * min_count, returning {} items instead of {}."
        warn(msg.format(y.shape[1] * min_count, size))
        size = y.shape[1] * min_count

    rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))

    if isinstance(y, pd.DataFrame):
        choices = y.index
        y = y.values
    else:
        choices = np.arange(y.shape[0])

    sample_idxs = np.array([], dtype=choices.dtype)

    # first, guarantee > min_count of each label
    for j in range(y.shape[1]):
        label_choices = choices[y[:, j] == 1]
        label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
        sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])

    sample_idxs = np.unique(sample_idxs)

    # now that we have at least min_count of each, we can just random sample
    sample_count = int(size - sample_idxs.shape[0])

    # get sample_count indices from remaining choices
    remaining_choices = np.setdiff1d(choices, sample_idxs)
    remaining_sampled = rng.choice(remaining_choices,
                                   size=sample_count,
                                   replace=False)

    return np.concatenate([sample_idxs, remaining_sampled])


def multilabel_sample_dataframe(df, labels, size, min_count=5, seed=None):
    """ Takes a dataframe `df` and returns a sample of size `size` where all
        classes in the binary matrix `labels` are represented at
        least `min_count` times.
    """
    idxs = multilabel_sample(labels, size=size, min_count=min_count, seed=seed)
    return df.loc[idxs]

def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):
    """ Takes a features matrix `X` and a label matrix `Y` and
        returns (X_train, X_test, Y_train, Y_test) where all
        classes in Y are represented at least `min_count` times.
    """
    index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])

    test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)
    train_set_idxs = np.setdiff1d(index, test_set_idxs)

    test_set_mask = index.isin(test_set_idxs)
    train_set_mask = ~test_set_mask

    return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])

BOX_PLOTS_COLUMN_INDICES = [list(range(37)),
                            list(range(37, 48)),
                            list(range(48, 51)),
                            list(range(51, 76)),
                            list(range(76, 79)),
                            list(range(79, 82)),
                            list(range(82, 87)),
                            list(range(87, 96)),
                            list(range(96, 104))]


def multi_multi_log_loss(predicted,
                          actual,
                          class_column_indices=BOX_PLOTS_COLUMN_INDICES,
                          eps=1e-15):
    """ Multi class version of Logarithmic Loss metric as implemented on
        DrivenData.org
    """
    class_scores = np.ones(len(class_column_indices), dtype=np.float64)

    # calculate log loss for each set of columns that belong to a class:
    for k, this_class_indices in enumerate(class_column_indices):
        # get just the columns for this class
        preds_k = predicted[:, this_class_indices].astype(np.float64)

        # normalize so probabilities sum to one (unless sum is zero, then we clip)
        preds_k /= np.clip(preds_k.sum(axis=1).reshape(-1, 1), eps, np.inf)

        actual_k = actual[:, this_class_indices]

        # shrink predictions so
        y_hats = np.clip(preds_k, eps, 1 - eps)
        sum_logs = np.sum(actual_k * np.log(y_hats))
        class_scores[k] = (-1.0 / actual.shape[0]) * sum_logs

    return np.average(class_scores)

### How to:

Minimal preprocessing:

In [8]:
print(NUMERIC_COLUMNS)  #list of all numeric columns 
print(LABELS)#List of target label columns

data_to_train = df[NUMERIC_COLUMNS].fillna(-1000)
labels_to_use = pd.get_dummies(df[LABELS])
X_train, X_test, y_train, y_test = multilabel_train_test_split(data_to_train, 
                                                               labels_to_use, 
                                                               size =0.2, 
                                                               seed=123)

['Unnamed: 0', 'FTE', 'Total']
['Function', 'Use', 'Sharing', 'Reporting', 'Student_Type', 'Position_Type', 'Object_Type', 'Pre_K', 'Operating_Status']


NameError: name 'multilabel_train_test_split' is not defined

### Train the model:

In [9]:
from sklearn.linar_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

clf = OneVsRestClassifier(LogisticRegression()) 
clf.fit(X_train, y_train)

ModuleNotFoundError: No module named 'sklearn.linar_model'

  
    OneVsRestClassifier
   
treats each column of y independently  
Fits a separate classifier for each of the columns

# Simulating competing in the Competition
## Loading Test_set (Holdout data)

In [10]:
holdout = pd.read_csv("TestData.csv", index_col=0)
holdout = holdout[NUMERIC_COLUMNS].fillna(-1000)
predictions = clf.predict_proba(holdout)

KeyError: "['Unnamed: 0'] not in index"

Due to the use of Logloss, .predict() would be much worse (only 0s and 1s)  
-> Using **.predict_proba()** solves the problem giving probabilities rather than predictions  
  
  
## Formatting and submission of predictions



In [None]:
prediction_df = pd.DataFrame(columns = pd.get_dummies(df[LABELS], prefix_sep="__").columns, index=holdout.index, data=predictions)
prediction.df.to_csv("precitions.csv")
score = score_submission(pred_path="predictions.csv")

## Upload and obtain score from the Leaderboard

# Introduction to NLP
Data: Text, documents, speech

## First step: Tokenization  
* Splitting a string into segments
* Store results as lists  
Example:  
"Natural Language Processing"  
  
"Natural", "Language","Processing"  

Tokenize on:  
* whitespace
* punctuation

## Using "bag of words"
* Count the number of times a particular token appears
* Bag of words:  
  * Count the number of times a word was pulled out of the bag  
* This approach discards the information about word order
->"Red, not blue" == "blue, not red"  
  
## Using n-gram
1-gram, 2-gram,...,n-gram

Not a single word is treated as an occurance to count but every ordered 2 word pair = "2-gram"
  
## Representing words numerically


In [None]:
'''Bag-of-Word representation
Using: sklearns CountVectorizor

does 3 things:
    Tokenize all strings
    Builds a vocabulary
    Counts the tokens apprearance
How?'''

#import packages
from sklearn.feature_extraction.text import CountVectorizer

#define regular_expression that does the split on whitespaces
TOKEN_BASIC = "\\\\S+(=\\\\s+)"

#replace NaNs with empty strings
df.Program_Description.fillna("", inplace = True)

#instantiate CountVectorizer
vec_basic = CountVectorizer(token_pattern=TOKEN_BASIC)

#fit the vectorizer
vec_basic.fit(df.Program_description)

#Extract found words by using get_feature_names()-method on vectorizer
features = vec_basic.get_feature_names()



# Pipeline, features & test processing


= Repeatable way to go from raw data to trained model ->see unsupervised-learning notes
  
Pipelines can also have sub-pipelines as steps
  
## How to:

In [15]:
#import packages:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split



#initiate pipeline
pl = Pipeline([("clf", OneVsRestClassifier(LogisticRegression()))])

#split data !![[for getting a dataframe instead of a Series]]
X_train, X_test, y_train, y_test = train_test_split(df[["numeric"]], pd.get_dummies(df["label"]), random_state = 2)

#fit
pl.fit(X_train, y_train)

#score on test
pl.score(X_test, y_test)

#Throwing with NaNs will cause a break_down
#Therefore: create Imputer (convert NaNs)
from sklearn.preprocessing import Imputer

pl = Pipeline([("imp", Imputer),
               ("clf", OneVsRestClassifier(LogisticRegression())
                )])

#fit and score with nans


## Implementing text features into the pipeline
from sklearn.feature_extraction.text import CountVectorizer
X_train, X_test, y_train, y_test = train_test_split(df["text"], pd.get_dummies(df["label"]), random_state=2)

pl = Pipeline([("vec" , CountVectorizer()),
                ("clf" , OneVsRestClassifier(LogisticRegression()))])

#fit and score again


KeyError: "None of [Index(['numeric'], dtype='object')] are in the [columns]"

## Pre-Processing multiple dtypes
* Using all variables from different types
* Problem:
    * Pipelines cannot follow each other 
    * e.g., CountVec can't be input for Imputer
* Solution: FunctionTransformer() & FeatureUnion()  
  

### Functiontransformer()
* Turn Python function into object, understandable by scikit-learn Pipelines  
* Need to write two functions for pipeline preprocessing  
    1 Takes hole dataframe, returns numeric columns  
2 Takes hole dataframe, returns text columns

## How to:

In [None]:
#import necessary packages
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion

#split complete set
X_train, X_test, y_train, y_test = train_test_split(df[["numeric", "with_missing", "text"]], pd.get_dummies(df["label"]), random_state = 1)

get_text_data = FunctionTransformer(lambda x: x["text"], validate = False)
get_numeric_data = FunctionTransformer(lambda x: x[["numeric", "with_missing"]], validate = False)


### FeatureUnion
combines the two outcomes into one dataframe

## HowTo:

In [None]:
union = FeatureUnion([("numeric", numeric_pipeline),
                     ("text", text_pipeline)])                   

## Complete Pipeline

In [None]:
#import necessary packages
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion

#split complete set
X_train, X_test, y_train, y_test = train_test_split(df[["numeric", "with_missing", "text"]], pd.get_dummies(df["label"]), random_state = 1)

get_text_data = FunctionTransformer(lambda x: x["text"], validate = False)
get_numeric_data = FunctionTransformer(lambda x: x[["numeric", "with_missing"]], validate = False)
#to obtain results use fit_transform(x)


#create numeric sub-pipeline
numeric_pipeline = Pipeline([
                            ("selector", get_numeric_data),
                            ("imputer", Imputer())
                            ])

#create text sub-pipeline
text_pipeline = Pipeline([
                        ("selector", get_text_data),
                        ("vectorizer"), CountVectorizer())
                        ])
#create final pipeline with featureunion and two subs + model
pl = Pipeline([
                ("union", FeatureUnion([
                                        ("numeric", numeric_pipeline)
                                        ("text", text_pipeline)
                                        ])),
                ("clf" , OneVsRestClassifier(LogisticRegression()))
                ])

#fit the data
pl.fit(X_train, y_train)

#get the score
pl.score(X_test, y_test)

# Choosing classifier

In [58]:
#Using the pipeline with the main dataset

#import packages
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MaxAbsScaler
from sklearn.metrics.scorer import make_scorer
from sklearn.preprocessing import PolynomialFeatures

log_loss_scorer = make_scorer(multi_multi_log_loss)

LABELS = ['Function',
 'Use',
 'Sharing',
 'Reporting',
 'Student_Type',
 'Position_Type',
 'Object_Type',
 'Pre_K',
 'Operating_Status']
#NON_LABELS = list(df.loc[:,~df.columns.isin(LABELS)].columns)
NON_LABELS = [c for c in df.columns if c not in LABELS] #much better approach
NUMERIC_COLUMNS = list(df.loc[:,df.dtypes != "object"].columns)
print(NON_LABELS)
print(NUMERIC_COLUMNS)

df = pd.read_csv("TrainingData.csv", index_col=0, nrows=20000)

dummy_labels = pd.get_dummies(df[LABELS])


X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS], dummy_labels, 0.2, 1)



def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ Takes the dataset as read in, drops the non-feature, non-text columns and
        then combines all of the text columns into a single vector that has all of
        the text for a row.
        
        :param data_frame: The data as read in with read_csv (no preprocessing necessary)
        :param to_drop (optional): Removes the numeric and label columns by default.
    """
    # drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)
    
    # replace nans with blanks
    text_data.fillna("", inplace=True)
    
    # joins all of the text items in a row (axis=1)
    # with a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)




get_text_data = FunctionTransformer(combine_text_columns, validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

pl = Pipeline([("union",FeatureUnion([
                        ("numeric_features", Pipeline([
                                            ("selector", get_numeric_data),
                                            ("imputer", SimpleImputer()),
                                            ("scaler", MaxAbsScaler())
                                                        ])),
                        ("text_features", Pipeline([
                                        ("selector", get_text_data),
                                        ("vectorizer", CountVectorizer(ngram_range=(1,2))
                                        )
                                                    ]))
                                        ])),
                                
                ("clf", OneVsRestClassifier(LogisticRegression()))
              ])

pl.fit(X_train, y_train)

log_loss_scorer(pl, X_test, y_test.values)

['Object_Description', 'Text_2', 'SubFund_Description', 'Job_Title_Description', 'Text_3', 'Text_4', 'Sub_Object_Description', 'Location_Description', 'FTE', 'Function_Description', 'Facility_or_Department', 'Position_Extra', 'Total', 'Program_Description', 'Fund_Description', 'Text_1']
['FTE', 'Total']


1.4583388290318766

# Tips and Tricks
## Text Processing
* NLP Tricks  

* Tokenized on punctutation
to avoid hyphens, underscores, etc  
  
* Include a unigrams and bigrams in the model via CountVectorizer
  
## Stats trick
* Interaction terms -> describes mathematically when tokens appear together by adding a beta3*(x1 * x2) -1*0 = 0, 1*1 = 1, etc.-  
  
* use via **from sklearn.preprocessing import PolynominalFeatures**  
  
* interaction = PolynominalFeatures(degree = 2, interaction_only = True, include_bias = False) 
  * use SparseInteractions instead
  
* Fit and transform to get the new slide

## The winning model  
* Adding new features will cause enourmous increase in array size -> computaional power  
  
* Hashing to be more memory-efficient  
  
* -> Dimensionality reduction  
  
* How to: use HashingVectorizer instead with (norm=None, non_negative = True, token_pattern = TOKEN_ALPHANUMERIC, ngram_range = (1, 2))  
  
## Other Tricks

* NLP:Stemming, Stop_word removal
* Model: RandomForest, kNN, Naive Bayes
* Numeric Processing :Imputation strategies
* Optimization: GridSearchCV
* Experiement with other techniques



