# Donors Choose - TensorFlow Model

## 1. Introduction
I have create this project in order to gain practical experience with TensorFlow after finishing the Google Machine Learning Crash Course. It has been a rewarding experience, as it touched on several aspects of TensorFlow modeling, feature engineering, natural language processing, as well as time series analysis.

I have developed my model as an exercise with the goal of getting a high AUC in the test dataset. If I was developing a production model with practical applicability in mind, I would adjust my my methodology as well as the problem definition. I explain these considerations in section 2.

The main technical limitation I have encountered was the 1GB hard drive space in the Kaggle environment, which was easily filled up in the NN training phase. Even though I managed to reduce the hard drive footprint with some data preprocessing, I was still not able to utilize the full training data due to this constrain. My plan is to learn more about the different ways to input data into TensorFlow that may overcome this limitation.

## 2. Considerations for Practical Applicability

### 2.1 Discussion of Test Dataset and Repeatability
The test dataset for this exercise was created as a random selection from the full application population. While this is generally a safe way to test if the model will perform in practice, in this particular case it may introduce unwanted leakage of information. Here are two examples how this can happen... 
 - Example 1: Since test records can be pulled from any point in the history of a teacher's submission list, the training and scoring process has access to future approval performance of given teacher. For example the model knows if the teacher's next project was approved or not. Apparently this will not be known in real world application.
 - Example 2: For each test record, we have a number of training records submitted at the same time (e.g same day, same week) with approval results. Including this data can contain approval rate information related to some unknown factor like for example volunteer shortage at the time. Again, this information will be known in the test data, but will not be available in the future.

I explicitly create features in my kernel exploiting both of the examples above, since for this exercise I only worry about the test dataset performance. Nonetheless, even without creating explicit features, complex models like DNNs may pick up on these patterns, therefore care needs to be taken to make sure the model is not learning from leaked information. One way to do this is would be to create an out-of-time test dataset including all applications submitted after a set date, while train and validation data would contain applications from before that date. Performance on this type of test dataset would be a better reflection of the final usability of the model.

### 2.2 Discussion of Fairness, Explicability, Stability
Although this competition dataset and task is a wonderful resource for practicing modeling with Tensorflow (and this is what I came for personally), I can see three possible issues with an approach that uses NNs to simply replicate observed approvals from the past. 
- Fairness: We understand that each volunteer has potentially their own bias in approval decisions, but in addition to this, there may also be unwanted overall bias (cultral, geographical, ... ). While building the model on a large set of volunteers will improve consistency, the model will still reflect the overall bias if there was one. Furhtermore, if it is put into action, it may even reinforce this bias in a positive feedback loop. It is therefore very important to conduct a through analysis and continuous monitoring to ensure that the model is fair.
- Explicability: Ideally, the model should assist volunteers to identify applications that may be problematic, and need further attention. Natural extension of this process is to notify the teacher if their project was rejected and for what reason. Therefore it would be very useful if the model can provide not only probability of rejection, but also  the most likely rejection reason. This would be best achived by building a multi-class NN that would be modeled on rejection reason rather than a simple indicator of approval. If rejection reason is not available as a catgorical column (was not collected), one option would be to compile correspondence from volunteers to teachers whose projects were rejected and use NLP to classify rejection reasons.
- Stability: One of the predictors of approval are keywords in the requested resource descriptions. These can thing like "wobble chair" or "Apple iPad". Understandably, these terms can age. New techologies emerge, products can be created by alternative brands and so on. If a model is trained on outdated data, it may be looking for resources that are not requested anymore and lose strength because of that. This instability cannot be completely overcome, but it can be controlled for and understood by analyzing the durability of terms and approval rates as a time series. 

## 3. Implementation

### 3.1 Include libraries and basic setup

In [1]:
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from tensorflow.python.data import Dataset
import numpy as np
import re
import sklearn.metrics as metrics
import pandas as pd
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
from collections import OrderedDict
tf.logging.set_verbosity(tf.logging.ERROR)

### 3.2 Load all data files

In [2]:
# Load data files
training_dataset = pd.read_csv('../input/train.csv', sep=',')
resources_dataset = pd.read_csv('../input/resources.csv', sep=',')
test_dataset = pd.read_csv('../input/test.csv', sep=',')

### 3.3 Create time-context features
These features capture the context of each application relative to other applications. These include ...  
- Is this the first/last application of the teacher? 
- How long since/to the previous/next application of the teacher? 
- Was the teachers's previous/next application approved?
- What was the approval rate of applications within the same day?

As I discuss above in section 2.1, most of these features are only possible because of the nature of the test dataset in this exercise. In practice, we will not have this context for new applications coming in.

In [3]:
# Join train and test data. Context features are based on the full dataset 
dfall = training_dataset[["id","teacher_id","project_submitted_datetime","project_is_approved"]].append(
            test_dataset[["id","teacher_id","project_submitted_datetime"]])

# Parse project submitted date and sort the data frame
dfall["project_submitted_datetime"] = pd.to_datetime(dfall["project_submitted_datetime"])
dfall = dfall.sort_values(by=["project_submitted_datetime"])
dfall = dfall.set_index("project_submitted_datetime")

# Calculate rolling 1 day approval rate feature (roll_approved_pct)
#  - Calculated as approved applications within the last 24 hours / total applications within the last 24 hours
#  - Always exclude the current application from calculation
dfall["project_is_approved1"] = dfall["project_is_approved"].fillna(0)
dfall["train_data"] =1-( dfall["project_is_approved"].isna()*1)
dfall[["roll_approved","roll_total"]] = dfall.rolling('1d')["project_is_approved1","train_data"].sum()
dfall["roll_approved_pct"] = (dfall["roll_approved"]-dfall["project_is_approved1"])/(dfall["roll_total"]-dfall["train_data"])
dfall = dfall.reset_index(level="project_submitted_datetime")

# Create teacher-context features
#  - Sort by teacher ID + project submitted datetime and shift forward and backward to get information about next/previous application
dfall = dfall.sort_values(by=["teacher_id","project_submitted_datetime"])
dfall[["last_project_is_approved","last_dt"]] = dfall.groupby("teacher_id")["project_is_approved","project_submitted_datetime"].shift(1)
dfall[["next_project_is_approved","next_dt"]] = dfall.groupby("teacher_id")["project_is_approved","project_submitted_datetime"].shift(-1)
dfall["last_project_is_rejected"] = 1-dfall["last_project_is_approved"]
dfall["next_project_is_rejected"] = 1-dfall["next_project_is_approved"]
dfall["time_since_last_project"] = (dfall["project_submitted_datetime"] - dfall["last_dt"])/np.timedelta64(1, 'h')
dfall["time_to_next_project"] = (dfall["next_dt"] - dfall["project_submitted_datetime"])/np.timedelta64(1, 'h')
dfall["first_project_ind"] = dfall["time_since_last_project"].isna()*1
dfall["last_project_ind"] = dfall["time_to_next_project"].isna()*1
#  Use rank to get the position of the application in teacher's submission history
dfall["project_number"] = dfall.groupby("teacher_id")["project_submitted_datetime"].rank().astype(int)

#  Clean up variables
dfall = dfall.fillna({"last_project_is_approved":0,"next_project_is_approved":0,
              "last_project_is_rejected":0,"next_project_is_rejected":0,})
for col in ["last_project_is_approved","next_project_is_approved","last_project_is_rejected","next_project_is_rejected"]:
    dfall[col] = dfall[col].fillna(0)
    dfall[col] = dfall[col].astype(int)
dfall.head(5)

### 3.4 Split Training and Validation Data
Due to disk space constraints during NN training phase, I am only using a subset of the available data for training and validation. 

In [4]:
# Split training and validation data
N_TRAINING = 60000
N_VALIDATION = 20000 

# Create separate training and validation datasets
training_dataset = training_dataset.reindex(np.random.RandomState(seed=67).permutation(training_dataset.index))
training_data = training_dataset.head(N_TRAINING).copy()
validation_data = training_dataset.tail(N_VALIDATION).copy()

### 3.5 Preprocess the resources dataset
This includes imputing missing descriptions and aggregating the data on request id level. Three aggregated variables are created:
 - resource_count - number of different items requested 
 - resource_price - total price for all resources
 - resource_descriptions - Appended descriptions from all resource rows

In [5]:
# Impute missing descriptions 
resources_dataset = resources_dataset.fillna({'description':'N/A'})
# Calculate total price for each resource
resources_dataset['total_price'] = resources_dataset['quantity'] * resources_dataset['price']
# Aggregate resources on id level: count rows, add total price, append descriptions
grp = resources_dataset.groupby(['id'])
resources_dataset_grp = grp.apply(lambda row: pd.Series(dict(
    resource_count=row['total_price'].count(),
    resource_price=row['total_price'].sum(),
    resource_descriptions=' '.join(row['description']) )) ).reset_index()
resources_dataset_grp.head(5)

### 3.6 Calculate Teacher Statistics
Teacher statistics is another set of features that relies on the fact that the test dataset is selected randomly from the full timeline of submissions (for more detail, see section 2.1). The statistics are calculated separately for each teacher and are in three levels:
- All-time teacher statistics
- Same-month teacher statistics
- Same-day teacher statistics

For each of these periods, I calculate 
- Count of applications
- Count of approved applications
- Count of rejected applications
- Percent of approved applications

To avoid the approval label leakage, the stats are alwas calculated based on all applications in the give time frame except the one in question.

In [6]:
# Calculation of teacher statistics is based on all training data
training_data_ext = training_dataset[["teacher_id","project_submitted_datetime","project_is_approved"]].copy()

# Discretize submitted datetime on day and month level
for ds in [training_data,validation_data,test_dataset,training_data_ext]:
    ds["project_submitted_date"] = pd.to_datetime(ds["project_submitted_datetime"]).dt.date
    ds["project_submitted_month"] = pd.to_datetime(ds["project_submitted_datetime"]).dt.to_period('M')

# Caluclate all-time teacher stats
teacher_stats = pd.DataFrame(training_data_ext.groupby("teacher_id")["project_is_approved"].agg(["count","sum"]).reset_index())
teacher_stats = teacher_stats.rename(index=str,columns={"count":"app_cnt","sum":"approved_cnt"})
teacher_stats["approved_pct"] = teacher_stats["approved_cnt"]/teacher_stats["app_cnt"]
teacher_stats["rejected_cnt"] =teacher_stats["app_cnt"] - teacher_stats["approved_cnt"]
training_data = pd.merge(training_data, teacher_stats, on='teacher_id', how="left")
validation_data = pd.merge(validation_data, teacher_stats, on='teacher_id', how="left")
test_dataset = pd.merge(test_dataset, teacher_stats, on='teacher_id', how="left")

# Caluclate teacher stats for each day and month
for period in ["date","month"]:
    teacher_stats = pd.DataFrame(training_data_ext.groupby(["teacher_id","project_submitted_" + period])["project_is_approved"].agg(["count","sum"]).reset_index())
    teacher_stats = teacher_stats.rename(index=str,columns={"count":"same_" + period + "_app_cnt","sum":"same_" + period + "_approved_cnt"})
    teacher_stats["same_" + period + "_approved_pct"] = teacher_stats["same_" + period + "_approved_cnt"]/teacher_stats["same_" + period + "_app_cnt"]
    teacher_stats["same_" + period + "_rejected_cnt"] = teacher_stats["same_" + period + "_app_cnt"] - teacher_stats["same_" + period + "_approved_cnt"]
    training_data = pd.merge(training_data, teacher_stats, on=['teacher_id','project_submitted_' + period], how="left")
    validation_data = pd.merge(validation_data, teacher_stats, on=['teacher_id','project_submitted_' + period], how="left")
    test_dataset = pd.merge(test_dataset, teacher_stats, on=['teacher_id','project_submitted_' + period], how="left")

# Inpute zeroes for all missing counts
fillNaDict = {"app_cnt":0,"approved_cnt":0,"rejected_cnt":0,
              "same_date_app_cnt":0,"same_date_approved_cnt":0,"same_date_rejected_cnt":0,
              "same_month_app_cnt":0,"same_month_approved_cnt":0,"same_month_rejected_cnt":0}
training_data = training_data.fillna(fillNaDict)
validation_data = validation_data.fillna(fillNaDict)
test_dataset = test_dataset.fillna(fillNaDict)

# Adjust all stats, so that current row (and its approval) is excluded from the calculation
for ds in [training_data,validation_data]:
    for prd in ["","same_date_","same_month_"]:
        ds[prd + "app_cnt"] = ds[prd + "app_cnt"] - 1
        ds[prd + "approved_cnt"] = ds[prd + "approved_cnt"] - ds["project_is_approved"]
        ds[prd + "rejected_cnt"] = ds[prd + "app_cnt"] - ds[prd + "approved_cnt"]
        ds[prd + "approved_pct"] = ds[prd + "approved_cnt"]/ds[prd + "app_cnt"]

training_data.head(5)

### 3.7  Merge resource and time-context data to the training and test datasets 

In [7]:
# Merge grouped resource data to the training and test datasets 
training_data = pd.merge(training_data, resources_dataset_grp, on='id', how="left")
validation_data = pd.merge(validation_data, resources_dataset_grp, on='id', how="left")
test_dataset = pd.merge(test_dataset, resources_dataset_grp, on='id', how="left")

# Merge time-context resource data to the training and test datasets 
dropList = ["teacher_id","project_submitted_datetime","project_is_approved"]
training_data = pd.merge(training_data, dfall.drop(columns=dropList), on='id', how="left")
validation_data = pd.merge(validation_data, dfall.drop(columns=dropList), on='id', how="left")
test_dataset = pd.merge(test_dataset, dfall.drop(columns=dropList), on='id', how="left")
training_data.head(5)

### 3.8 Text Feature Preprocessing
This section contains standard preprocessing steps for text columns. All modifications are done to the training, validation, and test datasets to make sure the situation is equivalient between training and scoring.

**Multi-Category Columns**
- Project_subject_categories and project_subject_subcategories contain comma-separated key phrases
- To deal with these columns in a standard way, they are converted to a pseudo-text: 1. remove spaces and 2. replace commas by space  
- After this transformation, each keyphrase becomes one word. This allows us to treat these columns as standard text columns

**Text Columns**
- Columns are cleaned to remove punctuation and special characters
- Derived features are created for each text feature: ..._is_na and ..._word_count
- Length of each text is limited to 500 words

In [8]:
datasets = [training_data,validation_data,test_dataset]
# 1. Multi-Category Columns
#   These columns include arbitrary number of comma-separated key phrases.
#    ->  We will remove spaces and then replace commas with spaces. That will allow us to treat the data as text with each key phrase being one "word"
multi_cat_cols = ["project_subject_categories","project_subject_subcategories"]
for col in multi_cat_cols:
    for ds in datasets:
        ds[col] = ds[col].str.replace(" ","").str.replace(","," ")

# 2. Text columns
#  - Clean: Remove punctuation, convert to lower case
#  - Engineer features: word_count, is_na
text_cols = ["project_title","project_essay_1","project_essay_2", #"project_essay_3","project_essay_4",
             "project_resource_summary","project_subject_categories","project_subject_subcategories","resource_descriptions"]
max_word_count = 500
for col in text_cols:
    for ds in datasets:
        ds[col + "_is_na"] = ds[col].isnull() * 1
        ds[col] = ds[col].fillna('').str.lower().str.replace('[^\w\s]','')
        ds[col + "_word_count"] = ds[col].str.count(' ') + 1
        ds[col] = ds[col].str.split(' ',max_word_count).str[0:max_word_count].str.join(' ')

# Impute missing data for the teacher_prefix feature
for ds in datasets:
    ds["teacher_prefix"] = ds["teacher_prefix"].fillna('')

# Project Submitted Datetime Features
for ds in datasets:
    ds["project_submitted_datetime_dt"] = pd.to_datetime(ds["project_submitted_datetime"])
    ds["submitted_year"] = ds["project_submitted_datetime_dt"].dt.year
    ds["submitted_month"] = ds["project_submitted_datetime_dt"].dt.month
    ds["submitted_dow"] = ds["project_submitted_datetime_dt"].dt.weekday_name
    ds["submitted_dom"] = ds["project_submitted_datetime_dt"].dt.day

training_data.head(5)

### 3.9 Create Features Definition
I have defined 6 types of features for this project, each with a different representation in the neural network
 - Numeric features - This is a standard numeric input signal. Used for numeric columns without NaN values.
 - Numeric features bucketized - By default, these get bucketized into 20 bins of equal population. Alternatively, a custom bucket cutoff list can be specified
 - Categorical features - These categorical columns are represented as one-hot indicators feeding directly into the model. Used for catgorical columns with small number of values. 
 - Categorical features with embedding - These categorical columns are represented as one-hot indicators feeding into a 2-node embedding.
 - Binary featues - Input columns have values 0 and 1, and are represented as categorical column with identity
 - Text features - Each text feature is represented as categorical with dictionary list, feeding into an embeding with 4 cells. Dictionary is determined from training data separately for each feature and includes top 250 terms that are the most overrepresented and 250 most underrepresented in the approved applications.  Terms must also meet minimum threshold count in the approved applications in training data (200). To reduce disk space footprint, additional preprocessing step is done at this stage, which removes all non-dictionary words from the respective text columns.

In [10]:
# Define lists of all feature types
numeric_features = ["app_cnt","approved_cnt","rejected_cnt",
                    "same_date_app_cnt","same_date_approved_cnt","same_date_rejected_cnt",
                    "same_month_app_cnt","same_month_approved_cnt","same_month_rejected_cnt"]
numeric_features_bucket = ["teacher_number_of_previously_posted_projects","resource_price",
                    "time_since_last_project","time_to_next_project"
                    ,"approved_pct","same_date_approved_pct","same_month_approved_pct","project_number",
                    "roll_approved_pct"] + [col+"_word_count" for col in text_cols]
# Custom cutoffs for selected bucketized numeric features
numeric_features_bucket_cutoffs = {}
numeric_features_bucket_cutoffs["time_since_last_project"] = [0.1,0.2,0.3,0.5,1.0,6.0,24.0,168.0,336.0,720.0,8760.0]
numeric_features_bucket_cutoffs["time_to_next_project"] = numeric_features_bucket_cutoffs["time_since_last_project"]
numeric_features_bucket_cutoffs["project_number"] = [1.0,2.0,3.0,4.0,10.0,20.0]

categorical_features = ["project_grade_category","teacher_prefix","submitted_year","submitted_month","submitted_dow"]
categorical_features_embed = ["school_state","submitted_dom"]
binary_features = ["last_project_is_approved","next_project_is_approved",
                   "last_project_is_rejected","next_project_is_rejected",
                   "first_project_ind", "last_project_ind"] + [col + "_is_na" for col in text_cols]
text_features = text_cols
target = "project_is_approved"

# Based on the lists of column names above, create list of TensorFlow features 
features = []
for feature in binary_features:
    features.append( tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_identity(feature, 2) ))

for feature in numeric_features_bucket:
    quantiles = []
    if feature in numeric_features_bucket_cutoffs:
        quantiles = numeric_features_bucket_cutoffs[feature]
    else:
        num_buckets = 20
        quantiles = training_data[feature].quantile(np.arange(1.0, num_buckets) / num_buckets)
        quantiles = [quantiles[q] for q in quantiles.keys()]
        quantiles = list(OrderedDict.fromkeys(quantiles))
    #print(feature)
    #print(quantiles)
    features.append( tf.feature_column.bucketized_column(tf.feature_column.numeric_column(feature), boundaries=quantiles) )

for feature in numeric_features:
    features.append( tf.feature_column.numeric_column(feature) )

for feature in categorical_features:
    dictionary = training_data[feature].unique()
    features.append(tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list(feature,dictionary)))

for feature in categorical_features_embed:
    embedding_count = 2
    dictionary = training_data[feature].unique()
    features.append(tf.feature_column.embedding_column(tf.feature_column.categorical_column_with_vocabulary_list(feature,dictionary),embedding_count))

# Helper function for text feature definition: Create dictionary of significant terms (return a list)
def getSignificantTerms(feature,posCountThreshold,wordCount):
    # Count term frequencies among accepted and among rejected applications
    tokenizerPos = Tokenizer()
    tokenizerNeg = Tokenizer()
    tokenizerPos.fit_on_texts(training_data.loc[training_data[target]==1][feature])
    tokenizerNeg.fit_on_texts(training_data.loc[training_data[target]==0][feature])
    posCounts = tokenizerPos.word_counts
    negCounts = tokenizerNeg.word_counts
    words = []
    # Iterate over all terms in the accepted submissions
    for w in posCounts:
        p = posCounts[w]
        # Term must appear at least posCountThreshold-times in the accepted submissions to be considered
        if p > posCountThreshold:
            n = 0
            if w in negCounts:
                n = negCounts[w]
            # Add term to candidate list with some basic stats
            # - word, freq
            words.append([w,p,n,p/(p+n)])    
    if len(words) <= wordCount:
        return [x[0] for x in words]
    words = sorted(words,key=lambda x: -x[3])
    wordCountPart = int(wordCount/2)
    ret = [x[0] for x in words[0:wordCountPart]]  # Most overrepresentd in accepted
    ret += ret + [x[0] for x in words[-wordCountPart:]]  # Most underrepresented in accepted
    return set(ret)

# Helper function for text feature definition: Prune text fields to keep only dictionary words
def dropNonDictionaryWords(feature,dictionary):
    dict_set = set(dictionary)
    for ds in [training_data,validation_data,test_dataset]:
        ds[feature] = ds[feature].apply(lambda x: set([y for y in x.split(' ') if y in dict_set])).str.join(' ')
    #print(training_data[feature][:10])
    
for feature in text_features:
    max_words = 500
    min_pos_frequency = 200
    embedding_count = 4
    dictionary = getSignificantTerms(feature,min_pos_frequency,max_words)
    #print(feature)
    dropNonDictionaryWords(feature,dictionary)
    features.append(tf.feature_column.embedding_column(tf.feature_column.categorical_column_with_vocabulary_list(feature,dictionary),embedding_count))

### 3.10 Create Input Function for DNNClassifier 

Input function must be in sync with the feature list created above.

In [12]:
# Helper function for text feature ipnut: Translate text (string with words) into a list of words 
def _parse_text(features,targets):
    for key in text_features:
        features[key] = tf.string_split([features[key]]).values
        #print(features[key])
    return (features,targets)
    
def my_input_fn( data, batch_size=1, shuffle=True, num_epochs=None ):
    # Create dictionary of all columns that are used by the features as defined above
    features = {key:np.array(data[key]) for key in (binary_features + numeric_features + numeric_features_bucket + categorical_features 
                                                    + categorical_features_embed + text_features)}
    targets = data[target]
    
    ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
    ds = ds.map(_parse_text)
    ds = ds.padded_batch(batch_size, ds.output_shapes).repeat(num_epochs)
    # Shuffle the data, if specified
    if shuffle:
      ds = ds.shuffle(10000)
    
    # Return the next batch of data
    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels

### 3.11 Train Model and Score Test and Validation Data

In [13]:
# Define Estimator
estimator = tf.estimator.DNNClassifier(
    feature_columns=features,
    hidden_units=[10,10],
    optimizer=tf.train.ProximalAdagradOptimizer(
      learning_rate=0.1,
      l1_regularization_strength=0.0001
    ))
# Create training and scoring versions of the input function 
train_fn = lambda:my_input_fn(data=training_data,batch_size=100)
trainev_fn = lambda:my_input_fn(data=training_data,num_epochs=1,shuffle=False)
valid_fn = lambda:my_input_fn(data=validation_data,num_epochs=1,shuffle=False)
# Train model
estimator.train(input_fn=train_fn, steps=2000)
# Calculate training and validation metrics
training_metrics = estimator.evaluate(input_fn=trainev_fn)
validation_metrics = estimator.evaluate(input_fn=valid_fn)
print("AUC train/test: {}/{}".format(training_metrics['auc'],validation_metrics['auc']))

### 3.12 Score Test Dataset and Save Results 

In [14]:
# Make predictions
test_dataset["project_is_approved"] = 0
test_fn = lambda:my_input_fn(data=test_dataset,num_epochs=1,shuffle=False)
predictions_generator = estimator.predict(input_fn=test_fn)
predictions_list = list(predictions_generator)

# Extract probabilities
probabilities = [p["probabilities"][1] for p in predictions_list]

my_submission = pd.DataFrame({'id': test_dataset["id"], 'project_is_approved': probabilities})

my_submission.to_csv('my_submission.csv', index=False)