#### Copyright 2017 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Lab 9: Bucketized Features Using Quantiles and Feature Crosses
**Learning Objectives:**
  * Learn to use quantiles to create bucketized features.
  * Learn how to introduce feature crosses.
  * Starting from just having the data loaded, train a linear classifier to predict if an individual's income is at least 50k using numerical features, categorical features, bucketized features, and feature crosses.


### Standard Set-up

We begin with the standard set-up as seen in the last lab again using the census data set.

In [0]:
import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from mpl_toolkits.mplot3d import Axes3D
from sklearn import metrics
import tempfile
import tensorflow as tf
from tensorflow.contrib.learn.python.learn import learn_io, estimator
import urllib

# This line increases the amount of logging when there is an error.  You can
# remove it if you want less logging.
tf.logging.set_verbosity(tf.logging.ERROR)

# Set the output display to have two digits for decimal places, for display
# readability only and limit it to printing 15 rows.
pd.options.display.float_format = '{:.2f}'.format
pd.options.display.max_rows = 15


train_file = tempfile.NamedTemporaryFile()
urllib.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)

COLUMNS = ["age", "workclass", "sample_weight", "education", "education_num",
           "marital_status", "occupation", "relationship", "race", "gender",
           "capital_gain", "capital_loss", "hours_per_week", "native_country",
           "income_bracket"]
census_df = pd.read_csv(train_file, names=COLUMNS, skipinitialspace=True)

### Making Numerical Features Categorical through Bucketization

As we saw in [Lab4 (Using Bucketized Numerical Features)](https://colab.sandbox.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_4__Using_a_Bucketized_Numerical_Feature.ipynb), often the relationship between a numerical feature and the label is not linear. As an example relevant to this data set, a person's income may grow with age in the early stage of one's career, then the growth may slow at some point, and finally the income decreases after retirement. If we want to learn the fine-grained correlation between income and each age group separately, we can leverage bucketization (also known as binning).  **Bucketization** is a process of partitioning the entire range of a numerical feature into bins/buckets, and then converting the original numerical feature into a set of categorical features with one feature correpsonding to each bucket (with a value of 1 when the numerical feature falls in the range of the bucket, and 0 otherwise). However, in general, it is not feasible to hand pick boundaries as we did for compression ratio in Lab 4.


### Computing Quantile Boundaries ###

A good general approach is to bucketize features into groups so that there are roughly the same number of examples falling into each group.  Such groups are called ***quantiles*** and can be computed very simply as illustrated below in `get_quantile_based_boundaries`.

In [0]:
def get_quantile_based_boundaries(feature_values, num_buckets):
  boundaries = np.arange(1.0, num_buckets) / num_buckets
  quantiles = feature_values.quantile(boundaries)
  return [q for q in quantiles]

Let's try it out on `age` with 5 quantiles. We use plot to visualize the boundaries on a histogram. So the bins defined for `age` on this data are $\le$25, 26-32, 33-40, 41-49, and $\ge$50.

In [0]:
histogram = census_df["age"].hist(bins=50)
boundaries = get_quantile_based_boundaries(census_df["age"], 5)
print "boundaries are:", boundaries
for x in boundaries:
  plt.axvline(x, color='g')

### Feature Crosses

As we discussd in the slides, another very powerful way to capture non-linear behavior in a linear model is through introducing feature crosses. Any combination of categorical features and bucketized features (which are a form of categorical feature) can be combined in a **feature cross**.  When this is done there will be a new categorical featuers introduced for each possible value for all the features in the cross.  Thus if a feature with `n1` values is crossed with a feature with `n2` values then there will be `n1 * n2` features for the cross.

Here is a sample of creating a cross between `gender` and `age_buckets`.
```
   gender_x_age_buckets = tf.contrib.layers.crossed_column(
      [gender, age_buckets], hash_bucket_size=1000
```

If we had defined 5 age buckets as above, then this crossed column would introduce 10 Boolean features: one for males in each of the 5 age buckets listed above, and one for females in each of the 5 age buckets.

## Task 1 - Train a Linear Classifier with Bucketized Features and Feature Crosses (5 points)

For this lab, you are going to train a model to improve upon what you did in Lab 8 by introducing bucketized features and feature crosses.  You should introduce at least two bucketized features and at two feature crosses.

Unlike in past labs, we are not providing any code other than what is provided above to load the data into Pandas, and compute quantile boundaries.  Just to be sure it is clear how to introduce a `bucketized_colum` and a`crossed_column` column below is a starting point for `construct_feature_column`.  Copy any of pieces of code that you'd like to use from Lab 8.

**WARNING: As discussed in the slides, because the log loss has a gradient that goes to infinity as your prediction approaches the target value, when training a logistic regression model with a lot of features and thus the possibility to overfit the training data, you can get a gradient that is so large that your model overflows. If you see an error indicating that you divided by zero or a loss of NaN, then most likely this situation has occured. The way to address this problem is to introduce regularization (which you will learn how to do in the next lab). For now, the solution is to reduce the learning rate and/or the number of training steps even if that means that your model is undertrained.**


In [0]:
# You will need to modify this to add the features you had in Task 3 in addition
# to adding at least two more feature crosses beyond the one illustrated here. 

def construct_feature_columns():
  """Construct TensorFlow Feature Columns for features
  
  Returns:
    A set of feature columns
  """
  
  # Sample of creating a real-valued column.
  age = tf.contrib.layers.real_valued_column("age") 
  
  # Sample of creating a bucketized column using a real-valued column
  boundaries = get_quantile_based_boundaries(training_examples["age"], 5)
  age_buckets = tf.contrib.layers.bucketized_column(age, boundaries)
  
  # Sample of creating a categorical column with known values
  gender = tf.contrib.layers.sparse_column_with_keys(
    column_name="gender", keys=["Female", "Male"])

  # Sample of a crossed_column which in this case combines a bucketized column
  # and a categorical column. In general, you can include any number of each.
  # So for example you could cross two categorical columns, or two bucketized
  # columns, two categorical columns and also a bucketized column,...
  gender_x_age_buckets = tf.contrib.layers.crossed_column(
      [gender, age_buckets], hash_bucket_size=1000)

  # In this sample code, note that while the real-valued column age was defined
  # in order to define the bucketized column age_buckets, the real-valued
  # feature age is not being included in feature_columns.  If you would like
  # the real-valued feature age to also be used in training the model then you
  # would add that to the set of feature columns being returned.
  feature_columns=[age_buckets, gender, gender_x_age_buckets]
 
  return feature_columns