# Feature columns

Feature columns tell the model what inputs to expect

In [12]:
import tensorflow as tf
import numpy as np

In [13]:
fc = tf.feature_column

## Example

In [14]:
featcolumns = [
    fc.numeric_column('sq_footage'),
    fc.categorical_column_with_vocabulary_list('type', ['house', 'apt']) # one-hot encoding
]

## `bucketized_column`

The `fc.bucketized_column` function splits a numeric feature into categories based on numeric ranges

In [21]:
NBUCKETS = 16
latbuckets = np.linspace(start=38.0, stop=42.0, num=NBUCKETS).tolist()
lonbuckets = np.linspace(start=72.0, stop=76.0, num=NBUCKETS).tolist()

In [22]:
bucketised_plat = fc.bucketized_column(
    source_column=fc.numeric_column('pickup_latitude'),
    boundaries=latbuckets
)

In [23]:
bucketised_plon = fc.bucketized_column(
    source_column=fc.numeric_column('pickup_longitude'),
    boundaries=lonbuckets
)

## `categorical_column_*`

Categorical columns are represented as sparse tensors. This saves memory and optimise compute time.

If you know the keys beforhand

In [30]:
fc.categorical_column_with_vocabulary_list('postcode', ['W1A1AA', 'W13LBT', 'W377F'])
# fc.categorical_column_with_vocabulary_file in case the vocabulary is stored in a file rather than in memory

VocabularyListCategoricalColumn(key='postcode', vocabulary_list=('W1A1AA', 'W13LBT', 'W377F'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

if your data is already indexed i.e. has integers [0, N)

In [31]:
fc.categorical_column_with_identity('schoolsRatings', num_buckets=2)

IdentityCategoricalColumn(key='schoolsRatings', number_buckets=2, default_value=None)

if you don't have a vocabulary of all possible values

In [29]:
fc.categorical_column_with_hash_bucket('storeID', hash_bucket_size=500)

HashedCategoricalColumn(key='storeID', hash_bucket_size=500, dtype=tf.string)

## `embedding_column`

As the number of categories of the feature grows, it becomes unfeasibel to train a neural network using one-hot encodings. Imagine having 1 million possible values for a categorical feature, with one-hot encoding we will have 1 million feature to represent all the possible values.

Embeddings overcome this limitation. Rather than repreenting the data as a one-hot tensor of many dimensions, an embedding column represents the data at a lower dimensional level using a dense vector, where each cell can contain any number, not just a 0 or a 1.