# STEP 3 - Feature Selection

A central model for tff has been found.
The model with all features proved to be less accurate.
The next step is to evaluate which features should be selected for the best prediction quality.
This is done by training models on all possible feature subsets and comparing the results.
The most important features are to be expected:

* the temporal features (all time components including is_weekday)
* user id

## Imports

In [1]:
import nest_asyncio

nest_asyncio.apply()

import collections
import itertools
import functools
import os
import time
import numpy as np
import tensorflow as tf
import tensorflow_federated as tff
import pandas as pd
import numpy as np

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

from tqdm import tqdm

In [2]:
import logging

logging.basicConfig(filename="./log/feature-selection/Evaluation.log", level=logging.INFO)

def log(text):
  print(text)
  logging.info(text)

In [3]:
# Test the TFF is working:
tff.federated_computation(lambda: 'Hello, World!')()

b'Hello, World!'

## Feature Selection

Several models are trained on subsets of the full dataset to check which features are most beneficial for the prediction.

In [4]:
df = pd.read_csv("./4square/processed_transformed_big.csv")
df.head(100)

Unnamed: 0,cat_id,user_id,latitude,longitude,is_weekend,clock_sin,clock_cos,day_sin,day_cos,month_sin,month_cos,week_day_sin,week_day_cos,venue_id,orig_cat_id
0,0,470,40.719810,-74.002581,False,-1.000000,0.000654,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,0,0
1,1,979,40.606800,-74.044170,False,-0.999998,0.001818,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,1,1
2,2,69,40.716162,-73.883070,False,-0.999945,0.010472,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,2,2
3,3,395,40.745164,-73.982519,False,-0.999931,0.011708,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,3,3
4,4,87,40.740104,-73.989658,False,-0.999914,0.013090,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,4,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,7,445,40.828602,-73.879259,False,-0.959601,0.281365,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,93,24
96,6,235,40.745463,-73.990983,False,-0.956326,0.292302,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,94,6
97,8,118,40.600144,-73.946593,False,-0.955729,0.294249,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,95,57
98,2,1054,40.870630,-74.097926,False,-0.955407,0.295291,0.587785,0.809017,0.866025,-0.5,0.781831,0.62349,96,58


It is best, to use only the best 100 users for this purpose.
As they have the longest sequences of visited places.

In [5]:
count = df.user_id.value_counts()

idx = count.loc[count.index[:100]].index # count >= 100
df = df.loc[df.user_id.isin(idx)]

An array is created containing all visited locations for every user.
The original data is sorted by time (ascending).
Thus, the array contains a sequence of visited location categories by user.

In [6]:
# List the df for each user
users_locations = []

# For each user
for user_id in tqdm(idx):
  users_locations.append(df.loc[df.user_id == user_id].copy())

100%|██████████| 100/100 [00:00<00:00, 1724.20it/s]


It is necessary to first split the data in train, valid and test for each user.
Then, these are merged together again later on.
This is done to ensure that the sequences are kept together and not split randomly for the users.

In [7]:
# List the dfs fo train, val and test for each user
users_locations_train = []
users_locations_val = []
users_locations_test = []

for user_df in users_locations:
  # Split in train, test and validation
  train, test = train_test_split(user_df, test_size=0.2, shuffle=False)
  train, val = train_test_split(train, test_size=0.2, shuffle=False)

  # Append the sets
  users_locations_train.append(train)
  users_locations_val.append(val)
  users_locations_test.append(test)

The dataframes are concatenated again.

In [8]:
# Merge back the dataframes
df_train = pd.concat(users_locations_train)

# Merge back the dataframes
df_val = pd.concat(users_locations_val)

# Merge back the dataframes
df_test = pd.concat(users_locations_test)

In [9]:
user_ids = df_train.user_id.unique()

Helper functions to split data, create clients dictionaries and preprocess the data for the FL algorithm.
The model creation is also defined here.

In [10]:
# Split the data into chunks of N
def split_data(N, train, val, test):

  # dictionary of list of df
  df_dictionary = {}

  for uid in tqdm(user_ids):
    # Get the records of the user
    user_df_train = train.loc[train.user_id == uid].copy()
    user_df_val = val.loc[val.user_id == uid].copy()
    user_df_test = test.loc[test.user_id == uid].copy()

    # Get a list of dataframes of length N records
    user_list_train = [user_df_train[i:i+N] for i in range(0, user_df_train.shape[0], N)]
    user_list_val = [user_df_val[i:i+N] for i in range(0, user_df_val.shape[0], N)]
    user_list_test = [user_df_test[i:i+N] for i in range(0, user_df_test.shape[0], N)]

    # Save the list of dataframes into a dictionary
    df_dictionary[uid] = {
        'train': user_list_train,
        'val': user_list_val,
        'test': user_list_test
    }

  return  df_dictionary

In [11]:
# Takes a dictionary with train, validation and test sets and the desired set type
def create_clients_dict(df_dictionary, set_type, N):

  dataset_dict = {}

  for uid in tqdm(user_ids):

    c_data = collections.OrderedDict()
    values = df_dictionary[uid][set_type]

    # If the last dataframe of the list is not complete
    if len(values[-1]) < N:
      diff = 1
    else:
      diff = 0

    if len(values) > 0:
      # Create the dictionary to create a clientData
      for header in columns_names:
        c_data[header] = [values[i][header].values for i in range(0, len(values)-diff)]
      dataset_dict[uid] = c_data

  return dataset_dict

In [12]:
# preprocess dataset to tf format
def preprocess(dataset, N):

  def batch_format_fn(element):

    x=collections.OrderedDict()

    for name in columns_names:
      x[name]=tf.reshape(element[name][:, :-1], [-1, N-1])

    y=tf.reshape(element[columns_names[0]][:, 1:], [-1, N-1])

    return collections.OrderedDict(x=x, y=y)

  return dataset.repeat(NUM_EPOCHS).batch(BATCH_SIZE, drop_remainder=True).map(batch_format_fn).prefetch(PREFETCH_BUFFER)

In [13]:
# create federated data for every client
def make_federated_data(client_data, client_ids, N):

  return [
      preprocess(client_data.create_tf_dataset_for_client(x), N)
      for x in tqdm(client_ids)
  ]

In [14]:
# Create a model
def create_keras_model(number_of_places, N, batch_size):

  # Shortcut to the layers package
  l = tf.keras.layers

  # List of numeric feature columns to pass to the DenseLayer
  numeric_feature_columns = []

  # Handling numerical columns
  for header in numerical_column_names:
		# Append all the numerical columns defined into the list
    numeric_feature_columns.append(feature_column.numeric_column(header, shape=N-1))

  feature_inputs={}
  for c_name in numerical_column_names:
    feature_inputs[c_name] = tf.keras.Input((N-1,), batch_size=batch_size, name=c_name)

  # We cannot use an array of features as always because we have sequences
  # We have to do one by one in order to match the shape
  num_features = []
  for c_name in numerical_column_names:
    f =  feature_column.numeric_column(c_name, shape=(N-1))
    feature = l.DenseFeatures(f)(feature_inputs)
    feature = tf.expand_dims(feature, -1)
    num_features.append(feature)

  categorical_feature_inputs = []
  categorical_features = []
  for categorical_feature in categorical_columns:  # add batch_size=batch_size in case of stateful GRU
    d = {categorical_feature.feature_name: tf.keras.Input((N-1,), batch_size=batch_size, dtype=tf.dtypes.int32, name=categorical_feature.feature_name)}
    categorical_feature_inputs.append(d)

    one_hot = feature_column.sequence_categorical_column_with_vocabulary_list(categorical_feature.feature_name, [i for i in range(categorical_feature.vocab_size)])

    if categorical_feature.use_embedding:
      # Embed the one-hot encoding
      categorical_features.append(feature_column.embedding_column(one_hot, 64))
    else:
      categorical_features.append(feature_column.indicator_column(one_hot))

  seq_features = []
  for i in range(0, len(categorical_feature_inputs)):
    sequence_features, sequence_length = tf.keras.experimental.SequenceFeatures(categorical_features[i])(categorical_feature_inputs[i])
    seq_features.append(sequence_features)

  input_sequence = l.Concatenate(axis=2)( [] + seq_features + num_features)

  # Rnn
  recurrent = l.GRU(64,
                        batch_size=batch_size, #in case of stateful
                        dropout=0.3,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform')(input_sequence)


	# Last layer with an output for each place
  dense_1 = layers.Dense(number_of_places)(recurrent)

	# Softmax output layer
  output = l.Softmax()(dense_1)

	# To return the Model, we need to define its inputs and outputs
	# In out case, we need to list all the input layers we have defined
  inputs = list(feature_inputs.values()) + categorical_feature_inputs

	# Return the Model
  return tf.keras.Model(inputs=inputs, outputs=output)

In [15]:
#train and evaluate the model
def train_and_eval_model(vocab_size, n, federated_train_data, federated_val_data, federated_test_data, path='./log/central-test-run'):
  train_logdir = path + '/train'
  val_logdir = path + '/val'
  eval_logdir = path + '/eval'

  train_summary_writer = tf.summary.create_file_writer(train_logdir)
  val_summary_writer = tf.summary.create_file_writer(val_logdir)
  eval_summary_writer = tf.summary.create_file_writer(eval_logdir)

  # Clone the keras_model inside `create_tff_model()`, which TFF will
  # call to produce a new copy of the model inside the graph that it will
  # serialize. Note: we want to construct all the necessary objects we'll need
  # _inside_ this method.
  def create_tff_model():
    # TFF uses an `input_spec` so it knows the types and shapes
    # that your model expects.
    input_spec = federated_train_data[0].element_spec
    keras_model_clone = create_keras_model(vocab_size, n, batch_size=BATCH_SIZE)
    #plot_model(keras_model_clone, 'keras_model_for_fl.png', show_shapes=True)
    tff_model = tff.learning.from_keras_model(
      keras_model_clone,
      input_spec=input_spec,
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
    return tff_model

  # This command builds all the TensorFlow graphs and serializes them:
  fed_avg = tff.learning.build_federated_averaging_process(
    model_fn=create_tff_model,
    client_optimizer_fn=lambda: tf.keras.optimizers.Adam(learning_rate=0.002),
    server_optimizer_fn=lambda: tf.keras.optimizers.Adam(learning_rate=0.06))

  state = fed_avg.initialize()
  evaluation = tff.learning.build_federated_evaluation(model_fn=create_tff_model)

  tolerance = 7
  best_state = 0
  lowest_loss = 100.00
  stop = tolerance

  NUM_ROUNDS = 5
  with train_summary_writer.as_default():
    for round_num in range(1, NUM_ROUNDS + 1):
      log('Round {r}'.format(r=round_num))

      # Uncomment to simulate sparse availability of clients
      # train_data_for_this_round, val_data_for_this_round = sample((federated_train_data, federated_val_data), 20, NUM_CLIENTS)

      state, metrics = fed_avg.next(state, federated_train_data)

      train_metrics = metrics['train']
      log('\tTrain: loss={l:.3f}, accuracy={a:.3f}'.format(l=train_metrics['loss'], a=train_metrics['sparse_categorical_accuracy']))

      val_metrics = evaluation(state.model, federated_val_data)
      log('\tValidation: loss={l:.3f}, accuracy={a:.3f}'.format( l=val_metrics['loss'], a=val_metrics['sparse_categorical_accuracy']))

      # Check for decreasing validation loss
      if lowest_loss > val_metrics['loss']:
        log('\tSaving best model..')
        lowest_loss = val_metrics['loss']
        best_state = state
        stop = tolerance - 1
      else:
        stop = stop - 1
        if stop <= 0:
          log('\tEarly stopping...')
          break;

      log(' ')
      log('\twriting..')

      # Iterate across the metrics and write their data
      for name, value in dict(train_metrics).items():
        tf.summary.scalar('epoch_'+name, value, step=round_num)

      with val_summary_writer.as_default():
        for name, value in dict(val_metrics).items():
          tf.summary.scalar('epoch_'+name, value, step=round_num)

  train_summary_writer.close()
  val_summary_writer.close()

  # evaluate over test data
  test_metrics = evaluation(best_state.model, federated_test_data)
  log('\tEvaluation: loss={l:.3f}, accuracy={a:.3f}'.format( l=test_metrics['loss'], a=test_metrics['sparse_categorical_accuracy']))

Here, the different subsets of the dataset are generated.
Afterwards, the column_names variables are set accordingly so that only the relevant columns are chosen for the federated train/val/test datasets.
The training is run for all combinations and the results are saved, so they can be compared easily.

In [16]:
class CategoricalFeature:
  def __init__(self, feature_name, vocab_size, use_embedding):
    self.feature_name = feature_name
    self.vocab_size = vocab_size
    self.use_embedding = use_embedding

In [17]:
vocab_size = df.cat_id.unique().size
users_size = df.user_id.unique().size
venues_size = df.venue_id.unique().size
orig_cats_size = df.orig_cat_id.unique().size

In [18]:
all_num_column_names = ['latitude', 'longitude', 'clock_sin', 'clock_cos', 'day_sin', 'day_cos', 'month_sin',
                          'month_cos', 'week_day_sin', 'week_day_cos']

all_cat_columns = [
      CategoricalFeature('user_id', users_size, False),
      CategoricalFeature('cat_id', vocab_size, False),
      CategoricalFeature('venue_id', venues_size, False),
      CategoricalFeature('orig_cat_id', orig_cats_size, False)]

NUM_CLIENTS = user_ids.size
NUM_EPOCHS = 4
BATCH_SIZE = 16
#SHUFFLE_BUFFER = 100
PREFETCH_BUFFER = 5
n=17

In [19]:
drop_columns = [['user_id'], ['latitude', 'longitude'], ['is_weekend'], ['venue_id'], ['orig_cat_id']]

for L in range(1, len(drop_columns) + 1):
  for subset in itertools.combinations(drop_columns, L):

    cols = [item for sub_list in subset for item in sub_list]

    log('START Run'.center(80, '*'))
    log('Excluded columns: {c}'.format(c=cols))

    #train_dropped = df_train.drop(cols, axis=1, inplace=False)
    #val_dropped = df_val.drop(cols, axis=1, inplace=False)
    #test_dropped = df_test.drop(cols, axis=1, inplace=False)

    columns_names = [i for i in df_train.columns.values if i not in cols]

    numerical_column_names = [i for i in all_num_column_names if i not in cols]

    categorical_columns = [i for i in all_cat_columns if i.feature_name not in cols]

    df_dict = split_data(n, df_train, df_val, df_test)

    clients_train_dict = create_clients_dict(df_dict, 'train', n)
    clients_val_dict = create_clients_dict(df_dict, 'val', n)
    clients_test_dict = create_clients_dict(df_dict, 'test', n)

    # Convert the dictionary to a dataset
    client_train_data = tff.simulation.FromTensorSlicesClientData(clients_train_dict)
    client_val_data = tff.simulation.FromTensorSlicesClientData(clients_val_dict)
    client_test_data = tff.simulation.FromTensorSlicesClientData(clients_test_dict)

    example_dataset = client_train_data.create_tf_dataset_for_client(
    client_train_data.client_ids[1])

    example_element = next(iter(example_dataset))

    # Select the clients
    sample_clients = client_train_data.client_ids[0:NUM_CLIENTS]

    # Federate the clients datasets
    federated_train_data = make_federated_data(client_train_data, sample_clients, n)
    federated_val_data = make_federated_data(client_val_data, sample_clients, n)
    federated_test_data = make_federated_data(client_test_data, sample_clients, n)

    train_and_eval_model(vocab_size, n, federated_train_data, federated_val_data, federated_test_data, path='./log/central-test-run')

    log('END Run'.center(80, '*'))

***********************************START Run************************************
Excluded columns: ['user_id']


100%|██████████| 100/100 [00:00<00:00, 434.78it/s]
100%|██████████| 100/100 [00:00<00:00, 144.40it/s]
100%|██████████| 100/100 [00:00<00:00, 908.83it/s]
100%|██████████| 100/100 [00:00<00:00, 416.79it/s]
100%|██████████| 100/100 [00:03<00:00, 28.45it/s]
100%|██████████| 100/100 [00:03<00:00, 30.74it/s]
100%|██████████| 100/100 [00:03<00:00, 29.44it/s]


Round 1
	Train: loss=2.756, accuracy=0.246
	Validation: loss=3.557, accuracy=0.174
	Saving best model..
 
	writing..
Round 2
	Train: loss=3.180, accuracy=0.218
	Validation: loss=3.158, accuracy=0.071
	Saving best model..
 
	writing..
Round 3
	Train: loss=3.222, accuracy=0.090
	Validation: loss=3.210, accuracy=0.092
 
	writing..
Round 4
	Train: loss=2.913, accuracy=0.179
	Validation: loss=3.082, accuracy=0.169
	Saving best model..
 
	writing..
Round 5
	Train: loss=2.758, accuracy=0.234
	Validation: loss=3.031, accuracy=0.172
	Saving best model..
 
	writing..
	Evaluation: loss=3.113, accuracy=0.146
************************************END Run*************************************
***********************************START Run************************************
Excluded columns: ['latitude', 'longitude']


100%|██████████| 100/100 [00:00<00:00, 491.18it/s]
100%|██████████| 100/100 [00:00<00:00, 235.05it/s]
100%|██████████| 100/100 [00:00<00:00, 187.12it/s]
100%|██████████| 100/100 [00:00<00:00, 779.14it/s]
100%|██████████| 100/100 [00:03<00:00, 31.11it/s]
100%|██████████| 100/100 [00:03<00:00, 29.03it/s]
100%|██████████| 100/100 [00:03<00:00, 32.87it/s]


Round 1
	Train: loss=3.043, accuracy=0.211
	Validation: loss=4.064, accuracy=0.176
	Saving best model..
 
	writing..
Round 2
	Train: loss=3.514, accuracy=0.213
	Validation: loss=3.413, accuracy=0.093
	Saving best model..
 
	writing..
Round 3
	Train: loss=2.995, accuracy=0.171
	Validation: loss=3.243, accuracy=0.089
	Saving best model..
 
	writing..
Round 4
	Train: loss=2.717, accuracy=0.227
	Validation: loss=2.932, accuracy=0.210
	Saving best model..
 
	writing..
Round 5
	Train: loss=2.585, accuracy=0.282
	Validation: loss=2.781, accuracy=0.209
	Saving best model..
 
	writing..
	Evaluation: loss=2.915, accuracy=0.177
************************************END Run*************************************
***********************************START Run************************************
Excluded columns: ['is_weekend']


100%|██████████| 100/100 [00:00<00:00, 434.07it/s]
100%|██████████| 100/100 [00:01<00:00, 96.30it/s]
100%|██████████| 100/100 [00:00<00:00, 919.50it/s]
100%|██████████| 100/100 [00:00<00:00, 712.46it/s]
100%|██████████| 100/100 [00:03<00:00, 29.07it/s]
100%|██████████| 100/100 [00:03<00:00, 25.62it/s]
100%|██████████| 100/100 [00:03<00:00, 30.70it/s]


Round 1
	Train: loss=2.814, accuracy=0.229
	Validation: loss=3.846, accuracy=0.149
	Saving best model..
 
	writing..
Round 2
	Train: loss=3.776, accuracy=0.171
	Validation: loss=3.236, accuracy=0.114
	Saving best model..
 
	writing..
Round 3
	Train: loss=3.121, accuracy=0.142
	Validation: loss=3.319, accuracy=0.064
 
	writing..
Round 4
	Train: loss=3.277, accuracy=0.083
	Validation: loss=3.118, accuracy=0.050
	Saving best model..
 
	writing..
Round 5


KeyboardInterrupt: 