This is application of wide&deep model by using TensorFlow class DNNLinearCombinedClassifier

![wide and deep model](wide_n_deep.svg)

The combined model consist of a wide model (logistic regression with sparse features and transformations), and a deep model (feed-forward neural network with several hidden layers). We train both Wide & Deep models jointly. At a high level, here are the steps using the tf.estimator API:

1. Load data and create train and test (validation) datasets
2. Secify the names of columns
3. Define which features go to wide and which to deep deep part of the model
4. Tell tensorflow which features to use for the wide and deep parts 
5. Train and validate the model

For more information, see this publication:

https://arxiv.org/abs/1606.07792

TensorFlow tutorial for deep&wide models:

https://www.tensorflow.org/tutorials/wide_and_deep

and our own demo applied to movie ratings:

https://github.com/karthikbharadwaj/strata_SG_2017_recommender_tutorial



In [1]:
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
from __future__ import print_function
import pickle

### 1. Load data and create train and test (validation) datasets

Load dataset that has been previously created and split into train and valid sets. 

In [2]:
FullSet = pickle.load(open("/home/dnikolic/PMI/MentorMenteeDNNTrainData.p", "rb"))


def split_dataset(dataset, split_frac=.5):
    dataset = dataset.sample(frac=1, replace=False)
    n_split = int(len(dataset)*split_frac)
    trainset = dataset[:n_split]
    validset = dataset[n_split:]
    return trainset, validset

trainset, validset = split_dataset(FullSet, 0.5)

In [11]:
validset

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
31683,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.022840,2.0
47294,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.252454,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.0
52480,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.488726,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.0
32485,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,3.0
18005,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.0
174472,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.099797,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,4.0
164122,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.343909,...,0.000000,0.000000,0.000000,0.000000,0.173855,0.000000,0.000000,0.000000,0.000000,4.0
117049,0.000000,0.891111,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,5.0
69729,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.137832,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,3.0
91139,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.770484,0.000000,...,0.000000,0.000000,0.055353,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,5.0


In [12]:
trainset

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
175792,0.014851,0.012048,0.026141,0.027563,0.052779,0.012843,0.000000,0.024989,0.025331,0.025014,...,0.000000,0.000000,0.000000,0.000000,0.546169,0.000000,0.000000,0.000000,0.000000,3.0
115316,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.0
110153,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.121407,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.105323,4.0
16533,0.178436,0.000000,0.323648,0.000000,0.000000,0.000000,0.000000,0.363630,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.0
99362,0.000000,0.000000,0.369340,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,4.0
146826,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.089481,0.000000,...,0.425091,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.0
113323,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.203038,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.0
192985,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.018765,0.000000,0.000000,0.042890,0.000000,0.000000,2.0
3820,0.032858,0.000000,0.000000,0.000000,0.000000,0.081300,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.036669,0.206192,0.101881,0.000000,0.000000,0.000000,0.153263,0.000000,5.0
29856,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,4.0


### 2. Secify the names of columns



<font color=blue>The feature selection for deep and wide parts of the model is flexible, you can try out different combinations. In this case we assigned all features to both deep and wide models. In a more elaborate application with a larger number of features (e.g., categorical features), this would not be the case.</font>

In [3]:

TOPIC_MENTOR = []
TOPIC_MENTEE = []
topic_array_len = 50

for i in range(topic_array_len):
    TOPIC_MENTOR.append(str(i))
    TOPIC_MENTEE.append(str(i + topic_array_len))
    
LABEL_COL = "100"

DEEP_COLS = TOPIC_MENTOR + TOPIC_MENTEE 


### 3. Define which features go to wide and which to deep deep part of the model



Assign the values to TensorFlow variables

In [1]:
def make_inputs(dataframe):
    """
    
    Arguments:
    dataframe -- pandas dataframe containing the values of features and labels.
    
    Returns:
    feature_inputs -- a dictionary of sparse tensors of features.
    label_input -- a constant with shape of [number of training example, 1]
    """ 
  
        
    
    feature_inputs = {
        col_name: tf.Variable(
            dataframe[int(col_name)].values,
            dtype = tf.float32
        )
        for col_name in DEEP_COLS
        
    }
    label_input = tf.Variable(tf.cast(dataframe[int(LABEL_COL)].values-1, dtype=tf.int32))
    
    return (feature_inputs, label_input)

### 4. Tell tensorflow which features to use for the wide and deep parts

In our case, all features go into both deep and wide parts of the model.

In [None]:
def create_wide_input_layers(DEEP_COLS):
    """
    
    """ 
    wide_input_layers = [
        tf.feature_column.numeric_column(key = cs)
        for cs in DEEP_COLS
    ]
    return wide_input_layers


def create_deep_input_layers(DEEP_COLS):
    """
    
    """ 
    deep_input_layers = [
        tf.feature_column.numeric_column(key = cs)
        for cs in DEEP_COLS
    ]
    return deep_input_layers

### 5. Train and validate the model

Here we provide input features for the deep model and wide model, define the number of layers and layer sizes of DNN 
and create the model with tf.contrib.learn.DNNLinearCombinedClassifier. We save the model in directory ./model/

In [8]:
print("create input layers...", end="")


#hash_columns = make_hash_columns(CAT_STR_COLS)
#int_columns = make_int_columns(CAT_INT_COLS)
#embedding_layers = make_embeddings(hash_columns, int_columns,dim =6)

deep_input_layers = create_deep_input_layers(DEEP_COLS)
wide_input_layers = create_wide_input_layers(DEEP_COLS)


print("done!")
print("create model...", end="")


model = tf.contrib.learn.DNNLinearCombinedClassifier(
    n_classes=5,
    linear_feature_columns = wide_input_layers,
    dnn_feature_columns = deep_input_layers,
    dnn_hidden_units = [1024, 1024, 512],
    fix_global_step_increment_bug=True,   ###########!!!!!!!!!!!!!!!!!!!!!
    config = tf.contrib.learn.RunConfig(
        keep_checkpoint_max = 1,
        save_summary_steps = 10,
        num_cores = 8,
        #gpu_memory_fraction = 0.9,
        model_dir = "./model7/"
    )
)


print("done!")
print("training model...", end="")
#model.fit(input_fn = lambda: make_inputs(trainset), steps=100)  #steps = 1000
model.fit(input_fn = lambda: make_inputs(trainset), steps=1000)
print("done!")
print("evaluating model on train data...", end="")
resultsTrain = model.evaluate(input_fn = lambda: make_inputs(trainset), steps=1)
print("done!")
print("evaluating model on test data (validation data)...", end="")
resultsTest = model.evaluate(input_fn = lambda: make_inputs(validset), steps=1)
print("done!")
print("calculating predictions...", end="")
predictions = model.predict_classes(input_fn = lambda: make_inputs(validset))
print("done!")
print("calculating probabilites...", end="")
probabilities = model.predict_proba(input_fn = lambda: make_inputs(validset))
print("done!")

create input layers...done!
create model...INFO:tensorflow:Using config: {'_tf_random_seed': None, '_log_step_count_steps': 100, '_keep_checkpoint_every_n_hours': 10000, '_task_id': 0, '_master': '', '_num_ps_replicas': 0, '_evaluation_master': '', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f5c04908390>, '_keep_checkpoint_max': 1, '_is_chief': True, '_environment': 'local', '_task_type': None, '_session_config': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_num_worker_replicas': 0, '_model_dir': './model7/', '_tf_config': intra_op_parallelism_threads: 8
inter_op_parallelism_threads: 8
gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_save_summary_steps': 10}
done!
training model...INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./model7/model.ckpt-32000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:

# Performance

In [9]:
for n, r in resultsTest.items():
    print("%s: %s"%(n, r))

global_step: 33000
accuracy: 0.33422
loss: 1.4836442


# Predictions

In [13]:
predict = list(predictions)
predict

[0,
 0,
 4,
 0,
 3,
 1,
 0,
 4,
 0,
 0,
 0,
 0,
 4,
 0,
 0,
 4,
 0,
 4,
 3,
 4,
 0,
 0,
 4,
 0,
 4,
 4,
 4,
 4,
 0,
 1,
 0,
 0,
 0,
 0,
 4,
 0,
 0,
 4,
 0,
 4,
 4,
 2,
 4,
 4,
 0,
 1,
 3,
 0,
 0,
 4,
 0,
 4,
 0,
 4,
 4,
 0,
 4,
 4,
 2,
 0,
 4,
 4,
 4,
 4,
 0,
 4,
 0,
 4,
 0,
 0,
 0,
 2,
 4,
 4,
 0,
 4,
 4,
 4,
 0,
 0,
 4,
 4,
 0,
 4,
 4,
 4,
 0,
 1,
 4,
 0,
 4,
 0,
 0,
 0,
 0,
 0,
 4,
 0,
 4,
 4,
 0,
 4,
 4,
 0,
 0,
 1,
 0,
 0,
 4,
 4,
 4,
 0,
 4,
 4,
 0,
 0,
 4,
 4,
 4,
 0,
 4,
 0,
 4,
 0,
 4,
 4,
 4,
 0,
 4,
 0,
 4,
 0,
 0,
 4,
 4,
 4,
 4,
 4,
 0,
 0,
 0,
 4,
 0,
 4,
 4,
 4,
 4,
 0,
 0,
 4,
 4,
 1,
 4,
 2,
 4,
 4,
 4,
 0,
 4,
 0,
 1,
 4,
 0,
 4,
 0,
 4,
 4,
 0,
 4,
 0,
 0,
 4,
 3,
 4,
 4,
 4,
 4,
 4,
 4,
 1,
 1,
 4,
 1,
 4,
 0,
 2,
 0,
 0,
 4,
 4,
 0,
 4,
 0,
 0,
 4,
 1,
 4,
 4,
 0,
 4,
 0,
 4,
 3,
 4,
 4,
 0,
 0,
 4,
 4,
 4,
 0,
 2,
 4,
 4,
 0,
 0,
 0,
 4,
 4,
 1,
 4,
 0,
 0,
 4,
 0,
 0,
 0,
 4,
 0,
 4,
 4,
 4,
 0,
 4,
 4,
 0,
 1,
 4,
 0,
 0,
 1,
 0,
 4,
 4,
 4,
 4,
 0,
 4,
 0,
 0,


# Probabilities
Outputs from the five units in the last layer of the model

In [17]:
prob = list(probabilities)
prob 

[array([0.2943454 , 0.24289766, 0.1736121 , 0.15866765, 0.13047723],
       dtype=float32),
 array([0.38689426, 0.23102264, 0.15670401, 0.13618702, 0.08919206],
       dtype=float32),
 array([0.14528683, 0.19015527, 0.16112582, 0.20068935, 0.30274278],
       dtype=float32),
 array([0.49055913, 0.20599757, 0.12812603, 0.11042535, 0.06489193],
       dtype=float32),
 array([0.21655752, 0.14147307, 0.21876092, 0.2658939 , 0.1573146 ],
       dtype=float32),
 array([0.2400289 , 0.27580902, 0.11038407, 0.24409293, 0.12968503],
       dtype=float32),
 array([0.3385826 , 0.21004668, 0.1883696 , 0.15387681, 0.10912427],
       dtype=float32),
 array([0.22964726, 0.16100088, 0.1274193 , 0.16120237, 0.3207302 ],
       dtype=float32),
 array([0.47195938, 0.24782008, 0.11945792, 0.09512348, 0.06563917],
       dtype=float32),
 array([0.3195886 , 0.15885648, 0.14983575, 0.13822657, 0.23349255],
       dtype=float32),
 array([0.24997339, 0.17563586, 0.15375488, 0.1877678 , 0.23286809],
       dtyp