In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Tesnor flow tutorial 1 (Estimator API)**

These series on notebooks will progressively explore the tensorflow library and focus primarily on scalable and distributed training and prediction. The first notebook will explore the estimator api on a toy dataset. We will first build a simple ablation experiment with keras and then user tensorflow to build a more productionised and scalble end to end model pipeline. We will first use pandas and numpy for preprocessing as an ablation experiment and then we will migrate it to tensorflow. Our final prediction model will mostly be a neural network approach and hence we will not focus on traditional machine learning algorithms.

In [None]:
%%bash
cd ../input/mushroom-classification/
ls


The magic commands will help us to locate the paths of our datasets and other items and we will often use them.

In [None]:
import pandas as pd
import numpy as np
#Formatting commands to let panndas display all columnns in the notebook
pd.set_option('display.max_columns', 500)
data = pd.read_csv('../input/mushroom-classification/mushrooms.csv')

The data is read , so now lets explore it a bit and do some preprocessinng if required.

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.head(5)

The data is already mostly complete and hence we will not require much preprocessing. Also all of the data is categorical so here we will use the categorical values to determine the class of the mushroom. 

This is a simple binary classification problem where we will try to predict the class of the mushroom that is  **edible(e)** or **poisonous(p)**

We will also not do any eda or feature engimeering as those will be covered in later modules.

Some of the data points have missing values and very less features hence we will need to do a little bit of cleaning om the dataset.

In [None]:
data.columns

In [None]:
X = data.loc[:,['cap-shape', 'cap-surface', 'cap-color', 'odor',
       'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
       'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
       'stalk-surface-below-ring', 'stalk-color-above-ring',
       'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
       'ring-type', 'spore-print-color', 'habitat']]
Y = data.loc[:,['class']]


Here we take only those data points which we feel are relevat to out dataset. Then we do preprocessing of the data , followed by splitting

In [None]:
def preprocess_data(x,y):
    x = pd.get_dummies(x)
    y = pd.get_dummies(y)
    return x,y

Since all the values are categorical , so we have to convert them to one hot encoded format. The above function is a pandas implementation of one hot encoding

In [None]:
x_enc,y_enc = preprocess_data(X,Y)
display(x_enc)
x_shape = x_enc.shape
y_shape = y_enc.shape
print("The shape of x is {} and the shape of y is {}".format(x_enc.shape,y_enc.shape))

In [None]:
def train_test_split(X,Y,train_split = 0.8,eval_split = 0.1):
    np.random.seed(0)
    mask_train = [x < train_split for x in np.random.random(len(X))]
    mask_eval = [x >= train_split and x < (train_split + eval_split) for x in np.random.random(len(X))]
    mask_test = [x >= (train_split + eval_split) for x in np.random.random(len(X))]
    return X[mask_train],X[mask_eval],X[mask_test],Y[mask_train],Y[mask_eval],Y[mask_test]

x_train,x_eval,x_test,y_train,y_eval,y_test = train_test_split(x_enc,y_enc)

print("The number of train samples is {}, the number of eval samples is {}, the number of test samples is {}".format(len(x_train),len(x_eval),len(x_test)))

The above preprocessing steps generate a sparse matrix which we shall feed to our keras neural network. For splitting the data into train,test and eval sets , we can use pre defined functions in scikit learn api but the above code will give us an intuition on how to write our own train test split function as we would be using such approaches in future tutorials, especially when we will work with very large datasets.

In [None]:
import tensorflow as tf

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Input(shape = (109,)))
model.add(tf.keras.layers.Dense(1))
model.add(tf.keras.layers.Dense(2,activation = tf.keras.activations.softmax))
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate = 0.01),loss=tf.keras.losses.categorical_crossentropy,metrics=['acc'])
model.summary()

This is a simple linear regression model and has only one layer with a single neuron. 

In [None]:
model.fit(x_train,y_train,batch_size=10,epochs=10, validation_data=(x_eval, y_eval))

The input data is a sparse tensor which is horrible because it takes up a lot of memory with no meaningful data. But still , the model yields great results even with a simple linear regression model with very high accuracy.

Lets test the model and ecaluate its performace.

In [None]:
model.evaluate(x_test,y_test)

A simple linear regression model is enough to give us high accuracy on this problem. And hence this model is sufficient to provide very accurate results. But what we went through is a set of preprocessing steps which we want to embed in the model graph itself. This will enable us to deploy the model in production and serve it as is with very minimum data augmentation required. So in the next part of the notebook we will replicate the existing code with using tensorflow operations as much as possible.

From here onwards we will use the estimator api of tensor flow and perform all the above steps usinng tensor flow functions. The above code is fine for experimentation but when we want to productionise our model pipeline we must make sure that the etl , preprocessing and model training and serving steps are as performant as possible. Tensor flow functionns are implemented in C++ and so their performance is much better than traditional python code. Also the pre processing steps become part of the tensorflow graph itself and they can then be run on a variety of devices like GPU's TPU's or mobile devices with minimum pre processing steps required. 

So lets get started

In [None]:
features = [
        tf.feature_column.categorical_column_with_vocabulary_list("cap-shape",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("cap-surface",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("cap-color",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("odor",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("gill-attachment",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("gill-spacing",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("gill-size",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("gill-color",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("stalk-shape",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("stalk-root",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("stalk-surface-above-ring",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("stalk-surface-below-ring",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("stalk-color-above-ring",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("stalk-color-below-ring",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("veil-type",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("veil-color",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("ring-number",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("ring-type",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("spore-print-color",['a','b']),
        tf.feature_column.categorical_column_with_vocabulary_list("habitat",['a','b'])
]

features

 Befor we feed data to the tensorflow graph , we need to define the feature columns and their types to the model. The above code is a dummy code and is quite tedious to write so lets create a function which will do it for us. 

In [None]:
def generate_feature_columns():
    features = []
    for item in X:
        col_name = item
        col_classes = X[item].unique()
        feat_col = tf.feature_column.categorical_column_with_vocabulary_list(col_name,col_classes)
        one_hot = tf.feature_column.indicator_column(feat_col)
        features.append(one_hot)
    return features
generate_feature_columns()

The above function generates the feature columns for the model to understand. We can add conditions to the above function to treat each column as required. However for smaller datasets , we can do this process manually. Now we combine our code into an input function which will be used for training evaluation as well as serving.So we. will create three input functions for train,eval and predict phases. Under the hood tensorflow will do all the preprocessing for us like converting it into one hot encoded version. 

Owing to the scalable nature of tensorflow pipelines , it is expected that all pre processing steps is actually performedby another tool like big query or a cloud data prep / apache beam job. We shall cover these tools in detail in consecutive modules. For now we will split our train , test,eval data into three separate csv files

In [None]:
def train_test_split_write(dataset,train_split = 0.8,eval_split = 0.1):
    np.random.seed(0)
    mask_train = [x < train_split for x in np.random.random(len(X))]
    mask_eval = [x >= train_split and x < (train_split + eval_split) for x in np.random.random(len(X))]
    mask_test = [x >= (train_split + eval_split) for x in np.random.random(len(X))]
    dataset[mask_train].to_csv('../working/mushrooms_train.csv')
    dataset[mask_eval].to_csv('../working/mushrooms_eval.csv')
    dataset[mask_test].to_csv('../working/mushrooms_test.csv')




train_test_split_write(pd.read_csv('../input/mushroom-classification/mushrooms.csv'))

Any data science pipeline consits of the following steps , data extraction , preprocessing , training , evaluation and serving. We will use more advanced tools for extraction and preprocessing in the future and the output of these pipelines will mostly be a csv file stored in a location. This is done so that the data is never contaminated and also such pipelines will help us in distributed etl as well

In [None]:
x_train,x_eval,x_test,y_train,y_eval,y_test = train_test_split(X,Y)
print("The number of train samples is {}, the number of eval samples is {}, the number of test samples is {}".format(len(x_train),len(x_eval),len(x_test)))

In [None]:
CSV_DEFAULTS = ['?' for item in range(24)]
CSV_COLUMN_NAMES = ['class','cap-shape', 'cap-surface', 'cap-color', 'odor',
       'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
       'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
       'stalk-surface-below-ring', 'stalk-color-above-ring',
       'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
       'ring-type', 'spore-print-color', 'habitat']

def format_label(x):
    if(x=='p'):
        return 1
    else:
        return 0

def read_dataset(csv_path):  
    dataset = pd.read_csv(csv_path,index_col=[0])
    dataset['class'] = dataset['class'].map(format_label)
    dataset = tf.data.Dataset.from_tensor_slices((dict(dataset[CSV_COLUMN_NAMES]),dataset['class']))
    return dataset

def train_input_funnction(batch_size,epochs = 10):
    dataset = read_dataset('../working/mushrooms_train.csv')
    dataset = dataset.shuffle(buffer_size=6600).repeat(count=epochs).batch(batch_size)
    return dataset
    
def eval_input_funnction(batch_size,epochs = 10):
    dataset = read_dataset('../working/mushrooms_eval.csv')
    dataset = dataset.shuffle(buffer_size=810).repeat(count=epochs).batch(batch_size)
    return dataset

def predict_input_funnction(batch_size,epochs = 10):
    dataset = read_dataset('../working/mushrooms_test.csv')
    dataset = dataset.shuffle(buffer_size=880).repeat(count=epochs).batch(batch_size)
    return dataset

The above code and functions represent the input functions which do the shuffling and batching operations for the data

In [None]:
for item in iter(train_input_funnction(10)):
    feature_layer = tf.keras.layers.DenseFeatures(generate_feature_columns())
    display(item[0])
    display(feature_layer(item[0]))
    break

So our input functions are created , which will serve the dataset that out model will consume lets test them out. The above code gives us an example as how tensorflow does one hot encoding on the fly and we only need to feed the tensors to our model from the dataset with appropriate feature column definition. As we can see from the above code , the individual arrays of strings contained in the input tensors are converted to one hot encoding format by the DenseFeatures layer.

lets now define our estimator. Tensor flow has several pre defined estimators or models which will easily enable us to make pre defined models very quickly. Here we will use one such pre defined estimator. At a later stage we will also use a custom keras model as and estimator. Keras models and estimators sometimes face issues in compatiability hence we will demonstrate how to use the existing keras api with estimators to get the best of both worlds.

In [None]:
# model_path = "../models/model_path"

# model = tf.keras.models.Sequential()
# model.add(tf.keras.layers.DenseFeatures(generate_feature_columns()))
# model.add(tf.keras.layers.Dense(1))
# model.add(tf.keras.layers.Dense(1,activation = tf.keras.activations.softmax))
# model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate = 0.01),loss=tf.keras.losses.binary_crossentropy,metrics=['acc'])

# model = tf.keras.estimator.model_to_estimator(model)

# model.train(input_fn= lambda : train_input_funnction(10),steps = 1000)

In [None]:
# %%bash
# ls
# rm -r models/

The above shell script will remove the checkpoint files for the model. It is necessary to remove the checkpoint files before a fresh training else the model will stop training once it crosses max steps.

the relationship between steps and number of epochs is **steps = train_sample_size/batch_size * epochs**

So steps will be **(6537/10)*10 = 6537**

So lets start training

In [None]:
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
model = tf.estimator.DNNClassifier(
    hidden_units = [1], 
    feature_columns = generate_feature_columns(), 
    model_dir = "models/",
    n_classes = 2,
    config = tf.estimator.RunConfig(tf_random_seed = 1)
)

train_spec = tf.estimator.TrainSpec(input_fn=lambda : train_input_funnction(10), max_steps=6537)
eval_spec = tf.estimator.EvalSpec(input_fn=lambda : eval_input_funnction(10))
tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

The above code runs and provides us with the training metrics of the model. The performance is not at par with the previous keras model because its only run for one epoch. In the next notebook as an extension , we will re run this code and perform distributed trainig along with proper logging. 