# simulating data for classification

To train a neural network for object classification, we need some training data. We've already seen how we can use Scikit-learn to simulate training data for regression problems, but training data for classification problems is different. We need data with one-hot encoded class labels, and it would be nice if the data used for classification was actually correlated with the class labels.

Fortunately for us, Scikit-learn has a function, "make_classification" that simulates training data for object classification for us! Also fortunately for us, the make_classification function is very similar to the function we used previously for regression problems.

The following code cell uses the make_classification function to generate 100 data samples for binary classification. 

The "x" variable holds the data that will be used by the model to infer the class labels, which are typically called "features" in object classification. A classification network uses "feature" data to infer the "class label" of each data sample. 

The "y" variable holds the class labels.

In this case, we are simulating data with 10 features (n_features=10) and 2 possible class labels (n_classes=2). As before, we set the random_state option for reproducibility.

After simulating the data, we print the simulated features and the class labels, just to have a look at them. We also print the shapes of the feature data and the class labels.

In [None]:
import sklearn.datasets

x,y = sklearn.datasets.make_classification(n_samples=100,
                                           n_features=10,
                                           n_classes=2,
                                           random_state=77317)

print(x)
print(y)
print(x.shape,y.shape)

There is quite a bit of feature data in the output; 100 data samples of 10 features each, to be precise. You can see that the features are floating-point numbers.

The binary class labels (there are 100 of them, of course, one for each data sample) are either 0 or 1.

So, it looks like we have some data we can use for binary classification. Of course, we'll need to package these data into a tensorflow Dataset object to train our network, but we can deal with that later.

# binary classification

To do binary classification, we need a neural network. We are free to build any network we want, but the input and output shapes of the network must match the shape of the features and the shape of the class labels, respectively.

Let's look at the data to determine the input and output shapes required.

The feature data "x" has shape (100,10). There are 100 data samples (we can ignore this, it will be handled by the batch dimension in tensorflow), and there are 10 features per data sample. So, our feature data are 10-dimensional, and we'll need our network to be able to handle input vectors (rank-1 tensors) in 10 dimensions. This can be done by specifying:

    input_shape=[10]

in the first layer of our network.

The class label data "y" has shape (100,); it is basically scalar data, which can be treated as a rank-1 tensor (aka, vector) of 1 dimension in tensorflow. This means we will need a single neuron in our network's output layer.

For a binary classification problem, we'll need to specify sigmoid activation for the output layer. In tensorflow, sigmoid activation is implemented using a tf.keras.activations.sigmoid object, so we can specify sigmoid activation in tensorflow using the option:

    activation=tf.keras.activations.sigmoid

when we create the network's output layer.

We'll also need to specify cross-entropy loss when we compile our model. Tensorflow implements cross-entropy loss for binary classification using a tf.keras.losses.BinaryCrossentropy object, so we can specify binary cross-entropy loss using the option:

    loss=tf.keras.losses.BinaryCrossentropy()

Pretty simple so far. We can use the same optimizer for classification problems as we did for regression problems. And batch training is the same whether the problem is regression or classification.

For classification problems, we often want to know how often our model predicts the correct class label during the training process. We can see this information by adding the option:

    metrics=[tf.keras.metrics.BinaryAccuracy()]

when we call model.fit(...). With this option set, tensorflow will report the proportion of training samples correctly classified at each step in the training process. Ideally, we'd like model accuracy to be very close to 1.0 by the end of the training process.

Using this information, let's build a simple *linear* classifier for our data in tensorflow.

We'll include the data simulation code in the following code cell, just for completeness. We then need to package the data into a tensorflow Dataset object, build our neural network classifier, compile the network using the appropriate loss function, and train the network. In this case, we won't worry about splitting the data into training and validation subsets, although in practice we typically would.

In [None]:
import sklearn.datasets
import tensorflow as tf

# simulate training data
x,y = sklearn.datasets.make_classification(n_samples=100,
                                           n_features=10,
                                           n_classes=2,
                                           random_state=77317)

# package training data into tensorflow Dataset
data = tf.data.Dataset.from_tensor_slices((x,y))

# create and summarize linear neural network classifier
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units=1, input_shape=[10], activation=tf.keras.activations.sigmoid))
model.summary()

# compile model with loss function and optimizer
model.compile(optimizer=tf.keras.optimizers.SGD(),
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=[tf.keras.metrics.BinaryAccuracy()])

# batch data and train model
data = data.batch(10)
model.fit(data, epochs=100)

From the model summary (scroll back up through the training output), you can see that the linear classifier has 11 trainable parameters: a weight for each of the 10 input features, plus the bias term.

The model's loss is high when training starts, and accuracy is pretty poor, but the loss and accuracy both improve pretty quickly. By the end of the training process, our linear classifier achieves ~0.9 accuracy, indicating that it can correctly classify approximately 90/100 data samples.

Not too shabby for a simple linear model.

# simulating multi-label classification data

Let's extend our binary classification problem to the case in which there are more than two possible classes.

It's relatively easy to use scikit-learn's make_classification function to simulate multi-label classification data; all you need to do is set the n_classes option to the number of classes you'd like to simulate.

In the following code cell, edit the FIXME part to simulate 3 possible classes. You'll notice we added an option:

    n_clusters_per_class=1

This is just to make the math calculations work in scikit-learn for 3-class data.

In [None]:
import sklearn.datasets

x,y = sklearn.datasets.make_classification(n_samples=100,
                                           n_features=10,
                                           n_classes=FIXME,
                                           n_clusters_per_class=1,
                                           random_state=77317)

print(x)
print(y)
print(x.shape,y.shape)

If everything works out, you should see similar feature data (x) as before. The class label data (y) might look a bit strange, though!

Notice that the class labels appear to be integers: 0, 1, 2. These are definitely *not* one-hot encoded!

Not to worry, we'll deal with one-hot encoding in tensorflow.

# multi-label classification

Now that we have some 3-class data, we need to build a neural network classifier to classify it.

Notice that the shape of the feature data (x) is the same as before, so we can use the same input_shape option for our network's input layer.

The class label data (y) needs some special attention. Although the shape of the class label data is the same as before, there are now 3 possible classes, rather than 2. *If* the class label data were one-hot encoded, we'd need 3 output neurons, one for each of the possible classes. The *actual* class label data are *not* one-hot encoded, but we're going to allow tensorflow to internally *convert* the data to one-hot encoding, so we need to design our network's output layer *as if* the data were one-hot encoded. That means we need 3 neurons in the output layer.

We'll need to use softmax activation in the output layer, as well. Tensorflow implements softmax activation as the object: tf.keras.activations.softmax, so we can specify:

    activation=tf.keras.activations.softmax

when we create the output layer, in order to use softmax activation.

We'll use the loss function to calculate cross-entropy loss *and* convert the class labels to one-hot encoding, all in one step.

To do this, we'll use a tf.keras.losses.SparseCategoricalCrossentropy object. This object automatically converts the integer-encoded class labels (y) to one-hot encoding ("Sparse") and calculates multi-label cross-entropy loss ("CategoricalCrossentropy"). We'll just need to specify:

    loss=tf.keras.losses.SparseCategoricalCrossentropy()

when we compile the model.

Other than those changes, the rest of the model construction and training is pretty much the same. We do have to change the accuracy metric from tf.keras.metrics.BinaryAccuracy to tf.keras.metrics.SparseCategoricalAccuracy, to handle >2 possible classes (see line #22).

The following code cell implements multi-label data simulation, builds a model, compiles it, and executes batch training. Edit the FIXME portions to simulate 3-class data (line #7), use softmax activation (line #16) and train using sparse categorical cross-entropy loss (line #21).

In [None]:
import sklearn.datasets
import tensorflow as tf

# simulate training data
x,y = sklearn.datasets.make_classification(n_samples=100,
                                           n_features=10,
                                           n_classes=FIXME,
                                           n_clusters_per_class=1,
                                           random_state=77317)

# package training data into tensorflow Dataset
data = tf.data.Dataset.from_tensor_slices((x,y))

# create and summarize linear neural network classifier
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units=3, input_shape=[10], activation=FIXME))
model.summary()

# compile model with loss function and optimizer
model.compile(optimizer=tf.keras.optimizers.SGD(),
              loss=FIXME,
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

# batch data and train model
data = data.batch(10)
model.fit(data, epochs=100)

The model now has 33 trainable parameters, 11 for *each* of the 3 output neurons. When I ran the previous code cell, the model's loss stabilized to around 0.44, and the accuracy plateaued at around 0.87. Not too bad.

Let's see if we can 'improve' our model's accuracy on the training data by adding a few more layers *before* the output layer.

Edit the following code cell to include 2 new Dense neural-network layers, each with 8 units and ReLU activation. Remember that you'll need to specify the input_shape of the *first* layer in the network. Notice that we've also increased the training run to 300 epochs, in order to better fit the model's additional parameters.

In [None]:
import sklearn.datasets
import tensorflow as tf

# simulate training data
x,y = sklearn.datasets.make_classification(n_samples=100,
                                           n_features=10,
                                           n_classes=FIXME,
                                           n_clusters_per_class=1,
                                           random_state=77317)

# package training data into tensorflow Dataset
data = tf.data.Dataset.from_tensor_slices((x,y))

# create and summarize linear neural network classifier
model = tf.keras.models.Sequential()
FIXME
model.add(tf.keras.layers.Dense(units=3, activation=FIXME))
model.summary()

# compile model with loss function and optimizer
model.compile(optimizer=tf.keras.optimizers.SGD(),
              loss=FIXME,
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

# batch data and train model
data = data.batch(10)
model.fit(data, epochs=300)

This model has 187 trainable parameters, more than the number of data samples used to train the model! This should make us highly suspicious that our model might be overfitting the training data!

Nonetheless, this model should achieve fairly high accuracy on the training data.

After completing the quiz, I'll leave it as a do-on-your-own exercise to:

1. simulate a larger data set, and see if your model still achieves high accuracy, and
2. implement a train-validate split and assess model overfitting.
