<a href="https://colab.research.google.com/github/victorgtrrz/titanic_survival_estimator/blob/main/flowers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification example: Flowers

In the following notebook we're gonna create a model trained to infer if, giver certain characteristics, a flower is classified as one specie or another.

In [None]:
import tensorflow as tf
import pandas as pd
import os
import time

def clear_console():
    os.system('cls' if os.name == 'nt' else 'clear')

This specific dataset seperates flowers into 3 different classes of species.
<ul>
<li>Setosa</li>
<li>Versicolor</li>
<li>Virginica</li>
</ul>

The information about each flower is the following.
<ul>
<li>Sepal length</li>
<li>Sepal width</li>
<li>Petal length</li>
<li>Petal width</li>
</ul>

In [None]:
CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
SPECIES = ['SETOSA', 'VERSICOLOR', 'VIRGINICA']

train_path = tf.keras.utils.get_file("iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv" )
test_path = tf.keras.utils.get_file("iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")
train = pd. read_csv(train_path, names=CSV_COLUMN_NAMES, header=0)
test = pd. read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)
# Here we use keras (a module inside of TensorFlow) to grab our datasets and read them into a pandas dataframe

train.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,6.4,2.8,5.6,2.2,2
1,5.0,2.3,3.3,1.0,1
2,4.9,2.5,4.5,1.7,2
3,4.9,3.1,1.5,0.1,0
4,5.7,3.8,1.7,0.3,0


Now we separate the outputs (species) from both datasets and store them in a separate variable.

In [None]:
train_y = train.pop('Species')
test_y = test.pop('Species')

Here we visualize the size of the dataset

In [None]:
rows, columns = train.shape
print(f'The dataset contains {rows} rows and {columns} columns')

The dataset contains 120 rows and 4 columns


The input here is different from the one in the linear regression model

In [None]:
def input_fn(features, labels, training=True, batch_size=256):
  # Convert the inputs to a Dataset.
  dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
  # Shuffle and repeat if you are in training mode.
  if training:
    dataset = dataset.shuffle(1000).repeat()
  return dataset.batch(batch_size)

In [None]:
my_feature_columns = []
for key in train.keys():
  my_feature_columns.append(tf.feature_column.numeric_column(key=key))
print(my_feature_columns)

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.


[NumericColumn(key='SepalLength', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='SepalWidth', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='PetalLength', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None), NumericColumn(key='PetalWidth', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]


# Building the Model
For classification tasks there are variety (hundreds apparently) of different estimators/models that we can pick from within Tensor Flow.
Some options are
*   DNNClassifier (Deep Neural Network)
*   LinearClassifier

We can choose either model but the DNN seems to be the best choice. This is because we may not be able to find a linear corespondence in our data. However, most of the work is in the pre-processing of the data, so it would not be a problem to test different models and see which one is the best fit to the idiosyncrasy of the dataset.




In [None]:
# Build a DNN with 2 hidden layers with 30 and 10 hidden nodes each
classifier = tf.estimator.DNNClassifier(
  feature_columns=my_feature_columns,
  # Two hidden layers of 30 and 10 nodes respectively (arbitrarily, we'll discuss later how to choose an appropiate number)
  hidden_units=[30, 10],
  # The model must choose between 3 classes.
  n_classes=3)

clear_console()

#lambda allows you to define a function in one line
#instead of epochs, here we define the step. I don't really know what does it mean though.
tic = time.time()
classifier.train(
  input_fn = lambda: input_fn(train, train_y, training=True),
  steps = 5000)
clear_console()
toc = time.time()

print(f"Model trained in {(toc-tic):.2f} seconds")



Model trained in 11.10 seconds


I noticed a high variance in the time it takes to the model to be trained. I can't explain why.

Now the model is evaluated. The accuracy may vary for each iteration because the order by which the model sees the data is different (shuffle).

In [None]:
eval_result = classifier.evaluate(input_fn=lambda: input_fn(test, test_y, training=False))
print(f"\nTest set accuracy: {eval_result['accuracy']*100:0.3f}%\n")


Test set accuracy: 83.333%



In [None]:
def input_fn(features, batch_size = 256):
  # Convert the inputs to a dataset without labels
  return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)

features = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']
predict = {}

print("Please type numeric values as prompted.")
for feature in features:
  while True:
    val = input(feature + ": ")
    try:
      # Intenta convertir la entrada a float
      float_val = float(val)
      # Si la conversión es exitosa, rompe el ciclo while
      break
    except ValueError:
      # Si la conversión falla, imprime un mensaje de error y vuelve a intentar
      print("Please enter a valid numeric value.")

  predict[feature] = [float(val)]

predictions = classifier.predict(input_fn=lambda: input_fn(predict))

for pred_dict in predictions:
  class_id = pred_dict['class_ids'][0]
  probability = pred_dict['probabilities'][class_id]

  clear_console()

  print(f"Prediction is {SPECIES[class_id]} with {100 * probability:.1f}% probability")

Please type numeric values as prompted.
SepalLength: 1
SepalWidth: 1
PetalLength: 1.2
PetalWidth: 2
Prediction is VIRGINICA with 70.8% probability
