# CHAPTER 3.2

### Training a simple classifier on extracted features

Machine learning algorithms are not properly equipped to work with tensors, which forbid them from learning directly from images. However, by using pre-trained networks as feature extractors, we close this gap, enabling us to access the power of widely popular, battle-tested algorithms such as Logistic Regression, Decision Trees, and Support Vector Machines.

We'll use the features we generated in the previous recipe (in HDF5 format) to train an image orientation detector to correct the degrees of rotation of a picture, to restore its original state.

In [1]:
import pathlib

import h5py
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report

In [2]:
dataset_path='features.hdf5'
db = h5py.File(dataset_path, 'r')

Let's look what we have in our dataset...

In [3]:
db

<HDF5 file "features.hdf5" (mode r)>

In [4]:
db.keys()

<KeysViewHDF5 ['features', 'label_names', 'labels']>

In [5]:
db['features']

<HDF5 dataset "features": shape (8144, 25088), type "<f8">

In [6]:
db['features'][0]

array([0., 0., 0., ..., 0., 0., 0.])

In [7]:
db['label_names'][0]

b'0'

In [8]:
db['labels'][0]

3

In [9]:
db['labels'].shape

(8144,)

Because the dataset is too big, we'll only work with 50% of the data.

In [10]:
SUBSET_INDEX = int(db['labels'].shape[0] * 0.5)
print(SUBSET_INDEX)

4072


In [11]:
features = db['features'][:SUBSET_INDEX]
labels = db['labels'][:SUBSET_INDEX]

In [12]:
features.shape

(4072, 25088)

In [13]:
TRAIN_PROPORTION = 0.8
SPLIT_INDEX = int(len(labels) * TRAIN_PROPORTION)

Take the first 80% of the data to train the model, and the remaining 20% to evaluate
it later

In [14]:
X_train, y_train = (features[:SPLIT_INDEX],
                    labels[:SPLIT_INDEX])
X_test, y_test = (features[SPLIT_INDEX:],
                  labels[SPLIT_INDEX:])

In [15]:
model = LogisticRegressionCV(n_jobs=-1)
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# to protect from limited iteration, we can define max iteration number
model = LogisticRegressionCV(max_iter=1000, n_jobs=-1)

In [16]:
predictions = model.predict(X_test)

In [20]:
print(predictions[0])

1


In [21]:
list=db['label_names'][:]
for i in range(len(list)):
    list[i]=list[i].decode(encoding="utf-8")

In [22]:
report = classification_report(y_test, predictions,target_names=list)

In [23]:
print(report)

              precision    recall  f1-score   support

           0       1.00      0.99      0.99       204
         180       0.99      1.00      1.00       208
         270       1.00      1.00      1.00       210
          90       0.99      1.00      0.99       193

    accuracy                           1.00       815
   macro avg       1.00      1.00      1.00       815
weighted avg       1.00      1.00      1.00       815



In [25]:
db.close()