# Session 6 - Benchmark classification on ```cifar-10```

This notebook builds on what we were doing last week with the handwritten digits from the MNIST dataset.

This week, we're working with another famous dataset in computer vision and image processing research - [cifar10](https://www.cs.toronto.edu/~kriz/cifar.html).

In [9]:
# path tools
import os
import cv2

# data loader
import numpy as np
from tensorflow.keras.datasets import cifar10

# machine learning tools
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# classificatio models
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

We're going to load the data using a function from the library ```TensorFlow```, which we'll be looking at in more detail next week. 

For now, we're just using it to fetch the data!

In [3]:
(X_train, y_train), (X_test, y_test) = cifar10.load_data() #making it a tuple

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


**Question:** What is the shape of the data?

In [4]:
X_train.shape # 4 dimensional numpy array  (n objects, 32*32 size, 3 colour channels)

(50000, 32, 32, 3)

In [6]:
y_train # these are the current labels, they suck, they are index (so 9= truck, 6 =frog, a lil dumb, lets remake them)

array([[6],
       [9],
       [9],
       ...,
       [9],
       [1],
       [1]], dtype=uint8)

Unfortunately, this version of the data set doesn't have explict labels, so we need to create our own.
Labels can be found here in documentation of data: https://www.cs.toronto.edu/~kriz/cifar.html 

In [5]:
labels = ["airplane", 
          "automobile", 
          "bird", 
          "cat", 
          "deer", 
          "dog", 
          "frog", 
          "horse", 
          "ship", 
          "truck"] #note this is alfabetically 

### Convert all the data to greyscale

In the following cell, I'm converting all of my images to greyscale and then making a ```numpy``` array at the end.

Notice that I'm using something funky here called *[list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)*.

In [10]:
#list comprehensions (a loop in a single line)
X_train_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_train]) #list 
X_test_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_test])

In [12]:
X_train_grey.shape # now its a 3 dimensionel object (50000 images, 32 width, 32 height)

(50000, 32, 32)

In [None]:
#list comprehension (translation)
#for x in y:
#    do_this(x)

#[do_this(x) for x in y]
#OR
#[x.upper() for x in y]
#EG, having the list colours = ["red", "green", "blue"]
# upper = [colour.upper() for colour in colours] 

Then, we're going to do some simple scaling by dividing by 255.
- We're essentially doing this to make smaller numbers and fit the data better.
- Every pixel value in each 50000 32x32 image is now compressed! (dimensions are the same, we still have 50000 32x32 images, but now with LOWER pixel values) 

In [13]:
X_train_scaled = (X_train_grey)/255.0 
X_test_scaled = (X_test_grey)/255.0

### Reshaping the data

Next, we're going to reshape this data. 
- Both of the training and test data

In [1]:
nsamples, nx, ny = X_train_scaled.shape #extracting each dimension (samples, 32, 32)
X_train_dataset = X_train_scaled.reshape((nsamples,nx*ny)) #we are shaping our data from 3D (50000, 32,32) to 2D (50000, 1024)

#this is litteraly FLATTENING IT (as we have done before apparently)
#imagine having a pic of a car, that we want to flatten to one hidden layer where each node is a pixel in the 32*32 carpicture. 
# and now the data went from 32*32 pic to one hidden layer of 1024 nodes for each 50000 image

X_train_dataset.shape #see now we have 2D 


NameError: name 'X_train_scaled' is not defined

In [19]:
nsamples, nx, ny = X_test_scaled.shape
X_test_dataset = X_test_scaled.reshape((nsamples,nx*ny))

## Simple logistic regression classifier

We define our Logistic Regression classifier as we have done previously. You'll notice that I've set a lot of different parameters here - you can learn more in the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

- PENALTY:
    - L1 = setting very small (close to 0) weights to 0, and only keeping only meaingful weights
    - none = no penalty
- TOL:
    - toleance for stopping criteria. When the model stop learning (improving the weight values), then stop training. 
    - Default is usually pretty small, so 0.1 is a little high actually 
- VERBOSE: 
    - Boolean value. 
    - Default is FALSE: meaing no ouput is printet
    - TRUE: output is printed as it runs
 
- SOLVER: 
    - algorithm for optimization problem.
    - See documentation when choosing (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
    - Depends on dataset size and multiclass problems. 

- MULTI_CLASS: 
    - either multinomial and binary. 
    - In this case we have multinomial data, and not (FAKE vs REAL data articles)


In [18]:
clf = LogisticRegression(penalty="none", 
                        tol=0.1,
                        verbose=True,
                        solver="saga",
                        multi_class="multinomial").fit(X_train_dataset, y_train)

  y = column_or_1d(y, warn=True)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Epoch 1, change: 1.00000000
Epoch 2, change: 0.31719215
Epoch 3, change: 0.13990186
convergence after 4 epochs took 18 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   17.6s finished


In [20]:
y_pred = clf.predict(X_test_dataset)

We can then print our classification report, using the label names that we defined earlier.

In [21]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels) #cool classification report function
print(report)

              precision    recall  f1-score   support

    airplane       0.34      0.39      0.36      1000
  automobile       0.38      0.37      0.37      1000
        bird       0.28      0.20      0.23      1000
         cat       0.24      0.16      0.19      1000
        deer       0.25      0.25      0.25      1000
         dog       0.30      0.29      0.30      1000
        frog       0.28      0.30      0.29      1000
       horse       0.33      0.31      0.32      1000
        ship       0.33      0.40      0.36      1000
       truck       0.37      0.48      0.42      1000

    accuracy                           0.32     10000
   macro avg       0.31      0.32      0.31     10000
weighted avg       0.31      0.32      0.31     10000



## Neural network classifier

I've set a couple of different parameters here - you can see more in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

**NB!** This will take a long time to run! On the 32 CPU machine on UCloud, this takes around 30 seconds per iteration.

In [22]:
#TRAINING THE MODEL and get validation score everytime
clf = MLPClassifier(random_state=42,
                    hidden_layer_sizes=(64, 10), #two hidden layers (64 nodes and 10 nodes), this will icrease computational time
                    learning_rate="adaptive", # adaptive = rapid guessing at first, and later slowly learning from what is already has learnt
                    early_stopping=True, #if we change the previous tolerance value it affects the early stopping.
                    verbose=True,
                    max_iter=20).fit(X_train_dataset, y_train) #we cross validate 20 times


# LOSS SCORE: we should see that the loss score should reduce as the model leans/TRAINS (see output)
# VALIDATION SCORE: the validation score will increase as the model learns, it gets better at predicting the cross val data 

  y = column_or_1d(y, warn=True)


Iteration 1, loss = 2.30872956
Validation score: 0.133000
Iteration 2, loss = 2.15971661
Validation score: 0.239200
Iteration 3, loss = 2.02581278
Validation score: 0.265200
Iteration 4, loss = 1.97076182
Validation score: 0.281800
Iteration 5, loss = 1.93555578
Validation score: 0.302600
Iteration 6, loss = 1.90926190
Validation score: 0.315600
Iteration 7, loss = 1.89160286
Validation score: 0.318800
Iteration 8, loss = 1.87500641
Validation score: 0.322200
Iteration 9, loss = 1.86730610
Validation score: 0.316800
Iteration 10, loss = 1.85845283
Validation score: 0.321200
Iteration 11, loss = 1.84549829
Validation score: 0.331400
Iteration 12, loss = 1.83590762
Validation score: 0.328600
Iteration 13, loss = 1.82908945
Validation score: 0.331400
Iteration 14, loss = 1.82320985
Validation score: 0.330600
Iteration 15, loss = 1.81056794
Validation score: 0.343400
Iteration 16, loss = 1.80707784
Validation score: 0.338400
Iteration 17, loss = 1.79877427
Validation score: 0.339800
Iterat



In [23]:
#ACTUALLY PREDICTING THE TEST SET 
y_pred = clf.predict(X_test_dataset)

Lastly, we can get our classification report as usual.

In [24]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels)
print(report)

              precision    recall  f1-score   support

    airplane       0.38      0.41      0.40      1000
  automobile       0.40      0.49      0.44      1000
        bird       0.26      0.34      0.30      1000
         cat       0.28      0.11      0.16      1000
        deer       0.27      0.26      0.27      1000
         dog       0.33      0.34      0.34      1000
        frog       0.28      0.29      0.28      1000
       horse       0.45      0.39      0.42      1000
        ship       0.44      0.44      0.44      1000
       truck       0.42      0.47      0.44      1000

    accuracy                           0.35     10000
   macro avg       0.35      0.35      0.35     10000
weighted avg       0.35      0.35      0.35     10000



## Tasks

Take the code outlined in this notebook and turn it into two separate Python scripts, one which performs Logistic Regression classification and one which uses the MLPClassifier on the ```Cifar10``` dataset.

Try to use the things we've spoken about in clas
- Requirements.txt
- Virtual environment
- Setup scripts
- Argparse

This task is [Assignment 2 for Visual Analytics](https://classroom.github.com/a/KLVvny7d).