<a href="https://colab.research.google.com/github/re114/re114.github.io/blob/main/machine_learning_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Approaches to machine learning
###Supervised Learning  
In supervised learning we take a set of data and a set of labels that map to the data, xi -> yi, the dataset is split to provide a training set and a test set, the model is trained on the data and labels, and tested on the test data, the aim is to reduce the error. Supervised learning is typically used for Classification or Regression problems.
A Regression problem involves predicting a numerical label, an example of regression would be projections of new coronavirus cases based on historic data, or data from other countries such as provided by worldmeter (“United Kingdom Coronavirus cases,” n.d.).
A Classification problem aims to identify a class label, an example of a classification problem would be the digit identification we looked at with the MNIST database (LeCunn et al., n.d.), which takes data and label pairs (x,y) x is data y is label, with the goal of learning a function to map x-> y.
The MNIST dataset consists of 60000 labelled images of the digits 0-9. (LeCunn et al., n.d.) So a supervised model might take 40000 images and labels as the training set and run the model on these. Once the model is trained it is run with the test data.
A good model will give similar test results to training results. A model that has been too closely training is said to have been overfitted to the training data which. This highlights the importance of splitting the data into at least training and test data (better still, training, verification and testing data) sets to prevent unseen correlations from swaying the model.
  
>“Learning is a search through the space of possible hypotheses for one that will perform well, even on new examples beyond the training set. To measure the accuracy of a hypothesis we give it a test set of examples that are distinct from the training set.”  
(Norvig & Stuart, 2010).

###Unsupervised learning.  
Unsupervised learning does not require labelled data, and instead actively looks for correlations within the data to learn the underlying structure - "is this thing like another thing?"
Unsupervised learning is typically used for clustering or density estimation problems.
An example of a problem addressed by clustering is spam filtering, which can use K-Means clustering to at the email header and content and create groups, or clusters to identify problem emails.(“(28) Lecture 13.2 — Clustering | KMeans Algorithm — [ Machine Learning | Andrew Ng ] - YouTube,” n.d.)

>“The most common unsupervised learning task is clustering detecting potentially useful clusters of input examples. For example, a taxi agent might gradually develop a concept of “good traffic days” and “bad traffic days” without ever being given labelled examples of each by a teacher.”  
(Norvig & Stuart, 2010).  

###Reinforcement learning  
Reinforcement learning places the machine learning agent in an environment and lets it learn using feedback and success against a success criterion. It takes data as state action pairs and sets goals based on maximum future rewards over many time steps.
>“Reinforcement learning is learning what to do — how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.”  
(Sutton & Barto, 2018)  
An example of reinforcement learning would be the (very popular with Computer Science students) work on AI systems learning to play Video games (Shao, Tang, Zhu, Li, & Zhao, 2018).  

  
  


###Neural Networks  
It's interesting to consider that the original work undertaken by McCulloch and Pitts (Mcculloch & Pitts, 1990) on neural brain structures as logic gates informed and inspired Von-Neumann's architecture (Ohta, 2015) and his view of the computer as a brain. The paradigm shifted and we began to view the brain as a type of computer, and as we came round to neural networks the analogy switched back once more (Cobb, 2020).  
Early work on Neural Networks was carried out by Frank Rosenblatt, who described the structure of the perceptron (Rosenblatt, 1958).  
Rosenblatt may have overhyped his findings, and fed a media circus instead of managing expectations (Boden, 2006).  
  
![Rosenblatts Perceptron](https://raw.githubusercontent.com/re114/robotdraws/main/img/Rosenblatt-perceptron.png)

Minksy at MIT published a damning mathematical analysis of Rosenblatt's work (Minsky, 1961) which many cite as precipitating the first AI winter (Boden, 2006) (Norvig & Stuart, 2010). The paper suggested that the perceptron was a dead end for AI as it could not internally represent the things it was learning (Cobb, 2020) and it was not until the adoption of backpropagation that the approach became ascendant again (Y. LeCun et al., 1989).  
  
###The structure of Neural Networks  
Bengio quotes Hinton: "You have relatively simple processing elements that are very loosely models of neurons. They have connections coming in , each connection has a weight on it, and that weight can be changed through learning” (LeCunn, Bottou, & Haffner, 1998).  
  
Chollet outlines the processes in the operation of a Neural Network (François Chollet, 2019).  
• define  
• fit  
• predict  
• and evaluate  
• Initialise weights randomly - or by some insight into the relative importance of hyperparameters.  
• loop till convergence  
• compute gradient - (derivative)  
• update weights  
• return weights  
In practical terms the aim is to minimise the error. We can describe the error as the absolute difference between the prediction and the results.  

```
error = ((input * weight) - goal.pred) **2
```

The weight determines the significance of the input.



This script trains a basic neural network on a set of labelled data to classify images as either cats or dogs.

In [1]:
import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])
predictions = model(x_train[:1]).numpy()

predictions

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[[ 0.5176658  -0.13445848  0.02820197  0.04942533 -0.6842776  -0.26025444
   0.27671477 -0.6232597  -0.06118366 -0.44626555]]


In [2]:
tf.nn.softmax(predictions).numpy()


array([[0.17962019, 0.09357098, 0.11009909, 0.11246073, 0.05399552,
        0.0825104 , 0.14115994, 0.05739281, 0.10068483, 0.06850545]],
      dtype=float32)

In [4]:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_fn(y_train[:1], predictions).numpy()
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 128)               100480    
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290      
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________


In [5]:
model.fit(x_train, y_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f9dc2eb0a10>

In [6]:
model.evaluate(x_test,  y_test, verbose=2)


313/313 - 0s - loss: 0.0764 - accuracy: 0.9766


[0.07643616944551468, 0.9765999913215637]

In [8]:
probability_model = tf.keras.Sequential([
  model,
  tf.keras.layers.Softmax()
])

probability_model(x_test[:5])


<tf.Tensor: shape=(5, 10), dtype=float32, numpy=
array([[2.3982443e-09, 6.4169414e-10, 1.0888503e-05, 3.0259458e-05,
        3.3600733e-11, 1.7275735e-07, 4.4016707e-15, 9.9995506e-01,
        2.0105460e-07, 3.4885350e-06],
       [2.0737152e-09, 2.4616753e-03, 9.9752265e-01, 1.4214254e-05,
        5.7037852e-15, 3.9163100e-07, 7.7725552e-08, 6.7630531e-13,
        9.1622178e-07, 2.7283492e-13],
       [2.7055131e-07, 9.9863869e-01, 2.0268162e-04, 5.1933890e-05,
        1.9910194e-04, 1.2790084e-05, 1.4361428e-04, 6.0762779e-04,
        1.4009200e-04, 3.2012922e-06],
       [9.9954587e-01, 7.9242646e-10, 5.4218904e-06, 1.5822366e-07,
        4.7550991e-05, 2.0334862e-06, 2.9748009e-04, 8.9524649e-05,
        1.1174333e-07, 1.1786353e-05],
       [3.9327824e-06, 1.9700616e-08, 1.1182532e-05, 2.0912091e-07,
        9.8954433e-01, 1.0499693e-06, 1.2581231e-05, 8.8090062e-05,
        2.3666976e-06, 1.0336231e-02]], dtype=float32)>

We can gauge how accurate our predictions are by comparing the prediction with a known result or label.  
In order to benefit from multiple layers, we need to add an activation function, without this the layers could practically be collapsed into a single layer (H. Li, Ouyang, & Wang, 2016).  
The activation function maps back to the neural structure of biological systems where synaptic inputs are expressed or repressed (Hawkins, 2005) to determine their activation.
We can use a range of activation functions, such as RELU - rectified linear units which only activates if the input is above zero.  
