# Supervised Machine Learning
### Artificial Intelligence 1, Week 6


### Learning models for **classification** or **regression** 
from a set of labelled instances.

# This week
Learning outcomes:

- Identify formulate and apply the basic processes of supervised machine learning
- Understand the role of data in estimating accuracy 

Videos:
- Basic model building process: train and test 
- Types of model: instance-based ( e.g. kNN) vs explicit (e.g. decision trees,rules, ...) 
- Example:   greedy rule induction as compared to expert system




# Machine Learning Paradigm
- Completely different paradigm to symbolic AI
- Create a system with the ability to learn
- Present the system with series of examples
- System builds up its own model of the world




<img src="figures/PersonThinkingAboutDogs.png" style="float:left"><img src="figures/idealisedDog.png" style="float:right">

## Video (6:52): Hello World of Machine Learning Recipes


https://youtu.be/cKxRvEZd3Mw


## It's all about the data
- Computers cannot experience artefacts of the real world directly
- Instead they just deal with a few variables that represent them
- ML algorithms learn from a “training set” containing digital representations of examples to learn from
- Outcomes depend entirely on:
 - What you choose to measure
 - And how representative your training set is
 



## More formally

We have a set of *n* examples., and for each one  we have: 
- a value for each of *f* features 
- a label

The data set *X* is usually 2-D array of *n* rows and *f* columns.   

The label set *y* is usually a 1-D array with *n* entries.   

For now we'll assume the features are *continuous* (e.g. floating point values)

If the label comes from a discrete unordered set of *m* values, e.g.  ("Orange","Apple" "Banana"): 
- we have a **Classification** problem. $M: \mathcal{R}^f \rightarrow \{1,\ldots,m\}$
- The learned model *M*  is a mapping from a *f*-dimensional continuous space (the feature values) onto a finite set
 

If the label is an ordinal value (integer,    floating point):
- we have a **Regression** problem. $ M:\mathcal{R}^f \rightarrow \mathcal{R}$

# The  Supervised Learning Workflow
<div>
<div width=40% style="float:left">
    <p>This diagram assumes you are trying out:</p>
<ul>
    <li> more than one type of algorithm </li>
    <li> or a choice of parameter settings</li>
    <li> or when to stop training <br>
        your algorithm</li>
    </ul>
<p>If you are just trying one algorithm<br>  
    you can skip the validation phase</p>
    </div>

<div width=45% style="float:right">    
<img src="figures/ML_workflow.png" style= "float:right">
    </div>
</div>

### Example:  Iris flowers <img src="figures/Iris-image.png" style="float:right;width:300px">
- Classic Machine Learning Data set
- 4 measurements: sepal and petal width and length
- 50 examples  from each 3 sub-species for iris flowers
- three class problem:
 - so for some types of algorithm have to decide whether to make  
   a 3-way classifier or nested 1-vs-rest classifers
- most ML classifiers can get over 90%




In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets
import  week6_utils as w6utils
%matplotlib inline

In [None]:
iris_x,iris_y = sklearn.datasets.load_iris(return_X_y=True)
iris_features= ("sepal_length", "sepal_width", "petal_length", "petal_width")
iris_names= ['setosa','versicolor','virginica']
title="Scatterplots of 2D slices through the 4D Iris data"
w6utils.show_scatterplot_matrix(iris_x,iris_y,iris_features,title)

# Recap so far
Machine Learning is about learning patterns from data.  
In supervised ML this means: 
1. **Training Data**: set of labelled examples, each characterised by values for *f* features  
   - **X**: data - usually a 2D array with one row per example, one column for each feature  
     (even images can be 'flattened' into this format).   
   - **y** : the labels/target 

2. A supervised Machine Learning **Algorithm**

3. A **performance criteria (quality)**: used to drive training and estimate quality of model.  
Depending on the **context** this might be accuracy,  precision, recall, error rates...


4. A **test set** to estimate the performance of the model on unseen data.  
   If this is not available separately, have to take out some data from the training set
   - crude way; single 70:30 train:test split, making sure you preserve the proportions of different classes
   - better way: *N-Fold Cross Validation* [Description on Wikipedia](https://en.wikipedia.org/wiki/Cross-validation_(statistics))

# Important Idea!  Decision Surfaces in feature space
<div>
<div width=70% style="float:left">
<p>    
Each feature defines a dimension in <em>feature space</em>.<br>  
Each example has specific values for each feature<br>
    - so it occupies <b>one point</b> in feature space</p>

<p>The aim of our model is to let us predict labels for any item:</p>
    <ul><li> so it puts decision boundaries into feature space </li>
        <li>  to divide it into regions with a common label</li></ul>

<p>Symbolic Reasoning:
<ul><li> boundaries defined by our 'knowledge' </li>
<li> so can plot without needing data! e.g. image on right</li>
</ul></p>

<p>Machine Learning: 
<ul><li> uses the training data to <b>estimate</b> where the boundaries should be.</li>
    <li> usually through a search process to minimise the number of mis-predictions</li>
<li> So we can plot model's prediction for lots of points over a grid <br> 
    to visualise the decision surface and boundaries </li></ul></p>
</div>
<div  width=35% style="float:right">    
    <figure>
        <img src="figures/decisionRegions.png" style="width:300px;float:right">
        <figcaption>Decision Model for predicting outcomes<br> using  pre-2024 assessment.</figcaption>
    </figure>
</div>
</div>

## Machine Learning Algorithms
Typically a ML method consists of:

1: A  representation for the decision boundaries
 - Each different arrangement of boundaries defines a unique model
 - Each unique model is defined by the set of values for variables specifying where the boundaries are.
 - Different types of models will have different variables.
 
2: A learning algorithm to deciding how to change those variable values to move between models
 - last week we saw how the KMeans clustering algorirthm uses "local search with random restarts"

ML Algorithms build models in different ways
- but they don’t care what it is they are grouping
- and it is **meaningless** to say they “understand”.


## Some example ML methods
The field of ML is fast growing and contains many complex methods and representations.   
This module focusses on a few simple ideas to give you a feel for what is out there.  
- Instance-based learning (k-Nearest Neighbours) - this week
- Decision trees and rule induction algorithms- this week
- Artificial Neural Networks - weeks 7 and 8 

Next year: 
- Artificial Intelligence 2:  15 credits, semester 1 (AI and "General" pathways) 
and in particular 
- Machine Learning: 15 credits, semester 2     ( AI pathway)

will cover more algorithms in greater depth.


## Instance-based Methods: Nearest Neighbour Methods
- Do not explicitly represent class boundaries  
  Construct them “on-the-fly” when queried
- Store the set of training examples  
  More efficient methods may not store all points
- Use a metric to calculate distance between two points  
  e.g. Euclidean (continuous), Hamming (binary), ...

<img src="figures/kNN-steps.png">

## K-Nearest Neighbour Classification 

<div>
<div width-50% style="float:left">
    <p><b>init(neighbours=k, distance metric =d)</b>:<br>  
        Specify <i>k</i> and a distance metric <i>d(i,j)</i> </p>
    <p><b>fit(trainingData)</b>:<br>  
      Store a local copy of the training data as two arrays:<br>  
       model_x of shape (numTrainingItems , numFeatures),  <br>
        model_y of shape( numTrainingItems)</p>  
    <p><b>predict(newItems)</b>:<br>
        <ol>
        <li>Make 2D array <i>distances</i> of shape (num_newItems , numTrainingItems)<br>  
            FOREACH COMBINATION of newItem i  and trainingitem j <br> 
            ...SET <i>distances[i][j] = d (i,j)</i> 
        </li>
        <li> Make 2D array <i>votes</i> of shape(num_newItems, k)<br> 
                FOREACH newItem i <br>
            ...Find the <i>k</i> columns of the row <i>distances[i]</i> with the smallest values<br>
                ...Put the corresponding <i>k</i> labels from model_y into <i>votes[i]</i> 
        </li>
        <li>Store majority vote in a  1D array <i>y_pred</i> of size <i>numToPredict</i><br>
                FOREACH  newItem i<br>
                ...SET<i> y_pred[i] = most_common_value(votes[i]) </i>
        </li>
        <li>RETURN y_pred</li>
            </ol>
</div>    
<div width=35% style="float:right">    
<img src="figures/voronoi.png" style="float:right" width = 400 title="https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor">
<p><a href= https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor>Image of Vornoi tesselation</a> 
</div>

In [None]:
# Example for K = 1 
class Simple1NNClassifier:
    """ 
    Simple example class for 1-Nearest Neighbours algorithm.
    Assumes numpy is imported as np and uses euclidean distance
    """    
    def dist_a_b(self,a:np.array,b:np.array)->float:
        """ euclidean distance between same-size vectors a and b"""
        assert a.shape==b.shape, 'vectors not same size calculating distance'
        return np.linalg.norm(a-b) 
    
    def fit(self,x:np.ndarray,y:np.array):
        """ just stores the data for k-nerarest neighbour"""
        self.num_training_items = x.shape[0]
        self.num_features = x.shape[1]
        self.model_x = x
        self.model_y = y
        
    def predict(self,new_items:np.ndarray):
        """ makes predictions for an array of new items"""
        num_to_predict = new_items.shape[0]
        y_pred = np.zeros((num_to_predict),dtype=int)
        
        # measure distances - creates an array with numToPredict rows and num_trainItems columns
        dist = np.zeros((num_to_predict,self.num_training_items))
        for new_item in range(num_to_predict):
            for stored_example in range(self.num_training_items):
                dist[new_item][stored_example]= self.dist_a_b(new_items[new_item],
                                                              self.model_x[stored_example ])

        #make predictions: 
        closest = np.argmin(dist, axis=1) #closest has one entry for each row (item to predict)
        for item_idx in range(num_to_predict):
            y_pred[item_idx] = self.predict_one(item_idx, dist)
        return y_pred
    
    def predict_one(self,item_idx:int,distances:np.ndarray):
        """ makes a class prediction for a single new item
        This version is just for 1 Nearest Neighbour
        Parameters
        ----------
        item_idx (int): item to make predciton for - i.e. idx of row in distances matrix
        dist (numpy ndarray): array of distances between new items (rows) and training set records(columns)
        """
        # we're going to use numpy's argmin method (google it)
        # which gives us the  get indexes of column with lowest value in an array
        idx_of_nearest_neighbour = np.argmin (distances[item_idx])
        return self.model_y[ idx_of_nearest_neighbour]

## How does K-nearest Neighbours do on the Iris data?

We'll use a function from sklearn to do our train/test split here.

This is handy because it shuffles the data and has options to make sure that we keep the same proportion of different classes in our training and testing data.


            
           
We'll also make a **confusion matrix** to examine the predictions it makes
rows = target labels,  columns = predicted labels
           

In [None]:
# make train/test split 
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(iris_x, iris_y, test_size=0.33,stratify=iris_y)


my_KNN_model = Simple1NNClassifier()
my_KNN_model.fit(train_x,train_y)
y_pred = my_KNN_model.predict(test_x)
print(y_pred.T) #.t turns column to row so it sghows onscreen better 

## how good are these results?
We can use a neat numpy trick to find out if the predictions are correct

In [None]:
print ( (test_y==y_pred))
accuracy = 100* ( test_y == y_pred).sum() / test_y.shape[0]
print(f"Overall Accuracy = {accuracy} %")

confusionMatrix = np.zeros((3,3),int)
for i in range(50):
    actual = int(test_y[i])
    predicted = int(y_pred[i])
    confusionMatrix[actual][predicted] += 1
print(confusionMatrix)

#and here's sklearn's built-in method
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(test_y, y_pred,display_labels= iris_names )

### Visualising 1-NN if we just learn from the petal features


- same labels (y)
- but training inputs  only has the last two features about the petals
- i.e. train_x[2:4]  or test_x[2:4

In [None]:
petals_trainx= train_x[:,2:4]
petals_testx = test_x[:,2:4]
print(f'new input features have sizes {petals_trainx.shape} and {petals_testx.shape}')

In [None]:

# new model
my_KNN_model2= Simple1NNClassifier()

# fit the petals data
my_KNN_model2.fit(petals_trainx,train_y)

#make predictions for petals version of test data
y_pred2 = my_KNN_model2.predict(petals_testx)

accuracy = 100* ( test_y == y_pred).sum() / test_y.shape[0]
print(f"Overall Accuracy in 2D = {accuracy} %")

#visualise the decision surface
w6utils.plot_decision_surface(petals_trainx,train_y,my_KNN_model2, 
                              "1-Nearest Neighbour on petal features", iris_features[2:4],min_zero=True,step_size= 0.1)

# Timeout

# Rule Induction Algorithms

## Principles

### Rule Representation
<div>
<div width=50% style="float:left">
<p>Week 1 we introduced the idea of rules. <p>
<ul><li>Topic 3 is about <em>Knowledge-Based systems</em><br>  
    where <b>humans provide the rules</b> for a situation. </li>
    <li> ML Rule Induction algorithms automate learning rules</li>
    </ul>    
<p> In both cases a rule has the form:</p>
<ul>
    <li><b>if</b> feature_n <em> comparison</em> threshold <b>then</b> prediction.</li>
<li> with comparison  one of less than, equals, more than etc.</li>
    <li> Illustrated on the right for the iris data</li></ul>
    </div>
<div width=40% style="float:right">
<img src="figures/rule-representation.png" style="float:right">
    </div>
</div>

   
### Rule Matching
We say that a rule *covers* a training example (features, label) if
- the example features meet the rule's _condition_
- the rule's _action_ (prediction) matches the example's label.   

### Decision Boundaries and Default Classes
Most existing algorithms tend to use  rules built up of lots of axis-perpendicular decisions.   
-  For example the (useless) rule  *If( petal_length > 0.3) THEN ("Setosa")*   
  Draws a line through feature space, at right angles to the petal_length axis, crossing it at 0.3.  
  Puts the label "setosa" on one side, nothing on the other

- As more rules are added, the model effectively builds labelled (hyper) boxes in space.  
  
- Rest of 'decision space' is given with the default (majority) label

### Making a Prediction with a rule based model
Lets assume we have 3 rules (so this example fits on a slide)
and we want to make a predcition for a new item

```python
  if matches(rule0_condition, new_item):
     return rule0_prediction
  elif if matches(rule1_condition, new_item):
     return rule1_prediction
  elif if matches(rule2_condition, new_item):
     return rule2_prediction
  else:
     return default class 
```

### Relationship of model fitting to search
This learning or fitting a model to the data can be seen as a search process

*italics below refer to the search framework we used last topic*

To do that we need to have:
1. A representation for rules  
   *`mycandidatesolution.variable_values` is a list of rules*   
   *and each rule is a tuple of four values*
2. A way of assigning "goodness" to (sets of) rules.  
   *a dataset would be held in an instance of a `RuleInductionProblem`*  
   *evaluate() breaks into two stages: make predictions, then score them  against actual values*
3. A way of algorithmically generating possible rules  
   We have fixed sets of features,operators,outputs,  
   We can **discretize** the thresholds for each feature    
   So we can use nested loops (or a neat python trick) to create all possible rules.  
   *this would be the value_set for the problem instance*
   
Then we could use our local search algorithm to learn a set of rules.

## Greedy rule induction: keep choosing the next best rule

We can exploit the ability to generate rules algorithmically to make a simple Machine Learning algorithm that **automatically** learns rules, using a greedy constructive hill climbing approach:  

This is a **generate-and-test** approach for search the space of all possible models, that repeatedly takes the "next-best" rule to create a rule-set.     
- Note that this method can be easily out-performed by more sophisticated approaches.

In our framework `RuleInductionProblem.evaluate()`:
- takes a list of rules **not including the default rule**
- uses them to make a predictions for the training data
- then compares each prediction to the actual label for that training item
 - returns:
   - -1 if the model makes any incorrect predictions
   - otherwise the number of correct predictions
Note that this does not assume we can correctly classify each trainig item

## Decision Trees can capture rules and more
<div>
    <p> <b>Basic idea:</b> divide input space using a set of axis-parallel lines 
        and <b>"grow"</b> a tree via these steps.</p> 
    <ol>
        <li>Start with single node that predicts majority class label.</li>  
        <li>Loop over every leaf node:
        <ul>  
          <li>measure (in some way) the "data purity"  or "information content"  
             of the data that arrives at that node</li>  
         <li>for each possible each way of splitting data  you could put into that node
          <ul>
               <li>  measure and add the "information content" of the child nodes created by the split </li>
               <li> subtract information content of parent</li>
               <li> result is the <i> gain</i> in information content given by split</li>
              <li> update stored "best split" if appropriate</li>
           </ul>
        </li>
        <li> If the  "best" split is above some threshold then change the leaf node to an interior node with the <i>best</i> condition</li>   
            <li> If <i>termination criteria</i> not met goto step 2 </li>
                     </ol>
<p>This criteria for adding nodes is different to the rule induction algorithm, and gives you different trees</p>
<p> <b>Interior nodes</b> are equivalent to conditions in a rule  </p>
            <p><b>Leaf Nodes</b> are the outputs (actions of a rule):</p>
            <ul>
                <li>class labels (classification tree), </li>
                <li>equation for predicting values (regression tree)</li>
            </ul>
        </div>


## Decision trees for our example datasets
using code from sklearn 
`class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, ...)`

Like all sklearn models it implements a fit() and predict() method

Note the default criteria for splitting is the 'gini' index = there are many available, this is a popular one


In [None]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree

fig,ax = plt.subplots(1,3,figsize=(15,5))
fig.suptitle("Illustration of how Decision Trees select and insert nodes to increase data purity")
for depth in range (1,4):
    my_dt_model = DecisionTreeClassifier(random_state=1234, max_depth=depth,min_samples_split=2,min_samples_leaf=1)
    my_dt_model.fit(train_x,train_y)
    _ = tree.plot_tree(my_dt_model, feature_names=iris_features, class_names= iris_names,filled=True,ax=ax[depth-1])
    ax[depth-1].set_title("Depth "+str(depth))

## Visualising the results using just the petal features

In [None]:
#make nad fit new model
two_D_DT_model = DecisionTreeClassifier(max_depth=4)
two_D_DT_model.fit(petals_trainx,train_y)

#call sklearn;s built in visualisation
_ = tree.plot_tree(two_D_DT_model, 
                   feature_names=iris_features, 
                   class_names= iris_names,
                   filled=True)

#call our bespoke visualisation of the decision surface
w6utils.plot_decision_surface(petals_trainx,train_y,two_D_DT_model,
                              "Decision Tree: simplified outcomes", 
                              iris_features[2:4],step_size=0.1)

## So how do  we learn models?
**Construction**:  add boundaries to make models more complex
- Add examples to kNN
- Repeatedly add nodes to trees, splitting on new variables
- Repeatedly add rules that classify as-yet unclassified data
- Add nodes to an artifical neural network
 
**Perturbation**: Move existing boundaries to change model
- Change value of K or distance function in kNN
- Change rule/treenode thresholds: *if  exam < 40*  &rarr; *if exam < 38*
- Change operators in rules/ tree nodes:  *if exam < 38* &rarr; *if exam &leq; 38*
- Change variables considered in rules/tree nodes: *if exam < 38* &rarr; *if coursework < 38*
- Change weights in MLP, 


## Summary
Supervised Machine Learning is concerned with learning predictive models from datasets
- Different algorithms use different representations of decision boundaries
- Regions inside the boundaries contain **Class labels** or **(formulas leading to) continuous values** (regression)

Algorithms **fit** models to data by repeatedly:
  - making and testing small changes,  
  - and then selecting the ones that improve accuracy on the training set
  - until some stop criteria is met

  - They do this by either adding complexity or changing the parameters of an existing model
  - This is equivalent to moving through “model space”

Once the model has been learned (fit) we leave it unchanged  
  - and use it to **predict** the labels for new data points

Next week:   Neural Networks
