---
# <div align="center"><font color='green'>  </font></div>
# <div align="center"><font color='green'> COSC 2673/2793 | Machine Learning  </font></div>
## <div align="center"> <font color='green'> Week 7 Lab Exercises: **Rule Learning**</font></div>
---

# Introduction

In this lab you will be:

1. Implement the entropy calculation
2. Implement a simplified proposition rule learning algorithm, outputting rules

*sklearn* does not have an implementation of a rule learner. Instead you will implement a simplified CN2 algorithm.
This algorithm will construct pre-conditions that contain a single term, that is, the rule precondition will not contain conjunctions. This will require you to implement functions in python, and use simple loops and if-statements. If you are unfamiliar with these, first revise the Python tutorials from Lab01.

This lab only requires Pandas/Numpy to load with work with the data set, and the math library.

In [None]:
import pandas as pd
import numpy as np
import math

## Datasets

You will be looking at two data sets for this lab which you have seen before:

1. Sailing days
2. Zoo (animal) classification

You can download these from Canvas or BitBucket code repo.

In [None]:
sailData = pd.read_csv('./datasets/sailing-custom-python.txt',delim_whitespace=True)
zooData = pd.read_csv('./datasets/zoo-python.txt',delim_whitespace=True)

In [None]:
sailData.head()

In [None]:
zooData.head()

 remove unnecessary columns:

In [None]:
zooData = zooData.drop(columns='name')

# Simple Rule Learner
You will develop the simple rule learner over three parts:

1. Entropy calculation function
2. Majority class calculation function
3. Rule learner


## Entropy function

First you will need a function that calculates the entropy of a data set.

In [None]:
def entropy(data, target):
    #TODO

**Note:** In Juypter you need to place the entire function definition in a single input group. You also need to obey formatting rules for functions (that is tabs/spaces for indentation)

This function takes two parameters, (1) the data set, and (2) the column name of the output/target class. The function should return the entropy of the data set. 

As a reminder, entropy is:

$\text{entropy}(S) = -\sum_{i=0}^{c}p_i\log_2p_i$

The pseudo-code for the entropy calculation is ($x indicates variable x):

```
entropy($data, $target):
    $entropy_value = 0
    foreach $value of $target:
            $count = the number of examples in $data where $value==$target
            $p_i = $count / (total number of examples in $data)
            Add to $entropy_value using $p_i
        return $entropy_value
```


The following code-snippets will help in creating the entropy function:
- You can get a count of each of the values of a single attribute using:
    ```
        vCounts = pd.value_counts(data[target])
    ```
    This gives as a 2D array, for each value of the target column, the number of values matching that value.


- You can iterate through the actual counts by:
    ```
        for value in vCounts:
    ```


- You can iterate through the labels of the value counts array by:
    ```
        for value in vCounts.axes[0]:
    ```


- The following returns all examples in the data frame whose attribute matches the given value:
    ```
        matching = data.loc[data[attribute] == value]
    ```


- The number of rows in a data frame is
    ```
        data.shape[0]
    ```


- The size property of a pandas data frame returns the number of elements in the data frame, or the length of a single column:
    ```
        data.size
    ```


- The $log_2$ of a number $x$ is calculated by:
    ```
        math.log(x,2)
    ```

In [None]:
print('Entropy for Sail data: ', entropy(sailData, 'Sail'))
print('Entropy for Zoo: ', entropy(zooData, 'type'))

If you have implemented the entropy function correctly, you should get the following results for the sailing and zoo data sets:

    - entropy(sailData, 'Sail') = 0.9975025463691153
    - entropy(zooData, 'type') = 2.390559682294039

## Majority Class
Secondly, you will need to implement a function that returns the value of the target column which has the majority number of values.
This code should be very similar to the entropy calculation.
Use the following as the definition for your function:


In [None]:
def majority_class(data, target):
    #TODO

The pseudo-code for finding the majority is:

```
majority_class($data, $target):
    $majority = 0
    $class = ''
    foreach $value of $target:
        $count = the number of examples in $data where $value==$target
        if $count > $majority:
            $majority = $count
            $class = $value
    return $class
```

Alternatively, you can investigate how to use the **idmax()** function, which is a function of a pandas dataframe/series.

In [None]:
print('Majority for Sail data Target: ', majority_class(sailData, 'Sail'))
print('Majority for Zoo data Target: ', majority_class(zooData, 'type'))

## Rule Learner

Given the above two functions, it is now possible to implement a simple propositional rule learner.
The features of this rule learner are:

1. The pre-condition of each rule contains a single condition
2. All attributes are treated as categorical
3. The rules are going to be printed to the command line


The pseudo-code for this simple propositional rule learner is:
```
simpler_rule_learner($data, $target):
    while $data.shape[0] > 0:
        if entropy($data) = 0:
            print ("otherwise =>", majority_class($data,$target))
            drop all rows in $data
        else:
            $best_entropy = entropy($data)
            $best_attribute = ''
            $best_value = ''
            $best_data=$data
            foreach $attribue of $data:
                foreach $value of $attribute:
                    $data2 = select the examples in $data where $attribute==$value
                    if entropy($data2) < $best_entropy:
                        $best_entropy = entropy($data2)
                        $best_attribute = $attribue
                        $best_value = $value
                        $best_data=$data2
            print($best_attribute, "=", $best_value, "=>",
                    majority_class($best_data,$target))
            drop all rows of $data2 from $data
```

In [None]:
def simpler_rule_learner(data, target):
    # TODO

**Hints:**

- You can drop all the necessary row of `$data` by constructing the opposition condition that was used to create `$data`, ie
    ```
        data = data.loc[data[best_attribute] != best_value]
    ```


- The following drops all rows of a data frame
    ```
        data = data.iloc[0:0]
    ```
    
    

In [None]:
simpler_rule_learner(sailData, 'Sail')

In [None]:
simpler_rule_learner(zooData, 'type')

<span style="font-size:1.5em;">�</span> How do these rules compare to the rules generated by CN2 (try Orange software demostrated in lecture)?


If you have implemented the simple ruler learner correctly, you should get the following output
```
    simpler_rule_learner(sailData, 'Sail')
        Company = big => yes
        Outlook = rainy => no
        Sailboat = small => yes
        Company = med => yes
        otherwise => no
```


```
    simpler_rule_learner(zooData, 'type')
        feathers = Yes => bird
        milk = Yes => mammal
        fins = Yes => fish
        hair = Yes => insect
        airborne = Yes => insect
        legs = 8.0 => invertebrate
        catsize = Yes => reptile
        eggs = No => reptile
        breathes = No => invertebrate
        aquatic = Yes => amphibian
        tail = Yes => reptile
        legs = 0.0 => invertebrate
        otherwise => insect
```

# Sample Solutions
If you are struggling with the first two functions, a sample solution has been provided for these.
Only use this if you have **made your absolute best attempts** at implementing these functions yourself.
The purpose of this lab is to understand common aspects of symbolic machine learning algorithms, though the CN2 algorithm.
You will gain significantly less out of this lab if you don't try to solve the problems yourself.