<a href="https://colab.research.google.com/github/sysphcd/PythonProgrammingforData/blob/main/15_2_Decision_Tree_coded_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding a simple decision tree
---

In this worksheet we are going to work with a data set, using the idea of a decision tree class.  We are going to simplify the model and use Python code to make a simple decision tree classification model.  We will do this for two reasons:
*   writing the code is often good for helping to understand what is going on under the bonnet of a library function
*   it is a good coding exercise for practice as it mostly depends on calculations and if..elif..else statements

In this worksheet we are going to code a decision tree which will use the calculated probabilities to make decisions about whether a row of given data would be classified as Iris-virginica, or not, based on sepal and petal dimensions.  It is easier to classify between two values (Iris-virginica or not).  Later, using this information, species would be further predicted by probabilities of error.

![Iris-petals and sepals](https://www.math.umd.edu/~petersd/666/html/iris_with_labels.jpg)

The workflow is:
*  divide the data set into 70% of the rows for training and 30% for testing  (we can increase the size of the training set later)
*  find the median for each of the 4 size columns
*  calculate the proportion of each column that are on or above median that are of a species (ie proportion of petal-lengths on or above median that are Iris-virginica)
*  infer the proportion of each that are not of that species (using 1 - proportion above).  In both cases we are looking to find if either of these is 1, which could be infered as definitely not that species. 
*  calculate a Gini Index that will indicate the probability that a prediction will be incorrect
*  use the results of the Gini Index to model a decision tree
*  code the decision tree model into a function that will return whether or not a row in the test set is predicted to be of species Iris-virginica
*  use the decision tree function to predict, for each row in the test set, if the species will be Iris-virginicia or not, using a set of nested if statements to classify
*  compare the predicted values against the actual values in the test set - what proportion were predicted correctly?


### Exercise 1 - investigate the iris data set
---
Let's start by looking at the data.  We are going to use a data set that contains data on iris flowers.

Read the data at this location: https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv into a dataframe called iris_data

The columns in the CSV file do not have headings, when you read the file, add column headings like this:
```
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species']
iris_data = pd.read_csv(url, name=names)
```
*  Take a look at the column info (how many columns, what type of data, any missing data?)
*  Take a look at the data values in the first 10 and the last 10 records to get an idea of the type of values included
*  Find out how many unique values there are in the species column
*  Find out the maximum, minimum, median and upper and lower quartile values in each of the columns


In [2]:
import pandas as pd
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species']
iris_data = pd.read_csv(url, names=names)
display(iris_data)
print(iris_data.info())
print(type(iris_data))
print("any null values ? = " ,iris_data.isnull().values.any())

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal-length  150 non-null    float64
 1   sepal-width   150 non-null    float64
 2   petal-length  150 non-null    float64
 3   petal-width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
<class 'pandas.core.frame.DataFrame'>
any null values ? =  False


In [3]:
# Take a look at the data values in the first 10 and the last 10 records to get an idea of the type of values included
print(iris_data.head(10))
print(iris_data.tail(10))

   sepal-length  sepal-width  petal-length  petal-width      species
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa
5           5.4          3.9           1.7          0.4  Iris-setosa
6           4.6          3.4           1.4          0.3  Iris-setosa
7           5.0          3.4           1.5          0.2  Iris-setosa
8           4.4          2.9           1.4          0.2  Iris-setosa
9           4.9          3.1           1.5          0.1  Iris-setosa
     sepal-length  sepal-width  petal-length  petal-width         species
140           6.7          3.1           5.6          2.4  Iris-virginica
141           6.9          3.1           5.1          2.3  Iris-virginica
142           5.8  

In [4]:
# Find out how many unique values there are in the species column
iris_data['species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [11]:
# Find out the maximum, minimum, median and upper and lower quartile values in each of the columns
iris_data.describe().T
# print('----max-----')
# print(iris_data[['sepal-length', 'sepal-width',  'petal-length',  'petal-width']].max())
# print('----min-----')
# print(iris_data[['sepal-length', 'sepal-width',  'petal-length',  'petal-width']].min())
# print('----median-----')
# print(iris_data[['sepal-length', 'sepal-width',  'petal-length',  'petal-width']].median())
# print('----ower & upper quartile-----')
# print(iris_data[['sepal-length', 'sepal-width',  'petal-length',  'petal-width']].quantile([.1, .5]))


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sepal-length,150.0,5.843333,0.828066,4.3,5.1,5.8,6.4,7.9
sepal-width,150.0,3.054,0.433594,2.0,2.8,3.0,3.3,4.4
petal-length,150.0,3.758667,1.76442,1.0,1.6,4.35,5.1,6.9
petal-width,150.0,1.198667,0.763161,0.1,0.3,1.3,1.8,2.5


### Exercise 2 - split the data into train and test sets
---

Split the data set into and 70% train, 30% test, split.  From now, just use the train data set.


In [21]:
# import the train_test_split function
from sklearn.model_selection import train_test_split

# create the classification variables from the all columns
train, test = train_test_split(iris_data, test_size=0.30, random_state=42)
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sepal-length,105.0,5.842857,0.833304,4.3,5.1,5.8,6.4,7.7
sepal-width,105.0,3.004762,0.414956,2.0,2.8,3.0,3.2,4.2
petal-length,105.0,3.871429,1.720002,1.1,1.7,4.3,5.1,6.7
petal-width,105.0,1.238095,0.744128,0.1,0.4,1.3,1.8,2.5


### Exercise 3 - assumptions and classification
---

Let's make some assumptions based on the data

1.  Iris-setosa, Iris-versicolor, Iris-virginica are the full range of types of iris to be analysed
2.  Although this is a small data set, the means are fairly representative

With these in mind, let's start by classifying sepal/petal size into long/short and wide/narrow with values on or above the mean taken as long or wide and those below as short or narrow.

This is a starting point.  We will be trying to find a value (indicator) for each column where rows on or above do not contain any of a particular species, this might indicate that this column is a good (if not rough) indicator of species.

*  Drop any null values from each column

Calculate, and store the means of the four columns

*  **Test**:
Display train.describe() to see the value of the means of the training set. Print the four means and compare to the output of train.describe() to check that they have been calculated correctly.

*  Create a new dataframe with the numeric columns encoded so show a 1 for any value that is above the mean for its column and 0 for any that isn't.




In [16]:
# print("any null values ? = " ,iris_data.isnull().values.any())
print("any null values in sepal-length ? = " ,train['sepal-length'].isnull().values.any())
print("any null values in sepal-width ? = " ,train['sepal-width'].isnull().values.any())
print("any null values in petal-length ? = " ,train['petal-length'].isnull().values.any())
print("any null values in petal-width ? = " ,train['petal-width'].isnull().values.any())

# train.dropna(subset = ["sepal-length", "sepal-width","petal-length","petal-width"])
train.shape, test.shape

any null values in sepal-length ? =  False
any null values in sepal-width ? =  False
any null values in petal-length ? =  False
any null values in petal-width ? =  False


((105, 5), (45, 5))

In [8]:

# get the mean for each column and apply a function to encode into 1 (above mean) and 0 (mean or below mean)
def encode(df, **kwds):
  key = kwds['key']
  indicator = kwds['indicator']
  if df[key] >= indicator:
    return 1  
  return 0

    
sepal_length_mean = train['sepal-length'].mean()
sepal_width_mean = train['sepal-width'].mean()
petal_length_mean = train['petal-length'].mean()
petal_width_mean = train['petal-width'].mean()    

# run the function for each column so that each of the four columns are encoded, then drop the original columns, saving as a new dataframe
train['encodedmean-sl'] = train.apply(encode, axis=1, key='sepal-length', indicator=sepal_length_mean)
train['encodedmean-sw'] = train.apply(encode, axis=1, key='sepal-width', indicator=sepal_width_mean)
train['encodedmean-pl'] = train.apply(encode, axis=1, key='petal-length', indicator=petal_length_mean)
train['encodedmean-pw'] = train.apply(encode, axis=1, key='petal-width', indicator=petal_width_mean)
# encodedmean_train = train[(train['species']=='Iris-virginica')][['species','encodedmean-sl','encodedmean-sw','encodedmean-pl','encodedmean-pw']]
encodedmean_train = train.copy()
encodedmean_train.drop('sepal-length', axis=1, inplace=True) #inplace=True : specifies the drop operation to be in same dataframe rather creating a copy of the dataframe after drop.
encodedmean_train.drop('sepal-width', axis=1, inplace=True)
encodedmean_train.drop('petal-length', axis=1, inplace=True)
encodedmean_train.drop('petal-width', axis=1, inplace=True)
# display(train) 
display(encodedmean_train)

Unnamed: 0,species,encodedmean-sl,encodedmean-sw,encodedmean-pl,encodedmean-pw
35,Iris-setosa,0,1,0,0
71,Iris-versicolor,1,0,1,1
111,Iris-virginica,1,0,1,1
21,Iris-setosa,0,1,0,0
26,Iris-setosa,0,1,0,0
...,...,...,...,...,...
41,Iris-setosa,0,0,0,0
126,Iris-virginica,1,0,1,1
31,Iris-setosa,0,1,0,0
70,Iris-versicolor,1,1,1,1


### Exercise 4 - Calculate the proportion of values on or above the mean that are of each species

We are going to focus on the `Iris-virginica` species first.

First we will calculate, for each dimension column (`sepal-length, sepal-width, petal-length, petal-width`) what proportion of values in that column, where the value is on or above the mean, are classified as `Iris-virginica`.

We will do this by filtering all the records in each column of the the `train` set that are on or above the mean and match the species .  Then use the outcome to calculate the proportion of the full `train` set for which a value on or above the mean that are of species `Iris-virginica`.

*  filter for values in the `sepal-length` column being on or above the mean and the species column being `Iris-virginica`.  Then divide the count of rows in this filtered dataset by the count of rows in a second data set, filtered for just the value being on or above the mean.

*  Do this for all four columns, for `Iris-virginica`  (4 operations).

Print the results to see which columns look like they might most reliably predict the species as `Iris-virginica` (the result is as close as possible to 1).  The highest numbers may indicate the most reliable indicators, but we will do some more before coming to this conclusion.

*  By definition, those on or above the mean that are NOT Iris_virginica will be `1 - the proportion of those that are.  Calculate these

The first one has been done for you.

*  We will also need the proportion of those BELOW the median that are NOT Iris-virginica.  Calculate these in the same way



In [17]:
def cal_proportion_abovemean(df, specie, part_field):
  results = df[(df[part_field] == 1) & (df['species'] == specie)]
  
  proportion_count = results['species'].count()
  
  print("------------------------", part_field , " " , specie , "------------------------")
  # display(results)
  print("  number of row = ", proportion_count)
  return results, proportion_count


print("-----------------------------------value >= mean-----------------------------------")
df_col1_virginica,proportion_col1_virginica = cal_proportion_abovemean(encodedmean_train, 'Iris-virginica', 'encodedmean-sl')
df_col2_virginica,proportion_col2_virginica = cal_proportion_abovemean(encodedmean_train, 'Iris-virginica', 'encodedmean-sw') 
df_col3_virginica,proportion_col3_virginica = cal_proportion_abovemean(encodedmean_train, 'Iris-virginica', 'encodedmean-pl') 
df_col4_virginica,proportion_col4_virginica = cal_proportion_abovemean(encodedmean_train, 'Iris-virginica', 'encodedmean-pw') 

-----------------------------------value >= mean-----------------------------------
------------------------ encodedmean-sl   Iris-virginica ------------------------
  number of row =  31
------------------------ encodedmean-sw   Iris-virginica ------------------------
  number of row =  9
------------------------ encodedmean-pl   Iris-virginica ------------------------
  number of row =  35
------------------------ encodedmean-pw   Iris-virginica ------------------------
  number of row =  35


### Exercise 5 - Calculate the proportion of each column where the value is below median that are of species `Iris-virginica`

Repeat the code above, this time looking for values below the mean

In [19]:
# calculate the proportion of results where the value is below median that are of the species Iris-virginica

def encode_median(df, **kwds):
  key = kwds['key']
  indicator = kwds['indicator']
  if df[key] < indicator:
    return 1
  else: 
    return 0

def filter_median_data(df, specie, col, new_col):
  median = df[df['species']==specie][col].median()
  results = df.apply(encode_median, axis=1, key=col, indicator=median)
  return median, results

sl_median_virginica, sl__virginica_result = filter_median_data(train, 'Iris-virginica', 'sepal-length', 'encodedmedian-sl')
sw_median_virginica, sw__virginica_result = filter_median_data(train, 'Iris-virginica', 'sepal-width', 'encodedmedian-sw')
pl_median_virginica, pl__virginica_result = filter_median_data(train, 'Iris-virginica', 'petal-length', 'encodedmedian-pl')
pw_median_virginica, pw__virginica_result = filter_median_data(train, 'Iris-virginica', 'petal-width', 'encodedmedian-pw')

print(sl__virginica_result)
# sepal_length_median = train[train['species']=='Iris-virginica']['sepal-length'].median()
# sepal_width_median = train[train['species']=='Iris-virginica']['sepal-width'].median()
# petal_length_median = train[train['species']=='Iris-virginica']['petal-length'].median()
# petal_width_median = train[train['species']=='Iris-virginica']['petal-width'].median()
# # train[(train['species']=='Iris-virginica')][['species','sepal-length']]    

# # run the function for each column so that each of the four columns are encoded, then drop the original columns, saving as a new dataframe
# train['encodedmedian-sl'] = train.apply(encode_median, axis=1, key='sepal-length', indicator=sepal_length_median)
# train['encodedmedian-sw'] = train.apply(encode_median, axis=1, key='sepal-width', indicator=sepal_width_median)
# train['encodedmedian-pl'] = train.apply(encode_median, axis=1, key='petal-length', indicator=petal_length_median)
# train['encodedmedian-pw'] = train.apply(encode_median, axis=1, key='petal-width', indicator=petal_width_median)


# encodedmedian_train = train[(train['species']=='Iris-virginica')][['species','encodedmedian-sl','encodedmedian-sw','encodedmedian-pl','encodedmedian-pw']]
# encodedmedian_train.drop('sepal-length', axis=1, inplace=True) #inplace=True : specifies the drop operation to be in same dataframe rather creating a copy of the dataframe after drop.
# encodedmedian_train.drop('sepal-width', axis=1, inplace=True)
# encodedmedian_train.drop('petal-length', axis=1, inplace=True)
# encodedmedian_train.drop('petal-width', axis=1, inplace=True)
# encodedmedian_train.drop('encodedmean-sl', axis=1, inplace=True) #inplace=True : specifies the drop operation to be in same dataframe rather creating a copy of the dataframe after drop.
# encodedmedian_train.drop('encodedmean-sw', axis=1, inplace=True)
# encodedmedian_train.drop('encodedmean-pl', axis=1, inplace=True)
# encodedmedian_train.drop('encodedmean-pw', axis=1, inplace=True)
\
# display(train) 
# display(encodedmedian_train)
# print("number of rows = ", encodedmedian_train['species'].count())

36     1
92     1
45     1
131    0
14     1
      ..
91     1
50     0
12     1
1      1
35     1
Length: 105, dtype: int64


### Exercise 5 - calculate for the other two Iris species
---

Do the same calculations for the Iris-versicolor species, then for the Iris-setosa species.






In [20]:
print("-----------------------------------value >= mean-----------------------------------")
df_col1_versicolor,proportion_col1_versicolor = cal_proportion_abovemean(encodedmean_train, 'Iris-versicolor', 'encodedmean-sl') 
df_col2_versicolor,proportion_col2_versicolor = cal_proportion_abovemean(encodedmean_train, 'Iris-versicolor', 'encodedmean-sw') 
df_col3_versicolor,proportion_col3_versicolor = cal_proportion_abovemean(encodedmean_train, 'Iris-versicolor', 'encodedmean-pl') 
df_col4_versicolor,proportion_col4_versicolor = cal_proportion_abovemean(encodedmean_train, 'Iris-versicolor', 'encodedmean-pw') 

df_col1_setosa,proportion_col1_setosa = cal_proportion_abovemean(encodedmean_train, 'Iris-setosa', 'encodedmean-sl') 
df_col2_setosa,proportion_col2_setosa = cal_proportion_abovemean(encodedmean_train, 'Iris-setosa', 'encodedmean-sw') 
df_col3_setosa,proportion_col3_setosa = cal_proportion_abovemean(encodedmean_train, 'Iris-setosa', 'encodedmean-pl') 
df_col4_setosa,proportion_col4_setosa = cal_proportion_abovemean(encodedmean_train, 'Iris-setosa', 'encodedmean-pw') 


-----------------------------------value >= mean-----------------------------------
------------------------ encodedmean-sl   Iris-versicolor ------------------------
  number of row =  20
------------------------ encodedmean-sw   Iris-versicolor ------------------------
  number of row =  6
------------------------ encodedmean-pl   Iris-versicolor ------------------------
  number of row =  29
------------------------ encodedmean-pw   Iris-versicolor ------------------------
  number of row =  28
------------------------ encodedmean-sl   Iris-setosa ------------------------
  number of row =  0
------------------------ encodedmean-sw   Iris-setosa ------------------------
  number of row =  31
------------------------ encodedmean-pl   Iris-setosa ------------------------
  number of row =  0
------------------------ encodedmean-pw   Iris-setosa ------------------------
  number of row =  0


### Exercise 6 - predict from the results
---

Create a list of dictionaries from the results Exercise 4 and 5 (e.g. {'species':..., 'above_mean': 0.xx, 'below_mean': 0.xx}  

Then use a loop to go through the list and print:  
*  any species and indicator (above or below mean) that can reliably be predicted.  A reliable prediction may be one over 0.5

In [None]:
# show which columns are reliable predictors



### Exercise 6 - Make a decision tree
---

Use pencil and paper or a graphical application to create a decision tree for Iris-virginica, using the following rules (use the picture below as a guide):

*  The column with the highest indicator is placed at the top
*  Other columns are placed in order below
*  The rest of the columns are placed in order below these

Any column where one branch (on or above mean OR below mean) has an indicator of 0, could be classified as a strong indicator of Iris_virginica being the species.  Anything else, unless there is something very close to 0, could be classified as a weak indicator of Iris_virginica being the species.

Let's code the decision tree using the following logic for this decision tree (yours might be slightly different):

![Decision tree](https://drive.google.com/uc?id=1CTo23EHwR2IPCRjcfSyCQsT_oQ5Exwso)

In the decision tree above, there is no certainty below petal-length so our decision tree will only include petal-width and petal-length.




In [None]:
def get_species(df):
  # ADD CODE HERE TO RETURN None if petal-width is below mean (encoded as 0) or if petal-length is below mean (encoded as 0), otherwise return 'Iris-virginia'
    if df['petal-width'] == 0:
        return None
        if df['petal-length'] == 0:
            return None
            if df['sepal-length'] == 0:
                return None
    return 'Iris-virginica' 

# use the get_species(df) function to predict the species, count how many are predicted correct and use this to calculate the proportion correct
correct = 0
test_size = test.shape[0]
for i in range(0, test_size):
  species = get_species(test.iloc[i])
  if species == test.iloc[i]['species']:
      correct += 1

print ("Proportion correctly identified", correct / test_size) 


### Exercise 7 - change the measure

We are currently using the mean to act as the decision making line.  We can use the decision tree with a different line.

Change the mean values so that you are instead using the median instead for all four columns.  The code should not need changing except for where you calculated the mean.

Run all the code again.  Is the proportion of correct values better this time?   Is the decision tree still appropriate?


What do you notice? (write your answer here)

### Exercise 8 - try different measures
---

Do the same again but with upper quantile, then again with the lower quantile.  Is it making any difference?  Which give the best looking results?

### Exercise 9 - try a different species

Run the mean test again for the Iris-versicolor species.  Again, try some different decision making lines.

What are the results.  Record them in the text cell below:

Write your answers here:  

# New logic introduced in this worksheet:

1.  Adding headings to a CSV if none currently exist
2.  Splitting a data set into train and test sets

In [None]:
## this type of plot will show the distribution on a chart
from plotnine import *
ggplot(train, aes(x='petal-length', y='petal-width', color='species')) + geom_point() + geom_vline(train, aes(xintercept=train['sepal-length'].mean(), color='species')) + geom_hline(train, aes(yintercept=train['sepal-length'].mean(), color='species'))