# Hands-on introduction to ML training
In this notebook, we will look at another classification model, Decision Trees. We will tackle the same Animal classification, but with all 7 classes.

We will visualise a decision tree at the end of this notebook.

### Step 1: Load and explore data
The first step is figuring out the data source. In this case we will use a pre-existing dataset. We will:
1. Create a folder 'data'
2. Download the file from public github repo using python package "requests" and save the `zoo.csv` file in the data folder.

In [1]:
%config IPCompleter.greedy=True #Helps with auto-complete

import numpy as np
import pandas as pd
import os

try:
    os.mkdir('data')
except OSError as error:
    print(error)

import requests, csv

url = 'https://raw.githubusercontent.com/techno-nerd/ML_101_Course/main/05%20Decision%20Tree/data/zoo.csv'
r = requests.get(url)
with open('data/zoo.csv', 'w') as f:
  writer = csv.writer(f)
  for line in r.iter_lines():
    writer.writerow(line.decode('utf-8').split(','))

[Errno 17] File exists: 'data'


In [2]:
df = pd.read_csv('data/zoo.csv')

In [3]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   animal_name  101 non-null    object
 1   hair         101 non-null    int64 
 2   feathers     101 non-null    int64 
 3   eggs         101 non-null    int64 
 4   milk         101 non-null    int64 
 5   airborne     101 non-null    int64 
 6   aquatic      101 non-null    int64 
 7   predator     101 non-null    int64 
 8   toothed      101 non-null    int64 
 9   backbone     101 non-null    int64 
 10  breathes     101 non-null    int64 
 11  venomous     101 non-null    int64 
 12  fins         101 non-null    int64 
 13  legs         101 non-null    int64 
 14  tail         101 non-null    int64 
 15  domestic     101 non-null    int64 
 16  catsize      101 non-null    int64 
 17  class_type   101 non-null    int64 
dtypes: int64(17), object(1)
memory usage: 14.3+ KB
None


In [4]:
print(df[:5])

  animal_name  hair  feathers  eggs  milk  airborne  aquatic  predator  \
0    aardvark     1         0     0     1         0        0         1   
1    antelope     1         0     0     1         0        0         0   
2        bass     0         0     1     0         0        1         1   
3        bear     1         0     0     1         0        0         1   
4        boar     1         0     0     1         0        0         1   

   toothed  backbone  breathes  venomous  fins  legs  tail  domestic  catsize  \
0        1         1         1         0     0     4     0         0        1   
1        1         1         1         0     0     4     1         0        1   
2        1         1         0         0     1     0     1         0        0   
3        1         1         1         0     0     4     0         0        1   
4        1         1         1         0     0     4     1         0        1   

   class_type  
0           1  
1           1  
2           4  
3   

In [5]:
print(df['class_type'].value_counts())

#1 = Mammal, 2 = Bird, 3 = Reptile, 4 = Fish, 5 = Amphibian, 6 = Bug, 7 = Invertebrate

class_type
1    41
2    20
4    13
7    10
6     8
3     5
5     4
Name: count, dtype: int64


### Step 2: Data preparation

There are a few tasks we need to do before we can train the model on this data:
1. Improve representation for certain classes
2. Ignore unnecessary columns like animal_name 

Then, we will split the data the same way as last time:
1. Split the data (101 rows) into training set (80%) and test set (20%)
2. Separate the input features (aspects of the amimal) from the label ("class_type")

In [6]:
#Tripling class 3
temp1 = df[df['class_type'] == 3]

print(temp1.shape)

(5, 18)


In [7]:
df = pd.concat([df, temp1, temp1], axis=0, ignore_index=True)
print(df['class_type'].value_counts())

class_type
1    41
2    20
3    15
4    13
7    10
6     8
5     4
Name: count, dtype: int64


In [8]:
#Tripling class 5
temp2 = df[df['class_type'] == 5]
df = pd.concat([df, temp2, temp2], axis=0, ignore_index=True)
print(df['class_type'].value_counts())


class_type
1    41
2    20
3    15
4    13
5    12
7    10
6     8
Name: count, dtype: int64


In [9]:
#Tripling class 6
temp3 = df[df['class_type'] == 6]
df = pd.concat([df, temp3, temp3], axis=0, ignore_index=True)
print(df['class_type'].value_counts())


class_type
1    41
6    24
2    20
3    15
4    13
5    12
7    10
Name: count, dtype: int64


In [10]:
#Doubling class 7
temp4 = df[df['class_type'] == 7]
df = pd.concat([df, temp4], axis=0, ignore_index=True)
print(df['class_type'].value_counts())

class_type
1    41
6    24
2    20
7    20
3    15
4    13
5    12
Name: count, dtype: int64


In [11]:
#Doubling class 5
temp5 = df[df['class_type'] == 5]
df = pd.concat([df, temp5], axis=0, ignore_index=True)
print(df['class_type'].value_counts())

class_type
1    41
6    24
5    24
2    20
7    20
3    15
4    13
Name: count, dtype: int64


In [12]:
#Doubling class 4
temp6 = df[df['class_type'] == 4]
df = pd.concat([df, temp6], axis=0, ignore_index=True)
print(df['class_type'].value_counts())

class_type
1    41
4    26
6    24
5    24
2    20
7    20
3    15
Name: count, dtype: int64


In [13]:
#Taking all integer columns (actually boolean) as features, except class_type, which is the label
features = df.select_dtypes(include=['int64']).drop('class_type', axis=1)
features[:5]

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize
0,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1
1,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1
2,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0
3,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1
4,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1


In [14]:
labels = df.class_type
labels[:5]

0    1
1    1
2    4
3    1
4    1
Name: class_type, dtype: int64

In [15]:
import sklearn.model_selection as ms

train_features, test_features, train_labels, test_labels = ms.train_test_split(features, labels, test_size=0.2)
print(train_features.shape)
print(test_features.shape)
print(train_labels.shape)
print(test_labels.shape)

(136, 16)
(34, 16)
(136,)
(34,)


### Step 3: Model Selection and Training

Instead of using 7 different Logistic Regression models, we will use a Decision Tree.

In [16]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier(min_samples_leaf=5)
dtree.fit(train_features, train_labels)

### Step 4: Model evaluation and tuning
Unlike logistic regression, where we use three metrics, in multi-class classification, we will use just one: Accuracy (total correct / total predictions)

Precision and Recall can be calculated for each class, but we will not do it for this exercise. If you would like to, use the code from 04 Classification to get the Precision and Recall for each class.

In [17]:
train_predictions = dtree.predict(train_features)
test_predictions = dtree.predict(test_features)

In [18]:
print(train_predictions[:5], train_labels[:5])

[1 7 5 4 6] 54     1
72     7
111    5
158    4
42     6
Name: class_type, dtype: int64


In [19]:
def accuracy(labels, predictions):
    total = labels.size
    result = (labels == predictions)
    correct = result.sum()
    accuracy = (correct)/total

    return accuracy

In [20]:
print(f"Accuracy: {accuracy(train_labels, train_predictions)}")

Accuracy: 0.9779411764705882


In [21]:
print(f"Accuracy: {accuracy(test_labels, test_predictions)}")

Accuracy: 0.8823529411764706


In [22]:
print(features.columns)

Index(['hair', 'feathers', 'eggs', 'milk', 'airborne', 'aquatic', 'predator',
       'toothed', 'backbone', 'breathes', 'venomous', 'fins', 'legs', 'tail',
       'domestic', 'catsize'],
      dtype='object')


In [23]:
#Try your own animal:
#Remember, the columns should be given in order (0 is no, 1 is yes)

#Testing for bat

custom_animal = [1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0]
custom_animal = np.reshape(custom_animal, (1, 16))
print(custom_animal.shape)

pred = dtree.predict(custom_animal)

print(pred)

#1 = Mammal, 2 = Bird, 3 = Reptile, 4 = Fish, 5 = Amphibian, 6 = Bug, 7 = Invertebrate

(1, 16)
[1]




### Step 5: Model visualisation

In [24]:
from sklearn.tree import export_graphviz
import graphviz

# Export the decision tree as a DOT file
dot_data = export_graphviz(dtree, out_file=None, 
                          feature_names=features.columns,
                          class_names=['Mammal', 'Bird', 'Reptile', 'Fish', 'Amphibian', 'Bug', 'Invertebrate'],
                          filled=True, rounded=True,
                          special_characters=True)

In [25]:
graph = graphviz.Source(dot_data)
graph.render("resources/zoo_dtree")  # This will create a PDF file with the tree visualization

'resources/zoo_dtree.pdf'

In [26]:
graph.view()

'resources/zoo_dtree.pdf'