# Hands-on introduction to ML training
In this notebook we will tackle a different kind of problem: Classification

For this lesson, we will convert it to a simple binary classification problem - Mammal or not a mammal. 

### Step 1: Load and explore data
The first step is figuring out the data source. In this case we will use a pre-existing dataset. We will:
1. Create a folder 'data'
2. Download the file from public github repo using python package "requests" and save the titanic.csv file in the data folder.

In [1]:
%config IPCompleter.greedy=True #Helps with auto-complete

import numpy as np
import pandas as pd
import os

try:
    os.mkdir('data')
except OSError as error:
    print(error)

import requests, csv

url = 'https://raw.githubusercontent.com/techno-nerd/ML_101_Course/main/04%20Classification/data/zoo.csv'
r = requests.get(url)
with open('data/zoo.csv', 'w') as f:
  writer = csv.writer(f)
  for line in r.iter_lines():
    writer.writerow(line.decode('utf-8').split(','))

[Errno 17] File exists: 'data'


In [2]:
df = pd.read_csv('data/zoo.csv')

In [3]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   animal_name  101 non-null    object
 1   hair         101 non-null    int64 
 2   feathers     101 non-null    int64 
 3   eggs         101 non-null    int64 
 4   milk         101 non-null    int64 
 5   airborne     101 non-null    int64 
 6   aquatic      101 non-null    int64 
 7   predator     101 non-null    int64 
 8   toothed      101 non-null    int64 
 9   backbone     101 non-null    int64 
 10  breathes     101 non-null    int64 
 11  venomous     101 non-null    int64 
 12  fins         101 non-null    int64 
 13  legs         101 non-null    int64 
 14  tail         101 non-null    int64 
 15  domestic     101 non-null    int64 
 16  catsize      101 non-null    int64 
 17  class_type   101 non-null    int64 
dtypes: int64(17), object(1)
memory usage: 14.3+ KB
None


In [4]:
print(df[:5])

  animal_name  hair  feathers  eggs  milk  airborne  aquatic  predator  \
0    aardvark     1         0     0     1         0        0         1   
1    antelope     1         0     0     1         0        0         0   
2        bass     0         0     1     0         0        1         1   
3        bear     1         0     0     1         0        0         1   
4        boar     1         0     0     1         0        0         1   

   toothed  backbone  breathes  venomous  fins  legs  tail  domestic  catsize  \
0        1         1         1         0     0     4     0         0        1   
1        1         1         1         0     0     4     1         0        1   
2        1         1         0         0     1     0     1         0        0   
3        1         1         1         0     0     4     0         0        1   
4        1         1         1         0     0     4     1         0        1   

   class_type  
0           1  
1           1  
2           4  
3   

### Step 2: Data preparation

There are a few tasks we need to do before we can train the model on this data:
1. Replace all labels except 1 (non-mammals) with 0 so that it is a binary classification problem.

Then, we will split the data the same way as last time:
1. Divide the dataset into features (animal characteristics) and labels (mammal or not)
2. Split the data (181 rows) into training set (80%) and test set (20%)

In [6]:
labels = df.class_type
labels[:5]

0    1
1    1
2    4
3    1
4    1
Name: class_type, dtype: int64

In [5]:
#Taking all integer columns (actually boolean) as features, except class_type, which is the label
features = df.select_dtypes(include=['int64']).drop('class_type', axis=1)
features[:5]

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize
0,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1
1,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1
2,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0
3,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1
4,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1


In [7]:
#Since this is binary, we will do either mammal or non-mammal

labels.replace(np.arange(2, 10, 1), 0, inplace=True)
#Non-mammals
print((labels == 0).sum())
#Mammals
print((labels == 1).sum())

60
41


In [8]:
from sklearn import preprocessing
import sklearn.model_selection as ms
from sklearn import linear_model as lm
from sklearn.metrics import mean_squared_error as mse 

In [9]:
train_features, test_features, train_labels, test_labels = ms.train_test_split(features, labels, test_size=0.2)
print(train_features.shape)
print(test_features.shape)
print(train_labels.shape)
print(test_labels.shape)

(80, 16)
(21, 16)
(80,)
(21,)


### Step 3: Model Selection and Training

For this problem, we will use a model called logistic regression. 

To learn more, watch the video: https://youtu.be/O4sExG-hUxA

This model also learns weights for each feature, just like a linear regression model. But,  it applies a function at the end that maps the value between 0 and 1.

In [10]:
logistic_mod = lm.LogisticRegression(solver='lbfgs')
logistic_mod.fit(train_features, train_labels)

### Step 4: Model evaluation and tuning
Unlike linear regression, we are not going to use Root Mean Squared Error. Instead, we will use the three metrics mentioned in the Titanic problem:
1. Accuracy = total correct / total predictions
2. Precision = correct class 1 / total predicted class 1
3. Recall = correct class 1 / total number of class 1's

In [11]:
print(logistic_mod.coef_)
print(logistic_mod.intercept_)

[[ 1.39633535 -0.46564754 -1.40707806  1.87461269 -0.40099644 -0.15568172
  -0.03635202  0.76443901  0.55381352  0.54524492 -0.4315619  -0.03397471
   0.17339491  0.15926214  0.14419439  0.87463228]]
[-3.4263619]


In [12]:
train_probs = logistic_mod.predict_proba(train_features)
test_probs = logistic_mod.predict_proba(test_features)

In [13]:
print('Class 0 and Class 1 probabilities')

print(train_probs[:5])

Class 0 and Class 1 probabilities
[[0.98112347 0.01887653]
 [0.97293126 0.02706874]
 [0.03120249 0.96879751]
 [0.07270527 0.92729473]
 [0.0323203  0.9676797 ]]


In [14]:
train_predictions = np.where(train_probs[:, 1] > 0.5, 1, 0)
test_predictions = np.where(test_probs[:, 1] > 0.5, 1, 0)

In [15]:
print(train_predictions[:5], train_labels[:5])

[0 0 1 1 1] 11    0
61    0
17    1
9     1
10    1
Name: class_type, dtype: int64


In [16]:
def ClassifierMetrics(labels, predictions):
    total = labels.size
    result = (labels == predictions)
    correct = result.sum()
    accuracy = (correct)/total

    #Precision (correct '1' prediction / total '1' prediction)
    precision = (result[predictions == 1.0].sum()) / (predictions == 1.0).sum()

    #Recall = (correct '1' predictions / total number of '1's)

    recall = (result[predictions == 1.0].sum()) / (labels == 1.0).sum()

    return [accuracy, precision, recall]

In [17]:
train_metrics = ClassifierMetrics(train_labels, train_predictions)
print(f"Accuracy: {train_metrics[0]}")
print(f"Precision: {train_metrics[1]}")
print(f"Recall: {train_metrics[2]}")

Accuracy: 1.0
Precision: 1.0
Recall: 1.0


### Overfitting?

Looking at results, it seems like the model has memorised the data, hence it achieves 100% accuracy, precision and recall.

However, the test performance is equally as good, hence we can conclude that the model is not overfit. Since this is a very simple problem, the model seems to have very accurately learnt the patterns in the data.

In [18]:
test_metrics = ClassifierMetrics(test_labels, test_predictions)
print(f"Accuracy: {test_metrics[0]}")
print(f"Precision: {test_metrics[1]}")
print(f"Recall: {test_metrics[2]}")

Accuracy: 1.0
Precision: 1.0
Recall: 1.0


In [19]:
print(features.columns)

Index(['hair', 'feathers', 'eggs', 'milk', 'airborne', 'aquatic', 'predator',
       'toothed', 'backbone', 'breathes', 'venomous', 'fins', 'legs', 'tail',
       'domestic', 'catsize'],
      dtype='object')


In [20]:
#Try your own animal:
#Remember, the columns (above for reference) should be given in order (0 is no, 1 is yes)

#Testing for bat

custom_animal = [1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0]
custom_animal = np.reshape(custom_animal, (1, 16))
print(custom_animal.shape)

pred = logistic_mod.predict(custom_animal)

if pred == 0:
    print('Your animal is not a mammal')
else:
    print('Your animal is a mammal')

(1, 16)
Your animal is a mammal


