# **Classification using Zoo Animals**

This project was my first foray into Machine Learning. 

In this project, I am using a dataset of over 100 zoo animals (comprised of all their various feautures). I will be using two methods of classification -- K-Nearest Neighbors and Decision Trees -- to try and train my machine to predict what class type each animal belongs to after assessing each of their bodily feautures. 

# **Table of Contents**
* **a)** Code Library Set-Up
* **b)** Brief Exploratory Analysis
* **c)** Test Splitting
* **1)** K-Nearest Neighbors Classification
* **2)** Decision Tree Classification
* **3)** Conclusion

# **a) Code Library Set-Up**

In [None]:
#Import libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#sklearn libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import *
from sklearn.metrics import confusion_matrix,classification_report, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



In [None]:
#Load the data
zoo = pd.read_csv('../input/zoo-animal-classification/zoo.csv')
class_df = pd.read_csv('../input/zoo-animal-classification/class.csv')

In [None]:
#Combine the zoo and the class files
zoo = zoo.merge(class_df,how='left',left_on='class_type',right_on='Class_Number')


# **b) Brief Exploratory Analysis** 

In [None]:
zoo

In [None]:

print(zoo.columns)

print(zoo['animal_name'].unique() )

print(zoo['Class_Type'].unique() )



 # So what's in the data?
 
* 101 animals
* The feautures are:
    * hair
    * feathers
    * eggs
    * milk
    * airborne
    * aquatic
    * predator
    * toothed
    * backbone
    * venomous
    * fins
    * legs
    * tail
    * domestic
    * catsize
* The types of classes are: 
    * mammal
    * fish
    * bird
    * invertebrate
    * bug
    * amphibian
    * reptile

# Visualization of our data

Below is a basic plot showing how much of each species we have.

In [None]:
plt.figure(figsize=(10,10))
sns.light_palette("seagreen", as_cmap=True)
fig = sns.countplot(x=zoo['Class_Type'],label="Count", palette = "Greens_r")
fig = fig.get_figure()


To get a brief idea of what we expect, let's first run a correlation plot on all our feautres and see if any arise.

In [None]:
plt.figure(figsize=(13,13))
corr = zoo.iloc[:,1:-1].corr()
sns.heatmap(corr, cmap = "Greens", annot=True)
plt.show()

# c) Test Splitting

For our Machine Learning portion, we are going to split the data into our feautures and our target values.

> Our **feautures** to be tested are: *hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathes, fins, legs, tail, domestic, catsize.*

> Our **target value** to predict are the class types: *mammal, fish, bird, invertebrate, bug, amphibian, reptile.*

Using our various methods, we are going to split the data into specific training and testing sections and see which is the most accurate form of classification. Basically, we are training our machine to use various algorithms to go through the list of animal feautures and see if it can accurately predict which type of animal it is based off of what type of bodily feautures it possesses. 

In [None]:
# Remove unwanted columns, and assign the x and y values
features = ['hair', 'feathers', 'eggs', 'milk', 'airborne',
       'aquatic', 'predator', 'toothed', 'backbone', 'breathes', 'venomous',
       'fins', 'legs', 'tail', 'domestic', 'catsize']
X = zoo[features]
y = zoo['Class_Number']

# Split these x and y values into testing
train_X, val_X, train_y, val_y = train_test_split(X,y, random_state=1 )

# **1) K-Nearest Neighbors**

What is KNN?

From medium.com,
> *K-nearest neighbors (KNN) is a type of supervised learning algorithm used for both regression and classification. KNN tries to predict the correct class for the test data by calculating the distance between the test data and all the training points. Then select the K number of points which is closet to the test data. The KNN algorithm calculates the probability of the test data belonging to the classes of ‘K’ training data and class holds the highest probability will be selected. In the case of regression, the value is the mean of the ‘K’ selected training points.*

In [None]:
#Implement KNN
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(train_X,train_y)

pred = knn.predict(train_X)
print("Accuracy: {}".format(accuracy_score(train_y, pred)))

In [None]:
# Let's try many iterations of N

#Create an empty list to later insert our results
accuracy = []

#Let's check all the way to 50 neighbors away
neighbors = range(1,50)
for i in neighbors:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(train_X,train_y)
    pred = knn.predict(train_X)
    accuracy.append(metrics.accuracy_score(train_y, pred))

#Now print the results
print(accuracy)

In [None]:
#Let's visualize these results
plt.figure(figsize=(10,10))
plt.plot(neighbors,accuracy, color = 'green')
plt.xlabel('Value of N')
_ = plt.ylabel('Accuracy')

# 2) Decision Tree

What is a decision tree?

From medium.com,
> DTs are ML algorithms that progressively divide data sets into smaller data groups based on a descriptive feature, until they reach sets that are small enough to be described by some label.

In [None]:
dec = DecisionTreeClassifier(random_state=1)
dec = dec.fit(train_X,train_y)
pred = dec.predict(val_X)

plt.figure(figsize=(25,25))
_ = tree.plot_tree(dec,feature_names = features,filled = True)

print("Accuracy: " + str(accuracy_score(val_y,pred)))



We can see that using a Decision Tree, our accuracy in correctly classifying an animal is **96%**.

# **3) Conclusion**

It is evident that using a decision tree is much more effective of a solution -- netting us a 96% accuracy rating.