## **Introduction**

This notebook contains the steps enumerated below for analyzing characteristics of zoo animals and creating classifications.<br> 
Data is available at: https://www.kaggle.com/uciml/zoo-animal-classification/data <br><br>
1. [Import Data & Python Packages](#1-bullet) <br>
2. [Assess Data Quality & Missing Values](#2-bullet)<br>
3. [Exploratory Data Analysis](#3-bullet) <br>
4. [Classification & Cross Validation](#4-bullet) <br>
    * [4.1 Split train and test dataset](#4.1-bullet) <br>
    * [4.2 Perceptron Method](#4.2-bullet)<br>
      * [4.2.1 Cross Validation for Perceptron Method](#4.2.1-bullet) <br>
    * [4.3 Decision Tree](#4.3-bullet)<br>
    * [4.4 SVM](#4.4-bullet)<br>
    * [4.5 Multiclass Logistic Regression](#4.5-bullet)<br>
5. [Summary](#5-bullet) <br>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="white") #white background style for seaborn plots
sns.set(style="whitegrid", color_codes=True)
from sklearn.metrics import accuracy_score

In [None]:
animal=pd.read_csv('../input/zoo.csv')
ani_class=pd.read_csv('../input/class.csv')

**1. Import Data & Python Packages **

In [None]:
animal.head()

In [None]:
# Check class table for later use.
ani_class

In [None]:
# Check data type for each variable
animal.info()

**2. Assess Data Quality & Missing Values **

In [None]:
animal.isnull().sum()

Good news is there's no missing value in this table! 
And data seems to be very clean that only full number is presented.

In [None]:
animal.describe()

In [None]:
# Check if class_type has correct values
print(animal.class_type.unique())

In [None]:
print(animal.legs.unique())

In [None]:
# just curious which animal has 5 legs
animal.loc[animal['legs'] == 5]

**3. Exploratory Data Analysis **

In [None]:
# Join animal table and class table to show actual class names
df=pd.merge(animal,ani_class,how='left',left_on='class_type',right_on='Class_Number')
df.head()

In [None]:
plt.hist(df.class_type, bins=7)

In [None]:
# See which class the most zoo animals belong to
sns.factorplot('Class_Type', data=df,kind="count", aspect=2)

In [None]:
# heatmap to show correlations
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("Correlation Heatmap")
corr = animal.corr()
sns.heatmap(corr, annot=True,
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

In [None]:
# show vairable correlation which is more than 0.7 (positive or negative)
corr[corr != 1][abs(corr)> 0.7].dropna(how='all', axis=1).dropna(how='all', axis=0)

In [None]:
df.groupby('Class_Type').mean()

It is too obvious that if "milk" exists, then the animal is mammal; if "feathers" exists, then it should be bird. 

In [None]:
# checking leg number in each class
g = sns.FacetGrid(df, col="Class_Type")
g.map(plt.hist, "legs")
plt.show()

**4. Classification & Cross Validation **

**4.1 Split train and test dataset **

In [None]:
from sklearn.model_selection import train_test_split
# 80/20 split
#animal=animal.drop(['eggs', 'hair'], axis=1)
#X = animal.iloc[:,1:15]
#y = animal.iloc[:,15]
X = animal.iloc[:,1:17]
y = animal.iloc[:,17]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

**4.2 Perceptron Method 
**<br>Perceptron is good for multi-class classification, which might be a good method for us, since we have 7 animal classes.

In [None]:
from sklearn.linear_model import Perceptron
ppn = Perceptron(eta0=1, random_state=1)
ppn.fit(X_train, y_train)
# make prediction
y_pred = ppn.predict(X_test)
# check model accuracy
accuracy_score(y_pred,y_test)

In [None]:
# 70/30 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
ppn = Perceptron(eta0=1, random_state=1)
ppn.fit(X_train, y_train)
y_pred = ppn.predict(X_test)
accuracy_score(y_pred,y_test)

By spliting train/test dataset again did make the model to fit better. But I would do a cross validation for this model.

**4.2.1 Cross Validation for Perceptron Method**<br>
K-fold CV - we split our data into k subsets, and train on k-1 one of those subset. What we do is to hold the last subset for test.<br>
* A model is trained using k-1 of the folds as training data<br>
* the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

In [None]:
from sklearn.model_selection import cross_val_score
score_ppn=cross_val_score(ppn, X,y, cv=5)
score_ppn

In [None]:
# The mean score and the 95% confidence interval of the score estimate are:
print("Accuracy: %0.2f (+/- %0.2f)" % (score_ppn.mean(), score_ppn.std() * 2))

So the accuracy for Perceptron model is around 0.89, which is fine, but I'd like to try some other models.

**4.3 Decision Tree**

In [None]:
from sklearn import tree
dt = tree.DecisionTreeClassifier()
score_dt=cross_val_score(dt, X,y, cv=5)
score_dt

In [None]:
# The mean score and the 95% confidence interval of the score estimate are:
print("Accuracy: %0.2f (+/- %0.2f)" % (score_dt.mean(), score_dt.std() * 2))

**4.4 SVM**

In [None]:
from sklearn.svm import SVC
svc = SVC(kernel='linear', C=1)
score_svc=cross_val_score(svc, X,y, cv=5)
score_svc

In [None]:
# The mean score and the 95% confidence interval of the score estimate are:
print("Accuracy: %0.2f (+/- %0.2f)" % (score_svc.mean(), score_svc.std() * 2))

**4.5 Multiclass Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(multi_class='multinomial', solver='newton-cg')
score_lr=cross_val_score(lr, X,y, cv=5)
score_lr

In [None]:
# The mean score and the 95% confidence interval of the score estimate are:
print("Accuracy: %0.2f (+/- %0.2f)" % (score_lr.mean(), score_lr.std() * 2))

**5. Summary**

In [None]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'Logistic Regression', 'Perceptron', 'Decision Tree'],
    'Score': [score_svc.mean(), score_lr.mean(), score_ppn.mean(), score_dt.mean()]})
models.sort_values(by='Score', ascending=False)

After comparing the score of each model, the SVM model seems to be the most accurate.