# Problem statement

##### Implement a KNN model to classify the animals and predict they are of which animal type. The 7 Class Types are: Mammal, Bird, Reptile, Fish, Amphibian, Bug and Invertebrate


# Importing the libraries

In [None]:
from pandas import read_csv
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sn
import pandas as pd
from sklearn.model_selection import GridSearchCV,train_test_split
import numpy as np
import imblearn
from imblearn.over_sampling import RandomOverSampler
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler

# Loading the dataset

In [None]:
zoo = pd.read_csv('../input/zoo-animal-classification/zoo.csv')


In [None]:
zoo.head(5)

We can just peek into few data points by using head function of pandas. By default, head function return top 5 values 

# Data Insights

In [None]:
zoo.shape

In [None]:
zoo.info()

### Observations :-

##### We can see there are no null values in our dataset. There are 16 variables with various traits to describe the animals. The traits are hair, feathers, eggs, milk, .......domestic,catsize.


##### The purpose for this dataset is to be able to predict the classification(type) of the animals, based upon the variables.


In [None]:
zoo[zoo.duplicated()]

##### There are no duplicate values in our data

# Summary statistics 

In [None]:
zoo.describe()

##### We could see that all the feature attributes are encoded into 0 and 1 except legs. So we will use encoding technique on legs attribute as well.

##### As all the other attributes are encoded using dummy encoding, we will use the same encoding for legs as well.

In [None]:
zoo = pd.get_dummies(zoo,columns=['legs'])

In [None]:
zoo.head()

# Understanding the target variable

##### Our main objective is to be able to predict the classification(type) of the animals, based upon the variables.

##### value_counts() method shows how many samples it is for the animal type. 


In [None]:
zoo['class_type'].value_counts()

##### We could see that the type 1 counts is very high and there is huge difference between the next highest count wich is 20 for type 2. The sets of data in which classes are not evenly distributed are called imbalanced datasets.The imbalance dataset can cause high/low accuracy value of the model due to a certain class.

In [None]:
sn.set(style = 'whitegrid', font_scale = 1.4)
plt.subplots(figsize = (12,7))
sn.countplot(x = 'class_type', data = zoo, palette = 'Pastel1')

##### We can see the count of type 1 is very high

# Separating feature data and Label data  and train-test split

##### We will separate the class label data (type) and features data as Y and X respectively. Also, we will split the dataset into training and test data. The animal_name column is not required for classification as it is not a feature, so we will drop that column as well.

In [None]:
Y = zoo['class_type']
Y.head()

In [None]:
X = zoo.drop('animal_name',axis=1)

In [None]:
X = X.drop('class_type',axis=1)

In [None]:
X.head()

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .2, random_state = 30, stratify = Y)

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
Y_train.head()

In [None]:
Y_test.head()

# Grid search for Algorithm Tuning

In [None]:
n_neighbors = np.array(range(1,40))
param_grid = dict(n_neighbors=n_neighbors)
param_grid

In [None]:
model = KNeighborsClassifier()
grid = GridSearchCV(estimator=model, param_grid=param_grid,cv=10)
grid.fit(X_train, Y_train)
print(grid.best_params_)

##### After applying GridSearch, we got the best K (n_neighbors) value as 1, so we will be using the k= 1 for KNN Classifier algorithm

### Visualizing CV results

In [None]:
import matplotlib.pyplot as plt 
%matplotlib inline
# choose k between 1 to 41
k_range = range(1, 41)
k_scores = []
# use iteration to caclulator different k in models, then return the average accuracy based on the cross validation
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, Y_train, cv=10)
    k_scores.append(scores.mean())
# plot to see clearly
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.show()

##### We could see that the model accuracy is very good for k values smaller than 5 and as the value increases the accuracy goes on decreasing

# Using KNN Classifier for prediction

In [None]:
model = KNeighborsClassifier(n_neighbors =1).fit(X_train,Y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(Y_test,y_pred)
print(accuracy)

##### We can see that the accuracy score which we have got for our model is 0.76 which is 76%. It is decent accuracy score. But the accuracy score can be misleading for imbalanced data. So we will use confusion matrix and classification report metrics further

In [None]:
confusion_matrix = confusion_matrix(Y_test,y_pred)
print (confusion_matrix)

In [None]:
print(classification_report(Y_test,y_pred))

##### The precison and f1 score for type 5 is  low. Since the data is imbalanced, we can see the precision values are affected. We will use oversamping technique as the data is very less and undersampling will cause data loss

# Using Over Sampling for balancing the data

##### We will use RandomOverSampler (ROS) for sampling the the data to balance our data

In [None]:
ros = RandomOverSampler(random_state = 30)

##### Fitting the data using ROS 

In [None]:
x_resample, y_resample = ros.fit_resample(X, Y)
y_df = pd.DataFrame(y_resample)

In [None]:
y_df.value_counts()

##### We could see the data is resampled now and all the type values are 41 now. Previously only type 1 was 41. We will split the resampled data into training and test data and build a KNN model

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(x_resample, y_resample, test_size = .2, random_state = 30, stratify = y_resample)

# Using GridSearch for Algorithm Tuning after resampling

In [None]:
n_neighbors = np.array(range(1,40))
param_grid = dict(n_neighbors=n_neighbors)

model = KNeighborsClassifier()
grid = GridSearchCV(estimator=model, param_grid=param_grid,cv=10)
grid.fit(X_train, Y_train)
print(grid.best_params_)

##### After applying GridSearch, we got the best K (n_neighbors) value as 1, so we will be using the k= 1 for KNN Classifier algorithm

### Visualizing the accuracy with different k values on sampled data

In [None]:
import matplotlib.pyplot as plt 
%matplotlib inline
# choose k between 1 to 41
k_range = range(1, 41)
k_scores = []
# use iteration to caclulator different k in models, then return the average accuracy based on the cross validation
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, Y_train, cv=10)
    k_scores.append(scores.mean())
# plot to see clearly
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.show()

##### The accuracy value is high for low values of k (less than 5) and it descreases as we increase values of k

# Using KNN with k=1 for model classification 

##### We had identified the k=1 is best parameter with GridSearch so using k as 1

In [None]:
model = KNeighborsClassifier(n_neighbors =1).fit(X_train,Y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(Y_test,y_pred)
print(accuracy)

##### The accuracy is 1 which is 100% after applying sampling.  We will use confusion matrix and classification report to further check our accuracy

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(Y_test,y_pred)
print (confusion_matrix)

In [None]:
print(classification_report(Y_test,y_pred))

##### We could see the precision and recall values is 1 for all 7 types which is an excellent score. 