<a href="https://colab.research.google.com/github/shubhangi-dwivedi/predictions-using-iris-dataset/blob/main/prediction_using_iris_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Project : Classifying the flower in which category it belong 
using Iris dataset using K-nearest neighbors

*****-------Using Supervised learning---------*****

In [6]:
import pandas as pd 
import numpy as np 

In [7]:
#sklearn is the framework, datasets is the package, load_iris is the function we are using
from sklearn.datasets import load_iris
iris_dataset = load_iris()

In [8]:
#The iris object that is returned by load_iris is a Bunch object, 
#which is very similar to a dictionary. It contains keys and values :
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))

Keys of iris_dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [9]:
#The value of the key DESCR is a short description of the dataset. 
#Lets look at what it contains :
val = iris_dataset['DESCR']
start_val = val[:200]
print(start_val + "\n...")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive
...


In [10]:
#The value of the key target_names is an array of strings, 
#containing the species of flowers that we want to predict 
#i.e. 'setosa' 'versicolor' and 'virginica'
print("Target names: {}".format(iris_dataset['target_names']))

Target names: ['setosa' 'versicolor' 'virginica']


In [11]:
#feature_names key gives the description of each feature it includes
print("Feature names: \n{}".format(iris_dataset['feature_names']))

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [12]:
#The type in which we recieve data
print("Type of data: {}".format(type(iris_dataset['data'])))

Type of data: <class 'numpy.ndarray'>


In [13]:
#The rows in this data array correspond to the 150 flowers, 
#while the columns represent the 4 measurements that were taken for each flower:
print("Shape of data: {}".format(iris_dataset['data'].shape))

Shape of data: (150, 4)


1.Individual items are called samples or data points or instances and their properties are called features or attritbutes. 

2.Shape is the no. of samples multiplied by the no. of features.

#here we have 150 data points & 4 features.

In [14]:
print("First five rows of data: \n{}".format(iris_dataset['data'][:5]))
# we observe that the all the flowers have the same petal width 
#and the first flower has the max sepal width

First five rows of data: 
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [15]:
print("Type of target: {}".format(type(iris_dataset['target']))) 
#tell us the type of target key

Type of target: <class 'numpy.ndarray'>


In [16]:
#target is a 1D array, and on seeing the shape 
#we can see that it contains one entry per flower:
print("Shape of target: {}".format(iris_dataset['target'].shape))

Shape of target: (150,)


In [17]:
print("Target: \n{}".format(iris_dataset['target']))
#the species are encoded as digits between 0-2
#0-sentose, 1-versicolor, 2-virginica

Target: 
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


#Measuring Success: Training and testing Data:

In [18]:
#most used ML framework
import sklearn #scikit-learn


1. In scikit-learn, data is usually denoted with a capital X, while labels are denoted by a lowercase y (inspired by standard formulation (x)=y in mathematics, where x is the I/P and y is the O/P).
2. we use a capital X because the data is a 2D-array(a matrix) and a lowercase y because the target is a 1D-array(a vector).


In [19]:
#train_test_split is a function to split data into 75% training data and 
#25% testing data by default
from sklearn.model_selection import train_test_split

#random_state pick the training data randomly from training the model 
#as we have our data arranged into an orderly fashion
#it will be same random order for everyone
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state = 0)
# train_test_split returns a tuple

In [20]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_train shape: (112, 4)
y_train shape: (112,)
X_test shape: (38, 4)
y_test shape: (38,)


for X_train and y_train 112 data points and 4 columns for the 4 attributes 
and X_test and y_test 38 data points and 4 columns

*Training Data*

In [21]:
#k-nearnest neighbors ML model
from sklearn.neighbors import KNeighborsClassifier #neighbors is the module and KNeighborsClassifier is a class
knn = KNeighborsClassifier(n_neighbors=1) 
#The most imp. parameter of KNeighborsClassifier is the no. of neighbors, which we have set to 1. 
#knn is the name of the object/bot we have made

In [23]:
#(training the model)
#to build the model on the training set we call the fit method of knn object which
#takes as arguments the numpy array X_train and the y_train of the training data 
#and the corresponding training labels resp. 
#and fit the entire KNeighbor classifier algorithm to the X_train * y_tarin values

knn.fit(X_train, y_train) #this line is training the model in trained data

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

*Making Predictions*

We have put the measurements of a random flower 
i.e. we have found and iris in wild with a saple length of 5cm,
a sepal width of 2.9cm, a petal legthn of 1cm,
and a petal width of 0.2cm. What species of iris would this be? 
We have put these measurements into a 2D numpy array 
as sklearn(scikit-learn) expects 2D  arrays as data (to check whether the model have learned something 
by testing it by putting the measurements of this random flower to see wheter it'll be able to predict something or not) , again by calculating the shape  - i.e., the no. of samples(1) multiplied by the no. of features(4):


In [25]:
X_new = np.array([[5, 2.9, 1, 0.2]])

print("X_new.shape: {}".format(X_new.shape))


X_new.shape: (1, 4)


In [26]:
prediction = knn.predict(X_new) #calls the predict method to make a prediction

print("Prediction: {}".format(prediction))
print("Predicted Target Name: {}".format(
    iris_dataset['target_names'][prediction]
      ))

Prediction: [0]
Predicted Target Name: ['setosa']


*Evaluating the Model*

In [31]:
#Now predicting for the test dataset (X_test)
y_pred = knn.predict(X_test) #we had 38 data points in X_test dataset
#so we are predicting for the whole data set i.e. 38 data pts.
print("Test set predictions: \n {}".format(y_pred))

Test set predictions: 
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


In [33]:
#checking if the predictions made by the model were actually correct or not
print("Test Set Score: {}".format(np.mean(y_pred == y_test))) #for all 38 values
#y_test is the actual values of the test dataset(flowers)
#and y_pred is the predicted values given by model to us.

Test Set Score: 0.9736842105263158
