#### **Wheat Seed Classification**

In this assignment, you will use the [Wheat Seed Dataset](https://archive.ics.uci.edu/ml/datasets/seeds) to classify the type of wheat seed based on the measurements of the seed. The dataset contains 7 attributes and 210 instances. The attributes are:

1. Area
2. Perimeter
3. Compactness
4. Length of Kernel
5. Width of Kernel
6. Asymmetry Coefficient
7. Length of Kernel Groove

Based on the attributes, the dataset contains 3 classes:

1. Kama
2. Rosa
3. Canadian

The text file `seeds_dataset.txt` contains the dataset. The first 7 columns are the attributes and the last column is the class label. The class labels are encoded as  1, 2, and 3 for Kama, Rosa, and Canadian, respectively. The goal of this assignment is to build a classifier that can predict the type of wheat seed based on the measurements of the seed. Follow the instructions below to complete the assignment.

#### **Step 1:** Download the dataset from [here](https://drive.google.com/file/d/1ZnGOVGFrNv0L1ctT8SO8Y3WfjD2HShgK/view?usp=sharing). It should be saved as `seeds_dataset.csv`.


#### **Step 2:** Upload the dataset to your Google Drive and mount your Google Drive to Colab.


In [22]:
from google.colab import drive
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

In [11]:
# write your code here
from google.colab import drive
drive.mount('/content/drive')
dataset_path = "/content/drive/MyDrive/Dataset/seeds_dataset.csv"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### **Step 3:** Read the dataset using pandas' built-in function `pd.read_csv()` as `data` convert it into numpy array using `data.to_numpy()` function. Pass the following parameters to the `pd.read_csv()`function:
    

*   `filepath_or_buffer`: The path to the dataset
*   `delimiter`: The delimiter used in the dataset to separate the attributes (Hint: Use `'\t'` as the delimiter)
*   `header`: The column header used in the dataset (Hint: Use `None` as the header)


In [12]:
# write your code here
data = pd.read_csv(dataset_path, delimiter='\t', header=None)
data_array = data.to_numpy()


#### **Step 4:** Shuffle the dataset using `np.random.shuffle()`. Pass the following parameters to the function:
* `x`: The dataset


In [15]:
# write your code here
np.random.shuffle(data_array)

#### **Step 5:**  Split the dataset into features and labels. The first 7 columns of the dataset are the features and the last column is the label. Use numpy's array slicing to split the dataset into features and labels. (Hint: Use `:` to select all the rows and `0:7` to select the first 7 columns for features and `7` to select the last column for labels)


In [16]:
# write your code here
features = data_array[:, 0:7]
labels = data_array[:, 7]

#### **Step 6:**  Split the dataset into training and testing sets. Use numpy's built-in function `np.split()` to split the dataset into training and testing sets. Pass the following parameters to the function:
* `ary`: The dataset
* `indices_or_sections`: The number of instances in the training set (Hint: Use `int(0.8 * len(dataset))` to get the number of instances in the training set)
* `axis`: The axis to split the dataset (Hint: Use `0` to split the dataset along the rows)


In [17]:
# write your code here
train_features, test_features = np.split(features, [int(0.8 * len(features))], axis = 0) 
train_labels, test_labels = np.split(labels, [int(0.8 * len(labels))], axis = 0) 

#### **Step 7:**  Find the minimum and maximum values of each feature in the training set. Use numpy's built-in function `np.min()` and `np.max()` to find the minimum and maximum values of each feature in the training set. Pass the following parameters to the function:
* `a`: The training set
* `axis`: The axis to find the minimum and maximum values (Hint: Use `0` to find the minimum and maximum values along the columns)

In [19]:
# write your code here
minTrainValue = np.min(train_features, 0)
maxTrainValue = np.max(train_features, 0)

##### **Step 8:**  In this step, you must normalize the training and test sets. Nomalization is an essential part of every machine learning project. It is used to bring all the features to the same scale. If the features are not normalized, the higher-valued features will outnumber the lower-valued ones.

For example, suppose we have a dataset with two features: the number of bedrooms in a house and the size of the garden in square feet and we are trying to forecast the rent of the residence. If the features are not normalized, the feature with higher values will take precedence over the feature with lower values. In this scenario, the garden area has a greater value. As a result, the model will make an attempt to forecast the house's price depending on the size of the garden. As a consequence, the model will be faulty since most individuals will not pay higher rent for more garden area. We need to normalize the features in order to prevent this. Let's look at the following illustration to better comprehend what we have said:
* House 1: 2 bedrooms, 2500 sq. ft. garden
* House 2: 3 bedrooms, 500 sq. ft. garden
* House 3: 7 bedrooms, 2300 sq. ft. garden

Considering that most people won't pay more for a larger garden, it follows that the rent for House 1 should be more comparable to House 2 than to House 3. However, if we give the aforementioned data to a k-NN classifier without normalization, it will compute the euclidean distance between the test and training examples and pick the class of the test instance based on the class of the closest training instance.

The euclidean distance between the test instance and the training instances will be:

* Distance between house 1 and house 2: $\sqrt{(2-3)^2 + (2500-500)^2} = 2000$
* Distance between house 1 and house 3: $\sqrt{(2-7)^2 + (2500-2300)^2} = 200$

As you can see, the distance between houses 1 and 3 is shorter than that between houses 1 and 2. As a result, the model will forecast that house 1 will cost around the same as house 3. This is not what was anticipated. We need to normalize the features in order to prevent this. To normalize the features, subtract the minimum value of each feature from all the values of that feature and divide the result by the range of the feature. The range of a feature is the difference between the maximum and minimum values of that feature. The formula for normalization is given below:

$$x_{normalized} = \frac{x - min(x)}{max(x) - min(x)}$$

<html>
<center> 
where, $x$ is the feature vector. The above formula will normalize the features to a scale of 0 to 1.
</center>
</html>



Let's normalize the features in the above example. To do so, we need to find the minimum and maximum values of each feature. The minimum and maximum values of the number of bedrooms are 2 and 7, respectively. The minimum and maximum values of the garden area are 500 and 2500, respectively. The normalized values of the features are given below:

* House 1: $(2 - 2) / 5 = 0$ bedrooms, $(2500 - 500) / 2000 = 0.75$ sq. ft. garden
* House 2: $(3 - 2) / 5 = 0.2$ bedrooms, $(500 - 500) / 2000 = 0$ sq. ft. garden
* House 3: $(7 - 2) / 5 = 1$ bedrooms, $(2300 - 500) / 2000 = 0.85$ sq. ft. garden

Now, the euclidean distance between the test instance and the training instances will be:

* Distance between house 1 and house 2: $\sqrt{(0-0.2)^2 + (0.75-0)^2} = 0.77$
* Distance between house 1 and house 3: $\sqrt{(0-1)^2 + (0.75-0.9)^2} = 1.11$

As you can see now, the distance between houses 1 and 2 is shorter than that between houses 1 and 3. The model will thus forecast that house 1 will cost about the same as house 2, according to the prediction. This is what is anticipated. This is what normalization does. It equalizes the scale of all features. This is important because it prevents the features with higher values from dominating the features with lower values.

Use the minimum and maximum values you found in the previous step to normalize the training and test sets.


In [20]:
# write your code here
minTestValue = np.min(test_features, 0)
maxTestValue = np.max(test_features, 0)

normTrainFeature = (train_features - minTrainValue) / (maxTrainValue - minTrainValue)
normTestFeature = (test_features - minTestValue) / (maxTestValue - minTestValue)

##### **Step 9:**  Now, you have to build a classifier to classify the type of wheat seed based on the measurements of the seed. Use the K-Nearest Neighbors algorithm to build the classifier. Use the Euclidean distance to find the nearest neighbors.


In [24]:
# write your code here
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(train_features, train_labels)
predictions = classifier.predict(test_features)

def euclidean_distance(features, train_features):
    euc_distances1 = np.sqrt(np.sum((features - train_features)**2, axis=1))
    return euc_distances1
def nearest_neighbor(features, train_features, train_labels):
    distances = euclidean_distance(features, train_features)
    indices = np.argsort(distances)
    nearest_neighbor = train_labels[indices[0]]
    return nearest_neighbor

predicted_label = nearest_neighbor(normTestFeature[7], normTrainFeature, train_labels)

if(predicted_label==1):
  print("Predicted: Kama")
elif(predicted_label==2):
  print("Predicted: Rosa")
else:
  print("Predicted: Canadian")
    
if(test_labels[7]==1):
  print("Actual label: Kama")
elif(test_labels[7]==2):
  print("Actual label: Rosa")
else:
  print("Actual label: Canadian")

Predicted: Kama
Actual label: Kama


##### **Step 10:**  Output the number of data points in the testing set and the number of correct predictions made by the classifier for each class.

In [25]:
# write your code here
def accuracy(predictions, labels):
    correct_predictions = np.sum(predictions == labels)
    accuracy = round((correct_predictions / len(labels))*100, 3)
    return accuracy, correct_predictions

score, correct = accuracy(predictions, test_labels)
print("Accuracy: {0}%".format(score))
print("Number of data points: {0}".format(len(test_labels)))
print("Correct predictions: {0}".format(correct))

Accuracy: 92.857%
Number of data points: 42
Correct predictions: 39
