<a href="https://colab.research.google.com/github/vahedshaik/cmpe-255-assignment-5/blob/main/Nearest_Neighb_algo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using different Nearest Neighbor Algorithms

Using various Nearest Neighbor Algorithms on the [Crop Recommendation Dataset](https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset) by **ATHARVA INGLE** on Kaggle.

**Content**

This dataset was build using datasets of rainfall, climate, fertilizer data available for India.

**Data Fields** 

* ***N*** - ratio of Nitrogen content in soil

* ***P*** - ratio of Phosphorous content in soil

* ***K*** - ratio of Potassium content in soil

* ***temperature*** - temperature in Celsius in degrees

* ***humidity*** - relative humidity in %

* ***ph*** - ph value of the soil

* ***rainfall*** - rainfall in mm


importing dataset for getting data


In [None]:
# Enabling Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Installing, initializing Kaggle
!pip install kaggle
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Link for Dataset - https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset
!kaggle datasets download -d atharvaingle/crop-recommendation-dataset

Downloading crop-recommendation-dataset.zip to /content
  0% 0.00/63.7k [00:00<?, ?B/s]
100% 63.7k/63.7k [00:00<00:00, 33.5MB/s]


In [None]:
!unzip /content/crop-recommendation-dataset
!mkdir crop-dataset
!mv '/content/Crop_recommendation.csv' '/content/crop-dataset/'

Archive:  /content/crop-recommendation-dataset.zip
  inflating: Crop_recommendation.csv  


In [None]:
import pandas as pds
  
# reading CSV file
file = ('/content/crop-dataset/Crop_recommendation.csv')
dataset = pds.read_csv(file)
  
# displaying contents of the CSV file
dataset.head()

Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,label
0,90,42,43,20.879744,82.002744,6.502985,202.935536,rice
1,85,58,41,21.770462,80.319644,7.038096,226.655537,rice
2,60,55,44,23.004459,82.320763,7.840207,263.964248,rice
3,74,35,40,26.491096,80.158363,6.980401,242.864034,rice
4,78,42,42,20.130175,81.604873,7.628473,262.71734,rice


### Importing the faiss and pickle

In [None]:
!pip install faiss


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!sudo apt-get install libomp-dev


Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  libomp5
Suggested packages:
  libomp-doc
The following NEW packages will be installed:
  libomp-dev libomp5
0 upgraded, 2 newly installed, 0 to remove and 5 not upgraded.
Need to get 239 kB of archives.
After this operation, 804 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libomp5 amd64 5.0.1-1 [234 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libomp-dev amd64 5.0.1-1 [5,088 B]
Fetched 239 kB in 0s (1,012 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 2.)
de

In [None]:
import pickle
import faiss
import numpy as np

### Splitting the dataset into Vectors, Labels

In [None]:
data_vectors = dataset.drop(['label'], axis = 1)
data_vectors

Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall
0,90,42,43,20.879744,82.002744,6.502985,202.935536
1,85,58,41,21.770462,80.319644,7.038096,226.655537
2,60,55,44,23.004459,82.320763,7.840207,263.964248
3,74,35,40,26.491096,80.158363,6.980401,242.864034
4,78,42,42,20.130175,81.604873,7.628473,262.717340
...,...,...,...,...,...,...,...
2195,107,34,32,26.774637,66.413269,6.780064,177.774507
2196,99,15,27,27.417112,56.636362,6.086922,127.924610
2197,118,33,30,24.131797,67.225123,6.362608,173.322839
2198,117,32,34,26.272418,52.127394,6.758793,127.175293


In [None]:
data_vectors = data_vectors.values
data_vectors = np.ascontiguousarray(data_vectors, dtype=np.float32)

In [None]:
data_labels = dataset['label']
data_labels.head()

0    rice
1    rice
2    rice
3    rice
4    rice
Name: label, dtype: object

In [None]:
data_labels = data_labels.values

### Locality Sensitive Hashing


In [None]:
class LSHIndex():
    def __init__(self, vectors, labels):
        self.dimension = vectors.shape[1]
        self.vectors = vectors
        self.labels = labels    
   
    def build(self, num_bits=10):
        self.index = faiss.IndexLSH(self.dimension, num_bits)
        self.index.add(self.vectors)
        
    def query(self, vectors, k=10):
        distances, indices = self.index.search(vectors, k) 
        return [self.labels[i] for i in indices[0]]


In [None]:
lsh_index = LSHIndex(data_vectors, data_labels)
lsh_index.build()

In [None]:
lsh_index.query(np.array([data_vectors[889]]))

['mothbeans',
 'mothbeans',
 'mothbeans',
 'mothbeans',
 'mothbeans',
 'mothbeans',
 'mothbeans',
 'mothbeans',
 'mothbeans',
 'mothbeans']

In [None]:
lsh_index.query(np.array([data_vectors[0]]))

['rice',
 'rice',
 'rice',
 'rice',
 'rice',
 'rice',
 'rice',
 'rice',
 'rice',
 'rice']

###Exhaustive Search


In [None]:
class ExhaustiveIndex():
    def __init__(self, vectors, labels):
        self.dimension = vectors.shape[1]
        self.vectors = vectors
        self.labels = labels    
   
    def build(self):
        self.index = faiss.IndexFlatL2(self.dimension)
        self.index.add(self.vectors)
        
    def query(self, vectors, k=10):
        distances, indices = self.index.search(vectors, k) 
        return [self.labels[i] for i in indices[0]]

In [None]:
exact_index = ExhaustiveIndex(data_vectors, data_labels)
exact_index.build()

In [None]:
exact_index.query(
  np.array([data_vectors[776]])
)

['blackgram',
 'blackgram',
 'blackgram',
 'blackgram',
 'blackgram',
 'blackgram',
 'blackgram',
 'blackgram',
 'blackgram',
 'blackgram']

### Product Quantization


In [None]:
class ProductQuantizationIndex():
    def __init__(self, vectors, labels):
        self.dimension = vectors.shape[1]
        self.vectors = vectors.astype('float32')
        self.labels = labels    
    
    def build(self, number_of_partition=1, search_in_x_partitions=1, subvector_size=2):
        quantizer = faiss.IndexFlatL2(self.dimension)
        self.index = faiss.IndexIVFPQ(quantizer, 
                                      self.dimension, 
                                      number_of_partition, 
                                      search_in_x_partitions, 
                                      subvector_size)
        self.index.train(self.vectors)
        self.index.add(self.vectors)
        
    def query(self, vectors, k=10):
        distances, indices = self.index.search(vectors, k) 
        return [self.labels[i] for i in indices[0]]

In [None]:
product_quantization_index = ProductQuantizationIndex(data_vectors, data_labels)
product_quantization_index.build()


In [None]:
product_quantization_index.query(np.array([data_vectors[1000]]))

['maize',
 'maize',
 'maize',
 'maize',
 'maize',
 'maize',
 'maize',
 'maize',
 'maize',
 'maize']

### Trees and Graphs

In [None]:
!pip install annoy


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting annoy
  Downloading annoy-1.17.1.tar.gz (647 kB)
[K     |████████████████████████████████| 647 kB 5.0 MB/s 
[?25hBuilding wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.17.1-cp37-cp37m-linux_x86_64.whl size=395181 sha256=1ac98c4efb635346c3779b5767e0cdab9767c31185a5e34703c41d571b45f90a
  Stored in directory: /root/.cache/pip/wheels/81/94/bf/92cb0e4fef8770fe9c6df0ba588fca30ab7c306b6048ae8a54
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.17.1


In [None]:
import annoy

In [None]:
class AnnoyIndex():
    def __init__(self, vectors, labels):
        self.dimension = vectors.shape[1]
        self.vectors = vectors
        self.labels = labels    
   
    def build(self, number_of_trees=5):
        self.index = annoy.AnnoyIndex(self.dimension)
        for i, vec in enumerate(self.vectors):
            self.index.add_item(i, vec.tolist())
        self.index.build(number_of_trees)
        
    def query(self, vector, k=10):
        indices = self.index.get_nns_by_vector(vector.tolist(), k, search_k=7)                                           
        return [self.labels[i] for i in indices]

In [None]:
annoy_index = AnnoyIndex(data_vectors, data_labels)
annoy_index.build()

  


In [None]:
annoy_index.query(data_vectors[267])

['chickpea',
 'chickpea',
 'chickpea',
 'chickpea',
 'chickpea',
 'chickpea',
 'chickpea',
 'chickpea',
 'chickpea',
 'chickpea']

### Hierarchical Navigable Small World 

In [None]:
!pip install nmslib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nmslib
  Downloading nmslib-2.1.1-cp37-cp37m-manylinux2010_x86_64.whl (13.5 MB)
[K     |████████████████████████████████| 13.5 MB 5.1 MB/s 
[?25hCollecting pybind11<2.6.2
  Downloading pybind11-2.6.1-py2.py3-none-any.whl (188 kB)
[K     |████████████████████████████████| 188 kB 57.0 MB/s 
Installing collected packages: pybind11, nmslib
Successfully installed nmslib-2.1.1 pybind11-2.6.1


In [None]:
import nmslib

In [None]:
class NMSLIBIndex():
    def __init__(self, vectors, labels):
        self.dimention = vectors.shape[1]
        self.vectors = vectors.astype('float32')
        self.labels = labels
    def build(self):
        self.index = nmslib.init(method='hnsw', space='cosinesimil')
        self.index.addDataPointBatch(self.vectors)
        self.index.createIndex({'post': 2})
        
    def query(self, vector, k=10):
        indices = self.index.knnQuery(vector, k=k)
        return [self.labels[i] for i in indices[0]]

In [None]:
hnsw_index = NMSLIBIndex(data_vectors, data_labels)
hnsw_index.build()

In [None]:
hnsw_index.query(data_vectors[1])

['rice',
 'rice',
 'rice',
 'jute',
 'rice',
 'rice',
 'rice',
 'rice',
 'jute',
 'rice']