## Introduction ##
The idea here is to use a bag of visual words model to classify the different images. We will use SIFT algorithm to extract the keypoints of each image and create the bag of words.<br>
More information about this method can be found here:<br><ul>
<li>http://www.cs.cmu.edu/~16385/lectures/Lecture12.pdf</li>
<li>https://www.youtube.com/watch?v=iGZpJZhqEME</li>

Some part of this script are inside function, it's just a way to avoid error when I will publish this notebook. If you want to use this script, just remove line starting by "def ...".

In [None]:
import cv2
import numpy as np
import os
import pandas as pd
import csv

from sklearn.cluster import MiniBatchKMeans
from sklearn.neural_network import MLPClassifier

To do it, we will use OpenCV (cv2) library to extract keypoints with SIFT algorithm.

## Extract keypoints from each image ##

In [None]:
img_path = '../input/images/'
train = pd.read_csv('../input/train.csv')
species = train.species.sort_values().unique()

dico = []

def step1():
    for leaf in train.id:
        img = cv2.imread(img_path + str(leaf) + ".jpg")
        kp, des = sift.detectAndCompute(img, None)

        for d in des:
            dico.append(d)

## Clustering  ##
We now have an array with a huge number of descriptors. We cannot use all of them to create or model so we need to cluster them. A rule-of-thumb is to create k centers with k = number of categories * 10 (in our case, it's 990).

In [None]:
def step2():
    k = np.size(species) * 10

    batch_size = np.size(os.listdir(img_path)) * 3
    kmeans = MiniBatchKMeans(n_clusters=k, batch_size=batch_size, verbose=1).fit(dico)

I use MiniBatchKMeans to avoid Memory Error.

## Creation of the histograms ##
To create our each image by a histogram. We will create a vector of k value for each image. For each keypoints in an image, we will find the nearest center and increase by one its value.

In [None]:
def step3():
    kmeans.verbose = False

    histo_list = []

    for leaf in train.id:
        img = cv2.imread(img_path + str(leaf) + ".jpg")
        kp, des = sift.detectAndCompute(img, None)

        histo = np.zeros(k)
        nkp = np.size(kp)

        for d in des:
            idx = kmeans.predict([d])
            histo[idx] += 1/nkp # Because we need normalized histograms, I prefere to add 1/nkp directly

        histo_list.append(histo)

## Training of the neural network ##

In [None]:
def step4():
    X = np.array(histo_list)
    Y = []

    # It's a way to convert species name into an integer
    for s in train.species:
        Y.append(np.min(np.nonzero(species == s)))

    mlp = MLPClassifier(verbose=True, max_iter=600000)
    mlp.fit(X, Y)

## Predictions ##

In [None]:
def step5():
    test = pd.read_csv('../input/test.csv')

    result_file = open("sift.csv", "w")
    result_file_obj = csv.writer(result_file)
    result_file_obj.writerow(np.append("id", species))

    for leaf in test.id:
        img = cv2.imread(img_path + str(leaf) + ".jpg")
        kp, des = sift.detectAndCompute(img, None)

        x = np.zeros(k)
        nkp = np.size(kp)

        for d in des:
            idx = kmeans.predict([d])
            x[idx] += 1/nkp

        res = mlp.predict_proba([x])
        row = []
        row.append(leaf)

        for e in res[0]:
            row.append(e)

        result_file_obj.writerow(row)

    result_file.close()

## Alternative ##
I also run this script with ORB instead of SIFT and I got best results. To do it, just replace `cv2.xfeatures2d.SIFT_create()` by `cv2.ORB_create()`.