Using clustering for feature selection.

Here I will show you my own way to extract features for seedings classification.

The features in this case are the pixels which belong to the plant (almost green).

In [None]:
import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
import scipy.misc as sc

%matplotlib inline

Loading and reshaping image for processing

In [None]:
pngfile = np.array(Image.open("train/Black-grass/0ace21089.png")) # load single image and convert into nparray
plt.imshow(pngfile)
h=pngfile.shape[0] # get height
w=pngfile.shape[1] # get width
pngfile=pngfile.reshape([h*w,3])

My idea is to use clustering for extracting different objects from the picture.

I used a simple KMeans class from sklearn and train it on a single picture.

The number of clusters I choosed is 20 but this number might not be optimal.

In [None]:
np.random.seed(1) # To be sure you will get the same clusters

from sklearn.cluster import KMeans

kmeans = KMeans(init='k-means++', n_clusters=20, n_init=1)
kmeans.fit(pngfile) # Clustering

pred=kmeans.predict(pngfile[:,:]) # Extracting pixel's cluster numbers
pred=pred.reshape([h,w]) # reshaping the pixels for visualizetion
plt.imshow(pred) # show the result

As one can see above, the most part of pixels of the plant belong to only one class.

In my case pixels of the plant belong to class #12. Let's check it!

In [None]:
res=(pred==12)*1 # if pixel number is 12, the result will be 1, otherwise - 0.
plt.imshow(res)

As you see, the extracting looks pretty good. *1 means convertation from bool to int

Lets check it on another example.

In [None]:
pngfile = Image.open("train/Maize/a1d7080b1.png")
pngfile=np.array(pngfile)
h=pngfile.shape[0]
w=pngfile.shape[1]
pngfile=pngfile.reshape([h*w,3])
pred=kmeans.predict(pngfile[:,:])
pred=(pred.reshape([h,w])==12)*1
plt.imshow(pred)

As one can see this may work for different kinds of plants/

Below is the script I used to process all files the same way.
Training set:

In [None]:
import glob
from tqdm import tqdm_notebook as tqdm

folder_list=[]
for filename in glob.iglob('train/**', recursive=False):
    c=filename.split('\\')
    folder_list.append(c[len(c)-1])
    
f=open("train.csv",'w') 

with tqdm(total=len(folder_list)) as pbar:    
    for folder in folder_list:
        print(folder)
        pbar.update(1)
        for filename in glob.iglob('train/' + folder + '/*.png', recursive=False):
            pngfile = np.array(Image.open(filename))
            h=pngfile.shape[0]
            w=pngfile.shape[1]
            
            if(pngfile.shape[2]!=3):
                continue
                
            pngfile=pngfile.reshape([h*w,3])
            pred=kmeans.predict(pngfile[:,:])
            pred=pred.reshape([h,w])

            res=(pred==12)*1          
            res=sc.imresize(res,[100,100]).reshape([10000])

            f.write(folder+'\t')
            for i in range(9999):
                f.write(str(res[i])+'\t')  
            f.write(str(res[9999])+'\n')
            f.write(folder+'\t')
            for i in range(9999):
                f.write(str(res2[i])+'\t')  
            f.write(str(res2[9999])+'\n')
f.close()

Test set:

In [None]:
f=open("test.csv",'w') 

for filename in glob.iglob('test/*.png', recursive=False):
    pbar.update(1)
    pngfile = np.array(Image.open(filename))
    h=pngfile.shape[0]
    w=pngfile.shape[1]
    pngfile=pngfile.reshape([h*w,3])
    pred=kmeans.predict(pngfile[:,:])
    pred=pred.reshape([h,w])

    res=(pred==12)*1
                        
    res=sc.imresize(res,[100,100]).reshape([10000])

    for i in range(9999):
        f.write(str(res[i])+'\t')  
    f.write(str(res[9999])+'\n')
    
f.close()