Version 1.1.1

# The task

In this assignment you will need to implement features, based on nearest neighbours. 

KNN classifier (regressor) is a very powerful model, when the features are homogeneous and it is a very common practice to use KNN as first level model. In this homework we will extend KNN model and compute more features, based on nearest neighbors and their distances. 

You will need to implement a number of features, that were one of the key features, that leaded the instructors to prizes in [Otto](https://www.kaggle.com/c/otto-group-product-classification-challenge) and [Springleaf](https://www.kaggle.com/c/springleaf-marketing-response) competitions. Of course, the list of features you will need to implement can be extended, in fact in competitions the list was at least 3 times larger. So when solving a real competition do not hesitate to make up your own features.   

You can optionally implement multicore feature computation. Nearest neighbours are hard to compute so it is preferable to have a parallel version of the algorithm. In fact, it is really a cool skill to know how to use `multiprocessing`, `joblib` and etc. In this homework you will have a chance to see the benefits of parallel algorithm. 

# Check your versions

Some functions we use here are not present in old versions of the libraries, so make sure you have up-to-date software. 

In [213]:
import numpy as np
import pandas as pd 
import sklearn
import scipy.sparse 

for p in [np, pd, sklearn, scipy]:
    print (p.__name__, p.__version__)

numpy 1.13.1
pandas 0.20.3
sklearn 0.19.0
scipy 0.19.1


The versions should be not less than:

    numpy 1.13.1
    pandas 0.20.3
    sklearn 0.19.0
    scipy 0.19.1
   
**IMPORTANT!** The results with `scipy=1.0.0` will be different! Make sure you use _exactly_ version `0.19.1`.

# Load data

Learn features and labels. These features are actually OOF predictions of linear models.

In [214]:
train_path = '../readonly/KNN_features_data/X.npz'
train_labels = '../readonly/KNN_features_data/Y.npy'

test_path = '../readonly/KNN_features_data/X_test.npz'
test_labels = '../readonly/KNN_features_data/Y_test.npy'

# Train data
X = scipy.sparse.load_npz(train_path)
Y = np.load(train_labels)

# Test data
X_test = scipy.sparse.load_npz(test_path)
Y_test = np.load(test_labels)

# Out-of-fold features we loaded above were generated with n_splits=4 and skf seed 123
# So it is better to use seed 123 for generating KNN features as well 
skf_seed = 123
n_splits = 4


Below you need to implement features, based on nearest neighbors.

In [265]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.neighbors import NearestNeighbors
from multiprocessing import Pool

import numpy as np


class NearestNeighborsFeats(BaseEstimator, ClassifierMixin):
    '''
    This class implements KNN features extraction.
    
    '''
    
    def __init__(self, n_jobs, k_list, metric, n_classes=None, n_neighbors=None, eps=1e-6):
        '''
        + metric: metric used to train the KNN algorithm.
        + k_list: values of K to apply the features generation to.
        + n_jobs: number of threads for parallel processing when
                  extracting features of datapoints.
        + n_classes: number of classes to consider - only useful when
                     some classes are not in the dataset. this number
                     must be higher than the numer of classes in the 
                     dataset.
        + n_neighbors: number of neighbors to use when fitting the KNN algorithm.
                       if not provided, the algorithm will use max(k_list).
        + eps: ensures we don't divide by zero.
        '''
        
        # parameters for fitting the KNN algorithm
        self.metric = metric
        self.n_neighbors = max(k_list) if n_neighbors is None else n_neighbors
            
        # parameters for extracting features        
        self.k_list = k_list
        self.n_jobs = n_jobs
        self.n_classes_ = n_classes
        self.eps = eps
       
    
    def fit(self, X, y):
        '''
        Setup the train set and self.NN object
        '''
        
        # training labels
        self.y_train = y
        
        # classes
        self.n_classes = np.unique(y).shape[0] if self.n_classes_ is None else self.n_classes_
        
        
        # Fit a NearestNeighbors (NN) object to X 
        self.NN = NearestNeighbors(n_neighbors=self.n_neighbors, 
                                   metric=self.metric, 
                                   n_jobs=1, 
                                   algorithm='brute' if self.metric=='cosine' else 'auto')
        self.NN.fit(X)
                
            
        
    def predict(self, X):       
        '''
        Produces KNN features for every object of a dataset X
        '''
        
        if self.n_jobs == 1:
            test_feats = []
            for i in range(X.shape[0]):
                test_feats.append(self.get_features_for_one(X[i:i+1]))
        
        else:
            '''
            Number of threads is controlled by `self.n_jobs`.
            Either use `multiprocessing.Pool` or `joblib`                     
            '''
            
            gen = (X[i:i+1] for i in range(X.shape[0]))
            pool = Pool(processes=self.n_jobs)
            test_feats = pool.map(self.get_features_for_one, gen)
            pool.close()
            pool.join()
            
            # Comment out this line once you implement the code
            # assert False, 'You need to implement it for n_jobs > 1'
            
        return np.vstack(test_feats)
        
        
        
    def get_features_for_one(self, x):
        '''
        Computes KNN features for a single object `x`
        '''

        NN_output = self.NN.kneighbors(x)
        
        # vectors of size `n_neighbors`
        neighs = NN_output[1][0]        # neighbors indices        
        neighs_dist = NN_output[0][0]   # distances to neighbors
        neighs_y = self.y_train[neighs] # labels of neighbors
        
        # list of computed features (each feature is a list or np.array)
        features_list = [] 
        
        # add features
        
        
        features_list += self.neighbors_class_probabilities_(neighs_dist, neighs_y)
        features_list += self.first_neighbors_with_same_class_(neighs_y)
        features_list += self.minimum_distance_to_class_(neighs_dist, neighs_y)
        features_list += self.minimum_normed_distance_to_class_(neighs_dist, neighs_y)
        features_list += self.distance_to_kth_neighbor_(neighs_dist, neighs_y)
        features_list += self.mean_distance_to_class_(neighs_dist, neighs_y)
        
        # merge features
        knn_feats = np.hstack(features_list)
        
        assert knn_feats.shape == (239,) or knn_feats.shape == (239, 1)
        return knn_feats
    
    
    
    def neighbors_class_probabilities_(self, neighs_dist, neighs_y):
        ''' 
        Fraction of neighbors for every class (KNNСlassifiers predictions).
        Returns a list of length `len(k_list) * n_classes`.
        '''
        
        # use of `np.bincount`
        feature = []       
        for k in self.k_list:
            neighs_y_k = neighs_y[:k]
            feature_k = np.bincount(neighs_y_k, minlength=self.n_classes) / neighs_y_k.shape[0]
            feature += feature_k.tolist()
        
        return feature
        
        
        
    def first_neighbors_with_same_class_(self, neighs_y):
        '''
        K first neighbors with the same label.
        Returns a list of size `1` with a single integer.
        '''

        # use of `np.where`
        non_matching_idx = np.where(neighs_y!=neighs_y[0])[0]
        if non_matching_idx.size == 0:
            feature = self.n_neighbors
        else:
            feature = non_matching_idx[0]

        return [feature]
        
        
        
    def minimum_distance_to_class_(self, neighs_dist, neighs_y):  
        '''
        Minimum distance to neighbors of each class; 999 if no neighbors of a class.
        Returns a list of length `n_classes`.
        '''
                
        # use of `np.where`
        feature = [999] * self.n_classes
        for c in range(self.n_classes):
            c_dist = np.where(neighs_y==c, neighs_dist, 999)
            feature[c] = np.min(c_dist)
            
        return feature
        
        
        
    def minimum_normed_distance_to_class_(self, neighs_dist, neighs_y):
        '''
        Minimum distance to neighbors of each class, "normalized"
        by the distance to closest class; 999 if no neighbors of a class.
        Returns a list of length `n_classes`.
        '''
           
        # use of self.eps
        min_distances = self.minimum_distance_to_class_(neighs_dist, neighs_y)
        min_dist = min(min_distances) + self.eps
        
        feature = [dist/min_dist if dist < 999 else 999 for dist in min_distances]
        return feature
       
        
    
    def distance_to_kth_neighbor_(self, neighs_dist, neighs_y):
        '''
        Distance to each neighbor & distance normalized
        by the distance to closest neighbor.
        Returns a list of dimensions `1x2*len(k_list)`.
        '''
        
        # use of self.eps
        feature = []
        for k in self.k_list:
            
            neighs_dist_k = neighs_dist[:k]
            min_dist = min(neighs_dist_k)
            feat_51 = neighs_dist_k[k-1]
            feat_52 = neighs_dist_k[k-1] / (min_dist + self.eps)
            feature += [[feat_51, feat_52]]
            
        return feature
            
    
    
    def mean_distance_to_class_(self, neighs_dist, neighs_y):
        '''
        Mean distance to neighbors of each class; 999 if no neighbors of a class.
        Returns a list of length `len(k_list) * n_classes`.
        '''
        
        feature = []
        for k in self.k_list:
            
            # info for the k-NN
            neighs_y_k = neighs_y[:k]
            neighs_dist_k = neighs_dist[:k]
            
            # extract sum of distances & count of labels for each class
            dist_per_class = np.bincount(neighs_y_k, weights= neighs_dist_k, minlength=self.n_classes)
            y_per_class = np.bincount(neighs_y_k, minlength=self.n_classes)
            
            # get avg distance
            feature_k = np.where(y_per_class==0, 999, dist_per_class / (y_per_class + self.eps))            
            feature += feature_k.tolist()
            
        return feature
        

## Sanity check

To make sure you've implemented everything correctly we provide you the correct features for the first 50 objects.

In [270]:
# a list of K in KNN, starts with one 
k_list = [3, 8, 32]

# Load correct features
true_knn_feats_first50 = np.load('../readonly/KNN_features_data/knn_feats_test_first50.npy')

# Create instance of our KNN feature extractor
NNF = NearestNeighborsFeats(n_jobs=4, k_list=k_list, metric='minkowski')

# Fit on train set
NNF.fit(X, Y)

# Get features for test
test_knn_feats = NNF.predict(X_test[:50])

# This should be zero
print ('Deviation from ground thruth features: %f' % np.abs(test_knn_feats - true_knn_feats_first50).sum())

deviation =np.abs(test_knn_feats - true_knn_feats_first50).sum(0)
for m in np.where(deviation > 1e-3)[0]: 
    p = np.where(np.array([87, 88, 117, 146, 152, 239]) > m)[0][0]
    print ('There is a problem in feature %d, which is a part of section %d.' % (m, p + 1))
    

Deviation from ground thruth features: 0.000000


Now implement parallel computations and compute features for the train and test sets. 

## Get features for test

Now compute features for the whole test set.

In [281]:
for metric in ['minkowski', 'cosine']:
    print (metric)
    
    # Create instance of our KNN feature extractor
    NNF = NearestNeighborsFeats(n_jobs=4, k_list=k_list, metric=metric)
    
    # Fit on train set
    NNF.fit(X, Y)

    # Get features for test
    test_knn_feats = NNF.predict(X_test)
    
    # Dump the features to disk
    np.save('knn_feats_%s_test.npy' % metric , test_knn_feats)
    

minkowski
cosine


## Get features for train

Compute features for train, using out-of-fold strategy.

In [290]:
#16:10

# Differently from other homework we will not implement OOF predictions ourselves
# but use sklearn's `cross_val_predict`
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import StratifiedKFold

# We will use two metrics for KNN
for metric in ['minkowski', 'cosine']:
    print (metric)
    
    # Set up splitting scheme, use StratifiedKFold
    skf = StratifiedKFold(n_splits, shuffle=True, random_state=skf_seed)
    
    # Create instance of our KNN feature extractor
    NNF = NearestNeighborsFeats(n_jobs=4, k_list=k_list, metric=metric)
    
    # Get KNN features using OOF use cross_val_predict with right parameters
    preds = cross_val_predict(NNF, X, Y, cv=skf)
    
    # Save the features
    np.save('knn_feats_%s_train.npy' % metric, preds)
    

minkowski
cosine


# Submit

If you made the above cells work, just run the following cell to produce a number to submit.

In [292]:
s = 0
for metric in ['minkowski', 'cosine']:
    knn_feats_train = np.load('knn_feats_%s_train.npy' % metric)
    knn_feats_test = np.load('knn_feats_%s_test.npy' % metric)
    
    s += knn_feats_train.mean() + knn_feats_test.mean()
    
answer = np.floor(s)
print (answer)


3838.0


Submit!

In [293]:
from grader import Grader

grader = Grader()

grader.submit_tag('statistic', answer)

STUDENT_EMAIL = 'plat.sebastien@hotmail.fr'
STUDENT_TOKEN = 'r3JKJUIMgA202mjV'
grader.status()

grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)


Current answer for task statistic is: 3838.0
You want to submit these numbers:
Task statistic: 3838.0
Submitted to Coursera platform. See results on assignment page!
