## 1. Object-based ensemble model

This method makes use of statistical and location features calculated for each field using an ensemble of linear (LogisticRegression), nearest-neighbours (KNN) and tree-based methods (RandomForest, ExtraTrees, XGBoost).

In [1]:
# OPTIONAL: Run this cell to automatically reload all modules (if they've been externally edited)
%load_ext autoreload
%autoreload 2

In [2]:
# OPTIONAL: Run this cell to silence warnings (not recommended!) Used here to silence LogReg convergence warning
import warnings
warnings.simplefilter('ignore')

### Load custom modules

In [3]:
from modules.process_data import SelectFeatures, Scale, OneHot
from modules.run_models import ModelEnsemble, make_submission
from modules.metaclassifiers import UnweightedAverage

### Load python modules

In [4]:
import numpy as np
import pandas as pd
from sklearn.metrics import log_loss
import pickle

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.metrics import log_loss

### Load processed feature datasets

In [6]:
train_data = pd.read_pickle('extracted_data/train_data.pkl')
stat_features_train = pd.read_pickle('processed_data/train/stat_features.pkl')
location_features_train = pd.read_pickle('processed_data/train/location_features.pkl')
expanded_pixels_train = pd.read_pickle('processed_data/train/expanded_pixels.pkl')


test_data = pd.read_pickle('extracted_data/test_data.pkl')
stat_features_test = pd.read_pickle('processed_data/test/stat_features.pkl')
location_features_test = pd.read_pickle('processed_data/test/location_features.pkl')
expanded_pixels_test = pd.read_pickle('processed_data/test/expanded_pixels.pkl')

### Select features

Select statistical features for each spectral band including the calculated NDVI vegetative index, as well as location features with 200 zones

In [7]:
sf = SelectFeatures(keep_cols=['mean', 'std', 'max', 'min', 'location_200'], 
                    drop_cols=['diff', 'TCI', 'RVI', '_DVI', 'IPVI', 'ARVI', 'location_2000'])

fit_A = sf.transform([train_data, stat_features_train, location_features_train])
predict_A = sf.transform([test_data, stat_features_test, location_features_test])

Selected  617  columns 
Use .cols attribute to see all columns

Selected  617  columns 
Use .cols attribute to see all columns



### Pre-process features

* Min-Max scale all numerical features

In [8]:
sc = Scale(keep_cols=['B', 'NDVI']).fit(fit_A)

fit_A = sc.transform(fit_A)
predict_A = sc.transform(predict_A)

Scaling  616  columns 
Use .cols attribute to see all columns

Scaling  616  columns 
Use .cols attribute to see all columns



* One-hot encode categorical location features

In [9]:
oh = OneHot(keep_cols=['location_200']).fit(fit_A)

fit_A = oh.transform(fit_A)
predict_A = oh.transform(predict_A)

Encoding  1  columns 
Use .cols attribute to see all columns

Encoding  1  columns 
Use .cols attribute to see all columns



* Fill NaN with their mean column values

In [10]:
fit_A = fit_A.fillna(fit_A.mean())
predict_A = predict_A.fillna(fit_A.mean())

In [11]:
fit_A.head()

Unnamed: 0,0101_B05_max,0210_B11_min,0715_B07_mean,0131_B12_max,0131_B8A_std,0819_NDVI_std,0620_B8A_max,0620_B02_min,0715_B07_max,0715_B05_max,...,location_200_190,location_200_191,location_200_192,location_200_193,location_200_194,location_200_195,location_200_196,location_200_197,location_200_198,location_200_199
0,0.385355,0.447401,0.330374,0.414763,0.269055,0.071518,0.47868,0.717791,0.428621,0.475167,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.343195,0.427027,0.658061,0.329805,0.243533,0.179315,0.769257,0.599864,0.678229,0.473424,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.364152,0.402911,0.385792,0.245125,0.133185,0.067051,0.528026,0.646898,0.466969,0.580017,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.285996,0.393347,0.296498,0.31727,0.249262,0.092814,0.503611,0.656442,0.443612,0.458612,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.348619,0.417672,0.527508,0.369081,0.163205,0.149857,0.693776,0.646217,0.578874,0.506245,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Define classifiers and metaclassifiers

Use a combination of linear, nearest-neighbour and tree-based models both for classification and model stacking. 
Note: UnweightedAv is a custom estimator which simply takes the combines the predictions by taking their (unweighted) average. A weighted average estimator was also tested, using Nelder-Mead weight optimisation but was prone to overfitting.

In [12]:
classifiers = {
'LogReg': LogisticRegression(solver='lbfgs', multi_class='multinomial'),
'RandomForest': RandomForestClassifier(n_estimators = 1000),
'KNN': KNeighborsClassifier(n_neighbors=100),
'ExtraTrees': ExtraTreesClassifier(n_estimators = 1000),
'XGB': XGBClassifier(silent=False, 
                    n_estimators=1000, learning_rate=0.3, 
                    scale_pos_weight=1, colsample_bytree = 0.4, subsample = 0.9, objective='multi:softprob', 
                    eval_metric='mlogloss', reg_alpha = 0.3, max_depth=6, gamma=5)}

metaclassifiers = {
'LogReg': LogisticRegression(solver='lbfgs', multi_class='multinomial'),
'RandomForest': RandomForestClassifier(n_estimators = 1000),
'UnweightedAv': UnweightedAverage(n_classes=9)}

### Fit ensemble

In [13]:
ensemble_A = ModelEnsemble(clfs=classifiers, mclfs=metaclassifiers).fit(fit_A, train_data)

Fitting classifiers... 

Classifier                  Fold 1 Score             Fold 2 Score
---------------------------------------------------------------------
LogReg                          0.737                    0.743
RandomForest                    0.821                    0.913
KNN                             1.369                    1.527
ExtraTrees                      0.830                    0.887
XGB                             0.799                    0.863


Fitting metaclassifiers... 

Meta-Classifier                 Score
----------------------------------------------
LogReg                          1.700                  
RandomForest                    0.336                  
UnweightedAv                    2.492                  


Complete!


### Predict ensemble

In [14]:
predictions = ensemble_A.predict(predict_A)

Re-fitting and predicting classifiers... 

Classifier                     Status
----------------------------------------------
LogReg                          Complete
RandomForest                    Complete
KNN                             Complete
ExtraTrees                      Complete
XGB                             Complete


Predicting metaclassifiers... 

Meta-Classifier                 Status
----------------------------------------------
LogReg                          Complete
RandomForest                    Complete
UnweightedAv                    Complete


Complete!


### Make submissions

In [None]:
make_submission(predictions, 'Ensemble_A')