# Creating a submission file

## Setup

Import the libraries + set the seed val

In [2]:
import pickle
import pandas as pd

from sklearn.preprocessing import LabelEncoder

import sys
sys.path.append('..')

from utils.file_ops import *
from utils.runtime_helpers import *
from utils.data_prep import *

seed_val = 922

Load the data and add Crop Labels to Train Data

In [3]:
train_labels = pd.read_csv('../../data/labels_TRAIN.csv', index_col=[0])
train_data_agg = pd.read_csv('../../data/pixel_data_agg_TRAIN.csv', index_col=[0])
test_data_agg = pd.read_csv('../../data/pixel_data_agg_TEST.csv', index_col=[0])

train_data_and_labels = train_data_agg.merge(train_labels, on=['field_id'])

## Assign Train / Test values

Below we will assign our X and y values using the complete train dataset and the test dataset - unlike our prior modeling efforts which split the train set into train/test collections.

In [4]:
X_train = train_data_and_labels.drop(['field_id', 'crop_id'], axis=1)
y_train = train_data_and_labels['crop_id']
X_test = test_data_agg.drop(['field_id'], axis=1)
field_ids = test_data_agg['field_id']

le = LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)

Additionally, experimentation has shown that using the following values for modeling (obtained using our PCA in the EDA notebook) appears to lower the logloss and improve the score for the final submission

*Note: these same selected values were also used in prior iterations for the modeling notebook that preceeds this one, however, we obtained better metrics using all the features available to us in that context.

In [5]:
agg_idxs = [
    'B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B09', 'B11', 'B12',
    'NDVI', 'ARVI', 'SAVI', 'MSI', 'MCARI', 'MARI', 'EVI2', 'NDMI', 'NDWI',
    'brightness'
]    
agg_metrics = [
    '_median'
]

selected = get_features(agg_idxs, agg_metrics, ['pixels'])

## Fitting / Preparing Predictions for Submission

### Loading the saved models

In [6]:
rf_p = pickle.load(open('../../saved_models/rf_p.sav', 'rb'))
xgb_p = pickle.load(open('../../saved_models/xgb_p.sav', 'rb'))
svm_p = pickle.load(open('../../saved_models/svm_p.sav', 'rb'))
mlp_p = pickle.load(open('../../saved_models/mlp_p.sav', 'rb'))
vp_p = pickle.load(open('../../saved_models/vc_p.sav', 'rb'))


### Random Forest

In [8]:
clf = rf_p

clf.fit(X_train[selected], y_train)
y_test_pred = clf.predict_proba(X_test[selected])

crop_dict = get_crop_dict()
crop_columns = [crop_dict.get(i) for i in le.inverse_transform(clf.classes_)]

test_df = pd.DataFrame(columns= ['field_id'] + crop_columns)
test_df['field_id'] = field_ids
test_df[crop_columns] = y_test_pred 

test_df.to_csv('../../submissions/rf_p.csv', index=False)

test_df.head()

Unnamed: 0,field_id,Wheat,Mustard,Lentil,No Crop,Green pea,Sugarcane,Garlic,Maize,Gram,Coriander,Potato,Bersem,Rice
0,11,0.118,0.158,0.06,0.1,0.024,0.528,0.002,0.004,0.0,0.002,0.004,0.0,0.0
1,13,0.606,0.232,0.028,0.068,0.014,0.036,0.006,0.0,0.002,0.006,0.0,0.0,0.002
2,19,0.166,0.19,0.4,0.126,0.018,0.086,0.002,0.0,0.0,0.012,0.0,0.0,0.0
3,21,0.052,0.33,0.184,0.272,0.002,0.066,0.018,0.0,0.052,0.0,0.012,0.012,0.0
4,25,0.134,0.026,0.122,0.706,0.002,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### X Gradient Boost

In [9]:
clf = xgb_p

clf.fit(X_train[selected], y_train)
y_test_pred = clf.predict_proba(X_test[selected])

crop_dict = get_crop_dict()
crop_columns = [crop_dict.get(i) for i in le.inverse_transform(clf.classes_)]

test_df = pd.DataFrame(columns= ['field_id'] + crop_columns)
test_df['field_id'] = field_ids
test_df[crop_columns] = y_test_pred 

test_df.to_csv('../../submissions/xgb_p.csv', index=False)

test_df.head()

Unnamed: 0,field_id,Wheat,Mustard,Lentil,No Crop,Green pea,Sugarcane,Garlic,Maize,Gram,Coriander,Potato,Bersem,Rice
0,11,0.202802,0.121857,0.032468,0.086322,0.019908,0.52761,0.001092,0.001722,0.001374,0.001367,0.001297,0.000935,0.001245
1,13,0.504577,0.073377,0.004514,0.408544,0.000627,0.003906,0.000874,0.000432,0.000636,0.000573,0.000461,0.001047,0.000433
2,19,0.148569,0.290637,0.442358,0.0713,0.002091,0.036588,0.001784,0.000992,0.001239,0.001342,0.00106,0.001045,0.000994
3,21,0.012638,0.739189,0.008112,0.188216,0.000543,0.044424,0.000793,0.000413,0.002882,0.000473,0.00108,0.000607,0.00063
4,25,0.063104,0.004891,0.025427,0.901334,0.000503,0.001399,0.000344,0.000304,0.000397,0.000676,0.000325,0.000818,0.000477


1.758395241
1.765851258

### Scalable Vector Machines

In [10]:
clf = svm_p

clf.fit(X_train[selected], y_train)
y_test_pred = clf.predict_proba(X_test[selected])

crop_dict = get_crop_dict()
crop_columns = [crop_dict.get(i) for i in le.inverse_transform(clf.classes_)]

test_df = pd.DataFrame(columns= ['field_id'] + crop_columns)
test_df['field_id'] = field_ids
test_df[crop_columns] = y_test_pred 

test_df.to_csv('../../submissions/svm_p.csv', index=False)

test_df.head()

Unnamed: 0,field_id,Wheat,Mustard,Lentil,No Crop,Green pea,Sugarcane,Garlic,Maize,Gram,Coriander,Potato,Bersem,Rice
0,11,0.046002,0.125851,0.013903,0.038035,0.006031,0.767024,0.000938,0.001055,0.000169,0.000245,0.000218,0.00017,0.000357
1,13,0.52605,0.125779,0.114691,0.127016,0.050062,0.020501,0.011899,0.001449,0.012424,0.001547,0.002192,0.002319,0.004071
2,19,0.080895,0.171049,0.544467,0.015403,0.005569,0.176266,0.000489,0.000231,0.000263,0.004236,0.000534,0.000386,0.000212
3,21,0.019594,0.122085,0.102055,0.068018,0.00247,0.616033,0.016509,0.001487,0.023261,0.000798,0.014019,0.01301,0.000662
4,25,0.081844,0.014978,0.040532,0.845483,0.001033,0.004353,0.000781,0.001043,0.00097,0.00068,0.000705,0.001485,0.006114


### Multi-Layer Perceptron

In [11]:
clf = mlp_p

clf.fit(X_train[selected], y_train)
y_test_pred = clf.predict_proba(X_test[selected])

crop_dict = get_crop_dict()
crop_columns = [crop_dict.get(i) for i in le.inverse_transform(clf.classes_)]

test_df = pd.DataFrame(columns= ['field_id'] + crop_columns)
test_df['field_id'] = field_ids
test_df[crop_columns] = y_test_pred 

test_df.to_csv('../../submissions/mlp_p.csv', index=False)

test_df.head()

Unnamed: 0,field_id,Wheat,Mustard,Lentil,No Crop,Green pea,Sugarcane,Garlic,Maize,Gram,Coriander,Potato,Bersem,Rice
0,11,0.051279,0.102818,0.117733,0.059046,0.001644,0.62471,0.032706,0.00039,0.000271,0.00549,0.002465,6e-06,0.001441
1,13,0.22924,0.06923,0.144726,0.126961,0.382958,0.038991,0.001646,0.000133,0.000222,1e-05,0.000408,0.000585,0.004891
2,19,0.095015,0.293615,0.271186,0.054623,0.011457,0.210124,0.045834,0.00025,0.002714,0.001113,0.003753,0.01023,8.6e-05
3,21,0.012866,0.040258,0.072815,0.047312,0.004073,0.404882,0.004572,0.005568,4.2e-05,0.000508,0.246426,0.160125,0.000554
4,25,0.184122,0.130562,0.52226,0.133386,0.000728,0.013455,0.000138,0.000118,7e-06,1.8e-05,0.001144,0.01396,0.000102


# Voting Classifier

In [14]:
clf = vp_p

clf.fit(X_train[selected], y_train)
y_test_pred = clf.predict_proba(X_test[selected])

crop_dict = get_crop_dict()
crop_columns = [crop_dict.get(i) for i in le.inverse_transform(clf.classes_)]

test_df = pd.DataFrame(columns= ['field_id'] + crop_columns)
test_df['field_id'] = field_ids
test_df[crop_columns] = y_test_pred 

test_df.to_csv('../../submissions/vp_p.csv', index=False)

test_df.head()

Unnamed: 0,field_id,Wheat,Mustard,Lentil,No Crop,Green pea,Sugarcane,Garlic,Maize,Gram,Coriander,Potato,Bersem,Rice
0,11,0.104521,0.127132,0.056026,0.070851,0.012896,0.611836,0.009184,0.001792,0.000454,0.002275,0.001995,0.000278,0.000761
1,13,0.466467,0.125097,0.072983,0.18263,0.111912,0.024849,0.005105,0.000504,0.003821,0.002032,0.000765,0.000988,0.002849
2,19,0.12262,0.236325,0.414503,0.066831,0.009279,0.127245,0.012527,0.000368,0.001054,0.004673,0.001337,0.002915,0.000323
3,21,0.024275,0.307883,0.091745,0.143886,0.002272,0.282835,0.009969,0.001867,0.019546,0.000445,0.068381,0.046435,0.000462
4,25,0.115767,0.044108,0.177555,0.646551,0.001066,0.007302,0.000316,0.000366,0.000343,0.000344,0.000543,0.004066,0.001673


## Analysis

After submitting each model's predictions on the test data, we acheive the following scores from the AgrifieldNet competition.

| Model | Public Score | Private Score |
| ----- | ------------ | ------------- |
| rf_p  | 1.744836782  | 2.163112415   |
| xgb_p | 1.758395241  | 1.765851258   |
| svm_p | 1.465262957  | 1.497996049   |
| mlp_p | 1.806023962  | 1.767157053   |
| v_p   | 1.309915641  | 1.372958035   |

The competition website explains the presence of both public and private scoring by stating that "Zindi maintains a public leaderboard and a private leaderboard for each competition. The Public Leaderboard includes approximately 20% of the test dataset."

As hypothesized, the voting classifier performs better than any other estimators we created when evaluated on their own. While we wouldn't have won the competition with our voting classifier model, it does put us at **22nd** place out of 151 submissions according to the [leaderboard](https://zindi.africa/competitions/agrifieldnet-india-challenge/leaderboard) (had we submitted our results before the end of the cutoff date.)

## Conclusions and Possible Next Steps

We fit and scored models we created in the previous notebook. We then processed and wrote the results into CSV files for submission to the AgrifieldNet competition website.

Interestingly, according to the AgrifieldNet scores, our models performed better with a subset of features. However, our models benefitted from all available features when evaluated locally using training data split into train/test splits. Further exploration into the effects of manual and automated feature selection on the predictive power of our models is warranted. The voting classifier could also benefit from hyperparameter tuning, which was not performed due to a lack of time/compute power.