# Crop Type Classification - Featurization

This notebook performs featurization on the cleaned NDVI timeseries data to prepare it for model training.

### Conda environment setup
Before running this notebook, let's build a conda environment. If you do not have conda installed, please follow the instructions from [Conda User Guide](https://docs.conda.io/projects/conda/en/latest/user-guide/index.html). 

```
$ conda create --name 'env_name' --file requirements.txt
$ conda activate 'env_name'
```

### Key Libraries:

- [Geopandas](https://geopandas.org/en/stable/docs.html), [Pandas]((https://xgboost.readthedocs.io/en/stable/)): Data handling and manipulation.
- [Sklearn](https://scikit-learn.org/0.21/documentation.html): Train/test split, preprocessors and encoders


The key steps are:

- Train/Test Split: Split the featurized data into training and test sets for model fitting and evaluation.
- Feature Selection: Select only the NDVI features needed for crop classification. Remove unnecessary meta data columns.
- Label Encoding: Label encode the crop type classes into numeric labels.
- Scaling: Scale the NDVI features to have zero mean and unit variance to improve model convergence.

The output of this notebook are preprocessed pandas DataFrames with selected features, encoded targets and scaled NDVI data ready for model training,validation and testing.

--------

# Imports

In [10]:
import os
try:
    if kernel_is_loaded:
        pass
except:
    os.chdir('/'.join(os.getcwd().split('/')[:-1]))
    kernel_is_loaded = True

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from IPython.display import display
from sklearn.preprocessing import StandardScaler
import random
import warnings
warnings.filterwarnings("ignore")

pd.options.display.max_columns = 20

# Seeding
def seed_everything(seed = 42):
    random.seed(seed)
    np.random.seed(seed)
    
seed_everything()

In [11]:
data = pd.read_csv('data_files/preprocessed_data.csv')
data.head(3)

Unnamed: 0,oct_1f,oct_2f,nov_1f,nov_2f,dec_1f,dec_2f,jan_1f,jan_2f,feb_1f,feb_2f,mar_1f,mar_2f,apr_1f,apr_2f,crop_name,sowing_period,harvest_period
0,189.0,191.0,173.0,126.0,132.0,171.0,183.0,187.0,184.0,180.0,166.0,130.0,120.0,124.0,Wheat,nov_2f,mar_1f
1,171.0,166.0,142.0,132.0,165.0,185.0,190.0,188.0,180.0,171.0,143.0,130.0,123.0,121.0,Wheat,nov_2f,mar_1f
2,166.0,147.0,129.0,131.0,159.0,188.0,197.0,192.0,185.0,172.0,121.0,121.0,116.0,118.0,Wheat,nov_2f,feb_2f


# Train, Test and Validation split

In [12]:
# Samples with less than 3 occurrences cannot be divided into three parts (Train, Validation, Test). 
# These are excluded before splitting and then reintroduced to the training set to enhance generalization.

data['crop_sp_hp'] = data['crop_name']+'_'+data['sowing_period']+'_'+data['harvest_period']

comb_under_three = data.crop_sp_hp.value_counts()[data.crop_sp_hp.value_counts() < 3].index

samples_under_three = data[data['crop_sp_hp'].isin(comb_under_three)]

data.drop(samples_under_three.index, inplace=True)

samples_under_three.shape[0]

23

In [13]:
# Train-Validation-Test split: 60-20-20 by keeping the data distribution constant across all 3 data sets.
# Stratification is perfomed based on crop_name, sowing_period and harvest_period. 

train_test, val = train_test_split(data, test_size=0.2, 
                                   stratify=data[['crop_name', 'sowing_period', 'harvest_period']], random_state=0)

train, test = train_test_split(train_test, test_size=0.25, 
                                   stratify=train_test[['crop_name', 'sowing_period', 'harvest_period']], random_state=0)

display(train.shape, val.shape, test.shape, 
        train.crop_name.value_counts(normalize=True), val.crop_name.value_counts(normalize=True), test.crop_name.value_counts(normalize=True),
        train.crop_name.value_counts(normalize=False), val.crop_name.value_counts(normalize=False), test.crop_name.value_counts(normalize=False))

(16101, 18)

(5368, 18)

(5367, 18)

Wheat      0.814111
Potato     0.114341
Mustard    0.071548
Name: crop_name, dtype: float64

Wheat      0.813711
Potato     0.114382
Mustard    0.071908
Name: crop_name, dtype: float64

Wheat      0.813490
Potato     0.114589
Mustard    0.071921
Name: crop_name, dtype: float64

Wheat      13108
Potato      1841
Mustard     1152
Name: crop_name, dtype: int64

Wheat      4368
Potato      614
Mustard     386
Name: crop_name, dtype: int64

Wheat      4366
Potato      615
Mustard     386
Name: crop_name, dtype: int64

In [14]:
# Adding samples with less than 3 occurrences back to the training set to aid generalization. 

train = pd.concat([train, samples_under_three], axis=0)

# Label Encoding

In [15]:
# Label Encoding crop classes. 

crop_label = {'Mustard':0, 'Wheat':1, 'Potato':2}

for df in train, val, test:
    df['crop_name'] = df['crop_name'].apply(lambda crop:crop_label[crop])
    df.drop(['sowing_period', 'harvest_period', 'crop_sp_hp'], axis=1, inplace=True)

# Scaling

In [16]:
# Fitting the train set to the standard scaler and transforming the test and validation sets

scaler = StandardScaler()

train.loc[:,'oct_1f':'apr_2f'] = scaler.fit_transform(train.drop('crop_name', axis=1))
val.loc[:,'oct_1f':'apr_2f'] = scaler.transform(val.drop('crop_name', axis=1))
test.loc[:,'oct_1f':'apr_2f'] = scaler.transform(test.drop('crop_name', axis=1))

# File Export

In [17]:
train.to_csv('data_files/train.csv', index=False)
val.to_csv('data_files/val.csv', index=False)
test.to_csv('data_files/test.csv', index=False)