# Crop Type Classification - Featurization

This notebook performs featurization on the cleaned NDVI timeseries data to prepare it for model training.

### Conda environment setup
Before running this notebook, let's build a conda environment. If you do not have conda installed, please follow the instructions from [Conda User Guide](https://docs.conda.io/projects/conda/en/latest/user-guide/index.html). 

```
$ conda create --name 'env_name' --file requirements.txt
$ conda activate 'env_name'
```

### Key Libraries:

- [Geopandas](https://geopandas.org/en/stable/docs.html), [Pandas]((https://xgboost.readthedocs.io/en/stable/)): Data handling and manipulation.
- [Sklearn](https://scikit-learn.org/0.21/documentation.html): Train/test split, preprocessors and encoders


The key steps are:

- Train/Test Split: Split the featurized data into training and test sets for model fitting and evaluation.
- Feature Selection: Select only the NDVI features needed for crop classification. Remove unnecessary meta data columns.
- Label Encoding: Label encode the crop type classes into numeric labels.
- Scaling: Scale the NDVI features to have zero mean and unit variance to improve model convergence.

The output of this notebook are preprocessed pandas DataFrames with selected features, encoded targets and scaled NDVI data ready for model training,validation and testing.

--------

# Imports

In [2]:
# Changing working dir to the repository main folder
import os
try:
    if kernel_is_loaded:
        pass
except:
    os.chdir('/'.join(os.getcwd().split('/')[:-2]))
    kernel_is_loaded = True

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from IPython.display import display
from sklearn.preprocessing import StandardScaler
import random
import warnings
warnings.filterwarnings("ignore")

pd.options.display.max_columns = 20

# Seeding
def seed_everything(seed = 42):
    random.seed(seed)
    np.random.seed(seed)
    
seed_everything()

In [3]:
data = pd.read_csv('notebooks/demo/data_files/preprocessed_data_sample.csv')
data.head(5)

Unnamed: 0,oct_1f,oct_2f,nov_1f,nov_2f,dec_1f,dec_2f,jan_1f,jan_2f,feb_1f,feb_2f,mar_1f,mar_2f,apr_1f,apr_2f,crop_type,sowing_period,harvest_period
0,135.0,136.0,129.0,160.0,181.0,188.0,189.0,186.0,185.0,157.0,149.0,123.0,129.0,125.0,Wheat,nov_1f,mar_1f
1,174.0,144.0,132.0,137.0,168.0,196.0,192.0,189.0,189.0,177.0,163.0,134.0,121.0,116.0,Wheat,nov_2f,mar_1f
2,154.0,150.0,126.0,132.0,155.0,169.0,180.0,182.0,181.0,175.0,151.0,122.0,115.0,114.0,Wheat,nov_1f,mar_1f
3,180.0,171.0,135.0,128.0,135.0,162.0,178.0,173.0,185.0,188.0,175.0,144.0,134.0,130.0,Wheat,nov_2f,mar_2f
4,142.0,123.0,119.0,120.0,148.0,183.0,191.0,188.0,192.0,189.0,153.0,127.0,117.0,117.0,Wheat,nov_2f,mar_1f


# Train, Test and Validation split

In [5]:
# Samples with less than 3 occurrences cannot be divided into three parts (Train, Validation, Test). 
# These are excluded before splitting and then reintroduced to the training set to enhance generalization.

data['crop_sp_hp'] = data['crop_type']+'_'+data['sowing_period']+'_'+data['harvest_period']

comb_under_three = data.crop_sp_hp.value_counts()[data.crop_sp_hp.value_counts() < 3].index

samples_under_three = data[data['crop_sp_hp'].isin(comb_under_three)]

data.drop(samples_under_three.index, inplace=True)

samples_under_three.shape[0]

22

In [8]:
# Train-Validation-Test split: 60-20-20 by keeping the data distribution constant across all 3 data sets.
# Stratification is perfomed based on crop_name, sowing_period and harvest_period. 

train_test, val = train_test_split(data, test_size=0.2, 
                                   stratify=data[['crop_type', 'sowing_period', 'harvest_period']], random_state=0)

train, test = train_test_split(train_test, test_size=0.25, 
                                   stratify=train_test[['crop_type', 'sowing_period', 'harvest_period']], random_state=0)

display(train.shape, val.shape, test.shape, 
        train.crop_type.value_counts(normalize=True), val.crop_type.value_counts(normalize=True), test.crop_type.value_counts(normalize=True),
        train.crop_type.value_counts(normalize=False), val.crop_type.value_counts(normalize=False), test.crop_type.value_counts(normalize=False))

(166, 18)

(56, 18)

(56, 18)

Wheat      0.349398
Mustard    0.349398
Potato     0.301205
Name: crop_type, dtype: float64

Wheat      0.357143
Mustard    0.339286
Potato     0.303571
Name: crop_type, dtype: float64

Mustard    0.357143
Wheat      0.339286
Potato     0.303571
Name: crop_type, dtype: float64

Wheat      58
Mustard    58
Potato     50
Name: crop_type, dtype: int64

Wheat      20
Mustard    19
Potato     17
Name: crop_type, dtype: int64

Mustard    20
Wheat      19
Potato     17
Name: crop_type, dtype: int64

In [9]:
# Adding samples with less than 3 occurrences back to the training set to aid generalization. 

train = pd.concat([train, samples_under_three], axis=0)

# Label Encoding

In [11]:
# Label Encoding crop classes. 

crop_label = {'Mustard':0, 'Wheat':1, 'Potato':2}

for df in train, val, test:
    df['crop_type'] = df['crop_type'].apply(lambda crop:crop_label[crop])
    df.drop(['sowing_period', 'harvest_period', 'crop_sp_hp'], axis=1, inplace=True)

# Scaling

In [12]:
# Fitting the train set to the standard scaler and transforming the test and validation sets

scaler = StandardScaler()

train.loc[:,'oct_1f':'apr_2f'] = scaler.fit_transform(train.drop('crop_type', axis=1))
val.loc[:,'oct_1f':'apr_2f'] = scaler.transform(val.drop('crop_type', axis=1))
test.loc[:,'oct_1f':'apr_2f'] = scaler.transform(test.drop('crop_type', axis=1))

# File Export

In [13]:
train.to_csv('notebooks/demo/data_files/train.csv', index=False)
val.to_csv('notebooks/demo/data_files/val.csv', index=False)
test.to_csv('notebooks/demo/data_files/test.csv', index=False)