## Generate all features

This notebook takes the extracted train and test datasets and transforms them into engineered feature datasets

In [1]:
#Run this cell to automatically reload all modules (if they've been externally edited)
%load_ext autoreload
%autoreload 2

In [3]:
from modules.generate_features import (GenerateLocationFeatures,
                                        GenerateVegIndexFeatures,
                                        GenerateTimeDiffFeatures,
                                        GenerateStatFeatures,
                                        GenerateMeanDiffFeatures,
                                        GenerateResizedImages,
                                        ObjectToPixels)

In [4]:
import pandas as pd

train_data = pd.read_pickle('extracted_data/train_data.pkl')
test_data = pd.read_pickle('extracted_data/test_data.pkl')

### Generate Location Features (0)

The latitude and longitude of each field in the train set is used to fit a KMeans clustering algorithm, which divides the location of the fields into 'zones'. This fitted model then acts as a transformer to label each field with a zone number -- with the number of clusters specifying the number of zones (features generated for between 10-2000 zones in this example). 

In [5]:
location_features = GenerateLocationFeatures().fit(train_data)

In [6]:
location_features_train = location_features.transform(train_data, 
                                                    save = True, path = 'processed_data/train/location_features')

In [7]:
location_features_test = location_features.transform(test_data, 
                                                    save = True, path = 'processed_data/test/location_features')

### Generate Veg Index Features (1)

Certain transformations of spectral bands may be used to accentuate the spectral response of green plants such that the characteristics of the vegetation may be measured, the background soil signal may be discounted, as well as atmospheric and topographic effects. A selection of these are calculated for each field.

In [8]:
vegindex_features_train = GenerateVegIndexFeatures().transform(train_data, 
                                                         save = True, path = 'processed_data/train/vegindex_features')

  x[NIR_col].astype(np.int16) + x[RED_col].astype(np.int16))
  RVI = np.divide(x[NIR_col].astype(np.int16), x[RED_col].astype(np.int16))
  IPVI = x[NIR_col].astype(np.int16) / (x[NIR_col].astype(np.int16) + x[RED_col].astype(np.int16))
  ARVI = (x[NIR_col].astype(np.int16) - RB) / (x[NIR_col].astype(np.int16) + RB)


In [9]:
vegindex_features_test = GenerateVegIndexFeatures().transform(test_data, 
                                                         save = True, path = 'processed_data/test/vegindex_features')

### ---> Merge into train data

Ensures the following features are calculated with the previous features included.

In [11]:
train_data = train_data.merge(vegindex_features_train, on='Field_Id', how='left')

In [12]:
test_data = test_data.merge(vegindex_features_test, on='Field_Id', how='left')

### Generate time difference features (2)

The change in the appearance of a field over time is likely closely related to the crop type, as a result of seeding and growth cycles throughout the year. Intensity difference features are calculated between the seasons of the year in an attempt to highlight this time-series pattern in the absence of several years of data.

In [14]:
timediff_features_train = GenerateTimeDiffFeatures().transform(train_data, 
                                                         save = True, path = 'processed_data/train/timediff_features')

  change = (x[new_col].astype(np.float32) - x[old_col].astype(np.float32)) / x[old_col].astype(np.float32)


In [15]:
timediff_features_test = GenerateTimeDiffFeatures().transform(test_data, 
                                                       save = True, path = 'processed_data/test/timediff_features')

### ---> Merge into train data

Ensures the following features are calculated with the previous features included.

In [17]:
train_data = train_data.merge(timediff_features_train, on='Field_Id', how='left')

In [18]:
test_data = test_data.merge(timediff_features_test, on='Field_Id', how='left')

### Generate statistical features (3)

The mean, std, max and min pixel intensities are calculated for each image of each field.

In [20]:
stat_features_train = GenerateStatFeatures().transform(train_data, 
                                                         save = True, path = 'processed_data/train/stat_features')

In [21]:
stat_features_test = GenerateStatFeatures().transform(test_data, 
                                                         save = True, path = 'processed_data/test/stat_features')

### Generate mean difference features (4)

The deviation of a field from the 'typical' image for each crop may be a useful indicator of their similarity. The mean pixel value for each crop type in the training set is calculated, the difference between each pixel in an image and this mean is then determined for each field.

Note: This takes a long time to run and will be ignored as they are currently unused

### Generate resized images (5)

Each field is a different size and shape, with the image represented by a numpy array of pixel intensities padded by zeros. In order to standardise these images for NN training, they are resized to common dimensions (32 x 32 in this example).

In [22]:
resized_images_train = GenerateResizedImages().transform(train_data, 
                                                         save = True, path = 'processed_data/train/resized_images')

In [23]:
resized_images_test = GenerateResizedImages().transform(test_data, 
                                                         save = True, path = 'processed_data/test/resized_images')

### Transform to pixels (6)

For training the pixel based models, the dataset is transformed from a set of image arrays to a set of pixel values.

In [24]:
expand_to_pixels_train = ObjectToPixels().transform(train_data, 
                                                         save = True, path = 'processed_data/train/expanded_pixels')

In [25]:
expand_to_pixels_test = ObjectToPixels().transform(test_data, 
                                                         save = True, path = 'processed_data/test/expanded_pixels')