## Linear model

In this notebook, we train a linear model and investigate the coefficients it learns. These coefficients can be interpreted as the global feature importances of the input features.

In [1]:
from pathlib import Path
from itertools import chain
import pandas as pd
import numpy as np

import sys
sys.path.insert(0, '..')

from predictor.models import LinearModel
from predictor.preprocessing import VALUE_COLS, VEGETATION_LABELS

In [2]:
path_to_arrays = Path('../data/processed/arrays')

In [3]:
model_with_veg = LinearModel(path_to_arrays)

In [4]:
model_with_veg.train()

Train set RMSE: 0.05012686183266561


We can isolate the coefficients, as well as the values they correspond to. Note that each label in `value_labels` has the following format: `{value}_{month}` where `month` is relative to the `pred_month` (so if we are predicting June, then `month=11` corresponds to data in May).

In [5]:
coefs = model_with_veg.model.coef_

In [6]:
value_labels = list(chain(*[[f'{val}_{month}' for val in VALUE_COLS] for month in range(1, 12)]))

In [7]:
feature_importances_veg = pd.DataFrame(data={
    'feature': value_labels,
    'value': coefs
})

Lets investigate the most important features (by absolute value)

In [8]:
feature_importances_veg.iloc[(-np.abs(feature_importances_veg['value'].values)).argsort()][:10]

Unnamed: 0,feature,value
86,ndvi_11,0.30343
83,sm_11,0.11974
3,sm_1,-0.093128
81,lst_day_11,-0.092214
6,ndvi_1,0.090166
40,lst_night_6,0.088878
28,spi_4,0.081083
19,sm_3,0.080832
29,spei_4,-0.075993
43,sm_6,0.070247


## Without vegetation

The model above tells us that the vegetation health in May is predictive of the vegetation health in June. What happens if we hide vegetation health from the model?

In [9]:
model_no_veg = LinearModel(path_to_arrays, hide_vegetation=True)

In [10]:
model_no_veg.train()

Training model without vegetation features
Train set RMSE: 0.08695270885288697


In [11]:
coefs = model_no_veg.model.coef_

veg_features = ['ndvi', 'evi']
value_labels = list(chain(*[[f'{val}_{month}' for val in VALUE_COLS if val not in VEGETATION_LABELS] 
                            for month in range(1, 12)]))

In [12]:
feature_importances_no_veg = pd.DataFrame(data={
    'feature': value_labels,
    'value': coefs
})

In [13]:
feature_importances_no_veg.iloc[(-np.abs(feature_importances_no_veg['value'].values)).argsort()][:10]

Unnamed: 0,feature,value
15,sm_3,0.395197
61,lst_day_11,-0.318953
64,spi_11,0.253879
63,sm_11,0.225947
49,lst_day_9,0.219764
30,lst_night_6,0.186422
3,sm_1,-0.185122
42,lst_night_8,-0.174076
9,sm_2,-0.166429
65,spei_11,-0.162737
