## Linear model

In this notebook, we train a linear model and investigate the coefficients it learns. These coefficients can be interpreted as the global feature importances of the input features.

In [1]:
from pathlib import Path
from itertools import chain
import pandas as pd
import numpy as np

import sys
sys.path.insert(0, '..')

from predictor.models import LinearModel
from predictor.preprocessing import VALUE_COLS, VEGETATION_LABELS

In [2]:
target = 'ndvi'

In [3]:
path_to_arrays = Path(f'../data/processed/{target}/arrays')

In [4]:
model_with_veg = LinearModel(path_to_arrays)

In [5]:
model_with_veg.train()

Train set RMSE: 0.03977352798000336


We can isolate the coefficients, as well as the values they correspond to. Note that each label in `value_labels` has the following format: `{value}_{month}` where `month` is relative to the `pred_month` (so if we are predicting June, then `month=11` corresponds to data in May).

In [6]:
coefs = model_with_veg.model.coef_

In [7]:
value_labels = list(chain(*[[f'{val}_{month}' for val in VALUE_COLS] for month in range(1, 12)]))

In [8]:
feature_importances_veg = pd.DataFrame(data={
    'feature': value_labels,
    'value': coefs
})

Lets investigate the most important features (by absolute value)

In [9]:
feature_importances_veg.iloc[(-np.abs(feature_importances_veg['value'].values)).argsort()][:10]

Unnamed: 0,feature,value
74,ndvi_11,0.118459
4,ndvi_1,0.110067
67,ndvi_10,-0.054666
6,ndvi_anomaly_1,-0.036427
11,ndvi_2,-0.031033
75,evi_11,0.020969
39,ndvi_6,0.020805
71,lst_day_11,-0.016052
46,ndvi_7,0.015323
40,evi_6,-0.014433


## Without vegetation

The model above tells us that the vegetation health in May is predictive of the vegetation health in June. What happens if we hide vegetation health from the model?

In [10]:
model_no_veg = LinearModel(path_to_arrays, hide_vegetation=True)

In [11]:
model_no_veg.train()

Training model without vegetation features
Train set RMSE: 0.07832527470825593


In [12]:
coefs = model_no_veg.model.coef_

veg_features = ['ndvi', 'evi']
value_labels = list(chain(*[[f'{val}_{month}' for val in VALUE_COLS if val not in VEGETATION_LABELS] 
                            for month in range(1, 12)]))

In [13]:
feature_importances_no_veg = pd.DataFrame(data={
    'feature': value_labels,
    'value': coefs
})

In [14]:
feature_importances_no_veg.iloc[(-np.abs(feature_importances_no_veg['value'].values)).argsort()][:10]

Unnamed: 0,feature,value
41,lst_day_11,-0.068263
42,precip_11,0.04888
0,lst_night_1,0.035778
40,lst_night_11,-0.035483
30,precip_8,0.030408
20,lst_night_6,0.029277
21,lst_day_6,-0.026891
10,precip_3,0.026077
33,lst_day_9,0.025803
2,precip_1,0.021489
