## Linear model

In this notebook, we train a linear model and investigate the coefficients it learns. These coefficients can be interpreted as the global feature importances of the input features.

In [1]:
from pathlib import Path
from itertools import chain
import pandas as pd
import numpy as np

import sys
sys.path.insert(0, '..')

from predictor.models import LinearModel
from predictor.preprocessing import VALUE_COLS

In [2]:
path_to_arrays = Path('../data/processed/arrays')

In [3]:
model = LinearModel(path_to_arrays)

In [4]:
model.train()

Train set RMSE: 0.05012686183266561


We can isolate the coefficients, as well as the values they correspond to. Note that each label in `value_labels` has the following format: `{value}_{month}` where `month` is relative to the `pred_month` (so if we are predicting June, then `month=11` corresponds to data in May).

In [5]:
coefs = model.model.coef_

In [6]:
value_labels = list(chain(*[[f'{val}_{month}' for val in VALUE_COLS] for month in range(1, 12)]))

In [7]:
feature_importances = pd.DataFrame(data={
    'feature': value_labels,
    'value': coefs
})

Lets investigate the most important features (by absolute value)

In [8]:
feature_importances.iloc[(-np.abs(feature_importances['value'].values)).argsort()][:10]

Unnamed: 0,feature,value
86,ndvi_11,0.30343
83,sm_11,0.11974
3,sm_1,-0.093128
81,lst_day_11,-0.092214
6,ndvi_1,0.090166
40,lst_night_6,0.088878
28,spi_4,0.081083
19,sm_3,0.080832
29,spei_4,-0.075993
43,sm_6,0.070247
