# 03 - Prequential Expanding

Quebra o conjunto original em folders. Na primeira tentativa, pega o primeiro bloco para treino e o segundo bloco para teste. Na segunda tentativa, pega o primeiro e o segundo bloco para treino e o terceiro bloco para teste e assim vai fazendo para as demais tentativas, até chegar no último bloco. Os dados de treino vão sempre expandindo.

In [1]:
import pandas as pd
import numpy as np
from lightgbm import LGBMRegressor

In [2]:
data = pd.read_csv('data-processed/train.csv')

## Prequential Expanding

In [3]:
data['block'] = np.trunc(data['era']*.1).astype(int)
data.loc[data['block'] == 12, 'block'] = 11

data['block'].value_counts().sort_index()

0     24515
1     34600
2     37444
3     41101
4     43439
5     48186
6     46831
7     40403
8     43971
9     45609
10    46107
11    49602
Name: block, dtype: int64

In [4]:
results_val = []

for block in range(1,12):
    print("Train blocks 0-{} - Validation Block {}".format(block - 1, block))
    
    train = data[data['block'] < block]
    val = data[data['block'] == block]
    
    X_train = train.filter(regex=r'feature')
    X_val = val.filter(regex=r'feature')

    y_train = train['target']
    y_val = val['target']
     
    mdl = LGBMRegressor(max_depth=5, num_leaves=2**5, learning_rate=0.01, n_estimators=2000, colsample_bytree=0.1, random_state=0)
    mdl.fit(X_train, y_train)
    
    predictions = pd.Series(mdl.predict(X_val))
    ranked_predictions = predictions.rank(pct=True, method="first")
    correlation = np.corrcoef(y_val, ranked_predictions)[0, 1]
    #print(correlation)
    
    results_val.append(correlation)
    print("Correlation {}".format(correlation))
    print()

Train blocks 0-0 - Validation Block 1
Correlation 0.03492548009783929

Train blocks 0-1 - Validation Block 2
Correlation 0.050899663651192195

Train blocks 0-2 - Validation Block 3
Correlation 0.055482929228326164

Train blocks 0-3 - Validation Block 4
Correlation 0.04808939370995281

Train blocks 0-4 - Validation Block 5
Correlation 0.02929588848371697

Train blocks 0-5 - Validation Block 6
Correlation 0.02933274216060179

Train blocks 0-6 - Validation Block 7
Correlation 0.04980009798824874

Train blocks 0-7 - Validation Block 8
Correlation 0.036794885860251224

Train blocks 0-8 - Validation Block 9
Correlation 0.06196275563173071

Train blocks 0-9 - Validation Block 10
Correlation 0.041960375767892584

Train blocks 0-10 - Validation Block 11
Correlation 0.03723415166242421



In [5]:
np.median(results_val)

0.041960375767892584

In [6]:
np.min(results_val)

0.02929588848371697

In [7]:
np.max(results_val)

0.06196275563173071

In [8]:
np.mean(results_val)

0.043252578567470605

In [9]:
len(results_val)

11

## Prequential Expanding With Gap

Inserimos gaps entre os blocos de treino e validação. Isso é interessante quando não temos os dados imediatamente antes dos dados que precisamos prever.

In [10]:
results_val = []

for block in range(2, 12):
    print("Train blocks 0-{} - Gap Block {} - Validation Block {}".format(block - 2, block - 1,  block))
    
    train = data[data['block'] < block-1]
    val = data[data['block'] == block]
    
    X_train = train.filter(regex=r'feature')
    X_val = val.filter(regex=r'feature')

    y_train = train['target']
    y_val = val['target']
     
    mdl = LGBMRegressor(max_depth=5, num_leaves=2**5, learning_rate=0.01, n_estimators=2000, colsample_bytree=0.1, random_state=0)
    mdl.fit(X_train, y_train)
    
    predictions = pd.Series(mdl.predict(X_val))
    ranked_predictions = predictions.rank(pct=True, method="first")
    correlation = np.corrcoef(y_val, ranked_predictions)[0, 1]
    #print(correlation)
    
    results_val.append(correlation)
    print("Correlation {}".format(correlation))
    print()

Train blocks 0-0 - Gap Block 1 - Validation Block 2
Correlation 0.0468018477612904

Train blocks 0-1 - Gap Block 2 - Validation Block 3
Correlation 0.04479889032662142

Train blocks 0-2 - Gap Block 3 - Validation Block 4
Correlation 0.04308764828870972

Train blocks 0-3 - Gap Block 4 - Validation Block 5
Correlation 0.02576312120128353

Train blocks 0-4 - Gap Block 5 - Validation Block 6
Correlation 0.02090852253662712

Train blocks 0-5 - Gap Block 6 - Validation Block 7
Correlation 0.04964472276964134

Train blocks 0-6 - Gap Block 7 - Validation Block 8
Correlation 0.032339553609559715

Train blocks 0-7 - Gap Block 8 - Validation Block 9
Correlation 0.05902661603936733

Train blocks 0-8 - Gap Block 9 - Validation Block 10
Correlation 0.038391438450815364

Train blocks 0-9 - Gap Block 10 - Validation Block 11
Correlation 0.03633220010476147



In [11]:
np.median(results_val)

0.04073954336976254

In [12]:
np.min(results_val)

0.02090852253662712

In [13]:
np.max(results_val)

0.05902661603936733

In [14]:
np.mean(results_val)

0.03970945610886774

In [15]:
len(results_val)

10

# Fim