# Sequências - Aula Prática

## LSTM + GRU

Neste notebook, iremos implementar novos tipos de redes neurais recorrentes (LSTM, GRU, BiLSTM, BiGRU) para o conjunto de dados de classificação de litologia.

- **Importante:** caso esteja rodando esse notebook no ambiente do Google Colab, favor executar as próximas células. Caso contrário, basta ignorar a sua execução.

In [None]:
!wget -nc https://raw.githubusercontent.com/bolgebrygg/Force-2020-Machine-Learning-competition/refs/heads/master/lithology_competition/data/hidden_test.csv
!wget -nc https://raw.githubusercontent.com/bolgebrygg/Force-2020-Machine-Learning-competition/refs/heads/master/lithology_competition/data/leaderboard_test_features.csv
!wget -nc https://raw.githubusercontent.com/bolgebrygg/Force-2020-Machine-Learning-competition/refs/heads/master/lithology_competition/data/leaderboard_test_target.csv
!wget -nc https://raw.githubusercontent.com/bolgebrygg/Force-2020-Machine-Learning-competition/refs/heads/master/lithology_competition/data/train.zip

--2024-11-25 18:09:33--  https://raw.githubusercontent.com/bolgebrygg/Force-2020-Machine-Learning-competition/refs/heads/master/lithology_competition/data/hidden_test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30292497 (29M) [text/plain]
Saving to: ‘hidden_test.csv’


2024-11-25 18:09:36 (127 MB/s) - ‘hidden_test.csv’ saved [30292497/30292497]

--2024-11-25 18:09:37--  https://raw.githubusercontent.com/bolgebrygg/Force-2020-Machine-Learning-competition/refs/heads/master/lithology_competition/data/leaderboard_test_features.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HT

In [None]:
!unzip -o train.zip

Archive:  train.zip
  inflating: train.csv               


In [None]:
data_dir = '/content'

# Configuração do ambiente



In [None]:
import os

import json
import pandas as pd
from tqdm import tqdm
import pickle

from sklearn.preprocessing import LabelEncoder
from sklearn.utils.class_weight import compute_class_weight, compute_sample_weight
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, confusion_matrix, matthews_corrcoef, f1_score, precision_score, recall_score, ConfusionMatrixDisplay, balanced_accuracy_score

import torch
import random

import numpy as np
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchinfo import summary

import matplotlib.pyplot as plt
import matplotlib

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device escolhido:', device)

random_state = 42

Device escolhido: cuda


# Carregamento da base de dados

In [None]:
class Force:
    def __init__(self, directory:str, logs:list[str], verbose:bool=False) -> None:
        """
            Arguments:
            ---------
                - directory (str): Path to the directory where data is
                - logs (list[str] or tuple[str]): Logs used from FORCE data
                - verbose (bool): If True, print progress details. Else, does not print anything.
        """

        self.directory = directory
        self.logs = logs

        self.lithology_keys = {30000: 'Sandstone',
                     65030: 'Sandstone/Shale',
                     65000: 'Shale',
                     80000: 'Marl',
                     74000: 'Dolomite',
                     70000: 'Limestone',
                     70032: 'Chalk',
                     88000: 'Halite',
                     86000: 'Anhydrite',
                     99000: 'Tuff',
                     90000: 'Coal',
                     93000: 'Basement'}

        self.data, self.le = self.open_data()

    def open_data(self) -> tuple[pd.DataFrame, LabelEncoder]:
        """
        Main method to open data.
            Arguments:
            ---------
                -
            Return:
            ---------
                - data (pd.DataFrame): Well log dataset fully configured to be used
                - le (LabelEncoder): Label Encoder used to encode lithology classes to consecutive numbers
        """

        train_data = pd.read_csv( os.path.join(self.directory, 'train.csv'), sep=';' )
        hidden_test = pd.read_csv( os.path.join(self.directory, 'hidden_test.csv'), sep=';' )
        leaderboard_test_features = pd.read_csv( os.path.join(self.directory, 'leaderboard_test_features.csv'), sep=';' )
        leaderboard_test_target = pd.read_csv( os.path.join(self.directory, 'leaderboard_test_target.csv'), sep=';' )

        ## A little of consistency checking
        leaderboard_test_target['WELL_tg'] = leaderboard_test_target.WELL
        leaderboard_test_target['DEPTH_MD_tg'] = leaderboard_test_target.DEPTH_MD
        leaderboard_test_target.drop(columns=['WELL', 'DEPTH_MD'], inplace=True)
        leaderboard_test = pd.concat([leaderboard_test_features, leaderboard_test_target], axis=1)

        ## Make sure the values for the WELL and DEPTH_MD columns match between the two concatenated data-frames
        _check_well = np.all( (leaderboard_test.WELL == leaderboard_test.WELL_tg).values )
        _check_depth = np.all( (leaderboard_test.DEPTH_MD == leaderboard_test.DEPTH_MD_tg).values )
        assert _check_well and _check_depth, 'Inconsistency found in leaderboard test data...'

        ## Passed the consistency check, we drop the redundant columns
        leaderboard_test.drop(columns=['WELL_tg', 'DEPTH_MD_tg'], inplace=True)

        ## Note leaderboard_test dataframe does not have the FORCE_2020_LITHOFACIES_CONFIDENCE column. We will therefore fill it with NaNs.
        leaderboard_test['FORCE_2020_LITHOFACIES_CONFIDENCE'] = np.nan

        data = pd.concat([train_data, leaderboard_test, hidden_test], axis=0, ignore_index=True)
        data.sort_values(by=['WELL', 'DEPTH_MD'], inplace=True)
        data.reset_index(drop=True, inplace=True)

        data['LITHOLOGY_NAMES'] = data.FORCE_2020_LITHOFACIES_LITHOLOGY.map(self.lithology_keys)
        data = data[data["LITHOLOGY_NAMES"] != 'Basement']
        le = LabelEncoder()
        data['LITHOLOGY'] = le.fit_transform(data['FORCE_2020_LITHOFACIES_LITHOLOGY'])

        return data, le

# Pré-processamento

In [None]:
def remove_quartiles(original_data:pd.DataFrame, logs:list[str], q:list=[0.01, 0.99], verbose:bool=True) -> pd.DataFrame:
    """
    Function to apply winsorization (remove outliers by clipping extreme quartiles. Upper or lower quartiles)
        Arguments:
        ---------
            - original_data (pd.DataFrame): Well log data, including lithology column
            - logs (list[str]): List of log names used. Ex: GR, NPHI, ...
            - class_col (str): Name of the lithology column
        Return:
        ---------
            - data (pd.DataFrame): Well log data without outliers.
    """

    data = original_data.copy()
    num_cols = len(logs)

    for i, col in enumerate(logs):
        if verbose:
            print(f'Handling log {i + 1}/{num_cols} - {col}')
        array_data = data[col].values
        only_nans = np.all( np.isnan(array_data) )

        if not only_nans:
            min_quart, max_quart = np.nanquantile(array_data, q=q)

            if verbose:
                print(f'{col}: min: {min_quart:.4f} - max: {max_quart:.4f} ')

            # Set outlier values as nan
            outlier_idx = np.logical_or(array_data < min_quart, array_data > max_quart)
            if verbose:
                print(f'Ignoring {np.sum(outlier_idx)} values')

            # Set series in dataframe with clipped values
            data[col] = data[col].clip(min_quart, max_quart)

    if verbose:
        print()

    return data

In [None]:
def open_and_preprocess_data(data_dir:str, logs:list[str], class_col:str, test_size:float, val_size:float, shuffle:bool, random_state:int|np.random.RandomState, verbose:bool=True) -> tuple[pd.DataFrame, LabelEncoder, list, list, list]:

    """
    Function that receives all necessary parameters to open and preprocess data and calls all necessary functions, classes and methods.
        Arguments:
        ---------
            - data_dir (str): Path for folder containing dataset.
            - logs (str): List of names of logs used.
            - class_col (str): Name of the label column (usually 'Lithology')
            - test_size (float): Size of test set. Range: 0-1.
            - val_size (float): Size of validation set. Range: 0-1.
            - shuffle (bool): Wether to shuffle or not while data splitting.
            - random_state (int or np.random.RandomState): Random state to define random operations.
            - verbose (bool): If True, print progress details. Else, does not print anything.
        Return:
        ---------
            - data (pd.DataFrame): Well log dataset fully configured to be used
            - le (LabelEncoder): Label Encoder used to encode lithology classes to consecutive numbers
            - well_names (list[str]): List of all well names contained in dataset
            - train_wells (list[str]): List of train wells after splitting
            - val_wells (list[str] or None): List of validation wells after splitting. Can be None if there is no validation split.
            - test_wells (list[str] or None): List of test wells after splitting. Can be None if there is no test split.
    """

    force_dataset = Force(data_dir, logs)
    data, le = force_dataset.data, force_dataset.le

    data = remove_quartiles(data, logs, verbose=verbose)

    well_names = list(data['WELL'].unique())
    train_wells, test_wells = train_test_split(well_names, test_size=test_size, shuffle=shuffle, random_state=random_state)
    train_wells, val_wells = train_test_split(train_wells, test_size=val_size, shuffle=shuffle, random_state=random_state)


    return data, le, well_names, train_wells, val_wells, test_wells

In [None]:
class Config:
    seq_size = 256
    batch_size = 64

    split_form = 'train_val_test' ## kfold, train_test or train_val_test
    n_splits = 5
    test_size = 0.2
    val_size = 0.1
    shuffle = True

    scaling_method = 'standard' # standard or min-max

    data_dir = data_dir
    logs = ['GR', 'RHOB', 'NPHI', 'DTC']
    logs_info = logs + ['WELL', 'DEPTH_MD', 'LITHOLOGY']
    class_col = 'LITHOLOGY'

    num_classes = 11

    verbose = True

cfg = Config()

In [None]:
data, le_data, well_names, train_wells, val_wells, test_wells = open_and_preprocess_data(cfg.data_dir, cfg.logs, cfg.class_col,
                                                                                   cfg.test_size, cfg.val_size, cfg.shuffle,
                                                                                   random_state, verbose=cfg.verbose)

Handling log 1/4 - GR
GR: min: 8.9536 - max: 180.3104 
Ignoring 28592 values
Handling log 2/4 - RHOB
RHOB: min: 1.6196 - max: 2.6975 
Ignoring 24838 values
Handling log 3/4 - NPHI
NPHI: min: 0.0491 - max: 0.6245 
Ignoring 19320 values
Handling log 4/4 - DTC
DTC: min: 60.0862 - max: 173.0303 
Ignoring 26878 values



In [None]:
train_data = data[data['WELL'].isin(train_wells)]
X_train = train_data[cfg.logs_info]
y_train = train_data[cfg.class_col]

val_data = data[data['WELL'].isin(val_wells)]
X_val = val_data[cfg.logs_info]
y_val = val_data[cfg.class_col]

test_data = data[data['WELL'].isin(test_wells)]
X_test = test_data[cfg.logs_info]
y_test = test_data[cfg.class_col]

In [None]:
scaler = StandardScaler()

X_train[cfg.logs] = scaler.fit_transform(X_train[cfg.logs])
X_val[cfg.logs] = scaler.transform(X_val[cfg.logs])
X_test[cfg.logs] = scaler.transform(X_test[cfg.logs])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[cfg.logs] = scaler.fit_transform(X_train[cfg.logs])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val[cfg.logs] = scaler.transform(X_val[cfg.logs])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test[cfg.logs] = scaler.transform(X_test[cfg.logs])


# Dataset e DataLoader

In [None]:
class LithologyDataset(Dataset):
    def __init__(self, df:pd.DataFrame, labels:pd.Series, logs:list[str], num_classes:int, seq_size:int=100, interval_size:int=100, well_name_column:str='WELL', lithology_column:str='LITHOLOGY') -> None:
        """
            Arguments:
            ---------
                - df (pd.DataFrame): Well log data
                - labels (pd.Series): Column containing lithology classes for each depth
                - logs (list[str]): List of logs used. Ex: GR, NPHI, ...
                - num_classes (int): Number of lithology classes
                - seq_size (int): Size of sequence sent to the model
                - interval_size (int): Size of the interval used to extract consecutive sequences
                - well_name_column (str): Name of the column that indicates the well name in the data
                - lithology_column (str): Name of the lithology column
            Return:
            ---------
                None
        """

        self.data = df
        self.list_of_wells = list(df[well_name_column].unique())
        self.labels = labels
        self.logs = logs
        self.num_classes = num_classes
        self.seq_size = seq_size
        self.interval_size = interval_size
        self.well_name_column = well_name_column
        self.lithology_column = lithology_column
        self.no_missing_logs = self.logs + [self.lithology_column]

        self.data['labels'] = labels
        self.list_of_sequences = self.__create_dataset(self.data, verbose=False)

    def __create_dataset(self, df:pd.DataFrame, verbose:bool=False) -> list:
        """
            Arguments:
            ---------
                - df (pd.DataFrame): Well log data
            Return:
            ---------
                - list_of_sequences (list): list of all sequences without null values in the dataset
        """

        list_of_sequences = list()

        for wellname in tqdm(self.list_of_wells, disable=(not verbose)):

            well_df = df[df[self.well_name_column] == wellname]

            j=0
            while j < well_df.shape[0]-(self.seq_size-1): # Enquanto for possível pegar uma sequência de tamanho seq_size no meu poço

                sequence = well_df.iloc[j:j+self.seq_size]

                # Busca indíces de valores nulos dentro da sequência
                idx_null = [k for k,x in enumerate(sequence[self.no_missing_logs].values) if np.isnan(x).any()]

                # Se não tiver valor nulo na sequência
                if idx_null == []:
                    list_of_sequences.append([wellname, sequence[self.logs], sequence['labels']])
                    j = j + self.interval_size
                # Se tiver, pular para o indíce seguinte ao último valor nulo na sequência
                else:
                    j = j + idx_null[-1] + 1

        return list_of_sequences

    def __len__(self):

        return len(self.list_of_sequences)

    def __getitem__(self, idx) -> tuple[str, torch.Tensor, torch.Tensor]:
        """
            Arguments:
            ---------
                - idx (int): Index for selecting a sample from the dataset
            Return:
            ---------
                - wellname (str): Name of the well from which the sequence is taken
                - well_data_torch (torch.Tensor): Well log sequence
                - labels_torch (torch.Tensor): One-hot-encoded lithology labels sequence
        """

        wellname, sequence, labels = self.list_of_sequences[idx]
        # To numpy
        sequence_numpy = sequence.to_numpy()
        sequence_numpy = np.reshape(sequence_numpy, (-1, len(self.logs)))

        # Create one-hot vector to represent labels
        labels_numpy = np.array([np.array([1. if i==label else 0. for i in range(self.num_classes)]) for label in labels.to_numpy()])

        # To torch
        well_data_torch = torch.from_numpy(sequence_numpy).float()
        labels_torch = torch.from_numpy(labels_numpy).float()

        return wellname, well_data_torch, labels_torch

In [None]:
train_dataset = LithologyDataset(X_train, y_train, cfg.logs, cfg.num_classes, seq_size=cfg.seq_size, interval_size=cfg.seq_size)
train_dataloader = DataLoader(train_dataset, batch_size=cfg.batch_size, shuffle=True)

val_dataset = LithologyDataset(X_val, y_val, cfg.logs, cfg.num_classes, seq_size=cfg.seq_size, interval_size=cfg.seq_size)
val_dataloader = DataLoader(val_dataset, batch_size=cfg.batch_size, shuffle=False)

test_dataset = LithologyDataset(X_test, y_test, cfg.logs, cfg.num_classes, seq_size=cfg.seq_size, interval_size=cfg.seq_size)
test_dataloader = DataLoader(test_dataset, batch_size=cfg.batch_size, shuffle=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.data['labels'] = labels
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.data['labels'] = labels
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.data['labels'] = labels


# Modelos

#### **Classe LSTM**

1. Implemente uma LSTM para modelagem de dependência de longo prazo. Deve ser possível configurar seu modelo para que seja unidirecional ou bidirecional.

In [None]:
# Implemente a sua solução aqui

#### **Classe GRU**

2. Implemente uma GRU. Da mesma forma, o modelo pode ser configurado como unidirecional ou bidirecional.

In [None]:
# Implemente a sua solução aqui

# Treinamento

### Função de avaliação (`evaluate`)

In [None]:
def evaluate(model, dataloader, print_metrics=False):

    model.eval()
    y_pred_deep = []
    y_test_deep = []
    with torch.no_grad():

        for i, (wellnames, well_data_torch, labels_torch) in enumerate(dataloader):

            well_data_torch = well_data_torch.to(device)
            labels_torch = labels_torch.to(device)

            output, hidden = model(well_data_torch)

            # Reshape to 2D tensor (batch_size * seq_len, num_classes)
            output = output.reshape(-1, cfg.num_classes).detach().cpu().numpy()

            labels_torch = labels_torch.reshape(-1, cfg.num_classes).detach().cpu().numpy()

            test_probs = np.argmax(output, axis=1).tolist()
            test_labels = np.argmax(labels_torch, axis=1).tolist()

            y_pred_deep.extend(test_probs)
            y_test_deep.extend(test_labels)

    y_test = y_test_deep
    y_predict = y_pred_deep

    if print_metrics:
        print(f'Accuracy: {accuracy_score(y_test, y_predict):.2f}')
        print(f'MCC: {matthews_corrcoef(y_test, y_predict):.2f}')
        print(f'Precision: {precision_score(y_test, y_predict, average="macro"):.2f}')
        print(f'Recall: {recall_score(y_test, y_predict, average="macro"):.2f}')
        print(f'F1-Score: {f1_score(y_test, y_predict, average="macro"):.2f}')

    lithology_keys = {30000: 'Sandstone',
                     65030: 'Sandstone/Shale',
                     65000: 'Shale',
                     80000: 'Marl',
                     74000: 'Dolomite',
                     70000: 'Limestone',
                     70032: 'Chalk',
                     88000: 'Halite',
                     86000: 'Anhydrite',
                     99000: 'Tuff',
                     90000: 'Coal',
                     93000: 'Basement'}

    labels = [le_data.inverse_transform([value])[0] for value in np.unique(np.hstack((y_test, y_predict)))]
    label_names = [lithology_keys[label_values] for label_values in labels]

    cm = confusion_matrix(y_test, y_predict, normalize='true')
    conf_mat_norm = np.around(cm.astype('float'), decimals=2)

    disp = ConfusionMatrixDisplay(confusion_matrix=conf_mat_norm, display_labels=label_names)
    disp.plot()
    plt.xticks(rotation=90)
    plt.show()

### Configuração do modelo, função de perda e otimizador


Inicialização do modelo
- Parâmetros do modelo:
    - Tamanho da entrada (`len(cfg.logs)`): Número de atributos de entrada (tipos de registro).
    - Tamanho da camada oculta (`64`): Número de unidades ocultas no modelo.
    - Tamanho da saída (`cfg.num_classes`): Número de classes de litologia.
    - Bidirecionalidade (`bidirectional`): Indica se a rede é bidirecional.

Função de perda
- Calcula pesos de classe para lidar com conjuntos de dados desbalanceados usando `compute_class_weight`.
- Aplica `nn.CrossEntropyLoss` com os pesos de classe computados.

Otimizador
- Configura o otimizador como Adam (`torch.optim.Adam`) com os parâmetros do modelo e a taxa de aprendizado (`lr`).

3. Complete o código abaixo para criar os modelos: LSTM, GRU, BiLSTM e BiGRU.

In [None]:
LSTM_model = ...

GRU_model = ...

BiLSTM_model = ...

BiGRU_model = ...

class_weights = compute_class_weight('balanced', classes=np.unique(y_train.values), y=y_train.values)
class_weights_torch = torch.tensor(class_weights, dtype=torch.float).to(device)
criterion = nn.CrossEntropyLoss(weight=class_weights_torch, reduction='mean')

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

### Loop de treinamento

- Iteração de época: Itera pelo número de épocas.

    - Iteração de batch: Faz um loop pelo `train_dataloader`, buscando batches de dados de registro de poços e rótulos correspondentes.

        - Forward pass: Calcula previsões do modelo para os dados de entrada.

        - Cálculo de perda: Calcula a perda usando o critério especificado (por exemplo, `nn.CrossEntropyLoss`).

        - Backward pass: Calcula gradientes usando `loss.backward()`.

        - Otimização: Atualiza parâmetros do modelo usando `optimizer.step()` com base nos gradientes.

    - Logging: imprime a perda média de treinamento para a época e avalia o modelo no conjunto de validação a cada `n` épocas (por exemplo, a cada 20 épocas).

In [None]:
def train(model, epochs=100):

    for epoch in range(epochs):

        if cfg.verbose:
            print(f'Epoch {epoch+1}/{epochs}', end='\r')

        total_loss = 0
        model.train()

        for i, (wellnames, well_data_torch, labels_torch) in enumerate(train_dataloader):

            well_data_torch = well_data_torch.float().to(device)
            labels_torch = labels_torch.to(device)

            optimizer.zero_grad()

            output, hidden = model(well_data_torch)

            # Reshape to 2D tensor (batch_size * seq_len, num_classes)
            output = output.reshape(-1, cfg.num_classes)

            labels_torch = labels_torch.reshape(-1, cfg.num_classes)

            loss = criterion(output, labels_torch)
            total_loss += loss.item()

            loss.backward()

            optimizer.step()

        if cfg.verbose == True and epoch%11==0:
            print(f"Epoch {epoch+1}/{epochs} - Training Loss: {(total_loss/len(train_dataloader)):.2f}")

        if epoch%33==0:
            evaluate(model, val_dataloader)

In [None]:
train(LSTM_model)

In [None]:
train(GRU_model)

In [None]:
train(BiLSTM_model)

In [None]:
train(BiGRU_model)

### Avaliação final

4. Compare o tempo de treino e inferência de cada arquitetura (LSTM, GRU, BiLSTM e BiGRU). Compare também o número de parâmetros.

In [None]:
# Implemente a sua solução aqui