# 3W Dataset's General Presentation

This is a general presentation of the 3W Dataset, to the best of its authors' knowledge, the first realistic and public dataset with rare undesirable real events in oil wells that can be readily used as a benchmark dataset for development of machine learning techniques related to inherent difficulties of actual data.

For more information about the theory behind this dataset, refer to the paper **A Realistic and Public Dataset with Rare Undesirable Real Events in Oil Wells** published in the **Journal of Petroleum Science and Engineering** (link [here](https://doi.org/10.1016/j.petrol.2019.106223)).

# 1. Introduction

This Jupyter Notebook presents a 3W Dataset overview. For this, one **interactive plot graph** from a specific instance from an event class is presented. By default, the instance is downsampling (n=100) and applied Z-score Scaler. To help the visualization, transient labels were changed to '0.5'.

# 2. Imports and Configurations

In [1]:
import warnings

warnings.simplefilter("ignore", FutureWarning)

import sys
import os

sys.path.append(os.path.join("..", ".."))
import toolkit as tk

import plotly.offline as py
import plotly.graph_objs as go
import glob
import pandas as pd
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport

%matplotlib inline
%config InlineBackend.figure_format = 'svg'


IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html



AttributeError: module 'numba' has no attribute 'generated_jit'

Each instance is stored in a [Parquet file](https://parquet.apache.org/docs/) and loaded into a pandas DataFrame as follows:

* All Parquet files are created and read with pandas functions, `pyarrow` engine and `brotli` compression;
* For each instance, timestamps corresponding to observations are stored in Parquet file as its index and loaded into pandas DataFrame as its index;
* Each observation is stored in a line of a Parquet file and loaded as a line of a pandas DataFrame; 
* All variables are stored as float in columns of Parquet files and loaded as float in columns of pandas DataFrame;
* All labels are stored as `Int64` (not `int64`) in columns of Parquet files and loaded as `Int64` (not `int64`) in columns of pandas DataFrame.

The variables and labels are as follows:

* **ABER-CKGL**: Opening of the GLCK (gas lift choke) [%];
* **ABER-CKP**: Opening of the PCK (production choke) [%];
* **ESTADO-DHSV**: State of the DHSV (downhole safety valve) [0, 0.5, or 1];
* **ESTADO-M1**: State of the PMV (production master valve) [0, 0.5, or 1];
* **ESTADO-M2**: State of the AMV (annulus master valve) [0, 0.5, or 1];
* **ESTADO-PXO**: State of the PXO (pig-crossover) valve [0, 0.5, or 1];
* **ESTADO-SDV-GL**: State of the gas lift SDV (shutdown valve) [0, 0.5, or 1];
* **ESTADO-SDV-P**: State of the production SDV (shutdown valve) [0, 0.5, or 1];
* **ESTADO-W1**: State of the PWV (production wing valve) [0, 0.5, or 1];
* **ESTADO-W2**: State of the AWV (annulus wing valve) [0, 0.5, or 1];
* **ESTADO-XO**: State of the XO (crossover) valve [0, 0.5, or 1];
* **P-ANULAR**: Pressure in the well annulus [Pa];
* **P-JUS-BS**: Downstream pressure of the SP (service pump) [Pa];
* **P-JUS-CKGL**: Downstream pressure of the GLCK (gas lift choke) [Pa];
* **P-JUS-CKP**: Downstream pressure of the PCK (production choke) [Pa];
* **P-MON-CKGL**: Upstream pressure of the GLCK (gas lift choke) [Pa];
* **P-MON-CKP**: Upstream pressure of the PCK (production choke) [Pa];
* **P-MON-SDV-P**: Upstream pressure of the production SDV (shutdown valve) [Pa];
* **P-PDG**: Pressure at the PDG (permanent downhole gauge) [Pa];
* **PT-P**: Downstream pressure of the PWV (production wing valve) in the production tube [Pa];
* **P-TPT**: Pressure at the TPT (temperature and pressure transducer) [Pa];
* **QBS**: Flow rate at the SP (service pump) [m3/s];
* **QGL**: Gas lift flow rate [m3/s];
* **T-JUS-CKP**: Downstream temperature of the PCK (production choke) [oC];
* **T-MON-CKP**: Upstream temperature of the PCK (production choke) [oC];
* **T-PDG**: Temperature at the PDG (permanent downhole gauge) [oC];
* **T-TPT**: Temperature at the TPT (temperature and pressure transducer) [oC];
* **class**: Label of the observation;
* **state**: Well operational status.

Other informations are also loaded into each pandas Dataframe:

* **label**: instance label (event type);
* **well**: well name. Hand-drawn and simulated instances have fixed names. Real instances have names masked with incremental id;
* **id**: instance identifier. Hand-drawn and simulated instances have incremental id. Each real instance has an id generated from its first timestamp.

More information about these variables can be obtained from the following publicly available documents:

* ***Option in Portuguese***: R.E.V. Vargas. Base de dados e benchmarks para prognóstico de anomalias em sistemas de elevação de petróleo. Universidade Federal do Espírito Santo. Doctoral thesis. 2019. https://github.com/petrobras/3W/raw/main/docs/doctoral_thesis_ricardo_vargas.pdf.
* ***Option in English***: B.G. Carvalho. Evaluating machine learning techniques for detection of flow instability events in offshore oil wells. Universidade Federal do Espírito Santo. Master's degree dissertation. 2021. https://github.com/petrobras/3W/raw/main/docs/master_degree_dissertation_bruno_carvalho.pdf.

# 3. Plot Instances

Plot one interactive graph from an especific event class and instance.

In [None]:
class_number = 7
instance_index = 5
resample_factor = 100
tk.plot_instance(class_number, instance_index, resample_factor)

# 4. Profiling Report

In this part, we generate a complete interactive HTML report from the data set. It is possible to have a complete view of the 3W Dataset of one event class, such as the number of lines, number of columns (variables), number of missing values (null cells, NaNs), duplicate lines, size, and the types of variables that we have in the database. In addition, the tool also brings statistics, histograms, interactions, and correlations.

In the Warnings field, the report already brings some things that we will have to be careful about when analyzing the dataset. With this, it is possible to assess the need or not to perform some initial treatment on the data, before starting the exploration.

The original frequency rate is 1Hz. In some 3W classes, due to a large number of samples, the maximum allowed size is exceeded. Thus we reduce the frequency rate. The parameter, that determines the new frequency is "resample_factor". In this case, we downsampling 100 times. To visualize the original data use "resize=1", but it's no warranty that the report will be generated.

In [None]:
class_number = 2
resample_factor = 100
df_all_instances_class = pd.concat(
    [
        tk.resample(pd.read_parquet(f, engine="pyarrow"), resample_factor, class_number)
        for f in glob.glob(
            os.path.join(tk.PATH_DATASET, str(class_number), "*.parquet")
        )
    ],
    ignore_index=True,
)

Genarate the Profile Report

In [None]:
profile = df_all_instances_class.profile_report(
    title=tk.EVENT_NAMES[class_number] + " Profiling Report"
)
profile.to_file(tk.EVENT_NAMES[class_number].replace(" ", "") + "DataReport.html")
print(
    "Generated Profiling Report: "
    + tk.EVENT_NAMES[class_number].replace(" ", "")
    + "DataReport.html"
)

Open the Interactive Report on new tab browser 

In [None]:
import webbrowser

webbrowser.open_new_tab(
    tk.EVENT_NAMES[class_number].replace(" ", "") + "DataReport.html"
)