<div style="background-color: yellow; padding: 18px;">
    <h1> Data Science Challenge |  Data & Analytics Team
</div>

<div style="background-color: lightblue; padding: 10px;">
    <h2> Case 3 - Previsión de falla
</div>
    

**Descripción**
 
Los galpones de Full de mercado libre cuentan con una flota de dispositivos que transmiten diariamente telemetría agregada en varios atributos.

Las técnicas de mantenimiento predictivo están diseñadas para ayudar a determinar la condición del equipo de mantenimiento en servicio para predecir cuándo se debe realizar el mantenimiento. Este enfoque promete ahorros de costos sobre el mantenimiento preventivo de rutina o basado en el tiempo porque las tareas se realizan solo cuando están justificadas.
   
Tiene la tarea de generar una Jupyter notebook con un modelo predictivo para predecir la probabilidad de falla del dispositivo con el objetivo de bajar los costos del proceso. Como una referencia, una falla de un dispositivo tiene un costo de 1 mientras el costo de un mantenimiento es 0,5. El archivo "full_devices.csv" tiene los valores diários para los 9 atributos de los dispositivos y la columna que está tratando de predecir se llama 'failure' con el valor binario 0 para no fallar y 1 para fallar.
    
___

# Estratégia adotada:

- Analisar os dados de forma básica
- Feature Engineer e preparar os dados
- Criar modelos
- Avaliar os modelos
- Balancear classes
    - Resampling 
- Otimizar os modelos
- Conclusão


## Carregar, analisar e limpar os dados

In [1]:
# Importando as bibliotecas necessárias

import numpy as np
import pandas as pd
import datetime as dt

import matplotlib.pyplot as plt
import seaborn as sns

# import plotly.express as px
# from plotly.offline import init_notebook_mode, iplot

from sklearn.metrics import (ConfusionMatrixDisplay, classification_report, precision_recall_curve, 
                             PrecisionRecallDisplay, RocCurveDisplay,
                             recall_score, make_scorer, roc_auc_score)

In [2]:
df = pd.read_csv("files/full_devices.csv", encoding="latin1")

In [3]:
df

Unnamed: 0,date,device,failure,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9
0,2015-01-01,S1F01085,0,215630672,56,0,52,6,407438,0,0,7
1,2015-01-01,S1F0166B,0,61370680,0,3,0,6,403174,0,0,0
2,2015-01-01,S1F01E6Y,0,173295968,0,0,0,12,237394,0,0,0
3,2015-01-01,S1F01JE0,0,79694024,0,0,0,6,410186,0,0,0
4,2015-01-01,S1F01R2B,0,135970480,0,0,0,15,313173,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...
124489,2015-11-02,Z1F0MA1S,0,18310224,0,0,0,10,353705,8,8,0
124490,2015-11-02,Z1F0Q8RT,0,172556680,96,107,4,11,332792,0,0,13
124491,2015-11-02,Z1F0QK05,0,19029120,4832,0,0,11,350410,0,0,0
124492,2015-11-02,Z1F0QL3N,0,226953408,0,0,0,12,358980,0,0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124494 entries, 0 to 124493
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   date        124494 non-null  object
 1   device      124494 non-null  object
 2   failure     124494 non-null  int64 
 3   attribute1  124494 non-null  int64 
 4   attribute2  124494 non-null  int64 
 5   attribute3  124494 non-null  int64 
 6   attribute4  124494 non-null  int64 
 7   attribute5  124494 non-null  int64 
 8   attribute6  124494 non-null  int64 
 9   attribute7  124494 non-null  int64 
 10  attribute8  124494 non-null  int64 
 11  attribute9  124494 non-null  int64 
dtypes: int64(10), object(2)
memory usage: 11.4+ MB


In [5]:
# Analisar basicamente algumas estatísticas das features numéricas
df.describe()

Unnamed: 0,failure,attribute1,attribute2,attribute3,attribute4,attribute5,attribute6,attribute7,attribute8,attribute9
count,124494.0,124494.0,124494.0,124494.0,124494.0,124494.0,124494.0,124494.0,124494.0,124494.0
mean,0.000851,122388100.0,159.484762,9.940455,1.74112,14.222669,260172.657726,0.292528,0.292528,12.451524
std,0.029167,70459330.0,2179.65773,185.747321,22.908507,15.943028,99151.078547,7.436924,7.436924,191.425623
min,0.0,0.0,0.0,0.0,0.0,1.0,8.0,0.0,0.0,0.0
25%,0.0,61284760.0,0.0,0.0,0.0,8.0,221452.0,0.0,0.0,0.0
50%,0.0,122797400.0,0.0,0.0,0.0,10.0,249799.5,0.0,0.0,0.0
75%,0.0,183309600.0,0.0,0.0,0.0,12.0,310266.0,0.0,0.0,0.0
max,1.0,244140500.0,64968.0,24929.0,1666.0,98.0,689161.0,832.0,832.0,18701.0


Os atributos tem escalas diferentes. Se usarmos modelos baseados em distância, será necessário normalizar/padronizar os atributos

In [11]:
# Analisar basicamente algumas estatísticas das features categóricas
df.describe(include='O')

Unnamed: 0,date,device
count,124493,124493
unique,304,1169
top,2015-01-01,Z1F0QLC1
freq,1163,304


In [7]:
# tirar duplicados # 1 registro
df = df.drop_duplicates()

In [8]:
df.isna().sum()

date          0
device        0
failure       0
attribute1    0
attribute2    0
attribute3    0
attribute4    0
attribute5    0
attribute6    0
attribute7    0
attribute8    0
attribute9    0
dtype: int64

In [9]:
df.date.min(), df.date.max()

('2015-01-01', '2015-11-02')

In [10]:
# target - falha
df.failure.value_counts(dropna=False)

failure
0    124387
1       106
Name: count, dtype: int64

In [40]:
print(f"Dados referentes ao período de tempo entre {df.date.min()} e {df.date.max()}")

print(f"{len(df)} resgistros de {df.device.nunique()
                                 } devices em {df.date.nunique()} dias distintos")
print(f"São {df.failure.value_counts(dropna=False)[1]} falhas, o que significa {
      round(df.failure.value_counts(dropna=False, normalize=True)[1]*100, 4)}% do casos")

Dados referentes ao período de tempo entre 2015-01-01 e 2015-11-02
124494 resgistros de 1169 devices em 304 dias distintos
São 106 falhas, o que significa 0.0851% do casos


**Desequilibrio entre as classes da variável target, failure - desbalanceamento significativo**

In [42]:
# Transformando a data em datetime e colocando como índice para facilitar a visualização

df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').sort_index()

KeyError: 'date'

In [10]:
attributes = [col for col in df.columns if "attribute" in col]