## Clusterização

##### Por: Marcio Carvalho

### Problema de Negócio:  
     Aneel - órgão regulador. 
     
     Sabemos o consumo por estado, não conhecemos o perfil do cliente. Precisamos agrupar o perfil do cliente de acordo com o nível de consumo independente de onde eles moram, pois, queremos fazer uma atividade de marketing direcionada.

     Precisamos construir uma Máquina Preditiva (MP), a partir do consumo de energia do cliente. Agrupe os consumidores por similaridades, afim de compreender o comportamento dos clientes e sua relação com o consumo de energia.

    * Estaremos utilizando esse dataframe apenas para estudo didático, simulando que as linhas são  diferentes consumidores para atender o problema de negócio.
    
### Fonte
   https://archive.ics.uci.edu/dataset/235/individual+household+electric+power+consumption

### Dataset:

1. data: Data no formato dd/mm/aaaa

2. time: hora no formato hh:mm:ss

3. global_active_power: potência ativa média global doméstica por minuto (em quilowatts)

4. global_reactive_power: potência reativa média global doméstica por minuto (em quilowatts)

5. tensão: tensão média por minuto (em volts)

6. global_intensity: intensidade de corrente média global por minuto doméstica (em amperes)

7. sub_metering_1: sub medição de energia nº 1 (em watt-hora de energia ativa). Corresponde à cozinha, contendo maioritariamente uma máquina de lavar louça, um forno e um micro-ondas (as placas não são eléctricas mas sim a gás).

8. sub_metering_2: sub medição de energia nº 2 (em watt-hora de energia ativa). Corresponde à lavandaria, contendo máquina de lavar roupa, máquina de secar roupa, frigorífico e luz.

9. sub_metering_3: sub medição de energia nº 3 (em watt-hora de energia ativa). Corresponde a um esquentador eléctrico e a um ar condicionado.
      

In [1]:
# Importação das bibliotecas
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings("ignore")

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [15]:
# Coletando e substituindo nome das features para algo mais claro e objetivo em português
df = pd.read_table ('..\data\household_power_consumption.txt', sep=';')
df.columns = ['Data','Tempo','Potencia_ativa_total','Potencia_reativa_total', 'Tensao', 'Corrente_total', 'Subdivisao_cozinha', 'Subdivisao_lavanderia', 'Subdivisao_aquecedor_Ar']
df.head()

Unnamed: 0,Data,Tempo,Potencia_ativa_total,Potencia_reativa_total,Tensao,Corrente_total,Subdivisao_cozinha,Subdivisao_lavanderia,Subdivisao_aquecedor_Ar
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [31]:
df_cop = df.copy ()
df_cop

Unnamed: 0,Data,Tempo,Potencia_ativa_total,Potencia_reativa_total,Tensao,Corrente_total,Subdivisao_cozinha,Subdivisao_lavanderia,Subdivisao_aquecedor_Ar
0,16/12/2006,17:24:00,4.216,0.418,234.840,18.400,0.000,1.000,17.0
1,16/12/2006,17:25:00,5.360,0.436,233.630,23.000,0.000,1.000,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.290,23.000,0.000,2.000,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.740,23.000,0.000,1.000,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.680,15.800,0.000,1.000,17.0
...,...,...,...,...,...,...,...,...,...
2075254,26/11/2010,20:58:00,0.946,0.0,240.43,4.0,0.0,0.0,0.0
2075255,26/11/2010,20:59:00,0.944,0.0,240.0,4.0,0.0,0.0,0.0
2075256,26/11/2010,21:00:00,0.938,0.0,239.82,3.8,0.0,0.0,0.0
2075257,26/11/2010,21:01:00,0.934,0.0,239.7,3.8,0.0,0.0,0.0


In [32]:
df_cop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   Data                     object 
 1   Tempo                    object 
 2   Potencia_ativa_total     object 
 3   Potencia_reativa_total   object 
 4   Tensao                   object 
 5   Corrente_total           object 
 6   Subdivisao_cozinha       object 
 7   Subdivisao_lavanderia    object 
 8   Subdivisao_aquecedor_Ar  float64
dtypes: float64(1), object(8)
memory usage: 142.5+ MB


In [33]:
#data_str = "25/11/2022"
for coluna in df_cop.columns:
    df_cop[coluna] = df_cop[coluna].replace('?' , np.nan)
    try:
        df_cop['Data'] = pd.to_datetime(df_cop['Data'], format='%d/%m/%Y')
        df_cop['Tempo'] = pd.to_datetime(df_cop['Tempo'], format='%H:%M:%S')
        df_cop[coluna] = df_cop[coluna].astype(float)
    except:
       continue
df_cop.head() 

Unnamed: 0,Data,Tempo,Potencia_ativa_total,Potencia_reativa_total,Tensao,Corrente_total,Subdivisao_cozinha,Subdivisao_lavanderia,Subdivisao_aquecedor_Ar
0,2006-12-16,1900-01-01 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,2006-12-16,1900-01-01 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,2006-12-16,1900-01-01 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,2006-12-16,1900-01-01 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,2006-12-16,1900-01-01 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [34]:
df_cop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075259 entries, 0 to 2075258
Data columns (total 9 columns):
 #   Column                   Dtype         
---  ------                   -----         
 0   Data                     datetime64[ns]
 1   Tempo                    datetime64[ns]
 2   Potencia_ativa_total     float64       
 3   Potencia_reativa_total   float64       
 4   Tensao                   float64       
 5   Corrente_total           float64       
 6   Subdivisao_cozinha       float64       
 7   Subdivisao_lavanderia    float64       
 8   Subdivisao_aquecedor_Ar  float64       
dtypes: datetime64[ns](2), float64(7)
memory usage: 142.5 MB


In [35]:
df_cop.isnull().sum()

Data                           0
Tempo                          0
Potencia_ativa_total       25979
Potencia_reativa_total     25979
Tensao                     25979
Corrente_total             25979
Subdivisao_cozinha         25979
Subdivisao_lavanderia      25979
Subdivisao_aquecedor_Ar    25979
dtype: int64