# 1. Preparación de datos: Serie de tiempo Air Quality

**Objetivo:** Limpieza y exploración de series de tiempo


El conjunto de datos contiene 9358 instancias de respuestas promediadas por hora de una serie de 5 sensores químicos de óxido metálico integrados en un dispositivo multisensor químico de calidad del aire.
https://archive.ics.uci.edu/ml//datasets/Air+quality

**Información de las características**
* 0 Fecha (DD.MM.YYYY)
* 1 Hora (HH.MM.SS)
* 2 Concentración por hora de CO in mg/m^3 (reference analyzer)
* 3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
* 4 True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
* 5 True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
* 6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
* 7 True hourly averaged NOx concentration in ppb (reference analyzer)
* 8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
* 9 True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
* 10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
* 11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
* 12 Temperature in °C
* 13 Relative Humidity (%)
* 14 AH Absolute Humidity

**Número de instancias:** 9358

**Número de atributos:** 14

**Variable dependiente:** Temperatura
 

# 2. Acceso a drive


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 3. Importación de librerías

In [None]:
import os
import pandas as ___
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import ___ as sns
import matplotlib.pyplot as ___

# 4. Lectura y visualización de datos

In [None]:
path = r'/content/drive/Shareddrives/Data Science para Geociencias/2. Preparación de los datos'
name = 'AirQualityUCI.xlsx'

Concatenando ruta y nombre del archivo

In [None]:
path_file = os.path.join(____, ____)

In [None]:
air_quality = pd.read_excel(path_file)
air_quality.head(5)

In [None]:
print('Número de instancias: %s'%len(air_quality))
print('Número de atributos: %s'%(air_quality.shape[1]))

# 5. Limpieza de datos

### b) Modificar columna del tiempo 

In [None]:
air_quality.loc[:, ____] = pd.to_datetime(air_quality.Date.astype(str)+' '+air_quality.Time.astype(str))
air_quality.drop(['Date','Time'], axis=1, inplace=True)

In [None]:
air_quality = air_quality[['Fecha','CO(GT)','PT08.S1(CO)','NMHC(GT)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH','AH']]
air_quality.head()

In [None]:
fig=go.Figure()
fig.layout.template = "ggplot2"
fig.add_scatter(x=air_quality['Fecha'], y=air_quality['AH'], mode='lines',name='AH')
fig.add_scatter(x=air_quality['Fecha'], y=air_quality['RH'], mode='lines',name='RH')
fig.add_scatter(x=air_quality['Fecha'], y=air_quality['T'], mode='lines',name='T')

fig.update_traces(marker=dict(size=3),
                  selector=dict(mode='markers'))
pio.show(fig)

### c) Manejo de datos faltantes



¿Cómo lidiamos con valores faltantes?


*   Una opción podría ser eliminar toda la instancia
*   Otra opción es rellenar el valor faltante con la **media, moda, mediana, mínimo, máximo...**

#### Eliminación de instancias con datos faltantes

In [None]:
copy_df = air_quality
copy_df.dropna().head()

#### Rellenando NaNs

In [None]:
air_quality['PT08.S2(NMHC)'] = air_quality['PT08.S2(NMHC)'].fillna(air_quality['PT08.S2(NMHC)'][air_quality['PT08.S2(NMHC)'] > 0].mean())
air_quality.head()

**Ahora rellena los demás NaNs con la media**



In [None]:
air_quality['CO(GT)'] = air_quality['CO(GT)'].fillna(air_quality['CO(GT)'][air_quality['CO(GT)']>0].mean())
air_quality['PT08.S1(CO)'] = ________
air_quality['NMHC(GT)'] = ________
air_quality['C6H6(GT)'] = ________
air_quality['PT08.S2(NMHC)'] = ________
air_quality['NOx(GT)'] = ________
air_quality['PT08.S3(NOx)'] = ________
air_quality['NO2(GT)'] = ________
air_quality['PT08.S4(NO2)'] = ________
air_quality['PT08.S5(O3)'] = ________
air_quality['T'] = ________
air_quality['RH'] = ________
air_quality['AH'] = ________
air_quality.head()

### d) Removiendo duplicados

Hay que remover las instancias que son exactamente iguales porque sólo causan ruido.

In [None]:
air_quality.drop_duplicates(inplace=True)
air_quality.head()

### e) Rellenando valores faltantes

Para el caso de series de tiempo, la detección de outliers se puede hacer mediante inspección visual en un plano tiempo vs el atributo.

In [None]:
for i in range(air_quality.shape[1]-1):
  air_quality.iloc[:,i+1] = air_quality.iloc[:,i+1].replace(to_replace=_____, value=air_quality.iloc[:,i+1][air_quality.iloc[:,i+1]>0].mean())

In [None]:
fig=go.Figure()
fig.add_scatter(x=air_quality['Fecha'], y=air_quality['AH'], mode='lines',name='AH')
fig.add_scatter(x=air_quality['Fecha'], y=air_quality['RH'], mode='lines',name='RH')
fig.add_scatter(x=air_quality['Fecha'], y=air_quality['T'], mode='lines',name='T')
fig.update_traces(marker=dict(size=3),
                  selector=dict(mode='markers'))
pio.show(fig)

In [None]:
_______.describe()

# 6. Exploración de los datos (EDA)

In [None]:
sns.set(style="ticks", context="talk")
plt.style.use('seaborn-paper')
g = sns.PairGrid(air_quality.iloc[:,1:], diag_sharey=False, corner=True)
g.map_lower(sns.scatterplot)
g.map_diag(sns.kdeplot)
g.add_legend()

### a) Guardando el dataframe

In [None]:
______.to_csv(os.path.join(path,'AirQuality_New.csv'))