## 🧼 Día 10 – Limpieza, Detección de Outliers y Escalado de Datos

**Hoy aprenderás a:**
- Detectar y rellenar valores faltantes (missing values)
- Identificar y tratar outliers
- Escalar columnas numéricas para modelos de ML

In [3]:
# 1. Cargar y explorar el dataset
import pandas as pd
import numpy as np
import sys
sys.path.append('../fundamentals/week02')
from Limpiar_Daraset import limpiar

# Releer el archivo y convertir espacios vacíos o textos 'None' en NaN
df = pd.read_csv("../../data/ventas_limpio.csv", 
                 na_values=[" ", "None", "nan", "NaN"],
                 parse_dates=['fecha_venta'])

print(df['fecha_venta'].dtype)
df

datetime64[ns]


  df[col] = df[col].replace('[\$,]', '', regex=True).astype(float)


Unnamed: 0,producto,precio,cantidad,fecha_venta,ingreso
0,F,$142.50,12.0,2024-01-10,"$1,710.00"
1,A,,10.0,NaT,"$1,000.00"
2,C,$142.50,8.0,2024-01-03,"$1,140.00"
3,D,$150.00,8.0,2024-01-05,
4,E,,7.0,2024-01-07,$840.00
5,B,$200.00,3.0,2024-01-03,"$1,000.00"


In [4]:
df_new = limpiar(df)
df_new

Unnamed: 0,producto,precio,cantidad,fecha_venta,ingreso
0,F,142.5,12.0,2024-01-10,1710.0
1,A,158.75,10.0,2024-01-03,1587.5
2,C,142.5,8.0,2024-01-03,1140.0
3,D,150.0,8.0,2024-01-05,1200.0
4,E,158.75,7.0,2024-01-07,1111.25
5,B,200.0,3.0,2024-01-03,600.0


In [5]:
# 3. Imputación de valores faltantes
df['precio'].fillna(df['precio'].mean(), inplace=True)
df['cantidad'].fillna(df['cantidad'].median(), inplace=True)

# Recalcular la columna de ingresos
df['ingreso'] = df['precio'] * df['cantidad']

# Mostrar resultados actualizados
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['precio'].fillna(df['precio'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['cantidad'].fillna(df['cantidad'].median(), inplace=True)


Unnamed: 0,producto,precio,cantidad,fecha_venta,ingreso
0,F,142.5,12.0,2024-01-10,1710.0
1,A,158.75,10.0,2024-01-03,1587.5
2,C,142.5,8.0,2024-01-03,1140.0
3,D,150.0,8.0,2024-01-05,1200.0
4,E,158.75,7.0,2024-01-07,1111.25
5,B,200.0,3.0,2024-01-03,600.0


In [6]:
# 4. Detección de outliers con IQR
Q1 = df['ingreso'].quantile(0.25)
Q3 = df['ingreso'].quantile(0.75)
IQR = Q3 - Q1
print(Q1, Q3, IQR)
outliers = df[(df['ingreso'] < Q1 - 1.5 * IQR) | (df['ingreso'] > Q3 + 1.5 * IQR)]
print(outliers)

1118.4375 1490.625 372.1875
Empty DataFrame
Columns: [producto, precio, cantidad, fecha_venta, ingreso]
Index: []


In [7]:
# 5. Escalado Min-Max
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['precio', 'cantidad', 'ingreso']] = scaler.fit_transform(df[['precio', 'cantidad', 'ingreso']])
df.head()

Unnamed: 0,producto,precio,cantidad,fecha_venta,ingreso
0,F,0.0,1.0,2024-01-10,1.0
1,A,0.282609,0.777778,2024-01-03,0.88964
2,C,0.0,0.555556,2024-01-03,0.486486
3,D,0.130435,0.555556,2024-01-05,0.540541
4,E,0.282609,0.444444,2024-01-07,0.460586
