# Trabajo Práctico: Exploración y Preparación de Airbnb Listings


## Objetivo
En esta libreta analizamos el dataset `listings_big.csv` para comprender su estructura, diagnosticar problemas de calidad y preparar un conjunto de features apto para estimar **qué tan alquilable es un listing**. El flujo general incluye:
- Exploración inicial y resumen estadístico
- Análisis de valores faltantes y patrones de disponibilidad
- Visualizaciones que permitan interpretar la ocupación y sus drivers
- Definición de un problema supervisado de clasificación enfocado en la alta ocupación
- Construcción de un pipeline de preprocesamiento, ingeniería de variables y selección/reducción de dimensionalidad


# 🏡 Airbnb Listings Data Dictionary

> Diccionario de datos del conjunto de listados de Airbnb, con tipos de datos, campos calculados y descripciones.  
> Compatible con visualización en GitHub (scroll horizontal automático en tablas grandes).

---

<div style="overflow-x: auto;">

| **Field** | **Type** | **Calculated** | **Description** |
|------------|-----------|----------------|-----------------|
| id | integer |  | Airbnb's unique identifier for the listing |
| listing_url | text | ✅ |  |
| scrape_id | bigint | ✅ | Inside Airbnb "Scrape" this was part of |
| last_scraped | datetime | ✅ | UTC. The date and time this listing was "scraped". |
| source | text |  | One of `"neighbourhood search"` or `"previous scrape"`. Describes how the listing was discovered. |
| name | text |  | Name of the listing |
| description | text |  | Detailed description of the listing |
| neighborhood_overview | text |  | Host's description of the neighbourhood |
| picture_url | text |  | URL to Airbnb-hosted regular-sized image |
| host_id | integer |  | Airbnb's unique identifier for the host/user |
| host_url | text | ✅ | Airbnb page for the host |
| host_name | text |  | Name of the host (usually first name). |
| host_since | date |  | Host account creation date |
| host_location | text |  | Host's self-reported location |
| host_about | text |  | Description about the host |
| host_response_time | text |  |  |
| host_response_rate | text |  |  |
| host_acceptance_rate | text |  | Rate at which host accepts booking requests |
| host_is_superhost | boolean *(t=true; f=false)* |  |  |
| host_thumbnail_url | text |  |  |
| host_picture_url | text |  |  |
| host_neighbourhood | text |  |  |
| host_listings_count | text |  | Number of listings (Airbnb internal) |
| host_total_listings_count | text |  | Total listings (Airbnb internal) |
| host_verifications | text |  |  |
| host_has_profile_pic | boolean *(t=true; f=false)* |  |  |
| host_identity_verified | boolean *(t=true; f=false)* |  |  |
| neighbourhood | text |  |  |
| neighbourhood_cleansed | text | ✅ | Geocoded neighbourhood (from shapefiles) |
| neighbourhood_group_cleansed | text | ✅ | Geocoded neighbourhood group (from shapefiles) |
| latitude | numeric |  | WGS84 latitude |
| longitude | numeric |  | WGS84 longitude |
| property_type | text |  | Self-selected property type |
| room_type | text |  | One of:<br>• **Entire home/apt** — Whole space<br>• **Private room** — Own room, shared areas<br>• **Shared room** — Shared sleeping space |
| accommodates | integer |  | Max guest capacity |
| bathrooms | numeric |  | Number of bathrooms |
| bathrooms_text | string |  | Text description of bathrooms (legacy field) |
| bedrooms | integer |  | Number of bedrooms |
| beds | integer |  | Number of beds |
| amenities | json |  | JSON array of amenities |
| price | currency |  | Daily price (ignore `$` artifact) |
| minimum_nights | integer |  | Minimum nights allowed |
| maximum_nights | integer |  | Maximum nights allowed |
| minimum_minimum_nights | integer | ✅ | Smallest min nights (calendar, 365 days ahead) |
| maximum_minimum_nights | integer | ✅ | Largest min nights (calendar, 365 days ahead) |
| minimum_maximum_nights | integer | ✅ | Smallest max nights (calendar, 365 days ahead) |
| maximum_maximum_nights | integer | ✅ | Largest max nights (calendar, 365 days ahead) |
| minimum_nights_avg_ntm | numeric | ✅ | Avg min nights (calendar, 365 days ahead) |
| maximum_nights_avg_ntm | numeric | ✅ | Avg max nights (calendar, 365 days ahead) |
| calendar_updated | date |  |  |
| has_availability | boolean |  | *(t=true; f=false)* |
| availability_30 | integer | ✅ | Available nights (next 30 days) |
| availability_60 | integer | ✅ | Available nights (next 60 days) |
| availability_90 | integer | ✅ | Available nights (next 90 days) |
| availability_365 | integer | ✅ | Available nights (next 365 days) |
| calendar_last_scraped | date |  |  |
| number_of_reviews | integer |  | Total number of reviews |
| number_of_reviews_ltm | integer | ✅ | Reviews in last 12 months |
| number_of_reviews_l30d | integer | ✅ | Reviews in last 30 days |
| first_review | date | ✅ | Date of first review |
| last_review | date | ✅ | Date of most recent review |
| review_scores_rating | numeric |  |  |
| review_scores_accuracy | numeric |  |  |
| review_scores_cleanliness | numeric |  |  |
| review_scores_checkin | numeric |  |  |
| review_scores_communication | numeric |  |  |
| review_scores_location | numeric |  |  |
| review_scores_value | numeric |  |  |
| license | text |  | Licence/permit/registration number |
| instant_bookable | boolean |  | *(t=true; f=false)* — Guests can book instantly |
| calculated_host_listings_count | integer | ✅ | Number of listings host has in this scrape (city/region) |
| calculated_host_listings_count_entire_homes | integer | ✅ | Entire homes owned by host |
| calculated_host_listings_count_private_rooms | integer | ✅ | Private rooms owned by host |
| calculated_host_listings_count_shared_rooms | integer | ✅ | Shared rooms owned by host |
| reviews_per_month | numeric | ✅ | Average reviews per month over listing lifetime.<br><br>**Pseudocode/SQL:**<br>`IF scrape_date - first_review <= 30 THEN number_of_reviews`<br>`ELSE number_of_reviews / ((scrape_date - first_review + 1) / (365/12))` |

</div>

---

✅ = Calculated field  
📘 *Boolean values*: `t=true`, `f=false`  
🌐 *Coordinates*: WGS84 projection  
💡 *Source*: [Inside Airbnb](http://insideairbnb.com/)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import ast
import re

plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (8, 5)
pd.set_option('display.max_columns', 80)
pd.set_option("display.float_format", "{:.4f}".format)

DATA_PATH = Path('listings_big.csv')


## 1. Carga y saneamiento inicial


In [None]:
df_raw = pd.read_csv(
    DATA_PATH,
    na_values=("N/A", "NA", "na", "n/a", ""),
    keep_default_na=True
)

df = df_raw.copy()
# Limpieza de precios: eliminar símbolos y convertir a numérico
df['price'] = (df['price']
               .astype(str)
               .str.replace(r'[€$£,]', '', regex=True)
               .str.strip())
df['price'] = pd.to_numeric(df['price'], errors='coerce')
# Conversión de columnas fecha para posteriores features
date_columns = ['last_scraped', 'host_since', 'first_review', 'last_review']
for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce')
# Normalización de flags binarias
def to_bool(series):
    mapped = series.map({'t': 'yes', 'f': 'no', np.nan: 'no'})
    return mapped.astype('object')
for col in ['host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'has_availability']:
    if col in df.columns:
        df[col] = to_bool(df[col])
print(f"Shape original: {df_raw.shape}")
print(f"Shape después de transformaciones básicas: {df.shape}")


In [None]:
# Diagnóstico y corrección de precios no positivos
price_zero_count = (df['price'] == 0).sum()
price_negative_count = (df['price'] < 0).sum()
print(f'Precios iguales a 0: {price_zero_count}')
print(f'Precios negativos: {price_negative_count}')
positive_prices = df.loc[df['price'] > 0, 'price']
if positive_prices.empty:
    raise ValueError('No hay precios positivos disponibles para imputar.')
median_positive_price = positive_prices.median()
if price_zero_count + price_negative_count > 0:
    print(f"Imputando precios no positivos con la mediana de valores positivos: {median_positive_price:.2f}")
df['price_imputed'] = df['price'].mask(df['price'] <= 0, median_positive_price)
print(f"Precio mínimo tras imputación: {df['price_imputed'].min():.2f}")


In [None]:
# Verificamos el rango tras aplicar MinMaxScaler sobre el precio imputado
from sklearn.preprocessing import MinMaxScaler
price_scaler = MinMaxScaler()
price_scaled_preview = price_scaler.fit_transform(df[['price_imputed']])
print(f"Rango del precio escalado: min={price_scaled_preview.min():.3f}, max={price_scaled_preview.max():.3f}")


### 1.1 Información general del dataset


In [None]:
df.info()


In [None]:
# Resumen de tipos de columnas y conteo por tipo
type_summary = df.dtypes.value_counts().rename('count').to_frame()
type_summary


In [None]:
# Estadísticas descriptivas de variables numéricas
def format_numeric_summary(df, exclude_patterns=("id",), currency_cols=None, decimals=4):
    # excluir columnas que no tienen sentido estadístico
    columns_to_exclude = [c for c in df.columns if any(pat in c.lower() for pat in exclude_patterns)]
    summary_num = (
        df.drop(columns=columns_to_exclude)
          .select_dtypes(include="number")
          .describe()
          .T
          .round(decimals)
    )

    # versión formateada
    summary_fmt = summary_num.copy().astype(object)

    # aplicar formato a las currency_cols
    if currency_cols:
        for col in currency_cols:
            if col in summary_fmt.index:
                summary_fmt.loc[col] = summary_num.loc[col].apply(lambda x: f"${x:,.{decimals}f}")

    return summary_num, summary_fmt

columns_to_exclude = ["id", "scrape_id", "host_id", "calendar_updated","neighbourhood_group_cleansed"]
currency_cols = ["price", "estimated_revenue_l365d"]

numeric_summary, numeric_summary_fmt = format_numeric_summary(
    df,
    exclude_patterns=columns_to_exclude,
    currency_cols=currency_cols,
    decimals=2
)

numeric_summary_fmt


In [None]:
# Estadísticas básicas de variables categóricas/cadenas
categorical_cols = df.select_dtypes(include=['object', 'string']).columns
categorical_summary = df[categorical_cols].describe().T
categorical_summary_sorted = categorical_summary.sort_values("freq", ascending=False)

categorical_summary_sorted.head(20)



### 1.2 Exploración de duplicados y consistencia


In [None]:
duplicate_count = df.duplicated(subset=['id']).sum()
print(f"Duplicados basados en 'id': {duplicate_count}")

missing_ids = df['id'].isna().sum()
print(f"IDs faltantes: {missing_ids}")


## 2. Valores faltantes


In [None]:
missing_abs = df.isna().sum().sort_values(ascending=False)
missing_pct = (missing_abs / len(df)).round(4)
missing_table = pd.DataFrame({'missing': missing_abs, 'percent': missing_pct})
missing_table.head(20)


In [None]:
# Visualización del top de variables con mayor porcentaje de faltantes
top_missing = missing_table[missing_table['percent'] > 0].head(25)
ax = top_missing.sort_values('percent').plot.barh(y='percent', figsize=(8, 6))
ax.set_xlabel('Proporción de valores faltantes')
ax.set_ylabel('Variable')
ax.set_title('Variables con mayor proporción de valores faltantes')
plt.tight_layout()
plt.show()


In [None]:
# Matriz de missingness para detectar patrones
try:
    import missingno as msno
    msno.matrix(df[['price', 'reviews_per_month', 'last_review', 'description', 'host_about', 'bathrooms']])
    plt.show()
except ImportError:
    print('missingno no está disponible en el entorno actual.')


In [None]:
# Ejemplo de análisis MAR/MNAR: comparar reviews_per_month faltante vs número de reviews
reviews_missing_flag = df['reviews_per_month'].isna()
comparison = df.groupby(reviews_missing_flag)['number_of_reviews'].agg(['mean', 'median', 'count'])
comparison.index = ['reviews_per_month disponible', 'reviews_per_month faltante']
comparison

**Interpretación inicial:** La ausencia de `reviews_per_month` suele darse en listados sin reseñas recientes, sugiriendo un mecanismo MAR (dependiente de `number_of_reviews`).


## 3. Distribuciones y visualizaciones clave


In [None]:
sns.histplot(df['price'].dropna(), bins=100, kde=True)
plt.title('Distribución de precios (todas las observaciones)')
plt.xlabel('Precio por noche (USD)')
plt.ylabel('Frecuencia')
plt.tight_layout()
plt.show()


In [None]:
# Distribución recortando el 1% superior para atenuar outliers
price_cap = df['price'].quantile(0.99)
sns.histplot(df.loc[df['price'] <= price_cap, 'price'], bins=60, kde=True)
plt.axvline(df['price'].median(), color='red', linestyle='--', label='Mediana')
plt.title('Distribución de precios (<= percentil 99)')
plt.xlabel('Precio por noche (USD)')
plt.legend()
plt.tight_layout()
plt.show()


In [None]:
# Skewness de precios para evaluar la asimetría
price_series = df['price'].dropna()
price_skew = price_series.skew()
positive_prices = price_series[price_series > 0]
price_log_skew = np.log1p(positive_prices).skew() if not positive_prices.empty else np.nan
print(f'Skewness (precio bruto): {price_skew:.3f}')
print(f'Skewness (log1p del precio): {price_log_skew:.3f}')


In [None]:
sns.boxplot(data=df, x='room_type', y='price')
plt.yscale('log')
plt.title('Precio por tipo de habitación (escala log)')
plt.xlabel('Tipo de habitación')
plt.ylabel('Precio (log)')
plt.tight_layout()
plt.show()


In [None]:
# Precio vs capacidad
sns.scatterplot(data=df, x='accommodates', y='price', hue='room_type', alpha=0.3)
plt.yscale('log')
plt.title('Precio vs huéspedes admitidos')
plt.xlabel('Capacidad (accommodates)')
plt.ylabel('Precio (log)')
plt.tight_layout()
plt.show()


### Distribuciones de disponibilidad y ocupación


In [None]:
sns.histplot(df['availability_365'], bins=40, kde=False)
plt.title('Distribución de disponibilidad anual (días disponibles)')
plt.xlabel('Días disponibles en 365')
plt.ylabel('Frecuencia')
plt.tight_layout()
plt.show()


In [None]:
sns.histplot(1 - (df['availability_365'] / 365), bins=40, kde=True)
plt.title('Tasa de ocupación anual estimada')
plt.xlabel('Proporción de noches ocupadas')
plt.tight_layout()
plt.show()


### Correlación orientada a disponibilidad (categóricas codificadas)
Para estudiar los drivers de disponibilidad aplicamos one-hot encoding sobre las variables categóricas y analizamos la correlación con las tasas de ocupación.


In [None]:
# Codificación de categóricas para matriz de correlación orientada a disponibilidad
availability_cols = ['availability_30', 'availability_60', 'availability_90', 'availability_365']
base_columns = ['accommodates', 'bedrooms', 'beds', 'minimum_nights', 'maximum_nights',
                'number_of_reviews', 'reviews_per_month', 'price']

corr_base = df[base_columns + availability_cols].copy()

for horizon in [30, 60, 90, 365]:
    rate_col = f'occupancy_rate_{horizon}'
    if rate_col not in df.columns:
        df[rate_col] = 1 - (df[f'availability_{horizon}'] / horizon)
    corr_base[rate_col] = df[rate_col]

binary_map = {'yes': 1, 'no': 0}
for col in ['host_is_superhost', 'instant_bookable']:
    corr_base[f'{col}_flag'] = df[col].map(binary_map)

categorical_for_corr = ['room_type', 'property_type', 'neighbourhood_cleansed']
corr_encoded = pd.get_dummies(
    pd.concat([corr_base, df[categorical_for_corr]], axis=1),
    columns=categorical_for_corr,
    drop_first=True
)

corr_matrix = corr_encoded.corr(numeric_only=True)
target_corr = corr_matrix[['availability_365', 'occupancy_rate_365']].sort_values(
    by='occupancy_rate_365', ascending=False
)

top_features = target_corr.head(15).index
sns.heatmap(
    corr_matrix.loc[top_features, ['availability_365', 'occupancy_rate_365']],
    annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1
)
plt.title('Correlación con disponibilidad y ocupación (top 15)')
plt.tight_layout()
plt.show()

target_corr.head(15)


## 4. Disponibilidad y objetivo de ocupación


In [None]:
# Cálculo de tasas de ocupación a partir de la disponibilidad
availability_cols = ['availability_30', 'availability_60', 'availability_90', 'availability_365']
occupancy_df = df[availability_cols].copy()

for horizon in [30, 60, 90, 365]:
    rate_col = f'occupancy_rate_{horizon}'
    df[rate_col] = 1 - (df[f'availability_{horizon}'] / horizon)
    occupancy_df[rate_col] = df[rate_col]

occupancy_summary = occupancy_df[[f'occupancy_rate_{h}' for h in [30, 60, 90, 365]]].describe().T
print(occupancy_summary[['mean', 'std', 'min', '25%', '50%', '75%', 'max']])

# Target multiclase basado en la tasa anual
bins = [0.0, 0.4, 0.7, 1.01]
labels = ['low', 'mid', 'high']
df['occupancy_level'] = pd.cut(
    df['occupancy_rate_365'].clip(lower=0, upper=1),
    bins=bins, labels=labels, right=False, include_lowest=True
)

level_counts = df['occupancy_level'].value_counts(dropna=False).sort_index()
print('Distribución de occupancy_level:')
print(level_counts)
print('Proporciones:')
print((level_counts / len(df)).round(3))


**Decisión:** Se utilizará la etiqueta multiclase `occupancy_level` para representar la ocupación esperada (low/mid/high). El objetivo del modelado es predecir este nivel sin emplear variables de disponibilidad futuras.


## 5. Planteo del problema supervisado
Planteamos un problema de **clasificación multiclase** donde la variable target es `occupancy_level`, construida a partir de la tasa de ocupación anual estimada. Los niveles son:
- `high`: ocupación >= 70%
- `mid`: ocupación entre 40% y 70%
- `low`: ocupación < 40%

El objetivo es anticipar la categoría de ocupación utilizando únicamente atributos conocidos al momento de publicar un listing (sin recurrir a disponibilidades futuras). Se evaluará con métricas como F1 macro, balanced accuracy y matrices de confusión para garantizar buen desempeño en las tres clases.

Las features consideradas combinan atributos de ubicación (`latitude`, `longitude`, `neighbourhood_cleansed`), configuración (`room_type`, `property_type`, `accommodates`, `bedrooms`), reputación (`number_of_reviews`, `reviews_per_month`, `host_is_superhost`) y métricas derivadas de la actividad del host (`host_tenure_days`, `reviews_per_year`, `days_since_last_review`).


**Definición del target:** `occupancy_level` surge de discretizar la tasa de ocupación anual en tres bandas (`low`, `mid`, `high`) con umbrales (0.0, 0.4, 0.7, 1.0]. Esto permite capturar distintos perfiles de demanda y evita depender de la disponibilidad futura como feature. Registros sin información suficiente quedan con valor nulo y se descartan durante el split de entrenamiento.


## 6. Ingeniería de features y preprocesamiento


### 6.1 Codificación de amenities

Generamos variables binarias a partir de las amenities más frecuentes para integrarlas como features interpretables.


In [None]:
# Normalizamos la columna de amenities y generamos indicadores binarios
def normalize_amenities(value):
    if pd.isna(value):
        return []
    text = str(value).strip()
    if not text:
        return []
    candidate = text.replace('{', '[').replace('}', ']')
    try:
        parsed = json.loads(candidate)
    except json.JSONDecodeError:
        try:
            parsed = ast.literal_eval(candidate)
        except (ValueError, SyntaxError):
            parsed = [item.strip() for item in candidate.strip('[]').split(',')]
    normalized = []
    for item in parsed:
        token = str(item).strip().strip('"').strip("'")
        if token:
            normalized.append(token.lower())
    return sorted(set(normalized))

amenity_lists = df['amenities'].apply(normalize_amenities)
amenity_counts = amenity_lists.explode().value_counts()
top_k = 20
top_amenities = amenity_counts.head(top_k)

def amenity_to_col(amenity):
    slug = re.sub(r'[^a-z0-9]+', '_', amenity).strip('_')
    return f'amenity_{slug}' if slug else 'amenity_other'

amenities_feature_cols = []
for amenity in top_amenities.index:
    base_name = amenity_to_col(amenity)
    name = base_name
    suffix = 1
    while name in amenities_feature_cols:
        suffix += 1
        name = f"{base_name}_{suffix}"
    amenities_feature_cols.append(name)
    df[name] = amenity_lists.apply(lambda items, target=amenity: int(target in items))

top_amenities.to_frame(name='count').assign(percent=lambda data: data['count'] / len(df))



In [None]:
# Feature engineering orientado a ocupación
# Imputar reseñas por mes faltantes con 0 (sugiere ausencia de reseñas)
df['reviews_per_month_filled'] = df['reviews_per_month'].fillna(0)

# Antigüedad del host y tiempo desde la última reseña
reference_date = df['last_scraped'].max()
df['host_tenure_days'] = (reference_date - df['host_since']).dt.days

df['days_since_last_review'] = (reference_date - df['last_review']).dt.days

# Densidad de reseñas: reviews acumulados sobre tenure (evitar división por cero)
df['reviews_per_year'] = df['number_of_reviews'] / (df['host_tenure_days'] / 365)
df.loc[df['host_tenure_days'] <= 0, 'reviews_per_year'] = np.nan

# Construimos dataset sin columnas de disponibilidad para evitar fuga de información
amenities_feature_cols = globals().get('amenities_feature_cols', [])
base_feature_cols = [
    'accommodates', 'bedrooms', 'beds', 'minimum_nights', 'maximum_nights',
    'price_imputed',
    'number_of_reviews', 'reviews_per_month_filled', 'latitude', 'longitude',
    'host_is_superhost', 'instant_bookable', 'room_type', 'property_type',
    'neighbourhood_cleansed', 'host_tenure_days', 'days_since_last_review', 'reviews_per_year'
]

feature_cols = base_feature_cols + amenities_feature_cols

model_df = df[feature_cols + ['occupancy_level']].copy()
model_df.head()



In [None]:
target_col = 'occupancy_level'
X = model_df.drop(columns=[target_col])
y = model_df[target_col]

print(f'Observaciones totales para modelado: {len(X)}')
print(f'Registros sin target: {y.isna().sum()} (serán descartados antes del split)')


In [None]:
from sklearn.model_selection import train_test_split

valid_rows = y.notna()
X_valid = X.loc[valid_rows]
y_valid = y.loc[valid_rows]

X_train, X_test, y_train, y_test = train_test_split(
    X_valid, y_valid, test_size=0.2, random_state=42
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer

amenities_feature_cols = globals().get('amenities_feature_cols', [])

numeric_features = [
    'accommodates', 'bedrooms', 'beds', 'minimum_nights', 'maximum_nights',
    'number_of_reviews', 'reviews_per_month_filled', 'latitude', 'longitude',
    'host_tenure_days', 'days_since_last_review', 'reviews_per_year'
]

numeric_features += amenities_feature_cols

price_feature = ['price_imputed']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

price_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler())
])

categorical_features = [
    'host_is_superhost', 'instant_bookable', 'room_type', 'property_type', 'neighbourhood_cleansed'
]

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('price', price_transformer, price_feature),
        ('cat', categorical_transformer, categorical_features)
    ]
)

X_train_prepared = preprocessor.fit_transform(X_train)
X_test_prepared = preprocessor.transform(X_test)

print(f'Matriz transformada - train: {X_train_prepared.shape}')
print(f'Matriz transformada - test: {X_test_prepared.shape}')



### Balance del target (low/mid/high)


In [None]:
y_train.value_counts(normalize=True).sort_index()


## 7. Selección y reducción de dimensionalidad


In [None]:
# Importancia de features numéricas (ANOVA F-score)
from sklearn.feature_selection import f_classif

numeric_matrix = preprocessor.named_transformers_['num'].transform(X_train[numeric_features])
f_scores, p_values = f_classif(numeric_matrix, y_train)

anova_scores = (pd.Series(f_scores, index=numeric_features)
                  .sort_values(ascending=False))
anova_scores


In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif

numeric_matrix = preprocessor.named_transformers_['num'].transform(X_train[numeric_features])
selector = SelectKBest(score_func=mutual_info_classif, k=min(8, len(numeric_features)))
selector.fit(numeric_matrix, y_train)

selected_numeric = [numeric_features[i] for i in selector.get_support(indices=True)]
print('Mejores features numéricas según mutual information:', selected_numeric)


In [None]:
from sklearn.decomposition import PCA

# Aplicamos PCA sobre features numéricas estandarizadas
numeric_imputed = preprocessor.named_transformers_['num'].fit_transform(X_train[numeric_features])
pca = PCA().fit(numeric_imputed)
explained_variance = pd.DataFrame({
    'componente': np.arange(1, len(pca.explained_variance_ratio_) + 1),
    'var_ratio': pca.explained_variance_ratio_,
    'var_acumulada': np.cumsum(pca.explained_variance_ratio_)
})
explained_variance.head(10)


In [None]:
plt.plot(explained_variance['componente'], explained_variance['var_acumulada'], marker='o')
plt.xlabel('Número de componentes')
plt.ylabel('Varianza acumulada')
plt.title('Curva de varianza explicada por PCA')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


**Interpretación:** Con ~6 componentes se retiene alrededor del 80% de la varianza numérica, útil si se necesita compactar el espacio antes de entrenar clasificadores sensibles a la dimensionalidad (p. ej. regresión logística).


## 8. Próximos pasos sugeridos
- Entrenar clasificadores multiclase (`LogisticRegression`, `RandomForestClassifier`, `XGBoost`) usando el pipeline de `preprocessor`.
- Medir F1 macro, balanced accuracy y analizar la matriz de confusión para los niveles `low/mid/high`.
- Ajustar umbrales o reglas de decisión según la estrategia comercial (p. ej. priorizar recall de `high`).
- Incorporar nuevas variables disponibles al crear el anuncio (amenities, políticas, texto) para mejorar el poder predictivo sin depender de disponibilidad futura.


- Entrenar un modelo base (p. ej. `RandomForestRegressor`, `GradientBoostingRegressor`) usando el pipeline de `preprocessor`.
- Validar con una métrica robusta (MAE) y comparar contra un baseline como la mediana.
- Explorar ingeniería específica del vecindario (densidad de listings, precio medio por zona) y disponibilidad.
- Evaluar estrategias de tuning (GridSearch/Optuna) y monitoreo de drift con `wandb`.
