# Proyecto Final — Predicción del Total de Venta en Coffee Shop
### Autor: *Estudiante*

Este notebook forma parte del **proyecto final** para aplicar técnicas de *feature selection*, *regresión* y *evaluación de métricas*.
El objetivo es predecir el monto total de una transacción (`Total_Bill`) utilizando un dataset de ventas de una cafetería.

## 1. Carga de librerías y dataset

In [None]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Cargar dataset
df = pd.read_csv('Dataset Coffeshop.csv')
df.head()


## 2. Exploración inicial del dataset

In [None]:

print("Dimensiones:", df.shape)
print("Columnas:", df.columns.tolist())
print("\nValores nulos por columna:\n", df.isnull().sum())
df.describe(include='all').T.head(15)


## 3. Selección del target y preparación de features

In [None]:

target_col = 'Total_Bill'
drop_cols = ['transaction_id', 'transaction_date', 'transaction_time']
features = [c for c in df.columns if c not in drop_cols + [target_col]]

df2 = df[features + [target_col]].copy()
num_cols = df2.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = [c for c in df2.columns if c not in num_cols and c != target_col]

imp_num = SimpleImputer(strategy='median')
imp_cat = SimpleImputer(strategy='most_frequent')
if num_cols: df2[num_cols] = imp_num.fit_transform(df2[num_cols])
if cat_cols: df2[cat_cols] = imp_cat.fit_transform(df2[cat_cols])

df_enc = pd.get_dummies(df2.drop(columns=[target_col]), drop_first=True)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_enc)
y = df2[target_col].values

print("Dimensiones tras one-hot:", X_scaled.shape)


## 4. Selección de características (Feature Selection)

In [None]:

sel_model = RandomForestRegressor(n_estimators=40, random_state=42, n_jobs=1)
sample_idx = np.random.RandomState(42).choice(X_scaled.shape[0], min(30000, X_scaled.shape[0]), replace=False)
sel_model.fit(X_scaled[sample_idx], y[sample_idx])

sfm = SelectFromModel(sel_model, prefit=True, threshold='median')
X_reduced = sfm.transform(X_scaled)
selected_features = df_enc.columns[sfm.get_support()].tolist()

print("Features seleccionadas:", len(selected_features))


## 5. Entrenamiento del modelo de regresión

In [None]:

X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X_reduced, y, test_size=0.2, random_state=42)

# Muestreo para entrenamiento rápido
train_sample_size = min(30000, X_train_full.shape[0])
train_idx = np.random.RandomState(1).choice(X_train_full.shape[0], train_sample_size, replace=False)
X_train = X_train_full[train_idx]
y_train = y_train_full[train_idx]
X_test = X_test_full
y_test = y_test_full

final_model = RandomForestRegressor(n_estimators=80, random_state=42, n_jobs=1)
final_model.fit(X_train, y_train)
y_pred = final_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.4f}, RMSE: {rmse:.4f}, R2: {r2:.4f}")


## 6. Importancia de características

In [None]:

importances = pd.Series(final_model.feature_importances_, index=selected_features).sort_values(ascending=False)
importances.head(20).plot(kind='barh', figsize=(6,6), title='Top 20 Features Importantes')
plt.show()


## 7. Evaluación visual (Predicciones vs Reales)

In [None]:

plt.figure(figsize=(6,5))
plt.scatter(y_test, y_pred, alpha=0.2)
plt.xlabel('True Total_Bill')
plt.ylabel('Predicted Total_Bill')
plt.title('Real vs Predicho')
plt.tight_layout()
plt.show()


## 8. Conclusiones


- **Target:** `Total_Bill`
- **Feature selection:** RandomForest con umbral de importancia mediana → reducción de 103 a ~52 features.
- **Modelo:** RandomForestRegressor (80 árboles, muestreo de 30k filas).
- **Resultados:**  
  - R² ≈ 0.90  
  - RMSE ≈ 1.46  
- **Conclusión:** El modelo logra predecir con alta precisión el monto total de la cuenta.  
  Posibles mejoras: ajustar hiperparámetros, probar XGBoost o LightGBM, agregar variables derivadas (ej. `qty * unit_price`).