<a href="https://colab.research.google.com/github/samipn/Pycaret/blob/main/03_regression_california_housing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression (California Housing) — PyCaret

# ✅ Enable GPU in Colab
*Runtime → Change runtime type → **T4 / L4 GPU** → Save.*  
Each notebook sets `use_gpu=True` in `setup()`. Models that support GPU (e.g., XGBoost, CatBoost) will leverage it automatically if available.

California Housing dataset. Target: `median_house_value`.

In [1]:
!pip install pycaret[full]

Collecting pycaret[full]
  Downloading pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting numpy<1.27,>=1.21 (from pycaret[full])
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas<2.2.0 (from pycaret[full])
  Downloading pandas-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret[full])
  Downloading scipy-1.11.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib<1.4,>=1.2.0 (from pycaret[full])
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pyod>=1.1.3 (from pycaret[full])
  Downloading pyod-2.0.5-py3-none-any.whl.meta

In [3]:
import pandas as pd
url = "https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv"
df = pd.read_csv(url)
# Basic feature engineering
df['rooms_per_household'] = df['total_rooms'] / (df['households'] + 1e-9)
df['bedrooms_per_room'] = df['total_bedrooms'] / (df['total_rooms'] + 1e-9)
df['population_per_household'] = df['population'] / (df['households'] + 1e-9)
df = df.dropna()

In [4]:
from pycaret.regression import setup, compare_models, tune_model, blend_models, stack_models, finalize_model, predict_model, pull, save_model

s = setup(
    data=df,
    target='median_house_value',
    session_id=7,
    use_gpu=True,
    fold=5,
    log_experiment=False
)
best3 = compare_models(n_select=3)
tuned = [tune_model(m) for m in best3]
blended = blend_models(estimator_list=tuned)
final_model = finalize_model(blended)
preds = predict_model(final_model)
preds.head()

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Tesla T4, Ven

Unnamed: 0,Description,Value
0,Session id,7
1,Target,median_house_value
2,Target type,Regression
3,Original data shape,"(20433, 13)"
4,Transformed data shape,"(20433, 17)"
5,Transformed train set shape,"(14303, 17)"
6,Transformed test set shape,"(6130, 17)"
7,Numeric features,11
8,Categorical features,1
9,Preprocess,True


[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 2, number of used features: 0
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Start training from score 0.500000


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,30876.4321,2147259895.0694,46331.4059,0.8372,0.2263,0.1709,1.322
catboost,CatBoost Regressor,31515.1777,2217268922.8875,47081.8415,0.832,0.2291,0.1735,7.108
et,Extra Trees Regressor,33987.9839,2592747837.7498,50908.6278,0.8036,0.2424,0.1871,1.026
gbr,Gradient Boosting Regressor,36640.7267,2789275273.851,52808.3239,0.7888,0.2608,0.205,3.944
lasso,Lasso Regression,49209.9836,4677013509.3435,68374.7238,0.6456,0.3669,0.2828,0.082
llar,Lasso Least Angle Regression,49259.9261,4680276115.7947,68398.0732,0.6454,0.371,0.284,0.08
lr,Linear Regression,49258.5235,4680301559.7121,68398.2573,0.6453,0.3711,0.284,0.122
ridge,Ridge Regression,49286.0332,4680688353.9327,68401.3422,0.6453,0.371,0.284,0.1
br,Bayesian Ridge,49276.408,4680474970.9673,68399.6991,0.6453,0.3709,0.284,0.132
en,Elastic Net,52785.3062,5147001898.8063,71735.2992,0.61,0.3689,0.3171,0.068


Processing:   0%|          | 0/87 [00:00<?, ?it/s]

[2025-10-26 20:42:49.084] [CUML] [info] Unused keyword parameter: n_jobs during cuML estimator initialization


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,30580.7587,2046414795.0238,45237.3164,0.842,0.2276,0.1692
1,30757.4807,2133416414.9355,46188.9209,0.845,0.2266,0.1694
2,30352.8701,2118797280.4122,46030.3952,0.846,0.2203,0.163
3,30772.6715,2172150810.026,46606.3387,0.8369,0.2296,0.1705
4,30911.5706,2231921686.5449,47243.2184,0.8188,0.2325,0.1729
Mean,30675.0703,2140540197.3885,46261.2379,0.8377,0.2273,0.169
Std,192.3317,61209315.4302,661.8631,0.01,0.004,0.0033


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[LightGBM] [Info] 12 dense feature groups (0.05 MB) transferred to GPU in 0.000852 secs. 0 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 12 dense feature groups (0.05 MB) transferred to GPU in 0.000837 secs. 0 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 12 dense feature groups (0.05 MB) transferred to GPU in 0.000825 secs. 0 sparse feature groups
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 12 dense feature groups (0.05 MB) transferred to GPU in 0.000840 secs. 0 sparse feature groups
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 2610
[LightGBM] [Info] Number of data points in the train set: 11442, number of used features: 15
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs ha

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,35175.2084,2525167606.6413,50251.0458,0.805,0.2545,0.2004
1,36170.4814,2767679901.4534,52608.7436,0.7989,0.2557,0.201
2,36517.7557,2803488920.7134,52947.9832,0.7962,0.2534,0.1985
3,35187.5831,2692895389.4515,51893.115,0.7977,0.2528,0.1974
4,35365.44,2656214086.2841,51538.4719,0.7844,0.2538,0.1997
Mean,35683.2937,2689089180.9087,51847.8719,0.7965,0.254,0.1994
Std,554.7232,97189858.9396,941.9978,0.0067,0.001,0.0013


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 5 folds for each of 10 candidates, totalling 50 fits


Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,41604.8414,3784721949.5457,61520.0939,0.7078,0.284,0.2192
1,42812.7223,4151677085.4409,64433.509,0.6983,0.288,0.22
2,43221.8577,4228937350.0834,65030.2803,0.6926,0.2873,0.2181
3,40713.5269,3799904246.7542,61643.3634,0.7146,0.2849,0.2161
4,41932.7428,3917980996.547,62593.7776,0.682,0.2887,0.2208
Mean,42057.1382,3976644325.6742,63044.2048,0.6991,0.2866,0.2188
Std,889.1363,182099356.0103,1439.6401,0.0114,0.0018,0.0016


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 5 folds for each of 10 candidates, totalling 50 fits


Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,30150.545,2000185676.3941,44723.4354,0.8456,0.2233,0.1679
1,30778.5391,2144119309.8108,46304.6359,0.8442,0.2217,0.1684
2,30845.5348,2177235040.0272,46660.8513,0.8417,0.221,0.1662
3,30516.8641,2162470168.8337,46502.3673,0.8376,0.2248,0.1699
4,30799.8019,2204960788.754,46957.01,0.821,0.2276,0.1717
Mean,30618.257,2137794196.764,46229.66,0.838,0.2237,0.1688
Std,260.5145,71626736.2483,782.7729,0.0089,0.0024,0.0019


Processing:   0%|          | 0/6 [00:00<?, ?it/s]

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Voting Regressor,17629.6358,688446152.8972,26238.2574,0.9492,0.141,0.1007




Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,rooms_per_household,bedrooms_per_room,population_per_household,median_house_value,prediction_label
6374,-118.019997,34.169998,32.0,3868.0,548.0,1558.0,528.0,9.4667,INLAND,7.325758,0.141675,2.950758,500001.0,494631.118862
15061,-116.949997,32.790001,19.0,11391.0,3093.0,7178.0,2905.0,2.0326,<1H OCEAN,3.92117,0.27153,2.470912,123200.0,121497.575805
13633,-117.339996,34.080002,35.0,1380.0,248.0,730.0,264.0,3.2305,INLAND,5.227273,0.17971,2.765152,93700.0,98764.676908
17670,-121.870003,37.299999,28.0,859.0,199.0,455.0,211.0,2.3293,<1H OCEAN,4.07109,0.231665,2.156398,215900.0,216305.870592
632,-122.160004,37.720001,38.0,1007.0,245.0,618.0,239.0,2.875,NEAR BAY,4.213389,0.243297,2.585774,144800.0,143670.642969


In [5]:
# Save the best model (PyCaret 3.x) and export metrics
best_path = save_model(best3, 'best_model')
print("Saved model:", best_path)

# Export experiment logs / results
import pandas as pd
results_df = pull()  # last displayed table
results_df.to_csv('experiment_results.csv', index=False)
print("Exported experiment results to experiment_results.csv")

Transformation Pipeline and Model Successfully Saved
Saved model: (Pipeline(memory=Memory(location=None),
         steps=[('numerical_imputer',
                 TransformerWrapper(include=['longitude', 'latitude',
                                             'housing_median_age',
                                             'total_rooms', 'total_bedrooms',
                                             'population', 'households',
                                             'median_income',
                                             'rooms_per_household',
                                             'bedrooms_per_room',
                                             'population_per_household'],
                                    transformer=SimpleImputer())),
                ('categorical_imputer',
                 TransformerWrapp...
                                    transformer=SimpleImputer(strategy='most_frequent'))),
                ('onehot_encoding',
                 Transforme