<h2>Importing Libraries</h2>
  <p>
    This code brings in all of the external tools we’ll need:
    <ul>
      <li><strong>pandas</strong> for data manipulation (aliased as <code>pd</code>).</li>
      <li><code>fetch_california_housing</code> to load a built-in housing dataset.</li>
      <li><code>train_test_split</code> to divide our data into training and test sets.</li>
      <li><code>mean_squared_error</code> and <code>r2_score</code> for evaluating regression performance.</li>
      <li><code>XGBRegressor</code> from XGBoost as one of our tree-based regression models.</li>
    </ul>
  </p>

In [1]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor


<h2>Loading & Assembling the Dataset in Detail</h2>

<p>Using this code, we:</p>
<ol>
  <li><strong>Fetch the California housing data</strong> as a pandas DataFrame and Series:</li>
    <ul>
      <li><code>data.data</code> is a DataFrame of 8 feature columns.</li>
      <li><code>data.target</code> is a Series of the target values (median house prices).</li>
    </ul>
  <li><strong>Concatenate</strong> those two parts side-by-side into one DataFrame <code>df</code>.</li>
  <li><code>df.head()</code> lets us inspect the first few rows to confirm everything loaded correctly.</li>
</ol>

<p><strong>Feature (Column) Descriptions:</strong></p>
<dl>
  <dt>MedInc</dt>
  <dd>
    <em>Median Income</em> of households in the block group, measured in tens of thousands of U.S. dollars.  
    A value of 3.5 → \$35,000.
  </dd>

  <dt>HouseAge</dt>
  <dd>
    <em>Median House Age</em> in the block group, in years.  
    Higher values indicate older housing stock.
  </dd>

  <dt>AveRooms</dt>
  <dd>
    <em>Average Rooms per Household</em>.  
    Computed as total rooms ÷ number of households.
  </dd>

  <dt>AveBedrms</dt>
  <dd>
    <em>Average Bedrooms per Household</em>.  
    Total bedrooms ÷ number of households.
  </dd>

  <dt>Population</dt>
  <dd>
    <em>Block Group Population</em>.  
    Total number of people living in that block group.
  </dd>

  <dt>AveOccup</dt>
  <dd>
    <em>Average Occupancy</em> (household size).  
    Total population ÷ number of households.
  </dd>

  <dt>Latitude</dt>
  <dd>
    <em>Block Group Latitude</em> coordinate (in degrees).  
    Indicates north–south position.
  </dd>

  <dt>Longitude</dt>
  <dd>
    <em>Block Group Longitude</em> coordinate (in degrees).  
    Indicates east–west position.
  </dd>

  <dt>Target</dt>
  <dd>
    <em>Median House Value</em> for the block group, in hundreds of thousands of U.S. dollars.  
    A value of 2.5 → \$250,000.
  </dd>
</dl>


In [2]:
data = fetch_california_housing(as_frame=True)
df = pd.concat([data.data, data.target.rename("Target")], axis=1)
df.head()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


  <h2>Defining Features and Splitting Data</h2>
  <p>
    The DataFrame is split into two parts:
    <ul>
      <li><strong>X</strong> holds all predictor columns.</li>
      <li><strong>y</strong> holds the target column (“Target”).</li>
    </ul>
    We then use <code>train_test_split</code> to randomly shuffle and carve off 10% of the data as a test set,
    keeping 90% for training, with a fixed random seed for reproducibility.
  </p>

In [3]:
X = df.iloc[:, :-1]  # All columns except "Target"
y = df["Target"]     # The column we want to predict

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=89)


<h2>Training & Predicting with XGBoost</h2>
  <p>
    An <code>XGBRegressor</code> model is instantiated with 100 boosting rounds and a squared-error objective.
    We fit it on the training data and then generate predictions on the held-out test set, which we store for later comparison.
  </p>

In [4]:
model = XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=89)
model.fit(X_train, y_train)

y_pred_xgb = model.predict(X_test)


  <h2>Comparing True vs. XGBoost Predictions</h2>
  <p>
    We assemble a small table showing the actual test-set values alongside the XGBoost predictions.
    Resetting the index ensures a clean row numbering from 0 upward, making it easy to spot large prediction errors.
  </p>

In [5]:
comparison_df = pd.DataFrame({
    'True Values': y_test.values,
    'XGBoost Predicted Values': y_pred_xgb
})

print(comparison_df.reset_index(drop=True))


      True Values  XGBoost Predicted Values
0           0.678                  0.760134
1           1.633                  1.464027
2           0.325                  1.113717
3           0.974                  1.455863
4           0.689                  0.622672
...           ...                       ...
2059        2.089                  2.243745
2060        2.738                  1.955471
2061        2.939                  2.900734
2062        1.050                  1.018348
2063        0.958                  1.181661

[2064 rows x 2 columns]


  <h2>Training & Comparing AdaBoost</h2>
  <p>
    This code repeats the previous workflow using <code>AdaBoostRegressor</code> with 100 weak learners.
    After fitting on the same training split and predicting, we again build and display a comparison table
    between true values and AdaBoost’s estimates.
  </p>

In [6]:
from sklearn.ensemble import AdaBoostRegressor
model = AdaBoostRegressor(n_estimators=100, random_state=89)
model.fit(X_train, y_train)
y_pred_ada = model.predict(X_test)
comparison_df = pd.DataFrame({
    'True Values': y_test.values,
    'AdaBoost Predicted Values': y_pred_ada
})

print(comparison_df.reset_index(drop=True))

      True Values  AdaBoost Predicted Values
0           0.678                   1.476444
1           1.633                   1.600211
2           0.325                   2.257839
3           0.974                   2.891567
4           0.689                   1.490360
...           ...                        ...
2059        2.089                   2.116662
2060        2.738                   2.653919
2061        2.939                   3.396618
2062        1.050                   2.083498
2063        0.958                   1.555354

[2064 rows x 2 columns]


  <h2>Training & Comparing Random Forest</h2>
  <p>
    A <code>RandomForestRegressor</code> with 100 trees is trained on the same data.
    We predict on the test set and present the side-by-side table of true vs. Random Forest predictions
    to visually assess its accuracy relative to the other two models.
  </p>

In [7]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=89)
model.fit(X_train, y_train)
y_pred_rf = model.predict(X_test)
comparison_df = pd.DataFrame({
    'True Values': y_test.values,
    'Random Forest Predicted Values': y_pred_rf
})

print(comparison_df.reset_index(drop=True))


      True Values  Random Forest Predicted Values
0           0.678                        0.666340
1           1.633                        1.580540
2           0.325                        1.903001
3           0.974                        1.367020
4           0.689                        0.621760
...           ...                             ...
2059        2.089                        2.003770
2060        2.738                        1.957180
2061        2.939                        3.242730
2062        1.050                        1.019100
2063        0.958                        1.435980

[2064 rows x 2 columns]


  <h2>Evaluating XGBoost</h2>
  <p>
    We compute two key regression metrics on the XGBoost predictions:
    <ul>
      <li><strong>Mean Squared Error (MSE):</strong> the average of squared prediction errors.</li>
      <li><strong>R² Score:</strong> the proportion of variance in the target explained by the model.</li>
    </ul>
    Printing these values gives a numeric summary of XGBoost’s performance.
  </p>

In [8]:
mse = mean_squared_error(y_test, y_pred_xgb)
r2 = r2_score(y_test, y_pred_xgb)

print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

Mean Squared Error: 0.1895106604397491
R^2 Score: 0.8501974586838057


  <h2>Evaluating AdaBoost</h2>
  <p>
    The same two metrics (MSE and R²) are calculated for the AdaBoost predictions.
    Comparing these numbers to XGBoost’s helps us judge which algorithm captures the data patterns best.
  </p>

In [9]:
mse = mean_squared_error(y_test, y_pred_ada)
r2 = r2_score(y_test, y_pred_ada)

print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

Mean Squared Error: 0.67750204520055
R^2 Score: 0.46445478115871663


  <h2>Evaluating Random Forest</h2>
  <p>
    Finally, we compute MSE and R² for the Random Forest’s predictions on the test set.
    With all three models evaluated on identical metrics and data, we can make an apple-to-apples comparison
    of their regression accuracy.
  </p>

In [10]:
mse = mean_squared_error(y_test, y_pred_rf)
r2 = r2_score(y_test, y_pred_rf)

print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

Mean Squared Error: 0.21504261059267016
R^2 Score: 0.8300152113696395


<h2>Performance Summary of Regression Models</h2>

<ul>
  <li>
    <strong>XGBoost</strong><br>
    • <em>Mean Squared Error (MSE):</em> 0.1895 (lowest error)<br>
    • <em>R² Score:</em> 0.8502 (explains 85% of variance)
  </li>
  <li>
    <strong>Random Forest</strong><br>
    • <em>MSE:</em> 0.2150 (slightly higher error than XGBoost)<br>
    • <em>R² Score:</em> 0.8300 (explains 83% of variance)
  </li>
  <li>
    <strong>AdaBoost</strong><br>
    • <em>MSE:</em> 0.6775 (highest error)<br>
    • <em>R² Score:</em> 0.4645 (explains 46% of variance)
  </li>
</ul>

<p>
  <strong>Interpretation:</strong>  
  Among the three algorithms, <em>XGBoost</em> achieves the best regression performance,  
  with the lowest average squared error and the highest R².  
  <em>Random Forest</em> follows closely, showing strong predictive power.  
  <em>AdaBoost</em> underperforms in this task, exhibiting both higher error and lower explained variance.
</p>
