### Ensuring Consistency Across Training & Inference Datasets: Pipeline Integration
**Question**: Create and train a machine learning pipeline that ensures feature transformation consistency across training and inference datasets using scikit-learn.

In [2]:
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import numpy as np

# --- Step 1: Load the Boston Housing Dataset ---
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target

print("Boston Housing Dataset loaded successfully.\n")
print("First 5 rows of the dataset:")
print(df.head())

# --- Step 2: Separate Features and Target, Split Data ---
X = df.drop('MEDV', axis=1)
y = df['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nShape of training features: {X_train.shape}")
print(f"Shape of testing features: {X_test.shape}")

# --- Step 3: Create a Machine Learning Pipeline ---
# The pipeline will consist of feature scaling and a linear regression model
pipeline = Pipeline([
    ('scaler', StandardScaler()),        # Step 1: Standardize features
    ('poly', PolynomialFeatures(degree=2, include_bias=False)), # Step 2: Add polynomial features (optional)
    ('regressor', LinearRegression())    # Step 3: Linear Regression model
])

print("\nMachine Learning Pipeline created:")
print(pipeline)

# --- Step 4: Train the Pipeline on the Training Data ---
pipeline.fit(X_train)

print("\nPipeline trained on the training data.")

# --- Step 5: Evaluate the Pipeline on the Testing Data (Inference) ---
y_pred = pipeline.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"\nMean Squared Error on the testing data (inference): {mse:.2f}")

# --- Step 6: Simulate Inference with New Data ---
print("\n--- Simulating Inference with New Data ---")

# Assume we have a new data point for inference
new_data = np.array([[0.02731, 0.0, 7.07, 0.0, 0.469, 6.421, 78.9, 4.9671, 2.0, 242.0, 17.8, 396.90, 9.14]])
new_data_df = pd.DataFrame(new_data, columns=X.columns)
print("\nNew data point for inference:")
print(new_data_df)

# Use the *same* trained pipeline to make predictions on the new data
predicted_price = pipeline.predict(new_data)
print(f"\nPredicted price for the new data point: {predicted_price[0]:.2f}")

print("\nFeature transformation consistency ensured through the scikit-learn pipeline.")
print("The same scaling and polynomial feature generation (if included) applied during training")
print("are automatically applied during inference on the testing data and new data points.")

ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>
