
# Car Pricing — Regression Starter Notebook

This notebook builds a clean, reproducible baseline for predicting used vehicle prices from **`vehicles.csv`** using scikit-learn Pipelines:
- Data preview & sanity checks
- Train/validation split
- Preprocessing with `ColumnTransformer` (impute + encode + scale)
- Models: **Linear Regression** (baseline) and **RandomForestRegressor**
- Optional: **XGBRegressor** if `xgboost` is available
- Evaluation: RMSE, MAE, R² + predicted vs. actual plot

> Place this notebook in the same folder as `vehicles.csv`, or adjust `DATA_PATH` below.


## 1) Setup & Data Loading

In [4]:
# 1

import sys, sysconfig, platform
print("Python exe:", sys.executable)
print("Version   :", platform.python_version())


Python exe: C:\Users\Owner\anaconda3\python.exe
Version   : 3.11.7


In [5]:
# [1.1] Imports


!conda install -c conda-forge scikit-learn -y
!pip install scikit-learn matplotlib xgboost seaborn

import os
import numpy as np
import pandas as pd
import sys
!{sys.executable} -m pip install matplotlib
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# Optional XGBoost
try:
    from xgboost import XGBRegressor
    XGB_AVAILABLE = True
except Exception:
    XGB_AVAILABLE = False

import matplotlib.pyplot as plt

print(sys.executable)

Channels:
 - conda-forge
 - defaults
Platform: win-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\Owner\anaconda3

  added / updated specs:
    - scikit-learn


The following NEW packages will be INSTALLED:

  python_abi         conda-forge/win-64::python_abi-3.11-2_cp311 
  scikit-learn       conda-forge/win-64::scikit-learn-1.7.2-py311h8a15ebc_0 
  threadpoolctl      conda-forge/noarch::threadpoolctl-3.6.0-pyhecae5ae_0 

The following packages will be UPDATED:

  openssl              pkgs/main::openssl-3.0.17-h35632f6_0 --> conda-forge::openssl-3.5.2-h725018a_0 
  vc                          pkgs/main::vc-14.2-h21ff451_1 --> conda-forge::vc-14.3-h41ae7f8_31 

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            pkgs/main/win-64::certifi-2025.8.3-py~ --> conda-forge/noarch::certifi-2025.8.3-pyhd8ed1ab_0 



Downloading and Extractin


RemoveError: 'setuptools' is a dependency of conda and cannot be removed from
conda's operating environment.
RemoveError: 'tqdm' is a dependency of conda and cannot be removed from
conda's operating environment.



C:\Users\Owner\anaconda3\python.exe


In [7]:
# 1.2

# Test whether %pip magic is available in this Jupyter environment
try:
    get_ipython().run_line_magic("pip", "--version")
    print("✅ %pip magic is available. Using it to install scikit-learn...")
    %pip install scikit-learn matplotlib seaborn xgboost
except Exception as e:
    print("⚠️ %pip magic is not available here.")
    print("Error was:", e)


pip 25.2 from C:\Users\Owner\anaconda3\Lib\site-packages\pip (python 3.11)

Note: you may need to restart the kernel to use updated packages.
✅ %pip magic is available. Using it to install scikit-learn...
Note: you may need to restart the kernel to use updated packages.


In [8]:
# 1.3

import sklearn; print("sklearn:", sklearn.__version__)


sklearn: 1.7.2


In [9]:
1.4 

# upgrade pip
!"{sys.executable}" -m pip install -U pip

# install scikit-learn into this interpreter
!"{sys.executable}" -m pip install -U scikit-learn




In [10]:
# [1.4] 

# Load data

DATA_PATH = "vehicles.csv"  # adjust if needed

assert os.path.exists(DATA_PATH), f"File not found at {DATA_PATH}"
df = pd.read_csv(DATA_PATH)

print(df.shape)
df.head(3)

(426880, 18)


Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
0,7222695916,prescott,6000,,,,,,,,,,,,,,,az
1,7218891961,fayetteville,11900,,,,,,,,,,,,,,,ar
2,7221797935,florida keys,21000,,,,,,,,,,,,,,,fl


## 2) Quick EDA & Target Definition

In [11]:
# 2 

# Inspect columns and missingness

pd.set_option("display.max_columns", 200)
print("Columns:", list(df.columns))
print("\nMissing values (top 20):")
print(df.isna().sum().sort_values(ascending=False).head(20))


Columns: ['id', 'region', 'price', 'year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color', 'state']

Missing values (top 20):
size            306361
cylinders       177678
condition       174104
VIN             161042
drive           130567
paint_color     130203
type             92858
manufacturer     17646
title_status      8242
model             5277
odometer          4400
fuel              3013
transmission      2556
year              1205
id                   0
region               0
price                0
state                0
dtype: int64


In [12]:
# [2.1] Infer target column name

# Common target names: 'price', 'Price', 'sale_price'
possible_targets = ['price', 'Price', 'sale_price', 'SalePrice']
target = None
for t in possible_targets:
    if t in df.columns:
        target = t
        break
assert target is not None, f"Could not find a price column among: {possible_targets}. Please set 'target' manually."

print("Using target:", target)


Using target: price


In [13]:
# [2.2] Basic target sanity check

print("Target summary:")
print(df[target].describe())

# Drop rows with non-positive or obviously invalid price, if any
df = df[df[target].notna() & (df[target] > 0)].copy()
print("After filtering invalid target values, shape:", df.shape)


Target summary:
count    4.268800e+05
mean     7.519903e+04
std      1.218228e+07
min      0.000000e+00
25%      5.900000e+03
50%      1.395000e+04
75%      2.648575e+04
max      3.736929e+09
Name: price, dtype: float64
After filtering invalid target values, shape: (393985, 18)


## 3) Feature Split & Preprocessing

In [14]:
# [3.1] Split features/target

X = df.drop(columns=[target]).copy()
y = df[target].copy()

# Separate columns by dtype
categorical_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()
numeric_cols = X.select_dtypes(exclude=["object", "category"]).columns.tolist()

print("Numeric features:", len(numeric_cols))
print("Categorical features:", len(categorical_cols))

# [3.2] Preprocessing pipelines
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols)
    ]
)


Numeric features: 3
Categorical features: 14


## 4) Train/Validation Split

In [15]:
# [4.1] Split data

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_train.shape, X_valid.shape


((315188, 17), (78797, 17))

## 5) Models

In [19]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Define pipeline
linreg_pipe = Pipeline(steps=[
    ("preprocess", preprocessor),   # assuming you already defined preprocessor
    ("model", LinearRegression())
])

# Fit model
linreg_pipe.fit(X_train, y_train)


0,1,2
,steps,"[('preprocess', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [20]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

y_pred_lr = linreg_pipe.predict(X_valid)
mse_lr  = mean_squared_error(y_valid, y_pred_lr)
rmse_lr = np.sqrt(mse_lr)
mae_lr  = mean_absolute_error(y_valid, y_pred_lr)
r2_lr   = r2_score(y_valid, y_pred_lr)

print(f"Linear Regression -> RMSE: {rmse_lr:,.2f} | MAE: {mae_lr:,.2f} | R²: {r2_lr:.3f}")


Linear Regression -> RMSE: 17,277,517.57 | MAE: 524,251.18 | R²: -0.018


In [None]:
# [5.2] Random Forest

rf_pipe = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", RandomForestRegressor(
        n_estimators=400,
        random_state=42,
        n_jobs=-1
    ))
])
rf_pipe.fit(X_train, y_train)
y_pred_rf = rf_pipe.predict(X_valid)

rmse_rf = mean_squared_error(y_valid, y_pred_rf, squared=False)
mae_rf = mean_absolute_error(y_valid, y_pred_rf)
r2_rf = r2_score(y_valid, y_pred_rf)

print(f"Random Forest -> RMSE: {rmse_rf:,.2f} | MAE: {mae_rf:,.2f} | R²: {r2_rf:.3f}")


In [None]:

# [5.3] 

# XGBoost

if XGB_AVAILABLE:
    xgb_pipe = Pipeline(steps=[
        ("preprocess", preprocessor),
        ("model", XGBRegressor(
            n_estimators=600,
            max_depth=8,
            learning_rate=0.06,
            subsample=0.8,
            colsample_bytree=0.8,
            reg_lambda=1.0,
            tree_method="hist",
            random_state=42,
            n_jobs=-1
        ))
    ])
    xgb_pipe.fit(X_train, y_train)
    y_pred_xgb = xgb_pipe.predict(X_valid)

    rmse_xgb = mean_squared_error(y_valid, y_pred_xgb, squared=False)
    mae_xgb = mean_absolute_error(y_valid, y_pred_xgb)
    r2_xgb = r2_score(y_valid, y_pred_xgb)

    print(f"XGBoost -> RMSE: {rmse_xgb:,.2f} | MAE: {mae_xgb:,.2f} | R²: {r2_xgb:.3f}")
else:
    print("XGBoost not installed — skipping. Install with `pip install xgboost` to enable.")


## 6) Predicted vs. Actual Plot

In [None]:

# [6.1] Plot for best model (choose RF by default)
best_pred = y_pred_rf
best_name = "Random Forest"

plt.figure(figsize=(5,5))
plt.scatter(y_valid, best_pred, alpha=0.4)
lims = [min(y_valid.min(), best_pred.min()), max(y_valid.max(), best_pred.max())]
plt.plot(lims, lims)  # 45-degree line
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title(f"Predicted vs Actual — {best_name}")
plt.tight_layout()
plt.show()


In [None]:
# 6.2

# === Ridge / Lasso / ElasticNet (with the same preprocessing) ===
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

def eval_reg(model, name):
    pipe = Pipeline([
        ("preprocess", preprocessor),
        ("model", model)
    ])
    pipe.fit(X_train, y_train)
    pred = pipe.predict(X_valid)
    rmse = mean_squared_error(y_valid, pred, squared=False)
    mae  = mean_absolute_error(y_valid, pred)
    r2   = r2_score(y_valid, pred)
    print(f"{name:12s} -> RMSE: {rmse:,.2f} | MAE: {mae:,.2f} | R²: {r2:.3f}")

eval_reg(Ridge(alpha=1.0, random_state=42),       "Ridge")
eval_reg(Lasso(alpha=0.001, random_state=42),     "Lasso")
eval_reg(ElasticNet(alpha=0.001, l1_ratio=0.5, random_state=42), "ElasticNet")


In [None]:
from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

xgb = Pipeline([
    ("preprocess", preprocessor),
    ("model", XGBRegressor(
        tree_method="hist", n_jobs=-1, random_state=42,
        objective="reg:squarederror", eval_metric="rmse"
    ))
])

xgb_param_dist = {
    "model__n_estimators": randint(200, 800),
    "model__max_depth": randint(4, 10),
    "model__learning_rate": uniform(0.03, 0.15),
    "model__subsample": uniform(0.6, 0.4),
    "model__colsample_bytree": uniform(0.6, 0.4),
    "model__reg_lambda": uniform(0.0, 2.0)
}

# fit params let us use early stopping on each CV split
fit_params = {
    "model__eval_set": [(preprocessor.fit_transform(X_valid), y_valid)],
    "model__early_stopping_rounds": 25,
    "model__verbose": False
}

xgb_search = RandomizedSearchCV(
    xgb, xgb_param_dist,
    n_iter=25, cv=3, scoring="neg_root_mean_squared_error",
    random_state=42, n_jobs=-1, verbose=1
)
xgb_search.fit(X_train, y_train, **fit_params)

print("XGB best params:", xgb_search.best_params_)
print("XGB CV best RMSE:", -xgb_search.best_score_)
pred = xgb_search.best_estimator_.predict(X_valid)
print("XGB holdout RMSE:", mean_squared_error(y_valid, pred, squared=False))


## 7) Next Steps


- Try additional regressors (e.g., `Ridge`, `Lasso`).
- Tune hyperparameters (`RandomizedSearchCV` or `GridSearchCV`).
- Engineer domain features (e.g., vehicle age = current_year − year, brand/model embeddings).
- Add robust outlier handling (clip extreme odometer values, etc.).
- Export the best model with `joblib` and serve via a small API or Streamlit.
