# What drives the price of a car?

![Kurt used car dealership](images/kurt.jpeg)

**OVERVIEW**

This notebook explores a large dataset of used cars. Your goal is to find out what features drive the price of a car and to recommend to a used car dealership what characteristics consumers truly value.

## CRISP-DM Framework

* Business Understanding
* Data Understanding
* Data Preparation
* Modeling
* Evaluation
* Deployment

## 1. Data Loading and Initial Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', palette='muted', font_scale=1.2)

# Load data
df = pd.read_csv('vehicles.csv')
print(f"Shape: {df.shape}")
df.head()

## 2. Data Cleaning & Preprocessing

In [None]:
# Inspect missing values
missing = df.isnull().mean().sort_values(ascending=False)
print(missing[missing > 0])

# Drop rows with no price or year (since these are essential for analysis)
df = df.dropna(subset=['price', 'year'])
# Remove obviously bad price entries (e.g., price = 0 or unrealistic values)
df = df[df['price'].between(100, 200000)]

# Basic type conversions
df['year'] = df['year'].astype(int)
# Fill non-critical missing fields with 'unknown' (can also use more advanced imputation)
for col in ['manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'transmission', 'drive', 'size', 'type', 'paint_color', 'state', 'title_status']:
    df[col] = df[col].fillna('unknown')

## 3. Exploratory Data Analysis (EDA)
### 3.1 Price Distribution

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df['price'], bins=60, kde=True)
plt.title('Price Distribution')
plt.xlabel('Price')
plt.xlim(0, 70000)
plt.show()

### 3.2 Relationship with Quantitative Features

In [None]:
plt.figure(figsize=(10,5))
sns.scatterplot(data=df, x='year', y='price', alpha=0.2)
plt.title('Price vs Year')
plt.show()

In [None]:
# Odometer (mileage) typically also matters
sns.scatterplot(data=df[df['odometer'].notnull()], x='odometer', y='price', alpha=0.2)
plt.title('Price vs Odometer')
plt.show()

### 3.3 Categorical Features

In [None]:
# Mean price by manufacturer
top_makes = df['manufacturer'].value_counts().head(10).index
sns.boxplot(data=df[df['manufacturer'].isin(top_makes)], x='manufacturer', y='price')
plt.xticks(rotation=45)
plt.title('Price by Manufacturer')
plt.show()

In [None]:
# Transmission, fuel, drive type
for col in ['condition', 'fuel', 'transmission', 'drive', 'type']:
    plt.figure(figsize=(8,4))
    sns.boxplot(data=df, x=col, y='price')
    plt.xticks(rotation=45)
    plt.title(f'Price by {col.capitalize()}')
    plt.show()

## 4. Feature Engineering & Modeling
We'll build a regression model to estimate price using the most informative features.

In [None]:
# Simple feature selection and encoding (for brevity; expand for deployment)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

# Pick features
features = ['year', 'odometer', 'manufacturer', 'condition', 'fuel', 'transmission', 'drive', 'type']
X = df[features]
y = df['price']

numeric_features = ['year', 'odometer']
categorical_features = list(set(features) - set(numeric_features))

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

model = Pipeline([
    ('pre', preprocessor),
    ('reg', Ridge(alpha=3))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"Test R2: {r2_score(y_test, y_pred):.2f}")

## 5. What features most drive price? (Interpretation)

In [None]:
# Extract feature importances from Ridge coefficients (illustrative; more robust: permutation importance)
feature_names = numeric_features.copy()
enc = model.named_steps['pre'].named_transformers_['cat']
feature_names.extend(enc.get_feature_names_out(categorical_features))

coefs = model.named_steps['reg'].coef_
top_idx = np.argsort(np.abs(coefs))[-10:][::-1]
for i in top_idx:
    print(f"{feature_names[i]}: {coefs[i]:.2f}")

## 6. Conclusions & Recommendations

**Key findings:**

 * Newer cars and those with lower mileage are more expensive.
 * Certain manufacturers and models command higher prices (e.g., luxury, pickup trucks).
 * "Condition", "drive type" (e.g., 4wd), and fuel type matter to consumers.
 * Transmission and car type also affect value.

**Recommendations to the dealership:**
 * Focus on sourcing vehicles with low mileage and newer model years.
 * Stock more in-demand types (e.g. pickups, SUVs) and popular manufacturers according to local trends.
 * Price premium for "excellent" condition listings is supported by data.
 * Consider upselling value-added features such as 4wd or premium transmission types.

Expand with deeper analysis (e.g., feature interactions, geographic price differences) for deployment.