# Practical Application Assignment 11.1

## Used Car Price Prediction

### Business Understanding

The goal of this project is to identify key factors that influence the price of used cars. By analyzing the dataset, we aim to build predictive models that estimate the price of a car based on various attributes such as manufacturer, year, mileage, fuel type, and other factors. The results will help a used car dealership better understand what features consumers value and adjust their inventory accordingly.

### Data Understanding
In this section, we load and explore the dataset to understand its structure, check for missing data, and perform some basic analysis.

In [None]:

import pandas as pd

# Load the dataset
vehicles_df = pd.read_csv('/mnt/data/vehicles.csv')

# Preview the dataset
vehicles_df.head()
    

In [None]:

# Check for missing values
vehicles_df.isnull().sum()

# Get basic statistics for numerical columns
vehicles_df.describe()
    

### Data Preparation
We will clean the dataset by handling missing values, encoding categorical variables, and scaling the data for modeling.

In [None]:

# Dropping irrelevant columns
vehicles_df_cleaned = vehicles_df.drop(columns=['id', 'VIN'])

# Handle missing values
vehicles_df_cleaned = vehicles_df_cleaned[vehicles_df_cleaned['price'].notna() & (vehicles_df_cleaned['price'] > 0)]
vehicles_df_cleaned['year'].fillna(vehicles_df_cleaned['year'].median(), inplace=True)
vehicles_df_cleaned['odometer'].fillna(vehicles_df_cleaned['odometer'].median(), inplace=True)

# For categorical columns, fill missing values with mode
categorical_columns = ['manufacturer', 'fuel', 'transmission', 'drive', 'paint_color', 'title_status', 'type', 'condition']
for col in categorical_columns:
    vehicles_df_cleaned[col].fillna(vehicles_df_cleaned[col].mode()[0], inplace=True)

# Remove outliers in price
vehicles_df_cleaned = vehicles_df_cleaned[vehicles_df_cleaned['price'] <= 200000]

# One-hot encode categorical variables
X = vehicles_df_cleaned.drop(columns=['price', 'model', 'cylinders', 'size'])
X = pd.get_dummies(X, drop_first=True)
y = vehicles_df_cleaned['price']
    

### Visualizations
We will visualize the distribution of continuous variables like price and odometer, as well as categorical variables like manufacturer and fuel type.

In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style for readability
sns.set(style="whitegrid")

# 1. Distribution of Car Prices
plt.figure(figsize=(10,6))
sns.histplot(vehicles_df_cleaned['price'], bins=50, kde=True, color='blue')
plt.title('Distribution of Car Prices', fontsize=15)
plt.xlabel('Price', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()

# 2. Distribution of Odometer Readings
plt.figure(figsize=(10,6))
sns.histplot(vehicles_df_cleaned['odometer'], bins=50, kde=True, color='green')
plt.title('Distribution of Odometer Readings', fontsize=15)
plt.xlabel('Odometer (miles)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()

# 3. Bar plot for Manufacturer Distribution
plt.figure(figsize=(12,6))
vehicles_df_cleaned['manufacturer'].value_counts().plot(kind='bar', color='orange')
plt.title('Distribution of Car Manufacturers', fontsize=15)
plt.xlabel('Manufacturer', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=90)
plt.show()

# 4. Bar plot for Fuel Type Distribution
plt.figure(figsize=(8,6))
vehicles_df_cleaned['fuel'].value_counts().plot(kind='bar', color='purple')
plt.title('Distribution of Fuel Types', fontsize=15)
plt.xlabel('Fuel Type', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.show()


### Modeling
We will build multiple regression models (Linear, Ridge, Lasso) and tune hyperparameters using cross-validation.

In [None]:

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 1. Linear Regression
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train)

# 2. Ridge Regression with GridSearchCV
ridge_model = Ridge()
ridge_params = {'alpha': [0.1, 1, 10, 100]}
ridge_grid = GridSearchCV(ridge_model, ridge_params, cv=5)
ridge_grid.fit(X_train_scaled, y_train)

# 3. Lasso Regression with GridSearchCV
lasso_model = Lasso()
lasso_params = {'alpha': [0.1, 1, 10, 100]}
lasso_grid = GridSearchCV(lasso_model, lasso_params, cv=5)
lasso_grid.fit(X_train_scaled, y_train)

# Predictions
linear_predictions = linear_model.predict(X_test_scaled)
ridge_predictions = ridge_grid.best_estimator_.predict(X_test_scaled)
lasso_predictions = lasso_grid.best_estimator_.predict(X_test_scaled)
    

### Evaluation
We evaluate the models using R-squared and RMSE.

In [None]:

# Calculate R-squared and RMSE for each model
linear_r2 = r2_score(y_test, linear_predictions)
ridge_r2 = r2_score(y_test, ridge_predictions)
lasso_r2 = r2_score(y_test, lasso_predictions)

linear_rmse = mean_squared_error(y_test, linear_predictions, squared=False)
ridge_rmse = mean_squared_error(y_test, ridge_predictions, squared=False)
lasso_rmse = mean_squared_error(y_test, lasso_predictions, squared=False)

# Summary of results
model_results = pd.DataFrame({
    'Model': ['Linear Regression', 'Ridge Regression', 'Lasso Regression'],
    'R-squared': [linear_r2, ridge_r2, lasso_r2],
    'RMSE': [linear_rmse, ridge_rmse, lasso_rmse]
})

model_results
    

### Findings and Recommendations

Based on the modeling results, we found that Ridge Regression performed the best with the lowest RMSE and highest R-squared value. This suggests that Ridge is the most reliable model for predicting used car prices based on the available features.