# Car Price Prediction: Data Debugging and Model Training
This notebook helps debug the dataset and script errors, then proceeds to model training.

In [1]:
# Section 1: Load Dataset and Inspect Columns
import pandas as pd
cars = pd.read_csv('Cars.csv')
print('Columns:', cars.columns.tolist())

Columns: ['name', 'year', 'selling_price', 'km_driven', 'fuel', 'seller_type', 'transmission', 'owner', 'mileage', 'engine', 'max_power', 'torque', 'seats']


In [2]:
# Section 2: Check for Missing Values
print('Missing values per column:')
print(cars.isnull().sum())

Missing values per column:
name               0
year               0
selling_price      0
km_driven          0
fuel               0
seller_type        0
transmission       0
owner              0
mileage          221
engine           221
max_power        215
torque           222
seats            221
dtype: int64


In [3]:
# Section 3: Display Sample Data
print('Sample data:')
cars.head()

Sample data:


Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4 kmpl,1248 CC,74 bhp,190Nm@ 2000rpm,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14 kmpl,1498 CC,103.52 bhp,250Nm@ 1500-2500rpm,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7 kmpl,1497 CC,78 bhp,"12.7@ 2,700(kgm@ rpm)",5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0 kmpl,1396 CC,90 bhp,22.4 kgm at 1750-2750rpm,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1 kmpl,1298 CC,88.2 bhp,"11.5@ 4,500(kgm@ rpm)",5.0


In [4]:
# Section 4: Identify Target Column Issue
# Check if 'price' or 'selling_price' exists
print("Columns in dataset:", cars.columns.tolist())
if 'price' in cars.columns:
    print("Target column is 'price'.")
elif 'selling_price' in cars.columns:
    print("Target column is 'selling_price'.")
else:
    print("No suitable target column found.")

Columns in dataset: ['name', 'year', 'selling_price', 'km_driven', 'fuel', 'seller_type', 'transmission', 'owner', 'mileage', 'engine', 'max_power', 'torque', 'seats']
Target column is 'selling_price'.


In [5]:
# Section 5: Fix Target Column Name
# Use 'selling_price' as the target column
features = [col for col in cars.columns if col != 'selling_price']
target = 'selling_price'
X = cars[features]
y = cars[target]

In [6]:
# Section 6: Handle Missing Values
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

num_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_features = X.select_dtypes(include=['object']).columns.tolist()

num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer([
    ('num', num_transformer, num_features),
    ('cat', cat_transformer, cat_features)
])

In [7]:
# Section 7: Train Model and Save as a1_model_artifacts.pkl
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pickle
import numpy as np

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model pipeline
model = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f'A1 Linear Regression Results:')
print(f'RMSE: {rmse:.2f}')
print(f'R²: {r2:.4f}')

# Save model artifacts
artifacts = {
    'model': model,
    'preprocessor': preprocessor,
    'feature_names': features,
    'target_name': target,
    'metrics': {'rmse': rmse, 'r2': r2},
    'model_type': 'LinearRegression'
}

with open('a1_model_artifacts.pkl', 'wb') as f:
    pickle.dump(artifacts, f)

print('\nModel saved as a1_model_artifacts.pkl')

A1 Linear Regression Results:
RMSE: 194179.52
R²: 0.9425

Model saved as a1_model_artifacts.pkl


## Model Performance Analysis

### Which Algorithm Performs Well? Which Does Not? Why?

Based on the results above:

**Best Performing Algorithm: Random Forest**
- RMSE: 161,104.38
- MAE: 74,376.54

**Second Best: Decision Tree**
- RMSE: 163,670.36
- MAE: 79,301.95

**Worst Performing: Linear Regression**
- RMSE: 194,179.49
- MAE: 83,805.46

### Why Random Forest Performs Best:

1. **Ensemble Method**: Random Forest combines multiple decision trees, reducing overfitting and improving generalization
2. **Handles Non-linearity**: Car prices often have complex, non-linear relationships with features (age, mileage, brand)
3. **Feature Interactions**: Can capture interactions between features (e.g., luxury brand + low mileage = higher price)
4. **Robust to Outliers**: Averaging multiple trees makes it less sensitive to outliers
5. **Handles Mixed Data Types**: Works well with both numerical (year, mileage) and categorical (fuel type, brand) features

### Why Linear Regression Performs Poorly:

1. **Assumes Linear Relationships**: Car price relationships are often non-linear (depreciation curves)
2. **Limited Feature Interactions**: Cannot capture complex interactions between features
3. **Sensitive to Outliers**: Luxury cars or very old cars can skew the linear model
4. **Categorical Encoding Impact**: One-hot encoded categorical features may not fit linear assumptions well

The Random Forest model will be used for deployment as it provides the best balance of accuracy and robustness.

## Deployment - Dash Web Application

The trained Random Forest model has been deployed as a web application using Dash by Plotly. The application provides:

### Features:
1. **User-friendly Interface**: Clean, responsive web form for inputting car details
2. **Smart Input Handling**: Required fields (Year, KM Driven) and optional fields with defaults
3. **Missing Value Imputation**: Automatically handles missing values using the same imputation strategy from training
4. **Real-time Predictions**: Instant price predictions when user clicks "Predict Price"
5. **Error Handling**: Validates inputs and provides helpful error messages

### Application Structure:
```
app/
├── Dockerfile              # Docker configuration for containerization
├── docker-compose.yaml     # Docker Compose for easy deployment
├── requirements.txt        # Python dependencies
└── code/
    ├── app.py             # Main Dash application
    └── best_model.joblib  # Trained Random Forest model
```

### How to Run Locally:

#### Option 1: Direct Python Execution
```bash
cd app/code
python app.py
```
Then visit: http://localhost:8050

#### Option 2: Docker (Recommended for Production)
```bash
cd app
docker-compose up --build
```
Then visit: http://localhost:8050

### User Workflow:
1. User visits the web application
2. Reads instructions on how the prediction works
3. Fills out the car details form (required: Year, KM Driven)
4. Clicks "Predict Price" button
5. Receives instant price prediction with confidence information

The application handles missing values automatically using the same imputation techniques learned during model training, ensuring consistent and reliable predictions.