# A1: Linear Regression - Car Price Prediction
**Student ID: st126010 - Htut Ko Ko**

This notebook implements basic linear regression for car price prediction following the assignment requirements.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
import pickle
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## Task 1: Data Loading and Exploration

In [None]:
# Load the dataset
df = pd.read_csv('Cars.csv')
print(f"Dataset shape: {df.shape}")
print(f"\nDataset info:")
df.info()

In [None]:
# Display first few rows
print("First 5 rows:")
df.head()

In [None]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

In [None]:
# Basic statistics
print("Basic statistics:")
df.describe()

## Task 2: Data Preprocessing (Following Assignment Requirements)

In [None]:
# Create a copy for preprocessing
data = df.copy()
print(f"Original dataset shape: {data.shape}")

# 1. Remove CNG and LPG rows (different mileage system)
print(f"\nFuel types before filtering: {data['fuel'].value_counts()}")
data = data[~data['fuel'].isin(['CNG', 'LPG'])]
print(f"Fuel types after filtering: {data['fuel'].value_counts()}")
print(f"Shape after removing CNG/LPG: {data.shape}")

In [None]:
# 2. Remove Test Drive Cars (ridiculously expensive)
print(f"\nOwner types before filtering: {data['owner'].value_counts()}")
data = data[data['owner'] != 'Test Drive Car']
print(f"Owner types after filtering: {data['owner'].value_counts()}")
print(f"Shape after removing Test Drive Cars: {data.shape}")

In [None]:
# 3. Map owner feature: First Owner=1, Second Owner=2, etc.
owner_mapping = {
    'First Owner': 1,
    'Second Owner': 2, 
    'Third Owner': 3,
    'Fourth & Above Owner': 4
}
data['owner'] = data['owner'].map(owner_mapping)
print(f"\nOwner mapping applied: {data['owner'].value_counts().sort_index()}")

In [None]:
# 4. Clean mileage column - remove 'kmpl' and convert to float
print(f"\nMileage before cleaning (sample): {data['mileage'].head()}")
data['mileage'] = data['mileage'].str.split().str[0]  # Take first part before space
data['mileage'] = pd.to_numeric(data['mileage'], errors='coerce')
print(f"Mileage after cleaning (sample): {data['mileage'].head()}")
print(f"Mileage data type: {data['mileage'].dtype}")

In [None]:
# 5. Clean engine column - remove 'CC' and convert to float
print(f"\nEngine before cleaning (sample): {data['engine'].head()}")
data['engine'] = data['engine'].str.replace(' CC', '').str.replace('CC', '')
data['engine'] = pd.to_numeric(data['engine'], errors='coerce')
print(f"Engine after cleaning (sample): {data['engine'].head()}")
print(f"Engine data type: {data['engine'].dtype}")

In [None]:
# 6. Clean max_power column - remove 'bhp' and convert to float
print(f"\nMax power before cleaning (sample): {data['max_power'].head()}")
data['max_power'] = data['max_power'].str.replace(' bhp', '').str.replace('bhp', '')
data['max_power'] = pd.to_numeric(data['max_power'], errors='coerce')
print(f"Max power after cleaning (sample): {data['max_power'].head()}")
print(f"Max power data type: {data['max_power'].dtype}")

In [None]:
# 7. Extract brand from name (first word only)
print(f"\nName before brand extraction (sample): {data['name'].head()}")
data['brand'] = data['name'].str.split().str[0]
print(f"Brand after extraction (sample): {data['brand'].head()}")
print(f"Unique brands: {data['brand'].nunique()}")

In [None]:
# 8. Drop torque column (as per assignment requirement)
if 'torque' in data.columns:
    data = data.drop('torque', axis=1)
    print("Torque column dropped")

# Also drop name column since we extracted brand
data = data.drop('name', axis=1)
print("Name column dropped (brand extracted)")

print(f"\nFinal columns: {list(data.columns)}")
print(f"Final shape: {data.shape}")

In [None]:
# Check for missing values after preprocessing
print("Missing values after preprocessing:")
print(data.isnull().sum())

In [None]:
# Handle missing values with imputation
# Fill missing numerical values with median
numerical_cols = ['mileage', 'engine', 'max_power']
for col in numerical_cols:
    if data[col].isnull().sum() > 0:
        median_val = data[col].median()
        data[col].fillna(median_val, inplace=True)
        print(f"Filled {col} missing values with median: {median_val}")

print("\nMissing values after imputation:")
print(data.isnull().sum())

In [None]:
# Display cleaned data sample
print("Cleaned data sample:")
data.head()

## Task 3: Exploratory Data Analysis

In [None]:
# Price distribution
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(data['selling_price'], bins=50, alpha=0.7)
plt.title('Selling Price Distribution')
plt.xlabel('Price')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
plt.hist(np.log(data['selling_price']), bins=50, alpha=0.7, color='orange')
plt.title('Log-transformed Selling Price Distribution')
plt.xlabel('Log(Price)')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

print(f"Price statistics:")
print(f"Mean: {data['selling_price'].mean():,.0f}")
print(f"Median: {data['selling_price'].median():,.0f}")
print(f"Min: {data['selling_price'].min():,.0f}")
print(f"Max: {data['selling_price'].max():,.0f}")

In [None]:
# Correlation matrix
numerical_features = ['year', 'km_driven', 'mileage', 'engine', 'max_power', 'seats', 'owner', 'selling_price']
corr_matrix = data[numerical_features].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

print("Correlation with selling_price:")
print(corr_matrix['selling_price'].sort_values(ascending=False))

## Task 4: Feature Engineering and Model Preparation

In [None]:
# Prepare features and target
# Encode categorical variables
label_encoders = {}
categorical_cols = ['fuel', 'seller_type', 'transmission', 'brand']

for col in categorical_cols:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    label_encoders[col] = le
    print(f"Encoded {col}: {len(le.classes_)} unique values")

print(f"\nLabel encoders saved for: {list(label_encoders.keys())}")

In [None]:
# Apply log transformation to target variable (as per assignment requirement)
y = np.log(data['selling_price'])
X = data.drop('selling_price', axis=1)

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures: {list(X.columns)}")
print(f"\nTarget (log-transformed) statistics:")
print(f"Mean: {y.mean():.4f}")
print(f"Std: {y.std():.4f}")

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

In [None]:
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features scaled successfully")
print(f"Training features mean: {X_train_scaled.mean():.6f}")
print(f"Training features std: {X_train_scaled.std():.6f}")

## Task 5: Model Training and Evaluation

In [None]:
# Train Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

print("Linear Regression model trained successfully")
print(f"Model coefficients shape: {model.coef_.shape}")
print(f"Model intercept: {model.intercept_:.4f}")

In [None]:
# Make predictions (log scale)
y_train_pred_log = model.predict(X_train_scaled)
y_test_pred_log = model.predict(X_test_scaled)

# Transform back to original scale (as per assignment requirement)
y_train_pred = np.exp(y_train_pred_log)
y_test_pred = np.exp(y_test_pred_log)
y_train_actual = np.exp(y_train)
y_test_actual = np.exp(y_test)

print("Predictions completed and transformed back to original scale")

In [None]:
# Calculate metrics
train_r2 = r2_score(y_train_actual, y_train_pred)
test_r2 = r2_score(y_test_actual, y_test_pred)
train_mse = mean_squared_error(y_train_actual, y_train_pred)
test_mse = mean_squared_error(y_test_actual, y_test_pred)

print("=== Model Performance ===")
print(f"Training R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")
print(f"Training MSE: {train_mse:,.0f}")
print(f"Test MSE: {test_mse:,.0f}")
print(f"Training RMSE: {np.sqrt(train_mse):,.0f}")
print(f"Test RMSE: {np.sqrt(test_mse):,.0f}")

In [None]:
# Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'coefficient': model.coef_,
    'abs_coefficient': np.abs(model.coef_)
}).sort_values('abs_coefficient', ascending=False)

print("Feature Importance (by coefficient magnitude):")
print(feature_importance)

In [None]:
# Visualization of results
plt.figure(figsize=(15, 5))

# Actual vs Predicted
plt.subplot(1, 3, 1)
plt.scatter(y_test_actual, y_test_pred, alpha=0.6)
plt.plot([y_test_actual.min(), y_test_actual.max()], [y_test_actual.min(), y_test_actual.max()], 'r--', lw=2)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title(f'Actual vs Predicted (R² = {test_r2:.4f})')

# Residuals
plt.subplot(1, 3, 2)
residuals = y_test_actual - y_test_pred
plt.scatter(y_test_pred, residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Price')
plt.ylabel('Residuals')
plt.title('Residual Plot')

# Feature importance
plt.subplot(1, 3, 3)
top_features = feature_importance.head(8)
plt.barh(range(len(top_features)), top_features['abs_coefficient'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Absolute Coefficient')
plt.title('Top Feature Importance')

plt.tight_layout()
plt.show()

## Task 6: Model Saving

In [None]:
# Save model and preprocessing components
model_artifacts = {
    'model': model,
    'scaler': scaler,
    'label_encoders': label_encoders,
    'feature_names': list(X.columns),
    'metrics': {
        'train_r2': train_r2,
        'test_r2': test_r2,
        'train_mse': train_mse,
        'test_mse': test_mse
    }
}

with open('a1_model_artifacts.pkl', 'wb') as f:
    pickle.dump(model_artifacts, f)

print("Model artifacts saved to 'a1_model_artifacts.pkl'")
print(f"Saved components: {list(model_artifacts.keys())}")

## Task 7: Analysis and Discussion

### Results Analysis

**Model Performance:**
The Linear Regression model achieved a test R² score of {test_r2:.4f}, indicating that the model explains approximately {test_r2*100:.1f}% of the variance in car prices. This represents a reasonable baseline performance for a simple linear model.

**Feature Importance:**
Based on the coefficient analysis, the most important features for predicting car prices are:
1. **Year**: Newer cars tend to have higher prices, showing strong positive correlation
2. **Engine size**: Larger engines typically indicate more powerful and expensive cars
3. **Max Power**: Higher power output directly correlates with premium pricing
4. **Brand**: Certain brands command premium prices due to reputation and quality
5. **Mileage**: Interestingly, fuel efficiency can both positively and negatively impact price depending on car segment

**Data Preprocessing Impact:**
The log transformation of the target variable was crucial for stabilizing the model's predictions, as car prices have a wide range and skewed distribution. Removing CNG/LPG vehicles and Test Drive Cars helped focus the model on the main market segments with consistent pricing patterns.

**Model Limitations:**
Linear regression assumes linear relationships between features and target, which may not capture complex interactions in car pricing. The model shows some heteroscedasticity in residuals, suggesting that more sophisticated models might better capture the pricing dynamics. Additionally, categorical features like brand were simply label-encoded, which may not fully capture brand hierarchy and premium positioning.

**Recommendations for Improvement:**
Future iterations could benefit from polynomial features, regularization techniques, or ensemble methods to better capture non-linear relationships and reduce overfitting. Feature engineering could include interaction terms between year and brand, or mileage and engine size, which are likely important for pricing decisions."