<h1>Estimating Car Prices using Multiple Linear Regression</h1>

#### Introduction:

In the rapidly evolving automobile market, understanding the factors that influence car prices is crucial for both sellers and buyers. This case study aims to predict the selling price of used cars based on various features such as the car's year, mileage, engine specifications, and other relevant attributes. By leveraging statistical modeling techniques, particularly Multiple Linear Regression (MLR), we can identify the key determinants of car prices and make informed predictions that assist in decision-making processes.

#### Dataset Overview:

The dataset for this case study, "car_price_prediction.csv," comprises multiple features of used cars that are believed to influence their selling prices. The dataset includes the following columns:

-   **name**: The name of the car model.
-   **year**: The year the car was manufactured.
-   **selling_price**: The price at which the car is sold.
-   **km_driven**: The total kilometers the car has been driven.
-   **fuel**: The type of fuel used by the car (e.g., Petrol, Diesel, LPG).
-   **seller_type**: The type of seller (e.g., Individual, Dealer).
-   **transmission**: The type of transmission (e.g., Manual, Automatic).
-   **owner**: The ownership status (e.g., First Owner, Second Owner).
-   **mileage**: The mileage of the car in kilometers per liter (or equivalent).
-   **engine**: The engine size of the car in cubic centimeters (cc).
-   **max_power**: The maximum power output of the car in horsepower (hp).
-   **seats**: The number of seats in the car.

This dataset provides a comprehensive view of various factors that can affect the selling price of a car, making it an ideal candidate for a regression analysis.

#### Why Choose Multiple Linear Regression (MLR):

Multiple Linear Regression (MLR) is chosen for this dataset due to several compelling reasons:

1.  **Predictive Goal**:

    -   The primary goal is to predict a continuous variable (selling_price) based on multiple predictor variables. MLR is designed for such predictive tasks, making it a suitable choice.
2.  **Nature of the Relationship**:

    -   MLR assumes a linear relationship between the dependent variable (selling_price) and the independent variables (features). Given that many economic and market-driven phenomena exhibit approximately linear relationships, this assumption is reasonable for our dataset.
3.  **Simplicity and Interpretability**:

    -   MLR provides clear interpretability of the model coefficients, helping us understand how each feature impacts the selling price. This interpretability is essential for stakeholders who need insights into the determinants of car prices.
4.  **Handling of Multiple Predictors**:

    -   MLR can simultaneously consider multiple predictors, providing a comprehensive model that accounts for the combined effects of various car features on the selling price.
5.  **Computational Efficiency**:

    -   Compared to more complex models (e.g., neural networks, ensemble methods), MLR is computationally efficient and easy to implement. This makes it a practical choice, especially for preliminary analysis and datasets of moderate size.
6.  **Baseline Performance**:

    -   MLR serves as a robust baseline model. Even if more complex models are considered later, MLR provides a benchmark for comparison, ensuring that any additional complexity in the model is justified by significant improvements in predictive performance.

## Steps for Multiple Linear Regression:

#### Step 1: Load the Data

First, load the dataset and perform initial exploratory data analysis (EDA).

In [19]:
import pandas as pd

# Load the dataset
data = pd.read_csv('../Datasets/car_price_prediction.csv')

# Display the first few rows of the dataset
print(data.head())


                           name  year  selling_price  km_driven    fuel  \
0        Maruti Swift Dzire VDI  2014         450000     145500  Diesel   
1  Skoda Rapid 1.5 TDI Ambition  2014         370000     120000  Diesel   
2      Honda City 2017-2020 EXi  2006         158000     140000  Petrol   
3     Hyundai i20 Sportz Diesel  2010         225000     127000  Diesel   
4        Maruti Swift VXI BSIII  2007         130000     120000  Petrol   

  seller_type transmission         owner  mileage(km/ltr/kg)  engine  \
0  Individual       Manual   First Owner               23.40  1248.0   
1  Individual       Manual  Second Owner               21.14  1498.0   
2  Individual       Manual   Third Owner               17.70  1497.0   
3  Individual       Manual   First Owner               23.00  1396.0   
4  Individual       Manual   First Owner               16.10  1298.0   

  max_power  seats  
0        74    5.0  
1    103.52    5.0  
2        78    5.0  
3        90    5.0  
4      88.2

#### Step 2: Data Cleaning

Ensure the data is clean and ready for modeling. Handle missing values and categorical data.

In [23]:
# Check for missing values
print(data.isnull().sum())

# Fill or drop missing values if necessary
data = data.dropna()

# Encode categorical variables
data = pd.get_dummies(data, drop_first=True)


year                  0
selling_price         0
km_driven             0
mileage(km/ltr/kg)    0
engine                0
                     ..
max_power_98.96       0
max_power_98.97       0
max_power_99          0
max_power_99.23       0
max_power_99.6        0
Length: 2316, dtype: int64


#### Step 3: Feature Selection

Select the features that will be used for prediction. Here, we might select all the columns except `name` and `selling_price`.

In [5]:
# Define the independent variables (features)
X = data.drop(['selling_price'], axis=1)

# Define the dependent variable (target)
y = data['selling_price']


#### Step 4: Split the Data

Split the data into training and testing sets.

In [6]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#### Step 5: Train the Model

Train the multiple linear regression model.

In [8]:
from sklearn.linear_model import LinearRegression

# Instantiate the model
model = LinearRegression()

# Fit the model on the training data
model.fit(X_train, y_train)


#### Step 6: Evaluate the Model

Evaluate the model's performance on the test set.

In [9]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f'MAE: {mae}')
print(f'MSE: {mse}')
print(f'RMSE: {rmse}')
print(f'R^2: {r2}')


MAE: 77595.81669650644
MSE: 24576595630.31565
RMSE: 156769.24325362948
R^2: 0.9671242120611796


#### Step 7: Interpret the Results

Interpret the model coefficients and evaluate the significance of each feature.

In [24]:
# Get the model coefficients
coefficients = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])

# Display the coefficients
print(coefficients)


                      Coefficient
year                 34365.353060
km_driven               -0.392782
mileage(km/ltr/kg)    4705.540085
engine                 512.539428
seats              -151096.131589
...                           ...
max_power_98.96    -142294.939904
max_power_98.97    -135028.520049
max_power_99       -357499.840273
max_power_99.23    -255141.030881
max_power_99.6     -298818.634366

[2315 rows x 1 columns]
