In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [25]:
df = pd.read_csv("/kaggle/input/car-price-dataset/car_price_dataset.csv")
df

Unnamed: 0,Brand,Model,Year,Engine_Size,Fuel_Type,Transmission,Mileage,Doors,Owner_Count,Price
0,Kia,Rio,2020,4.2,Diesel,Manual,289944,3,5,8501
1,Chevrolet,Malibu,2012,2.0,Hybrid,Automatic,5356,2,3,12092
2,Mercedes,GLA,2020,4.2,Diesel,Automatic,231440,4,2,11171
3,Audi,Q5,2023,2.0,Electric,Manual,160971,2,1,11780
4,Volkswagen,Golf,2003,2.6,Hybrid,Semi-Automatic,286618,3,3,2867
...,...,...,...,...,...,...,...,...,...,...
9995,Kia,Optima,2004,3.7,Diesel,Semi-Automatic,5794,2,4,8884
9996,Chevrolet,Impala,2002,1.4,Electric,Automatic,168000,2,1,6240
9997,BMW,3 Series,2010,3.0,Petrol,Automatic,86664,5,1,9866
9998,Ford,Explorer,2002,1.4,Hybrid,Automatic,225772,4,1,4084


The first step is to check the data for any missing values or errors.

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Brand         10000 non-null  object 
 1   Model         10000 non-null  object 
 2   Year          10000 non-null  int64  
 3   Engine_Size   10000 non-null  float64
 4   Fuel_Type     10000 non-null  object 
 5   Transmission  10000 non-null  object 
 6   Mileage       10000 non-null  int64  
 7   Doors         10000 non-null  int64  
 8   Owner_Count   10000 non-null  int64  
 9   Price         10000 non-null  int64  
dtypes: float64(1), int64(5), object(4)
memory usage: 781.4+ KB


Luckily, there is no missing data. Now we want to check if our data have correct values.

In [27]:
print(df.describe().T)
print(df[['Fuel_Type', 'Transmission']].apply(pd.Series.unique))

               count          mean           std     min       25%       50%  \
Year         10000.0    2011.54370      6.897699  2000.0   2006.00    2012.0   
Engine_Size  10000.0       3.00056      1.149324     1.0      2.00       3.0   
Mileage      10000.0  149239.11180  86322.348957    25.0  74649.25  149587.0   
Doors        10000.0       3.49710      1.110097     2.0      3.00       3.0   
Owner_Count  10000.0       2.99110      1.422682     1.0      2.00       3.0   
Price        10000.0    8852.96440   3112.596810  2000.0   6646.00    8858.5   

                  75%       max  
Year           2017.0    2023.0  
Engine_Size       4.0       5.0  
Mileage      223577.5  299947.0  
Doors             4.0       5.0  
Owner_Count       4.0       5.0  
Price         11086.5   18301.0  
Fuel_Type        [Diesel, Hybrid, Electric, Petrol]
Transmission    [Manual, Automatic, Semi-Automatic]
dtype: object


All of our data seems to be in order. 

Considering we have a limited number of columns, and each could be relevant in the price of our car, we will keep them all to build our model. 

Although, to simplify the work, we will aggregate the columns Brand and Model into the Model column.

In [28]:
df['Model'] = df['Brand'] + ' ' + df['Model']  # Concatenate Brand and Model
df.drop(columns=['Brand'], inplace=True)  # Drop the Brand column
df

Unnamed: 0,Model,Year,Engine_Size,Fuel_Type,Transmission,Mileage,Doors,Owner_Count,Price
0,Kia Rio,2020,4.2,Diesel,Manual,289944,3,5,8501
1,Chevrolet Malibu,2012,2.0,Hybrid,Automatic,5356,2,3,12092
2,Mercedes GLA,2020,4.2,Diesel,Automatic,231440,4,2,11171
3,Audi Q5,2023,2.0,Electric,Manual,160971,2,1,11780
4,Volkswagen Golf,2003,2.6,Hybrid,Semi-Automatic,286618,3,3,2867
...,...,...,...,...,...,...,...,...,...
9995,Kia Optima,2004,3.7,Diesel,Semi-Automatic,5794,2,4,8884
9996,Chevrolet Impala,2002,1.4,Electric,Automatic,168000,2,1,6240
9997,BMW 3 Series,2010,3.0,Petrol,Automatic,86664,5,1,9866
9998,Ford Explorer,2002,1.4,Hybrid,Automatic,225772,4,1,4084


For our model, we need every value to be numerical. That's why we are going to do a Label Encoding for the 'Model', 'Fuel_Type' and 'Transmission' columns, meaning we are going to assign a numerical value to each different text value in our column.

In [29]:
from sklearn.preprocessing import LabelEncoder

# Apply Label Encoding
encoder = LabelEncoder()
df['Model'] = encoder.fit_transform(df['Model'])
df['Fuel_Type'] = encoder.fit_transform(df['Fuel_Type'])
df['Transmission'] = encoder.fit_transform(df['Transmission'])
df

Unnamed: 0,Model,Year,Engine_Size,Fuel_Type,Transmission,Mileage,Doors,Owner_Count,Price
0,19,2020,4.2,0,1,289944,3,5,8501
1,8,2012,2.0,2,0,5356,2,3,12092
2,23,2020,4.2,0,0,231440,4,2,11171
3,2,2023,2.0,1,1,160971,2,1,11780
4,27,2003,2.6,2,2,286618,3,3,2867
...,...,...,...,...,...,...,...,...,...
9995,18,2004,3.7,0,2,5794,2,4,8884
9996,7,2002,1.4,1,0,168000,2,1,6240
9997,3,2010,3.0,3,0,86664,5,1,9866
9998,9,2002,1.4,2,0,225772,4,1,4084


Now that our data is ready, we can start working on our model. Like I said earlier, we are going to keep every column to build our first model.

In [30]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['Price'])
y = df['Price']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=25)

In [37]:
from xgboost import XGBRegressor

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train)

In [38]:
y_pred = my_model.predict(X_valid)

To analyze the results of our model, we will be using two metrics :

- Mean Absoule Error (MAE) : which is the average of price difference between our model predictions and the real values (the lower the better).
- R² Score : which is the coefficient of determination. It tells how well the model's predictions match the actual values (the closer to 1 the better).

In [43]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

mae = mean_absolute_error(y_valid, y_pred)
r2 = r2_score(y_valid, y_pred)

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"R² Score: {r2}")

Mean Absolute Error (MAE): 104.38
R² Score: 0.9981418011559952


To make sure our model is correct, we use something called cross-validation. It will 'shuffle' the data, and use our model with different training and testing data each time. It ensures our model didn't encounter a lucky or unlucky split in data, where the combination of specific training and testing data creates a better or worse model than it actually is.

In [45]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_model, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores.mean())

scores = cross_val_score(my_model, X, y,
                              cv=5,
                              scoring='r2')

print("R² scores:\n", scores.mean())

MAE scores:
 103.53246778564453
R² scores:
 0.998142675386379


Given that the mean values of our metrics are similar to our first test, we can assume that our model is working effeciently. 

Now, if we look at our metrics we see that we have a R² score very close to one, meaning our model almost match the actual prices of the cars. Just as well, our MAE is only 100 which indicates that on average there is a 100 dollars difference between our model predicted price and the actual price. Given that the average price of a car is around 9000 dollars, a difference of 100 dollars is pretty small, making our model pretty good.

We can conclude that our model is efficient in the prediction of car prices.