# **Cars 4 You | Machine Learning Project**

`This project was developed as part of the Machine Learning course, Master's in Advanced Analytics and Data Science, at NOVA IMS University.`


The objective of this project is to develop a regression model capable of predicting the price of a car based on its technical and market characteristics.

The problem is treated as a supervised regression task, where the target variable is the price of the car and the predictor variables include attributes such as brand, model, year of manufacture, transmission, mileage, miles per gallon, fuel type, engine size, and other relevant specifications.

The main evaluation metric will be MAE (Mean Absolute Error), complemented by RMSE (Root Mean Squared Error) and RÂ² to measure the quality of the fit. The goal is to obtain the lowest possible MAE, ensuring that the model makes predictions with the minimum mean square error.

### **Import Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

### **Load Data**

In [2]:
df = pd.read_csv('../data/train.csv')

### **Data Understanding**

In [4]:
df.head()

Unnamed: 0,carID,Brand,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,paintQuality%,previousOwners,hasDamage
0,69512,VW,Golf,2016.0,22290,Semi-Auto,28421.0,Petrol,,11.417268,2.0,63.0,4.0,0.0
1,53000,Toyota,Yaris,2019.0,13790,Manual,4589.0,Petrol,145.0,47.9,1.5,50.0,1.0,0.0
2,6366,Audi,Q2,2019.0,24990,Semi-Auto,3624.0,Petrol,145.0,40.9,1.5,56.0,4.0,0.0
3,29021,Ford,FIESTA,2018.0,12500,anual,9102.0,Petrol,145.0,65.7,1.0,50.0,-2.340306,0.0
4,10062,BMW,2 Series,2019.0,22995,Manual,1000.0,Petrol,145.0,42.8,1.5,97.0,3.0,0.0


In [5]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
carID,75973.0,37986.0,21931.660338,0.0,18993.0,37986.0,56979.0,75972.0
year,74482.0,2017.096611,2.208704,1970.0,2016.0,2017.0,2019.0,2024.121759
price,75973.0,16881.889553,9736.926322,450.0,10200.0,14699.0,20950.0,159999.0
mileage,74510.0,23004.184088,22129.788366,-58540.574478,7423.25,17300.0,32427.5,323000.0
tax,68069.0,120.329078,65.521176,-91.12163,125.0,145.0,145.0,580.0
mpg,68047.0,55.152666,16.497837,-43.421768,46.3,54.3,62.8,470.8
engineSize,74457.0,1.660136,0.573462,-0.103493,1.2,1.6,2.0,6.6
paintQuality%,74449.0,64.590667,21.021065,1.638913,47.0,65.0,82.0,125.594308
previousOwners,74423.0,1.99458,1.472981,-2.34565,1.0,2.0,3.0,6.258371
hasDamage,74425.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- **Price**: we have outliers, since we have a really big jump from the 4th Quartile and the Max value and we have a slightly left skewed distribution
- **Mileage**: we have outliers, since we have a really big jump from the 4th Quartile and the Max value and we have a left skewed distribution, we also have negative values, so we need to handle these values
- **Tax**: we have negative values, we need to investigate why these values are negative and If they make any sense in this context
- **MPG**: We have negative values, we need to handle these values
- **Engine Size**: We have negative values, needing to handle them
- **PaintQuality%**: We have values above 100%, we need to check this cases and see if they make any kind of sense
- **previousOwners**: We have negative and float values, we need to handle this cases as well
- **hasDamage**: Seems like a constant value feature, we need to check if thats the case


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75973 entries, 0 to 75972
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   carID           75973 non-null  int64  
 1   Brand           74452 non-null  object 
 2   model           74456 non-null  object 
 3   year            74482 non-null  float64
 4   price           75973 non-null  int64  
 5   transmission    74451 non-null  object 
 6   mileage         74510 non-null  float64
 7   fuelType        74462 non-null  object 
 8   tax             68069 non-null  float64
 9   mpg             68047 non-null  float64
 10  engineSize      74457 non-null  float64
 11  paintQuality%   74449 non-null  float64
 12  previousOwners  74423 non-null  float64
 13  hasDamage       74425 non-null  float64
dtypes: float64(8), int64(2), object(4)
memory usage: 8.1+ MB


In [7]:
df.isna().sum()

carID                0
Brand             1521
model             1517
year              1491
price                0
transmission      1522
mileage           1463
fuelType          1511
tax               7904
mpg               7926
engineSize        1516
paintQuality%     1524
previousOwners    1550
hasDamage         1548
dtype: int64