# **Car Price Prediction Project**

## **Project Introduction**
The Car Price Prediction project is a **supervised machine learning task** aimed at predicting the selling price of cars based on multiple features such as year of manufacture, mileage, fuel type, engine size, transmission, and ownership history.  

This project helps to:  
- Understand the key factors affecting car prices.  
- Build a predictive model to estimate car prices accurately.  
- Apply regression techniques to real-world datasets for practical insights.  

The dataset typically includes attributes like `Car_Name`, `Year`, `Kms_Driven`, `Fuel_Type`, `Transmission`, and `Seller_Type`. By analyzing and modeling this data, we can provide price estimates for cars in the market and assist buyers and sellers in decision-making.


In [1]:
#Importing libraries 

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

import joblib 
import warnings 
warnings.filterwarnings('ignore')



In [6]:
import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("taeefnajib/used-car-price-prediction-dataset")

print("Path to dataset files:", path)

import pandas as pd

csv_file = os.path.join(path, "used_cars.csv")  # adjust name if different
data = pd.read_csv(csv_file)

print(data.head())


Path to dataset files: /home/saroj/.cache/kagglehub/datasets/taeefnajib/used-car-price-prediction-dataset/versions/1
      brand                            model  model_year      milage  \
0      Ford  Utility Police Interceptor Base        2013  51,000 mi.   
1   Hyundai                     Palisade SEL        2021  34,742 mi.   
2     Lexus                    RX 350 RX 350        2022  22,372 mi.   
3  INFINITI                 Q50 Hybrid Sport        2015  88,900 mi.   
4      Audi        Q3 45 S line Premium Plus        2021   9,835 mi.   

       fuel_type                                             engine  \
0  E85 Flex Fuel  300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capa...   
1       Gasoline                               3.8L V6 24V GDI DOHC   
2       Gasoline                                     3.5 Liter DOHC   
3         Hybrid  354.0HP 3.5L V6 Cylinder Engine Gas/Electric H...   
4       Gasoline                         2.0L I4 16V GDI DOHC Turbo   

        transmission   

In [7]:
data.head()

Unnamed: 0,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,Ford,Utility Police Interceptor Base,2013,"51,000 mi.",E85 Flex Fuel,300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capa...,6-Speed A/T,Black,Black,At least 1 accident or damage reported,Yes,"$10,300"
1,Hyundai,Palisade SEL,2021,"34,742 mi.",Gasoline,3.8L V6 24V GDI DOHC,8-Speed Automatic,Moonlight Cloud,Gray,At least 1 accident or damage reported,Yes,"$38,005"
2,Lexus,RX 350 RX 350,2022,"22,372 mi.",Gasoline,3.5 Liter DOHC,Automatic,Blue,Black,None reported,,"$54,598"
3,INFINITI,Q50 Hybrid Sport,2015,"88,900 mi.",Hybrid,354.0HP 3.5L V6 Cylinder Engine Gas/Electric H...,7-Speed A/T,Black,Black,None reported,Yes,"$15,500"
4,Audi,Q3 45 S line Premium Plus,2021,"9,835 mi.",Gasoline,2.0L I4 16V GDI DOHC Turbo,8-Speed Automatic,Glacier White Metallic,Black,None reported,,"$34,999"


In [8]:
data.describe()

Unnamed: 0,model_year
count,4009.0
mean,2015.51559
std,6.104816
min,1974.0
25%,2012.0
50%,2017.0
75%,2020.0
max,2024.0


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4009 entries, 0 to 4008
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   brand         4009 non-null   object
 1   model         4009 non-null   object
 2   model_year    4009 non-null   int64 
 3   milage        4009 non-null   object
 4   fuel_type     3839 non-null   object
 5   engine        4009 non-null   object
 6   transmission  4009 non-null   object
 7   ext_col       4009 non-null   object
 8   int_col       4009 non-null   object
 9   accident      3896 non-null   object
 10  clean_title   3413 non-null   object
 11  price         4009 non-null   object
dtypes: int64(1), object(11)
memory usage: 376.0+ KB


In [10]:
data.shape

(4009, 12)

### Data cleaning 

In [None]:
#Remove $ from price and convert it into numeric 

data['price'] = data['price'].str.replace('[\$,]','', regex=True).astype(float)


#remoave mi. from milage and convert it to the numeric value 
data['milage'] = data['milage'].str.replace('ml.','',regex=False).str.replace(',','').astype(float)


#fill the missing value in clean title with no 
data['clean_title'] = data['clean_title'].fillna('No')

#simplify accident info
data['accident'] = data['accident'].apply(lambda x:0 if 'None' in str(x) else 1)

#Drop unnecessary columns that wont help in prediction 
data = data.drop(columns=['engine','ext_col','int_col'])





ValueError: could not convert string to float: '51000 mi.'