## 📄 Dataset Overview: Used Car Price Prediction
For this project, I am using the Used Car Price Prediction Dataset sourced from Kaggle. The dataset is a collection of real-world used car listings from the popular automobile marketplace website https://www.cars.com. The dataset includes 9 distinct and informative feature variables that can be effectively used to predict a car's price.

Total Instances (Rows): 4,009

Total Features (Columns): 9 (excluding the target variable)

Kaggle Dataset: https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset/data

### 🔍 Features:

Brand: Manufacturer company names (e.g. Toyota, Camry, BMW)

Model: Specific Model of the car (e.g. Rover LR4 HSE, RC 350 F Sport)

Model Year: Manufacturing year, crucial feature for assessing depreciation and technology (YYYY)

Mileage: otal distance the car has been driven. A key factor to identify wear & tear of vehicle (in miles)

Fuel Type: Type of fuel used (e.g., Gasoline, Diesel, Hybrid)

Engine Type: Engine specifications (e.g., V6, 4-Cylinder)

Transmission: Type of transmission (e.g., automatic, manual)

Exterior & Interior Colors: aesthetic of cars

Accident History: Accident recorded in past or damage, crucial for informed decision-making. (Yes/No)

Clean Title: Informs about the vehicle's resale value and legal status (Yes/No)

### 🎯 Target Variable:

Price: The listed price of the vehicle in USD (continuous numeric variable)

### 🔧  Prediction Task
Objective: I aim to build a machine learning X model to predict the price of a used car based on its features. This model can assist buyers and sellers in estimating the fair value of a vehicle by analyzing historical data.

  ## 1. Load the Data

In [31]:
import pandas as pd                        # Importing libraries
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import re

In [32]:
df = pd.read_csv("used_cars.csv")          # Loading datset 
df.head()                                  # Verify the dataset 

Unnamed: 0,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,Ford,Utility Police Interceptor Base,2013,"51,000 mi.",E85 Flex Fuel,300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capa...,6-Speed A/T,Black,Black,At least 1 accident or damage reported,Yes,"$10,300"
1,Hyundai,Palisade SEL,2021,"34,742 mi.",Gasoline,3.8L V6 24V GDI DOHC,8-Speed Automatic,Moonlight Cloud,Gray,At least 1 accident or damage reported,Yes,"$38,005"
2,Lexus,RX 350 RX 350,2022,"22,372 mi.",Gasoline,3.5 Liter DOHC,Automatic,Blue,Black,None reported,,"$54,598"
3,INFINITI,Q50 Hybrid Sport,2015,"88,900 mi.",Hybrid,354.0HP 3.5L V6 Cylinder Engine Gas/Electric H...,7-Speed A/T,Black,Black,None reported,Yes,"$15,500"
4,Audi,Q3 45 S line Premium Plus,2021,"9,835 mi.",Gasoline,2.0L I4 16V GDI DOHC Turbo,8-Speed Automatic,Glacier White Metallic,Black,None reported,,"$34,999"


## 2. Explore the Data (EDA)

In [33]:
df.info()                     # Checking datatype for columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4009 entries, 0 to 4008
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   brand         4009 non-null   object
 1   model         4009 non-null   object
 2   model_year    4009 non-null   int64 
 3   milage        4009 non-null   object
 4   fuel_type     3839 non-null   object
 5   engine        4009 non-null   object
 6   transmission  4009 non-null   object
 7   ext_col       4009 non-null   object
 8   int_col       4009 non-null   object
 9   accident      3896 non-null   object
 10  clean_title   3413 non-null   object
 11  price         4009 non-null   object
dtypes: int64(1), object(11)
memory usage: 376.0+ KB


##### We can see Milage & Price has Object Datatype but we can covert it to Integer for easy of analysis

In [34]:
df[['milage', 'price']]

Unnamed: 0,milage,price
0,"51,000 mi.","$10,300"
1,"34,742 mi.","$38,005"
2,"22,372 mi.","$54,598"
3,"88,900 mi.","$15,500"
4,"9,835 mi.","$34,999"
...,...,...
4004,714 mi.,"$349,950"
4005,"10,900 mi.","$53,900"
4006,"2,116 mi.","$90,998"
4007,"33,000 mi.","$62,999"


In [35]:
df['milage']= df['milage'].replace([',',' mi.'], ['',''], regex=True).astype(int)    # Replace comma (,) and " mi." with an empty string ''

In [36]:
df.price

0        $10,300
1        $38,005
2        $54,598
3        $15,500
4        $34,999
          ...   
4004    $349,950
4005     $53,900
4006     $90,998
4007     $62,999
4008     $40,000
Name: price, Length: 4009, dtype: object

In [37]:
df['price'] = df['price'].replace(r'[$,]', '', regex=True).astype(int)     # Replace dollar sign ($) and (,) to empty string ''

In [38]:
df[['milage', 'price']]

Unnamed: 0,milage,price
0,51000,10300
1,34742,38005
2,22372,54598
3,88900,15500
4,9835,34999
...,...,...
4004,714,349950
4005,10900,53900
4006,2116,90998
4007,33000,62999
