# 🧠 Price Car Prediction

## 2. Data Understanding
- Load initial dataset
- Explore variables and data types
- Visualize initial trends

### 📘 Data Dictionary – Car Price Prediction Dataset

**Dataset Information**
This data set consists of three types of entities: 
- (a) the specification of an auto in terms of various characteristics; 
- (b) it's assigned insurance risk rating;
- (c) it's normalized losses in use as compared to other cars.  

The second rating (b) `symboling` corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price.   Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale.  Actuarians call this process "symboling".
A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe. 
The third factor (c) `normalized-losses` is the relative average loss payment per insured vehicle year.  This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.

This data dictionary provides a structured description of the variables used in the car price prediction dataset.

| Variable Name       | Role                | Type Variable | Description |
|---------------------|---------------------|----------------|-------------|
| symboling           | Feature             | Categorical    | Risk factor assigned by insurance (ranging from -3 to 3; higher means more risky) |
| normalized-losses   | Feature             | Continuous     | Relative average loss paid by insurance; indicates vehicle damage risk |
| make                | Feature             | Categorical    | Manufacturer brand of the car |
| fuel-type           | Feature             | Categorical    | Type of fuel the car uses (gas or diesel) |
| aspiration          | Feature             | Categorical    | Engine type: standard or turbocharged |
| num-of-doors        | Feature             | Categorical    | Number of doors in the vehicle |
| body-style          | Feature             | Categorical    | Body type of the car (e.g., sedan, hatchback) |
| drive-wheels        | Feature             | Categorical    | Type of drivetrain (front, rear, or 4-wheel drive) |
| engine-location     | Feature             | Categorical    | Location of the engine (front or rear) |
| wheel-base          | Feature             | Continuous     | Distance between front and rear wheels (affects stability) |
| length              | Feature             | Continuous     | Overall length of the vehicle in inches |
| width               | Feature             | Continuous     | Width of the vehicle in inches |
| height              | Feature             | Continuous     | Height of the vehicle in inches |
| curb-weight         | Feature             | Continuous     | Total weight of the car without passengers or cargo |
| engine-type         | Feature             | Categorical    | Engine configuration (e.g., OHV, OHC, DOHC) |
| num-of-cylinders    | Feature             | Categorical    | Number of cylinders in the engine |
| engine-size         | Feature             | Continuous     | Engine displacement in cubic centimeters |
| fuel-system         | Feature             | Categorical    | Type of fuel injection system |
| bore                | Feature             | Continuous     | Diameter of each cylinder in inches |
| stroke              | Feature             | Continuous     | Distance the piston travels within the cylinder (in inches) |
| compression-ratio   | Feature             | Continuous     | Ratio of engine cylinder volume to combustion chamber volume |
| horsepower          | Feature             | Continuous     | Engine power output measured in horsepower |
| peak-rpm            | Feature             | Continuous     | Engine speed at which maximum horsepower is generated |
| city-mpg            | Feature             | Continuous     | Fuel consumption in city driving (miles per gallon) |
| highway-mpg         | Feature             | Continuous     | Fuel consumption on highways (miles per gallon) |
| price               | Target              | Continuous     | Selling price of the car in USD |


In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

#### Load Data

In [2]:
# Load Dataset in .csv
file_path = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
df = pd.read_csv(file_path, header=None)

# View the top 5 rows of the dataset
df.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


#### Data Exploration

In [3]:
# Rename the collums name
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

df.columns = headers
df.head()


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [5]:
# Check the shape of the DataFrame
print("Total Rows:",df.shape[0])
print("Total Cols:",df.shape[1])

Total Rows: 205
Total Cols: 26


In [6]:
# Examine the data types of each column
print("Data Types:")
df.dtypes


Data Types:


symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())

# Show the Missing Data Values
df[df.isnull()]

Missing Values:
symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
num-of-doors         0
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470


In [19]:
# Check the missing data / inconsistent Data
for i, col in enumerate(df.columns): 
    unique_vals = df[col].unique()
    print(f"Name Col: {col}: \n Unique Values: {unique_vals[:10]} \n")


#
df["num-of-doors"].value_counts()
df["price"].value_counts()

# Possível identificar que as colunas `normalized_losses` e `num-of-doors` e `price` possui inconsistências nos dados com valor "?"

Name Col: symboling: 
 Unique Values: [ 3  1  2  0 -1 -2] 

Name Col: normalized-losses: 
 Unique Values: ['?' '164' '158' '192' '188' '121' '98' '81' '118' '148'] 

Name Col: make: 
 Unique Values: ['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz'] 

Name Col: fuel-type: 
 Unique Values: ['gas' 'diesel'] 

Name Col: aspiration: 
 Unique Values: ['std' 'turbo'] 

Name Col: num-of-doors: 
 Unique Values: ['two' 'four' '?'] 

Name Col: body-style: 
 Unique Values: ['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop'] 

Name Col: drive-wheels: 
 Unique Values: ['rwd' 'fwd' '4wd'] 

Name Col: engine-location: 
 Unique Values: ['front' 'rear'] 

Name Col: wheel-base: 
 Unique Values: [ 88.6  94.5  99.8  99.4 105.8  99.5 101.2 103.5 110.   88.4] 

Name Col: length: 
 Unique Values: [168.8 171.2 176.6 177.3 192.7 178.2 176.8 189.  193.8 197. ] 

Name Col: width: 
 Unique Values: [64.1 65.5 66.2 66.4 66.3 71.4 67.9 64.8 66.9 70.9] 

Name Col: hei

price
?        4
16500    2
6229     2
7609     2
7957     2
        ..
16845    1
19045    1
21485    1
22470    1
22625    1
Name: count, Length: 187, dtype: int64

In [None]:
# Check there is duplicated data
df.duplicated().sum()

# Show the duplicated Data
df[df.duplicated()]

# Drop Duplicated
#df.drop_duplicates(inplace=True)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price


In [15]:
# Describe the numerical variables 
df.describe()

Unnamed: 0,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,10.142537,25.219512,30.75122
std,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,3.97204,6.542142,6.886443
min,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,7.0,13.0,16.0
25%,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,8.6,19.0,25.0
50%,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,9.0,24.0,30.0
75%,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,9.4,30.0,34.0
max,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,23.0,49.0,54.0
