# 1- import libariers and data 

In [1]:
import pandas as pd 
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

import warnings
warnings.filterwarnings('ignore')


In [2]:
df = pd.read_csv(r"data\car_price.csv")
df.head()

Unnamed: 0,CarName,carbody,drivewheel,enginelocation,fueltype,aspiration,doornumber,cylindernumber,enginetype,fuelsystem,...,curbweight,enginesize,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,alfa-romero giulia,convertible,rwd,front,gas,std,two,four,dohc,mpfi,...,2548,130,3.47,2.68,9.0,111,5000,21,27,13495.0
1,alfa-romero stelvio,convertible,rwd,front,gas,std,two,four,dohc,mpfi,...,2548,130,3.47,2.68,9.0,111,5000,21,27,16500.0
2,alfa-romero Quadrifoglio,hatchback,rwd,front,gas,std,two,six,ohcv,mpfi,...,2823,152,2.68,3.47,9.0,154,5000,19,26,16500.0
3,audi 100 ls,sedan,fwd,front,gas,std,four,four,ohc,mpfi,...,2337,109,3.19,3.4,10.0,102,5500,24,30,13950.0
4,audi 100ls,sedan,4wd,front,gas,std,four,five,ohc,mpfi,...,2824,136,3.19,3.4,8.0,115,5500,18,22,17450.0


# 2- Data Exploring

**categorical data**

In [3]:
# Making a list of all categorical variables
cat_columns = df.select_dtypes(include = 'object').columns
print(list(cat_columns))

['CarName', 'carbody', 'drivewheel', 'enginelocation', 'fueltype', 'aspiration', 'doornumber', 'cylindernumber', 'enginetype', 'fuelsystem']


In [4]:
print(f"number of unique values = {df['CarName'].nunique()} \n")

number of unique values = 147 



There are alot of unique values in **CarName** column . we will split just the campany name and rename the column to **company name**.

In [5]:
# create company name column
df["company name"] = df["CarName"].str.split(" ", expand=True)[0]

# drop CarName column
df.drop(columns="CarName", inplace=True)

In [6]:
print(f"number of unique values = {df['company name'].nunique()} \n")

print(df['company name'].unique())

number of unique values = 28 

['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'maxda' 'mazda' 'buick' 'mercury' 'mitsubishi' 'Nissan' 'nissan'
 'peugeot' 'plymouth' 'porsche' 'porcshce' 'renault' 'saab' 'subaru'
 'toyota' 'toyouta' 'vokswagen' 'volkswagen' 'vw' 'volvo']


**Note that**

* maxda = mazda
* Nissan = nissan
* porsche = porcshce
* toyota = toyouta
* vokswagen = volkswagen = vw

In [7]:
df.replace({'company name':{"maxda":"mazda" , "Nissan":"nissan" ,
                        "porcshce":"porsche", "toyouta":"toyota" ,
                        "vokswagen":"volkswagen", "vw":"volkswagen"}},inplace=True)

In [8]:
print(f"number of unique values = {df['company name'].nunique()} \n")

print(df['company name'].unique())

number of unique values = 22 

['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'buick' 'mercury' 'mitsubishi' 'nissan' 'peugeot' 'plymouth'
 'porsche' 'renault' 'saab' 'subaru' 'toyota' 'volkswagen' 'volvo']


In [48]:
df.iloc[:5, 5:]

Unnamed: 0,doornumber,cylindernumber,enginetype,fuelsystem,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price,company name
0,two,four,dohc,mpfi,3,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,5000,21,27,13495.0,alfa-romero
1,two,four,dohc,mpfi,3,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,5000,21,27,16500.0,alfa-romero
2,two,six,ohcv,mpfi,1,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154,5000,19,26,16500.0,alfa-romero
3,four,four,ohc,mpfi,2,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102,5500,24,30,13950.0,audi
4,four,five,ohc,mpfi,2,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115,5500,18,22,17450.0,audi


# 3- Data visualization

## Price mean per each car

In [29]:
carname_by_price = df.groupby('company name', as_index=False)['price'].mean().sort_values(by='price')

fig= px.bar(x= carname_by_price['price'],
       y= carname_by_price['company name'],
       text_auto=True,
       title='price mean per each car')

fig.update_traces(textfont_size=12, textangle=0, cliponaxis=False)
fig.update_layout(
    xaxis_title="Price",
    yaxis_title="Car name"
)

**We can see that Jaguar and Buick cars have the highest average price about `34K`.**
 **Chevrolet cars have the cheapest average price about `6K` also dodge, plymouth and honda about `8K`.**

## Car Body

In [11]:
keys = df['carbody'].value_counts().keys()
values = df['carbody'].value_counts().values


# Create the bar chart
fig = px.bar(df, x= keys, y= values, text= values, title='Car Body Counts')
fig.update_traces(textposition='outside')
fig.update_layout(
    xaxis_title="Car Body",
    yaxis_title="Number of cars"
)
# Create the pie chart
fig2 = px.pie(df, names= keys, values= values, title='Car Body Distribution')

# Display the plots
fig.show()
fig2.show()

In [12]:
carbody_by_price = df.groupby('carbody', as_index=False)['price'].mean().sort_values(by='price')

fig= px.bar(x= carbody_by_price['price'],
       y=carbody_by_price['carbody'],
       text_auto=True,
       title='Mean car price per each body')

fig.update_traces(textfont_size=12, textangle=0, cliponaxis=False)
fig.update_layout(
    xaxis_title="Price",
    yaxis_title="Car body name"
)

**Sedan is the top car body prefered about `46.7%` of cars (98 cars) that's because the price of cars with this body is average compared to other cars (average price is `14.4K`). That's the same reason why cars with Hatchback body are used by `33.8%` of cars (71 cars), they have the lowest price about `10.3K`.**

<br>

**Hardtop and convertible bodies have the highest average price about `22K`. Therefore, they are less used.**

## Distribution of car name by bodies

In [13]:
carbody_per_company_names= df.groupby(['carbody', 'company name'], as_index=False).size().sort_values('size', ascending=False)

fig = px.bar(
    carbody_per_company_names, x='company name', y='size', color='carbody', text_auto=True,
    title=''
)

fig.update_layout(
    xaxis_title="Car Name",
    yaxis_title="Number of Cars"
)
fig.show()

**We can see almost all cars have Sedan and hatchback bodies.**
   * **toyota have 14 cars with hatchback body(highest number) and 10 cars with Sedan that's why most cars in our data is toyota (32 car)**
   * **bmw and jaguar have only cars with Sedan body**
   


## Drive wheel

In [34]:
keys = df['drivewheel'].value_counts().keys()
values = df['drivewheel'].value_counts().values

# Create the bar chart
fig = px.bar(df, x= keys, y= values, text= values, title='Drive wheel Counts')
fig.update_traces(textposition='inside')
fig.update_layout(
    xaxis_title="Drive wheel type",
    yaxis_title="Number of cars"
)
# Create the pie chart
fig2 = px.pie(df, names= keys, values= values, title='Drive wheel Distribution')

# Display the plots
fig.show()
fig2.show()

**Almost cars have Front-Wheel-Drive (FWD),  about `57.6%` of cars (`121` cars).**

**Four-wheel drive (4WD) is the least popular drive wheel type. This is because it is typically only needed for off-road driving. Just `9` cars in our data have 4WD.**

## Price per drive wheels

In [35]:
drivewheel_by_price = df.groupby('drivewheel', as_index=False)['price'].mean().sort_values(by='price')

fig= px.bar(x= drivewheel_by_price['price'],
       y=drivewheel_by_price['drivewheel'],
       text_auto=True,
       title='Mean price per drive wheel')

fig.update_traces(textfont_size=12, textangle=0, cliponaxis=False)
fig.update_layout(
    xaxis_title="Price",
    yaxis_title="drive wheels"
)

In [18]:
fig = px.histogram(df, x="price", color="drivewheel", title='Price distribution by Drive Wheel types')

fig.update_layout(
    xaxis_title="Price",
    yaxis_title="Number of Cars"
)
fig.show()

## Drive wheel with engine location

In [42]:
drivewheels_per_enginelocation= df.groupby(['drivewheel', 'enginelocation'], as_index=False).size().sort_values('size', ascending=False)


fig = px.bar(drivewheels_per_enginelocation, x="drivewheel", y="size",
             color='enginelocation', barmode='group',
             height=400)


fig.update_layout(
    xaxis_title="drive wheel type",
    yaxis_title="Number of Cars"
)
fig.show()

**Front-Wheel-Drive (FWD) is the least complex to design and also the cheapest. The main reason behind it is the fact that most cars have their engines mounted at the front.**

   * **All cars with front wheel drive have engines mounted at the front (`121` cars).**
   
**Rear-Wheel-Drive (RWD) is a lot more complex but it serves a different purpose altogether, and better performance is the core of this drive type So it is the most expensive (Average price is `19.6K`)**
   * **Rear engine location is only in cars with RWD.**
   * **Just `3` cars with rear engine location in our data.**



## Car names per drive wheels

In [43]:
company_names_per_drivewheels= df.groupby(['drivewheel', 'company name'], as_index=False).size().sort_values('size', ascending=False)

fig = px.bar(
    company_names_per_drivewheels, x='company name', y='size', color='drivewheel', text_auto=True,
    title='Number of cars by Drive Wheel and company names'
)

fig.update_layout(
    xaxis_title="Car Name",
    yaxis_title="Number of Cars"
)
fig.show()

**There is a clear correlation between car company and drive wheel type. For example, Japanese car companies like Toyota, nissan, honda and Mazda tend to favor front-wheel drive, while German and french car companies like BMW, Porsche and Peugeot tend to favor rear-wheel drive.**



## Horsepower distribution by enginetype

In [44]:
fig = px.histogram(df, x="horsepower", color="enginetype", title='Horsepower distribution by enginetype')

fig.update_layout(
    xaxis_title="horsepower values",
    yaxis_title="Number of Cars"
)
fig.show()

**The most common horsepower range is `50-150`. This suggests that most cars are designed to provide a balance between performance and fuel efficiency.**

**Cars with ohe engine type have horsepower ranged 50-100 but cars with ohcv engine type have high horsepower. Only one car have dohcv engine type with very high horsepower `288`.**

**Engine type isn't the only factor that determines horsepower. Other factor, such as number of cylinders.
the number of cylinders is one of the most important factors in determining horsepower.**

let's see how!

In [47]:
fig = px.histogram(df, x="horsepower", color="cylindernumber", title='Horsepower distribution by cylinder number')

fig.update_layout(
    xaxis_title="horsepower values",
    yaxis_title="Number of Cars"
)
fig.show()

**There is a positive correlation between the number of cylinders and horsepower. In other words, cars with more cylinders tend to have more horsepower.**
**Cars with more cylinders have more power strokes per engine revolution, which results in more horsepower.**

**Cars with `4` cylinders have horsepower ranged `50-100` but cars with `6` cylinders have horsepower ranged around `150-200`.**

**There is only one car with `12` cylinders have high horsepower `262`.**

## Drive Wheel with Cylinder Number

In [19]:
#Creating Sunburst chart based on air, ground or naval unit type
fig = px.sunburst(df, path=['drivewheel', 'cylindernumber'], values='price',
                  color='drivewheel', 
                  title="Number of cars by Drive Wheel and Cylinder Number")

fig.show()

In [20]:
drivewheel_per_cylindernumber= df.groupby(['drivewheel', 'cylindernumber'], as_index=False).size().sort_values('size', ascending=False)

fig = px.sunburst(drivewheel_per_cylindernumber, path=['drivewheel', 'cylindernumber'], values='size',
                  color='drivewheel', 
                  title='Price Distribution by Drive Wheel and Cylinder Number')


fig.show()

## Company names by engine type

In [21]:
enginetype_per_company_names= df.groupby(['enginetype', 'company name'], as_index=False).size().sort_values('size', ascending=False)

fig = px.bar(
    enginetype_per_company_names, x='company name', y='size', color='enginetype', text_auto=True,
    title='Company names by engine type'
)

fig.update_layout(
    xaxis_title="Car Name",
    yaxis_title="Number of Cars"
)
fig.show()

**There are just 12 cars with DOHC engines but this is the dominant type today because they are the most efficient and can produce the most horsepower for their size.**

**Most cars with OHC engines type.**

In [36]:
enginetype_by_price = df.groupby('enginetype', as_index=False)['price'].mean().sort_values(by='price')

fig= px.bar(x= enginetype_by_price['price'],
       y=enginetype_by_price['enginetype'],
       text_auto=True,
       title='Car price by engine types')

fig.update_traces(textfont_size=12, textangle=0, cliponaxis=False)
fig.update_layout(
    xaxis_title="Price",
    yaxis_title="engine type"
)

**OHC engines are the cheapest type with average price about `11.6K`.**

**OHCV and DOHCV engines are the least in the data because it is the highest average price.**
   * **OHCV engines have average price about `25.1K`.**
   * **DOHCV engines have average price about `31.4K`.**


In [23]:
#Creating Sunburst chart based on air, ground or naval unit type
fig = px.sunburst(df, path=['drivewheel', 'cylindernumber', 'enginetype'], values='price',
                  color='drivewheel', 
                  title="Number of cars by Drive Wheel and Cylinder Number")

fig.show()

## Engine size distribution by engine types

In [49]:
fig = px.histogram(df, x="enginesize", color="enginetype", title=' Engine size distribution by engine types')

fig.update_layout(
    xaxis_title="engine size",
    yaxis_title="Number of Cars"
)
fig.show()

**There is a negative correlation between engine size and the number of cars. This suggests that larger engines are less common, as they are more expensive and less fuel-efficient.**

**Larger engines are more powerful, but they are also less fuel-efficient and more expensive.**

**The most common engine size is 100-150 cubic inches. OHCV and DOHCV engines reguire large engine size**

In [41]:
fueltype_with_fuelsystem= df.groupby(['fueltype', 'fuelsystem'], as_index=False).size().sort_values('size', ascending=False)

fig = px.bar(
    fueltype_with_fuelsystem, x='fuelsystem', y='size', color='fueltype', text_auto=True,
    title='fuel system by fuel type'
)

fig.update_layout(
    xaxis_title="fuel system",
    yaxis_title="Number of Cars"
)
fig.show()

**MPFI is the most popular fuel system (96 cars).**

**All fuel system types have gas fuel type except idi fuel system has diesel fuel type only.**

## Correlation matrix

In [26]:
corr = df.corr()

fig = go.Figure()
fig.add_trace(
    go.Heatmap(
        x = corr.columns,
        y = corr.index,
        z = np.array(corr),
        text= corr.values,
        texttemplate='%{text:.2f}',
        colorscale="Viridis"    
    )
)
fig.show()

* **Price is positively correlated with engine size, cur weight, car length, car width, and horsepower. This means that cars with larger engines, heavier weights, and longer and wider dimensions tend to be more expensive.**
<br>

* **Price is also positively correlated with wheelbase. This's not surprising, as wheelbase is often seen as a measure of comfort.** 
<br>

* **Price is nigatively correlated with citympg and highwaypmg.**
<br>

* **Price is not strongly correlated with the other car features like car height, compressionratio and peak rpm. This suggests that these features are less important to consumers when making a purchase decision.**
<br>

* **carbweight, carlength and carwidth have high correlation with each other.**

# Thank you for reading my notebook, hope it helps... If you liked it, please upvote🆙🙌