# 📊 Exploratory Data Analysis: Used Car Sales

This notebook performs basic exploratory data analysis on a dataset of used vehicle listings. 
The goal is to identify patterns in pricing and mileage, detect duplicates, and provide visual insights that support the final dashboard.


In [1]:
import pandas as pd
import plotly.express as px
import plotly.io as pio

pio.renderers.default = "notebook_connected"

# load the dataset
df = pd.read_csv(r'C:\Users\INVERSE\car-sales-dashboard-triple10\vehicles_us.csv')



In [2]:
# price distribution
px.histogram(df, x='price')



##### 💡 Insight on Price Distribution
Most vehicles are priced under $20,000, showing that the dataset includes primarily low- to mid-range cars. There's a long tail of expensive listings.
The histogram reveals that most used vehicles are priced below $20,000, with a sharp drop in frequency as prices increase. A small number of listings exceed $50,000, but they represent rare or luxury vehicles.

This right-skewed distribution indicates that the dataset mostly covers affordable to mid-range vehicles, which aligns with the purpose of the dashboard — to help users explore common pricing trends in the used car market.


In [3]:
# price vs mileage
px.scatter(df, x='odometer', y='price', color='type')

#### 💡 Insight on Mileage vs. Price
There is a clear negative relationship: as mileage increases, price tends to drop. SUVs and trucks show more variance than sedans.

We also observe:
- A large cluster of affordable vehicles under 200,000 miles
- A few outliers with extremely high prices or unusually high mileage
- SUV and truck types tend to show more pricing variability at higher mileages

These insights help support the dashboard's filters and pricing trends for potential buyers.


In [4]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

df = df.drop_duplicates()

Number of duplicate rows: 0


#### 🔁 Duplicate Check

We found **X duplicate rows** in the dataset. These were removed to ensure that the analysis is not skewed by repeated entries. 
This helps improve the accuracy of the visualizations and summary statistics.


## ✅ Final Conclusion

This dataset shows expected market behavior — higher mileage leads to lower prices. Most listings are affordable. These insights shaped the layout and filters of the Streamlit dashboard.

 We performed exploratory data analysis (EDA) on a used car sales dataset. The following steps were taken:

- Loaded and inspected the dataset structure
- Checked for duplicate rows to ensure data quality
- Created a histogram to visualize the distribution of car prices
- Created a scatter plot to explore the relationship between mileage and price

#### 🔍 Key Insights:

- Most vehicles are priced below $20,000, indicating that the dataset represents mostly affordable listings.
- There is a negative correlation between odometer reading and price — vehicles with higher mileage tend to be cheaper.
- Different vehicle types show varying distributions, with trucks and SUVs sometimes retaining higher value despite mileage.

These insights helped inform the visual layout and filters in the Streamlit dashboard, ensuring that the final web app clearly highlights pricing patterns and market behavior.
