# Exploratory Data Analysis (EDA)

This notebook explores the used vehicle dataset used in the Streamlit application.
The goal is to understand distributions, trends, and relationships between vehicle
price, model year, and condition, which directly inform the visualizations shown
in the deployed app.

import pandas as pd
import plotly.express as px

df = pd.read_csv('../vehicles_us.csv')

df.head()
df.info()
df.describe()

The dataset contains information on vehicle prices, model year, and condition.
Some columns contain missing values, which will be handled during analysis.
df.isna().sum()

Missing values are present in several columns. For this project, rows with missing
price or model year values are removed since these variables are central to the analysis.

df = df.dropna(subset=['price', 'model_year', 'condition'])

fig = px.histogram(
    df,
    x='price',
    nbins=50,
    title='Distribution of Vehicle Prices'
)
fig.show()

Vehicle prices are right-skewed, with most vehicles priced below $20,000.
This insight influenced the price range used in the app visualizations.

fig = px.scatter(
    df,
    x='model_year',
    y='price',
    title='Vehicle Price vs Model Year'
)
fig.show()

Newer vehicles generally have higher prices, though there is significant variation
depending on condition and other factors.

fig = px.histogram(
    df,
    x='condition',
    title='Vehicle Condition Distribution'
)
fig.show()

Most vehicles fall into the 'good' or 'excellent' condition categories.
This informed the condition filters used in the Streamlit application.

## Conclusion

The exploratory data analysis provided insights into pricing trends, model year
relationships, and vehicle condition distributions. These findings guided the
design of the Streamlit app and ensured the displayed charts reflect real patterns
in the data.