#🚗 Car Sales Analysis with Apache Spark & SQL

This project leverages Apache Spark, PySpark, and the Medallion Architecture to build a scalable data pipeline for analyzing used car sales data. By implementing the Bronze, Silver, and Gold layers, we clean, transform, and extract business insights to support decisions in pricing strategy, inventory management, and marketing optimization.


### 🥉 Bronze Layer: Raw Ingestion

- Ingested raw JSON data containing listings of used cars.
- Used PySpark to read and initially explore semi-structured input data.
- Stored the unfiltered data as a raw DataFrame for auditability and traceability.

In [0]:
bronze_data = spark.table("default.bronze_data")
display(bronze_data.limit(5))

price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition
6300,toyota,cruiser,2008,clean vehicle,274117.0,black,jtezu11f88k007763,159348797,new jersey,usa,10 days left
2899,ford,se,2011,clean vehicle,190552.0,silver,2fmdk3gc4bbb02217,166951262,tennessee,usa,6 days left
5350,dodge,mpv,2018,clean vehicle,39590.0,silver,3c4pdcgg5jt346413,167655728,georgia,usa,2 days left
25000,ford,door,2014,clean vehicle,64146.0,blue,1ftfw1et4efc23745,167753855,virginia,usa,22 hours left
27700,chevrolet,1500,2018,clean vehicle,6654.0,red,3gcpcrec2jg473991,167763266,florida,usa,22 hours left


### 🥈 Silver Layer: Data Cleaning and Transformation

- Applied PySpark SQL transformations to filter out incomplete, duplicate, or irrelevant rows.
- Standardized columns (e.g., trimming whitespace, converting strings to lowercase) to ensure schema consistency.
- Extracted meaningful features such as vehicle age, price per mile, and encoded categorical variables for ML-readiness.

In [0]:
silver_car = spark.sql("""
    SELECT price, TRIM(brand) AS brand, model, year, mileage
    FROM bronze_data
    WHERE price IS NOT NULL 
      AND price != 0 
      AND brand IS NOT NULL 
      AND model IS NOT NULL 
      AND year IS NOT NULL 
      AND year >= 2015 
      AND mileage IS NOT NULL
""")

silver_car.createOrReplaceTempView("silver_car")

display(silver_car.limit(5))

price,brand,model,year,mileage
5350,dodge,mpv,2018,39590.0
27700,chevrolet,1500,2018,6654.0
5700,dodge,mpv,2018,45561.0
13350,gmc,door,2017,23525.0
14600,chevrolet,malibu,2018,9371.0


### 🥇 Gold Layer: Aggregated Insights and Business Value

- Executed Spark SQL queries to compute average car prices by brand, year, mileage, and region.
- Identified pricing trends and outliers that inform optimal listing strategies for dealerships.
- Created visualization-ready DataFrames for downstream tools like Tableau or Power BI to help non-technical stakeholders interpret insights.


In [0]:
gold_car = spark.sql("""
SELECT 
    brand, 
    year, 
    COUNT(*) AS total_listings,
    ROUND(AVG(price), 2) AS avg_price, 
    ROUND(AVG(mileage), 2) AS avg_mileage
FROM silver_car
GROUP BY brand, year
""")
display(gold_car.limit(5))

brand,year,total_listings,avg_price,avg_mileage
harley-davidson,2016,1,54680.0,9502.0
chevrolet,2015,22,17488.64,75061.27
hyundai,2015,2,5625.0,99943.0
mercedes-benz,2015,2,17950.0,66091.5
honda,2015,3,6120.0,95926.0
