Project Description:

This project aims to analyze a dataset of used cars, focusing on data cleaning,  data analysis, and data extraction. The goal is to handle missing values, visualize key trends, and support decision-making related to car pricing and sales. We also identify factors that influence car prices
based on model year, condition, and mileage.

In [None]:
import pandas as pd


In [None]:
import plotly.express as px 

In [None]:
df = pd.read_csv('vehicles_us.csv')

In [None]:
# Convert columns to numeric


df["model_year"] = pd.to_numeric(df["model_year"], errors='coerce')
df["odometer"] = pd.to_numeric(df["odometer"], errors='coerce')
df["price"] = pd.to_numeric(df["price"], errors="coerce")

# Handle missing values and convert data types
df["price"].fillna(0, inplace=True)
df["price"] = df["price"].astype("Int64")

df['days_listed'] = df['days_listed'].fillna(0).astype('float32')
df['model_year'] = df['model_year'].fillna(0).astype("Int64")
df['cylinders'] = df['cylinders'].fillna(0).astype("Int64")
df['odometer'] = df['odometer'].fillna(0).astype('float32')
df['is_4wd'] = df['is_4wd'].fillna(0).astype(bool)  
df['paint_color'] = df['paint_color'].fillna("unknown")


print(df.columns)
print(df.dtypes)


Index(['price', 'model_year', 'model', 'condition', 'cylinders', 'fuel',
       'odometer', 'transmission', 'type', 'paint_color', 'is_4wd',
       'date_posted', 'days_listed'],
      dtype='object')
price           float64
model_year        Int64
model            object
condition        object
cylinders         Int64
fuel             object
odometer        float32
transmission     object
type             object
paint_color      object
is_4wd             bool
date_posted      object
days_listed     float32
dtype: object


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["price"].fillna(0, inplace=True)


In [None]:

print(df.info())

In [None]:
# Create histograms
hist1 = px.histogram(df, x="price", title="Distribution of Car Prices")

In [None]:
# Create histograms
hist2 = px.histogram(df, x="odometer", title="Distribution of Odometer Readings")

In [None]:
# Create scatter plots
scatterplot1 = px.scatter(df, x="odometer", y="price", title="Price vs Odometer")
scatterplot2 = px.scatter(df, x="model_year", y="price", title="Price vs Model Year")


In [None]:
# Show plots
hist1.show()
hist2.show()
scatterplot1.show()
scatterplot2.show()

In [None]:
# Fill missing 'is_4wd' with 0 and convert to bool
df["is_4wd"] = df["is_4wd"].fillna(0).astype(bool)

In [None]:
# Fill missing 'paint_color' with 'unknown'
df["paint_color"] = df["paint_color"].fillna("unknown")


In [None]:
# Fill missing 'model_year' based on median per model
df["model_year"] = df.groupby("model")["model_year"].transform(lambda x: x.fillna(x.median()))


In [None]:
# Fill missing 'cylinders' based on median per vehicle type
df["cylinders"] = df.groupby("type")["cylinders"].transform(lambda x: x.fillna(x.median()))


In [None]:
# Fill missing 'odometer' based on median per model_year
df["odometer"] = df.groupby("model_year")["odometer"].transform(lambda x: x.fillna(x.median()))


In [None]:
# Display cleaned data
print(df)