# Car prices data preparation

Before running this code, make sure you've downloaded the data CSV file from https://www.kaggle.com/austinreese/craigslist-carstrucks-data.

You may have to create a Kaggle account to download the data.

After downloading it, extract the ZIP file and make sure you have a file named `vehicles.csv` in the current directory.

In [1]:
import pandas as pd

In [2]:
used_cars = pd.read_csv("vehicles.csv")

The dataset includes the following columns:

In [3]:
used_cars.columns

Index(['id', 'url', 'region', 'region_url', 'price', 'year', 'manufacturer',
       'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status',
       'transmission', 'vin', 'drive', 'size', 'type', 'paint_color',
       'image_url', 'description', 'county', 'state', 'lat', 'long'],
      dtype='object')

You'll use just a subset of these columns:

In [4]:
u = used_cars[
    [
        "price",
        "year",
        "condition",
        "cylinders",
        "fuel",
        "odometer",
        "transmission",
        "size",
        "type",
    ]
]

Some lines have incomplete data. In this example, you'll just drop them:

In [5]:
u = u.dropna()

In [6]:
u.shape

(70256, 9)

There are 70256 rows with complete data.

It's also important to take a look at the categorical variables:

In [7]:
u["condition"].unique()

array(['good', 'fair', 'excellent', 'like new', 'salvage', 'new'],
      dtype=object)

In [8]:
u["cylinders"].unique()

array(['8 cylinders', '6 cylinders', '4 cylinders', '5 cylinders',
       '10 cylinders', '3 cylinders', 'other', '12 cylinders'],
      dtype=object)

In [9]:
u["fuel"].unique()

array(['diesel', 'gas', 'electric', 'hybrid', 'other'], dtype=object)

In [10]:
u["transmission"].unique()

array(['automatic', 'manual', 'other'], dtype=object)

In [11]:
u["size"].unique()

array(['full-size', 'mid-size', 'compact', 'sub-compact'], dtype=object)

In [12]:
u["type"].unique()

array(['truck', 'SUV', 'sedan', 'mini-van', 'hatchback', 'coupe',
       'pickup', 'wagon', 'convertible', 'van', 'other', 'offroad', 'bus'],
      dtype=object)

It's difficult to build a model that generalizes to a wide range of types of car. So, you'll consider only the samples with more usual features:

In [13]:
u = u[u.cylinders.isin(["4 cylinders", "6 cylinders"])]
u = u[u.fuel.isin(["gas", "diesel"])]
u = u[u.transmission.isin(["automatic", "manual"])]
u = u[u.type.isin(["sedan", "coupe", "wagon", "hatchback"])]

In [14]:
u.shape

(23382, 9)

Another problem is that the price is set to zero for some rows. Also, there are rows with very high prices, some of these due to input errors. Here, you'll just filter out these rows, considering just the ones in which 0 < price < 40000:

In [15]:
u = u[u["price"] > 0]
u = u[u["price"] < 40000]

In [16]:
u.shape

(22355, 9)

The `odometer` column also includes several rows with very high values. For this model, you'll consider only rows with `odometer` < 100000

In [17]:
u = u[u.odometer < 1e5]

In [18]:
u.shape

(9800, 9)

To build a consistent model, you'll consider only cars with `year` > 2000

In [19]:
u = u[u.year > 2000]

In [20]:
u.shape

(9332, 9)

After the filtering, you get a dataset with 9332 samples.

Now, you'll take a look at the data types used in the columns:

In [21]:
u.dtypes

price             int64
year            float64
condition        object
cylinders        object
fuel             object
odometer        float64
transmission     object
size             object
type             object
dtype: object

As you can notice, `year` and `odometer` are set as `float64` columns, which isn't ideal. The data type of these columns can be converted to `int` with the following:

In [22]:
u.loc[:, "year"] = u.loc[:, "year"].astype(int)
u.loc[:, "odometer"] = u.loc[:, "odometer"].astype(int)

Finally, save the filtered dataset to a CSV file:

In [23]:
u.to_csv("vehicles_cleaned.csv", index=False)