# Exploring eBay Car Sales Data

Dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website, available [here](https://www.kaggle.com/orgesleka/used-cars-database/data)

The main fields in the dataset are:
- **dateCrawled** - When this ad was first crawled. All field-values are taken from this date.
- **name** - Name of the car.
- **seller** - Whether the seller is private or a dealer.
- **offerType** - The type of listing
- **price** - The price on the ad to sell the car.
- **abtest** - Whether the listing is included in an A/B test.
- **vehicleType** - The vehicle Type.
- **yearOfRegistration** - The year in which which year the car was first registered.
- **gearbox** - The transmission type.
- **powerPS** - The power of the car in PS.
- **model** - The car model name.
- **kilometer** - How many kilometers the car has driven.
- **monthOfRegistration** - The month in which which year the car was first registered.
- **fuelType** - What type of fuel the car uses.
- **brand** - The brand of the car.
- **notRepairedDamage** - If the car has a damage which is not yet repaired.
- **dateCreated** - The date on which the eBay listing was created.
- **nrOfPictures** - The number of pictures in the ad.
- **postalCode** - The postal code for the location of the vehicle.
- **lastSeenOnline** - When the crawler saw this ad last online.

In [35]:
import pandas as pd
import numpy as np

autos = pd.read_csv('autos.csv', encoding = 'Latin-1')

In [36]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [37]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Initial observations... 
- Data is described in German
- 'vehicleType', 'gearbox', 'model', 'fueltype', and 'notRepairedDamage' missing values, but none more than 20%
- Field data types are either objects or integers, mostly strings
- Some field data is not clearly usable i.e. 'name', and can be split into further columns for better data readability and access
- Camelcasing instead of preferred snakecasing is used

In [38]:
print(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


In [39]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price_dollars', 'abtest',
    'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
    'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'no_of_pictures', 'postal_code',
       'last_seen']
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price_dollars,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,no_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Changing column headings...

- The main casing of the column names are changed from camel to snake, which is the default convention when using python
- Some fields are also shortened for easy access e.g. 'yearOfRegistration' to 'registration_year'
- Some columns also were edited to include the unit of measurement, e.g. 'price' to 'price_dollars'.

In [40]:
autos['price_dollars'] = autos['price_dollars'].str.replace('$','').str.replace(',','').astype(int)
autos['odometer_km'] = autos['odometer_km'].str.replace('km','').str.replace(',','').astype(int)

In [41]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price_dollars,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,no_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000.0,50000,44905,50000.0,47320,50000.0,47242,50000.0,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,,2,8,,2,,245,,,7,40,2,76,,,39481
top,2016-03-16 21:50:53,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,,25756,12859,,36993,,4024,,,30107,10687,35232,1946,,,8
mean,,,,,9840.044,,,2005.07328,,116.35592,,125732.7,5.72336,,,,,0.0,50813.6273,
std,,,,,481104.4,,,105.712813,,209.216627,,40042.211706,3.711984,,,,,0.0,25779.747957,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1100.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30451.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49577.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71540.0,


Though the describe() function does not reveal much for columns such as 'postal_code', we see the 'no_of_pictures' column as irrelevant due to there not being any actual data in there except for zeros.

There are also 5 columns with only 2 uniques values in each. Of these the two 'seller' and 'offer_type', the values are largely all the same except for one of the entries. This indicates that the column may not be necessary. 

The 'price_dollars', 'registration_year', and 'power_ps' columns also have unrealistic minimum and maximum values which can be removed. 

## Filtering outliers

Next we apply filters to remove outliers for the following columns:
registration_year => 1900-2018
price_dollars => 500-500,000
power_ps => 50-1000

In [42]:
autos['price_dollars'].value_counts().sort_index(ascending = False)

99999999       1
27322222       1
12345678       3
11111111       2
10000000       1
3890000        1
1300000        1
1234566        1
999999         2
999990         1
350000         1
345000         1
299000         1
295000         1
265000         1
259000         1
250000         1
220000         1
198000         1
197000         1
194000         1
190000         1
180000         1
175000         1
169999         1
169000         1
163991         1
163500         1
155000         1
151990         1
            ... 
66             1
65             5
60             9
59             1
55             2
50            49
49             4
47             1
45             4
40             6
35             1
30             7
29             1
25             5
20             4
18             1
17             3
15             2
14             1
13             2
12             3
11             2
10             7
9              1
8              1
5              2
3              1
2             

Large variation of listing prices. The extremely high prices are likely errors as values close to a million would not be sold on ebay and values at 0 cannot be valid listings

In [43]:
autos = autos[autos['price_dollars'].between(1, 500000)]
autos['odometer_km'].value_counts().sort_index(ascending = False)

150000    31414
125000     5057
100000     2115
90000      1734
80000      1415
70000      1217
60000      1155
50000      1012
40000       815
30000       780
20000       762
10000       253
5000        836
Name: odometer_km, dtype: int64

In [44]:
autos['registration_year'].value_counts().sort_index(ascending = False)

9999       3
9000       1
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       2
2018     470
2017    1392
2016    1220
2015     392
2014     663
2013     803
2012    1310
2011    1623
2010    1589
2009    2085
2008    2215
2007    2277
2006    2670
2005    2936
2004    2703
2003    2699
2002    2486
2001    2636
2000    3156
        ... 
1964      12
1963       8
1962       4
1961       6
1960      23
1959       6
1958       4
1957       2
1956       4
1955       2
1954       2
1953       1
1952       1
1951       2
1950       3
1948       1
1943       1
1941       2
1939       1
1938       1
1937       4
1934       2
1931       1
1929       1
1927       1
1910       5
1800       2
1111       1
1001       1
1000       1
Name: registration_year, Length: 95, dtype: int64

Variation in registration year indicates false values due to unlikely possibility of cars from before 20th century, or any of the years into the future. 

In [45]:
autos = autos[autos['registration_year'].between(1900, 2018)]
autos['power_ps'].value_counts().sort_index(ascending = False)

17700       1
16312       1
16011       1
15001       1
14009       1
9011        1
8404        1
7511        1
6512        1
6226        1
6045        1
5867        1
4400        1
3750        1
3500        1
2729        1
2018        1
1998        2
1988        1
1986        1
1800        1
1796        1
1793        1
1781        1
1780        1
1779        1
1771        1
1753        1
1704        1
1405        1
         ... 
37          7
35          2
34         27
33          9
30          3
29          4
27          5
26         34
25          2
24          1
23          3
21          1
20          4
19          3
18          6
16          1
15          5
14          1
12          1
11          4
10          2
9           1
8           2
6           3
5          13
4           4
3           2
2           2
1           5
0        4973
Name: power_ps, Length: 444, dtype: int64

In [46]:
autos = autos[autos['power_ps'].between(50, 1000)]
autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42756 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          42756 non-null object
name                  42756 non-null object
seller                42756 non-null object
offer_type            42756 non-null object
price_dollars         42756 non-null int64
abtest                42756 non-null object
vehicle_type          39963 non-null object
registration_year     42756 non-null int64
gearbox               41984 non-null object
power_ps              42756 non-null int64
model                 41042 non-null object
odometer_km           42756 non-null int64
registration_month    42756 non-null int64
fuel_type             40146 non-null object
brand                 42756 non-null object
unrepaired_damage     36440 non-null object
ad_created            42756 non-null object
no_of_pictures        42756 non-null int64
postal_code           42756 non-null int64
last_seen             42756 non-null object
dtypes: int64(7), 



Applying the filters removes roughly 10000 entries, or 20% of the data. 

## Exploring Date Columns

Out of the 5 columns representing date values, we focus on the 3 which are currently stored as strings in the dataframe. 

In [47]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
5,2016-03-21 13:47:45,2016-03-21 00:00:00,2016-04-06 09:45:21


In [48]:
(autos['date_crawled']
    .str[:10]
    .value_counts(normalize=True, dropna=False)
    .sort_index()
    )

2016-03-05    0.025564
2016-03-06    0.014314
2016-03-07    0.036229
2016-03-08    0.033188
2016-03-09    0.032463
2016-03-10    0.032276
2016-03-11    0.032323
2016-03-12    0.037281
2016-03-13    0.015670
2016-03-14    0.036697
2016-03-15    0.034124
2016-03-16    0.029259
2016-03-17    0.031528
2016-03-18    0.012981
2016-03-19    0.034498
2016-03-20    0.038334
2016-03-21    0.037211
2016-03-22    0.032791
2016-03-23    0.031551
2016-03-24    0.029329
2016-03-25    0.031855
2016-03-26    0.032557
2016-03-27    0.030803
2016-03-28    0.035480
2016-03-29    0.033399
2016-03-30    0.033633
2016-03-31    0.031645
2016-04-01    0.033656
2016-04-02    0.035504
2016-04-03    0.039152
2016-04-04    0.037001
2016-04-05    0.013191
2016-04-06    0.003181
2016-04-07    0.001333
Name: date_crawled, dtype: float64

Site is being crawled across the months of March and April of 2016 with roughly equal listings being crawled each day. 

In [49]:
(autos['ad_created']
    .str[:10]
    .value_counts(normalize=True, dropna=False)
    .sort_index()
    )

2015-08-10    0.000023
2015-09-09    0.000023
2015-11-10    0.000023
2015-12-05    0.000023
2015-12-30    0.000023
2016-01-03    0.000023
2016-01-07    0.000023
2016-01-10    0.000047
2016-01-13    0.000023
2016-01-14    0.000023
2016-01-16    0.000023
2016-01-22    0.000023
2016-01-27    0.000070
2016-01-29    0.000023
2016-02-01    0.000023
2016-02-02    0.000047
2016-02-05    0.000047
2016-02-07    0.000023
2016-02-08    0.000023
2016-02-09    0.000023
2016-02-12    0.000047
2016-02-14    0.000023
2016-02-16    0.000023
2016-02-17    0.000023
2016-02-18    0.000047
2016-02-19    0.000047
2016-02-20    0.000047
2016-02-21    0.000070
2016-02-22    0.000023
2016-02-23    0.000070
                ...   
2016-03-09    0.032557
2016-03-10    0.032019
2016-03-11    0.032627
2016-03-12    0.037118
2016-03-13    0.017050
2016-03-14    0.035317
2016-03-15    0.033867
2016-03-16    0.029750
2016-03-17    0.031317
2016-03-18    0.013542
2016-03-19    0.033422
2016-03-20    0.038404
2016-03-21 

Most of the listings seem to have been created around March/April 2016, however some are older, with the oldest being August 2015

In [50]:
(autos['last_seen']
    .str[:10]
    .value_counts(normalize=True, dropna=False)
    .sort_index()
    )

2016-03-05    0.001052
2016-03-06    0.004116
2016-03-07    0.005029
2016-03-08    0.006946
2016-03-09    0.009309
2016-03-10    0.010034
2016-03-11    0.012092
2016-03-12    0.023529
2016-03-13    0.008584
2016-03-14    0.012606
2016-03-15    0.015670
2016-03-16    0.015904
2016-03-17    0.027318
2016-03-18    0.007157
2016-03-19    0.015811
2016-03-20    0.020652
2016-03-21    0.020091
2016-03-22    0.020980
2016-03-23    0.018407
2016-03-24    0.019342
2016-03-25    0.019038
2016-03-26    0.016466
2016-03-27    0.015226
2016-03-28    0.020488
2016-03-29    0.021821
2016-03-30    0.024722
2016-03-31    0.023412
2016-04-01    0.022921
2016-04-02    0.025306
2016-04-03    0.024722
2016-04-04    0.024043
2016-04-05    0.127163
2016-04-06    0.225886
2016-04-07    0.134157
Name: last_seen, dtype: float64

Date the listing was last seen indicates when the item was taken down due to either expiring or selling. The large amount of listing ending in past 3 days suggests the monthly listings are ending and pending renewal, with the rest being a uniformly steady stream of sales. 

## Brand value

We now look at pricing of the listings based on brand

In [51]:
autos['brand'].value_counts(dropna=False)

volkswagen        9126
bmw               4910
opel              4445
mercedes_benz     4156
audi              3876
ford              2942
renault           1933
peugeot           1246
fiat              1048
seat               824
skoda              729
mazda              669
nissan             633
citroen            592
toyota             553
smart              496
hyundai            441
volvo              405
mini               401
mitsubishi         356
honda              353
kia                311
sonstige_autos     305
alfa_romeo         292
porsche            265
suzuki             251
chevrolet          242
chrysler           152
dacia              119
jeep                98
subaru              94
land_rover          91
daihatsu            86
saab                74
jaguar              66
daewoo              56
rover               52
lancia              46
lada                20
trabant              2
Name: brand, dtype: int64

In [52]:
brands = autos['brand'].value_counts().index
print(brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'citroen',
       'toyota', 'smart', 'hyundai', 'volvo', 'mini', 'mitsubishi', 'honda',
       'kia', 'sonstige_autos', 'alfa_romeo', 'porsche', 'suzuki', 'chevrolet',
       'chrysler', 'dacia', 'jeep', 'subaru', 'land_rover', 'daihatsu', 'saab',
       'jaguar', 'daewoo', 'rover', 'lancia', 'lada', 'trabant'],
      dtype='object')


In [53]:
mean_price = {}
for brand in brands:
    auto_brand = autos['brand'] == brand
    temp = autos[auto_brand].copy()
    mean_price[brand] = temp['price_dollars'].mean()
mean_price

{'alfa_romeo': 4140.390410958904,
 'audi': 9609.870227038184,
 'bmw': 8535.964969450102,
 'chevrolet': 6621.764462809917,
 'chrysler': 3604.0855263157896,
 'citroen': 3893.054054054054,
 'dacia': 6071.075630252101,
 'daewoo': 1095.5714285714287,
 'daihatsu': 1877.7325581395348,
 'fiat': 2977.0248091603053,
 'ford': 3928.137321549966,
 'honda': 4203.2776203966005,
 'hyundai': 5644.868480725623,
 'jaguar': 12063.242424242424,
 'jeep': 12036.775510204081,
 'kia': 6132.614147909968,
 'lada': 2922.2,
 'lancia': 3476.3695652173915,
 'land_rover': 19665.86813186813,
 'mazda': 4298.896860986547,
 'mercedes_benz': 8930.289461020211,
 'mini': 10649.294264339153,
 'mitsubishi': 3529.2106741573034,
 'nissan': 5098.925750394945,
 'opel': 3222.111361079865,
 'peugeot': 3287.4903691813806,
 'porsche': 48507.09433962264,
 'renault': 2631.183652353854,
 'rover': 1767.0,
 'saab': 3341.175675675676,
 'seat': 4665.377427184466,
 'skoda': 6657.319615912208,
 'smart': 3970.1774193548385,
 'sonstige_autos': 

The most expensive brand on average is Porsche at \$48k, with a mid-range of brands including Sonstige Autos, Mini, Jaguar and Jeep between \$12-15k. The lowest value brand turns out to be Trabant at less than \$1k

## Storing aggregated data

In [54]:
mean_mileage = {}
for brand in brands:
    auto_brand = autos['brand'] == brand
    temp = autos[auto_brand].copy()
    mean_mileage[brand] = temp['odometer_km'].mean()
mean_mileage

{'alfa_romeo': 131061.64383561644,
 'audi': 128956.3983488132,
 'bmw': 132747.45417515276,
 'chevrolet': 97272.72727272728,
 'chrysler': 134013.15789473685,
 'citroen': 119442.56756756757,
 'dacia': 83025.21008403362,
 'daewoo': 123660.71428571429,
 'daihatsu': 117732.55813953489,
 'fiat': 117433.20610687023,
 'ford': 124406.8660774983,
 'honda': 122379.60339943343,
 'hyundai': 104909.2970521542,
 'jaguar': 127424.24242424243,
 'jeep': 127806.12244897959,
 'kia': 112443.72990353698,
 'lada': 90000.0,
 'lancia': 125000.0,
 'land_rover': 118076.92307692308,
 'mazda': 124880.41853512706,
 'mercedes_benz': 130742.30028873918,
 'mini': 89164.5885286783,
 'mitsubishi': 125702.24719101124,
 'nissan': 117353.87045813586,
 'opel': 128903.26209223847,
 'peugeot': 126448.63563402889,
 'porsche': 97716.98113207547,
 'renault': 127335.74754267978,
 'rover': 136730.76923076922,
 'saab': 143243.24324324325,
 'seat': 121371.35922330097,
 'skoda': 110185.18518518518,
 'smart': 93608.87096774194,
 'sons

In [57]:
price_series = pd.Series(mean_price)
mileage_series = pd.Series(mean_mileage)
agg = pd.DataFrame(price_series, columns=['mean_price_dollars'])
agg['mean_mileage_km'] = mileage_series
agg['mileage/price ratio'] = agg['mean_mileage_km'] / agg['mean_price_dollars']
agg.round(2).sort_values(by=['mean_price_dollars'], ascending=False)

Unnamed: 0,mean_price_dollars,mean_mileage_km,mileage/price ratio
porsche,48507.09,97716.98,2.01
land_rover,19665.87,118076.92,6.0
sonstige_autos,14860.19,101950.82,6.86
jaguar,12063.24,127424.24,10.56
jeep,12036.78,127806.12,10.62
mini,10649.29,89164.59,8.37
audi,9609.87,128956.4,13.42
mercedes_benz,8930.29,130742.3,14.64
bmw,8535.96,132747.45,15.55
skoda,6657.32,110185.19,16.55


From this analysis we see that the top car brands do not provide improved average mileage over the less expensive brands. This effectively means that the cost per mile for the top brands is much higher. 

## Next steps...

Here are some next steps to consider:

- Data cleaning next steps:
    - Identify categorical data that uses german words, translate them and map the values to their english counterparts
    - Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.
    - See if there are particular keywords in the name column that you can extract as new columns
- Analysis next steps:
    - Find the most common brand/model combinations
    - Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.
    - How much cheaper are cars with damage than their non-damaged counterparts?