# Project Description

## Data

**Abstract**:
- Analyse car prices for German and UK markets.
- Determine value depreciation of cars. Model car prices value depreciation for future.

**Data Sources**:
1. German Car Market: the data got via scraping from [Web site AutoScout24](https://www.autoscout24.de/)
2. UK Car Market: [Kaggle Datasource](https://www.kaggle.com/datasets/adityadesai13/used-car-dataset-ford-and-mercedes?select=audi.csv), which also was scraped by the author, from download link. 

**Data Set Characteristics:**
- Obtained: scraped April 2023 (german Market), Kaggle Download (UK market)
- Multivariate dataset
- Shape of the dataset: 10976 rows, 20 columns
- Area: busines, used cars markets UK and Germany
- Attribute Characteristics: Categorical, Integer, Float, String
- Date priods: cars offerd for buying in April 2023, with first registration from 2023 to 2012.
- Associated Tasks: EDA, Regression (Car price prediction)
- Missing Values?: Yes

**Variable Description:**

|Variable|Definition   | Key  |
|---|---|---|
|make |Car Manufacturer  |   Audi,BMW, Volkswagon,Mercedes|
|model|Car Model within each Manufacturer| |
|fuel|Fuel Type|'Petrol', 'Diesel', 'Electro'|
|mileage|Km stand of car|in KM|
|gear|Gear Type |Automatic,manual|
|registration|year of first registration of car|Year|
|hp|Engine Power|kW|
|owner|no. of Previous Owner ||
|body|Car Type|Sedan, Small car, Station wagon, Convertible, Coupe, SUV|
|car_condition|Demonstration vehicle, Day admission, Annual car, Used, New||
|consumption|Fuel Consumption|in l/Km|
|emission|Exhaust emission|in g/Km|
|color|car color||
|car_id|unique carid||
|displacement|Engine Size|in cm3|
|drive_type|Types of drivetrain|Front,Rear, Four wheel drive(four w.d.)|
|link|link of car description||
|price|Price of car||

# Imports

In [1]:
# import pandas, numpy, datetime module
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

from datetime import datetime, date, time, timedelta

import datetime
import time
import re

# import plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# import formattings
from matplotlib.ticker import PercentFormatter
plt.rcParams.update({ "figure.figsize" : (8, 5),"axes.facecolor" : "white", "axes.edgecolor":  "black"})
plt.rcParams["figure.facecolor"]= "w"
pd.plotting.register_matplotlib_converters()
pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.set_option('display.max_columns', None)

RSEED = 3

# Setting plt style
plt.style.use('fivethirtyeight')

# set color theme
sns_colors = ["#FF6D43", "#00135D", '#00135D', '#00135D']
sns.set_palette(sns.color_palette(sns_colors))

primary = '#FF6D43'
secondary = '#00135D'

# Data understanding

In [2]:
# Import the dataset and load DF
df = pd.read_csv('../data/german_cars_scraping.csv')
df.drop(['doors', 'seats'], inplace=True, axis=1)

In [3]:
#Check the shape of the data
print(f'The dataset has {df.shape[0]} rows and {df.shape[1]} columns')

The dataset has 11109 rows and 18 columns


In [4]:
# check for duplicated values
df.duplicated().value_counts()

False    11104
True         5
dtype: int64

In [5]:
df[df.duplicated()]

Unnamed: 0,make,model,price,fuel,mileage,gear,registration,hp,owner,body,car_condition,consumption,emission,color,car_id,displacement,drive_type,link
10393,Audi,A1,"€ 12.790,-",Benzin,46.533 km,Schaltgetriebe,02/2013,63 kW (86 PS),2.0,Limousine,Gebraucht,"5,1 l/100 km (komb.)",121 g/km (komb.),Orange,ATS9040,1.197 cm³,Front,https://www.autoscout24.de/angebote/audi-a1-sp...
11026,Audi,A3,"€ 35.920,-1",Elektro/Benzin,13.000 km,Automatik,03/2022,180 kW (245 PS),1.0,Limousine,Jahreswagen,"1,4 l/100 km (komb.)",31 g/km (komb.),Blau,663018437048,1.395 cm³,,https://www.autoscout24.de/angebote/audi-a3-sp...
11032,Volkswagen,Passat Variant,"€ 43.340,-1",Elektro/Benzin,12.500 km,Automatik,07/2022,160 kW (218 PS),1.0,Kombi,Jahreswagen,"1,7 l/100 km (komb.)",38 g/km (komb.),Grau,22816,1.395 cm³,,https://www.autoscout24.de/angebote/volkswagen...
11040,Volkswagen,Tiguan,"€ 45.820,-1",Elektro/Benzin,23.671 km,Automatik,02/2022,180 kW (245 PS),1.0,SUV/Geländewagen/Pickup,Jahreswagen,"1,5 l/100 km (komb.)",153 g/km (komb.),Blau,23032,1.395 cm³,,https://www.autoscout24.de/angebote/volkswagen...
11070,Volkswagen,Tiguan,"€ 45.520,-1",Elektro/Benzin,2.735 km,Automatik,06/2021,180 kW (245 PS),1.0,SUV/Geländewagen/Pickup,Gebraucht,"1,5 l/100 km (komb.)",158 g/km (komb.),Schwarz,23028,1.395 cm³,,https://www.autoscout24.de/angebote/volkswagen...


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11109 entries, 0 to 11108
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   make           11109 non-null  object 
 1   model          11109 non-null  object 
 2   price          11109 non-null  object 
 3   fuel           11109 non-null  object 
 4   mileage        11109 non-null  object 
 5   gear           11038 non-null  object 
 6   registration   11109 non-null  object 
 7   hp             11109 non-null  object 
 8   owner          9579 non-null   float64
 9   body           11109 non-null  object 
 10  car_condition  11109 non-null  object 
 11  consumption    9745 non-null   object 
 12  emission       9444 non-null   object 
 13  color          11000 non-null  object 
 14  car_id         7281 non-null   object 
 15  displacement   10807 non-null  object 
 16  drive_type     7287 non-null   object 
 17  link           11109 non-null  object 
dtypes: flo

In [7]:
#Getting unique values counts for each column
for col in df.columns:
    print(f"{col} - {df[col].nunique()}")

make - 5
model - 16
price - 4095
fuel - 7
mileage - 5881
gear - 4
registration - 197
hp - 73
owner - 7
body - 8
car_condition - 5
consumption - 99
emission - 156
color - 14
car_id - 6708
displacement - 54
drive_type - 3
link - 8749


In [8]:
# checking Null values in columns
df.isna().sum()

make                0
model               0
price               0
fuel                0
mileage             0
gear               71
registration        0
hp                  0
owner            1530
body                0
car_condition       0
consumption      1364
emission         1665
color             109
car_id           3828
displacement      302
drive_type       3822
link                0
dtype: int64

In [9]:
df_links = df.copy()

# Data Cleaning

## scrape anew and fill missing values in columns

In [10]:
#df_scrape = df[df['drive_type'].isna()]
#df_scrape.shape
#df_scrape['link'].to_csv('../data/links_drive_type.csv', index=False)

In [11]:
# Read in newly scraped data for gear values
#df_1 = pd.read_csv('../data/scraped_anew.csv')
#print('Shape:', df_1.shape)

In [12]:
#df.update(df_1, overwrite=False)

In [13]:
#df.isna().sum()

In [14]:
#df.to_csv('../data/german_cars_scraping.csv', index=False)

## strip numerics from text 

In [15]:
import unicodedata

In [16]:
df['price'] = df['price'].str.replace(unicodedata.lookup('EURO SIGN'), '')
df['price'] = df['price'].str.split(',').str[0]
df['price'] = df['price'].str.replace('.', '')

In [17]:
for col in df[['mileage', 'emission', 'displacement', 'hp']]:
    df[col] = df[col].str.split(' ').str[0]

In [18]:
df['mileage'] = df['mileage'].str.replace('.', '')
df['consumption'] = df['consumption'].str.replace(',', '.')

In [19]:
df['registration'] = df['registration'].str.split('/').str[1]
df['consumption'] = df['consumption'].str.split('l').str[0]

In [20]:
df

Unnamed: 0,make,model,price,fuel,mileage,gear,registration,hp,owner,body,car_condition,consumption,emission,color,car_id,displacement,drive_type,link
0,Volkswagen,Polo GTI,34690,Benzin,10,Automatik,2023,152,1.00,Limousine,Vorführfahrzeug,4.2,,Grau,G3109,1.984,Front,https://www.autoscout24.de/angebote/volkswagen...
1,Volkswagen,Polo GTI,34950,Benzin,1500,Automatik,2023,152,1.00,Limousine,Vorführfahrzeug,6.6,0,Schwarz,363794549,1.998,Heck,https://www.autoscout24.de/angebote/volkswagen...
2,Volkswagen,Polo GTI,29990,Benzin,9000,Automatik,2023,152,1.00,Limousine,Vorführfahrzeug,7.4,0,Weiß,NU057791,1.984,Front,https://www.autoscout24.de/angebote/volkswagen...
3,Volkswagen,Polo GTI,33989,Benzin,3511,Automatik,2023,152,1.00,Limousine,Vorführfahrzeug,,,Schwarz,04197NPVFW,1.984,Front,https://www.autoscout24.de/angebote/volkswagen...
4,Volkswagen,Polo GTI,37980,Benzin,211,Automatik,2023,152,1.00,Kleinwagen,Vorführfahrzeug,5.7,,Weiß,VFW028766,1.984,Front,https://www.autoscout24.de/angebote/volkswagen...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11104,BMW,X1,35994,Elektro/Benzin,32082,Automatik,2020,162,1.00,SUV/Geländewagen/Pickup,Gebraucht,1.9,43,Schwarz,130645,1.499,Allrad,https://www.autoscout24.de/angebote/bmw-x1-xdr...
11105,BMW,X1,38200,Elektro/Benzin,13000,Automatik,2020,92,1.00,SUV/Geländewagen/Pickup,Gebraucht,1.9,,Blau,,1.499,Allrad,https://www.autoscout24.de/angebote/bmw-x1-x1-...
11106,Volkswagen,Passat Variant,35185,Elektro/Benzin,75226,Automatik,2019,115,1.00,Kombi,Gebraucht,1.7,38,Silber,30729,1.395,,https://www.autoscout24.de/angebote/volkswagen...
11107,Audi,A3,24980,Elektro/Benzin,57509,Automatik,2018,150,1.00,Limousine,Gebraucht,1.8,40,Rot,36598,1.395,Front,https://www.autoscout24.de/angebote/audi-a3-sp...


## strip white spaces

In [21]:
for col in df.columns:
    if df[col].dtypes == object:
        df[col] = df[col].str.strip()

## missing values

In [22]:
df.isna().sum()

make                0
model               0
price               0
fuel                0
mileage             0
gear               71
registration        0
hp                  0
owner            1530
body                0
car_condition       0
consumption      1364
emission         1665
color             109
car_id           3828
displacement      302
drive_type       3822
link                0
dtype: int64

### gear

In [23]:
# show nan values in col
print('Null values:', df['gear'].isna().sum())
print('---'*15)

# check unique values
print('Unique values:')
df['gear'].value_counts()

Null values: 71
---------------------------------------------
Unique values:


Automatik         7896
Schaltgetriebe    3097
-                   40
Halbautomatik        5
Name: gear, dtype: int64

In [24]:
df['gear'] = df['gear'].replace('-', np.nan)

In [25]:
df[df['gear'].isna()].groupby(['fuel']).count()

Unnamed: 0_level_0,make,model,price,mileage,gear,registration,hp,owner,body,car_condition,consumption,emission,color,car_id,displacement,drive_type,link
fuel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Benzin,16,16,16,16,0,16,16,13,16,16,12,13,16,1,13,8,16
Diesel,24,24,24,24,0,24,24,8,24,24,12,9,24,0,10,10,24
Elektro,34,34,34,34,0,34,34,33,34,34,9,29,34,30,11,12,34
Elektro/Benzin,37,37,37,37,0,37,37,33,37,37,26,28,36,33,36,30,37


In [26]:
# replace Nan values by mode of a group of car fuel-type
df['gear']=df.groupby('fuel').gear.transform(lambda x: x.fillna(x.mode()[0]))

In [27]:
df['gear'].value_counts()

Automatik         8007
Schaltgetriebe    3097
Halbautomatik        5
Name: gear, dtype: int64

In [28]:
df.isna().sum()

make                0
model               0
price               0
fuel                0
mileage             0
gear                0
registration        0
hp                  0
owner            1530
body                0
car_condition       0
consumption      1364
emission         1665
color             109
car_id           3828
displacement      302
drive_type       3822
link                0
dtype: int64

In [29]:
#df['gear'].fillna(df['gear'].mode()[0], inplace=True)

### color

In [30]:
df['color'].unique()

array(['Grau', 'Schwarz', 'Weiß', 'Rot', 'Blau', 'Silber', 'Gelb', nan,
       'Grün', 'Braun', 'Gold', 'Orange', 'Beige', 'Bronze', 'Violett'],
      dtype=object)

In [31]:
#printing whole link with max_width

with pd.option_context('display.max_colwidth', 150):
   print(df.iloc[:,17]) 
#df.iloc[5:,19]

0                  https://www.autoscout24.de/angebote/volkswagen-polo-gti-2-0-tsi-dsg-iq-light-acc-navi-pano-benzin-grau-69e912c7-81bf-49fb-840b-c6e71f6560f0
1          https://www.autoscout24.de/angebote/volkswagen-polo-gti-gti-2-0-l-tsi-dsg-panoramdach-beats-sou-benzin-schwarz-d4086c5e-9b90-4b33-9c8e-7497e26c6c52
2               https://www.autoscout24.de/angebote/volkswagen-polo-gti-2-0-tsi-dsg-iq-light-rueckfahrkamera-benzin-weiss-c20e5834-2267-40a2-9544-d5a346f1547f
3        https://www.autoscout24.de/angebote/volkswagen-polo-gti-2-0-tsi-dsg-iq-light-kamera-sport-select-klima-le-benzin-schwarz-f512787b-d2c8-4b69-9b62-6...
4               https://www.autoscout24.de/angebote/volkswagen-polo-gti-2-0-tsi-shz-matrix-led-navi-acc-pano-benzin-weiss-500d77c8-0d1d-4004-aab9-9cfd5c609c53
                                                                                 ...                                                                          
11104      https://www.autoscout24.de/angebote

In [32]:
# new dataframe with missing color value only
df_c = df.query('color.isnull()')
print(f'DF color has {df_c.shape[0]} rows and {df_c.shape[1]} colomns')
print('=='*20)
df_c.head()

DF color has 109 rows and 18 colomns


Unnamed: 0,make,model,price,fuel,mileage,gear,registration,hp,owner,body,car_condition,consumption,emission,color,car_id,displacement,drive_type,link
99,BMW,120,52690,Diesel,1503,Automatik,2023,140,1.0,Limousine,Gebraucht,4.8,127,,7K94547,1.995,Allrad,https://www.autoscout24.de/angebote/mercedes-b...
118,BMW,120,44390,Diesel,2896,Automatik,2023,140,1.0,Limousine,Vorführfahrzeug,1.4,25,,214261,1.995,Allrad,https://www.autoscout24.de/angebote/audi-q3-ad...
223,BMW,320,49880,Benzin,6000,Automatik,2023,135,1.0,Kombi,Vorführfahrzeug,2.0,0,,020804,1.998,Heck,https://www.autoscout24.de/angebote/bmw-i3-120...
301,BMW,X1,59740,Benzin,1,Automatik,2023,160,1.0,SUV/Geländewagen/Pickup,Vorführfahrzeug,4.9,0,,2923538,1.998,Allrad,https://www.autoscout24.de/angebote/audi-a1-ad...
303,BMW,X1,60499,Benzin,3,Automatik,2023,160,1.0,SUV/Geländewagen/Pickup,Vorführfahrzeug,6.8,0,,214305,1.998,Allrad,https://www.autoscout24.de/angebote/audi-a1-sp...


In [33]:
#assigning a list with colours of cars

farbe =['grau', 'schwarz', 'weiß', 'rot', 'blau', 'silber', 'gelb',
       'grün','gruen' 'braun', 'gold', 'orange', 'beige', 'bronze', 'violett']

In [34]:
#define a new column after splitting the link into list
      
df_c['new_link'] = df_c['link'].str.split('/').str[-1].str.split('-')
with pd.option_context('display.max_colwidth', 150):
   print(df_c.iloc[:,18]) 

99       [mercedes, benz, gla, 200, amg, amg, night, cam, led, ahk, ambiente, easy, benzin, schwarz, 492391ad, 371e, 416d, b65d, 4bddd35cf55b]
118                                  [audi, q3, advanced, 35, tfsi, s, tronic, fla, benzin, schwarz, 2ffc9aca, 0ab6, 4b05, 8050, e045406b6662]
223                                                                [bmw, i3, 120, ah, elektro, blau, dfed8e81, 17e6, 4c5e, 9251, b70ce3654491]
301                            [audi, a1, advanced, 30, tfsi, navi, gra, shz, klim, benzin, schwarz, 7282e0a9, c476, 4248, b418, 3b9058605e15]
303              [audi, a1, sportback, 25, tfsi, advanced, virtual, pdc, radio, klim, benzin, gruen, 167ba974, b6bb, 4b24, 9ca9, 549116e36c15]
                                                                         ...                                                                  
10286                           [audi, a1, s, line, panorama, 192ps, 18zoll, rotorbose, mmi, benzin, 0e1c2d27, 3d1e, 40a9, b5c8, 29c2ed4716a3]

In [35]:
#car color ist at (-6) position from left,if available in link
#extract cr color

df_c['new_link'] = df_c['link'].str.split('/').str[-1].str.split('-').str[-6]
with pd.option_context('display.max_colwidth', 150):
   print(df_c.iloc[:,18])


# if color isnot present in link then ,another vslue is extracted

99       schwarz
118      schwarz
223         blau
301      schwarz
303        gruen
          ...   
10286     benzin
10318     benzin
10388     benzin
10649     diesel
10759     diesel
Name: new_link, Length: 109, dtype: object


In [36]:
#compare volors from Farbe list is present in datagrame_df_c

df_c['new_link'].isin(farbe).value_counts()

False    87
True     22
Name: new_link, dtype: int64

In [37]:
df_c['color'] = np.where(df_c['new_link'].isin(farbe), df_c['new_link'], 'missing')

In [38]:
df_c['color'].value_counts()

missing    87
schwarz    12
grau        5
silber      2
blau        1
rot         1
beige       1
Name: color, dtype: int64

In [39]:
df.update(df_c)

In [40]:
df.isna().sum()

make                0
model               0
price               0
fuel                0
mileage             0
gear                0
registration        0
hp                  0
owner            1530
body                0
car_condition       0
consumption      1364
emission         1665
color               0
car_id           3828
displacement      302
drive_type       3822
link                0
dtype: int64

### owner

In [41]:
# show nan values in col
print('Null values:', df['owner'].isna().sum())
print('---'*15)

# check unique values
print('Unique values:')
df['owner'].value_counts()

Null values: 1530
---------------------------------------------
Unique values:


1.00    5858
2.00    3099
3.00     544
4.00      60
5.00      11
6.00       5
7.00       2
Name: owner, dtype: int64

In [42]:
# check cars in terms of first registration and number of owners, with more then 4 owners
#df[df['owner']>4]

In [43]:
# replace Nan values by mode of a group of car fuel-type
df['owner']=df.groupby('registration').owner.transform(lambda x: x.fillna(x.mode()[0]))

In [44]:
df['owner'].value_counts()

1.00    6342
2.00    4145
3.00     544
4.00      60
5.00      11
6.00       5
7.00       2
Name: owner, dtype: int64

In [45]:
df.isna().sum()

make                0
model               0
price               0
fuel                0
mileage             0
gear                0
registration        0
hp                  0
owner               0
body                0
car_condition       0
consumption      1364
emission         1665
color               0
car_id           3828
displacement      302
drive_type       3822
link                0
dtype: int64

### drive_type

In [46]:
# show nan values in col
print('Null values:', df['drive_type'].isna().sum())
print('---'*15)

# check unique values
print('Unique values:')
df['drive_type'].value_counts()

Null values: 3822
---------------------------------------------
Unique values:


Front     3588
Allrad    2705
Heck       994
Name: drive_type, dtype: int64

In [47]:
# replace Nan value sby mode of a group of car model
df['drive_type']=df.groupby('model').drive_type.transform(lambda x: x.fillna(x.mode()[0]))

In [48]:
print(df['drive_type'].value_counts())
print('--'*15)
print(df['drive_type'].isna().sum())

Front     5145
Allrad    3793
Heck      2171
Name: drive_type, dtype: int64
------------------------------
0


### displacement

In [49]:
# show nan values in col gear
print('Null values:', df['displacement'].isna().sum())
print('---'*15)
# check unique values
print('Unique values:')
df['displacement'].unique()

Null values: 302
---------------------------------------------
Unique values:


array(['1.984', '1.998', '1.498', '999', '1.332', '1.968', '1.395',
       '1.995', '1.497', '1.496', '1.499', '0', '2.967', '1.950', '1.988',
       '2.275', '647', '2.143', '1.422', '1.598', '1.500', '1.993',
       '1.398', '2.000', '1.000', '1.597', '1.965', '1', '898', '1.798',
       '2.148', '1.197', '2.698', '1.796', '1.991', '1.595', '1.390',
       '1.495', '998', nan, '1.400', '1.781', '2.995', '1.997', '1.576',
       '1.996', '1.896', '2.199', '1.594', '1.960', '855', '3.197',
       '1.600', '1.124', '1.800'], dtype=object)

In [50]:
# if not Elektro and displacement 0, replace as Nan
df['displacement'] = np.where((df['displacement']== '0') & (df['fuel']!='Elektro'), '0', df['displacement'].replace('0', np.nan))

In [51]:
df['displacement'].isna().sum()

393

In [52]:
#df['displacement'] = np.where((df['displacement']== '999 cm³'), '999 cm³', df['displacement'].replace('999 cm³', '0.999 cm³'))

df.loc[df.displacement =='999', 'displacement'] = '0.999'
df.loc[df.displacement =='898', 'displacement'] = '0.898'
df.loc[df.displacement =='998', 'displacement'] = '0.998'
df.loc[df.displacement =='647', 'displacement'] = '0.647'
df.loc[df.displacement =='855', 'displacement'] = '0.855'

In [53]:
df['displacement'].value_counts()

1.968    2365
1.995    1537
1.984    1311
1.395     799
1.498     563
1.332     508
1.998     403
0.999     317
1.390     309
1.598     280
1.950     261
1.499     224
1.798     210
1.595     206
2.143     190
2.967     182
1.497     166
1.991     148
1.796     127
1.597     121
1.496     106
1.197      79
0.898      68
1.993      29
2.148      28
1.422      28
2.000      25
2.698      21
1.896      18
1.997      16
2.995      14
1.500      10
3.197       6
1           5
1.398       4
1.965       3
0.998       3
1.400       3
0           3
1.988       2
1.600       2
1.996       2
0.647       2
1.781       2
2.275       1
1.576       1
1.000       1
2.199       1
1.594       1
1.960       1
0.855       1
1.495       1
1.124       1
1.800       1
Name: displacement, dtype: int64

In [54]:
# replace Nan values by mode of a group of car model
df['displacement']=df.groupby('model').displacement.transform(lambda x: x.fillna(x.mode()[0]))

In [55]:
df.isna().sum()

make                0
model               0
price               0
fuel                0
mileage             0
gear                0
registration        0
hp                  0
owner               0
body                0
car_condition       0
consumption      1364
emission         1665
color               0
car_id           3828
displacement        0
drive_type          0
link                0
dtype: int64

### consumption & emission

In [56]:
# show nan values in col gear
print('Null values:', df['consumption'].isna().sum())
print('---'*15)
# check unique values
print('Unique values:')
df['consumption'].value_counts()

Null values: 1364
---------------------------------------------
Unique values:


5.8    478
4.7    448
5.3    447
4.5    428
5.2    376
      ... 
1.2      1
3.2      1
9.1      1
47       1
3.3      1
Name: consumption, Length: 81, dtype: int64

In [57]:
# show nan values in col gear
print('Null values:', df['emission'].isna().sum())
print('---'*15)
# check unique values
print('Unique values:')
df['emission'].value_counts()

Null values: 1665
---------------------------------------------
Unique values:


0      1195
119     305
109     195
135     193
130     191
       ... 
227       1
181       1
239       1
212       1
11        1
Name: emission, Length: 156, dtype: int64

In [58]:
df[(df['fuel']=='Elektro')&(df['consumption']!=0)]

Unnamed: 0,make,model,price,fuel,mileage,gear,registration,hp,owner,body,car_condition,consumption,emission,color,car_id,displacement,drive_type,link
305,BMW,X1,69990,Elektro,1505,Automatik,2023,200,1.00,SUV/Geländewagen/Pickup,Vorführfahrzeug,6.9,0,Schwarz,132418,1.995,Allrad,https://www.autoscout24.de/angebote/audi-a1-sp...
334,smart,forFour,19770,Elektro,6599,Automatik,2022,60,1.00,Limousine,Jahreswagen,,0,Braun,30230015,1.968,Allrad,https://www.autoscout24.de/angebote/bmw-i3-120...
335,smart,forFour,18700,Elektro,9548,Automatik,2022,60,1.00,Limousine,Jahreswagen,,0,Silber,30230324,1.968,Heck,https://www.autoscout24.de/angebote/bmw-i3-120...
336,smart,forFour,19900,Elektro,6046,Automatik,2022,60,1.00,Limousine,Jahreswagen,4.7,0,Schwarz,18544,1.995,Heck,https://www.autoscout24.de/angebote/bmw-i3-s-1...
337,smart,forFour,18900,Elektro,11289,Automatik,2022,60,1.00,Limousine,Jahreswagen,4.3,0,Schwarz,8166,1.995,Heck,https://www.autoscout24.de/angebote/bmw-i3-120...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11018,BMW,i3,19750,Elektro,35846,Automatik,2018,125,1.00,Limousine,Gebraucht,,0,Schwarz,RAVZ31743,1.499,Heck,https://www.autoscout24.de/angebote/bmw-i3-94a...
11019,BMW,i3,20500,Elektro,31260,Automatik,2018,125,1.00,Limousine,Gebraucht,,0,Schwarz,B26804198,1.499,Heck,https://www.autoscout24.de/angebote/bmw-i3-nav...
11020,BMW,i3,19999,Elektro,34300,Automatik,2018,125,1.00,Kleinwagen,Gebraucht,,0,Schwarz,5N,1.499,Heck,https://www.autoscout24.de/angebote/bmw-i3-edr...
11021,BMW,i3,21999,Elektro,46000,Automatik,2018,75,2.00,Kleinwagen,Gebraucht,,0,Grau,WBY7Z210307B31253,1.499,Heck,https://www.autoscout24.de/angebote/bmw-i3-pro...


In [59]:
# if fuel is is elektro => then consumption replace with 0
df['consumption'].mask((df['fuel']=='Elektro'), 0, inplace=True)
df['emission'].mask((df['fuel']=='Elektro'), 0, inplace=True)

#if fuel is not elektro, but consumption 0 => replace 0 with Nan values
df['consumption'].mask((df['fuel']!='Elektro')&(df['consumption']== '0'), np.nan, inplace=True)
df['emission'].mask((df['fuel']!='Elektro')&(df['emission']== '0'), np.nan, inplace=True)

#check Nan values again
print('NAN in consupmtion:', df['consumption'].isna().sum())
print('NAN in emission:', df['emission'].isna().sum())

NAN in consupmtion: 1104
NAN in emission: 2429


In [60]:
df['emission'].value_counts()

0      440
119    305
109    195
135    193
130    191
      ... 
240      1
227      1
197      1
181      1
11       1
Name: emission, Length: 156, dtype: int64

In [61]:
# fill NAN by model-group mode
df['consumption']=df.groupby('model').consumption.transform(lambda x: x.fillna(x.mode()[0]))
df['emission']=df.groupby('model').emission.transform(lambda x: x.fillna(x.mode()[0]))

In [62]:
df.isna().sum()

make                0
model               0
price               0
fuel                0
mileage             0
gear                0
registration        0
hp                  0
owner               0
body                0
car_condition       0
consumption         0
emission            0
color               0
car_id           3828
displacement        0
drive_type          0
link                0
dtype: int64

## translate terms

In [63]:
#some colors are in upper and lower case and are german words, hence translating them to english

df.replace({'Benzin': 'Petrol', 'Elektro/Benzin': 'Electro/Petrol', 'Sonstige':'Other', 'Elektro':'Electro', 'Elektro/Diesel': 'Electro/Diesel', ' Benzin': 'Petrol',
           'Automatik': 'Automatic', 'Schaltgetriebe': 'Manual', 'Halbautomatik':'Semi-automatic',
            'Limousine': 'Sedan', 'Kleinwagen':'Small car', 'Kombi': 'Station wagon', 'Cabrio':'Convertible',
            'Coupé':'Coupe', 'SUV/Geländewagen/Pickup':'SUV', 'Van/Kleinbus':'SUV',
           'Vorführfahrzeug':'Demonstration vehicle',
           'Tageszulassung':'Day admission', 'Jahreswagen':'Annual car', 'Gebraucht':'Used', 'Neu':'New', 'Allrad':'Four.w.d', 'Heck':'Rear', 'Grau': 'Gray','grau': 'Gray', 'Schwarz': 'Black','schwarz': 'Black', 'Weiß':'White', 'Rot':'Red', 'rot':'Red', 'Blau':'Blue', 'blau':'Blue', 'Silber':'Silver', 'silber':'Silver', 'Gelb':'Yellow',
       'Grün':'Green', 'Braun':'Brown', 'Violett':'Purple', 'beige':'Beige', '-': 'unknown', 'nan':'unknown'}, inplace=True)

## change datatypes

In [64]:
display(df.dtypes)

make              object
model             object
price             object
fuel              object
mileage           object
gear              object
registration      object
hp                object
owner            float64
body              object
car_condition     object
consumption       object
emission          object
color             object
car_id            object
displacement      object
drive_type        object
link              object
dtype: object

In [65]:
#object to float

cols = ['price', 'displacement', 'consumption']
for col in cols:
    df[col] = pd.to_numeric(df[col])

In [66]:
#floats to int 
cols = ['mileage', 'owner', 'emission', 'registration']
for col in cols:
    df[col] = df[col].astype(int)

In [67]:
display(df.dtypes)

make              object
model             object
price              int64
fuel              object
mileage            int64
gear              object
registration       int64
hp                object
owner              int64
body              object
car_condition     object
consumption      float64
emission           int64
color             object
car_id            object
displacement     float64
drive_type        object
link              object
dtype: object

In [68]:
df.isna().sum()

make                0
model               0
price               0
fuel                0
mileage             0
gear                0
registration        0
hp                  0
owner               0
body                0
car_condition       0
consumption         0
emission            0
color               0
car_id           3828
displacement        0
drive_type          0
link                0
dtype: int64

### hp 

In [69]:
df['hp'].unique()

array(['152', '110', '81', '70', '120', '221', '180', '147', '150', '85',
       '131', '140', '130', '135', '90', '195', '210', '235', '115',
       '100', '160', '200', '155', '125', '60', '75', 'unknown', '206',
       '170', '162', '118', '169', '141', '103', '41', '66', '213', '96',
       '145', '176', '52', '185', '92', '132', '80', '243', '284', '228',
       '250', '106', '165', '45', '63', '136', '230', '77', '173', '55',
       '88', '164', '137', '101', '105', '177', '116', '28', '211', '74',
       '122', '202', '127', '199', '217'], dtype=object)

In [70]:
df[df['hp']== 'unknown']

Unnamed: 0,make,model,price,fuel,mileage,gear,registration,hp,owner,body,car_condition,consumption,emission,color,car_id,displacement,drive_type,link
453,BMW,i3,44500,Electro,2200,Automatic,2022,unknown,1,Sedan,Used,0.0,0,Gray,768.0,2.0,Rear,https://www.autoscout24.de/angebote/smart-forf...
687,smart,forFour,15850,Electro,9023,Automatic,2021,unknown,1,Small car,Used,0.0,0,Black,,1.4,Front,https://www.autoscout24.de/angebote/audi-a5-ca...
3764,Volkswagen,Tiguan,12490,Petrol,109000,Manual,2015,unknown,2,SUV,Used,6.5,152,Gray,,1.39,Four.w.d,https://www.autoscout24.de/angebote/volkswagen...
7158,smart,forFour,20950,Electro,4157,Automatic,2021,unknown,1,Small car,Used,0.0,0,Black,51521.0,1.0,Front,https://www.autoscout24.de/angebote/smart-forf...


In [71]:
df['hp'].replace({'unknown': np.nan}, inplace=True)
df['hp'].isna().sum()

4

In [72]:
df['hp']=df.groupby('model').hp.transform(lambda x: x.fillna(x.mode()[0]))
df['hp'] = df['hp'].astype(int)

In [73]:
df.isna().sum()

make                0
model               0
price               0
fuel                0
mileage             0
gear                0
registration        0
hp                  0
owner               0
body                0
car_condition       0
consumption         0
emission            0
color               0
car_id           3828
displacement        0
drive_type          0
link                0
dtype: int64

In [74]:
display(df.dtypes)

make              object
model             object
price              int64
fuel              object
mileage            int64
gear              object
registration       int64
hp                 int64
owner              int64
body              object
car_condition     object
consumption      float64
emission           int64
color             object
car_id            object
displacement     float64
drive_type        object
link              object
dtype: object

In [75]:
df.head(3)

Unnamed: 0,make,model,price,fuel,mileage,gear,registration,hp,owner,body,car_condition,consumption,emission,color,car_id,displacement,drive_type,link
0,Volkswagen,Polo GTI,34690,Petrol,10,Automatic,2023,152,1,Sedan,Demonstration vehicle,4.2,138,Gray,G3109,1.98,Front,https://www.autoscout24.de/angebote/volkswagen...
1,Volkswagen,Polo GTI,34950,Petrol,1500,Automatic,2023,152,1,Sedan,Demonstration vehicle,6.6,138,Black,363794549,2.0,Rear,https://www.autoscout24.de/angebote/volkswagen...
2,Volkswagen,Polo GTI,29990,Petrol,9000,Automatic,2023,152,1,Sedan,Demonstration vehicle,7.4,138,White,NU057791,1.98,Front,https://www.autoscout24.de/angebote/volkswagen...


In [76]:
df.describe()

Unnamed: 0,price,mileage,registration,hp,owner,consumption,emission,displacement
count,11109.0,11109.0,11109.0,11109.0,11109.0,11109.0,11109.0,11109.0
mean,26144.89,73161.89,2017.58,124.33,1.49,5.05,122.08,1.78
std,12934.45,68493.59,4.14,29.79,0.64,1.68,39.29,0.35
min,245.0,0.0,2007.0,28.0,1.0,0.0,0.0,0.0
25%,15990.0,16800.0,2015.0,110.0,1.0,4.5,117.0,1.5
50%,24900.0,53429.0,2019.0,120.0,1.0,5.2,126.0,1.97
75%,34950.0,110589.0,2021.0,140.0,2.0,5.9,143.0,1.98
max,87985.0,570000.0,2023.0,284.0,7.0,47.0,240.0,3.2


## check duplicates

In [77]:
df.duplicated().value_counts()

False    11101
True         8
dtype: int64

In [78]:
df.duplicated(subset=['link']).value_counts()

False    8749
True     2360
dtype: int64

In [79]:
df.drop('link', inplace=True, axis=1)

In [80]:
df.duplicated().value_counts()

False    10969
True       140
dtype: int64

In [81]:
df.drop_duplicates(inplace=True)

In [82]:
df.duplicated().value_counts()

False    10969
dtype: int64

## create car classes

In [83]:
df.body.value_counts()

Sedan            4105
SUV              3165
Station wagon    2233
Coupe             524
Small car         520
Convertible       378
Other              44
Name: body, dtype: int64

In [84]:
df['body'] = df['body'].replace('Other', np.nan)

In [85]:
df['body'].isna().sum()

44

In [86]:
df['body']=df.groupby('model').body.transform(lambda x: x.fillna(x.mode()[0]))

In [87]:
df.body.value_counts()

Sedan            4122
SUV              3173
Station wagon    2247
Coupe             529
Small car         520
Convertible       378
Name: body, dtype: int64

In [88]:
# create new column car class
df['class'] = df['body']

### small car

In [89]:
# smart
df['class'] = np.where((df['make'] == 'smart'), 'Small car', df['class'])

# Polo GTI
df['class'] = np.where((df['make'] == 'Volkswagen') & (df['model'] == 'Polo GTI') , 'Small car', df['class'])

# Audi A1
df['class'] = np.where((df['make'] == 'Audi') & (df['model'] == 'A1') , 'Small car', df['class'])

# BMW i3
df['class'] = np.where((df['make'] == 'BMW') & (df['model'] == 'i3') , 'Small car', df['class'])

### small family car

In [90]:
# MB A 200
df['class'] = np.where((df['make'] == 'Mercedes-Benz') & (df['model'] == 'A 200'), 'Small family car', df['class'])

# VW Golf GTI
df['class'] = np.where((df['make'] == 'Volkswagen') & (df['model'] == 'Golf GTI') , 'Small family car', df['class'])

# Audi A3
df['class'] = np.where((df['make'] == 'Audi') & (df['model'] == 'A3') , 'Small family car', df['class'])

# BMW 120
df['class'] = np.where((df['make'] == 'BMW') & (df['model'] == '120') , 'Small family car', df['class'])

### large family car

In [91]:
# MB C 200
df['class'] = np.where((df['make'] == 'Mercedes-Benz') & (df['model'] == 'C 200'), 'Large family car', df['class'])

# VW Passat Variant
df['class'] = np.where((df['make'] == 'Volkswagen') & (df['model'] == 'Passat Variant') , 'Large family car', df['class'])

# Audi A5
df['class'] = np.where((df['make'] == 'Audi') & (df['model'] == 'A5') , 'Large family car', df['class'])

# BMW 330
df['class'] = np.where((df['make'] == 'BMW') & (df['model'] == '320') , 'Large family car', df['class'])

### compact SUV

In [92]:
# MB GLA 200
df['class'] = np.where((df['make'] == 'Mercedes-Benz') & (df['model'] == 'GLA 200'), 'Compact SUV', df['class'])

# VW Tiguan
df['class'] = np.where((df['make'] == 'Volkswagen') & (df['model'] == 'Tiguan') , 'Compact SUV', df['class'])

# Audi Q3
df['class'] = np.where((df['make'] == 'Audi') & (df['model'] == 'Q3') , 'Compact SUV', df['class'])
# BMW X1
df['class'] = np.where((df['make'] == 'BMW') & (df['model'] == 'X1') , 'Compact SUV', df['class'])

### grouped by class, make, model

In [93]:
df['class'].unique()

array(['Small car', 'Small family car', 'Large family car', 'Compact SUV'],
      dtype=object)

In [94]:
df.groupby(['make', 'model']).count()['price']

make           model         
Audi           A1                 548
               A3                1304
               A5                1003
               Q3                 842
BMW            120                439
               320                903
               X1                 797
               i3                 215
Mercedes-Benz  A 200              509
               C 200              940
               GLA 200            492
Volkswagen     Golf GTI           301
               Passat Variant    1057
               Polo GTI           190
               Tiguan            1065
smart          forFour            364
Name: price, dtype: int64

## generate unique IDs

In [95]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10969 entries, 0 to 11108
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   make           10969 non-null  object 
 1   model          10969 non-null  object 
 2   price          10969 non-null  int64  
 3   fuel           10969 non-null  object 
 4   mileage        10969 non-null  int64  
 5   gear           10969 non-null  object 
 6   registration   10969 non-null  int64  
 7   hp             10969 non-null  int64  
 8   owner          10969 non-null  int64  
 9   body           10969 non-null  object 
 10  car_condition  10969 non-null  object 
 11  consumption    10969 non-null  float64
 12  emission       10969 non-null  int64  
 13  color          10969 non-null  object 
 14  car_id         7166 non-null   object 
 15  displacement   10969 non-null  float64
 16  drive_type     10969 non-null  object 
 17  class          10969 non-null  object 
dtypes: flo

In [96]:
df['car_id'] = df['car_id'].astype(str)

In [97]:
ddf = df[(df['car_id'].apply(len)<=5)]

In [98]:
#function to generate random unique ids

import string
import random

def ran_gen(size, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for x in range(size))

In [99]:
#generating ids using above function and assign it to dataframe df2

a = []
for i in ddf['car_id']:
    i = ran_gen(8, "AEIOS56723")
    a.append(i)
    
ddf['car_id'] = a

In [100]:
df.update(ddf)

In [101]:
df['car_id'].value_counts()

3614358     4
TS-Q3035    3
3615025     3
235762      3
GW-K115     3
           ..
13-33373    1
IIAAOA3I    1
72236OA7    1
IS5OIIAE    1
OI37I3E6    1
Name: car_id, Length: 10798, dtype: int64

In [102]:
df

Unnamed: 0,make,model,price,fuel,mileage,gear,registration,hp,owner,body,car_condition,consumption,emission,color,car_id,displacement,drive_type,class
0,Volkswagen,Polo GTI,34690.00,Petrol,10.00,Automatic,2023.00,152.00,1.00,Sedan,Demonstration vehicle,4.20,138.00,Gray,35OI76IE,1.98,Front,Small car
1,Volkswagen,Polo GTI,34950.00,Petrol,1500.00,Automatic,2023.00,152.00,1.00,Sedan,Demonstration vehicle,6.60,138.00,Black,363794549,2.00,Rear,Small car
2,Volkswagen,Polo GTI,29990.00,Petrol,9000.00,Automatic,2023.00,152.00,1.00,Sedan,Demonstration vehicle,7.40,138.00,White,NU057791,1.98,Front,Small car
3,Volkswagen,Polo GTI,33989.00,Petrol,3511.00,Automatic,2023.00,152.00,1.00,Sedan,Demonstration vehicle,5.80,138.00,Black,04197NPVFW,1.98,Front,Small car
4,Volkswagen,Polo GTI,37980.00,Petrol,211.00,Automatic,2023.00,152.00,1.00,Small car,Demonstration vehicle,5.70,138.00,White,VFW028766,1.98,Front,Small car
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11104,BMW,X1,35994.00,Electro/Petrol,32082.00,Automatic,2020.00,162.00,1.00,SUV,Used,1.90,43.00,Black,130645,1.50,Four.w.d,Compact SUV
11105,BMW,X1,38200.00,Electro/Petrol,13000.00,Automatic,2020.00,92.00,1.00,SUV,Used,1.90,43.00,Blue,SAI33O32,1.50,Four.w.d,Compact SUV
11106,Volkswagen,Passat Variant,35185.00,Electro/Petrol,75226.00,Automatic,2019.00,115.00,1.00,Station wagon,Used,1.70,38.00,Silver,AOS62567,1.40,Front,Large family car
11107,Audi,A3,24980.00,Electro/Petrol,57509.00,Automatic,2018.00,150.00,1.00,Sedan,Used,1.80,40.00,Red,72O3IEAS,1.40,Front,Small family car


## colum car age

In [103]:
df['car_age'] = 2023-df['registration']

In [104]:
df['car_age'].unique()

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       13., 14., 15., 16.])

## cleaning values

In [105]:
#Getting unique values counts for each column
for col in df.columns:
    print(f"{col} - {df[col].nunique()}")

make - 5
model - 16
price - 3390
fuel - 6
mileage - 5881
gear - 3
registration - 17
hp - 72
owner - 7
body - 6
car_condition - 5
consumption - 81
emission - 156
color - 15
car_id - 10798
displacement - 53
drive_type - 3
class - 4
car_age - 17


In [106]:
for col in df[['fuel', 'hp', 'displacement']]:
    print(f"{col}")
    print('--'*4)
    print(f"{df[col].value_counts()}")
    print('=='*20)

fuel
--------
Petrol            5149
Diesel            4829
Electro/Petrol     551
Electro            436
Electro/Diesel       3
Other                1
Name: fuel, dtype: int64
hp
--------
110.00    2152
140.00    1212
135.00    1067
120.00     631
103.00     451
          ... 
137.00       1
164.00       1
243.00       1
284.00       1
217.00       1
Name: hp, Length: 72, dtype: int64
displacement
--------
1.97    2342
2.00    1543
1.98    1293
1.40     792
1.50     559
1.33     507
1.00     459
2.00     403
1.50     402
1.39     306
1.60     276
1.95     258
1.80     209
1.59     202
2.14     187
2.97     180
1.50     176
1.99     145
1.80     127
1.60     121
1.50     106
1.20      77
0.90      64
1.99      29
2.15      28
1.42      28
2.00      25
2.70      21
1.90      18
2.00      16
3.00      14
1.50      10
1.00       6
3.20       6
1.40       3
1.00       3
1.97       3
1.40       3
0.00       3
1.78       2
0.65       2
2.00       2
1.99       2
1.60       2
1.50       1
2.27

In [107]:
df[df['fuel'] == 'Other']

Unnamed: 0,make,model,price,fuel,mileage,gear,registration,hp,owner,body,car_condition,consumption,emission,color,car_id,displacement,drive_type,class,car_age
115,BMW,120,41690.0,Other,1412.0,Automatic,2023.0,131.0,1.0,Sedan,Demonstration vehicle,1.4,24.0,Black,2945886,1.4,Front,Small family car,0.0


In [108]:
# Drop One Value "other" in col['Fuel']
df = df.drop([115])

In [109]:
# restrict car age range for 10/11 years, as there cars only with first registration in 2022
df = df[df['registration']>2011]

### clean price column

- there are prices less then a 3500
- after checking on AutoScout its clear, that these are ranges for leasing
- this cars can be deleted

In [110]:
df = df[(df['price']>3500)]

### Clean owner values

In [111]:
df[df['owner'] == 0]

Unnamed: 0,make,model,price,fuel,mileage,gear,registration,hp,owner,body,car_condition,consumption,emission,color,car_id,displacement,drive_type,class,car_age


In [112]:
df_owner_over_4 = df[df['owner'] > 4]
df_owner_over_4.owner.value_counts()

5.00    3
6.00    3
7.00    1
Name: owner, dtype: int64

In [113]:
df = df[df['owner'] <=4]

### clean fuel values

In [114]:
df = df[(df['fuel'] != 'Electro/Diesel') & (df['fuel'] != 'Electro/Petrol')]

# create column make-model

In [115]:
df["car"] = df["make"] + ' ' + df["model"]

In [116]:
df.drop(['car_id'], inplace=True, axis=1)
df

Unnamed: 0,make,model,price,fuel,mileage,gear,registration,hp,owner,body,car_condition,consumption,emission,color,displacement,drive_type,class,car_age,car
0,Volkswagen,Polo GTI,34690.00,Petrol,10.00,Automatic,2023.00,152.00,1.00,Sedan,Demonstration vehicle,4.20,138.00,Gray,1.98,Front,Small car,0.00,Volkswagen Polo GTI
1,Volkswagen,Polo GTI,34950.00,Petrol,1500.00,Automatic,2023.00,152.00,1.00,Sedan,Demonstration vehicle,6.60,138.00,Black,2.00,Rear,Small car,0.00,Volkswagen Polo GTI
2,Volkswagen,Polo GTI,29990.00,Petrol,9000.00,Automatic,2023.00,152.00,1.00,Sedan,Demonstration vehicle,7.40,138.00,White,1.98,Front,Small car,0.00,Volkswagen Polo GTI
3,Volkswagen,Polo GTI,33989.00,Petrol,3511.00,Automatic,2023.00,152.00,1.00,Sedan,Demonstration vehicle,5.80,138.00,Black,1.98,Front,Small car,0.00,Volkswagen Polo GTI
4,Volkswagen,Polo GTI,37980.00,Petrol,211.00,Automatic,2023.00,152.00,1.00,Small car,Demonstration vehicle,5.70,138.00,White,1.98,Front,Small car,0.00,Volkswagen Polo GTI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11017,BMW,i3,24500.00,Electro,41800.00,Automatic,2018.00,135.00,2.00,Small car,Used,0.00,0.00,Gray,1.50,Rear,Small car,5.00,BMW i3
11018,BMW,i3,19750.00,Electro,35846.00,Automatic,2018.00,125.00,1.00,Sedan,Used,0.00,0.00,Black,1.50,Rear,Small car,5.00,BMW i3
11020,BMW,i3,19999.00,Electro,34300.00,Automatic,2018.00,125.00,1.00,Small car,Used,0.00,0.00,Black,1.50,Rear,Small car,5.00,BMW i3
11021,BMW,i3,21999.00,Electro,46000.00,Automatic,2018.00,75.00,2.00,Small car,Used,0.00,0.00,Gray,1.50,Rear,Small car,5.00,BMW i3


In [117]:
df.rename(columns={'class': 'car_class'}, inplace=True)

# DF for Machine Learning and EDA

In [118]:
df_ml = df.copy()
df_ml.shape

(9158, 19)

In [119]:
#df_ml.to_csv('../data/df_ml.csv', index=False)

In [119]:
df_ml.isna().sum()

make             0
model            0
price            0
fuel             0
mileage          0
gear             0
registration     0
hp               0
owner            0
body             0
car_condition    0
consumption      0
emission         0
color            0
displacement     0
drive_type       0
car_class        0
car_age          0
car              0
dtype: int64

In [None]:
df_ml.info()